Hi Jimmy, Couple of thoughts:
Yeah, that output doesn't really give me enough info. I'd follow Andrew's recommendation - grab those debug libraries if possible and re-run the sample. In addition, you might want to try enabling call graph sampling. That should tell us what logic path is calling into those libraries. opcontrol accepts a --callgraph=depth parameter, as does opreport. My notes also indicate that the images have to be built with the "-fno-omit-frame-pointer" compiler flag in order to get the call stack. That may already be the case for your debug images - just something to keep in mind. thanks, -K ----- Original Message ----- > From: "Jimmy Jones" <[email protected]> > To: [email protected] > Sent: Monday, September 9, 2013 12:04:31 PM > Subject: Re: System stalling > > Hi Ken, > > Had a play with oprofile... but seems to have lumped everything into glibc, > any ideas? The queue is setup as ring, max size 2GB. > > # qpid-stat -q > > Queues > queue dur autoDel excl msg msgIn > msgOut bytes bytesIn bytesOut cons bind > > ========================================================================================================================== > 9d587baf-edc4-694b-bd43-f0accdf77a44:0.0 Y Y 0 0 > 0 0 0 0 1 2 > ingest Y 16.4k 49.9k > 33.5k 2.13g 7.17g 5.04g 1 2 > > # opreport --long-filenames --session-dir=/root/oprof > CPU: CPU with timer interrupt, speed 0 MHz (estimated) > Profiling through timer interrupt > TIMER:0| > samples| %| > ------------------ > 78847 100.000 /usr/sbin/qpidd > TIMER:0| > samples| %| > ------------------ > 53196 67.4674 /usr/lib64/libstdc++.so.6.0.8 > 14220 18.0349 /usr/lib/libqpidbroker.so.2.0.0 > 7368 9.3447 /lib64/libc-2.5.so > 3833 4.8613 /usr/lib/libqpidcommon.so.2.0.0 > 93 0.1179 /usr/lib/libqpidtypes.so.1.0.0 > 70 0.0888 /lib64/libpthread-2.5.so > 43 0.0545 /lib64/ld-2.5.so > 19 0.0241 /lib64/librt-2.5.so > 4 0.0051 /lib64/libuuid.so.1.2 > 1 0.0013 /usr/sbin/qpidd > > # opreport --demangle=smart --session-dir=/root/oprof --symbols `which qpidd` > CPU: CPU with timer interrupt, speed 0 MHz (estimated) > Profiling through timer interrupt > samples % image name symbol name > 53196 67.4674 libstdc++.so.6.0.8 /usr/lib64/libstdc++.so.6.0.8 > 10597 13.4400 libqpidbroker.so.2.0.0 void > deque<qpid::framing::AMQFrame>::_M_range_insert_aux<_Deque_iterator<qpid::framing::AMQFrame, > qpid::framing::AMQFrame&, > qpid::framing::AMQFrame*>>(_Deque_iterator<qpid::framing::AMQFrame, > qpid::framing::AMQFrame&, qpid::framing::AMQFrame*>, > _Deque_iterator<qpid::framing::AMQFrame, qpid::framing::AMQFrame&, > qpid::framing::AMQFrame*>, _Deque_iterator<qpid::framing::AMQFrame, > qpid::framing::AMQFrame&, qpid::framing::AMQFrame*>, forward_iterator_tag) > 2922 3.7059 libqpidcommon.so.2.0.0 > qpid::framing::AMQFrame::~AMQFrame() > 2833 3.5930 libqpidbroker.so.2.0.0 > deque<qpid::framing::AMQFrame>::clear() > 2486 3.1529 libc-2.5.so _int_malloc > 1882 2.3869 libc-2.5.so _int_free > 1757 2.2284 libc-2.5.so malloc > 589 0.7470 libc-2.5.so memcpy > 384 0.4870 libc-2.5.so free > ... > > Cheers, > > Jimmy > > ----- Original Message ----- > From: Ken Giusti > Sent: 09/06/13 06:27 PM > To: [email protected] > Subject: Re: System stalling > Hi Jimmy, > > Have you ever used the oprofile tool before? > > http://oprofile.sourceforge.net/about/ > > I've found this tool useful when I need to get a sense of where the broker is > spending its time, especially when it is compute-bound. > > You'll need to be able to install oprofile on the system that is running the > broker, and you'd need root permission to run it. > > The approach I take is to configure oprofile to analyze the broker, then > perform whatever actions I need to get the broker to into the compute bound > state. Once the broker is acting up, I then trigger oprofile to start a > capture. That results in a capture that best represents what the broker is > doing when it is in that compute bound state. > > It's been awhile since I used oprofile, but here's a summary of the commands > I used last. > > First, the path to the broker executable for this example is > /home/kgiusti/mrg/qpid/cpp/src/.libs/lt-qpidd. Be sure you're referencing > the actual executable image an not the shell wrapper that autotools > generates! > > After starting the broker daemon, I delete any old oprofile configuration and > captures. I then configure and start the oprofile daemon using the following > commands (done as root): > > $ rm -rf /root/oprof > $ rm -rf ~/.oprofile > $ opcontrol --shutdown > $ opcontrol --init > $ opcontrol --reset > $ opcontrol --setup --no-vmlinux --session-dir=/root/oprof > --image=/home/kgiusti/mrg/qpid/cpp/src/.libs/lt-qpidd --separate=library > --event=INST_RETIRED_ANY_P:6000:0:0:1 --cpu-buffer-size=1000000 > $ opcontrol --start-daemon > > Once that is done, you should try to reproduce the problem. Once the broker > is in that weird state, start the capture: > > $ opcontrol --start > > Capture for a while, then stop the capture and dump the results: > > $ opcontrol --stop > $ opreport --long-filenames --session-dir=/root/oprof > > opreport will dump the methods where the broker is spending most of its > compute time. You might need to also provide the paths to the link > libraries, e.g.: > > $ opreport --long-filenames --session-dir=/root/oprof -l > /home/kgiusti/mrg/qpid/cpp/src/.libs/libqpidbroker.so.2.0.0 > > These notes are a bit old, and opcontrol/opreport's options may have changed > a bit, but this should give you a general idea of how to use it. > > -K > > > ----- Original Message ----- > > From: "Jimmy Jones" <[email protected]> > > To: [email protected] > > Sent: Friday, September 6, 2013 9:50:17 AM > > Subject: Re: System stalling > > > > I've done some further digging, and managed to simplify the system a little > > to reproduce the problem. The system is now an external process that posts > > messages to the default headers exchange on my machine, which has a ring > > queue to receive effectively all messages from the default headers > > exchange, > > process them, and post to another headers exchange. There is now nothing > > listening on the subsequent headers exchange, and all exchanges are > > non-durable. I've also tried Fraser's suggestion of marking the link as > > unreliable on the queue which seems to have no effect (is there any way in > > the qpid utilities to confirm the link has been set to unreliable?) > > > > So essentially what happens is the system happily processes away, normally > > with an empty ring queue, sometimes it spikes up a bit and goes back down > > again, with my ingest process using ~70% CPU and qpidd ~50% CPU, on a > > machine with 8 CPU cores. However sometimes the queue spikes up to 2GB (the > > max), starts throwing messages away, and qpid hits 100%+ CPU and the ingest > > process goes to about 3% CPU. I can see messages are being very slowly > > processed. > > > > I've tried attaching to qpidd with gdb a few times, and all threads apart > > from one seem to be idle in epoll_wait or pthread_cond_wait. The running > > thread always seems to be somewhere under DispatchHandle::processEvent. > > > > I'm at a bit of a loss for what I can do to fix this! > > > > Jimmy > > > > ----- Original Message ----- > > From: Fraser Adams > > Sent: 08/23/13 09:09 AM > > To: [email protected] > > Subject: Re: System stalling > > Hi Jimmy, hope you are well! > > As an experiment one thing that you could try is messing with the link > > "reliability". As you know in the normal mode of operation it's > > necessary to periodically send acknowledgements from the consumer client > > application which then get passed back ultimately to the broker. > > > > I'm no expert on this but from my recollection if you are in a position > > particularly where circular queues are overflowing and you are > > continually trying to produce and consume and you have some fair level > > of prefetch/capacity on the consumer the mechanism for handling the > > acknowledgements on the broker is "sub-optimal" - I think it's a linear > > search or some such and there are conditions where catching up with > > acknowledgements becomes a bit "N squared". > > > > Gordon would be able to explain this way better than me - that's > > assuming this hypothesis is even relevant :-) > > > > Anyway if you try having a link: {reliability: unreliable} stanza in > > your consumer address string (as an example one of mine looks like the > > following - the address sting syntax isn't exactly trivial :-)). > > > > string address = "test_consumer; {create: receiver, node: {x-declare: > > {auto-delete: True, exclusive: True, arguments: {'qpid.policy_type': > > ring, 'qpid.max_size': 100000000}}, x-bindings: [{exchange: 'amq.match', > > queue: 'test_consumer', key: 'test1', arguments: {x-match: all, > > data-format: test}}]}, link: {reliability: unreliable}}"; > > > > Clearly your arguments would be different but hopefully it'll give you a > > kick start. > > > > > > The main down side of disabling link reliability is that if you have > > enabled prefetch and the consumer unexpectedly dies then all of the > > messages on the prefetch queue will be lost, whereas with reliable > > messaging the broker maintains references to all unacknowledged messages > > so would resent them (I *think* that's how it works.....) > > > > > > At the very least it's a fairly simple tweak to your consumer addresses > > that might rule out (or point to) acknowledgement shenanigans as being > > the root of your problem. From my own experience I always end up blaming > > this first if I hit performance weirdness with ring queues :-) > > > > HTH, > > Frase > > > > > > > > On 21/08/13 17:08, Jimmy Jones wrote: > > >>>>> I've got an simple processing system using the 0.22 C++ broker, all > > >>>>> on one box, where an external system posts messages to the default > > >>>>> headers exchange, and an ingest process receives them using a ring > > >>>>> queue, transforms them and outputs to a different headers exchange. > > >>>>> Various other processes pick messages of interest off that exchange > > >>>>> using ring queues. Recently however the system has been stalling - > > >>>>> I'm still receiving lots of data from the other system, but the > > >>>>> ingest process suddenly goes to <5% CPU usage and its queue fills up > > >>>>> and messages start getting discarded from the ring, the follow on > > >>>>> processes go to practically 0% CPU and qpidd hovers around 95-120% > > >>>>> CPU (normally its ~75%) and the rest of the system pretty much goes > > >>>>> idle (no swapping, there is free memory) > > >>>>> > > >>>>> I attached to the ingest process with gdb and it was stuck in send > > >>>>> (waitForCapacity/waitForCompletionImpl) - I notice this can block. > > >>>> Is there any queue bound to the second headers exchange, i.e. to the > > >>>> one > > >>>> this ingest process is sending to, that is not a ring queue? (If you > > >>>> run > > >>>> qpid-config queue -r, you get a quick listing of the queues and their > > >>>> bindings). > > >>> I've run qpid-config queue, and all my queues have --limit-policy=ring, > > >>> apart > > >>> from a UUID one which I presume is qpid-config itself. Are there any > > >>> other useful > > >>> debugging things I can do? > > >> What does qpid-stat -q show? Is it possible to test whether the broker > > >> is still responsive e,g, by sending and receiving messages through a > > >> test queue/exchange? Are there any errors in the logs? Are any of the > > >> queues durable (and messages persistent)? > > > qpid-stat -q is all zero's in the msg & bytes column, apart from the > > > ingest > > > queue, > > > and another overflowing ring queue I have. > > > > > > I did run qpid-tool when the system was broken to dump some stats. > > > msgTotalDequeues > > > was slowly incremeneting on the ingest queue, so I presume messages were > > > still being > > > delivered and the broker was responsive? > > > > > > The only logging I've got is syslog, and I just see a warning about > > > unsent > > > data, > > > presumably when the ingest process receives a SIGALARM. I'm happy to > > > swich > > > on more > > > logging, what would you recommend? > > > > > > None of my queues are durable, but I think incoming messages from the > > > other > > > system > > > are marked as durable. The exchange that the ingest process sends to is > > > durable, > > > but I'm not setting any durable flags on outgoing messages (I presume the > > > default > > > is off). > > > > > >> Another thing might be a ptrace of the broker process. Maybe two or > > >> three with a short delay between them. > > > I'll try this next time it goes haywire. > > > > > >> For some reason it seems like the broker is not sending back > > >> confirmation to the sender in the ingest process, causing that to block. > > >> Ring queues shouldn't be subject to producer flow control so we need to > > >> figure out what other reason there could be for that. > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [email protected] > > > For additional commands, e-mail: [email protected] > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > > > -- > -K > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- -K --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
