Re: System stalling

Ken Giusti Tue, 10 Sep 2013 06:02:53 -0700

Hi Jimmy,

Couple of thoughts:


Yeah, that output doesn't really give me enough info.

I'd follow Andrew's recommendation - grab those debug libraries if possible and 
re-run the sample.

In addition, you might want to try enabling call graph sampling.  That should 
tell us what logic path is calling into those libraries.  opcontrol accepts a 
--callgraph=depth parameter, as does opreport.  My notes also indicate that the 
images have to be built with the "-fno-omit-frame-pointer" compiler flag in 
order to get the call stack.  That may already be the case for your debug 
images - just something to keep in mind.

thanks,

-K
 

----- Original Message -----
> From: "Jimmy Jones" <[email protected]>
> To: [email protected]
> Sent: Monday, September 9, 2013 12:04:31 PM
> Subject: Re: System stalling
> 
> Hi Ken,
> 
> Had a play with oprofile... but seems to have lumped everything into glibc,
> any ideas? The queue is setup as ring, max size 2GB.
> 
> # qpid-stat -q
> 
> Queues
>   queue                                     dur  autoDel  excl  msg    msgIn
>    msgOut  bytes  bytesIn  bytesOut  cons  bind
>   
> ==========================================================================================================================
>   9d587baf-edc4-694b-bd43-f0accdf77a44:0.0       Y        Y        0      0
>        0       0      0        0         1     2
>   ingest                                                  Y     16.4k  49.9k
>    33.5k   2.13g  7.17g    5.04g        1     2
> 
> # opreport --long-filenames --session-dir=/root/oprof
> CPU: CPU with timer interrupt, speed 0 MHz (estimated)
> Profiling through timer interrupt
>           TIMER:0|
>   samples|      %|
> ------------------
>     78847 100.000 /usr/sbin/qpidd
>                   TIMER:0|
>           samples|      %|
>         ------------------
>             53196 67.4674 /usr/lib64/libstdc++.so.6.0.8
>             14220 18.0349 /usr/lib/libqpidbroker.so.2.0.0
>              7368  9.3447 /lib64/libc-2.5.so
>              3833  4.8613 /usr/lib/libqpidcommon.so.2.0.0
>                93  0.1179 /usr/lib/libqpidtypes.so.1.0.0
>                70  0.0888 /lib64/libpthread-2.5.so
>                43  0.0545 /lib64/ld-2.5.so
>                19  0.0241 /lib64/librt-2.5.so
>                 4  0.0051 /lib64/libuuid.so.1.2
>                 1  0.0013 /usr/sbin/qpidd
> 
> # opreport --demangle=smart --session-dir=/root/oprof --symbols `which qpidd`
> CPU: CPU with timer interrupt, speed 0 MHz (estimated)
> Profiling through timer interrupt
> samples  %        image name               symbol name
> 53196    67.4674  libstdc++.so.6.0.8       /usr/lib64/libstdc++.so.6.0.8
> 10597    13.4400  libqpidbroker.so.2.0.0   void
> deque<qpid::framing::AMQFrame>::_M_range_insert_aux<_Deque_iterator<qpid::framing::AMQFrame,
> qpid::framing::AMQFrame&,
> qpid::framing::AMQFrame*>>(_Deque_iterator<qpid::framing::AMQFrame,
> qpid::framing::AMQFrame&, qpid::framing::AMQFrame*>,
> _Deque_iterator<qpid::framing::AMQFrame, qpid::framing::AMQFrame&,
> qpid::framing::AMQFrame*>, _Deque_iterator<qpid::framing::AMQFrame,
> qpid::framing::AMQFrame&, qpid::framing::AMQFrame*>, forward_iterator_tag)
> 2922      3.7059  libqpidcommon.so.2.0.0
>   qpid::framing::AMQFrame::~AMQFrame()
> 2833      3.5930  libqpidbroker.so.2.0.0
>   deque<qpid::framing::AMQFrame>::clear()
> 2486      3.1529  libc-2.5.so              _int_malloc
> 1882      2.3869  libc-2.5.so              _int_free
> 1757      2.2284  libc-2.5.so              malloc
> 589       0.7470  libc-2.5.so              memcpy
> 384       0.4870  libc-2.5.so              free
> ...
> 
> Cheers,
> 
> Jimmy
>   
> ----- Original Message -----
> From: Ken Giusti
> Sent: 09/06/13 06:27 PM
> To: [email protected]
> Subject: Re: System stalling
>  Hi Jimmy,
> 
> Have you ever used the oprofile tool before?
> 
> http://oprofile.sourceforge.net/about/
> 
> I've found this tool useful when I need to get a sense of where the broker is
> spending its time, especially when it is compute-bound.
> 
> You'll need to be able to install oprofile on the system that is running the
> broker, and you'd need root permission to run it.
> 
> The approach I take is to configure oprofile to analyze the broker, then
> perform whatever actions I need to get the broker to into the compute bound
> state. Once the broker is acting up, I then trigger oprofile to start a
> capture. That results in a capture that best represents what the broker is
> doing when it is in that compute bound state.
> 
> It's been awhile since I used oprofile, but here's a summary of the commands
> I used last.
> 
> First, the path to the broker executable for this example is
> /home/kgiusti/mrg/qpid/cpp/src/.libs/lt-qpidd. Be sure you're referencing
> the actual executable image an not the shell wrapper that autotools
> generates!
> 
> After starting the broker daemon, I delete any old oprofile configuration and
> captures. I then configure and start the oprofile daemon using the following
> commands (done as root):
> 
> $ rm -rf /root/oprof
> $ rm -rf ~/.oprofile
> $ opcontrol --shutdown
> $ opcontrol --init
> $ opcontrol --reset
> $ opcontrol --setup --no-vmlinux --session-dir=/root/oprof
> --image=/home/kgiusti/mrg/qpid/cpp/src/.libs/lt-qpidd --separate=library
> --event=INST_RETIRED_ANY_P:6000:0:0:1 --cpu-buffer-size=1000000
> $ opcontrol --start-daemon
> 
> Once that is done, you should try to reproduce the problem. Once the broker
> is in that weird state, start the capture:
> 
> $ opcontrol --start
> 
> Capture for a while, then stop the capture and dump the results:
> 
> $ opcontrol --stop
> $ opreport --long-filenames --session-dir=/root/oprof
> 
> opreport will dump the methods where the broker is spending most of its
> compute time. You might need to also provide the paths to the link
> libraries, e.g.:
> 
> $ opreport --long-filenames --session-dir=/root/oprof -l
> /home/kgiusti/mrg/qpid/cpp/src/.libs/libqpidbroker.so.2.0.0
> 
> These notes are a bit old, and opcontrol/opreport's options may have changed
> a bit, but this should give you a general idea of how to use it.
> 
> -K
> 
> 
> ----- Original Message -----
> > From: "Jimmy Jones" <[email protected]>
> > To: [email protected]
> > Sent: Friday, September 6, 2013 9:50:17 AM
> > Subject: Re: System stalling
> > 
> > I've done some further digging, and managed to simplify the system a little
> > to reproduce the problem. The system is now an external process that posts
> > messages to the default headers exchange on my machine, which has a ring
> > queue to receive effectively all messages from the default headers
> > exchange,
> > process them, and post to another headers exchange. There is now nothing
> > listening on the subsequent headers exchange, and all exchanges are
> > non-durable. I've also tried Fraser's suggestion of marking the link as
> > unreliable on the queue which seems to have no effect (is there any way in
> > the qpid utilities to confirm the link has been set to unreliable?)
> > 
> > So essentially what happens is the system happily processes away, normally
> > with an empty ring queue, sometimes it spikes up a bit and goes back down
> > again, with my ingest process using ~70% CPU and qpidd ~50% CPU, on a
> > machine with 8 CPU cores. However sometimes the queue spikes up to 2GB (the
> > max), starts throwing messages away, and qpid hits 100%+ CPU and the ingest
> > process goes to about 3% CPU. I can see messages are being very slowly
> > processed.
> > 
> > I've tried attaching to qpidd with gdb a few times, and all threads apart
> > from one seem to be idle in epoll_wait or pthread_cond_wait. The running
> > thread always seems to be somewhere under DispatchHandle::processEvent.
> > 
> > I'm at a bit of a loss for what I can do to fix this!
> > 
> > Jimmy
> >  
> > ----- Original Message -----
> > From: Fraser Adams
> > Sent: 08/23/13 09:09 AM
> > To: [email protected]
> > Subject: Re: System stalling
> > Hi Jimmy, hope you are well!
> > As an experiment one thing that you could try is messing with the link
> > "reliability". As you know in the normal mode of operation it's
> > necessary to periodically send acknowledgements from the consumer client
> > application which then get passed back ultimately to the broker.
> > 
> > I'm no expert on this but from my recollection if you are in a position
> > particularly where circular queues are overflowing and you are
> > continually trying to produce and consume and you have some fair level
> > of prefetch/capacity on the consumer the mechanism for handling the
> > acknowledgements on the broker is "sub-optimal" - I think it's a linear
> > search or some such and there are conditions where catching up with
> > acknowledgements becomes a bit "N squared".
> > 
> > Gordon would be able to explain this way better than me - that's
> > assuming this hypothesis is even relevant :-)
> > 
> > Anyway if you try having a link: {reliability: unreliable} stanza in
> > your consumer address string (as an example one of mine looks like the
> > following - the address sting syntax isn't exactly trivial :-)).
> > 
> > string address = "test_consumer; {create: receiver, node: {x-declare:
> > {auto-delete: True, exclusive: True, arguments: {'qpid.policy_type':
> > ring, 'qpid.max_size': 100000000}}, x-bindings: [{exchange: 'amq.match',
> > queue: 'test_consumer', key: 'test1', arguments: {x-match: all,
> > data-format: test}}]}, link: {reliability: unreliable}}";
> > 
> > Clearly your arguments would be different but hopefully it'll give you a
> > kick start.
> > 
> > 
> > The main down side of disabling link reliability is that if you have
> > enabled prefetch and the consumer unexpectedly dies then all of the
> > messages on the prefetch queue will be lost, whereas with reliable
> > messaging the broker maintains references to all unacknowledged messages
> > so would resent them (I *think* that's how it works.....)
> > 
> > 
> > At the very least it's a fairly simple tweak to your consumer addresses
> > that might rule out (or point to) acknowledgement shenanigans as being
> > the root of your problem. From my own experience I always end up blaming
> > this first if I hit performance weirdness with ring queues :-)
> > 
> > HTH,
> > Frase
> > 
> > 
> > 
> > On 21/08/13 17:08, Jimmy Jones wrote:
> > >>>>> I've got an simple processing system using the 0.22 C++ broker, all
> > >>>>> on one box, where an external system posts messages to the default
> > >>>>> headers exchange, and an ingest process receives them using a ring
> > >>>>> queue, transforms them and outputs to a different headers exchange.
> > >>>>> Various other processes pick messages of interest off that exchange
> > >>>>> using ring queues. Recently however the system has been stalling -
> > >>>>> I'm still receiving lots of data from the other system, but the
> > >>>>> ingest process suddenly goes to <5% CPU usage and its queue fills up
> > >>>>> and messages start getting discarded from the ring, the follow on
> > >>>>> processes go to practically 0% CPU and qpidd hovers around 95-120%
> > >>>>> CPU (normally its ~75%) and the rest of the system pretty much goes
> > >>>>> idle (no swapping, there is free memory)
> > >>>>>
> > >>>>> I attached to the ingest process with gdb and it was stuck in send
> > >>>>> (waitForCapacity/waitForCompletionImpl) - I notice this can block.
> > >>>> Is there any queue bound to the second headers exchange, i.e. to the
> > >>>> one
> > >>>> this ingest process is sending to, that is not a ring queue? (If you
> > >>>> run
> > >>>> qpid-config queue -r, you get a quick listing of the queues and their
> > >>>> bindings).
> > >>> I've run qpid-config queue, and all my queues have --limit-policy=ring,
> > >>> apart
> > >>> from a UUID one which I presume is qpid-config itself. Are there any
> > >>> other useful
> > >>> debugging things I can do?
> > >> What does qpid-stat -q show? Is it possible to test whether the broker
> > >> is still responsive e,g, by sending and receiving messages through a
> > >> test queue/exchange? Are there any errors in the logs? Are any of the
> > >> queues durable (and messages persistent)?
> > > qpid-stat -q is all zero's in the msg & bytes column, apart from the
> > > ingest
> > > queue,
> > > and another overflowing ring queue I have.
> > >
> > > I did run qpid-tool when the system was broken to dump some stats.
> > > msgTotalDequeues
> > > was slowly incremeneting on the ingest queue, so I presume messages were
> > > still being
> > > delivered and the broker was responsive?
> > >
> > > The only logging I've got is syslog, and I just see a warning about
> > > unsent
> > > data,
> > > presumably when the ingest process receives a SIGALARM. I'm happy to
> > > swich
> > > on more
> > > logging, what would you recommend?
> > >
> > > None of my queues are durable, but I think incoming messages from the
> > > other
> > > system
> > > are marked as durable. The exchange that the ingest process sends to is
> > > durable,
> > > but I'm not setting any durable flags on outgoing messages (I presume the
> > > default
> > > is off).
> > >
> > >> Another thing might be a ptrace of the broker process. Maybe two or
> > >> three with a short delay between them.
> > > I'll try this next time it goes haywire.
> > >
> > >> For some reason it seems like the broker is not sending back
> > >> confirmation to the sender in the ingest process, causing that to block.
> > >> Ring queues shouldn't be subject to producer flow control so we need to
> > >> figure out what other reason there could be for that.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> > 
> > 
> 
> --
> -K
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 
> 

-- 
-K

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: System stalling

Reply via email to