Re: System stalling

Jimmy Jones Mon, 09 Sep 2013 09:12:19 -0700

Hi Ken,

Had a play with oprofile... but seems to have lumped everything into glibc, any 
ideas? The queue is setup as ring, max size 2GB.


# qpid-stat -q

Queues
  queue                                     dur  autoDel  excl  msg    msgIn  
msgOut  bytes  bytesIn  bytesOut  cons  bind
  
==========================================================================================================================
  9d587baf-edc4-694b-bd43-f0accdf77a44:0.0       Y        Y        0      0     
 0       0      0        0         1     2
  ingest                                                  Y     16.4k  49.9k  
33.5k   2.13g  7.17g    5.04g        1     2

# opreport --long-filenames --session-dir=/root/oprof
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
          TIMER:0|
  samples|      %|
------------------
    78847 100.000 /usr/sbin/qpidd
                  TIMER:0|
          samples|      %|
        ------------------
            53196 67.4674 /usr/lib64/libstdc++.so.6.0.8
            14220 18.0349 /usr/lib/libqpidbroker.so.2.0.0
             7368  9.3447 /lib64/libc-2.5.so
             3833  4.8613 /usr/lib/libqpidcommon.so.2.0.0
               93  0.1179 /usr/lib/libqpidtypes.so.1.0.0
               70  0.0888 /lib64/libpthread-2.5.so
               43  0.0545 /lib64/ld-2.5.so
               19  0.0241 /lib64/librt-2.5.so
                4  0.0051 /lib64/libuuid.so.1.2
                1  0.0013 /usr/sbin/qpidd

# opreport --demangle=smart --session-dir=/root/oprof --symbols `which qpidd`
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %        image name               symbol name
53196    67.4674  libstdc++.so.6.0.8       /usr/lib64/libstdc++.so.6.0.8
10597    13.4400  libqpidbroker.so.2.0.0   void 
deque<qpid::framing::AMQFrame>::_M_range_insert_aux<_Deque_iterator<qpid::framing::AMQFrame,
 qpid::framing::AMQFrame&, 
qpid::framing::AMQFrame*>>(_Deque_iterator<qpid::framing::AMQFrame, 
qpid::framing::AMQFrame&, qpid::framing::AMQFrame*>, 
_Deque_iterator<qpid::framing::AMQFrame, qpid::framing::AMQFrame&, 
qpid::framing::AMQFrame*>, _Deque_iterator<qpid::framing::AMQFrame, 
qpid::framing::AMQFrame&, qpid::framing::AMQFrame*>, forward_iterator_tag)
2922      3.7059  libqpidcommon.so.2.0.0   qpid::framing::AMQFrame::~AMQFrame()
2833      3.5930  libqpidbroker.so.2.0.0   
deque<qpid::framing::AMQFrame>::clear()
2486      3.1529  libc-2.5.so              _int_malloc
1882      2.3869  libc-2.5.so              _int_free
1757      2.2284  libc-2.5.so              malloc
589       0.7470  libc-2.5.so              memcpy
384       0.4870  libc-2.5.so              free
...

Cheers,

Jimmy
  
----- Original Message -----
From: Ken Giusti
Sent: 09/06/13 06:27 PM
To: [email protected]
Subject: Re: System stalling
 Hi Jimmy,

Have you ever used the oprofile tool before?

http://oprofile.sourceforge.net/about/

I've found this tool useful when I need to get a sense of where the broker is 
spending its time, especially when it is compute-bound.

You'll need to be able to install oprofile on the system that is running the 
broker, and you'd need root permission to run it.

The approach I take is to configure oprofile to analyze the broker, then 
perform whatever actions I need to get the broker to into the compute bound 
state. Once the broker is acting up, I then trigger oprofile to start a 
capture. That results in a capture that best represents what the broker is 
doing when it is in that compute bound state.

It's been awhile since I used oprofile, but here's a summary of the commands I 
used last.

First, the path to the broker executable for this example is 
/home/kgiusti/mrg/qpid/cpp/src/.libs/lt-qpidd. Be sure you're referencing the 
actual executable image an not the shell wrapper that autotools generates!

After starting the broker daemon, I delete any old oprofile configuration and 
captures. I then configure and start the oprofile daemon using the following 
commands (done as root):

$ rm -rf /root/oprof
$ rm -rf ~/.oprofile
$ opcontrol --shutdown
$ opcontrol --init
$ opcontrol --reset
$ opcontrol --setup --no-vmlinux --session-dir=/root/oprof 
--image=/home/kgiusti/mrg/qpid/cpp/src/.libs/lt-qpidd --separate=library 
--event=INST_RETIRED_ANY_P:6000:0:0:1 --cpu-buffer-size=1000000
$ opcontrol --start-daemon

Once that is done, you should try to reproduce the problem. Once the broker is 
in that weird state, start the capture:

$ opcontrol --start

Capture for a while, then stop the capture and dump the results:

$ opcontrol --stop
$ opreport --long-filenames --session-dir=/root/oprof

opreport will dump the methods where the broker is spending most of its compute 
time. You might need to also provide the paths to the link libraries, e.g.:

$ opreport --long-filenames --session-dir=/root/oprof -l 
/home/kgiusti/mrg/qpid/cpp/src/.libs/libqpidbroker.so.2.0.0

These notes are a bit old, and opcontrol/opreport's options may have changed a 
bit, but this should give you a general idea of how to use it.

-K


----- Original Message -----
> From: "Jimmy Jones" <[email protected]>
> To: [email protected]
> Sent: Friday, September 6, 2013 9:50:17 AM
> Subject: Re: System stalling
> 
> I've done some further digging, and managed to simplify the system a little
> to reproduce the problem. The system is now an external process that posts
> messages to the default headers exchange on my machine, which has a ring
> queue to receive effectively all messages from the default headers exchange,
> process them, and post to another headers exchange. There is now nothing
> listening on the subsequent headers exchange, and all exchanges are
> non-durable. I've also tried Fraser's suggestion of marking the link as
> unreliable on the queue which seems to have no effect (is there any way in
> the qpid utilities to confirm the link has been set to unreliable?)
> 
> So essentially what happens is the system happily processes away, normally
> with an empty ring queue, sometimes it spikes up a bit and goes back down
> again, with my ingest process using ~70% CPU and qpidd ~50% CPU, on a
> machine with 8 CPU cores. However sometimes the queue spikes up to 2GB (the
> max), starts throwing messages away, and qpid hits 100%+ CPU and the ingest
> process goes to about 3% CPU. I can see messages are being very slowly
> processed.
> 
> I've tried attaching to qpidd with gdb a few times, and all threads apart
> from one seem to be idle in epoll_wait or pthread_cond_wait. The running
> thread always seems to be somewhere under DispatchHandle::processEvent.
> 
> I'm at a bit of a loss for what I can do to fix this!
> 
> Jimmy
>  
> ----- Original Message -----
> From: Fraser Adams
> Sent: 08/23/13 09:09 AM
> To: [email protected]
> Subject: Re: System stalling
> Hi Jimmy, hope you are well!
> As an experiment one thing that you could try is messing with the link
> "reliability". As you know in the normal mode of operation it's
> necessary to periodically send acknowledgements from the consumer client
> application which then get passed back ultimately to the broker.
> 
> I'm no expert on this but from my recollection if you are in a position
> particularly where circular queues are overflowing and you are
> continually trying to produce and consume and you have some fair level
> of prefetch/capacity on the consumer the mechanism for handling the
> acknowledgements on the broker is "sub-optimal" - I think it's a linear
> search or some such and there are conditions where catching up with
> acknowledgements becomes a bit "N squared".
> 
> Gordon would be able to explain this way better than me - that's
> assuming this hypothesis is even relevant :-)
> 
> Anyway if you try having a link: {reliability: unreliable} stanza in
> your consumer address string (as an example one of mine looks like the
> following - the address sting syntax isn't exactly trivial :-)).
> 
> string address = "test_consumer; {create: receiver, node: {x-declare:
> {auto-delete: True, exclusive: True, arguments: {'qpid.policy_type':
> ring, 'qpid.max_size': 100000000}}, x-bindings: [{exchange: 'amq.match',
> queue: 'test_consumer', key: 'test1', arguments: {x-match: all,
> data-format: test}}]}, link: {reliability: unreliable}}";
> 
> Clearly your arguments would be different but hopefully it'll give you a
> kick start.
> 
> 
> The main down side of disabling link reliability is that if you have
> enabled prefetch and the consumer unexpectedly dies then all of the
> messages on the prefetch queue will be lost, whereas with reliable
> messaging the broker maintains references to all unacknowledged messages
> so would resent them (I *think* that's how it works.....)
> 
> 
> At the very least it's a fairly simple tweak to your consumer addresses
> that might rule out (or point to) acknowledgement shenanigans as being
> the root of your problem. From my own experience I always end up blaming
> this first if I hit performance weirdness with ring queues :-)
> 
> HTH,
> Frase
> 
> 
> 
> On 21/08/13 17:08, Jimmy Jones wrote:
> >>>>> I've got an simple processing system using the 0.22 C++ broker, all
> >>>>> on one box, where an external system posts messages to the default
> >>>>> headers exchange, and an ingest process receives them using a ring
> >>>>> queue, transforms them and outputs to a different headers exchange.
> >>>>> Various other processes pick messages of interest off that exchange
> >>>>> using ring queues. Recently however the system has been stalling -
> >>>>> I'm still receiving lots of data from the other system, but the
> >>>>> ingest process suddenly goes to <5% CPU usage and its queue fills up
> >>>>> and messages start getting discarded from the ring, the follow on
> >>>>> processes go to practically 0% CPU and qpidd hovers around 95-120%
> >>>>> CPU (normally its ~75%) and the rest of the system pretty much goes
> >>>>> idle (no swapping, there is free memory)
> >>>>>
> >>>>> I attached to the ingest process with gdb and it was stuck in send
> >>>>> (waitForCapacity/waitForCompletionImpl) - I notice this can block.
> >>>> Is there any queue bound to the second headers exchange, i.e. to the one
> >>>> this ingest process is sending to, that is not a ring queue? (If you run
> >>>> qpid-config queue -r, you get a quick listing of the queues and their
> >>>> bindings).
> >>> I've run qpid-config queue, and all my queues have --limit-policy=ring,
> >>> apart
> >>> from a UUID one which I presume is qpid-config itself. Are there any
> >>> other useful
> >>> debugging things I can do?
> >> What does qpid-stat -q show? Is it possible to test whether the broker
> >> is still responsive e,g, by sending and receiving messages through a
> >> test queue/exchange? Are there any errors in the logs? Are any of the
> >> queues durable (and messages persistent)?
> > qpid-stat -q is all zero's in the msg & bytes column, apart from the ingest
> > queue,
> > and another overflowing ring queue I have.
> >
> > I did run qpid-tool when the system was broken to dump some stats.
> > msgTotalDequeues
> > was slowly incremeneting on the ingest queue, so I presume messages were
> > still being
> > delivered and the broker was responsive?
> >
> > The only logging I've got is syslog, and I just see a warning about unsent
> > data,
> > presumably when the ingest process receives a SIGALARM. I'm happy to swich
> > on more
> > logging, what would you recommend?
> >
> > None of my queues are durable, but I think incoming messages from the other
> > system
> > are marked as durable. The exchange that the ingest process sends to is
> > durable,
> > but I'm not setting any durable flags on outgoing messages (I presume the
> > default
> > is off).
> >
> >> Another thing might be a ptrace of the broker process. Maybe two or
> >> three with a short delay between them.
> > I'll try this next time it goes haywire.
> >
> >> For some reason it seems like the broker is not sending back
> >> confirmation to the sender in the ingest process, causing that to block.
> >> Ring queues shouldn't be subject to producer flow control so we need to
> >> figure out what other reason there could be for that.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 
> 

-- 
-K

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected] 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: System stalling

Reply via email to