Re: System stalling

Ted Ross Fri, 06 Sep 2013 07:03:38 -0700

Jimmy,

Do your ring queues have any flow-control configuration set up? Thiswould be --flow-* thresholds in qpid-config.

Also, it would be helpful to see the output of a pstack on the qpiddprocess when the condition occurs. I think almost everything happensunder DispatchHandle::processEvent :)


-Ted

On 09/06/2013 09:50 AM, Jimmy Jones wrote:

I've done some further digging, and managed to simplify the system a little to 
reproduce the problem. The system is now an external process that posts 
messages to the default headers exchange on my machine, which has a ring queue 
to receive effectively all messages from the default headers exchange, process 
them, and post to another headers exchange. There is now nothing listening on 
the subsequent headers exchange, and all exchanges are non-durable. I've also 
tried Fraser's suggestion of marking the link as unreliable on the queue which 
seems to have no effect (is there any way in the qpid utilities to confirm the 
link has been set to unreliable?)

So essentially what happens is the system happily processes away, normally with 
an empty ring queue, sometimes it spikes up a bit and goes back down again, 
with my ingest process using ~70% CPU and qpidd ~50% CPU, on a machine with 8 
CPU cores. However sometimes the queue spikes up to 2GB (the max), starts 
throwing messages away, and qpid hits 100%+ CPU and the ingest process goes to 
about 3% CPU. I can see messages are being very slowly processed.

I've tried attaching to qpidd with gdb a few times, and all threads apart from 
one seem to be idle in epoll_wait or pthread_cond_wait. The running thread 
always seems to be somewhere under DispatchHandle::processEvent.

I'm at a bit of a loss for what I can do to fix this!

Jimmy

----- Original Message -----

From: Fraser Adams
Sent: 08/23/13 09:09 AM
To: [email protected]
Subject: Re: System stalling
  Hi Jimmy, hope you are well!
As an experiment one thing that you could try is messing with the link
"reliability". As you know in the normal mode of operation it's
necessary to periodically send acknowledgements from the consumer client
application which then get passed back ultimately to the broker.

I'm no expert on this but from my recollection if you are in a position
particularly where circular queues are overflowing and you are
continually trying to produce and consume and you have some fair level
of prefetch/capacity on the consumer the mechanism for handling the
acknowledgements on the broker is "sub-optimal" - I think it's a linear
search or some such and there are conditions where catching up with
acknowledgements becomes a bit "N squared".

Gordon would be able to explain this way better than me - that's
assuming this hypothesis is even relevant :-)

Anyway if you try having a link: {reliability: unreliable} stanza in
your consumer address string (as an example one of mine looks like the
following - the address sting syntax isn't exactly trivial :-)).

string address = "test_consumer; {create: receiver, node: {x-declare:
{auto-delete: True, exclusive: True, arguments: {'qpid.policy_type':
ring, 'qpid.max_size': 100000000}}, x-bindings: [{exchange: 'amq.match',
queue: 'test_consumer', key: 'test1', arguments: {x-match: all,
data-format: test}}]}, link: {reliability: unreliable}}";

Clearly your arguments would be different but hopefully it'll give you a
kick start.

The main down side of disabling link reliability is that if you have
enabled prefetch and the consumer unexpectedly dies then all of the
messages on the prefetch queue will be lost, whereas with reliable
messaging the broker maintains references to all unacknowledged messages
so would resent them (I *think* that's how it works.....)

At the very least it's a fairly simple tweak to your consumer addresses
that might rule out (or point to) acknowledgement shenanigans as being
the root of your problem. From my own experience I always end up blaming
this first if I hit performance weirdness with ring queues :-)

HTH,
Frase

On 21/08/13 17:08, Jimmy Jones wrote:

I've got an simple processing system using the 0.22 C++ broker, all
on one box, where an external system posts messages to the default
headers exchange, and an ingest process receives them using a ring
queue, transforms them and outputs to a different headers exchange.
Various other processes pick messages of interest off that exchange
using ring queues. Recently however the system has been stalling -
I'm still receiving lots of data from the other system, but the
ingest process suddenly goes to <5% CPU usage and its queue fills up
and messages start getting discarded from the ring, the follow on
processes go to practically 0% CPU and qpidd hovers around 95-120%
CPU (normally its ~75%) and the rest of the system pretty much goes
idle (no swapping, there is free memory)

I attached to the ingest process with gdb and it was stuck in send
(waitForCapacity/waitForCompletionImpl) - I notice this can block.

Is there any queue bound to the second headers exchange, i.e. to the one
this ingest process is sending to, that is not a ring queue? (If you run
qpid-config queue -r, you get a quick listing of the queues and their
bindings).

I've run qpid-config queue, and all my queues have --limit-policy=ring, apart
from a UUID one which I presume is qpid-config itself. Are there any other 
useful
debugging things I can do?

What does qpid-stat -q show? Is it possible to test whether the broker
is still responsive e,g, by sending and receiving messages through a
test queue/exchange? Are there any errors in the logs? Are any of the
queues durable (and messages persistent)?

qpid-stat -q is all zero's in the msg & bytes column, apart from the ingest 
queue,
and another overflowing ring queue I have.

I did run qpid-tool when the system was broken to dump some stats. 
msgTotalDequeues
was slowly incremeneting on the ingest queue, so I presume messages were still 
being
delivered and the broker was responsive?

The only logging I've got is syslog, and I just see a warning about unsent data,
presumably when the ingest process receives a SIGALARM. I'm happy to swich on 
more
logging, what would you recommend?

None of my queues are durable, but I think incoming messages from the other 
system
are marked as durable. The exchange that the ingest process sends to is durable,
but I'm not setting any durable flags on outgoing messages (I presume the 
default
is off).

Another thing might be a ptrace of the broker process. Maybe two or
three with a short delay between them.

I'll try this next time it goes haywire.

For some reason it seems like the broker is not sending back
confirmation to the sender in the ingest process, causing that to block.
Ring queues shouldn't be subject to producer flow control so we need to
figure out what other reason there could be for that.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: System stalling

Reply via email to