Hi there, I've deployed qpid on our embedded TI platform, and I'm seeing some
odd behaviour with long-term (36-48 hour) use, and I'd like some advice
about whether how I'm configuring and using qpid is the cause of this.

We have a distributed embedded system across up to 8-12 separate boxes, and
I've used Avahi to detect the other units, and then I'm using qpid messages
as the system IPC.  We don't have java or python as part of our buildroot
rootfs at the moment, so I've done everything with C++.  The network is
fully ad-hoc, so we can't do any static configuration of exchanges and so
on.  Nodes can come and go during runtime, and the system needs to deal with
it.  I've done it all by the creation of sessions, connections, senders and
receivers.  Qpid heartbeat messages are used to detect nodes disappearing.

The problem seem to be with the common situation of "give data to each of
the other nodes" type of problem.  I've implemented it as a push-type
operation.  The nodes only receive locally, so the node with the data will
send a message to a topic on each of the other nodes it knows about.  The
topic has been configured with the following string: 

<name> { create: always, assert: never, node : {type: topic, x-declare: {
auto-delete: True, exclusive: False, arguments: {'qpid.policy_type': ring}

I enable a 1s hearbeat on all the connections, but reconnect is off. The
QpidPoller thread spawned by the reconnect would mysteriously crash, so I've
taken it out.

The data profile is pretty modest with messages of perhaps 128 bytes are
sent around the system at about 1Hz, with some other slightly bigger 256
byte messages sent perhaps 3-5Hz.

I keep the connections and sessions open rather than opering them and
closing them a lot.

I've tried to be conservative with failure cases.  My send / receive failure
code will close the connection object, bin the message and lets the calling
code deal with the retry.  When the retry comes in (which I reject until 3s
is up), I open the connection, and then attempt to call getSender/Receiver
for that broker (doing a createSender/Receiver if the get throws an
exception).

This particular test is with a 2 node system, a "master" which is generally
supplying source data around the system, and a slave node generally just
receiving data.

After several (36-48 hours) I see mystery disconnections going into the
syslog, e.g.

Apr 18 10:42:52 [qpidd] 2013-04-18 10:42:52 [Broker] error Connection
127.0.0.1:5672-127.0.0.1:40549 timed out: closing_

My qpidd.conf has the interesting lines:

/cluster-mechanism=DIGEST-MD5 ANONYMOUS

# Default max size of queue in bytes.  
# Default is 104857600 (100Mb), which is a tad high, try 1Mb
default-queue-limit=1024000

# TTL of messages in system.  Default is 600s
queue-purge-interval=10/


When I use qpid-tool and qpid-stat from my dev box, I see stats like on the
"slave" node:

Summary of Objects by Type:
    Package                 Class         Active  Deleted
    =======================================================
    org.apache.qpid.broker  binding       13      280
    org.apache.qpid.broker  broker        1       0
    org.apache.qpid.broker  memory        1       0
    org.apache.qpid.broker  system        1       0
    org.apache.qpid.ha      habroker      1       0
    org.apache.qpid.broker  subscription  5       257
    org.apache.qpid.broker  connection    2       27
    org.apache.qpid.broker  session       1       23
    org.apache.qpid.broker  queue         6       147
    org.apache.qpid.broker  exchange      12      0
    org.apache.qpid.broker  vhost         1       0


    
Whereas the "master" node has stats like:

    Package                 Class         Active  Deleted
    =======================================================
    org.apache.qpid.broker  binding       36      0
    org.apache.qpid.broker  broker        1       0
    org.apache.qpid.broker  memory        1       0
    org.apache.qpid.broker  system        1       0
    org.apache.qpid.ha      habroker      1       0
    org.apache.qpid.broker  subscription  23      0
    org.apache.qpid.broker  connection    7       0
    org.apache.qpid.broker  session       7       0
    org.apache.qpid.broker  queue         19      0
    org.apache.qpid.broker  exchange      12      0
    org.apache.qpid.broker  vhost         1       0

So clearly I am creating and destroying a lot of bindings, subscriptions and
queues here.  If I list the queues on the slave I see a lot of this kind of
thing:

    223  20:25:43  -         
346.<TopicName>_1af820f0-0224-4f32-8464-081a8020fed4
    224  20:25:43  -         
346.<TopicName>_20433cbe-e71c-493d-9f2c-70c6caf89680
    225  20:25:43  -         
346.<TopicName>_20bfcaea-d734-45a6-837b-5f7065178f22
    226  20:25:43  -         
346.<TopicName>_30f68ad2-8556-4b8b-bf34-3aec299e9270
    227  20:25:43  -         
346.<TopicName>_3556f8a5-a233-42a8-a0a5-8e1b91aaeb7d
    228  20:25:43  -         
346.<TopicName>_3ec782b5-ce6c-4ec2-8a60-dec9ef797b18
    229  20:25:43  -         
346.<TopicName>_4295e25c-163b-41c0-91d8-1a52051a23e0
    230  20:25:43  -         
346.<TopicName>_43a772f6-42f3-4262-893e-52294bc901be
    231  20:25:43  -         
346.<TopicName>_489b4bda-3e7e-41f3-92db-6c5308245986
    232  20:25:43  -         
346.<TopicName>_4a2645fe-bb51-4170-a770-68120c2742b7
    233  20:25:43  -         
346.<TopicName>_5cc8a979-f50c-4c06-90fd-81ee3b16ee24
    234  20:25:43  -         
346.<TopicName>_6123edf9-553b-4ca0-8bc8-2305c33e71ec
    235  20:25:43  -         
346.<TopicName>_750800cc-c913-4cb6-b2bc-4cffa343b335
    236  20:25:43  -         
346.<TopicName>_880cff8a-fc4b-414f-8046-d3d09fa2e1a7
    237  20:25:43  -         
346.<TopicName>_a06a7268-89ad-4cc0-b6ec-ca44ef9ee787
    238  20:25:43  -         
346.<TopicName>_aa4753c5-cc32-40a4-afd4-d677303147df
    239  20:25:43  -         
346.<TopicName>_b3f15b2e-f93c-4fad-914d-070344343aaa
    240  20:25:43  -         
346.<TopicName>_b5cf4400-fc1b-4bd2-a8a5-7aa388a44e5d
    241  20:25:43  -         
346.<TopicName>_bbe3774f-fc6b-4164-bc71-81be966ca598
    242  20:25:43  -         
346.<TopicName>_c84620a2-7d29-4de0-bfa0-e7f937ea11ed
    243  20:25:43  -         
346.<TopicName>_c8e3c75d-07ab-4847-9fba-8ce929c3b470
    244  20:25:43  -         
346.<TopicName>_e1ef9106-07e2-4644-8410-150ad695e0be
    
Why are so many of these being created?  I've put the autodelete on, so I
guess it's the other end at the master somehow keeping the queue in
existence.

So my questions are:

a) It's a wired ethernet connection so I don't think it's connectivity
that's taking down the connections.  I've patched the qpid code to use the
monotonic rather than realtime clock so GPS leap seconds (which changes the
system clock) wouldn't cause timeouts.  What could be causing it?
a) Is my backoff logic sensible?  Should I be recreating the connection /
session / sender / receiver instead of trying to reuse them?
b) What's causing the proliferation of binding, subscription and queue
objects?
c) Are there any other settings I can be supplying in the address or broker
config that will mitigate the effect?

Thanks for your help,

Neil



--
View this message in context: 
http://qpid.2158936.n2.nabble.com/Newbie-problem-with-long-term-use-of-C-broker-client-code-tp7591679.html
Sent from the Apache Qpid users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to