[ https://issues.apache.org/jira/browse/QPID-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alan Conway resolved QPID-4201. ------------------------------- Resolution: Won't Fix Fix Version/s: (was: 0.19) 0.20 This issue affects the old cluster which is no longer part of Qpid for the 0.20 release. > Destination cluster de-sync when federation link used for a longer time > ----------------------------------------------------------------------- > > Key: QPID-4201 > URL: https://issues.apache.org/jira/browse/QPID-4201 > Project: Qpid > Issue Type: Bug > Components: C++ Clustering > Affects Versions: 0.18 > Reporter: Alan Conway > Assignee: Alan Conway > Fix For: 0.20 > > > (see also https://bugzilla.redhat.com/show_bug.cgi?id=836141) > Description of problem: > Using queue state replication from a broker (possibly clustered - this does > not matter) to a cluster of brokers cause cluster de-sync after a long time: > 2012-06-28 08:28:30 critical Error delivering frames: local error did not > occur on all cluster members : invalid-argument: > @QPID.77153a41-7531-47f6-bf55-b30ffed69922: confirmed < (4799+0) but only > sent < (4797+0) (qpid/SessionState.cpp:154) (qpid/cluster/ErrorCheck.cpp:89) > Version-Release number of selected component (if applicable): > every checked > How reproducible: > depending on time, but 10% for default scenario > Steps to Reproduce: > (ideally, if possible, rebuild qpid with changing > cpp/src/qpid/SessionState.cpp: static const uint32_t > SPONTANEOUS_REQUEST_INTERVAL = 64 to really, really significantly speedup the > reproducer) > 1) Have source broker (or cluster, this does not matter) and dest.cluster > with queue state replication of just one queue from source do dest.cluster. > 2) On the federation route, setup --ack to some low number (to speedup > replication, I used --ack 5). > 3) Randomly produce and consume messages to the src.broker to the queue to be > replicated - ideally, do the enqueues and dequeues as much alternating as > possible. Dont know why, but more alternates speeds up the reproducer as well. > 4) Now, be patient. After sending SPONTANEOUS_REQUEST_INTERVAL (by default > 64k) of some synchronization messages _from_ the backup cluster (that > requires around 100times more messages to be enqueued and dequeued on the > replicated queue), there is a probability to hit the bug. Once it was hit on > the first attempt (after 2^16 = 64k of such synchronization messages), once > after 720896 messages (in 11th "round" / "trial"). > > Actual results: > All brokers in dst.cluster - except the one that has the fed.link established > - shut down with log: > 2012-06-27 15:39:46 critical Error delivering frames: local error did not > occur on all cluster members : invalid-argument: > @QPID.314e73e8-8bc3-4f5a-b77d-6bdd4ee17e39: confirmed < (720895+0) but only > sent < (720893+0) (qpid/SessionState.cpp:154) (qpid/cluster/ErrorCheck.cpp:89) > Expected results: > No such cluster de-sync > Additional info: > - interesting fact: I was able to reproduce it using queue state replication > - only. Despite the bug is on federation link session, using fed.link without > queue state replication did not lead to the bug. > - the difference comes from the _beginning_ of session communication, per > some traces, these AMQP messages sent from dst.cluster to the source are > _not_ replayed by (even not multicasted to) the "other dst.brokers" (that > have the session / connection as shadow, not local). So these messages are > not replayed: > 2012-06-27 07:12:09 trace @QPID.2d7fe3c3-b0de-4f36-a028-23ffaed6e9a5: sent > cmd 0: {MessageSubscribeBody: queue=replication-queue; > destination=replication-exchange; accept-mode=0; acquire-mode=0; resume-id > resume-ttl=0; arguments={qpid.sync_frequency:F4:int32(100)}; } > 2012-06-27 07:12:09 trace @QPID.2d7fe3c3-b0de-4f36-a028-23ffaed6e9a5: sent > cmd 1: {MessageFlowBody: destination=replication-exchange; unit=0; > value=4294967295; } > 2012-06-27 07:12:09 trace @QPID.2d7fe3c3-b0de-4f36-a028-23ffaed6e9a5: sent > cmd 2: {MessageFlowBody: destination=replication-exchange; unit=1; > value=4294967295; } > [reply] [-] > Private > Comment 1 Pavel Moravec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org For additional commands, e-mail: dev-h...@qpid.apache.org