[jira] [Resolved] (QPID-4201) Destination cluster de-sync when federation link used for a longer time

Alan Conway (JIRA) Thu, 17 Jan 2013 08:34:13 -0800

     [ 
https://issues.apache.org/jira/browse/QPID-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alan Conway resolved QPID-4201.
-------------------------------

       Resolution: Won't Fix
    Fix Version/s:     (was: 0.19)
                   0.20

This issue affects the old cluster which is no longer part of Qpid for the 0.20 
release.
                
> Destination cluster de-sync when federation link used for a longer time
> -----------------------------------------------------------------------
>
>                 Key: QPID-4201
>                 URL: https://issues.apache.org/jira/browse/QPID-4201
>             Project: Qpid
>          Issue Type: Bug
>          Components: C++ Clustering
>    Affects Versions: 0.18
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>             Fix For: 0.20
>
>
> (see also  https://bugzilla.redhat.com/show_bug.cgi?id=836141)
> Description of problem:
> Using queue state replication from a broker (possibly clustered - this does 
> not matter) to a cluster of brokers cause cluster de-sync after a long time:
> 2012-06-28 08:28:30 critical Error delivering frames: local error did not 
> occur on all cluster members : invalid-argument: 
> @QPID.77153a41-7531-47f6-bf55-b30ffed69922: confirmed < (4799+0) but only 
> sent < (4797+0) (qpid/SessionState.cpp:154) (qpid/cluster/ErrorCheck.cpp:89)
> Version-Release number of selected component (if applicable):
> every checked 
> How reproducible:
> depending on time, but 10% for default scenario
> Steps to Reproduce:
> (ideally, if possible, rebuild qpid with changing 
> cpp/src/qpid/SessionState.cpp: static const uint32_t 
> SPONTANEOUS_REQUEST_INTERVAL = 64 to really, really significantly speedup the 
> reproducer)
> 1) Have source broker (or cluster, this does not matter) and dest.cluster 
> with queue state replication of just one queue from source do dest.cluster.
> 2) On the federation route, setup --ack to some low number (to speedup 
> replication, I used --ack 5).
> 3) Randomly produce and consume messages to the src.broker to the queue to be 
> replicated - ideally, do the enqueues and dequeues as much alternating as 
> possible. Dont know why, but more alternates speeds up the reproducer as well.
> 4) Now, be patient. After sending SPONTANEOUS_REQUEST_INTERVAL (by default 
> 64k) of some synchronization messages _from_ the backup cluster (that 
> requires around 100times more messages to be enqueued and dequeued on the 
> replicated queue), there is a probability to hit the bug. Once it was hit on 
> the first attempt (after 2^16 = 64k of such synchronization messages), once 
> after 720896 messages (in 11th "round" / "trial").
>   
> Actual results:
> All brokers in dst.cluster - except the one that has the fed.link established 
> - shut down with log:
> 2012-06-27 15:39:46 critical Error delivering frames: local error did not 
> occur on all cluster members : invalid-argument: 
> @QPID.314e73e8-8bc3-4f5a-b77d-6bdd4ee17e39: confirmed < (720895+0) but only 
> sent < (720893+0) (qpid/SessionState.cpp:154) (qpid/cluster/ErrorCheck.cpp:89)
> Expected results:
> No such cluster de-sync
> Additional info:
> - interesting fact: I was able to reproduce it using queue state replication 
> - only. Despite the bug is on federation link session, using fed.link without 
> queue state replication did not lead to the bug.
> - the difference comes from the _beginning_ of session communication, per 
> some traces, these AMQP messages sent from dst.cluster to the source are 
> _not_ replayed by (even not multicasted to) the "other dst.brokers" (that 
> have the session / connection as shadow, not local). So these messages are 
> not replayed:
> 2012-06-27 07:12:09 trace @QPID.2d7fe3c3-b0de-4f36-a028-23ffaed6e9a5: sent 
> cmd 0: {MessageSubscribeBody: queue=replication-queue; 
> destination=replication-exchange; accept-mode=0; acquire-mode=0; resume-id 
> resume-ttl=0; arguments={qpid.sync_frequency:F4:int32(100)}; }
> 2012-06-27 07:12:09 trace @QPID.2d7fe3c3-b0de-4f36-a028-23ffaed6e9a5: sent 
> cmd 1: {MessageFlowBody: destination=replication-exchange; unit=0; 
> value=4294967295; }
> 2012-06-27 07:12:09 trace @QPID.2d7fe3c3-b0de-4f36-a028-23ffaed6e9a5: sent 
> cmd 2: {MessageFlowBody: destination=replication-exchange; unit=1; 
> value=4294967295; }
> [reply] [-]
> Private
> Comment 1 Pavel Moravec 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org

[jira] [Resolved] (QPID-4201) Destination cluster de-sync when federation link used for a longer time

Reply via email to