> On April 1, 2014, 9:17 p.m., Alan Conway wrote:
> > Sorry, catching up. I think this patch is abusing the bootsequence which 
> > really has nothing at all to do with exchange sequence numbers. It would 
> > work on a standalone (but still be an abuse IMO) broker but it doesn't work 
> > for HA. The base for exchage sequence numbers has to be synchronized across 
> > a HA cluster, whereas the number of times each broker has rebooted *cannot* 
> > be synchronized without breaking its existing meaning as a reboot count. 
> > I'm also a bit wary of the division of sequence number into 17 bits for 
> > "restart" count 47 bits for "message" count. What are the grounds for that 
> > division? Is it safe to assume no overflows?
> > 
> > I see the goal of avoiding persisting the sequence number on every message, 
> > but how about persisting it (per exchange) on some suitably large number of 
> > messages? E.g. persist an initial next-sequence of 64k for each exchange, 
> > with the sequence number starting at 1. Once the sequence number gets 
> > within 32k of next-sequence we increase next-sequence of by 64k etc. On 
> > reboot or on promotion of the primary we jump each exchange sequence number 
> > ahead to the exchange's next-sequence number.
> > 
> > That lets us take smaller jumps on each restart/failover (16 bits vs. 47 
> > bits) and avoids using a single counter for every exchange which makes 
> > overflow less likely. The next-sequence number for each exchange would need 
> > to be persisted, replicated on joining (both can be done by putting it back 
> > in the args) and replicated via a management event when it is increased on 
> > the primary. Jumping the next-sequence number ahead when we still have 32k 
> > messages left to go means that we have time to store & replicate before we 
> > pass the new limit. The next-sequence doesn't have to be precisely 
> > synchronized in a cluster provided it is always a good margin bigger than 
> > the highest sequence number used by that exchange on the primary so far. 
> > During promotion (before allowing backups to join) we would need to jump 
> > each exchange sequence ahead to the next-sequence for the exchange and 
> > increase next-sequence by 64k to make sure the new primary is using higher 
> > sequence numbers than the old.
> > 
> > Are we sure nobody assumes that exchange sequence numbers are sequential? 
> > The above is fine for duplicate detection but are we sure nobody does 
> > missed message detection by checking for gaps in the exchange sequence 
> > numbers? That would complicate everything quite a bit...
> > 
> > 
> >
> 
> Pavel Moravec wrote:
>     This seems to be sound to me. To sum it up:
>     - exchanges with sequencing to have new argument qpid.next_sequenceNo, by 
> default set to 64k for new exchanges, or to the value read from store during 
> startup (or being updated from primary broker)
>       - if the argument is missing in stored args, it will be automatically 
> added there (to support upgrades)
>     - when incrementing sequenceNo in preRoute, if args(qpid.next_sequenceNo) 
> - sequenceNo == 32k, then execute 3 steps:
>       - increment qpid.next_sequenceNo by 64k
>       - update store bdb for the exchange
>       - if clustered broker and (primary or being promoted), notify backups 
> about the increment of qpid.next_sequenceNo
>     - when joining a cluster, active broker will update the joiner one 
> automatically (via exchange arguments) - no need to change
>     - when promoting a node to active:
>       - set sequenceNo = args(qpid.next_sequenceNo)
>       - execute the same 3 steps from "when incrementing sequenceNo"
>     
>     Two questions:
>     1) Can a node being promoted already update backups? Or is there a way of 
> postponing the update? (see latest point above)
>     2) What if - after a new promotion - the broker sends few messages with 
> sequenceNo say 128k but dies before it really writes to the disk 
> args(qpid.next_sequenceNo)=(128+64)k ? Can it even happen?
>     
>     
>     Sequence number will then jump ahead by at most 96k whenever a standalone 
> broker is restarted or new promotion in cluster happens (or cold start of 
> cluster happens). That means, there can be 93824992236885 such "jump events" 
> without sequence number overflow (or 93824671436104 "jump events" for 
> steadily message flow of 1M messages per second for 10 years). More than fair 
> enough.
>     
>     
>     Comment to "Are we sure nobody assumes that exchange sequence numbers are 
> sequential?" - Consumers get messages from queues and no queue can be 
> guaranteed to get _all_ messages routed by the exchange (until all know the 
> exchange is fanout or binding key is empty or so - but I think nobody can 
> build on this premise). So jumps in exchange sequence numbers can be always 
> meant as "the skipped messages were not meant for this queue".
>
> 
> Alan Conway wrote:
>     Notifying backups of changes is done via QMF events, so a broker can 
> always emit a "nextSequenceNo changed" event and if there are backups 
> listening they will pick it up. 
>     
>     So to your questions:
>     
>     1) A node being promoted can emit QMF events but there is not a 
> guaranteed ordering that the Exchange response will be seen before any QMF 
> events for that exchange. This is a real race condition, I have made fixes 
> for other special cases of this problem. Either we can make another special 
> case, or I have an idea for a more general fix for the problem. Let me look 
> at this, for your work assume that it will be OK to send a nextSequenceNo 
> event during update. I'll update this review when I have a patch for the race 
> condition.
>     
>     2) The reason that I propose updating the nextSequenceNo when we get 
> within 32k of the current is to ensure we have time to write the disk & do 
> the updates before the actual sequence number is used. E.g. say 
> nextSeqNo=128, actualSeqNo=96k. At this point we will update nextSeqNo to 
> 140k. I am assuming that we will not receive 32k new messages before we can 
> write this to disk, so even if we crash before the write, with nextSeqNo=128 
> still on disk, it will still be higher than any actual sequence number we 
> have sent. I am also assuming that the nextSeqNo update QMF event will be 
> processed by backups before we receive 32k new messages, to ensure that if we 
> fail the new primary will have a sufficient nextSeqNo. We could increase the 
> safety margin at the cost of decreasing our jumps-to-overflow.
>     
>     The alternative to a "safety margin" would be to effectively lock the 
> Exchange while we are updating the nextSeqNo. on disk and to the cluster. 
> That might be ok for performance since we don't do it often. However it would 
> be hard to implement without deadlocks, especially the cluster update part 
> which would also require a new feedback mechanism in the cluster. I prefer 
> the safety margin approach as being simpler provided the margin is generous 
> enough.
>     
>     
>
> 
> Pavel Moravec wrote:
>     Re 2) I meant scenario when a broker is being promoted. Then it sets 
> actualSeqNo=nextSeqNo; nextSeqNo+=64k; (and writes it to the disk and 
> notifies backups). Until the broker writes the update to the disk, there is 
> in fact no safety margin available. But to really hit duplicate sequence 
> number being set to a message, _all_ brokers in the cluster would have to be 
> killed before writing the updated nextSequenceNo to the disk - low probable 
> event we can live with, in my opinion.
> 
> Alan Conway wrote:
>     We can solve this by making sure a newly promoted primary does not go 
> READY until all nextSeqNo have been written to disk and all exchanges have 
> been replicated with the updated nextSeqNo. That's not the current behavior, 
> we only wait for queues to be ready, but I'll fix that.

Hi Alan,
is the above fix implemented?


- Pavel


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/19566/#review39200
-----------------------------------------------------------


On April 1, 2014, 10:45 a.m., Pavel Moravec wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/19566/
> -----------------------------------------------------------
> 
> (Updated April 1, 2014, 10:45 a.m.)
> 
> 
> Review request for qpid, Alan Conway, Gordon Sim, and Kim van der Riet.
> 
> 
> Bugs: QPID-5642
>     https://issues.apache.org/jira/browse/QPID-5642
> 
> 
> Repository: qpid
> 
> 
> Description
> -------
> 
> Elegant (but not performance optimal) way of patch:
> 
> 1) In Exchange::PreRoute::PreRoute, update exchange in store (whole entry).
> 2) The update method is dealed very similarly like 
> MessageStoreImpl::create(db_ptr db, IdSequence& seq, const 
> qpid::broker::Persistable& p) method, i.e. calls BufferValue that calls 
> Exchange::encode.
> 
> Here the code can be unified by merging MessageStoreImpl::update 
> intoMessageStoreImpl::create method where the code almost duplicates.
> 
> However I do not see the patch as performance efficient, as with every 
> message preroute, new qpid::framing::Buffer is filled in Exchange::encode 
> method, data are copied from it to char* BufferValue::data and even then they 
> are really written to the BDB. While in fact we just update the only one 
> number in the Buffer.
> 
> I tried to come up with less performance invasive approach (for those 
> interested, see 
> https://bugzilla.redhat.com/attachment.cgi?id=877576&action=diff - if you 
> dont have access there, let me write), that keeps qpid::framing::Buffer for 
> every durable exchange with sequencing enabled, but it returned (among 
> others) into the need of changing the way store encodes/decodes Exchange 
> instance (change in Exchange::encode / decode methods). What would make the 
> broker backward incompatible.
> 
> Is the performance penalty (due to Exchange::encode method called for every 
> message preroute) acceptable?
> 
> Is it worth merging MessageStoreImpl::update intoMessageStoreImpl::create 
> method?
> 
> 
> Diffs
> -----
> 
>   /trunk/qpid/cpp/src/qpid/broker/Exchange.cpp 1582719 
> 
> Diff: https://reviews.apache.org/r/19566/diff/
> 
> 
> Testing
> -------
> 
> - reproducer from JIRA verified
> - automated tests passed (except for those known to fail due to QPID-5641 
> (valgrind & legacystore)
> 
> 
> Thanks,
> 
> Pavel Moravec
> 
>

Reply via email to