> On April 1, 2014, 9:17 p.m., Alan Conway wrote: > > Sorry, catching up. I think this patch is abusing the bootsequence which > > really has nothing at all to do with exchange sequence numbers. It would > > work on a standalone (but still be an abuse IMO) broker but it doesn't work > > for HA. The base for exchage sequence numbers has to be synchronized across > > a HA cluster, whereas the number of times each broker has rebooted *cannot* > > be synchronized without breaking its existing meaning as a reboot count. > > I'm also a bit wary of the division of sequence number into 17 bits for > > "restart" count 47 bits for "message" count. What are the grounds for that > > division? Is it safe to assume no overflows? > > > > I see the goal of avoiding persisting the sequence number on every message, > > but how about persisting it (per exchange) on some suitably large number of > > messages? E.g. persist an initial next-sequence of 64k for each exchange, > > with the sequence number starting at 1. Once the sequence number gets > > within 32k of next-sequence we increase next-sequence of by 64k etc. On > > reboot or on promotion of the primary we jump each exchange sequence number > > ahead to the exchange's next-sequence number. > > > > That lets us take smaller jumps on each restart/failover (16 bits vs. 47 > > bits) and avoids using a single counter for every exchange which makes > > overflow less likely. The next-sequence number for each exchange would need > > to be persisted, replicated on joining (both can be done by putting it back > > in the args) and replicated via a management event when it is increased on > > the primary. Jumping the next-sequence number ahead when we still have 32k > > messages left to go means that we have time to store & replicate before we > > pass the new limit. The next-sequence doesn't have to be precisely > > synchronized in a cluster provided it is always a good margin bigger than > > the highest sequence number used by that exchange on the primary so far. > > During promotion (before allowing backups to join) we would need to jump > > each exchange sequence ahead to the next-sequence for the exchange and > > increase next-sequence by 64k to make sure the new primary is using higher > > sequence numbers than the old. > > > > Are we sure nobody assumes that exchange sequence numbers are sequential? > > The above is fine for duplicate detection but are we sure nobody does > > missed message detection by checking for gaps in the exchange sequence > > numbers? That would complicate everything quite a bit... > > > > > > > > Pavel Moravec wrote: > This seems to be sound to me. To sum it up: > - exchanges with sequencing to have new argument qpid.next_sequenceNo, by > default set to 64k for new exchanges, or to the value read from store during > startup (or being updated from primary broker) > - if the argument is missing in stored args, it will be automatically > added there (to support upgrades) > - when incrementing sequenceNo in preRoute, if args(qpid.next_sequenceNo) > - sequenceNo == 32k, then execute 3 steps: > - increment qpid.next_sequenceNo by 64k > - update store bdb for the exchange > - if clustered broker and (primary or being promoted), notify backups > about the increment of qpid.next_sequenceNo > - when joining a cluster, active broker will update the joiner one > automatically (via exchange arguments) - no need to change > - when promoting a node to active: > - set sequenceNo = args(qpid.next_sequenceNo) > - execute the same 3 steps from "when incrementing sequenceNo" > > Two questions: > 1) Can a node being promoted already update backups? Or is there a way of > postponing the update? (see latest point above) > 2) What if - after a new promotion - the broker sends few messages with > sequenceNo say 128k but dies before it really writes to the disk > args(qpid.next_sequenceNo)=(128+64)k ? Can it even happen? > > > Sequence number will then jump ahead by at most 96k whenever a standalone > broker is restarted or new promotion in cluster happens (or cold start of > cluster happens). That means, there can be 93824992236885 such "jump events" > without sequence number overflow (or 93824671436104 "jump events" for > steadily message flow of 1M messages per second for 10 years). More than fair > enough. > > > Comment to "Are we sure nobody assumes that exchange sequence numbers are > sequential?" - Consumers get messages from queues and no queue can be > guaranteed to get _all_ messages routed by the exchange (until all know the > exchange is fanout or binding key is empty or so - but I think nobody can > build on this premise). So jumps in exchange sequence numbers can be always > meant as "the skipped messages were not meant for this queue". > > > Alan Conway wrote: > Notifying backups of changes is done via QMF events, so a broker can > always emit a "nextSequenceNo changed" event and if there are backups > listening they will pick it up. > > So to your questions: > > 1) A node being promoted can emit QMF events but there is not a > guaranteed ordering that the Exchange response will be seen before any QMF > events for that exchange. This is a real race condition, I have made fixes > for other special cases of this problem. Either we can make another special > case, or I have an idea for a more general fix for the problem. Let me look > at this, for your work assume that it will be OK to send a nextSequenceNo > event during update. I'll update this review when I have a patch for the race > condition. > > 2) The reason that I propose updating the nextSequenceNo when we get > within 32k of the current is to ensure we have time to write the disk & do > the updates before the actual sequence number is used. E.g. say > nextSeqNo=128, actualSeqNo=96k. At this point we will update nextSeqNo to > 140k. I am assuming that we will not receive 32k new messages before we can > write this to disk, so even if we crash before the write, with nextSeqNo=128 > still on disk, it will still be higher than any actual sequence number we > have sent. I am also assuming that the nextSeqNo update QMF event will be > processed by backups before we receive 32k new messages, to ensure that if we > fail the new primary will have a sufficient nextSeqNo. We could increase the > safety margin at the cost of decreasing our jumps-to-overflow. > > The alternative to a "safety margin" would be to effectively lock the > Exchange while we are updating the nextSeqNo. on disk and to the cluster. > That might be ok for performance since we don't do it often. However it would > be hard to implement without deadlocks, especially the cluster update part > which would also require a new feedback mechanism in the cluster. I prefer > the safety margin approach as being simpler provided the margin is generous > enough. > > > > > Pavel Moravec wrote: > Re 2) I meant scenario when a broker is being promoted. Then it sets > actualSeqNo=nextSeqNo; nextSeqNo+=64k; (and writes it to the disk and > notifies backups). Until the broker writes the update to the disk, there is > in fact no safety margin available. But to really hit duplicate sequence > number being set to a message, _all_ brokers in the cluster would have to be > killed before writing the updated nextSequenceNo to the disk - low probable > event we can live with, in my opinion. > > Alan Conway wrote: > We can solve this by making sure a newly promoted primary does not go > READY until all nextSeqNo have been written to disk and all exchanges have > been replicated with the updated nextSeqNo. That's not the current behavior, > we only wait for queues to be ready, but I'll fix that.
Hi Alan, is the above fix implemented? - Pavel ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/19566/#review39200 ----------------------------------------------------------- On April 1, 2014, 10:45 a.m., Pavel Moravec wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/19566/ > ----------------------------------------------------------- > > (Updated April 1, 2014, 10:45 a.m.) > > > Review request for qpid, Alan Conway, Gordon Sim, and Kim van der Riet. > > > Bugs: QPID-5642 > https://issues.apache.org/jira/browse/QPID-5642 > > > Repository: qpid > > > Description > ------- > > Elegant (but not performance optimal) way of patch: > > 1) In Exchange::PreRoute::PreRoute, update exchange in store (whole entry). > 2) The update method is dealed very similarly like > MessageStoreImpl::create(db_ptr db, IdSequence& seq, const > qpid::broker::Persistable& p) method, i.e. calls BufferValue that calls > Exchange::encode. > > Here the code can be unified by merging MessageStoreImpl::update > intoMessageStoreImpl::create method where the code almost duplicates. > > However I do not see the patch as performance efficient, as with every > message preroute, new qpid::framing::Buffer is filled in Exchange::encode > method, data are copied from it to char* BufferValue::data and even then they > are really written to the BDB. While in fact we just update the only one > number in the Buffer. > > I tried to come up with less performance invasive approach (for those > interested, see > https://bugzilla.redhat.com/attachment.cgi?id=877576&action=diff - if you > dont have access there, let me write), that keeps qpid::framing::Buffer for > every durable exchange with sequencing enabled, but it returned (among > others) into the need of changing the way store encodes/decodes Exchange > instance (change in Exchange::encode / decode methods). What would make the > broker backward incompatible. > > Is the performance penalty (due to Exchange::encode method called for every > message preroute) acceptable? > > Is it worth merging MessageStoreImpl::update intoMessageStoreImpl::create > method? > > > Diffs > ----- > > /trunk/qpid/cpp/src/qpid/broker/Exchange.cpp 1582719 > > Diff: https://reviews.apache.org/r/19566/diff/ > > > Testing > ------- > > - reproducer from JIRA verified > - automated tests passed (except for those known to fail due to QPID-5641 > (valgrind & legacystore) > > > Thanks, > > Pavel Moravec > >