Hi Carlos, You have mentioned "we get the 'ERR_554 double get for block' and then we need to restore. We have an automated way of doing this now so it's not too disruptive when it happens." . I am running into the same ERR_554 issue, can you let me know how you fixed it?
Thanks, Ashma On Mon, Sep 28, 2015 at 9:15 AM, <[email protected]> wrote: > Emmanuel! Wow!! Thank you for that most detailed and comprehensive write > up of the concurrency problem you are working to solve with Mavibot. The > design sounds world class. > I now fully appreciate the complexity of this transaction issue and the > ASCII art helped make it clear. We realize you're all doing your > best to devise a robust solution for this and so we'll sit tight for now. > I appreciate the time you took to explain this to us. Have fun at the > conference. > Best, Carlo Accorsi > > > > > -----Original Message----- > From: Emmanuel Lécharny [mailto:[email protected]] > Sent: Sunday, September 27, 2015 1:23 PM > To: [email protected] > Subject: Re: ApacheDS with Mavibot anytime soon? > > Le 23/09/15 21:41, [email protected] a écrit : > > Hi, > > hi, > > sorry for the delayed response ! I saw your mail 3 days ago, but I forgot > to answer back then, being caught in a storm of delaying events... > > > > We have one customer that wants to get replication implemented. We've > set it up in house and it seems to work well. > > The trouble is there are 10's of thousands of users and all sorts of > concurrent activity. They're on M16 and a few times per month, we get the > 'ERR_554 double get for block' and then we need to restore. > > We have an automated way of doing this now so it's not too disruptive > when it happens. > Hopefully. Still I understand this is not convenient. > > > > My question is, would adding replication just add more concurrent > activity to the servers and lead more frequently to the ERR_554 situation? > For the consumer, it will be just as if the updates were coming from a > standard user. So, no, I don't think it's a real problem. > > > > Also he's asked to upgrade to M20 but my hunch is that for this issue it > won't matter much because we're still on JDBM? > > True. It's just that some bugs have been fixed in M20. > > > > Finally, we're eager to test out a release with the Mavibot backed. > > I'm sure it's possible to build ourselves but we're trying to keep the > moving parts to a minimum and use only ApacheDS releases. > I understand. > > Mavibot work has been delayed, and we are working on it as much as we can, > but it's not enough. Kiran and I have day jobs atm that forbid us to > dedicate as much time as would be necessary to complete the last part that > is mandatory to get this problem fixed : the transaction system. We > restarted to 'think' about the best implementation 2 weeks ago (on your > free time) and we expect to be able to spend more time on it next week, as > we will both be at Budapest, during the Apache Conference. > > Transaction support (the way we need it, ie cross-btree) is not that > simple to impelment. We have a pretty clear idea on how it should work, but > that mpacts the current design, as we need to woirk in memory, and avoid to > copy pages that have already been modifed in the transaction boundaries. We > also have to deal with the cleanup of old versions, which also means we > need to implement the version manager. > > To give you a insight on what I'm coming for, here is a draft of my > thoughts about this transaction manager (note that this is my vision, that > I haven't shared yet, because I don't think it's covering all the moving > parts, but still, I think it won't be that different to what we will end > with ) : > > Transaction support > ------------------- > > Mavibot must support transaction across many B-tres (ie, whatever the > number of b-trees we are updating, we should wait before committing the > changes up to the point the transaction is closed). That means we may have > many updates pending in memory. In any case, if a rollback is done, the > modified b-trees will remain intact, in the same state they were before the > transaction started. > > To achive this, we need to work in a working memory. As we may have to > fetch pages that are to be updated, we will face three cases : > - the page is not in memory : > we have to fetch them from disk, update them and store it in the working > memory we can read it back if needed > - the page is in the cache : > we have to copy it to the working memory > - the page is in the working memory : we simply update it, leaving it in > the working memory. > > So if we have to modify the same page many times, it will be updated in > the working memory, we won't need to copy it. > > That actually change the current Mavibot implementation, as it's copying > pages no matter what. Here, if the page is coming from the working memory, > we just update it. > > > Layout > ------ > > Disk Cache Working > ___ Memory > / \ .------. > + + / /| .------. > |\___/| .------. | .------. | > + + | | | | | | > |\___/| ----> | | | ----> | | | > + + | |/ | |/ > |\___/| .------. .------. > + + > \___/ > > When the transaction is rollbacked, we simply delete the content of the > working memory. > When the transaction is committed, we write all the pages in the working > memory on disk, and into the cache, we update the header, and we purge the > working memory. > > If a crash occurs during a transaction, then either the transaction is > rollbacked (if the process is still running) - which is just about purging > the working memory -, or there will be nothing to do if we have to restart > the application, as the modified pages were in memory nad have been removed > while the process crashed. > > OTOH, when we read, we must be sure that the B-trees are present, and that > we always use the B-tree revisions that are correlated. This is critical > when using LDAP, where many B-trees are updated during a sinlge write. > > The way to do that is to use a revision, and always use the B-trees which > revision is just below or equal to this revision. Let's see with an example > > +-----+---+---+---+---+---+---+---+... > |\ Rev| | | | | | | | > | \ | | | | | | | | > | \ | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 > |B- \ | | | | | | | | > |tree\| | | | | | | L | N > +-----+---+---+---+---+---+---+---+... > | aaa | X | | | X | X | X | | > | | 1 | 1 | 1 | 4 | 5 | 6 | 6 | > +-----+---+---+---+---+---+---+---+... > | bbb | X | X | X | | X | X | X | > | | 1 | 2 | 3 | 3 | 5 | 6 | 7 | > +-----+---+---+---+---+---+---+---+... > | ccc | X | | X | X | X | | X | > | | 1 | 1 | 3 | 4 | 5 | 5 | 7 | > +-----+---+---+---+---+---+---+---+... > | ddd | X | X | | | X | | | > | | 1 | 2 | 2 | 2 | 5 | 5 | 5 | > +-----+---+---+---+---+---+---+---+... > | eee | X | | X | X | | X | | > | | 1 | 1 | 3 | 4 | 4 | 6 | 6 | > +-----+---+---+---+---+---+---+---+... > > L : Latest revision > N : New revision > > In this table, we have had 7 updates, and 7 revsions. As all the B-trees > haven't been updated, we may have some 'holes' in the revision associated > to a B-tree. Typically, here, the 'aaa' B-tree will have 4 revisions (1, 4, > 5 and 6) created, because this B-tree has been updated > 4 times. > > Doing a read on revision 7 will use the following B-trees : aaa[6], > bbb[7], ccc[7], ddd[5], eee[6]. > > > All in all, a read is associated to a global transaction revision which is > used as a marker. Every B-tree update is done using the new revision. > > Managing on-going reads > ----------------------- > > We maintain all the B-tree revisions associated with a read, until the > read is completed, we need to get rid of the old revision when we are done. > This is critical to keep the database size to a minimum. In order to do > that, we must know if an older revision is still in use. We then need to > keep a common data structure that is shared across threads, which will hold > all the used revisions, ordered. With the previous example, assuming > revisions 2, 4 and 6 are in use, we may use a linked list for that purpose : > > +---+ +---+ +---+ > o--| 2 |<--| 4 |<--| 6 | > +--2+ +--1+ +--3+ > > When the read using rev 4 is done, we remove it from this list : > +-------------+ > +---+ | +---+ | +---+ > o--| 2 |<-+ X-| 4 | +-| 6 | > +--2+ +--0+ +--3+ > > The two remaining revisions are still in use, and one of them is older. > We can clean the B-tree pages associated with revision 4 if there is > already a new revision for this B-tree, and for the pages that have been > copied. > > When the read using rev 2 is done, we can clean all the B-tree pages that > are associated with every revision below 6, as soon as there is a B-tree > revision above the oldest one. > > An other important point : each revision might be used more than once. > Releasing a read will decrement the count, and when it goes down to 0, we > can safely remove the element from the list. We need a thread-safe data > structure for that. > > Managing the list > ----------------- > > let's go a bit deeper in the management of this list. It has interesting > characteristics : > > - elements are added by a unique thread : the writter > - elements are not removed unless the count becomes 0 > - the count is equal to the maximum number of thread that can use the > element at some point > - it's not because the counter is N that N thread *are* using this element > : we may eventually have no thread using it ! (but they *may* come back > later...) > - each element must be be protected agains concurrent update (which is a > different requirement than protecting the full list) > - any thread requiring an element in the list will always use the head of > the least (which is the most recent) > - we don't do any list traversal ! Once a thread uses an element, it > remembers about its position > - no user Thread is allowed to remove an element from this list. The only > thread that can do that is the Cleaner Thread > > > The last characteristic is interesting : that means we can safely depends > on one single thread to do the cleaning, so there is no concurrent access > on teh list while deleting elements. > > The deletion can be done in two ways : > - from the tail of the list, if the element's counter is equal to zero > (and we can iterate) > - from the inner of the list, when the element's counter to be removed is > zero > > The first way is simpler, but it holds potentially a lot of element if the > oldest one is there for a long time The second way implies we can relink > elements in the list, ie this is a double-linked list > > Let's assume we go for the first way. In any case, the list will > *always* have at least one element, even if its counter is zero : > > head > +---+ > | N | revision > +---+ > | 0 | users counter (AtomicInteger) > +---+ > | |--o > +---+ > o--| | > +---+ > > A thread wanting to use revision N will simply increment the counter, > which is an AtomicInteger, guaranteing that we can't have a wrong update of > this value. The revision will never change. > At this point, using the head is quite straight-forward. > > * Adding an element > > So the writter creates a new revision. This will create a new element : > > new head > +---+ +---+ > |N+1| | N |<<--- T1, T2, T5 > +---+ +---+ > | 0 | | 3 | > +---+ +---+ > | |--o | |--o > +---+ +---+ > o--| | o--| | > +---+ +---+ > > The Head should always be pointing to one single cell, and any thread > should be able to get it. This is easy to do if we use an AtomicReference : > > new [head] > +---+ +---+ > |N+1| | N |<<--- T1, T2, T5 > +---+ +---+ > | 0 | | 3 | > +---+ +---+ > | |--o | |--o > +---+ +---+ > o--| | o--| | > +---+ +---+ > > Swapping the head is done after we have updated the links, which is done > by the writter thread : > > Step 1 : > > Attach the head to the new element > > new [head] > +---+ +---+ > |N+1| | N |<<--- T1, T2, T5 > +---+ +---+ > | 0 | | 3 | > +---+ +---+ > | |--o | |--o > +---+ +---+ > o--| |<----| | > +---+ +---+ > > Step 2 : > > Attach the new element to the head (Here, at the same time, a new thread > needs the latest revison (still N)) > > > new [head] > +---+ +---+ > |N+1| | N |<<--- T1, T2, T5, T7 > +---+ +---+ > | 0 | | 4 | > +---+ +---+ > | |---->| |--o > +---+ +---+ > o--| |<----| | > +---+ +---+ > > Step 3 : > > Swap the head > > [head] > +---+ +---+ > |N+1| | N |<<--- T1, T2, T5, T7 > +---+ +---+ > | 0 | | 4 | > +---+ +---+ > | |---->| |--o > +---+ +---+ > o--| |<----| | > +---+ +---+ > > Step 4 : > > A new thread need the latest revision > > > +--- T8 > [head] | > +---+ | +---+ > |N+1|<<-+ | N |<<--- T1, T2, T5, T7 > +---+ +---+ > | 1 | | 4 | > +---+ +---+ > | |---->| |--o > +---+ +---+ > o--| |<----| | > +---+ +---+ > > At this point, no other thread will *ever* be attached to revision N > > This can go on an on, with new revisions attached to the list by the > writter thread. Now, let's consider two different use case : > 1) when the oldest revision has a counter going down to zero > 2) when a revision which is not the head nor the oldest that is going to > zero > > First case, a first approach : > > This is the simplest. We have to remove the element from the lists, and > check if its parent's counter is also zero and is not the head. If so, we > iterate until we meet an element which counter is not zero and is not the > head. > > In any case, we simply have to change its parent's next pointer, and to > remove the element from the list : > > +--- T8 +-- T1, T2, T5, T7 > [head] | | > +---+ | +---+ | +---+ +---+ > |N+1|<<-+ | N |<<-+ |N-1| |N-2|x---- T9 This thread will release > the element, setting teh counter to 0 > +---+ +---+ +---+ +---+ > | 1 | | 4 | | 0 | | 0 | Was 1 > +---+ +---+ +---+ +---+ > | |---->| |---->| |---->| |--o > +---+ +---+ +---+ +---+ > o--| |<----| |<----| |<----| | > +---+ +---+ +---+ +---+ > > The thread that set the counter to 0 > > Step 1 : > > Detach the last element parent's next pointer > > +--- T8 +-- T1, T2, T5, T7 > [head] | | > +---+ | +---+ | +---+ +---+ > |N+1|<<-+ | N |<<-+ |N-1| |N-2|x---- T9 > +---+ +---+ +---+ +---+ > | 1 | | 4 | | 0 | | 0 | > +---+ +---+ +---+ +---+ > | |---->| |---->| |--o | |--o > +---+ +---+ +---+ +---+ > o--| |<----| |<----| |<----| | > +---+ +---+ +---+ +---+ > > Step 2 : > Iterate if the parent's counter is 0 > > +--- T8 +-- T1, T2, T5, T7 > [head] | | > +---+ | +---+ | +---+ +---+ > |N+1|<<-+ | N |<<-+ |N-1| |N-2|x---- T9 > +---+ +---+ +---+ +---+ > | 1 | | 4 | | 0 | | 0 | > +---+ +---+ +---+ +---+ > | |---->| |--o | |--o | |--o > +---+ +---+ +---+ +---+ > o--| |<----| |<----| |<----| | > +---+ +---+ +---+ +---+ > > Step 3 : > Detach the last element which counter is 0 prev pointer > > +--- T8 +-- T1, T2, T5, T7 > [head] | | > +---+ | +---+ | +---+ +---+ > |N+1|<<-+ | N |<<-+ |N-1| |N-2|x---- T9 > +---+ +---+ +---+ +---+ > | 1 | | 4 | | 0 | | 0 | > +---+ +---+ +---+ +---+ > | |---->| |--o | |--o | |--o > +---+ +---+ +---+ +---+ > o--| |<----| | o--| |<----| | > +---+ +---+ +---+ +---+ > > Step 4 : > > We can now dispose the unlinked elements. > > What if a thread is freeing an element while we are already free another > one ? This can be exposed using the previous sample : > > +--- T8 +-- T1, T2, T5, T7 > [head] | | > +---+ | +---+ | +---+ +---+ > |N+1|<<-+ | N |<<-+ |N-1| |N-2|x---- T9 This thread will release > the element, setting teh counter to 0 > +---+ +---+ +---+ +---+ > | 1 | | 4 | | 1 | | 0 | Was 1 > +---+ +---+ +---+ +---+ > | |---->| |---->| |---->| |--o > +---+ +---+ +---+ +---+ > o--| |<----| |<----| |<----| | > +---+ +---+ +---+ +---+ > > Step 0+ > > A thread (10) release the N-1 revison. We have two possibilities here : > > 1) The N-2 parent's next is not yet detached > > > +--- T8 +-- T1, T2, T5, T7 > [head] | | > +---+ | +---+ | +---+ +---+ > |N+1|<<-+ | N |<<-+ |N-1| |N-2|x---- T9 This thread will release > the element, setting teh counter to 0 > +---+ +---+ +---+ +---+ > | 1 | | 4 | | 0 | | 0 | Was 1 > +---+ +---+ +---+ +---+ > | |---->| |---->| |---->| |--o > +---+ +---+ +---+ +---+ > o--| |<----| |<----| |<----| | > +---+ +---+ +---+ +---+ > > We just continue walking the list up to the first element with a non zero > counter > > 2) The N-2 paren'ts next is already detached > > > > +--- T8 +-- T1, T2, T5, T7 > [head] | | +-- T10 > +---+ | +---+ | +---+ | +---+ > |N+1|<<-+ | N |<<-+ |N-1|x--+ |N-2|x---- T9 This thread will release > the element, setting teh counter to 0 > +---+ +---+ +---+ +---+ > | 1 | | 4 | | 0 | | 0 | Was 1 > +---+ +---+ +---+ +---+ > | |---->| |---->| |--o | |--o > +---+ +---+ +---+ +---+ > o--| |<----| |<----| |<----| | > +---+ +---+ +---+ +---+ > > In this case, we will have 2 threads dealing with two elements to be > cleaned up. This is not a problem, as T9 will now detach the N-2 element > from teh N-1 one and T9 will detach the N-1 element from the list. All is > safe. > > About the removed elements > -------------------------- > > Once a thread has detached some elements which counters are equal to 0, > what should we do with them ? As we will update the disk with the removed > pages, we have to send them to teh Writter thread. This can be done by > pushing teminto a deque (and more specifically a > ConcurrentLinkedDeque) > > Another option is that the writer thread handle the cleaning of this queue > when it is weaken up or periodically. In this case, we don't need to push > the element sinto a deque, and the reader threads will just set the counter > to 0. > > Second case > ----------- > > What if we accept the removal of elements from the middle of the list ? > This is way more complex, as we have to be sure we don't break the chain. > One way to get rid of this constraint is to enforce the writer thread to do > the job, by periodically checing the whole list (which shuld not be too > costly). In this case, the reader thread never update the element's > parent's next pointer, they just decrement teh counter. > When this counter becomes 0, then we can remove the element. > > Solution > ======== > > We will separate the threads into 2 categories : > - readers (we may have many) > - writer (we have only one) > > The readers always peek the latest revision (which is the head of the > list). It then increment the counter. When teh thread does not need the > revision anymore, the counter is decremented. This is the only action the > reader thread will ever do on the elements and on the list, but it will > always keep a reference to the element into the session. > > The writer thread has a lot more to do : > - it is responsible for adding element at the top of the list (see upper). > Once the head is swapped, every new reader will get the added element as a > reference. > - it also has to check in the list for the elements which counter is 0, if > teh list is longer than 1 element. This check can be done after each > update, or periodically (every N seconds if the number of elements is > 1, > or after M additions) if tht help saving updates on disk. > > Referencing pages > ----------------- > > All the B-tree revisions are kept into the BoB B-tree, so it's easy to > retrieve the rootPage for a given revision. > > As all the pages can't be hold in memory, we sometime have to reach a page > from disk or from the cache. This is done following this simple algorithm : > > for a given page with offset O > if the cache contains an instance for O > return the page > else > fetch the page with offset O from disk > store it in the cache > > In our case, we just add a layer, which will check in the working memory > for the instance of a page, given its offset. > > > -- Ashma Shrestha PHP Developer (770) 310-9456 [email protected]
