Hi Carlos,

You have mentioned "we get the 'ERR_554 double get for block' and then we
need to restore. We have an automated way of doing this now so it's not too
disruptive when it happens." . I am running into the same ERR_554 issue,
can you let me know how you fixed it?

Thanks,

Ashma

On Mon, Sep 28, 2015 at 9:15 AM, <[email protected]> wrote:

> Emmanuel! Wow!! Thank you for that most detailed and comprehensive write
> up of the concurrency problem you are working to solve with Mavibot.  The
> design sounds world class.
> I now fully appreciate the complexity of this transaction issue and the
> ASCII art helped make it clear. We realize you're all doing your
> best to devise a robust solution for this and so we'll sit tight for now.
> I appreciate the time you took to explain this to us. Have fun at the
> conference.
> Best, Carlo Accorsi
>
>
>
>
> -----Original Message-----
> From: Emmanuel Lécharny [mailto:[email protected]]
> Sent: Sunday, September 27, 2015 1:23 PM
> To: [email protected]
> Subject: Re: ApacheDS with Mavibot anytime soon?
>
> Le 23/09/15 21:41, [email protected] a écrit :
> > Hi,
>
> hi,
>
> sorry for the delayed response ! I saw your mail 3 days ago, but I forgot
> to answer back then, being caught in a storm of delaying events...
> >
> > We have one customer that wants to get replication implemented. We've
> set it up in house and it seems to work well.
> > The trouble is there are 10's of thousands of users and all sorts of
> concurrent activity. They're on M16 and a few times per month, we get the
> 'ERR_554 double get for block' and then we need to restore.
> > We have an automated way of doing this now so it's not too disruptive
> when it happens.
> Hopefully. Still I understand this is not convenient.
> >
> > My question is, would adding replication just add more concurrent
> activity to the servers and lead more frequently to the ERR_554 situation?
> For the consumer, it will be just as if the updates were coming from a
> standard user. So, no, I don't think it's a real problem.
> >
> > Also he's asked to upgrade to M20 but my hunch is that for this issue it
> won't matter much because we're still on JDBM?
>
> True. It's just that some bugs have been fixed in M20.
> >
> > Finally, we're eager to test out a release with the Mavibot backed.
> > I'm sure it's possible to build ourselves but we're trying to keep the
> moving parts to a minimum and use only ApacheDS releases.
> I understand.
>
> Mavibot work has been delayed, and we are working on it as much as we can,
> but it's not enough. Kiran and I have day jobs atm that forbid us to
> dedicate as much time as would be necessary to complete the last part that
> is mandatory to get this problem fixed : the transaction system. We
> restarted to 'think' about the best implementation 2 weeks ago (on your
> free time) and we expect to be able to spend more time on it next week, as
> we will both be at Budapest, during the Apache Conference.
>
> Transaction  support (the way we need it, ie cross-btree) is not that
> simple to impelment. We have a pretty clear idea on how it should work, but
> that mpacts the current design, as we need to woirk in memory, and avoid to
> copy pages that have already been modifed in the transaction boundaries. We
> also have to deal with the cleanup of old versions, which also means we
> need to implement the version manager.
>
> To give you a insight on what I'm coming for, here is a draft of my
> thoughts about this transaction manager (note that this is my vision, that
> I haven't shared yet, because I don't think it's covering all the moving
> parts, but still, I think it won't be that different to what we will end
> with ) :
>
> Transaction support
> -------------------
>
> Mavibot must support transaction across many B-tres (ie, whatever the
> number of b-trees we are updating, we should wait before committing the
> changes up to the point the transaction is closed). That means we may have
> many updates pending in memory. In any case, if a rollback is done, the
> modified b-trees will remain intact, in the same state they were before the
> transaction started.
>
> To achive this, we need to work in a working memory. As we may have to
> fetch pages that are to be updated, we will face three cases :
> - the page is not in memory :
> we have to fetch them from disk, update them and store it in the working
> memory we can read it back if needed
> - the page is in the cache :
> we have to copy it to the working memory
> - the page is in the working memory : we simply update it, leaving it in
> the working memory.
>
> So if we have to modify the same page many times, it will be updated in
> the working memory, we won't need to copy it.
>
> That actually change the current Mavibot implementation, as it's copying
> pages no matter what. Here, if the page is coming from the working memory,
> we just update it.
>
>
> Layout
> ------
>
>  Disk            Cache           Working
>   ___                            Memory
>  /   \          .------.
> +     +        /      /|         .------.
> |\___/|       .------. |       .------. |
> +     +       |      | |       |      | |
> |\___/| ----> |      | | ----> |      | |
> +     +       |      |/        |      |/
> |\___/|       .------.         .------.
> +     +
>  \___/
>
>  When the transaction is rollbacked, we simply delete the content of the
> working memory.
>  When the transaction is committed, we write all the pages in the working
> memory on disk, and into the cache, we update the header, and we purge the
> working memory.
>
>  If a crash occurs during a transaction, then either the transaction is
> rollbacked (if the process is still running) - which is just about purging
> the working memory -, or there will be nothing to do if we have to restart
> the application, as the modified pages were in memory nad have been removed
> while the process crashed.
>
> OTOH, when we read, we must be sure that the B-trees are present, and that
> we always use the B-tree revisions that are correlated. This is critical
> when using LDAP, where many B-trees are updated during a sinlge write.
>
> The way to do that is to use a revision, and always use the B-trees which
> revision is just below or equal to this revision. Let's see with an example
>
>  +-----+---+---+---+---+---+---+---+...
>  |\ Rev|   |   |   |   |   |   |   |
>  | \   |   |   |   |   |   |   |   |
>  |  \  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8
>  |B- \ |   |   |   |   |   |   |   |
>  |tree\|   |   |   |   |   |   | L | N
>  +-----+---+---+---+---+---+---+---+...
>  | aaa | X |   |   | X | X | X |   |
>  |     | 1 | 1 | 1 | 4 | 5 | 6 | 6 |
>  +-----+---+---+---+---+---+---+---+...
>  | bbb | X | X | X |   | X | X | X |
>  |     | 1 | 2 | 3 | 3 | 5 | 6 | 7 |
>  +-----+---+---+---+---+---+---+---+...
>  | ccc | X |   | X | X | X |   | X |
>  |     | 1 | 1 | 3 | 4 | 5 | 5 | 7 |
>  +-----+---+---+---+---+---+---+---+...
>  | ddd | X | X |   |   | X |   |   |
>  |     | 1 | 2 | 2 | 2 | 5 | 5 | 5 |
>  +-----+---+---+---+---+---+---+---+...
>  | eee | X |   | X | X |   | X |   |
>  |     | 1 | 1 | 3 | 4 | 4 | 6 | 6 |
>  +-----+---+---+---+---+---+---+---+...
>
> L : Latest revision
> N : New revision
>
> In this table, we have had 7 updates, and 7 revsions. As all the B-trees
> haven't been updated, we may have some 'holes' in the revision associated
> to a B-tree. Typically, here, the 'aaa' B-tree will have 4 revisions (1, 4,
> 5 and 6) created, because this B-tree has been updated
> 4 times.
>
> Doing a read on revision 7 will use the following B-trees : aaa[6],
> bbb[7], ccc[7], ddd[5], eee[6].
>
>
> All in all, a read is associated to a global transaction revision which is
> used as a marker. Every B-tree update is done using the new revision.
>
> Managing on-going reads
> -----------------------
>
> We maintain all the B-tree revisions associated with a read, until the
> read is completed, we need to get rid of the old revision when we are done.
> This is critical to keep the database size to a minimum. In order to do
> that, we must know if an older revision is still in use. We then need to
> keep a common data structure that is shared across threads, which will hold
> all the used revisions, ordered. With the previous example, assuming
> revisions 2, 4 and 6 are in use, we may use a linked list for that purpose :
>
>    +---+   +---+   +---+
> o--| 2 |<--| 4 |<--| 6 |
>    +--2+   +--1+   +--3+
>
> When the read using rev 4 is done, we remove it from this list :
>           +-------------+
>    +---+  |    +---+    | +---+
> o--| 2 |<-+  X-| 4 |    +-| 6 |
>    +--2+       +--0+      +--3+
>
> The two remaining revisions are still in use, and one of them is older.
> We can clean the B-tree pages associated with revision 4 if there is
> already a new revision for this B-tree, and for the pages that have been
> copied.
>
> When the read using rev 2 is done, we can clean all the B-tree pages that
> are associated with every revision below 6, as soon as there is a B-tree
> revision above the oldest one.
>
> An other important point : each revision might be used more than once.
> Releasing a read will decrement the count, and when it goes down to 0, we
> can safely remove the element from the list. We need a thread-safe data
> structure for that.
>
> Managing the list
> -----------------
>
> let's go a bit deeper in the management of this list. It has interesting
> characteristics :
>
> - elements are added by a unique thread : the writter
> - elements are not removed unless the count becomes 0
> - the count is equal to the maximum number of thread that can use the
> element at some point
> - it's not because the counter is N that N thread *are* using this element
> : we may eventually have no thread using it ! (but they *may* come back
> later...)
> - each element must be be protected agains concurrent update (which is a
> different requirement than protecting the full list)
> - any thread requiring an element in the list will always use the head of
> the least (which is the most recent)
> - we don't do any list traversal ! Once a thread uses an element, it
> remembers about its position
> - no user Thread is allowed to remove an element from this list. The only
> thread that can do that is the Cleaner Thread
>
>
> The last characteristic is interesting : that means we can safely depends
> on one single thread to do the cleaning, so there is no concurrent access
> on teh list while deleting elements.
>
> The deletion can be done in two ways :
> - from the tail of the list, if the element's counter is equal to zero
> (and we can iterate)
> - from the inner of the list, when the element's counter to be removed is
> zero
>
> The first way is simpler, but it holds potentially a lot of element if the
> oldest one is there for a long time The second way implies we can relink
> elements in the list, ie this is a double-linked list
>
> Let's assume we go for the first way. In any case, the list will
> *always* have at least one element, even if its counter is zero :
>
>    head
>    +---+
>    | N | revision
>    +---+
>    | 0 | users counter (AtomicInteger)
>    +---+
>    |   |--o
>    +---+
> o--|   |
>    +---+
>
> A thread wanting to use revision N will simply increment the counter,
> which is an AtomicInteger, guaranteing that we can't have a wrong update of
> this value. The revision will never change.
> At this point, using the head is quite straight-forward.
>
> * Adding an element
>
> So the writter creates a new revision. This will create a new element :
>
>     new      head
>    +---+     +---+
>    |N+1|     | N |<<--- T1, T2, T5
>    +---+     +---+
>    | 0 |     | 3 |
>    +---+     +---+
>    |   |--o  |   |--o
>    +---+     +---+
> o--|   |  o--|   |
>    +---+     +---+
>
> The Head should always be pointing to one single cell, and any thread
> should be able to get it. This is easy to do if we use an AtomicReference :
>
>     new      [head]
>    +---+     +---+
>    |N+1|     | N |<<--- T1, T2, T5
>    +---+     +---+
>    | 0 |     | 3 |
>    +---+     +---+
>    |   |--o  |   |--o
>    +---+     +---+
> o--|   |  o--|   |
>    +---+     +---+
>
> Swapping the head is done after we have updated the links, which is done
> by the writter thread :
>
> Step 1 :
>
> Attach the head to the new element
>
>     new      [head]
>    +---+     +---+
>    |N+1|     | N |<<--- T1, T2, T5
>    +---+     +---+
>    | 0 |     | 3 |
>    +---+     +---+
>    |   |--o  |   |--o
>    +---+     +---+
> o--|   |<----|   |
>    +---+     +---+
>
> Step 2 :
>
> Attach the new element to the head (Here, at the same time, a new thread
> needs the latest revison (still N))
>
>
>     new      [head]
>    +---+     +---+
>    |N+1|     | N |<<--- T1, T2, T5, T7
>    +---+     +---+
>    | 0 |     | 4 |
>    +---+     +---+
>    |   |---->|   |--o
>    +---+     +---+
> o--|   |<----|   |
>    +---+     +---+
>
> Step 3 :
>
> Swap the head
>
>    [head]
>    +---+     +---+
>    |N+1|     | N |<<--- T1, T2, T5, T7
>    +---+     +---+
>    | 0 |     | 4 |
>    +---+     +---+
>    |   |---->|   |--o
>    +---+     +---+
> o--|   |<----|   |
>    +---+     +---+
>
> Step 4 :
>
> A new thread need the latest revision
>
>
>            +--- T8
>    [head]  |
>    +---+   | +---+
>    |N+1|<<-+ | N |<<--- T1, T2, T5, T7
>    +---+     +---+
>    | 1 |     | 4 |
>    +---+     +---+
>    |   |---->|   |--o
>    +---+     +---+
> o--|   |<----|   |
>    +---+     +---+
>
> At this point, no other thread will *ever* be attached to revision N
>
> This can go on an on, with new revisions attached to the list by the
> writter thread. Now, let's consider two different use case :
> 1) when the oldest revision has a counter going down to zero
> 2) when a revision which is not the head nor the oldest that is going to
> zero
>
> First case, a first approach :
>
> This is the simplest. We have to remove the element from the lists, and
> check if its parent's counter is also zero and is not the head. If so, we
> iterate until we meet an element which counter is not zero and is not the
> head.
>
> In any case, we simply have to change its parent's next pointer, and to
> remove the element from the list :
>
>            +--- T8   +-- T1, T2, T5, T7
>    [head]  |         |
>    +---+   | +---+   | +---+     +---+
>    |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9 This thread will release
> the element, setting teh counter to 0
>    +---+     +---+     +---+     +---+
>    | 1 |     | 4 |     | 0 |     | 0 |  Was 1
>    +---+     +---+     +---+     +---+
>    |   |---->|   |---->|   |---->|   |--o
>    +---+     +---+     +---+     +---+
> o--|   |<----|   |<----|   |<----|   |
>    +---+     +---+     +---+     +---+
>
> The thread that set the counter to 0
>
> Step 1 :
>
> Detach the last element parent's next pointer
>
>            +--- T8   +-- T1, T2, T5, T7
>    [head]  |         |
>    +---+   | +---+   | +---+     +---+
>    |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9
>    +---+     +---+     +---+     +---+
>    | 1 |     | 4 |     | 0 |     | 0 |
>    +---+     +---+     +---+     +---+
>    |   |---->|   |---->|   |--o  |   |--o
>    +---+     +---+     +---+     +---+
> o--|   |<----|   |<----|   |<----|   |
>    +---+     +---+     +---+     +---+
>
> Step 2 :
> Iterate if the parent's counter is 0
>
>            +--- T8   +-- T1, T2, T5, T7
>    [head]  |         |
>    +---+   | +---+   | +---+     +---+
>    |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9
>    +---+     +---+     +---+     +---+
>    | 1 |     | 4 |     | 0 |     | 0 |
>    +---+     +---+     +---+     +---+
>    |   |---->|   |--o  |   |--o  |   |--o
>    +---+     +---+     +---+     +---+
> o--|   |<----|   |<----|   |<----|   |
>    +---+     +---+     +---+     +---+
>
> Step 3 :
> Detach the last element which counter is 0 prev pointer
>
>            +--- T8   +-- T1, T2, T5, T7
>    [head]  |         |
>    +---+   | +---+   |     +---+     +---+
>    |N+1|<<-+ | N |<<-+     |N-1|     |N-2|x---- T9
>    +---+     +---+         +---+     +---+
>    | 1 |     | 4 |         | 0 |     | 0 |
>    +---+     +---+         +---+     +---+
>    |   |---->|   |--o      |   |--o  |   |--o
>    +---+     +---+         +---+     +---+
> o--|   |<----|   |      o--|   |<----|   |
>    +---+     +---+         +---+     +---+
>
> Step 4 :
>
>  We can now dispose the unlinked elements.
>
> What if a thread is freeing an element while we are already free another
> one ? This can be exposed using the previous sample :
>
>            +--- T8   +-- T1, T2, T5, T7
>    [head]  |         |
>    +---+   | +---+   | +---+     +---+
>    |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9 This thread will release
> the element, setting teh counter to 0
>    +---+     +---+     +---+     +---+
>    | 1 |     | 4 |     | 1 |     | 0 |  Was 1
>    +---+     +---+     +---+     +---+
>    |   |---->|   |---->|   |---->|   |--o
>    +---+     +---+     +---+     +---+
> o--|   |<----|   |<----|   |<----|   |
>    +---+     +---+     +---+     +---+
>
> Step 0+
>
> A thread (10) release the N-1 revison. We have two possibilities here :
>
> 1) The N-2 parent's next is not yet detached
>
>
>            +--- T8   +-- T1, T2, T5, T7
>    [head]  |         |
>    +---+   | +---+   | +---+     +---+
>    |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9 This thread will release
> the element, setting teh counter to 0
>    +---+     +---+     +---+     +---+
>    | 1 |     | 4 |     | 0 |     | 0 |  Was 1
>    +---+     +---+     +---+     +---+
>    |   |---->|   |---->|   |---->|   |--o
>    +---+     +---+     +---+     +---+
> o--|   |<----|   |<----|   |<----|   |
>    +---+     +---+     +---+     +---+
>
> We just continue walking the list up to the first element with a non zero
> counter
>
> 2) The N-2 paren'ts next is already detached
>
>
>
>            +--- T8   +-- T1, T2, T5, T7
>    [head]  |         |         +-- T10
>    +---+   | +---+   | +---+   | +---+
>    |N+1|<<-+ | N |<<-+ |N-1|x--+ |N-2|x---- T9 This thread will release
> the element, setting teh counter to 0
>    +---+     +---+     +---+     +---+
>    | 1 |     | 4 |     | 0 |     | 0 |  Was 1
>    +---+     +---+     +---+     +---+
>    |   |---->|   |---->|   |--o  |   |--o
>    +---+     +---+     +---+     +---+
> o--|   |<----|   |<----|   |<----|   |
>    +---+     +---+     +---+     +---+
>
> In this case, we will have 2 threads dealing with two elements to be
> cleaned up. This is not a problem, as T9 will now detach the N-2 element
> from teh N-1 one and T9 will detach the N-1 element from the list. All is
> safe.
>
> About the removed elements
> --------------------------
>
> Once a thread has detached some elements which counters are equal to 0,
> what should we do with them ? As we will update the disk with the removed
> pages, we have to send them to teh Writter thread. This can be done by
> pushing teminto a deque (and more specifically a
> ConcurrentLinkedDeque)
>
> Another option is that the writer thread handle the cleaning of this queue
> when it is weaken up or periodically. In this case, we don't need to push
> the element sinto a deque, and the reader threads will just set the counter
> to 0.
>
> Second case
> -----------
>
> What if we accept the removal of elements from the middle of the list ?
> This is way more complex, as we have to be sure we don't break the chain.
> One way to get rid of this constraint is to enforce the writer thread to do
> the job, by periodically checing the whole list (which shuld not be too
> costly). In this case, the reader thread never update the element's
> parent's next pointer, they just decrement teh counter.
> When this counter becomes 0, then we can remove the element.
>
> Solution
> ========
>
> We will separate the threads into 2 categories :
> - readers (we may have many)
> - writer (we have only one)
>
> The readers always peek the latest revision (which is the head of the
> list). It then increment the counter. When teh thread does not need the
> revision anymore, the counter is decremented. This is the only action the
> reader thread will ever do on the elements and on the list, but it will
> always keep a reference to the element into the session.
>
> The writer thread has a lot more to do :
> - it is responsible for adding element at the top of the list (see upper).
> Once the head is swapped, every new reader will get the added element as a
> reference.
> - it also has to check in the list for the elements which counter is 0, if
> teh list is longer than 1 element. This check can be done after each
> update, or periodically (every N seconds if the number of elements is > 1,
> or after M additions) if tht help saving updates on disk.
>
> Referencing pages
> -----------------
>
> All the B-tree revisions are kept into the BoB B-tree, so it's easy to
> retrieve the rootPage for a given revision.
>
> As all the pages can't be hold in memory, we sometime have to reach a page
> from disk or from the cache. This is done following this simple algorithm :
>
>   for a given page with offset O
>     if the cache contains an instance for O
>       return the page
>     else
>       fetch the page with offset O from disk
>       store it in the cache
>
> In our case, we just add a layer, which will check in the working memory
> for the instance of a page, given its offset.
>
>
>


-- 
Ashma Shrestha
PHP Developer
(770) 310-9456
[email protected]

Reply via email to