On 8/18/11 10:41 PM, Selcuk AYA wrote:
Hi,
Today we had some discussion with Alex, Emmanuel and others on how we
can improve jdbm consistency semantics. I had spent sometime looking
into this issue and thought it could be useful to put a summary of my
findings here.
Currently, jdbm has issues with both concurrency and consistency:
1) jdbm table lookups, insert and remove interfaces are synchronized
methods. So even if all the directory server does is to lookups on
tables, all lookups will be serialized. Moreover, the record manager
operations are all synchronized methods too. This means, for example,
while sync of dirty pages to disk goes on, no lookup operation can go
ahead.
Absolutely. Those synchronized were added a while back to try to
guaranty some kind of consistency... This is a ugly hack, and a
contention point for sure.
2) jdbm browser interface does not provide any consistency guarantees.
If there are underlying changes to the store while the browser is
open, then it might return inconsistent results. I think the situation
is even worse if the underlying record manager is CacheRecordManager
as the same page could be modified and read by a browser concurrently.
We already experience such inconsistency in the replication tests. Not
often, but still.
The reason we haven't noticed them up to now is that we have tests that
are run in one single thread, and for the concurrent tests we have done,
we never mixed searches and modifications together.
I have been working on a scheme which introduces what can be defined
as action consistency into the jdbm store.
1) Actions are lookup, insert, remove and browse. Each action is
assigned a unique version. Actions are ReadWrite or ReadOnly.
2) We allow one ReadWrite action and multiple ReadOnly actions to run
concurrently.So synchronized methods will be removed.
Good.
3)We introduce a new record manager which caches jdbm B+ pages. Each
page in the cache has a [startVersion, endVersion). When an action
with version V1 wants to read a page, its read can be satisfied
satisfied from that page's version with V1>= startVersion&& V1<
endVersion.
4) Pages' previous versions are kept in memory. A page can be purged
when the minimum version among all active actions is>= endVersion.
So say we have three pages in a chain (A0->B0->C0) and each of them
has version range [0, infinity). An write action starts and gets the
version number 1. It updates B0 and C0 to B1 and C1 in any order.
After these two updates, B0 and C0 will have version range [0,1) and
and B1 and C1 will have version range [1,infinity). Before the write
action completes, a read action comes, gets the current read version
which is 0 and reads the chain. Since B0 and C0 will be the versions
that can satisfy this read, the read only action will read the chain
A0->B0->C0. When write action completes, it posts version 1 as the new
read version. First read action completes, a second one starts with
version 1 and that one will read A0->B1->C1. Since the minimum read
version is now 1, B0 and C0 can be zapped.
Good.
Concerns:
1)Previous versions of B+ tree pages could consume too much memory. As
long as actions are kept small, this is not a problem. Only the
browsing action does not follow this rule. There are a couple of
options to deal with it. We can maybe spill previous versions to disk
after some memory limit.
That's an option. For big bases, this is mandatory.
Or we can think about chopping down browsing
action into smaller read actions. Another way to deal with this
problem would be to keep the previous versions of pages on disk rather
than in memory. On disk, versions for a B+ page would form a
chain.This is a radically different way of introducing action
consistency but I thought this unneccesarily complicates free space
management while all we need to do with old versions of pages after a
restart is to toss them away.
I'd rather go for a flush to disk when necessary.
In any case, this is a very interesting analysis of our problem, and the
proposed solutions seems to fit with the current code base, and won't
necessarily lead to a huge modification of JDBM.
Thanks a lot for this mail Selcuk !
--
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com