Hi !

That was not teh most efficient week-end in my life :-) Not that I
didn't spent time on the code, but I faced some unexpected issues...
This is where coding is far less productive when you haven't used a
piece of paper before. I would blame my fragmented time...


Here is what problem I faced : when I update a B-tree many times in one
single transaction, page split will occur, and that will impact the
B-tree structure. So far, so good, nothing new under the sun. The
problem is that I'm working in memory, and when I have to flus the pages
on disk, I have to do it in reverse order (ie, the leaves first, then
the parent node, etc up to the root page, and ending with the btree
header). This is mandatory, because on disk, they refer to each other by
their offset on disk, and if I haven't wrote a leaf on disk before its
parent's node, I can't update the offset in the node.

I don't use offset any more as a page unique identifier, because it
would force me to flush new pages prematurely on disk (well, not exactly
: I just have to ask a new page with an offset, and that requires an
access to the free page list). So in order to avoid such a problem, I
new attribute a unique ID to each page. This unique ID is incremental,
and I use a Long for that (which allows Mavibot to create up to 9
exa-pages - remember : kilo, mega, giga, tera, penta, exa - or 10^19
pages. If we create 1 billion page per second, it would take 10 billion
seconds to reach the limit, ie 300 years.)

The consequence is that I have to use a stack of modified pages where I
push the leaves, the parent's node, the node's parent's node, etc, up to
the root node, and the btree-header Then I can flush all the pages on
disk, and teh offset wil be updated properly.

But at the same time, I want to be able to retrieve a page in memory
quickly, in order to reuse it when in a transaction, instead of having
to fetch it from disk. So I needed a different data structure : a stack
was a List, and searching in a list is linear. I'm now using a TreeMap.
That solves the issue, *but* I do thnk it's not ideal. Managing a
tree-map is costly, compared to scan a list, when it comes to manage a
few items. For a big B-tree, we are likely to have only 6 or 7 pages in
the stack, so it *might* be faster to use a list rather than a
TreeMap... Ok, ok, we are talking about micro-optimisation here :-)


So at the end of the day, I'm facing some issue with offset computations
in some cases, because I don't write pages in the correct order in all
the possible cases (typically, I fixed an issue where one of the B-tree
was split, after 7 insertions, then another issue after 8 insertions,
then another after 11 insertions, and now I have another incorrect
offset after 24 insertions...). It's all about pushing the pages in the
correct order, which depends on teh Page ID, which sometime is not
correctly updated.


It looks like moving forward, and then backward, but at the end of the
day, it's really fixing fondamental design issues that became apparent
after doing some more thorough tests...


I'll keep you posted newt week-end !


-- 
Emmanuel Lecharny

Symas.com
directory.apache.org

Reply via email to