On Sun, Nov 11, 2001 at 07:02:23PM -0500, Tavin Cole wrote:
> On Sun, Nov 11, 2001 at 05:49:05PM -0500, Gianni Johansson wrote:
> > Until the DataStore bug is killed, it will be impossible to keep such a 
> > collection of nodes running, so I think that fixing it has to be the 
> > projects 
> > highest priority. It would be good if whatever is known about this bug 
> > could 
> > be dumped into the Wiki. I would be great if Tavin came through with a 
> > perfectly debugged DataStore implementation, but we shouldn't wait for him. 
> > One good side effect of an "all hands on deck" debugging session would be 
> > to 
> > get more eyes across this crucial component of the system.  
> 
> What makes the datastore so nasty to fix is that it's not a matter of
> weeding the bugs out of a sufficient design..  there are just big
> architectural holes that still need to be filled.  I'm not trying
> to discourage anyone from finding and fixing bugs in the current
> system, but I'm totally convinced that any serious effort to make
> it really robust will lead the designer down the same paths of
> extensive re-designing that I've been down.

Oops, I accidentally sent this before I finished it.

Anyway, the current system has a lot of rough edges because it grew from
the intersection of independent designs -- Scott's original DatastoreFS
code written back in 0.3 days with my 0.4-node-side datastore code.  I
made a lot of modifications to Scott's code to get the subsystems working
with each other and that got us where we are now.  I always knew there
were some serious problems in the system although I didn't expect them
to manifest in the way we are seeing it.

I'm sure the "datastore bug" (probably a family of poorly understood flaws
working in concert) is tied into one of these systemic holes -- the
insufficient treatment of the problem of keeping the on-disk files and
accounting data in a continuously consistent state so that the file-system
can start up properly no matter what was happening the last time it was
shut down (^C and killall java being the more popular ways :).

The current system just writes a snapshot of the datastore directory
to a file within the datastore at every "node checkpoint" (recording
the address of this file in the node data file store_XXXX).  Maybe we
could get away with this if we could somehow guarantee the snapshot
would be written before the node gets shut down..  but then again if
we have access to genies we should make better wishes.

The current design also fails to address the realities of dealing with
very large stores that might contain many thousands of files and possibly
on the order of a million fragments.  I don't think it's reasonable to
keep the accounting information for all that in memory but that's what
we do.

So my major target in the rewrite is a system that, 1) maintains the
files and accounting data in a continuously consistent state, and
2) can handle, say, 10^6 files in 10^7 distinct fragments with
celerity and yet a modesty of memory use..

It turns out these two goals are deeply intertwined, since what you
need are on-disk accounting structures that you can make atomic updates
to, and that you can also organize into binary search trees.  I have
worked out most of the details of such a system.  It reserves space at
the beginning of the store for a collection of independently updateable
accounting blocks.  Each of these records data about a certain range
of keys or storage fragments.  They are indexed into in-memory trees so
that any table look-up (i.e., checking if a key is in the store) requires
at most one disk access.  Each block has a checksum so it is possible to
tell if one was incompletely written.  Updates are made by writing a
special block in an empty slot to declare that an update is in progress,
then writing new versions of the updated blocks into empty slots, and
finally destroying the update-block and then queueing the old slots
of the updated blocks for re-use (i.e., deleting them).

Another eventual goal for the datastore is automatic defragmentation.
I don't expect to have this right away, but I'm making sure the system
can handle it (so that the only hurdle will be figuring out an economic
algorithm to do it).  For example, the new system will make it possible
to identify which file any byte in the store belongs to.  It will also
track which files are in active use so we know which files cannot be
moved/deleted at a given time (also important for dropping files when
the store fills up).

Something else that needs improving is the "sliding locks."  Right now
you can have a read lock slide on a write lock but not the other way
around.  The new system overcomes this limitation and in general handles
locks very nicely.  This leads to much better handling of the "circular
buffers" used to tunnel large keys through small stores, and also makes
it possible to run the lengthy cache initialization in a background
thread without delaying node startup the first time it runs.

Last but not least I've thrown in the often requested support for
spreading the store across multiple files.

Most of this code is finished .. where I am stuck right now is in the
nitty gritty details of managing the accounting blocks I was talking about.

So I'm going to go bang my head on that some more.  Cheers.

-- 

:: tavin cole (tcole at espnow.com) ::

if there's been a way to build it
there'll be a way to destroy it
things are not all that out of control

                        - stereolab

_______________________________________________
Devl mailing list
Devl at freenetproject.org
http://lists.freenetproject.org/mailman/listinfo/devl

Reply via email to