Re: Journaled filesystem in CURRENT

Terry Lambert Thu, 26 Sep 2002 11:35:41 -0700

David Malone wrote:
> > > On Thu, Sep 26, 2002 at 10:36:27AM -0700, Terry Lambert wrote:
> > > > I think that what you were probably testing was directory entry
> > > > layout and O(N) (linear) vs. O(log2(N)+1) search times for both
> > > > non-existant entries on creates, and for any entry on lookup
> > > > ( / 2 on lookup) .
> > >
> > > Though dirhash should eliminate most of this...
> 
> > Everybody alsways says that, and then backs off, when they realize
> > that a traversal of a mail queue of 100,000 entries, in which the
> > destination is known by the contents of the file, rather than the
> > file name, is involved.  8-).
> 
> If you are searching based on contents of a file, then any directory
> layout scheme will require mean N/2 probes on success and N on
> failure surely? And if these probes are linear (ie. in the order
> they are in the directory) then this really is O(N) both with and
> without dirhash 'cos the probles will be O(1).


~O((N^4)/8)/2, actually.

You linearly traverse for the queue element files, and then the
queue elelement files tell you name of the queue content file,
which you you have to look up.  So it's a combined traversal
and lookup on the same directory (in fact: a "dirhash buster",
with some of the least optimal behaviour possible).  There are
two additional lookups, which occur to unlink the queue entry

file, and the message file, so it's really (for n queue entries,
which means twice that many directory entries):

 N = n*2
 ----
 \
  >    O(N) * O(N/2) * O(N/2) * O((N-1)/2)
 /
 ----
 N = 0

(Assuming successful delivery and removal of the queue files
on each element iterated).

The way this is "fixed" in ext3 or most JFS implementations
(both XFS and IBM's OS/2 JFS for Linux) is that the linear
traversal is linear... meaning you don't restart the scan
each time... and the explicit file lookup is O(log2(N+1)).

N * log2(N+1)^3 is significantly smaller than (N^4)/8 (in case
you were wondering about the "/8", it's because statistically,
you only have to traverse 50% of the directory entries, on
average, for a linear lookup that results in a "hit", but that
only applies to explicit lookups, not the traversal); the result
is (again for n queue entries):

 N = n*2
 ----
 \
  >    O(N) * O(log2(N+1)) * O(log2(N+1)) * O(log2(N))
 /
 ----
 N = 0

There are other data structures that could reduce this further
than btree, FWIW, but implementing them in directories is
moderately hard because of metadata ordering guarantees, and
directory entry locking.  Still, it's probably worth doing, if
you can figure out a way to eliminate the need for directory
vnode locking for modification operations (or can make them
over into range-lock operations, instead).

One "obvious" fix is to time-order file creations, to try and
keep block locality "close" to time locality (i.e., if you are
going to create 2 files (f1,f2) in some time interval [t1..t2],
then you try to guarantee that the directory entry block that
contains f2 is after the cone that contains f1, so that a linear
progressive search from the current linear traversal location
that resulted in f1 being found is likely to find f2, either
immediately, or at least within the next block or two).  The
problem with doing this is the inability to ensure that the
file you are creating does not exist... without a full traversal.

This requires that there is some "cooperation" involved, so
that the lookup traversal is picked up following the current
offset of the linear traversal in progress.  It also fails,
if simultaneos traversals occur... assuming that the offset is
not maintained per-process, but instead, per directory.  If it's
per-process, it actually works out, but only because each process
in the sendmail case only has one queue run going on at a time.

Note that none of this accounts for the queue entry creation of
two additional files; that's a O(4*N+1), since create requires
that the file not already exist (and is not helped by dirhash
at all, being linear, by definition).  For a btree, that's only
O(2*log2(N+1)+2) for the two insertions.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Journaled filesystem in CURRENT

Reply via email to