Re: Tailmerging - detailed design (was: Questions...)

Daniel Phillips Wed, 02 Aug 2000 03:24:29 -0700
On Tue, 01 Aug 2000, Andreas Dilger wrote:
> You never mention what happens when a file doesn't have a tail yet,
> but then gets one...  

Every file has a tail, unless it is exactly some number of blocks long.  The
tail is the part at the end that doesn't fill a block completely.  In my
terminology, a "tail" differs from a "tail block" in that the tail is actual
data whereas the tail block is a place to put it, hopefully together with other
tails.

> Also, (you may have already considered this) you
> should design your code to work with an existing ext2 filesystem
> (i.e. partially used blocks at the end of a file, rather than tails)
> so you can simply "turn on" the tail feature and it will start working.

That's the idea.  See my original post on the subject of tailmerging where I
show how the current contents of the (as yet unused) fragment fields are to be
reinterpreted as tailmerge fields.

> > This doesn't attempt to do the tail merge incrementally - instead it's
> > going to be done at mount time, just for the prototype.
> 
> Could you explain what you mean by "incremental tail merge"?

The first, prototype implementation will have a mount option that says "make
one pass through all the inodes and try to do as much tailmerging as
possible".  No other tailmerging will be done after that, until the next mount. 
This is obviously wrong for the final implementation, in which tail merging will
be done on the fly, i.e., incrementally.

> > Synchronization:
> >   - Two processes attempting to unmerge at the same time could 
> >     deadlock, as each of them might attempts to lock the same 
> >     two inodes to perform a ring delete, and each succeed in 
> >     getting only one of them.  This impasse has to be resolved 
> >     in some deterministic way.
> 
> Normally the way you handle double locking is to sort the items and
> always lock in one direction.  For inodes, for example, you would always
> lock the lowest-numbered inode first, and then if there is any contention
> you will not deadlock.

Yes, that's the obvious resolution and I should have thought of it.  I guess
it must be a standard algorithm mentioned in any text on OS design.  I
suppose I should read one sometime. :-)

> One other thing you haven't mentioned is how to find space for new tails.
> Will it simply be a matter of "scan inodes (before and/or) after this one
> to find one with enough space for the new tail" or will tails always be new
> blocks?  Any kind of "best fit" algorithm to be used?

That remains to be determined.  Please check out the algorithm described
below.  It is admittedly naive - the main goal at this point is simplicity. 
Once we have working code (in the form of a patch to Ext2) feel free to plug in
your own algorithm and see if it works better.

> Scanning existing inodes to add tails would allow a lot of space to be
> recovered in an existing filesystem. Maybe you can generate a (partial)
> list of tails at mount time by scanning all the inodes (or some inodes in
> each group) and keep a cache to speed locating the best sized tail quickly.
> 
> Will directory inodes be treated differently from other inodes?

Here's the proposed mount-time tailmerging algorithm which was originally
posted as an attachment summarizing a series of emails between me and Stephen
Tweedie on this subject:
-------------------------------

Tail merge pass at mount time: 

Here is a relatively simple, non-incremental algorithm.  It makes one pass 
through all  inodes at mount time, trying to merge as many tails as possible. 
We keep a working  list of merge rings that we have already seen and that have 
more than some threshold  amount of slack space available in the tail block. 

Each entry in the working list has the following fields: 

  next ring in the working list 
  prev ring in the working list 
  inode (any inode in the ring) 
  lowest inode in ring 
  highest inode in ring 
  slack size (total free space in tail block) 

The "center" of a ring is the number halfway between the lowest and highest 
inode number. The working list will be kept sorted by the distance of ring 
center from the center of  the "current" ring, thus when looking for a merge 
candidate we will choose the "closest"  ring with that has enough tail slack. 
A practical result of this is that sequences of  many small files with adjacent 
inode numbers will tend to be merged together into a  single block, which is 
exactly what we want. 

Algorithm: 

  For each (allocated) inode: 

     (Each inode is a member of a tail merge ring consisting of one or more 
     inodes.)  Traverse the current ring starting with this inode.  If any link 
     points to an inode lower-numbered than the current one then we have 
     already seen this ring and we are finished with this inode.  Otherwise, 
     accumulate the ring statistics: total tail size; highest numbered inode. 
     While doing this, make a map of space allocation in the tail node. 
     Afterwards, use the map to check that the tail node is densely packed, and 
     if it isn't, repack the tail node and adjust the tail offsets of the ring 
     inodes accordingly.  (Now we can be sure that the tail node of any ring 
     in the working list is densely packed and thereby avoid the need to map 
     the tail node allocation again later.) 

     (Try to merge the ring with a ring in the working list.)  Working through 
     the rings in the working list in sorted order: 

        If a ring in the working list has enough tail slack space, merge the 
        current ring with it: 

          Copy in the tail fragments from the tail block of the current inode. 
          Free the tail block of the current ring and set the tail pointers to 
          the tail block of  the working list ring (not necessarily in that 
          order). 

          Then let: 

             pred1 be the current inode 
             succ1 be the successor inode in pred1's tail ring 
             pred2 be an inode from the working list ring 
             succ2 be the succesor inode to pred2 

          and set: 

             prec1.next = succ2 
             succ2.prev = prec1 
             prec2.next = succ1 
             succ1.prev = prec2 

          Update the statistics for the working list ring (remaining tail slack; 
          lowest and highest inode numbers) unless slack size has fallen to some 
          threshold (possibly  zero).  In that case, remove the ring from the 
          working list, otherwise resort the  list. 

    If it was not possible to merge the current ring then add it to the 
    working list in  the proper sorted location, setting the tail slack and 
    lowest/highest inode numbers. 

    (drop distant rings from the working list) 

This merge pass is intended to be run optionally at filesystem mount time. 
This is not  as good as a incremental merge but it will answer three important 
questions: (1) How much  space is saved by tail merging in real filesystems? 
(2) Are filesystem operations still  stable when merged tail blocks are 
present?  (3) What is the performance impact? 

Later, after things are stable, the one-time merge pass will be replaced by 
an  incremental merge performed at file close time.  The merge pass will still 
remain useful  as a means of checking that an incremental merging algorithm is 
close to optimal, and  also for one-time optimal packing of files in a 
read-only file system. 

-- 
Daniel
Re: Tailmerging - detailed design (was: Questions...)

Reply via email to