Re: Deltifying directories on the server

2011-02-01 Thread Branko Čibej
You do know that diff and delta are two different beasts, and that
the diff optimizations have no effect on deltas? :)

The problem with directory deltification lies in the length of the delta
chain and the frequency of directory lookup compared to file access. The
sad fact is that our directory storage (/and/ our API) are woefully
unsuited to their tasks. The way they're stored now (in both BDB and
FSFS back-ends) requires the whole directory to be read into memory and
hashed in order to find a single entry, and you have to do this for each
level of directories when resolving a path. It doesn't help that most of
the APIs are strictly path based, e.g., editor drives will do the lookup
any number of times.

The whole concept of directory storage needs to be changed. The easiest
way would be to store directories as B-trees, however, that doesn't play
nice with versioning. On the other hand, directory structure is well
known and there's no reason to use a generic delta algorithm to store
them. I could probably come up with a better storage schema for
directories in a couple weeks, but I don't have the time to implement
such a thing.

-- Brane

On 01.02.2011 05:29, Hyrum K Wright wrote:
 Philip and I had an interesting conversation with some users this
 evening, and I'm just archiving my brain dump here.

 These users have a large repository with a large number of branches in
 the /branches directory (~35k).  We described the well-known
 phenomenon in which directories aren't deltified on commit, and thus
 cause the repository to have very large revisions, even when the
 actual content changes are fairly small.  This is due to bubble up and
 having to re-write the entire directory list of the /branches
 directory.

 Philip recalled a time several years ago when he enabled directory
 deltification, but the performance was awful, and we've never released
 it.  In our discussion, we mentioned that directory deltification may
 be better performing now, especially in light of the imminent merge of
 the diff-bytes-optimizations branch.  In the case of a bubble-up
 directory modification, the prefix and suffix matching would simplify
 the problem space, leaving a very small diff.

 The only trouble with the above theory is if directory entry lists are
 stored in a hash, and are serialized in an unordered manner, thus
 negating any benefits prefix-scanning would provide (and potentially
 causing the horrific delta performance in the first place).

 Anyway, that was the kernel of our discussion.  I haven't dug around
 in the code to determine how much of it is true or not, but if anybody
 wants something to do, this might be interesting.

 -Hyrum



Re: Deltifying directories on the server

2011-02-01 Thread C. Michael Pilato
On 02/01/2011 04:43 AM, Branko Čibej wrote:
 You do know that diff and delta are two different beasts, and that
 the diff optimizations have no effect on deltas? :)
 
 The problem with directory deltification lies in the length of the delta
 chain and the frequency of directory lookup compared to file access. The
 sad fact is that our directory storage (/and/ our API) are woefully
 unsuited to their tasks. The way they're stored now (in both BDB and
 FSFS back-ends) requires the whole directory to be read into memory and
 hashed in order to find a single entry, and you have to do this for each
 level of directories when resolving a path. It doesn't help that most of
 the APIs are strictly path based, e.g., editor drives will do the lookup
 any number of times.
 
 The whole concept of directory storage needs to be changed. The easiest
 way would be to store directories as B-trees, however, that doesn't play
 nice with versioning. On the other hand, directory structure is well
 known and there's no reason to use a generic delta algorithm to store
 them. I could probably come up with a better storage schema for
 directories in a couple weeks, but I don't have the time to implement
 such a thing.
 
 -- Brane

I can only really speak for the BDB side of things, but... what he said.

-- 
C. Michael Pilato cmpil...@collab.net
CollabNet  www.collab.net  Distributed Development On Demand



signature.asc
Description: OpenPGP digital signature


Re: Deltifying directories on the server

2011-02-01 Thread Greg Hudson
On Tue, 2011-02-01 at 10:29 -0500, C. Michael Pilato wrote:
 I can only really speak for the BDB side of things, but... what he said.

I'll elaborate a little bit.  API issues aside, we're used to putting
artifacts from different versions in different places.  More so in FSFS,
where it was baked into the initial architecture, but also in BDB for
the most part.

The most efficient storage for large directories which frequently change
by small deltas would be some kind of multi-rooted B-tree.  To do that
efficiently (that is, without scattering each tiny change into a
separate disk block, requiring lots and lots of opens/seeks/reads),
you'd want to put artifacts from different versions of a directory all
in the same place.  You might be able to arrange it so that modifying a
directory is an append-only operation, avoiding the need for a lot of
copying, but you'd still want a place to append to for each directory,
which isn't how FSFS or BDB works.

So, I'm not sure we can ever have efficient handling of this use case
without designing a completely new back end--which wouldn't be a
terrible idea if someone's up to it.





Re: Deltifying directories on the server

2011-02-01 Thread Hyrum K Wright
On Tue, Feb 1, 2011 at 10:54 AM, Greg Hudson ghud...@mit.edu wrote:
 On Tue, 2011-02-01 at 10:29 -0500, C. Michael Pilato wrote:
 I can only really speak for the BDB side of things, but... what he said.

 I'll elaborate a little bit.  API issues aside, we're used to putting
 artifacts from different versions in different places.  More so in FSFS,
 where it was baked into the initial architecture, but also in BDB for
 the most part.

 The most efficient storage for large directories which frequently change
 by small deltas would be some kind of multi-rooted B-tree.  To do that
 efficiently (that is, without scattering each tiny change into a
 separate disk block, requiring lots and lots of opens/seeks/reads),
 you'd want to put artifacts from different versions of a directory all
 in the same place.  You might be able to arrange it so that modifying a
 directory is an append-only operation, avoiding the need for a lot of
 copying, but you'd still want a place to append to for each directory,
 which isn't how FSFS or BDB works.

 So, I'm not sure we can ever have efficient handling of this use case
 without designing a completely new back end--which wouldn't be a
 terrible idea if someone's up to it.

There's been a bit of talk here and there about a new back end, but
nothing concrete just yet.

Thanks all for the insight.

-Hyrum


Deltifying directories on the server

2011-01-31 Thread Hyrum K Wright
Philip and I had an interesting conversation with some users this
evening, and I'm just archiving my brain dump here.

These users have a large repository with a large number of branches in
the /branches directory (~35k).  We described the well-known
phenomenon in which directories aren't deltified on commit, and thus
cause the repository to have very large revisions, even when the
actual content changes are fairly small.  This is due to bubble up and
having to re-write the entire directory list of the /branches
directory.

Philip recalled a time several years ago when he enabled directory
deltification, but the performance was awful, and we've never released
it.  In our discussion, we mentioned that directory deltification may
be better performing now, especially in light of the imminent merge of
the diff-bytes-optimizations branch.  In the case of a bubble-up
directory modification, the prefix and suffix matching would simplify
the problem space, leaving a very small diff.

The only trouble with the above theory is if directory entry lists are
stored in a hash, and are serialized in an unordered manner, thus
negating any benefits prefix-scanning would provide (and potentially
causing the horrific delta performance in the first place).

Anyway, that was the kernel of our discussion.  I haven't dug around
in the code to determine how much of it is true or not, but if anybody
wants something to do, this might be interesting.

-Hyrum