Re: WARNING! Object DB conversion (was Re: [PATCH] write-tree performance problems)

2005-04-20 Thread Martin Uecker
On Wed, Apr 20, 2005 at 10:11:10PM +1000, Jon Seymour wrote:
 On 4/20/05, Linus Torvalds [EMAIL PROTECTED] wrote:
  
  
  I converted my git archives (kernel and git itself) to do the SHA1 hash
  _before_ the compression phase.
  
 
 Linus,
  
  Am I correct to understand that with this change, all the objects in
 the database are still being compressed (so no net performance benefit
 now), but by doing the SHA1 calculations before compression you are
 keeping open the possibility that at some point in the future you may
 use a different compression technique (including none at all) for some
 or all of the objects?

The main point is not about trying different compression
techniques but that you don't need to compress at all just
to calculate the hash of some data. (to know if it is
unchanged for example)

There are still some other design decisions I am worried
about:

The storage method of the database of a collection of
files in the underlying file system. Because of the
random nature of the hashes this leads to a horrible
amount of seeking for all operations which walk the
logical structure of some tree stored in the database.

Why not store all objects linearized in one or more
flat file?


The other thing I don't like is the use of a sha1
for a complete file. Switching to some kind of hash
tree would allow to introduce chunks later. This has
two advantages:

It would allow git to scale to repositories of large
binary files. And it would allow to build a very cool
content transport algorithm for those repositories.
This algorithm could combine all the advantages of
bittorrent and rsync (without the cpu load).


And it would allow trivial merging of patches which
apply to different chunks of a file in exact the same
way as merging changesets which apply to different
files in a tree.


Martin

-- 
One night, when little Giana from Milano was fast asleep,
she had a strange dream.



signature.asc
Description: Digital signature


Re: WARNING! Object DB conversion (was Re: [PATCH] write-tree performance problems)

2005-04-20 Thread Martin Uecker
On Wed, Apr 20, 2005 at 10:30:15AM -0400, C. Scott Ananian wrote:

Hi,

your code looks pretty cool. thank you!

 On Wed, 20 Apr 2005, Martin Uecker wrote:
 
 The other thing I don't like is the use of a sha1
 for a complete file. Switching to some kind of hash
 tree would allow to introduce chunks later. This has
 two advantages:
 
 You can (and my code demonstrates/will demonstrate) still use a whole-file 
 hash to use chunking.  With content prefixes, this takes O(N ln M) time 
 (where N is the file size and M is the number of chunks) to compute all 
 hashes; if subtrees can share the same prefix, then you can do this in 
 O(N) time (ie, as fast as possible, modulo a constant factor, which is 
 '2').  You don't *need* internal hashing functions.

I don't understand this paragraph. What is an internal
hash function? Your code seems to do exactly what I want.
The hashes are computed recusively as in a hash tree
with O(N ln N). The only difference between your design
and a design based on a conventional (binary) hash tree
seems to be that data is stored in the intermediate nodes
too. 

 It would allow git to scale to repositories of large
 binary files. And it would allow to build a very cool
 content transport algorithm for those repositories.
 This algorithm could combine all the advantages of
 bittorrent and rsync (without the cpu load).
 
 Yes, the big benefit of internal hashing is that it lets you check 
 validity of a chunk w/o having the entire file available.  I'm not sure 
 that's terribly useful in this case.  [And, if it is, then it can 
 obviously be done w/ other means.]

If I don't miss anything essential, you can validate
each treap piece at the moment you get it from the
network with its SHA1 hash and then proceed with
downloading the prefix and suffix tree (in parallel
if you have more than one peer a la bittorrent).

 And it would allow trivial merging of patches which
 apply to different chunks of a file in exact the same
 way as merging changesets which apply to different
 files in a tree.
 
 I'm not sure anyone should be looking at chunks.  To me, at least, they 
 are an object-store-implementation detail only.  For merging, etc, we 
 should be looking at whole files, or (better) the whole repository.
 The chunking algorithm is guaranteed not to respect semantic boundaries 
 (for *some* semantics of *some* file).

You might be right. I just wanted to point out this
possibility because it would allow to avoid calling
external merging code for a lot of trivial merges.

bye,
Martin



-- 
One night, when little Giana from Milano was fast asleep,
she had a strange dream.



signature.asc
Description: Digital signature


Re: WARNING! Object DB conversion (was Re: [PATCH] write-tree performance problems)

2005-04-20 Thread Martin Uecker
On Wed, Apr 20, 2005 at 11:28:20AM -0400, C. Scott Ananian wrote:

Hi,
 
 A merkle-tree (which I think you initially pointed me at) makes the hash 
 of the internal nodes be a hash of the chunk's hashes; ie not a straight 
 content hash.  This is roughly what my current implementation does, but
 I would like to identify each subtree with the hash of the 
 *(expanded) contents of that subtree* (ie no explicit reference to 
 subtree hashes).  This makes it interoperable with non-chunked or 
 differently-chunked representations, in that the top-level hash is *just 
 the hash of the complete content*, not some hash-of-subtree-hashes.  Does 
 that make more sense?

Yes, thank you. But I would like to argue against this:

You can make the representations interoperable
if you calculate the hash for the non-chunked
representations exactly as if this file is stored
chunked but simple do not store it in that way.

Of course this is not backward compatible to the
monolithic hash and not compatible with a differently
chunked representation (but you could store subtrees
unchunked if you think your chunks are too small).

 The code I posted doesn't demonstrate this very well, but now that Linus 
 has abandoned the 'hash of compressed content' stuff, my next code posting 
 should show this more clearly.

I think the hash of the treap piece should be calculated
from the hash of the prefix and suffix tree and the already
calculated hash of the uncompressed data. This makes hashing
nearly as cheap as in Linus version which is important
because checking whether a given file has identically
content as a stored version should be fast.

 If I don't miss anything essential, you can validate
 each treap piece at the moment you get it from the
 network with its SHA1 hash and then proceed with
 downloading the prefix and suffix tree (in parallel
 if you have more than one peer a la bittorrent).
 
 Yes, I guess this is the detail I was going to abandon. =)
 
 I viewed the fact that the top-level hash was dependent on the exact chunk 
 makeup a 'misfeature', because it doesn't allow easy interoperability with 
 existing non-chunked repos.

I thought this as a misfeature too before I realized how
many advantages this has.

Martin
 

-- 
One night, when little Giana from Milano was fast asleep,
she had a strange dream.



signature.asc
Description: Digital signature


Re: space compression (again)

2005-04-19 Thread Martin Uecker
On Sat, Apr 16, 2005 at 07:37:02PM +0200, Martin Uecker wrote:
 On Sat, Apr 16, 2005 at 11:11:00AM -0400, C. Scott Ananian wrote:
 
  The rsync approach does not use fixed chunk boundaries; this is necessary 
  to ensure good storage reuse for the expected case (ie; inserting a single 
  line at the start or in the middle of the file, which changes all the 
  chunk boundaries).
 
 Yes. The chunk boundaries should be determined deterministically
 from local properties of the data. Use a rolling checksum over
 some small window and split the file it it hits a special value (0).
 This is what the rsyncable patch to zlib does.

This is certainly uninteresting for source code repositories
but for people who manage repositories of rsyncable binary
packages this would save a lot of space, bandwidth and
cpu time (compared to rsync because the scanning phase is
not necessary anymore). 

Martin

-- 
One night, when little Giana from Milano was fast asleep,
she had a strange dream.



signature.asc
Description: Digital signature