Re: WARNING! Object DB conversion (was Re: [PATCH] write-tree performance problems)
On Wed, Apr 20, 2005 at 10:11:10PM +1000, Jon Seymour wrote: On 4/20/05, Linus Torvalds [EMAIL PROTECTED] wrote: I converted my git archives (kernel and git itself) to do the SHA1 hash _before_ the compression phase. Linus, Am I correct to understand that with this change, all the objects in the database are still being compressed (so no net performance benefit now), but by doing the SHA1 calculations before compression you are keeping open the possibility that at some point in the future you may use a different compression technique (including none at all) for some or all of the objects? The main point is not about trying different compression techniques but that you don't need to compress at all just to calculate the hash of some data. (to know if it is unchanged for example) There are still some other design decisions I am worried about: The storage method of the database of a collection of files in the underlying file system. Because of the random nature of the hashes this leads to a horrible amount of seeking for all operations which walk the logical structure of some tree stored in the database. Why not store all objects linearized in one or more flat file? The other thing I don't like is the use of a sha1 for a complete file. Switching to some kind of hash tree would allow to introduce chunks later. This has two advantages: It would allow git to scale to repositories of large binary files. And it would allow to build a very cool content transport algorithm for those repositories. This algorithm could combine all the advantages of bittorrent and rsync (without the cpu load). And it would allow trivial merging of patches which apply to different chunks of a file in exact the same way as merging changesets which apply to different files in a tree. Martin -- One night, when little Giana from Milano was fast asleep, she had a strange dream. signature.asc Description: Digital signature
Re: WARNING! Object DB conversion (was Re: [PATCH] write-tree performance problems)
On Wed, Apr 20, 2005 at 10:30:15AM -0400, C. Scott Ananian wrote: Hi, your code looks pretty cool. thank you! On Wed, 20 Apr 2005, Martin Uecker wrote: The other thing I don't like is the use of a sha1 for a complete file. Switching to some kind of hash tree would allow to introduce chunks later. This has two advantages: You can (and my code demonstrates/will demonstrate) still use a whole-file hash to use chunking. With content prefixes, this takes O(N ln M) time (where N is the file size and M is the number of chunks) to compute all hashes; if subtrees can share the same prefix, then you can do this in O(N) time (ie, as fast as possible, modulo a constant factor, which is '2'). You don't *need* internal hashing functions. I don't understand this paragraph. What is an internal hash function? Your code seems to do exactly what I want. The hashes are computed recusively as in a hash tree with O(N ln N). The only difference between your design and a design based on a conventional (binary) hash tree seems to be that data is stored in the intermediate nodes too. It would allow git to scale to repositories of large binary files. And it would allow to build a very cool content transport algorithm for those repositories. This algorithm could combine all the advantages of bittorrent and rsync (without the cpu load). Yes, the big benefit of internal hashing is that it lets you check validity of a chunk w/o having the entire file available. I'm not sure that's terribly useful in this case. [And, if it is, then it can obviously be done w/ other means.] If I don't miss anything essential, you can validate each treap piece at the moment you get it from the network with its SHA1 hash and then proceed with downloading the prefix and suffix tree (in parallel if you have more than one peer a la bittorrent). And it would allow trivial merging of patches which apply to different chunks of a file in exact the same way as merging changesets which apply to different files in a tree. I'm not sure anyone should be looking at chunks. To me, at least, they are an object-store-implementation detail only. For merging, etc, we should be looking at whole files, or (better) the whole repository. The chunking algorithm is guaranteed not to respect semantic boundaries (for *some* semantics of *some* file). You might be right. I just wanted to point out this possibility because it would allow to avoid calling external merging code for a lot of trivial merges. bye, Martin -- One night, when little Giana from Milano was fast asleep, she had a strange dream. signature.asc Description: Digital signature
Re: WARNING! Object DB conversion (was Re: [PATCH] write-tree performance problems)
On Wed, Apr 20, 2005 at 11:28:20AM -0400, C. Scott Ananian wrote: Hi, A merkle-tree (which I think you initially pointed me at) makes the hash of the internal nodes be a hash of the chunk's hashes; ie not a straight content hash. This is roughly what my current implementation does, but I would like to identify each subtree with the hash of the *(expanded) contents of that subtree* (ie no explicit reference to subtree hashes). This makes it interoperable with non-chunked or differently-chunked representations, in that the top-level hash is *just the hash of the complete content*, not some hash-of-subtree-hashes. Does that make more sense? Yes, thank you. But I would like to argue against this: You can make the representations interoperable if you calculate the hash for the non-chunked representations exactly as if this file is stored chunked but simple do not store it in that way. Of course this is not backward compatible to the monolithic hash and not compatible with a differently chunked representation (but you could store subtrees unchunked if you think your chunks are too small). The code I posted doesn't demonstrate this very well, but now that Linus has abandoned the 'hash of compressed content' stuff, my next code posting should show this more clearly. I think the hash of the treap piece should be calculated from the hash of the prefix and suffix tree and the already calculated hash of the uncompressed data. This makes hashing nearly as cheap as in Linus version which is important because checking whether a given file has identically content as a stored version should be fast. If I don't miss anything essential, you can validate each treap piece at the moment you get it from the network with its SHA1 hash and then proceed with downloading the prefix and suffix tree (in parallel if you have more than one peer a la bittorrent). Yes, I guess this is the detail I was going to abandon. =) I viewed the fact that the top-level hash was dependent on the exact chunk makeup a 'misfeature', because it doesn't allow easy interoperability with existing non-chunked repos. I thought this as a misfeature too before I realized how many advantages this has. Martin -- One night, when little Giana from Milano was fast asleep, she had a strange dream. signature.asc Description: Digital signature
Re: space compression (again)
On Sat, Apr 16, 2005 at 07:37:02PM +0200, Martin Uecker wrote: On Sat, Apr 16, 2005 at 11:11:00AM -0400, C. Scott Ananian wrote: The rsync approach does not use fixed chunk boundaries; this is necessary to ensure good storage reuse for the expected case (ie; inserting a single line at the start or in the middle of the file, which changes all the chunk boundaries). Yes. The chunk boundaries should be determined deterministically from local properties of the data. Use a rolling checksum over some small window and split the file it it hits a special value (0). This is what the rsyncable patch to zlib does. This is certainly uninteresting for source code repositories but for people who manage repositories of rsyncable binary packages this would save a lot of space, bandwidth and cpu time (compared to rsync because the scanning phase is not necessary anymore). Martin -- One night, when little Giana from Milano was fast asleep, she had a strange dream. signature.asc Description: Digital signature