On Wed, Apr 20, 2005 at 10:11:10PM +1000, Jon Seymour wrote: > On 4/20/05, Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > > > > I converted my git archives (kernel and git itself) to do the SHA1 hash > > _before_ the compression phase. > > > > Linus, > > Am I correct to understand that with this change, all the objects in > the database are still being compressed (so no net performance benefit > now), but by doing the SHA1 calculations before compression you are > keeping open the possibility that at some point in the future you may > use a different compression technique (including none at all) for some > or all of the objects?
The main point is not about trying different compression techniques but that you don't need to compress at all just to calculate the hash of some data. (to know if it is unchanged for example) There are still some other design decisions I am worried about: The storage method of the database of a collection of files in the underlying file system. Because of the random nature of the hashes this leads to a horrible amount of seeking for all operations which walk the logical structure of some tree stored in the database. Why not store all objects linearized in one or more flat file? The other thing I don't like is the use of a sha1 for a complete file. Switching to some kind of hash tree would allow to introduce chunks later. This has two advantages: It would allow git to scale to repositories of large binary files. And it would allow to build a very cool content transport algorithm for those repositories. This algorithm could combine all the advantages of bittorrent and rsync (without the cpu load). And it would allow trivial merging of patches which apply to different chunks of a file in exact the same way as merging changesets which apply to different files in a tree. Martin -- One night, when little Giana from Milano was fast asleep, she had a strange dream.
signature.asc
Description: Digital signature