On Wed, Apr 20, 2005 at 10:11:10PM +1000, Jon Seymour wrote:
> On 4/20/05, Linus Torvalds <[EMAIL PROTECTED]> wrote:
> > 
> > 
> > I converted my git archives (kernel and git itself) to do the SHA1 hash
> > _before_ the compression phase.
> > 
> 
> Linus,
>  
>  Am I correct to understand that with this change, all the objects in
> the database are still being compressed (so no net performance benefit
> now), but by doing the SHA1 calculations before compression you are
> keeping open the possibility that at some point in the future you may
> use a different compression technique (including none at all) for some
> or all of the objects?

The main point is not about trying different compression
techniques but that you don't need to compress at all just
to calculate the hash of some data. (to know if it is
unchanged for example)

There are still some other design decisions I am worried
about:

The storage method of the database of a collection of
files in the underlying file system. Because of the
random nature of the hashes this leads to a horrible
amount of seeking for all operations which walk the
logical structure of some tree stored in the database.

Why not store all objects linearized in one or more
flat file?


The other thing I don't like is the use of a sha1
for a complete file. Switching to some kind of hash
tree would allow to introduce chunks later. This has
two advantages:

It would allow git to scale to repositories of large
binary files. And it would allow to build a very cool
content transport algorithm for those repositories.
This algorithm could combine all the advantages of
bittorrent and rsync (without the cpu load).


And it would allow trivial merging of patches which
apply to different chunks of a file in exact the same
way as merging changesets which apply to different
files in a tree.


Martin

-- 
One night, when little Giana from Milano was fast asleep,
she had a strange dream.

Attachment: signature.asc
Description: Digital signature

Reply via email to