Blob chunking code. [First look.]

2005-04-20 Thread C. Scott Ananian
So I wrote up my ideas regarding blob chunking as code; see attached. This is against git-0.4 (I know, ancient, but I had to start somewhere.) The idea here is that blobs are chunked using a rolling checksum (so the chunk boundaries are content-dependent and stay fixed even if you mutate pieces

Re: [PATCH] write-tree performance problems

2005-04-20 Thread C. Scott Ananian
On Wed, 20 Apr 2005, Chris Mason wrote: With the basic changes I described before, the 100 patch time only goes down to 40s. Certainly not fast enough to justify the changes. In this case, the bulk of the extra time comes from write-tree writing the index file, so I split write-tree.c up into

Re: [PATCH] write-tree performance problems

2005-04-20 Thread C. Scott Ananian
On Wed, 20 Apr 2005, Linus Torvalds wrote: I was considering using a chunked representation for *all* files (not just blobs), which would avoid the original 'trees must reference other trees or they become too large' issue -- and maybe the performance issue you're referring to, as well? No. The

Re: [PATCH] Some documentation...

2005-04-20 Thread C. Scott Ananian
On Wed, 20 Apr 2005, David Greaves wrote: In doing this I noticed a couple of points: * update-cache won't accept ./file or fred/./file The comment in update-cache.c reads: /* * We fundamentally don't like some paths: we don't want * dot or dot-dot anywhere, and in fact, we don't even want *

Blob chunking code. [Second look]

2005-04-20 Thread C. Scott Ananian
store. This way * similar files will be expected to share chunks, saving space. * Files less than one disk block long are expected to fit in a single * chunk, so there is no extra indirection overhead for this case. * * Copyright (C) 2005 C. Scott Ananian [EMAIL PROTECTED] */ /* * We assume

Re: [ANNOUNCEMENT] /Arch/ embraces `git'

2005-04-20 Thread C. Scott Ananian
On Wed, 20 Apr 2005, Petr Baudis wrote: I think one thing git's objects database is not very well suited for are network transports. You want to have something smart doing the transports, comparing trees so that it can do some delta compression; that could probably reduce the amount of data needed

Re: chunking (Re: [ANNOUNCEMENT] /Arch/ embraces `git')

2005-04-20 Thread C. Scott Ananian
On Wed, 20 Apr 2005, Linus Torvalds wrote: What's the disk usage results? I'm on ext3, for example, which means that even small files invariably take up 4.125kB on disk (with the inode). Even uncompressed, most source files tend to be small. Compressed, I'm seeing the median blob size being ~1.6kB

Re: [PATCH] write-tree performance problems

2005-04-19 Thread C. Scott Ananian
On Tue, 19 Apr 2005, Linus Torvalds wrote: (*) Actually, I think it's the compression that ends up being the most expensive part. You're also using the equivalent of '-9', too -- and *that's slow*. Changing to Z_NORMAL_COMPRESSION would probably help a lot (but would break all existing

Re: SHA1 hash safety

2005-04-19 Thread C. Scott Ananian
On Tue, 19 Apr 2005, David Meybohm wrote: But doesn't this require assuming the distribution of MD5 is uniform, and don't the papers finding collisions in less show it's not? So, your birthday-argument for calculating the probability wouldn't apply, because it rests on the assumption MD5 is

Re: SHA1 hash safety

2005-04-18 Thread C. Scott Ananian
On Mon, 18 Apr 2005, Andy Isaacson wrote: If you had actual evidence of a collision, I'd love to see it - even if it's just the equivalent of % md5 foo d3b07384d113edec49eaa6238ad5ff00 foo % md5 bar d3b07384d113edec49eaa6238ad5ff00 bar % cmp foo bar foo bar differ: byte 25, line 1 % But in the

Re: another perspective on renames.

2005-04-15 Thread C. Scott Ananian
On Thu, 14 Apr 2005, Paul Jackson wrote: To me, rename is a special case of the more general case of a big chunk of code (a portion of a file) that was in one place either being moved or copied to another place. I wonder if there might be someway to use the tools that biologists use to analyze DNA