Blob chunking code. [First look.]

2005-04-20 Thread C. Scott Ananian
So I wrote up my ideas regarding blob chunking as code; see attached.
This is against git-0.4 (I know, ancient, but I had to start somewhere.)
The idea here is that blobs are chunked using a rolling checksum (so the 
chunk boundaries are content-dependent and stay fixed even if you mutate 
pieces of the file).  The chunks are then tree-structured as 'treaps', 
which will ensure that chunk trees can be profitably reused.  (If you 
create a flat 'chunk index' instead of tree-structuring it, then you need 
to write two files even if you make a small change to a small file.  If 
you use a full binary tree, then insertions at the beginning (say) still 
change the entire tree structure.  The treap ensures that on avg O(ln N) 
chunks need to be written per change, where N is the number of chunks in 
the file).  More details are in the code.

Compatibility with existing archives in git-0.4 was tricky, because of 
git's 'compress-before-hash' thingy.  Moving to 'hash before compress' is 
*much* better, although because the file size is included in the hash, I will 
need to perform (the equivalent of) O(ln N) hashes of the complete file.
If the file size weren't included, or if it were put at the end, then 2 
hashes would suffice. (Basically, we can save work hashing subranges which 
are prefix-identical, but including the length means that no subtrees are
prefix-identical.)

I'll work on bringing this forward to the latest git, but I thought I'd 
post it here for early reviews and comments.  My informal testing shows 
that 1) my chunk size is currently too small, and 2) subrange sharing 
works well even on relatively small files.  I'll be working on getting 
concrete numbers for larger archives.
 --scott

DNC NRA Kojarena ESCOBILLA QKENCHANT STANDEL shotgun ESGAIN KGB Mossad 
overthrow ASW cracking HOPEFUL KUBARK counter-intelligence Yakima
 ( http://cscott.net/ )
--- begin chunk.c --
#include stdlib.h

/* we could be clever and do this even if we don't fit in memory...
 * ... but we're going to be quick and dirty. */
/* C source has approx 5 bits per character of entropy.
 * We'd like to get 32 bits of good entropy; that means 7 bytes is a
 * reasonable minimum for the window size. */
#define ROLLING_WINDOW 30
#define CHUNK_SIZE 1023 /* desired block size */
#include assert.h
#include cache.h
/*
 * This file implements a treap-based chunked content store.  The
 * idea is that every stored file is broken down into tree-structured
 * chunks (that is, every chunk has an optional 'prefix' and 'suffix'
 * chunk), and these chunks are put in the object store.  This way
 * similar files will be expected to share chunks, saving space.
 * Files less than one disk block long are expected to fit in a single
 * chunk, so there is no extra indirection overhead for this case.
 */
/* First, some data structures: */
struct chunk {
/* a chunk represents some range of the underlying file */
size_t start /* inclusive */, end /*exclusive*/;
unsigned char sha1[20]; /* sha1 for this chunk; used as the heap key */
};
struct chunklist {
/* a dynamically-sized list of chunks */
struct chunk *chunk; /* an array of chunks */
size_t num_items; /* how many items are currently in the list */
size_t allocd;/* how many items we've allocated space for */
};
struct treap {
/* A treap node represents a run of consecutive chunks. */
struct chunk *chunk; /* some chunk in the run. */
/* treaps representing the run before 'chunk' (left) and
 * after 'chunk' (right).  */
struct treap *left, *right;
/* sha1 for the run represented by this treap */
unsigned char sha1[20];
};
static struct chunklist *
create_chunklist(int expected_items) {
struct chunklist *cl = malloc(sizeof(*cl));
cl-num_items = 0;
cl-allocd = expected_items;
cl-chunk = malloc(sizeof(cl-chunk[0]) * cl-allocd);
return cl;
}
static void
free_chunklist(struct chunklist *cl) {
free(cl-chunk);
free(cl);
}
/* Add a chunk to the chunk list, calculating its SHA1 in the process. */
/* The chunk includes buf[start] to buf[end-1].*/
static void
add_chunk(struct chunklist *cl, char *buf, size_t start, size_t end) {
struct chunk *ch;
SHA_CTX c;
assert(startend); assert(cl); assert(buf);
if (cl-num_items = cl-allocd) {
cl-allocd = cl-allocd*3/2;
cl-chunk = realloc(cl-chunk, cl-allocd * sizeof(*(cl-chunk)));
}
assert(cl-num_items  cl-allocd);
ch = cl-chunk + (cl-num_items++);
ch-start = start;
ch-end = end;
// compute SHA-1
SHA1_Init(c);
SHA1_Update(c, buf+start, end-start);
SHA1_Final(ch-sha1, c);
// done!
}
/* Split a buffer into chunks, using a rolling checksum over ROLLING_WINDOW
 * bytes to determine chunk boundaries.  We try to split chunks into pieces
 * whose size averages out to be 'CHUNK_SIZE'. */
static void
chunkify(struct chunklist *cl, char *buf, size_t size) {
int i, 

Re: [PATCH] write-tree performance problems

2005-04-20 Thread C. Scott Ananian
On Wed, 20 Apr 2005, Chris Mason wrote:
With the basic changes I described before, the  100 patch time only goes down
to 40s.  Certainly not fast enough to justify the changes.  In this case, the
bulk of the extra time comes from write-tree writing the index file, so I
split write-tree.c up into libwrite-tree.c, and created update-cache
--write-tree.
Hmm.  Are our index files too large, or is there some other factor?
I was considering using a chunked representation for *all* files (not just 
blobs), which would avoid the original 'trees must reference other trees 
or they become too large' issue -- and maybe the performance issue you're 
referring to, as well?
 --scott

Boston MI6 quiche LPMEDLEY BLUEBIRD PBSUCCESS jihad biowarfare non-violent protest 
Yakima NRA EZLN DES hack SARANAC KMPLEBE Echelon PBCABOOSE security
 ( http://cscott.net/ )
-
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] write-tree performance problems

2005-04-20 Thread C. Scott Ananian
On Wed, 20 Apr 2005, Linus Torvalds wrote:
I was considering using a chunked representation for *all* files (not just
blobs), which would avoid the original 'trees must reference other trees
or they become too large' issue -- and maybe the performance issue you're
referring to, as well?
No. The most common index file operation is reading, and that's the one
that has to be _fast_. And it is - it's a single mmap and some parsing.
OK, sure.  But how 'bout chunking trees?  Are you grown happy with the new 
trees-reference-other-trees paradigm, or is there a deep longing in your 
heart for the simplicity of 'trees-reference-blobs-period'?  I'm fairly
certain that chunking could get you the space-savings you need without 
multi-level trees, if the simplicity of that is still appealing.

Not necessarily for rev.1 of the chunking code, but I'm curious as to 
whether it's still of interest at all.  I don't know exactly how far
ingrained multilevel trees have become since they were adopted.
 --scott

Japan explosion BLUEBIRD Honduras jihad D5 SLBM Diplomat overthrow 
JMTIDE CABOUNCE AMTHUG ESODIC Kennedy AVBRANDY CLOWER mail drop PHOENIX
 ( http://cscott.net/ )
-
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Some documentation...

2005-04-20 Thread C. Scott Ananian
On Wed, 20 Apr 2005, David Greaves wrote:
In doing this I noticed a couple of points:
* update-cache won't accept ./file or fred/./file
The comment in update-cache.c reads:
/*
 * We fundamentally don't like some paths: we don't want
 * dot or dot-dot anywhere, and in fact, we don't even want
 * any other dot-files (.git or anything else). They
 * are hidden, for chist sake.
 *
 * Also, we don't want double slashes or slashes at the
 * end that can make pathnames ambiguous.
 */
It could be argued that './' is a special case... but at the moment this 
is definitely a designed 'feature' not a 'bug'.
 --scott

BLUEBIRD SEQUIN SECANT Waihopai Honduras KUDOVE genetic KUJUMP SCRANTON 
DES AMLASH Indonesia SLINC cracking ESMERALDITE mustard Uzi KUSODA
 ( http://cscott.net/ )
-
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Blob chunking code. [Second look]

2005-04-20 Thread C. Scott Ananian
Here's a quick rev of the chunking code.  This is compatible with 
git-current, where the hashes are of the *uncompressed* file.
The 'chunk' file gets dropped in at the same SHA1 filename as the
'blob' file, as it represents identical contents.  Martin won't like
this (because of how the hash is computed), but this is the short-term
direction I want to pursue to validate the concept: it means I can
run a simple converter over all the blob objects and don't have to
rewrite tree and commit objects.

If the approach is seen to have merit, then we can perhaps think about 
doing another bulk repository format conversion where all the hashes
change.  But (IMO) it's a little early to be thinking of this yet.
 --scott

nuclear RUCKUS KUPALM ODACID LA STANDEL Mossad LITEMPO atomic mail drop 
Hussein JUBILIST class struggle SSBN 731 Bush quiche Nazi MKULTRA
 ( http://cscott.net/ )
-  chunk.c --
/*
 * This file implements a treap-based chunked content store.  The
 * idea is that every stored file is broken down into tree-structured
 * chunks (that is, every chunk has an optional 'prefix' and 'suffix'
 * chunk), and these chunks are put in the object store.  This way
 * similar files will be expected to share chunks, saving space.
 * Files less than one disk block long are expected to fit in a single
 * chunk, so there is no extra indirection overhead for this case.
 *
 * Copyright (C) 2005 C. Scott Ananian [EMAIL PROTECTED]
 */

/*
 * We assume that the file and the chunk information all fits in memory.
 * A slightly more-clever implementation would work even if the file
 * didn't fit.  Basically, we could scan it an keep the
 * 'N' lowest heap keys (chunk hashes), where 'N' is chosen to fit
 * comfortably in memory.  These would form the root and top
 * of the resulting treap, constructing it top-down.  Then we'd scan
 * again any only keep the next 'N' lowest heap keys, etc.
 *
 * But we're going to keep things simple.  We do try to maintain locality
 * where possible, so if you need to swap things still shouldn't be too bad.
 */
#include assert.h
#include stdlib.h
#include cache.h
#include chunk.h
typedef unsigned long ch_size_t;
/* Our magic numbers: these can be tuned without breaking files already
 * in the archive, although space re-use is only expected between files which
 * have these constants set to the same values. */
/* The window size determines how much context we use when looking for a
 * chunk boundary.
 * C source has approx 5 bits per character of entropy.
 * We'd like to get 32 bits of good entropy into our boundary checksum;
 * that means 7 bytes is a rough minimum for the window size.
 * 30 bytes is what 'rsyncable zlib' uses; that should be fine. */
#define ROLLING_WINDOW 30
/* The ideal chunk size will fit most chunks into a disk block.  A typical
 * disk block size is 4k, and we expect (say) 50% compression. */
#define CHUNK_SIZE 7901 /* primes are nice to use */
/* Data structures: */
struct chunk {
/* a chunk represents some range of the underlying file */
ch_size_t start /* inclusive */, end /*exclusive*/;
unsigned char sha1[20]; /* sha1 for this chunk; used as the heap key */
};
struct chunklist {
/* a dynamically-sized list of chunks */
struct chunk *chunk; /* an array of chunks */
ch_size_t num_items; /* how many items are currently in the list */
ch_size_t allocd;/* how many items we've allocated space for */
};
struct treap {
/* A treap node represents a run of consecutive chunks. */
/* the start and end of the run: */
ch_size_t start /* inclusive */, end /*exclusive*/;
struct chunk *chunk; /* some chunk in the run. */
/* treaps representing the run before 'chunk' (left) and
 * after 'chunk' (right).  */
struct treap *left, *right;
/* sha1 for the run represented by this treap */
unsigned char sha1[20];
};
static struct chunklist *
create_chunklist(int expected_items) {
struct chunklist *cl = malloc(sizeof(*cl));
cl-num_items = 0;
cl-allocd = expected_items;
cl-chunk = malloc(sizeof(cl-chunk[0]) * cl-allocd);
return cl;
}
static void
free_chunklist(struct chunklist *cl) {
free(cl-chunk);
free(cl);
}
/* Add a chunk to the chunk list, calculating its SHA1 in the process. */
/* The chunk includes buf[start] to buf[end-1].*/
static void
add_chunk(struct chunklist *cl, char *buf, ch_size_t start, ch_size_t end) {
struct chunk *ch;
SHA_CTX c;
assert(startend); assert(cl); assert(buf);
if (cl-num_items = cl-allocd) {
cl-allocd = cl-allocd*3/2;
cl-chunk = realloc(cl-chunk, cl-allocd * sizeof(*(cl-chunk)));
}
assert(cl-num_items  cl-allocd);
ch = cl-chunk + (cl-num_items++);
ch-start = start;
ch-end = end;
/* compute SHA-1 of the chunk. */
SHA1_Init(c);
SHA1_Update(c, buf+start, end-start);
SHA1_Final(ch-sha1, c);
/* done! */
}
/* Split a buffer into chunks, using

Re: [ANNOUNCEMENT] /Arch/ embraces `git'

2005-04-20 Thread C. Scott Ananian
On Wed, 20 Apr 2005, Petr Baudis wrote:
I think one thing git's objects database is not very well suited for are
network transports. You want to have something smart doing the
transports, comparing trees so that it can do some delta compression;
that could probably reduce the amount of data needed to be sent
significantly.
I'm hoping my 'chunking' patches will fix this.  This ought to reduce the 
size of the object store by (in effect) doing delta compression; rsync
will then Do The Right Thing and only transfer the needed deltas.
Running some benchmarks right now to see how well it lives up to this 
promise...
 --scott

terrorist AEROPLANE munitions PAPERCLIP MI5 Morwenstow WSHOOFS CABOUNCE 
colonel Yakima AES MI6 nuclear NSA Cocaine Columbia plastique LICOZY
 ( http://cscott.net/ )
-
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: chunking (Re: [ANNOUNCEMENT] /Arch/ embraces `git')

2005-04-20 Thread C. Scott Ananian
On Wed, 20 Apr 2005, Linus Torvalds wrote:
What's the disk usage results? I'm on ext3, for example, which means that
even small files invariably take up 4.125kB on disk (with the inode).
Even uncompressed, most source files tend to be small. Compressed, I'm
seeing the median blob size being ~1.6kB in my trivial checks. That's
blobs only, btw.
I'm working on it.  The format was chosen so that blobs under 1 block long 
*stay* 1 block long; i.e. there's no 'chunk plus index file' overhead.
So the chunking should only kick in on multiple-block files.
I hacked 'convert-cache' to do the conversion and it's running out of
memory on linux-2.6.git, however --- I found a few memory leaks in your 
code =) but I certainly seem to be missing a big one still (maybe it's in 
my code!).

When I get this working properly, my plan is to do a number of runs over 
the linux-2.6 archive trying out various combinations of chunking 
parameters.  I *will* be watching both 'real' disk usage (bunged up to 
block boundaries) and 'ideal' disk usage (on a reiserfs-type system).
The goal is to improve both, but if I can improve 'ideal' usage 
significantly with a minimal penalty in 'real' usage then I would argue 
it's still worth doing, since that will improve network times.

The handshaking penalties you mention are significant, but that's why 
rsync uses a pipelined approach.  The 'upstream' part of your full-duplex 
pipe is 'free' while you've got bits clogging your 'downstream' 
pipe.  The wonders of full-duplex...

Anyway, numbers talk, etc.  I'm working on them.
 --scott
LIONIZER LCPANES shortwave MKSEARCH ESGAIN Saddam Hussein Rijndael 
WASHTUB Morwenstow ZPSEMANTIC SKIMMER cryptographic FJHOPEFUL assassination
 ( http://cscott.net/ )
-
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] write-tree performance problems

2005-04-19 Thread C. Scott Ananian
On Tue, 19 Apr 2005, Linus Torvalds wrote:
(*) Actually, I think it's the compression that ends up being the most
expensive part.
You're also using the equivalent of '-9', too -- and *that's slow*.
Changing to Z_NORMAL_COMPRESSION would probably help a lot
(but would break all existing repositories, sigh).
 --scott
DES WTO Indonesia NRA LCPANGS supercomputer plastique class struggle 
AEFOX Pakistan ODEARL Secretary KUGOWN Cheney ODIBEX SDI AP JMMADD
 ( http://cscott.net/ )
-
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SHA1 hash safety

2005-04-19 Thread C. Scott Ananian
On Tue, 19 Apr 2005, David Meybohm wrote:
But doesn't this require assuming the distribution of MD5 is uniform,
and don't the papers finding collisions in less show it's not? So, your
birthday-argument for calculating the probability wouldn't apply, because
it rests on the assumption MD5 is uniform, and it isn't.
No, the collision papers don't show this at all.
 --scott
atomic strategic HBDRILL SARANAC COBRA JUDY Ft. Meade assassination politics 
Mossad HOPEFUL ZPSEMANTIC DTFROGS HTKEEPER LITEMPO LIONIZER operation
 ( http://cscott.net/ )
-
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SHA1 hash safety

2005-04-18 Thread C. Scott Ananian
On Mon, 18 Apr 2005, Andy Isaacson wrote:
If you had actual evidence of a collision, I'd love to see it - even if
it's just the equivalent of
% md5 foo
d3b07384d113edec49eaa6238ad5ff00 foo
% md5 bar
d3b07384d113edec49eaa6238ad5ff00 bar
% cmp foo bar
foo bar differ: byte 25, line 1
%
But in the absence of actual evidence, we have to assume (just based on
the probabilities) that there was some error in your testing.
I've already had a long correspondence with this poster.  He claims that 
this happened 7 years ago, involved a commercial contract covered by 
Swiss Banking Law (with caps!) and that, of course, he certainly doesn't 
retain [his] client's documents, and even if he *did*, he wouldn't show 
them to *me*.

And then he was unable to comprehend that I couldn't accept his word alone 
as prima facie evidence that the laws of probability did not apply to him or 
his clients.

I've been a coder far too long to attribute to The Mysterious Hand Of 
God what can adequately be described by subtle programmer error.

The most reasonable explanation, given the (lack of) evidence, is that 
the programmer involved quickly took refuge in a (wildly improbable, but 
his clients'll never know) MD5 collision instead of buckling down and 
finding the bug in his code.
 --scott

ODOATH Ortega FBI SGUAT AEBARMAN India Peking ODACID operation RYBAT 
[Hello to all my fans in domestic surveillance] for Dummies KUCLUB
 ( http://cscott.net/ )
-
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: another perspective on renames.

2005-04-15 Thread C. Scott Ananian
On Thu, 14 Apr 2005, Paul Jackson wrote:
To me, rename is a special case of the more general case of a
big chunk of code (a portion of a file) that was in one place
either being moved or copied to another place.
I wonder if there might be someway to use the tools that biologists use
to analyze DNA sequences, to track the evolution of source code,
identifying things like common chunks of code that differ in just a few
mutations, and presenting the history of the evolution, at selectable
levels of detail.
The rsync algorithm (http://samba.anu.edu.au/rsync/tech_report/node2.html) 
is probably a good place to start, although it is relatively sensitive to 
mutations.  It will be able to efficiently detect identical blocks larger 
than some block size N (512 bytes or so for rsync).  You might well 
consider smaller blocks to be irrelevant.  The data can be made 
considerably more useful to developers by canonicalizing before searching 
(ie, compressing whitespace to ' ', etc)[*].  Note that the identical 
regions do *not* have to line up on block boundaries; see the rsync 
algorithm for more detail.

I think Linus has made a persuasive case that the 'developer-friendly' 
features of an SCM (ie annotate, log, and friends) can be built *on top* 
of GIT.   This is a perfect example.  Since the computation is non-trivial 
(although linear in the number of lines of code involved in the history of 
a file; ie doesn't depend on the unrelated size of the archive), it might 
make sense for the front-end SCM to maintain its own caches --- for 
example, of the block and rolling checksums for each file required by the 
rsync algorithm.  The key point being that these are just *caches*, not 
essential history information, and can always be wiped and regenerated.

The nice 'feature' of this system (some may disagree, I guess) is that it 
does *not* depend on extensive programmer annotation of file changes (ie, 
chunk A in file B came from lines C-D of file D, or file E was once named 
F, etc).  By inferring history from content-similar files and blocks, it 
seems that it would be more able to generate useful results after 
importing third-party sources, which may come in distinct 'releases' but 
lack explicit history annotations.
  --scott

[*] in general, i will be *glad* to see source-management move away from 
CVS' line-oriented style; there's no good reason we should still be worrying
about whitespace changes, etc.  When we build 'developer-friendly' tools 
we should make every effort to auto-detect source code, image formats, 
etc, and automatically perform appropriate canonicalization and 
beautification of diffs, because this can be/should be/is entirely 
separate from git's underlying storage representation.

Mk 48 PANCHO ZPSECANT MKDELTA SCRANTON D5 SLBM JMTRAX Delta Force 
MI6 SGUAT Khaddafi SMOTH interception mail drop SECANT PBSUCCESS Cocaine
 ( http://cscott.net/ )
-
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html