Blob chunking code. [First look.]
So I wrote up my ideas regarding blob chunking as code; see attached. This is against git-0.4 (I know, ancient, but I had to start somewhere.) The idea here is that blobs are chunked using a rolling checksum (so the chunk boundaries are content-dependent and stay fixed even if you mutate pieces of the file). The chunks are then tree-structured as 'treaps', which will ensure that chunk trees can be profitably reused. (If you create a flat 'chunk index' instead of tree-structuring it, then you need to write two files even if you make a small change to a small file. If you use a full binary tree, then insertions at the beginning (say) still change the entire tree structure. The treap ensures that on avg O(ln N) chunks need to be written per change, where N is the number of chunks in the file). More details are in the code. Compatibility with existing archives in git-0.4 was tricky, because of git's 'compress-before-hash' thingy. Moving to 'hash before compress' is *much* better, although because the file size is included in the hash, I will need to perform (the equivalent of) O(ln N) hashes of the complete file. If the file size weren't included, or if it were put at the end, then 2 hashes would suffice. (Basically, we can save work hashing subranges which are prefix-identical, but including the length means that no subtrees are prefix-identical.) I'll work on bringing this forward to the latest git, but I thought I'd post it here for early reviews and comments. My informal testing shows that 1) my chunk size is currently too small, and 2) subrange sharing works well even on relatively small files. I'll be working on getting concrete numbers for larger archives. --scott DNC NRA Kojarena ESCOBILLA QKENCHANT STANDEL shotgun ESGAIN KGB Mossad overthrow ASW cracking HOPEFUL KUBARK counter-intelligence Yakima ( http://cscott.net/ ) --- begin chunk.c -- #include stdlib.h /* we could be clever and do this even if we don't fit in memory... * ... but we're going to be quick and dirty. */ /* C source has approx 5 bits per character of entropy. * We'd like to get 32 bits of good entropy; that means 7 bytes is a * reasonable minimum for the window size. */ #define ROLLING_WINDOW 30 #define CHUNK_SIZE 1023 /* desired block size */ #include assert.h #include cache.h /* * This file implements a treap-based chunked content store. The * idea is that every stored file is broken down into tree-structured * chunks (that is, every chunk has an optional 'prefix' and 'suffix' * chunk), and these chunks are put in the object store. This way * similar files will be expected to share chunks, saving space. * Files less than one disk block long are expected to fit in a single * chunk, so there is no extra indirection overhead for this case. */ /* First, some data structures: */ struct chunk { /* a chunk represents some range of the underlying file */ size_t start /* inclusive */, end /*exclusive*/; unsigned char sha1[20]; /* sha1 for this chunk; used as the heap key */ }; struct chunklist { /* a dynamically-sized list of chunks */ struct chunk *chunk; /* an array of chunks */ size_t num_items; /* how many items are currently in the list */ size_t allocd;/* how many items we've allocated space for */ }; struct treap { /* A treap node represents a run of consecutive chunks. */ struct chunk *chunk; /* some chunk in the run. */ /* treaps representing the run before 'chunk' (left) and * after 'chunk' (right). */ struct treap *left, *right; /* sha1 for the run represented by this treap */ unsigned char sha1[20]; }; static struct chunklist * create_chunklist(int expected_items) { struct chunklist *cl = malloc(sizeof(*cl)); cl-num_items = 0; cl-allocd = expected_items; cl-chunk = malloc(sizeof(cl-chunk[0]) * cl-allocd); return cl; } static void free_chunklist(struct chunklist *cl) { free(cl-chunk); free(cl); } /* Add a chunk to the chunk list, calculating its SHA1 in the process. */ /* The chunk includes buf[start] to buf[end-1].*/ static void add_chunk(struct chunklist *cl, char *buf, size_t start, size_t end) { struct chunk *ch; SHA_CTX c; assert(startend); assert(cl); assert(buf); if (cl-num_items = cl-allocd) { cl-allocd = cl-allocd*3/2; cl-chunk = realloc(cl-chunk, cl-allocd * sizeof(*(cl-chunk))); } assert(cl-num_items cl-allocd); ch = cl-chunk + (cl-num_items++); ch-start = start; ch-end = end; // compute SHA-1 SHA1_Init(c); SHA1_Update(c, buf+start, end-start); SHA1_Final(ch-sha1, c); // done! } /* Split a buffer into chunks, using a rolling checksum over ROLLING_WINDOW * bytes to determine chunk boundaries. We try to split chunks into pieces * whose size averages out to be 'CHUNK_SIZE'. */ static void chunkify(struct chunklist *cl, char *buf, size_t size) { int i,
Re: [PATCH] write-tree performance problems
On Wed, 20 Apr 2005, Chris Mason wrote: With the basic changes I described before, the 100 patch time only goes down to 40s. Certainly not fast enough to justify the changes. In this case, the bulk of the extra time comes from write-tree writing the index file, so I split write-tree.c up into libwrite-tree.c, and created update-cache --write-tree. Hmm. Are our index files too large, or is there some other factor? I was considering using a chunked representation for *all* files (not just blobs), which would avoid the original 'trees must reference other trees or they become too large' issue -- and maybe the performance issue you're referring to, as well? --scott Boston MI6 quiche LPMEDLEY BLUEBIRD PBSUCCESS jihad biowarfare non-violent protest Yakima NRA EZLN DES hack SARANAC KMPLEBE Echelon PBCABOOSE security ( http://cscott.net/ ) - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] write-tree performance problems
On Wed, 20 Apr 2005, Linus Torvalds wrote: I was considering using a chunked representation for *all* files (not just blobs), which would avoid the original 'trees must reference other trees or they become too large' issue -- and maybe the performance issue you're referring to, as well? No. The most common index file operation is reading, and that's the one that has to be _fast_. And it is - it's a single mmap and some parsing. OK, sure. But how 'bout chunking trees? Are you grown happy with the new trees-reference-other-trees paradigm, or is there a deep longing in your heart for the simplicity of 'trees-reference-blobs-period'? I'm fairly certain that chunking could get you the space-savings you need without multi-level trees, if the simplicity of that is still appealing. Not necessarily for rev.1 of the chunking code, but I'm curious as to whether it's still of interest at all. I don't know exactly how far ingrained multilevel trees have become since they were adopted. --scott Japan explosion BLUEBIRD Honduras jihad D5 SLBM Diplomat overthrow JMTIDE CABOUNCE AMTHUG ESODIC Kennedy AVBRANDY CLOWER mail drop PHOENIX ( http://cscott.net/ ) - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Some documentation...
On Wed, 20 Apr 2005, David Greaves wrote: In doing this I noticed a couple of points: * update-cache won't accept ./file or fred/./file The comment in update-cache.c reads: /* * We fundamentally don't like some paths: we don't want * dot or dot-dot anywhere, and in fact, we don't even want * any other dot-files (.git or anything else). They * are hidden, for chist sake. * * Also, we don't want double slashes or slashes at the * end that can make pathnames ambiguous. */ It could be argued that './' is a special case... but at the moment this is definitely a designed 'feature' not a 'bug'. --scott BLUEBIRD SEQUIN SECANT Waihopai Honduras KUDOVE genetic KUJUMP SCRANTON DES AMLASH Indonesia SLINC cracking ESMERALDITE mustard Uzi KUSODA ( http://cscott.net/ ) - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Blob chunking code. [Second look]
Here's a quick rev of the chunking code. This is compatible with git-current, where the hashes are of the *uncompressed* file. The 'chunk' file gets dropped in at the same SHA1 filename as the 'blob' file, as it represents identical contents. Martin won't like this (because of how the hash is computed), but this is the short-term direction I want to pursue to validate the concept: it means I can run a simple converter over all the blob objects and don't have to rewrite tree and commit objects. If the approach is seen to have merit, then we can perhaps think about doing another bulk repository format conversion where all the hashes change. But (IMO) it's a little early to be thinking of this yet. --scott nuclear RUCKUS KUPALM ODACID LA STANDEL Mossad LITEMPO atomic mail drop Hussein JUBILIST class struggle SSBN 731 Bush quiche Nazi MKULTRA ( http://cscott.net/ ) - chunk.c -- /* * This file implements a treap-based chunked content store. The * idea is that every stored file is broken down into tree-structured * chunks (that is, every chunk has an optional 'prefix' and 'suffix' * chunk), and these chunks are put in the object store. This way * similar files will be expected to share chunks, saving space. * Files less than one disk block long are expected to fit in a single * chunk, so there is no extra indirection overhead for this case. * * Copyright (C) 2005 C. Scott Ananian [EMAIL PROTECTED] */ /* * We assume that the file and the chunk information all fits in memory. * A slightly more-clever implementation would work even if the file * didn't fit. Basically, we could scan it an keep the * 'N' lowest heap keys (chunk hashes), where 'N' is chosen to fit * comfortably in memory. These would form the root and top * of the resulting treap, constructing it top-down. Then we'd scan * again any only keep the next 'N' lowest heap keys, etc. * * But we're going to keep things simple. We do try to maintain locality * where possible, so if you need to swap things still shouldn't be too bad. */ #include assert.h #include stdlib.h #include cache.h #include chunk.h typedef unsigned long ch_size_t; /* Our magic numbers: these can be tuned without breaking files already * in the archive, although space re-use is only expected between files which * have these constants set to the same values. */ /* The window size determines how much context we use when looking for a * chunk boundary. * C source has approx 5 bits per character of entropy. * We'd like to get 32 bits of good entropy into our boundary checksum; * that means 7 bytes is a rough minimum for the window size. * 30 bytes is what 'rsyncable zlib' uses; that should be fine. */ #define ROLLING_WINDOW 30 /* The ideal chunk size will fit most chunks into a disk block. A typical * disk block size is 4k, and we expect (say) 50% compression. */ #define CHUNK_SIZE 7901 /* primes are nice to use */ /* Data structures: */ struct chunk { /* a chunk represents some range of the underlying file */ ch_size_t start /* inclusive */, end /*exclusive*/; unsigned char sha1[20]; /* sha1 for this chunk; used as the heap key */ }; struct chunklist { /* a dynamically-sized list of chunks */ struct chunk *chunk; /* an array of chunks */ ch_size_t num_items; /* how many items are currently in the list */ ch_size_t allocd;/* how many items we've allocated space for */ }; struct treap { /* A treap node represents a run of consecutive chunks. */ /* the start and end of the run: */ ch_size_t start /* inclusive */, end /*exclusive*/; struct chunk *chunk; /* some chunk in the run. */ /* treaps representing the run before 'chunk' (left) and * after 'chunk' (right). */ struct treap *left, *right; /* sha1 for the run represented by this treap */ unsigned char sha1[20]; }; static struct chunklist * create_chunklist(int expected_items) { struct chunklist *cl = malloc(sizeof(*cl)); cl-num_items = 0; cl-allocd = expected_items; cl-chunk = malloc(sizeof(cl-chunk[0]) * cl-allocd); return cl; } static void free_chunklist(struct chunklist *cl) { free(cl-chunk); free(cl); } /* Add a chunk to the chunk list, calculating its SHA1 in the process. */ /* The chunk includes buf[start] to buf[end-1].*/ static void add_chunk(struct chunklist *cl, char *buf, ch_size_t start, ch_size_t end) { struct chunk *ch; SHA_CTX c; assert(startend); assert(cl); assert(buf); if (cl-num_items = cl-allocd) { cl-allocd = cl-allocd*3/2; cl-chunk = realloc(cl-chunk, cl-allocd * sizeof(*(cl-chunk))); } assert(cl-num_items cl-allocd); ch = cl-chunk + (cl-num_items++); ch-start = start; ch-end = end; /* compute SHA-1 of the chunk. */ SHA1_Init(c); SHA1_Update(c, buf+start, end-start); SHA1_Final(ch-sha1, c); /* done! */ } /* Split a buffer into chunks, using
Re: [ANNOUNCEMENT] /Arch/ embraces `git'
On Wed, 20 Apr 2005, Petr Baudis wrote: I think one thing git's objects database is not very well suited for are network transports. You want to have something smart doing the transports, comparing trees so that it can do some delta compression; that could probably reduce the amount of data needed to be sent significantly. I'm hoping my 'chunking' patches will fix this. This ought to reduce the size of the object store by (in effect) doing delta compression; rsync will then Do The Right Thing and only transfer the needed deltas. Running some benchmarks right now to see how well it lives up to this promise... --scott terrorist AEROPLANE munitions PAPERCLIP MI5 Morwenstow WSHOOFS CABOUNCE colonel Yakima AES MI6 nuclear NSA Cocaine Columbia plastique LICOZY ( http://cscott.net/ ) - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: chunking (Re: [ANNOUNCEMENT] /Arch/ embraces `git')
On Wed, 20 Apr 2005, Linus Torvalds wrote: What's the disk usage results? I'm on ext3, for example, which means that even small files invariably take up 4.125kB on disk (with the inode). Even uncompressed, most source files tend to be small. Compressed, I'm seeing the median blob size being ~1.6kB in my trivial checks. That's blobs only, btw. I'm working on it. The format was chosen so that blobs under 1 block long *stay* 1 block long; i.e. there's no 'chunk plus index file' overhead. So the chunking should only kick in on multiple-block files. I hacked 'convert-cache' to do the conversion and it's running out of memory on linux-2.6.git, however --- I found a few memory leaks in your code =) but I certainly seem to be missing a big one still (maybe it's in my code!). When I get this working properly, my plan is to do a number of runs over the linux-2.6 archive trying out various combinations of chunking parameters. I *will* be watching both 'real' disk usage (bunged up to block boundaries) and 'ideal' disk usage (on a reiserfs-type system). The goal is to improve both, but if I can improve 'ideal' usage significantly with a minimal penalty in 'real' usage then I would argue it's still worth doing, since that will improve network times. The handshaking penalties you mention are significant, but that's why rsync uses a pipelined approach. The 'upstream' part of your full-duplex pipe is 'free' while you've got bits clogging your 'downstream' pipe. The wonders of full-duplex... Anyway, numbers talk, etc. I'm working on them. --scott LIONIZER LCPANES shortwave MKSEARCH ESGAIN Saddam Hussein Rijndael WASHTUB Morwenstow ZPSEMANTIC SKIMMER cryptographic FJHOPEFUL assassination ( http://cscott.net/ ) - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] write-tree performance problems
On Tue, 19 Apr 2005, Linus Torvalds wrote: (*) Actually, I think it's the compression that ends up being the most expensive part. You're also using the equivalent of '-9', too -- and *that's slow*. Changing to Z_NORMAL_COMPRESSION would probably help a lot (but would break all existing repositories, sigh). --scott DES WTO Indonesia NRA LCPANGS supercomputer plastique class struggle AEFOX Pakistan ODEARL Secretary KUGOWN Cheney ODIBEX SDI AP JMMADD ( http://cscott.net/ ) - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SHA1 hash safety
On Tue, 19 Apr 2005, David Meybohm wrote: But doesn't this require assuming the distribution of MD5 is uniform, and don't the papers finding collisions in less show it's not? So, your birthday-argument for calculating the probability wouldn't apply, because it rests on the assumption MD5 is uniform, and it isn't. No, the collision papers don't show this at all. --scott atomic strategic HBDRILL SARANAC COBRA JUDY Ft. Meade assassination politics Mossad HOPEFUL ZPSEMANTIC DTFROGS HTKEEPER LITEMPO LIONIZER operation ( http://cscott.net/ ) - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SHA1 hash safety
On Mon, 18 Apr 2005, Andy Isaacson wrote: If you had actual evidence of a collision, I'd love to see it - even if it's just the equivalent of % md5 foo d3b07384d113edec49eaa6238ad5ff00 foo % md5 bar d3b07384d113edec49eaa6238ad5ff00 bar % cmp foo bar foo bar differ: byte 25, line 1 % But in the absence of actual evidence, we have to assume (just based on the probabilities) that there was some error in your testing. I've already had a long correspondence with this poster. He claims that this happened 7 years ago, involved a commercial contract covered by Swiss Banking Law (with caps!) and that, of course, he certainly doesn't retain [his] client's documents, and even if he *did*, he wouldn't show them to *me*. And then he was unable to comprehend that I couldn't accept his word alone as prima facie evidence that the laws of probability did not apply to him or his clients. I've been a coder far too long to attribute to The Mysterious Hand Of God what can adequately be described by subtle programmer error. The most reasonable explanation, given the (lack of) evidence, is that the programmer involved quickly took refuge in a (wildly improbable, but his clients'll never know) MD5 collision instead of buckling down and finding the bug in his code. --scott ODOATH Ortega FBI SGUAT AEBARMAN India Peking ODACID operation RYBAT [Hello to all my fans in domestic surveillance] for Dummies KUCLUB ( http://cscott.net/ ) - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: another perspective on renames.
On Thu, 14 Apr 2005, Paul Jackson wrote: To me, rename is a special case of the more general case of a big chunk of code (a portion of a file) that was in one place either being moved or copied to another place. I wonder if there might be someway to use the tools that biologists use to analyze DNA sequences, to track the evolution of source code, identifying things like common chunks of code that differ in just a few mutations, and presenting the history of the evolution, at selectable levels of detail. The rsync algorithm (http://samba.anu.edu.au/rsync/tech_report/node2.html) is probably a good place to start, although it is relatively sensitive to mutations. It will be able to efficiently detect identical blocks larger than some block size N (512 bytes or so for rsync). You might well consider smaller blocks to be irrelevant. The data can be made considerably more useful to developers by canonicalizing before searching (ie, compressing whitespace to ' ', etc)[*]. Note that the identical regions do *not* have to line up on block boundaries; see the rsync algorithm for more detail. I think Linus has made a persuasive case that the 'developer-friendly' features of an SCM (ie annotate, log, and friends) can be built *on top* of GIT. This is a perfect example. Since the computation is non-trivial (although linear in the number of lines of code involved in the history of a file; ie doesn't depend on the unrelated size of the archive), it might make sense for the front-end SCM to maintain its own caches --- for example, of the block and rolling checksums for each file required by the rsync algorithm. The key point being that these are just *caches*, not essential history information, and can always be wiped and regenerated. The nice 'feature' of this system (some may disagree, I guess) is that it does *not* depend on extensive programmer annotation of file changes (ie, chunk A in file B came from lines C-D of file D, or file E was once named F, etc). By inferring history from content-similar files and blocks, it seems that it would be more able to generate useful results after importing third-party sources, which may come in distinct 'releases' but lack explicit history annotations. --scott [*] in general, i will be *glad* to see source-management move away from CVS' line-oriented style; there's no good reason we should still be worrying about whitespace changes, etc. When we build 'developer-friendly' tools we should make every effort to auto-detect source code, image formats, etc, and automatically perform appropriate canonicalization and beautification of diffs, because this can be/should be/is entirely separate from git's underlying storage representation. Mk 48 PANCHO ZPSECANT MKDELTA SCRANTON D5 SLBM JMTRAX Delta Force MI6 SGUAT Khaddafi SMOTH interception mail drop SECANT PBSUCCESS Cocaine ( http://cscott.net/ ) - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html