From: John A Meinel <[EMAIL PROTECTED]> But I have a question about blobs. They are stored compressed, and the sha checksum is for the *compressed* form. I understand this is probably for performance reasons. I'm concerned, though, that compression routines may not be 100% deterministic across all platforms.
That is an *excellent* concern and I implore you to research it further and report back. My understanding is still superficial in that detail: I gather that zip formats are standardized by a IETF document. I am not certain that the spec implies deterministic output. I am not certain that the way I'm driving `libz', so as to be compatible with Linus' code, is the right way to do it. Please, by all means, dig in and nail details. The goal here is to produce the high-quality-gem version of `git' rather than the rough-and-ready-works-for-me version. It is desirable to checksum the compressed rather than uncompressed blobs so that intermediate nodes in a circuit can validate blobs without having to pay for expanding them. Certainly just changing the compression level will change the compressed output. The actual implementation of `libz' is a train-wreck. It has lots of subtle bugs. I am using the `BEST_COMPRESSION' macro to select the compression method but I won't be surprised if you are right that this isn't the best choice. (I'm just copying Linus in that regard, for speed-of-impl and compatability). Rewriting or cleaning-up libz would be another great task for someone. One big problem in the current `libz' is that many of the types used for various fields are chosen poorly (e.g., `unsigned long' where `size_t' is the right answer -- that kind of thing). Having the handle fixed at 160 bits also seems limiting. It ties the entire archive format into exactly one hash. Yes it does. That's a longer discussion. Note that there are only a finite number of valid blob contents, too. The situation admits intense mathematical analysis --- in no small part because we pick a particular hash and address size. BTW -- the handles are actually 192 bits. I've upwards-compatibily generalized Linus' code to make clearer something that is muddled in his presentation: the blob size (zip form) is part of the handle (what I call an "address"). Also -- I have cleaned up Linus' design by making my spec robust against the possibility of a small number of successful SHA1 forgeries. My design *won't* withstand an attack that can turn any text into a semantically equivalent text with a desired SHA1 sum. I suppose as long as there is a version marker to allow new blob db versions, and the specific compression routine parameters are well defined. I just want to make sure that is done up front. Separate concerns. Blobs themselves are one thing -- blob dbs another. Also, this doesn't seem to work really well as a revlib format, it probably makes a great archive format, but revlibs need to know the contents so they can diff against eachother. You'll see how it fits :-) -t _______________________________________________ Gnu-arch-users mailing list [email protected] http://lists.gnu.org/mailman/listinfo/gnu-arch-users GNU arch home page: http://savannah.gnu.org/projects/gnu-arch/
