Re: [PATCH] pull: do not segfault when HEAD refers to missing object file
On Mon, Mar 06, 2017 at 07:52:10AM +0100, Johannes Sixt wrote: > > Yeah, it is. You can do it easily with 'sed', of course, but if you want > > to avoid the extra process and do it in pure shell, it's more like: > > > > last38=${REV#??} > > first2=${REV%$last38} > > rm -f .git/objects/$first2/$last38 > > Is it "HEAD points to non-existent object" or ".git/HEAD contains junk"? In > both cases there are simpler solutions than to remove an object. For > example, `echo "$_x40" >.git/HEAD` or `echo "this is junk" >.git/HEAD`? Good point. Unfortunately, it's a little tricky. You can't use "this is junk" because then git does not recognize it as a git repo at all. You can't use $_x40 because the null sha1 is used internally to signal an unborn branch, so we mix it up with that case. Something like "111..." for 40 characters would work, though at that point I think just simulating an actual corruption might be less convoluted. -Peff
Re: [PATCH] pull: do not segfault when HEAD refers to missing object file
On Mon, Mar 06, 2017 at 12:42:22AM +0100, André Laszlo wrote: > git pull --rebase on a corrupted HEAD used to segfault;it has been > corrected to error out with a message. A test has also been added to > verify this fix. Your commit message mostly tells us the "what" that we can see from the diff. But why are we segfaulting? What assumption is being violated, and why is the fix the right thing? You've stuck some of the details in your notes, and they should probably make it into the commit message. > When add_head_to_pending fails to add a pending object, git pull > --rebase segfaults. This can happen if HEAD is referring to a corrupt > or missing object. The other obvious time when HEAD is not valid is when you are on an unborn branch. So we should also consider how such a case interacts with the callsites you are touching. I think it is primarily this hunk: > --- a/wt-status.c > +++ b/wt-status.c > @@ -2252,6 +2252,11 @@ int has_uncommitted_changes(int ignore_submodules) > DIFF_OPT_SET(&rev_info.diffopt, IGNORE_SUBMODULES); > DIFF_OPT_SET(&rev_info.diffopt, QUICK); > add_head_to_pending(&rev_info); > + > + /* The add_head_to_pending call might not have added anything. */ > + if (!rev_info.pending.nr) > + die("bad object %s", "HEAD"); > + > diff_setup_done(&rev_info.diffopt); > result = run_diff_index(&rev_info, 1); > return diff_result_code(&rev_info.diffopt, result); that is the fix. We assume that add_head_to_pending() put something into rev_info.pending, but it might not. In the case of corruption, "bad object HEAD" is a reasonable outcome. In the case of an unborn branch, we'd probably want to compare against the empty tree. This trick is used elsewhere in wt-status.c, as in wt_status_collect_changes_index(). I'm not sure if we need to deal with that here or not. I wasn't able to trigger this code with an unborn branch. If you have no index, then the is_cache_unborn() check triggers earlier in the function. If you do have an index, then we have an earlier check: if (is_null_sha1(orig_head) && !is_cache_unborn()) die(_("Updating an unborn branch with changes added to the index.")); that covers this case. I am not 100% sure that check is correct, though. If I do: git init echo content >foo git add foo git rm -f foo then I have an index which is not "unborn", but also does not contain any changes. I think the conditional above should just be checking "active_nr". If we were to fix that, then I think has_uncommitted_changes() would need to be adjusted as well. I admit that this is probably an absurd corner case. So I think I am OK with your fix for now, and we can revisit the logic later if anybody starts to care. > I discovered this segfault on my machine after pulling a repo from > GitHub, but I have been unable to reproduce the sequence of events > that lead to the corrupted HEAD (I think it may have been caused by a > lost network connection in my case). Yikes. That should never lead to a corrupted HEAD (unless you are on a networked filesystem). I can't think of another read add_head_to_pending would fail, though, aside from a missing or corrupted object. Arguably add_head_to_pending() should die loudly if it can't parse the object pointed to by HEAD, rather than quietly returning without adding anything. > diff --git a/diff-lib.c b/diff-lib.c > index 52447466b..9d26b18c3 100644 > --- a/diff-lib.c > +++ b/diff-lib.c > @@ -512,7 +512,7 @@ int run_diff_index(struct rev_info *revs, int cached) > struct object_array_entry *ent; > > ent = revs->pending.objects; > - if (diff_cache(revs, ent->item->oid.hash, ent->name, cached)) > + if (!ent || diff_cache(revs, ent->item->oid.hash, ent->name, cached)) > exit(128); So if I understand correctly, this hunk should not trigger anymore, because we would never call run_diff_index() without something in pending. I think it would be a programming error elsewhere to do so, and we should treat this as an assertion by complaining loudly (for this and any other confusing cases): if (revs->pending.nr != 1) die("BUG: run_diff_index requires exactly one rev (got %d)", revs->pending.nr); Lastly, I think this is your first patch. So welcome, and thank you. :) I know there was a lot of discussion and critique above, but I think overall your patch is going in the right direction. -Peff
Re: [PATCH] pull: do not segfault when HEAD refers to missing object file
Am 06.03.2017 um 04:51 schrieb Jeff King: On Sun, Mar 05, 2017 at 11:52:22PM +, brian m. carlson wrote: On Mon, Mar 06, 2017 at 12:42:22AM +0100, André Laszlo wrote: +test_expect_success 'git pull --rebase with corrupt HEAD does not segfault' ' + mkdir corrupted && + (cd corrupted && We usally indent this like so: ( cd corrupted && echo one >file && git add file && ... ) && + git init && + echo one >file && git add file && + git commit -m one && + REV=$(git rev-parse HEAD) && + rm -f .git/objects/${REV:0:2}/${REV:2} && I think this is a bashism. On dash, I get the following: genre ok % dash -c 'foo=abcdefg; echo ${foo:0:2}; echo ${foo:2}' dash: 1: Bad substitution Yeah, it is. You can do it easily with 'sed', of course, but if you want to avoid the extra process and do it in pure shell, it's more like: last38=${REV#??} first2=${REV%$last38} rm -f .git/objects/$first2/$last38 Is it "HEAD points to non-existent object" or ".git/HEAD contains junk"? In both cases there are simpler solutions than to remove an object. For example, `echo "$_x40" >.git/HEAD` or `echo "this is junk" >.git/HEAD`? -- Hannes
[PATCH v2] git svn: fix authenticaton with 'branch'
Authentication fails with svn branch while svn rebase and svn dcommit work fine without authentication failures. $ git svn branch v7_3 Copying https://xxx at r27519 to https:///v7_3... Can't create session: Unable to connect to a repository at URL 'https://': No more credentials or we tried too many times. Authentication failed at C:\Program Files\Git\mingw64/libexec/git-core\git-svn line 1200. We add auth configuration to SVN::Client->new() to fix the issue. Signed-off-by: Hiroshi Shirosaki --- git-svn.perl | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/git-svn.perl b/git-svn.perl index fa42364..d240418 100755 --- a/git-svn.perl +++ b/git-svn.perl @@ -1175,10 +1175,10 @@ sub cmd_branch { ::_req_svn(); require SVN::Client; + my ($config, $baton, undef) = Git::SVN::Ra::prepare_config_once(); my $ctx = SVN::Client->new( - config => SVN::Core::config_get_config( - $Git::SVN::Ra::config_dir - ), + auth => $baton, + config => $config, log_msg => sub { ${ $_[0] } = defined $_message ? $_message -- 2.7.4
Re: Git download
On 03/05/2017 09:26 PM, Cory Kilpatrick wrote: I have downloaded Git and cannot find the application on my Mac. Should I try to download it again? I don't think so. It could be helpful if we can get some more information: - Could you open the terminal application and type which git git --version and post the results here ? It may be worth to mention that Git is a command line tool, so that you may not see anything in the "Applications" folder.
Re: [PATCH] pull: do not segfault when HEAD refers to missing object file
On Sun, Mar 05, 2017 at 11:52:22PM +, brian m. carlson wrote: > On Mon, Mar 06, 2017 at 12:42:22AM +0100, André Laszlo wrote: > > +test_expect_success 'git pull --rebase with corrupt HEAD does not > > segfault' ' > > + mkdir corrupted && > > + (cd corrupted && > > + git init && > > + echo one >file && git add file && > > + git commit -m one && > > + REV=$(git rev-parse HEAD) && > > + rm -f .git/objects/${REV:0:2}/${REV:2} && > > I think this is a bashism. On dash, I get the following: > > genre ok % dash -c 'foo=abcdefg; echo ${foo:0:2}; echo ${foo:2}' > dash: 1: Bad substitution Yeah, it is. You can do it easily with 'sed', of course, but if you want to avoid the extra process and do it in pure shell, it's more like: last38=${REV#??} first2=${REV%$last38} rm -f .git/objects/$first2/$last38 -Peff
Re: Delta compression not so effective
On Sat, Mar 4, 2017 at 12:27 AM, Marius Storm-Olsen wrote: > > I reran the repack with the options above (dropping the zlib=9, as you > suggested) > > $ time git -c pack.threads=4 repack -a -d -F \ >--window=350 --depth=250 --window-memory=30g > > and ended up with > $ du -sh . > 205G. > > In other words, going from 6G to 30G window didn't help a lick on finding > deltas for those binaries. Ok. > I did > fprintf(stderr, "%s %u %lu\n", > sha1_to_hex(delta_list[i]->idx.sha1), > delta_list[i]->hash, > delta_list[i]->size); > > I assume that's correct? Looks good. > I've removed all commit messages, and "sanitized" some filepaths etc, so > name hashes won't match what's reported, but that should be fine. (the > object_entry->hash seems to be just a trivial uint32 hash for sorting > anyways) Yes. I see your name list and your pack-file index. > BUT, if I look at the last 3 entries of the sorted git verify-pack output, > and look for them in the 'git log --oneline --raw -R --abbrev=40' output, I > get: ... > while I cannot find ANY of them in the delta_list output?? \ Yes. You have a lot of of object names in that log file you sent in private that aren't in the delta list. Now, objects smaller than 50 bytes we don't ever try to even delta. I can't see the object sizes when they don't show up in the delta list, but looking at some of those filenames I'd expect them to not fall in that category. I guess you could do the printout a bit earlier (on the "to_pack.objects[]" array - to_pack.nr_objects is the count there). That should show all of them. But the small objects shouldn't matter. But if you have a file like extern/win/FlammableV3/x64/lib/FlameProxyLibD.lib I would have assumed that it has a size that is > 50. Unless those "extern" things are placeholders? > You might get an idea for how to easily create a repo which reproduces the > issue, and which would highlight it more easily for the ML. Looking at your sorted object list ready for packing, it doesn't look horrible. When sorting for size, it still shows a lot of those large files with the same name hash, so they sorted together in that form too. I do wonder if your dll data just simply is absolutely horrible for xdelta. We've also limited the delta finding a bit, simply because it had some O(m*n) behavior that gets very expensive on some patterns. Maybe your blobs trigger some of those case. The diff-delta work all goes back to 2005 and 2006, so it's a long time ago. What I'd ask you to do is try to find if you could make a reposity of just one of the bigger DLL's with its history, particularly if you can find some that you don't think is _that_ sensitive. Looking at it, for example, I see that you have that file extern/redhat-5/FlammableV3/x64/plugins/libFlameCUDA-3.0.703.so that seems to have changed several times, and is a largish blob. Could you try creating a repository with git fast-import that *only* contains that file (or pick another one), and see if that delta's well? And if you find some case that doesn't xdelta well, and that you feel you could make available outside, we could have a test-case... Linus
Re: RFC: Another proposed hash function transition plan
On Sat, Mar 04, 2017 at 06:35:38PM -0800, Linus Torvalds wrote: > On Fri, Mar 3, 2017 at 5:12 PM, Jonathan Nieder wrote: > > > > This document is still in flux but I thought it best to send it out > > early to start getting feedback. > > This actually looks very reasonable if you can implement it cleanly > enough. In many ways the "convert entirely to a new 256-bit hash" is > the cleanest model, and interoperability was at least my personal > concern. Maybe your model solves it (devil in the details), in which > case I really like it. If you think you can do it, I'm all for it. > Btw, I do think the particular choice of hash should still be on the > table. sha-256 may be the obvious first choice, but there are > definitely a few reasons to consider alternatives, especially if it's > a complete switch-over like this. > > One is large-file behavior - a parallel (or tree) mode could improve > on that noticeably. BLAKE2 does have special support for that, for > example. And SHA-256 does have known attacks compared to SHA-3-256 or > BLAKE2 - whether that is due to age or due to more effort, I can't > really judge. But if we're switching away from SHA1 due to known > attacks, it does feel like we should be careful. I agree with Linus on this. SHA-256 is the slowest option, and it's the one with the most advanced cryptanalysis. SHA-3-256 is faster on 64-bit machines (which, as we've seen on the list, is the overwhelming majority of machines using Git), and even BLAKE2b-256 is stronger. Doing this all over again in another couple years should also be a non-goal. -- brian m. carlson / brian with sandals: Houston, Texas, US +1 832 623 2791 | https://www.crustytoothpaste.net/~bmc | My opinion only OpenPGP: https://keybase.io/bk2204 signature.asc Description: PGP signature
Re: [PATCH] pull: do not segfault when HEAD refers to missing object file
On Mon, Mar 06, 2017 at 12:42:22AM +0100, André Laszlo wrote: > +test_expect_success 'git pull --rebase with corrupt HEAD does not segfault' ' > + mkdir corrupted && > + (cd corrupted && > + git init && > + echo one >file && git add file && > + git commit -m one && > + REV=$(git rev-parse HEAD) && > + rm -f .git/objects/${REV:0:2}/${REV:2} && I think this is a bashism. On dash, I get the following: genre ok % dash -c 'foo=abcdefg; echo ${foo:0:2}; echo ${foo:2}' dash: 1: Bad substitution > + test_expect_code 128 git pull --rebase > /dev/null > + ) > +' > + > test_done -- brian m. carlson / brian with sandals: Houston, Texas, US +1 832 623 2791 | https://www.crustytoothpaste.net/~bmc | My opinion only OpenPGP: https://keybase.io/bk2204 signature.asc Description: PGP signature
Re: Transition plan for git to move to a new hash function
On Sun, Mar 05, 2017 at 01:45:46PM +, Ian Jackson wrote: > brian m. carlson writes ("Re: Transition plan for git to move to a new hash > function"): > > Instead, I was referring to areas like the notes code. It has extensive > > use of the last byte as a type of lookup table key. It's very dependent > > on having exactly one hash, since it will always want to use the last > > byte. > > You mean note_tree_search ? (My tree here may be a bit out of date.) > This doesn't seem difficult to fix. The nontrivial changes would be > mostly confined to SUBTREE_SHA1_PREFIXCMP and GET_NIBBLE. > > It's true that like most of git there's a lot of hardcoded `sha1'. I'm talking about the entire notes.c file. There are several different uses of "19" in there, and they compose at least two separate concepts. My object-id-part9 series tries to split those out into logical constants. This code is not going to handle repositories with different-length objects well, which I believe was your initial proposal. I originally thought that mixed-hash repositories would be viable as well, but I no longer do. > Are you arguing in favour of "replace git with git2 by simply > s/20/64/g; s/sha1/blake/g" ? This seems to me to be a poor idea. > Takeup of the new `git2' would be very slow because of the pain > involved. I'm arguing that the same binary ought to be able to handle both SHA-1 and the new hash. I'm also arguing that a given object have exactly one hash and that we not mix hashes in the same object. A repository will be composed of one type of object, and if that's the new hash, a lookup table will be used to translate SHA-1. We can synthesize the old objects, should we need them. That allows people to use the SHA-1 hashes (in my view, with a prefix, such as "sha1:") in repositories using the new hash. It also allows verifying old tags and commits if need be. What I *would* like to see is an extension to the tag and commit objects which names the hash that was used to make them. That makes it easy to determine which object the signature should be verified over, as it will verify over only one of them. > [1] I've heard suggestions here that instead we should expect users to > "git1 fast-export", which you would presumably feed into "git2 > fast-import". But what is `git1' here ? Is it the current git > codebase frozen in time ? I don't think it can be. With this > conversion strategy, we will need to maintain git1 for decades. It > will need portability fixes, security fixes, fixes for new hostile > compiler optimisations, and so on. The difficulty of conversion means > there will be pressure to backport new features from `git2' to `git1'. > (Also this approach means that all signatures are definitively lost > during the conversion process.) I'm proposing we have a git hash-convert (the name doesn't matter that much) that converts in place. It rebuilds the objects and builds a lookup table. Since the contents of git objects are deterministic, this makes it possible for each individual user to make the transition in place. -- brian m. carlson / brian with sandals: Houston, Texas, US +1 832 623 2791 | https://www.crustytoothpaste.net/~bmc | My opinion only OpenPGP: https://keybase.io/bk2204 signature.asc Description: PGP signature
[PATCH] pull: do not segfault when HEAD refers to missing object file
git pull --rebase on a corrupted HEAD used to segfault; it has been corrected to error out with a message. A test has also been added to verify this fix. Signed-off-by: André Laszlo --- Notes: When add_head_to_pending fails to add a pending object, git pull --rebase segfaults. This can happen if HEAD is referring to a corrupt or missing object. I discovered this segfault on my machine after pulling a repo from GitHub, but I have been unable to reproduce the sequence of events that lead to the corrupted HEAD (I think it may have been caused by a lost network connection in my case). The following commands use add_head_to_pending: format-patch setup_revisions before add_head_to_pending diff checks rev.pending.nr shortlog checks rev.pending.nr log uses resolve_ref_unsafe All of the above return an error code of 128 and print "fatal: bad object HEAD" instead of segfaulting, which I think is correct behavior. The check and error message have been added to has_uncommitted_changes, where they were missing, as well as to diff-lib.c (without the error message). diff-lib.c | 2 +- t/t5520-pull.sh | 12 wt-status.c | 5 + 3 files changed, 18 insertions(+), 1 deletion(-) diff --git a/diff-lib.c b/diff-lib.c index 52447466b..9d26b18c3 100644 --- a/diff-lib.c +++ b/diff-lib.c @@ -512,7 +512,7 @@ int run_diff_index(struct rev_info *revs, int cached) struct object_array_entry *ent; ent = revs->pending.objects; - if (diff_cache(revs, ent->item->oid.hash, ent->name, cached)) + if (!ent || diff_cache(revs, ent->item->oid.hash, ent->name, cached)) exit(128); diff_set_mnemonic_prefix(&revs->diffopt, "c/", cached ? "i/" : "w/"); diff --git a/t/t5520-pull.sh b/t/t5520-pull.sh index 17f4d0fe4..1edb6a97a 100755 --- a/t/t5520-pull.sh +++ b/t/t5520-pull.sh @@ -664,4 +664,16 @@ test_expect_success 'git pull --rebase against local branch' ' test file = "$(cat file2)" ' +test_expect_success 'git pull --rebase with corrupt HEAD does not segfault' ' + mkdir corrupted && + (cd corrupted && + git init && + echo one >file && git add file && + git commit -m one && + REV=$(git rev-parse HEAD) && + rm -f .git/objects/${REV:0:2}/${REV:2} && + test_expect_code 128 git pull --rebase > /dev/null + ) +' + test_done diff --git a/wt-status.c b/wt-status.c index d47012048..3d60eaff5 100644 --- a/wt-status.c +++ b/wt-status.c @@ -2252,6 +2252,11 @@ int has_uncommitted_changes(int ignore_submodules) DIFF_OPT_SET(&rev_info.diffopt, IGNORE_SUBMODULES); DIFF_OPT_SET(&rev_info.diffopt, QUICK); add_head_to_pending(&rev_info); + + /* The add_head_to_pending call might not have added anything. */ + if (!rev_info.pending.nr) + die("bad object %s", "HEAD"); + diff_setup_done(&rev_info.diffopt); result = run_diff_index(&rev_info, 1); return diff_result_code(&rev_info.diffopt, result); -- 2.12.0
Re: [PATCH v1] Travis: also test on 32-bit Linux
On 05/03/17 17:38, Lars Schneider wrote: >> On 02 Mar 2017, at 16:17, Ramsay Jones wrote: >> On 02/03/17 11:24, Johannes Schindelin wrote: >>> On Thu, 2 Mar 2017, Lars Schneider wrote: >>> >> [snip] One thing that still bugs me: In the Linux32 environment prove adds the CPU times to every test run: ( 0.02 usr 0.00 sys + 0.00 cusr 0.00 csys ... Has anyone an idea why that happens and how we can disable it? >>> >>> I have no idea. >> >> I have no idea either, but it is not unique to this 32bit Linux, but >> rather the version of prove. For example, I am seeing this on Linux >> Mint 18.1 (64bit _and_ 32bit), whereas Linux Mint 17.x did not do >> this. (They used different Ubuntu LTS releases). >> >> [Mint 18.1 'prove --version' says: TAP::Harness v3.35 and Perl v5.22.1] > > I think I found it. It was introduced in TAP::Harness v3.34: > https://github.com/Perl-Toolchain-Gang/Test-Harness/commit/66cbf6355928b4828db517a99f1099b7fed35e90 > > ... and it is enabled with the "--timer" switch. Yep, that looks like it. When I updated to Mint 18, this broke a perl script of mine, so I had a quick look to see what I could do to suppress it. The man page seemed to imply that you could replace the output formatter, but that didn't take me too far (search CPAN for TAP::Parser::Formatter: ;-) ). I suppose you could replace Tap::Formatter::Base, or some such, but I didn't need to go that far - I simply changed a couple of regex-es to ignore the excess output! :-P Do you really need to suppress that timing information or, like me, can you simply ignore it? ATB, Ramsay Jones
Git download
I have downloaded Git and cannot find the application on my Mac. Should I try to download it again?
[PATCH v3] Travis: also test on 32-bit Linux
From: Johannes Schindelin When Git v2.9.1 was released, it had a bug that showed only on Windows and on 32-bit systems: our assumption that `unsigned long` can hold 64-bit values turned out to be wrong. This could have been caught earlier if we had a Continuous Testing set up that includes a build and test run on 32-bit Linux. Let's do this (and take care of the Windows build later). This patch asks Travis CI to install a Docker image with 32-bit libraries and then goes on to build and test Git using this 32-bit setup. Signed-off-by: Johannes Schindelin Signed-off-by: Lars Schneider --- Hi, when I looked into the 32-bit/line-log issue [1], I realized that my proposed docker setup is not ideal for local debugging. Here is an approach that I think is better. I changed the following: - disable sudo as it is not required for the Travis job - keep all docker commands in the .travis.yml - add option to run commands inside docker with the same UID as the host user to make output files accessible - pass environment variables directly to the docker container Sorry for the back and forth. Cheers, Lars [1] http://public-inbox.org/git/2205f1a7-a694-4f40-b994-d68c3947f...@gmail.com/ Notes: Base Ref: master Web-Diff: https://github.com/larsxschneider/git/commit/a6fe1def16 Checkout: git fetch https://github.com/larsxschneider/git travisci/linux32-v3 && git checkout a6fe1def16 Interdiff (v2..v3): diff --git a/.travis.yml b/.travis.yml index fd60fd8328..591cc57b80 100644 --- a/.travis.yml +++ b/.travis.yml @@ -41,13 +41,25 @@ matrix: include: - env: Linux32 os: linux - sudo: required services: - docker before_install: - docker pull daald/ubuntu32:xenial before_script: - script: ci/run-linux32-build.sh daald/ubuntu32:xenial + script: +- > + docker run + --interactive + --env DEFAULT_TEST_TARGET + --env GIT_PROVE_OPTS + --env GIT_TEST_OPTS + --env GIT_TEST_CLONE_2GB + --volume "${PWD}:/usr/src/git" + daald/ubuntu32:xenial + /usr/src/git/ci/run-linux32-build.sh $(id -u $USER) +# Use the following command to debug the docker build locally: +# $ docker run -itv "${PWD}:/usr/src/git" --entrypoint /bin/bash daald/ubuntu32:xenial +# root@container:/# /usr/src/git/ci/run-linux32-build.sh - env: Documentation os: linux compiler: clang diff --git a/ci/run-linux32-build.sh b/ci/run-linux32-build.sh index 13c184d41c..f7a3434985 100755 --- a/ci/run-linux32-build.sh +++ b/ci/run-linux32-build.sh @@ -1,31 +1,30 @@ #!/bin/sh # -# Build and test Git in a docker container running a 32-bit Ubuntu Linux +# Build and test Git in a 32-bit environment # # Usage: -# run-linux32-build.sh [container-image] +# run-linux32-build.sh [host-user-id] # -CONTAINER="${1:-daald/ubuntu32:xenial}" - -sudo docker run --interactive --volume "${PWD}:/usr/src/git" "$CONTAINER" \ -/bin/bash -c 'linux32 --32bit i386 sh -c " -: update packages && +# Update packages to the latest available versions +linux32 --32bit i386 sh -c ' apt update >/dev/null && apt install -y build-essential libcurl4-openssl-dev libssl-dev \ -libexpat-dev gettext python >/dev/null && +libexpat-dev gettext python >/dev/null +' && + +# If this script runs inside a docker container, then all commands are +# usually executed as root. Consequently, the host user might not be +# able to access the test output files. +# If a host user id is given, then create a user "ci" with the host user +# id to make everything accessible to the host user. +HOST_UID=$1 && +CI_USER=$USER && +test -z $HOST_UID || (CI_USER="ci" && useradd -u $HOST_UID $CI_USER) && -: build and test && +# Build and test +linux32 --32bit i386 su -m -l $CI_USER -c ' cd /usr/src/git && -export DEFAULT_TEST_TARGET='$DEFAULT_TEST_TARGET' && -export GIT_PROVE_OPTS=\"'"$GIT_PROVE_OPTS"'\" && -export GIT_TEST_OPTS=\"'"$GIT_TEST_OPTS"'\" && -export GIT_TEST_CLONE_2GB='$GIT_TEST_CLONE_2GB' && make --jobs=2 && -make --quiet test || ( - -: make test-results readable to non-root user on TravisCI && -test '$TRAVIS' && -find t/test-results/ -type f -exec chmod o+r {} \; && -false ) -"' +make --quiet test +' \0 .travis.yml | 21 + ci/run-linux32-build.sh | 30 ++ 2 files changed, 51 insertions(+) create mode 100755 ci/run-linux32-build.sh diff --git a/.travis.yml b/.travis.yml index 9c63c8c3f6..591c
Re: [PATCH v1] Travis: also test on 32-bit Linux
> On 02 Mar 2017, at 16:17, Ramsay Jones wrote: > > > > On 02/03/17 11:24, Johannes Schindelin wrote: >> Hi Lars, >> >> On Thu, 2 Mar 2017, Lars Schneider wrote: >> > [snip] >>> One thing that still bugs me: In the Linux32 environment prove adds the >>> CPU times to every test run: ( 0.02 usr 0.00 sys + 0.00 cusr 0.00 >>> csys ... Has anyone an idea why that happens and how we can disable it? >> >> I have no idea. > > I have no idea either, but it is not unique to this 32bit Linux, but > rather the version of prove. For example, I am seeing this on Linux > Mint 18.1 (64bit _and_ 32bit), whereas Linux Mint 17.x did not do > this. (They used different Ubuntu LTS releases). > > [Mint 18.1 'prove --version' says: TAP::Harness v3.35 and Perl v5.22.1] I think I found it. It was introduced in TAP::Harness v3.34: https://github.com/Perl-Toolchain-Gang/Test-Harness/commit/66cbf6355928b4828db517a99f1099b7fed35e90 ... and it is enabled with the "--timer" switch. - Lars
difflame improvements
Hi! Since my last post the biggest improvement is the ability to detect that the user has requested a "reverse" analysis. Under "normal" circumstances a user would ask difflame to get the diff from an ancestor (call "difflame treeish1 treeish2" so that merge-base of treeish1 treeish2 equals treeish1). In this case the blame result is done using straight blame output for added lines and additional analysis to detect where a line was deleted (analysis has improved a lot in this regard I haven't heard anything from Peff, though). But if the user requests the opposite (call "difflame treeish1 treeish2" so that merge-base of treeish1 treeish2 is treeish2) then the analysis has to be driven "in reverse". Here's one example taken from difflame itself: normal "forward" call (hope output doesn't get butchered): $ ./difflame.py HEAD~3 HEAD~2 diff --git a/difflame.py b/difflame.py index e70154a..04c7577 100755 --- a/difflame.py +++ b/difflame.py @@ -365,7 +365,7 @@ def get_full_revision_id(revision): e5b218e4 (Edmundo 2017-02-01 365) # we already had the revision 50528377 (Edmundo 2017-03-04 366) return REVISIONS_ID_CACHE[revision] d1d11d8a (Edmundo 2017-02-02 367) # fallback to get it from git b1a6693 use rev-list to get revision IDs -b1a6693 (Edmundo 2017-03-04 368) full_revision = run_git_command(["show", "--pretty=%H", revision]).split("\n")[0] +b1a66932 (Edmundo 2017-03-04 368) full_revision = run_git_command(["rev-list", "--max-count=1", revision]).split("\n")[0] 50528377 (Edmundo 2017-03-04 369) REVISIONS_ID_CACHE[revision] = full_revision e5b218e4 (Edmundo 2017-02-01 370) return full_revision 91b7d3f5 (Edmundo 2017-01-31 371) "reverse" call: $ ./difflame.py HEAD~2 HEAD~3 diff --git a/difflame.py b/difflame.py index 04c7577..e70154a 100755 --- a/difflame.py +++ b/difflame.py @@ -365,7 +365,7 @@ def get_full_revision_id(revision): e5b218e4 (Edmundo 2017-02-01 365) # we already had the revision 50528377 (Edmundo 2017-03-04 366) return REVISIONS_ID_CACHE[revision] d1d11d8a (Edmundo 2017-02-02 367) # fallback to get it from git b1a6693 use rev-list to get revision IDs -b1a66932 (Edmundo 2017-03-04 368) full_revision = run_git_command(["rev-list", "--max-count=1", revision]).split("\n")[0] +b1a6693 (Edmundo 2017-03-04 368) full_revision = run_git_command(["show", "--pretty=%H", revision]).split("\n")[0] 50528377 (Edmundo 2017-03-04 369) REVISIONS_ID_CACHE[revision] = full_revision e5b218e4 (Edmundo 2017-02-01 370) return full_revision 91b7d3f5 (Edmundo 2017-01-31 371) Notice how the revision reported in both difflame calls is the same: $ git show b1a66932 commit b1a66932704245fd653f8d48c0a718f168f334a7 Author: Edmundo Carmona Antoranz Date: Sat Mar 4 13:59:50 2017 -0600 use rev-list to get revision IDs diff --git a/difflame.py b/difflame.py index e70154a..04c7577 100755 --- a/difflame.py +++ b/difflame.py @@ -365,7 +365,7 @@ def get_full_revision_id(revision): # we already had the revision return REVISIONS_ID_CACHE[revision] # fallback to get it from git -full_revision = run_git_command(["show", "--pretty=%H", revision]).split("\n")[0] +full_revision = run_git_command(["rev-list", "--max-count=1", revision]).split("\n")[0] REVISIONS_ID_CACHE[revision] = full_revision return full_revision If this "detection" to perform reverse analysis hadn't been done, then there wouldn't be a lot of useful information because there are no revisions in HEAD~2..HEAD~3 and so the output would have been something like: diff --git a/difflame.py b/difflame.py index 04c7577..e70154a 100755 --- a/difflame.py +++ b/difflame.py @@ -365,7 +365,7 @@ def get_full_revision_id(revision): e5b218e4 (Edmundo 2017-02-01 365) # we already had the revision 50528377 (Edmundo 2017-03-04 366) return REVISIONS_ID_CACHE[revision] d1d11d8a (Edmundo 2017-02-02 367) # fallback to get it from git b1a6693 use rev-list to get revision IDs %b1a6693 (Edmundo 2017-03-04 368) full_revision = run_git_command(["rev-list", "--max-count=1", revision]).split("\n")[0] e5b218e printing hints for deleted lines +e5b218e4 (Edmundo 2017-02-01 368) full_revision = run_git_command(["show", "--pretty=%H", revision]).split("\n")[0] 50528377 (Edmundo 2017-03-04 369) REVISIONS_ID_CACHE[revision] = full_revision e5b218e4 (Edmundo 2017-02-01 370) return full_revision 91b7d3f5 (Edmundo 2017-01-31 371) Notice how both the added line and the deleted line are reporting the _wrong_ revision. It should be b1a66932 in all cases. One question that has been bugging me for a while is what to do in cases where treeish1, treeish2 are not "direct" descendants" (as in merge-base treeish1 treeish2 is something other than treeish1 or treeish2). Suppose a line was added on an ancestor of treeish1 but it hasn't been merged into treeish2. In this case if we diff treeish1..treeish2 we will ge
Re: What's cooking in git.git (Mar 2017, #02; Fri, 3)
Hey Stephan On Sat, Mar 4, 2017 at 8:05 PM, Stephan Beyer wrote: > Hi Pranit, > > On 03/04/2017 12:26 AM, Junio C Hamano wrote: >> -- >> [Stalled] > [...] >> >> * pb/bisect (2017-02-18) 28 commits >> - fixup! bisect--helper: `bisect_next_check` & bisect_voc shell function in >> C >> - bisect--helper: remove the dequote in bisect_start() >> - bisect--helper: retire `--bisect-auto-next` subcommand >> - bisect--helper: retire `--bisect-autostart` subcommand >> - bisect--helper: retire `--bisect-write` subcommand >> - bisect--helper: `bisect_replay` shell function in C >> - bisect--helper: `bisect_log` shell function in C >> - bisect--helper: retire `--write-terms` subcommand >> - bisect--helper: retire `--check-expected-revs` subcommand >> - bisect--helper: `bisect_state` & `bisect_head` shell function in C >> - bisect--helper: `bisect_autostart` shell function in C >> - bisect--helper: retire `--next-all` subcommand >> - bisect--helper: retire `--bisect-clean-state` subcommand >> - bisect--helper: `bisect_next` and `bisect_auto_next` shell function in C >> - t6030: no cleanup with bad merge base >> - bisect--helper: `bisect_start` shell function partially in C >> - bisect--helper: `get_terms` & `bisect_terms` shell function in C >> - bisect--helper: `bisect_next_check` & bisect_voc shell function in C >> - bisect--helper: `check_and_set_terms` shell function in C >> - bisect--helper: `bisect_write` shell function in C >> - bisect--helper: `is_expected_rev` & `check_expected_revs` shell function >> in C >> - bisect--helper: `bisect_reset` shell function in C >> - wrapper: move is_empty_file() and rename it as is_empty_or_missing_file() >> - t6030: explicitly test for bisection cleanup >> - bisect--helper: `bisect_clean_state` shell function in C >> - bisect--helper: `write_terms` shell function in C >> - bisect: rewrite `check_term_format` shell function in C >> - bisect--helper: use OPT_CMDMODE instead of OPT_BOOL >> >> Move more parts of "git bisect" to C. >> >> Expecting a reroll. > > I guess you are short on time but I am hoping that you are still going > to send a reroll to the list (probably on top of [1]?). This is because > I would like to start to work on rerolling my bisect patches from last > year [2] ... but to avoid a mess of merge conflicts, I am waiting until > pb/bisect found its way into "next". (There were also recent discussions > on other bisect strategies [3] and it's probably only a matter of time > until a new big patchset on bisect--helper comes up...) I am sorry I haven't found much time on it. I actually came across a bug and haven't been able to fix that so I had just not worked on it then. I almost forgot that you too had a patch series and this series is important for you. I will start working on this and send a re-roll soon. Regards, Pranit Bauva
Re: bisect-helper: we do not bisect --objects
From: "Junio C Hamano" "Philip Oakley" writes: Bikeshedding: If the given boundary is a tag, it could be tagging a blob or tree rather than a commit. Would that be a scenario that reaches this part of the code? I do not think it is relevant. Bisection is an operation over a "bisectable" commit DAG, where commits can be partitioned into "new" (aka "bad") and "old" (aka "good") camp, all descendants of a "new" commit are all "new", all ancestors of an "old" commit are all "old". Think of the "new"-ness as a 100% inheritable disease that happens spontaneously and without cure. Once you are infected with "new"-ness, all your descendants are forever "new". If you know you are free of the "new"-ness, you know that all your ancestors are not with the "new"-ness, either. The goal of the operation is to find a "new" commit whose parents are all "old". The bisectability of the commit DAG is what allows you to say "this commit is new" to a commit somewhere in the middle of the history and avoid having to test any descendants, as they all inherit the "new"-ness from it (similarly when you have one commit that is "old", you do not have to test any ancestor), thereby reducing the number of test from N (all commits in good..bad range) to log(N). There is no room for a tree or a blob to participate in this graph partitioning problem. A "bad" tree that is "new" cannot infect its children with the "new"-ness and a "good" tree cannot guarantee the lack of "new"-ness of its parents, because a tree or a blob does not have parent or child commits. Thanks. I was happy with how the bisect actually works. I was more responding the the open question about how that piece of code may have come into existance, and the potential thought processes that can lead to such 'mistakes'. My line of reasoning was that it is reasonable to pass both commits and tags, as revisions(7), to the cli as being the bad and good points in the graph. This could then lead to a expectation that the objects they point to should be suitably marked, which is quite reasonable for commits, and for most tags. However there are tags that point to the commit tree, rather than the commit itself, so if that initial rule was followed in a simplistic manner, then (falsely) the peeled tree of the tag would be marked as good/bad, which as your patch identifies, would be the wrong thing to do. The study of human error is quite interesting regards Philip
Re: Transition plan for git to move to a new hash function
brian m. carlson writes ("Re: Transition plan for git to move to a new hash function"): > Instead, I was referring to areas like the notes code. It has extensive > use of the last byte as a type of lookup table key. It's very dependent > on having exactly one hash, since it will always want to use the last > byte. You mean note_tree_search ? (My tree here may be a bit out of date.) This doesn't seem difficult to fix. The nontrivial changes would be mostly confined to SUBTREE_SHA1_PREFIXCMP and GET_NIBBLE. It's true that like most of git there's a lot of hardcoded `sha1'. Are you arguing in favour of "replace git with git2 by simply s/20/64/g; s/sha1/blake/g" ? This seems to me to be a poor idea. Takeup of the new `git2' would be very slow because of the pain involved. Any sensible method of moving to a new hash that isn't "make a completely incompatible new version of git" is going to involve teaching the code we have in git right now to handle new hashes as well as sha1 hashes. Even if the plan is to try to convert old data, rather than keep it and be able to refer to it from new data, something will have to be able to parse old packfiles, old commits, old tags, old notes, etc. etc. etc. Either that's going to be some separate conversion utility, or it has to be the same code in git that's there already.[1] The ability to handle both old-format and new-format data can be achieved in the code by doing away with the hardcoded sha1s, so that instead the hash is an abstract data type with operations like "initialise", "compare", "get a nybble", etc. We've already seen patches going in this direction. [1] I've heard suggestions here that instead we should expect users to "git1 fast-export", which you would presumably feed into "git2 fast-import". But what is `git1' here ? Is it the current git codebase frozen in time ? I don't think it can be. With this conversion strategy, we will need to maintain git1 for decades. It will need portability fixes, security fixes, fixes for new hostile compiler optimisations, and so on. The difficulty of conversion means there will be pressure to backport new features from `git2' to `git1'. (Also this approach means that all signatures are definitively lost during the conversion process.) So if we want to provide both `git1' and `git2', it's still better to compile `git' and `git2' from the same codebase. But if we do that, the resulting ifdeffery and/or other hash abstractions are most of the work to be hash-agile. It's just the difference between a compile-time and runtime switch. I think the incompatibile approach is much more work in the medium and long term - and it leads to a longer transition period. Bear in mind that our objective is not to minimise the time until the new version of git is available. Our objective is to minimise the time until (most) people are using it. An approach which takes longer for the git community to develop, but which is easier to deploy, can easily be better. Or maybe the objective is to minimise overall effort. In which case more work on git, for an easier transition for all the users, seems like a no-brainer. I think this is arguably true even from the point of view of effort amongst the community of git contributors. git contributors start out as git users - and if git's users are all busy struggling with a difficult transition, they will have less time to improve other stuff and will tend less to get involved upstream. (And they may be less inclined to feel that the git upstream developers understand their needs well.) The better alternative is to adopt a plan that has a clear and straightforward transition for users, and ask git users to help with implementation. I think many git users, including sophisticated users and competent organisations, are concerned about sha1. Currently most of those users will find it difficult to help, because it's not clear to them what needs to be done. Thanks, Ian.
Re: [PATCH] line-log: use COPY_ARRAY to fix mis-sized memcpy
On 05/03/2017 12:44, Jeff King wrote: On Sun, Mar 05, 2017 at 06:36:19AM -0500, Jeff King wrote: range_set_init(dst, src->nr); - memcpy(dst->ranges, src->ranges, src->nr*sizeof(struct range_set)); + memcpy(dst->ranges, src->ranges, src->nr*sizeof(struct range)); I think "sizeof(*dst->ranges)" is probably an even better fix, as it infers the type of "dst". But these days we have COPY_ARRAY() to make it even harder to get this kind of thing wrong. So here's your fix wrapped up with a commit message, mostly for Junio's convenience. I listed you as the author, since you did the hard part. If you're OK with it, please indicate that it's OK to add your signed-off-by. If you prefer to do it differently, feel free to post your own patch. Thanks. -- >8 -- From: Vegard Nossum Subject: [PATCH] line-log: use COPY_ARRAY to fix mis-sized memcpy This memcpy meant to get the sizeof a "struct range", not a "range_set", as the former is what our array holds. Rather than swap out the types, let's convert this site to COPY_ARRAY, which avoids the problem entirely (and confirms that the src and dst types match). Note for curiosity's sake that this bug doesn't trigger on I32LP64 systems, but does on ILP32 systems. The mistaken "struct range_set" has two ints and a pointer. That's 16 bytes on LP64, or 12 on ILP32. The correct "struct range" type has two longs, which is also 16 on LP64, but only 8 on ILP32. Likewise an IL32P64 system would experience the bug. Signed-off-by: Jeff King Signed-off-by: Vegard Nossum --- line-log.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/line-log.c b/line-log.c index 65f3558b3..a2477f601 100644 --- a/line-log.c +++ b/line-log.c @@ -43,9 +43,10 @@ void range_set_release(struct range_set *rs) static void range_set_copy(struct range_set *dst, struct range_set *src) { range_set_init(dst, src->nr); - memcpy(dst->ranges, src->ranges, src->nr*sizeof(struct range_set)); + COPY_ARRAY(dst->ranges, src->ranges, src->nr); dst->nr = src->nr; } + static void range_set_move(struct range_set *dst, struct range_set *src) { range_set_release(dst); Vegard
Re: [PATCH v1] Travis: also test on 32-bit Linux
On Sun, Mar 05, 2017 at 06:36:19AM -0500, Jeff King wrote: > I grepped for 'memcpy.*sizeof' and found one other case that's not a > bug, but is questionable. And here's the fix for that case. It can be applied separately from the other patch if need be. -- >8 -- Subject: [PATCH] ewah: fix eword_t/uint64_t confusion The ewah subsystem typedefs eword_t to be uint64_t, but some code uses a bare uint64_t. This isn't a bug now, but it's a potential maintenance problem if the definition of eword_t ever changes. Let's use the correct type. Note that we can't use COPY_ARRAY() here because the source and destination point to objects of different sizes. For that reason we'll also skip the usual "sizeof(*dst)" and use the real type, which should make it more clear that there's something tricky going on. Signed-off-by: Jeff King --- ewah/ewah_io.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/ewah/ewah_io.c b/ewah/ewah_io.c index 61f6a4357..f73210973 100644 --- a/ewah/ewah_io.c +++ b/ewah/ewah_io.c @@ -142,8 +142,8 @@ int ewah_read_mmap(struct ewah_bitmap *self, const void *map, size_t len) * the endianness conversion in a separate pass to ensure * we're loading 8-byte aligned words. */ - memcpy(self->buffer, ptr, self->buffer_size * sizeof(uint64_t)); - ptr += self->buffer_size * sizeof(uint64_t); + memcpy(self->buffer, ptr, self->buffer_size * sizeof(eword_t)); + ptr += self->buffer_size * sizeof(eword_t); for (i = 0; i < self->buffer_size; ++i) self->buffer[i] = ntohll(self->buffer[i]); -- 2.12.0.426.g9d5d0eeae
[PATCH] line-log: use COPY_ARRAY to fix mis-sized memcpy
On Sun, Mar 05, 2017 at 06:36:19AM -0500, Jeff King wrote: > > range_set_init(dst, src->nr); > > - memcpy(dst->ranges, src->ranges, src->nr*sizeof(struct range_set)); > > + memcpy(dst->ranges, src->ranges, src->nr*sizeof(struct range)); > > I think "sizeof(*dst->ranges)" is probably an even better fix, as it > infers the type of "dst". But these days we have COPY_ARRAY() to make it > even harder to get this kind of thing wrong. So here's your fix wrapped up with a commit message, mostly for Junio's convenience. I listed you as the author, since you did the hard part. If you're OK with it, please indicate that it's OK to add your signed-off-by. If you prefer to do it differently, feel free to post your own patch. -- >8 -- From: Vegard Nossum Subject: [PATCH] line-log: use COPY_ARRAY to fix mis-sized memcpy This memcpy meant to get the sizeof a "struct range", not a "range_set", as the former is what our array holds. Rather than swap out the types, let's convert this site to COPY_ARRAY, which avoids the problem entirely (and confirms that the src and dst types match). Note for curiosity's sake that this bug doesn't trigger on I32LP64 systems, but does on ILP32 systems. The mistaken "struct range_set" has two ints and a pointer. That's 16 bytes on LP64, or 12 on ILP32. The correct "struct range" type has two longs, which is also 16 on LP64, but only 8 on ILP32. Likewise an IL32P64 system would experience the bug. Signed-off-by: Jeff King --- line-log.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/line-log.c b/line-log.c index 65f3558b3..a2477f601 100644 --- a/line-log.c +++ b/line-log.c @@ -43,9 +43,10 @@ void range_set_release(struct range_set *rs) static void range_set_copy(struct range_set *dst, struct range_set *src) { range_set_init(dst, src->nr); - memcpy(dst->ranges, src->ranges, src->nr*sizeof(struct range_set)); + COPY_ARRAY(dst->ranges, src->ranges, src->nr); dst->nr = src->nr; } + static void range_set_move(struct range_set *dst, struct range_set *src) { range_set_release(dst); -- 2.12.0.426.g9d5d0eeae
Re: [PATCH v1] Travis: also test on 32-bit Linux
On Sat, Mar 04, 2017 at 09:08:40PM +0100, Vegard Nossum wrote: > > At a glance, looks like range_set_copy() is using > > sizeof(struct range_set) == 12, but > > range_set_init/range_set_grow/ALLOC_GROW/REALLOC_ARRAY is using > > sizeof(rs->range) == 8. > > Attached patch seems to fix it -- basically, range_set_copy() is trying > to copy more than it should. It was uncovered with the test case from > Allan's commit because it's creating enough ranges to overflow the > initial allocation on 32-bit. Ugh, yeah, that is definitely a bug. > diff --git a/line-log.c b/line-log.c > index 951029665..cb0dc1110 100644 > --- a/line-log.c > +++ b/line-log.c > @@ -43,7 +43,7 @@ void range_set_release(struct range_set *rs) > static void range_set_copy(struct range_set *dst, struct range_set *src) > { > range_set_init(dst, src->nr); > - memcpy(dst->ranges, src->ranges, src->nr*sizeof(struct range_set)); > + memcpy(dst->ranges, src->ranges, src->nr*sizeof(struct range)); I think "sizeof(*dst->ranges)" is probably an even better fix, as it infers the type of "dst". But these days we have COPY_ARRAY() to make it even harder to get this kind of thing wrong. I grepped for 'memcpy.*sizeof' and found one other case that's not a bug, but is questionable. Of the "good" cases, I think most of them could be converted into something more obviously-correct, which would make auditing easier. The three main cases I saw were: 1. Ones which can probably be converted to COPY_ARRAY(). 2. Ones which just copy a single object, like: memcpy(&dst, &src, sizeof(dst)); Perhaps we should be using struct assignment like: dst = src; here. It's safer and it should give the compiler more room to optimize. The only downside is that if you have pointers, it is easy to write "dst = src" when you meant "*dst = *src". 3. There were a number of alloc-and-copy instances. The copy part is the same as (2) above, but you have to repeat the size, which is potentially error-prone. I wonder if we would want something like: #define ALLOC_COPY(dst, src) do { \ (dst) = xmalloc(sizeof(*(dst))); \ COPY_ARRAY(dst, src, 1); \ while(0) That avoids having to specify the size at all, and triggers a compile-time error if "src" and "dst" point to objects of different sizes. I suspect our friendly neighborhood coccinelle wizards could cook up a conversion. -Peff
Re: RFC: Another proposed hash function transition plan
Translation table ~ A fast bidirectional mapping between sha1-names and sha256-names of all local objects in the repository is kept on disk. The exact format of that mapping is to be determined. All operations that make new objects (e.g., "git commit") add the new objects to the translation table. This seems like a rather nontrival thing to design. It will need to hold millions of mappings, and be quickly searchable from either direction (sha1->new and new->sha1) while still be fairly fast to insert new records into. For Linux, just the list of hashes recording the commits is going to be in the millions, whiel the list of hashes of individual files for all those commits is going to be substantially larger. David Lang