Re: [PATCH v2] log -G: Ignore binary files
Junio C Hamano writes: >> +test_expect_success 'log -G ignores binary files' ' >> +git checkout --orphan orphan1 && >> +printf "a\0a" >data.bin && >> +git add data.bin && >> +git commit -m "message" && >> +git log -Ga >result && >> +test_must_be_empty result >> +' > > As this is the first mention of data.bin, this is adding a new file > data.bin that has two 'a' but is a binary file. And that is the > only commit in the history leading to orphan1. > > The fact that "log -Ga" won't find any means it missed the creation > event, because the blob is binary. Good. By the way, this root commit records another file whose path is "file" and has "Picked" in it. If the file had 'a' in it, it would have been included in "git log" output, but that is too subtle a point to be noticed by the readers who are only reading this patch without seeing what has been done to the index before this test piece. If you are going to restructure these tests to create a three-commit history in a single expect_success that is inspected with various "log -Ga" invocations in subsequent tests, it is worth removing that other file (or rather, starting with "read-tree --empty" immediately after checking out the orphan branch, to clarify to the readers that there is nothing but what you add in the set-up step in the index) to make the test more robust.
Re: [PATCH v2] log -G: Ignore binary files
Thomas Braun writes: > Subject: Re: [PATCH v2] log -G: Ignore binary files s/Ig/ig/; (will locally munge--this alone is no reason to reroll). The code changes looked sensible. > diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh > index 844df760f7..5c3e2a16b2 100755 > --- a/t/t4209-log-pickaxe.sh > +++ b/t/t4209-log-pickaxe.sh > @@ -106,4 +106,44 @@ test_expect_success 'log -S --no-textconv (missing > textconv tool)' ' > rm .gitattributes > ' > > +test_expect_success 'log -G ignores binary files' ' > + git checkout --orphan orphan1 && > + printf "a\0a" >data.bin && > + git add data.bin && > + git commit -m "message" && > + git log -Ga >result && > + test_must_be_empty result > +' As this is the first mention of data.bin, this is adding a new file data.bin that has two 'a' but is a binary file. And that is the only commit in the history leading to orphan1. The fact that "log -Ga" won't find any means it missed the creation event, because the blob is binary. Good. > +test_expect_success 'log -G looks into binary files with -a' ' > + git checkout --orphan orphan2 && > + printf "a\0a" >data.bin && > + git add data.bin && > + git commit -m "message" && This starts from the state left by the previous test piece, i.e. we have a binary data.bin file with two 'a' in it. We pretend to modify and add, but these two steps are no-op if the previous succeeded, but even if the previous step failed, we get what we want in the data.bin file. And then we make an initial commit the same way. > + git log -a -Ga >actual && > + git log >expected && And we ran the same test but this time with "-a" to tell Git that binary-ness should not matter. It will find the sole commit. Good. > + test_cmp actual expected > +' > + > +test_expect_success 'log -G looks into binary files with textconv filter' ' > + git checkout --orphan orphan3 && > + echo "* diff=bin" > .gitattributes && s/> />/; (will locally munge--this alone is no reason to reroll). > + printf "a\0a" >data.bin && > + git add data.bin && > + git commit -m "message" && > + git -c diff.bin.textconv=cat log -Ga >actual && This exposes a slight iffy-ness in the design. The textconv filter used here does not strip the "binary-ness" from the payload, but it is enough to tell the machinery that -G should look into the difference. Is that really desirable, though? IOW, if this weren't the initial commit (which is handled by the codepath to special-case creation and deletion in diff_grep() function), would "log -Ga" show it without "-a"? Should it? I think this test piece (and probably the previous ones for "-a" vs "no -a" without textconv, as well) should be using a history with three commits, where - the root commit introduces "a\0a" to data.bin (creation event) - the second commit adds another instance of "a\0a" to data.bin (forces comparison) - the third commit removes data.bin (deletion event) and make sure that the three are treated identically. If "log -Ga" finds one (with the combination of other conditions like use of textconv or -a option), it should find all three, and vice versa. > + git log >expected && > + test_cmp actual expected > +' > + > +test_expect_success 'log -S looks into binary files' ' > + git checkout --orphan orphan4 && > + printf "a\0a" >data.bin && > + git add data.bin && > + git commit -m "message" && > + git log -Sa >actual && > + git log >expected && > + test_cmp actual expected > +' Likewise. This would also benefit from a three-commit history. Perhaps you can create such a history at the beginning of these additions as another "setup -G/-S binary test" step and test different variations in subsequent tests without the setup? > test_done
Re: [PATCH v2] log -G: Ignore binary files
On Wed, Nov 28 2018, Thomas Braun wrote: Looks much better this time around. > The -G option of log looks for the differences whose patch text > contains added/removed lines that match regex. > > As the concept of patch text only makes sense for text files, we need to > ignore binary files when searching with -G as well. > > The -S option of log looks for differences that changes > the number of occurrences of the specified block of text (i.e. > addition/deletion) in a file. As we want to keep the current behaviour, > add a test to ensure it. > [...] > diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt > index c0a60f3158..059ddd3431 100644 > --- a/Documentation/gitdiffcore.txt > +++ b/Documentation/gitdiffcore.txt > @@ -242,7 +242,7 @@ textual diff has an added or a deleted line that matches > the given > regular expression. This means that it will detect in-file (or what > rename-detection considers the same file) moves, which is noise. The > implementation runs diff twice and greps, and this can be quite > -expensive. > +expensive. Binary files without textconv filter are ignored. Now that we support --text that should be documented. I tried to come up with something on top: diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt index 0378cd574e..42ae65fb57 100644 --- a/Documentation/diff-options.txt +++ b/Documentation/diff-options.txt @@ -524,6 +524,10 @@ struct), and want to know the history of that block since it first came into being: use the feature iteratively to feed the interesting block in the preimage back into `-S`, and keep going until you get the very first version of the block. ++ +Unlike `-G` the `-S` option will always search through binary files +without a textconv filter. [[TODO: Don't we want to support --no-text +then as an optimization?]]. -G:: Look for differences whose patch text contains added/removed @@ -545,6 +549,15 @@ occurrences of that string did not change). + See the 'pickaxe' entry in linkgit:gitdiffcore[7] for more information. ++ +Unless `--text` is supplied binary files without a textconv filter +will be ignored. This was not the case before Git version 2.21.. ++ +With `--text`, instead of patch lines we :: Look for differences that change the number of occurrences of diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt index c0a60f3158..26880b4149 100644 --- a/Documentation/gitdiffcore.txt +++ b/Documentation/gitdiffcore.txt @@ -251,6 +251,10 @@ criterion in a changeset, the entire changeset is kept. This behavior is designed to make reviewing changes in the context of the whole changeset easier. +Both `-S' and `-G' will ignore binary files without a textconv filter +by default, this can be overriden with `--text`. With `--text` the +binary patch we look through is generated as [[TODO: ???]]. + diffcore-order: For Sorting the Output Based on Filenames - But as you can see given the TODO comments I don't know how this works exactly. I *could* dig, but that's my main outstanding problem with this patch, the commit message / docs aren't being updated to reflect the new behavior. I.e. let's leave the docs in some state where the reader can as unambiguously know what to expect with -G and these binary diffs we've been implicitly supporting as with the textual diffs. Ideally with some examples of how to generate them (re my question about the base85 output in v1). Part of that's obviously behavior we've had all along, but it's much more convincing to say: We are changing X which we've done for ages, it works exactly like this, and here's a switch to get it back. Instead of: X doesn't make sense, let's turn it off. Also the diffcore docs already say stuff about how slow/fast things are, and in a side-thread you said: My main motiviation is to speed up "log -G" as that takes a considerable amount of time when it wades through MBs of binary files which change often. Makes sense, but then let's say something about that in that section of the docs. > When `-S` or `-G` are used without `--pickaxe-all`, only filepairs > that match their respective criterion are kept in the output. When > diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c > index 69fc55ea1e..4cea086f80 100644 > --- a/diffcore-pickaxe.c > +++ b/diffcore-pickaxe.c > @@ -154,6 +154,12 @@ static int pickaxe_match(struct diff_filepair *p, struct > diff_options *o, > if (textconv_one == textconv_two && diff_unmodified_pair(p)) > return 0; > > + if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) && > + !o->flags
[PATCH v2] log -G: Ignore binary files
The -G option of log looks for the differences whose patch text contains added/removed lines that match regex. As the concept of patch text only makes sense for text files, we need to ignore binary files when searching with -G as well. The -S option of log looks for differences that changes the number of occurrences of the specified block of text (i.e. addition/deletion) in a file. As we want to keep the current behaviour, add a test to ensure it. Signed-off-by: Thomas Braun --- Changes since v1: - Merged both patches into one - Adapted commit messages - Added missing support for -a flag with tests - Placed new code into correct location to be able to reuse an existing optimization - Uses help-suggested -Ga writing without spaces - Uses orphan branches instead of cannonball cleanup with rm -rf - Changed search text to make it clear that it is not about the \0 boundary Documentation/gitdiffcore.txt | 2 +- diffcore-pickaxe.c| 6 ++ t/t4209-log-pickaxe.sh| 40 +++ 3 files changed, 47 insertions(+), 1 deletion(-) diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt index c0a60f3158..059ddd3431 100644 --- a/Documentation/gitdiffcore.txt +++ b/Documentation/gitdiffcore.txt @@ -242,7 +242,7 @@ textual diff has an added or a deleted line that matches the given regular expression. This means that it will detect in-file (or what rename-detection considers the same file) moves, which is noise. The implementation runs diff twice and greps, and this can be quite -expensive. +expensive. Binary files without textconv filter are ignored. When `-S` or `-G` are used without `--pickaxe-all`, only filepairs that match their respective criterion are kept in the output. When diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c index 69fc55ea1e..4cea086f80 100644 --- a/diffcore-pickaxe.c +++ b/diffcore-pickaxe.c @@ -154,6 +154,12 @@ static int pickaxe_match(struct diff_filepair *p, struct diff_options *o, if (textconv_one == textconv_two && diff_unmodified_pair(p)) return 0; + if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) && + !o->flags.text && + ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) || +(!textconv_two && diff_filespec_is_binary(o->repo, p->two + return 0; + mf1.size = fill_textconv(o->repo, textconv_one, p->one, ); mf2.size = fill_textconv(o->repo, textconv_two, p->two, ); diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh index 844df760f7..5c3e2a16b2 100755 --- a/t/t4209-log-pickaxe.sh +++ b/t/t4209-log-pickaxe.sh @@ -106,4 +106,44 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' ' rm .gitattributes ' +test_expect_success 'log -G ignores binary files' ' + git checkout --orphan orphan1 && + printf "a\0a" >data.bin && + git add data.bin && + git commit -m "message" && + git log -Ga >result && + test_must_be_empty result +' + +test_expect_success 'log -G looks into binary files with -a' ' + git checkout --orphan orphan2 && + printf "a\0a" >data.bin && + git add data.bin && + git commit -m "message" && + git log -a -Ga >actual && + git log >expected && + test_cmp actual expected +' + +test_expect_success 'log -G looks into binary files with textconv filter' ' + git checkout --orphan orphan3 && + echo "* diff=bin" > .gitattributes && + printf "a\0a" >data.bin && + git add data.bin && + git commit -m "message" && + git -c diff.bin.textconv=cat log -Ga >actual && + git log >expected && + test_cmp actual expected +' + +test_expect_success 'log -S looks into binary files' ' + git checkout --orphan orphan4 && + printf "a\0a" >data.bin && + git add data.bin && + git commit -m "message" && + git log -Sa >actual && + git log >expected && + test_cmp actual expected +' + test_done -- 2.19.0.271.gfe8321ec05.dirty
Re: [PATCH v1 1/2] log -G: Ignore binary files
> Ævar Arnfjörð Bjarmason hat am 22. November 2018 um 11:16 > geschrieben: [...] > > > > +test_expect_success 'log -G ignores binary files' ' > > + rm -rf .git && > > + git init && > > + printf "a\0b" >data.bin && > > + git add data.bin && > > + git commit -m "message" && > > + git log -G a >result && > > Would be less confusing as "-Ga" since that's the invocation we > document, even though I see (but wasn't aware that...) "-G a" works too. Done. > > + test_must_be_empty result > > +' > > + > > +test_expect_success 'log -G looks into binary files with textconv filter' ' > > + rm -rf .git && > > + git init && > > + echo "* diff=bin" > .gitattributes && > > + printf "a\0b" >data.bin && > > + git add data.bin && > > + git commit -m "message" && > > + git -c diff.bin.textconv=cat log -G a >actual && > > + git log >expected && > > + test_cmp actual expected > > +' > > + > > test_done > > This patch seems like the wrong direction to me. In particular the > assertion that "the concept of differences only makes sense for text > files". That's just not true. This patch breaks this: > > ( > rm -rf /tmp/g-test && > git init /tmp/g-test && > cd /tmp/g-test && > for i in {1..10}; do > echo "Always matching thensome 5" >file && > printf "a thensome %d binary \0" $i >>file && > git add file && > git commit -m"Bump $i" > done && > git log -Gthensome.*5 > ) > > Right now this will emit 3/10 patches, and the right ones! I.e. "Bump > [156]". The 1st one because it introduces the "Always matching thensome > 5". Then 5/6 because the add/remove the string "a thensome 5 binary", > respectively. Which matches /thensome.*5/. log -p does not show you the patch text in your example because it is treated as binary. And currently "log -G" has a different opinion into what it looks and what it ignores. My patch tries to bring both more in line. > I.e. in the first one we do a regex match against the content here > because we don't have both sides: > https://github.com/git/git/blob/v2.19.2/diffcore-pickaxe.c#L48-L53 > > And then for the later ones where we have both sides we end up in > diffgrep_consume(): > https://github.com/git/git/blob/v2.19.2/diffcore-pickaxe.c#L27-L36 > > I think there may be a real issue here to address, which might be some > combination of: > > a) Even though the diffcore can do a binary diff internally, this is > not what it exposes with "-p", we just say "Binary files differ". > > I don't know how to emit the raw version we'll end up passing to > diffgrep_consume() in this case. Is it just --binary without the > encoding? I don't know... > > b) Your test case shows that you're matching a string at a \0 > boundary. Is this perhaps something you ran into? I.e. that we don't > have some -F version of -G so we can't supply regexes that match > past a \0? I had some related work on grep for this that hasn't been > carried over to the diffcore: > > git log --grep='grep:.*\\0' --author=Ævar > > c) Is this binary diff we end up matching against just bad in some > cases? I haven't dug but that wouldn't surprise me, i.e. that it's > trying to be line-based so we'll overmatch in many cases. > > So maybe this is something that should be passed down as a flag? See a > recent discussion at > https://public-inbox.org/git/87lg77cmr1@evledraar.gmail.com/ for how > that could be done. It is not about the \0 boundary. v2 of the patches will clarify that. My main motiviation is to speed up "log -G" as that takes a considerable amount of time when it wades through MBs of binary files which change often. And in multiple places I can already treat binary files differently (e.g. turn off delta compression, skip trying to diff them, no EOL normalization). And for me making log -G ignore what git thinks are binary files is making the line clearer between what should be treated as binary and what as text. > Also if we don't have some tests already that were failing with this > patch we really should have those as "let's test the current behavior > first". Unfortunately tests in this area are really lacking, see > e.g. my: > > git log --author=Junio --min-parents=2 --grep=ab/.*grep > > For some series of patches to grep where to get one patch in I needed to > often lead with 5-10 test patches to convince reviewers that I knew what > I was changing, and also to be comfortable that I'd covered all the edge > cases we currently supported, but weren't testing for. I'm happy to add more test cases to convince everyone involved :)
Re: [PATCH v1 2/2] log -S: Add test which searches in binary files
> Junio C Hamano hat am 22. November 2018 um 02:34 > geschrieben: > > > Thomas Braun writes: > > > The -S option of log looks for differences that changes the > > number of occurrences of the specified string (i.e. addition/deletion) > > in a file. > > s/-S /-S/ and > s/the specified string/the specified block of text/ would make it > more in line with how Documentation/gitdiffcore.txt explains it. > The original discussion from early 2017 also explains with a pointer > why the primary mode of -S is not but is . Thanks for the pointer. I've updated the commit message. > > diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh > > index 42cc8afd8b..d430f6f2f9 100755 > > --- a/t/t4209-log-pickaxe.sh > > +++ b/t/t4209-log-pickaxe.sh > > @@ -128,4 +128,15 @@ test_expect_success 'log -G looks into binary files > > with textconv filter' ' > > test_cmp actual expected > > ' > > > > +test_expect_success 'log -S looks into binary files' ' > > + rm -rf .git && > > + git init && > > Same comment as the one for 1/2 applies here. Fixed as well. > > + printf "a\0b" >data.bin && > > + git add data.bin && > > + git commit -m "message" && > > + git log -S a >actual && > > + git log >expected && > > + test_cmp actual expected > > +' > > + > > test_done > > Other than these, I think both patches look sensible. Thanks for > resurrecting the old topic and reigniting it. >
Re: [PATCH v1 1/2] log -G: Ignore binary files
> Junio C Hamano hat am 27. November 2018 um 01:51 > geschrieben: > > > Stefan Beller writes: > > > On Wed, Nov 21, 2018 at 1:08 PM Thomas Braun > > wrote: > >> > >> The -G option of log looks for the differences whose patch text > >> contains added/removed lines that match regex. > >> > >> The concept of differences only makes sense for text files, therefore > >> we need to ignore binary files when searching with -G as well. > > > > What about partial text/partial binary files? > > Good point. You'd use "-a" (or "--text") to tell the diff machinery > to treat the contents as text, and the new logic must pay attention > to that command line option. Yes exactly. Either use -a for the occasional use or a textconv filter for permanent use. Coming from the opposite side: I usually mark svg files as binary as the textual diff is well, let's say uninspiring.
Re: [PATCH v1 2/2] log -S: Add test which searches in binary files
> Ævar Arnfjörð Bjarmason hat am 22. November 2018 um 10:14 > geschrieben: > > > > On Wed, Nov 21 2018, Thomas Braun wrote: > > > The -S option of log looks for differences that changes the > > number of occurrences of the specified string (i.e. addition/deletion) > > in a file. > > > > Add a test to ensure that we keep looking into binary files with -S > > as changing that would break backwards compatibility in unexpected ways. > > > > Signed-off-by: Thomas Braun > > --- > > t/t4209-log-pickaxe.sh | 11 +++ > > 1 file changed, 11 insertions(+) > > > > diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh > > index 42cc8afd8b..d430f6f2f9 100755 > > --- a/t/t4209-log-pickaxe.sh > > +++ b/t/t4209-log-pickaxe.sh > > @@ -128,4 +128,15 @@ test_expect_success 'log -G looks into binary files > > with textconv filter' ' > > test_cmp actual expected > > ' > > > > +test_expect_success 'log -S looks into binary files' ' > > + rm -rf .git && > > + git init && > > + printf "a\0b" >data.bin && > > + git add data.bin && > > + git commit -m "message" && > > + git log -S a >actual && > > + git log >expected && > > + test_cmp actual expected > > +' > > + > > test_done > > This should just be part of 1/2 since the behavior is changed there & > the commit message should describe both cases. My reasoning was that this is a separate test which does not fit in with the other part. But I'm happy in folding both into one patch. Done.
Re: [PATCH v1 1/2] log -G: Ignore binary files
> Ævar Arnfjörð Bjarmason hat am 22. November 2018 um 11:16 > geschrieben: [...] > > > > +test_expect_success 'log -G ignores binary files' ' > > + rm -rf .git && > > + git init && > > + printf "a\0b" >data.bin && > > + git add data.bin && > > + git commit -m "message" && > > + git log -G a >result && > > Would be less confusing as "-Ga" since that's the invocation we > document, even though I see (but wasn't aware that...) "-G a" works too. Done. > > + test_must_be_empty result > > +' > > + > > +test_expect_success 'log -G looks into binary files with textconv filter' ' > > + rm -rf .git && > > + git init && > > + echo "* diff=bin" > .gitattributes && > > + printf "a\0b" >data.bin && > > + git add data.bin && > > + git commit -m "message" && > > + git -c diff.bin.textconv=cat log -G a >actual && > > + git log >expected && > > + test_cmp actual expected > > +' > > + > > test_done > > This patch seems like the wrong direction to me. In particular the > assertion that "the concept of differences only makes sense for text > files". That's just not true. This patch breaks this: > > ( > rm -rf /tmp/g-test && > git init /tmp/g-test && > cd /tmp/g-test && > for i in {1..10}; do > echo "Always matching thensome 5" >file && > printf "a thensome %d binary \0" $i >>file && > git add file && > git commit -m"Bump $i" > done && > git log -Gthensome.*5 > ) > > Right now this will emit 3/10 patches, and the right ones! I.e. "Bump > [156]". The 1st one because it introduces the "Always matching thensome > 5". Then 5/6 because the add/remove the string "a thensome 5 binary", > respectively. Which matches /thensome.*5/. log -p does not show you the patch text in your example because it is treated as binary. And currently "log -G" has a different opinion into what it looks and what it ignores. My patch tries to bring both more in line. > I.e. in the first one we do a regex match against the content here > because we don't have both sides: > https://github.com/git/git/blob/v2.19.2/diffcore-pickaxe.c#L48-L53 > > And then for the later ones where we have both sides we end up in > diffgrep_consume(): > https://github.com/git/git/blob/v2.19.2/diffcore-pickaxe.c#L27-L36 > > I think there may be a real issue here to address, which might be some > combination of: > > a) Even though the diffcore can do a binary diff internally, this is > not what it exposes with "-p", we just say "Binary files differ". > > I don't know how to emit the raw version we'll end up passing to > diffgrep_consume() in this case. Is it just --binary without the > encoding? I don't know... > > b) Your test case shows that you're matching a string at a \0 > boundary. Is this perhaps something you ran into? I.e. that we don't > have some -F version of -G so we can't supply regexes that match > past a \0? I had some related work on grep for this that hasn't been > carried over to the diffcore: > > git log --grep='grep:.*\\0' --author=Ævar > > c) Is this binary diff we end up matching against just bad in some > cases? I haven't dug but that wouldn't surprise me, i.e. that it's > trying to be line-based so we'll overmatch in many cases. > > So maybe this is something that should be passed down as a flag? See a > recent discussion at > https://public-inbox.org/git/87lg77cmr1@evledraar.gmail.com/ for how > that could be done. It is not about the \0 boundary. v2 of the patches will clarify that. My main motiviation is to speed up "log -G" as that takes a considerable amount of time when it wades through MBs of binary files which change often. And in multiple places I can already treat binary files differently (e.g. turn off delta compression, skip trying to diff them, no EOL normalization). And for me making log -G ignore what git thinks are binary files is making the line clearer between what should be treated as binary and what as text. > Also if we don't have some tests already that were failing with this > patch we really should have those as "let's test the current behavior > first". Unfortunately tests in this area are really lacking, see > e.g. my: > > git log --author=Junio --min-parents=2 --grep=ab/.*grep > > For some series of patches to grep where to get one patch in I needed to > often lead with 5-10 test patches to convince reviewers that I knew what > I was changing, and also to be comfortable that I'd covered all the edge > cases we currently supported, but weren't testing for. I'm happy to add more test cases to convince everyone involved :)
Re: [PATCH v1 1/2] log -G: Ignore binary files
> Jeff King hat am 22. November 2018 um 17:20 geschrieben: > > > On Wed, Nov 21, 2018 at 09:52:27PM +0100, Thomas Braun wrote: > > > diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c > > index 69fc55ea1e..8c2558b07d 100644 > > --- a/diffcore-pickaxe.c > > +++ b/diffcore-pickaxe.c > > @@ -144,6 +144,11 @@ static int pickaxe_match(struct diff_filepair *p, > > struct diff_options *o, > > textconv_two = get_textconv(o->repo->index, p->two); > > } > > > > + if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) && > > + ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) || > > +(!textconv_two && diff_filespec_is_binary(o->repo, p->two > > + return 0; > > If the user passes "-a" to treat binary files as text, we should > probably skip the binary check. I think we'd need to check > "o->flags.text" here. Good point. I missed that flag. Added. > > diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh > > index 844df760f7..42cc8afd8b 100755 > > --- a/t/t4209-log-pickaxe.sh > > +++ b/t/t4209-log-pickaxe.sh > > @@ -106,4 +106,26 @@ test_expect_success 'log -S --no-textconv (missing > > textconv tool)' ' > > [...] > > +test_expect_success 'log -G ignores binary files' ' > > [...] > > +test_expect_success 'log -G looks into binary files with textconv filter' ' > > And likewise add a test here similar to the textconv one. Added as well.
Re: [PATCH v1 1/2] log -G: Ignore binary files
> Junio C Hamano hat am 22. November 2018 um 02:29 > geschrieben: > > > Thomas Braun writes: > > > The -G option of log looks for the differences whose patch text > > contains added/removed lines that match regex. > > > > The concept of differences only makes sense for text files, therefore > > we need to ignore binary files when searching with -G as well. > > > > Signed-off-by: Thomas Braun > > --- > > Documentation/gitdiffcore.txt | 2 +- > > diffcore-pickaxe.c| 5 + > > t/t4209-log-pickaxe.sh| 22 ++ > > 3 files changed, 28 insertions(+), 1 deletion(-) > > OK. > > > diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt > > index c0a60f3158..059ddd3431 100644 > > --- a/Documentation/gitdiffcore.txt > > +++ b/Documentation/gitdiffcore.txt > > @@ -242,7 +242,7 @@ textual diff has an added or a deleted line that > > matches the given > > regular expression. This means that it will detect in-file (or what > > rename-detection considers the same file) moves, which is noise. The > > implementation runs diff twice and greps, and this can be quite > > -expensive. > > +expensive. Binary files without textconv filter are ignored. > > OK. > > > diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c > > index 69fc55ea1e..8c2558b07d 100644 > > --- a/diffcore-pickaxe.c > > +++ b/diffcore-pickaxe.c > > @@ -144,6 +144,11 @@ static int pickaxe_match(struct diff_filepair *p, > > struct diff_options *o, > > textconv_two = get_textconv(o->repo->index, p->two); > > } > > > > + if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) && > > + ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) || > > +(!textconv_two && diff_filespec_is_binary(o->repo, p->two > > + return 0; > > + > > /* > > * If we have an unmodified pair, we know that the count will be the > > * same and don't even have to load the blobs. Unless textconv is in > > Shouldn't this new test come after the existing optimization, which > allows us to leave without loading the blob contents (which is > needed once you call diff_filespec_is_binary())? Yes, good point. > > diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh > > index 844df760f7..42cc8afd8b 100755 > > --- a/t/t4209-log-pickaxe.sh > > +++ b/t/t4209-log-pickaxe.sh > > @@ -106,4 +106,26 @@ test_expect_success 'log -S --no-textconv (missing > > textconv tool)' ' > > rm .gitattributes > > ' > > > > +test_expect_success 'log -G ignores binary files' ' > > + rm -rf .git && > > + git init && > > Please never never ever do the above two unless you are writing a > test that checks low-level repository details. > > If you want a clean history that has specific lineage of commits > without getting affected by commits that have been made by the > previous test pieces, it is OK to "checkout --orphan" to create an > empty history to work with. Thanks for the hint. I thought I had seen a less intrusive way for getting an empty history. Changed. > > + printf "a\0b" >data.bin && > > + git add data.bin && > > + git commit -m "message" && > > + git log -G a >result && > > + test_must_be_empty result > > +' > > + > > +test_expect_success 'log -G looks into binary files with textconv filter' ' > > + rm -rf .git && > > + git init && > > + echo "* diff=bin" > .gitattributes && > > + printf "a\0b" >data.bin && > > + git add data.bin && > > + git commit -m "message" && > > + git -c diff.bin.textconv=cat log -G a >actual && > > + git log >expected && > > + test_cmp actual expected > > +' > > + > > test_done >
Re: [PATCH v1 1/2] log -G: Ignore binary files
Stefan Beller writes: > On Wed, Nov 21, 2018 at 1:08 PM Thomas Braun > wrote: >> >> The -G option of log looks for the differences whose patch text >> contains added/removed lines that match regex. >> >> The concept of differences only makes sense for text files, therefore >> we need to ignore binary files when searching with -G as well. > > What about partial text/partial binary files? Good point. You'd use "-a" (or "--text") to tell the diff machinery to treat the contents as text, and the new logic must pay attention to that command line option.
Re: [PATCH v1 1/2] log -G: Ignore binary files
On Wed, Nov 21, 2018 at 1:08 PM Thomas Braun wrote: > > The -G option of log looks for the differences whose patch text > contains added/removed lines that match regex. > > The concept of differences only makes sense for text files, therefore > we need to ignore binary files when searching with -G as well. What about partial text/partial binary files? I recall using text searching tools (not necessarily git machinery, my memory is fuzzy) to check for strings in pdf files, which are usually marked binary in context of git, such that we do not see their diffs in `log -p`. But I would expect a search with -G or -S to still work... until I find the exception in the docs, only to wonder if there is a switch to turn off this optimisation for this corner case. Stefan
Re: [PATCH v1 1/2] log -G: Ignore binary files
Jeff King writes: >> +if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) && >> +((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) || >> + (!textconv_two && diff_filespec_is_binary(o->repo, p->two >> +return 0; > > If the user passes "-a" to treat binary files as text, we should > probably skip the binary check. I think we'd need to check > "o->flags.text" here. Yeah, I forgot about that option. It would give an escape hatch that has a sane explanation.
Re: [PATCH v1 2/2] log -S: Add test which searches in binary files
Ævar Arnfjörð Bjarmason writes: > On Wed, Nov 21 2018, Thomas Braun wrote: > >> The -S option of log looks for differences that changes the >> number of occurrences of the specified string (i.e. addition/deletion) >> in a file. >> > ... > This should just be part of 1/2 since the behavior is changed there & > the commit message should describe both cases. Sensible suggestion. Thanks.
Re: [PATCH v1 1/2] log -G: Ignore binary files
On Thu, Nov 22, 2018 at 11:16:38AM +0100, Ævar Arnfjörð Bjarmason wrote: > > +test_expect_success 'log -G looks into binary files with textconv filter' ' > > + rm -rf .git && > > + git init && > > + echo "* diff=bin" > .gitattributes && > > + printf "a\0b" >data.bin && > > + git add data.bin && > > + git commit -m "message" && > > + git -c diff.bin.textconv=cat log -G a >actual && > > + git log >expected && > > + test_cmp actual expected > > +' > > + > > test_done > > This patch seems like the wrong direction to me. In particular the > assertion that "the concept of differences only makes sense for text > files". That's just not true. This patch breaks this: But "-G" is defined as "look for differences whose patch text contains added/removed lines that match ". We don't have patch text here, let alone added/removed lines. For binary files, "-Sfoo" is better defined. I think we _could_ define "search for in the added/removed bytes of a binary file". But I don't think that's what the current code does (it really does a line diff on a binary file, which is likely to put tons of unchanged crap into the "added and removed" lines, because the line divisions aren't meaningful in the first place). > ( > rm -rf /tmp/g-test && > git init /tmp/g-test && > cd /tmp/g-test && > for i in {1..10}; do > echo "Always matching thensome 5" >file && > printf "a thensome %d binary \0" $i >>file && > git add file && > git commit -m"Bump $i" > done && > git log -Gthensome.*5 > ) > > Right now this will emit 3/10 patches, and the right ones! I.e. "Bump > [156]". The 1st one because it introduces the "Always matching thensome > 5". Then 5/6 because the add/remove the string "a thensome 5 binary", > respectively. Which matches /thensome.*5/. Right, this will sometimes do the right thing. But it will also often do the wrong thing. It's also very expensive (we specifically avoid feeding large binary files to xdiff, but I think "-G" will happily do so -- I didn't double check, though). -Peff
Re: [PATCH v1 1/2] log -G: Ignore binary files
On Wed, Nov 21, 2018 at 09:52:27PM +0100, Thomas Braun wrote: > diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c > index 69fc55ea1e..8c2558b07d 100644 > --- a/diffcore-pickaxe.c > +++ b/diffcore-pickaxe.c > @@ -144,6 +144,11 @@ static int pickaxe_match(struct diff_filepair *p, struct > diff_options *o, > textconv_two = get_textconv(o->repo->index, p->two); > } > > + if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) && > + ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) || > + (!textconv_two && diff_filespec_is_binary(o->repo, p->two > + return 0; If the user passes "-a" to treat binary files as text, we should probably skip the binary check. I think we'd need to check "o->flags.text" here. > diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh > index 844df760f7..42cc8afd8b 100755 > --- a/t/t4209-log-pickaxe.sh > +++ b/t/t4209-log-pickaxe.sh > @@ -106,4 +106,26 @@ test_expect_success 'log -S --no-textconv (missing > textconv tool)' ' > [...] > +test_expect_success 'log -G ignores binary files' ' > [...] > +test_expect_success 'log -G looks into binary files with textconv filter' ' And likewise add a test here similar to the textconv one. -Peff
Re: [PATCH v1 1/2] log -G: Ignore binary files
> On Wed, Nov 21 2018, Thomas Braun wrote: > The -G option of log looks for the differences whose patch text > contains added/removed lines that match regex. > > The concept of differences only makes sense for text files, therefore > we need to ignore binary files when searching with -G as well. > > Signed-off-by: Thomas Braun > --- > Documentation/gitdiffcore.txt | 2 +- > diffcore-pickaxe.c| 5 + > t/t4209-log-pickaxe.sh| 22 ++ > 3 files changed, 28 insertions(+), 1 deletion(-) > > diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt > index c0a60f3158..059ddd3431 100644 > --- a/Documentation/gitdiffcore.txt > +++ b/Documentation/gitdiffcore.txt > @@ -242,7 +242,7 @@ textual diff has an added or a deleted line that matches > the given > regular expression. This means that it will detect in-file (or what > rename-detection considers the same file) moves, which is noise. The > implementation runs diff twice and greps, and this can be quite > -expensive. > +expensive. Binary files without textconv filter are ignored. > > When `-S` or `-G` are used without `--pickaxe-all`, only filepairs > that match their respective criterion are kept in the output. When > diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c > index 69fc55ea1e..8c2558b07d 100644 > --- a/diffcore-pickaxe.c > +++ b/diffcore-pickaxe.c > @@ -144,6 +144,11 @@ static int pickaxe_match(struct diff_filepair *p, struct > diff_options *o, > textconv_two = get_textconv(o->repo->index, p->two); > } > > + if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) && > + ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) || > + (!textconv_two && diff_filespec_is_binary(o->repo, p->two > + return 0; > + > /* >* If we have an unmodified pair, we know that the count will be the >* same and don't even have to load the blobs. Unless textconv is in > diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh > index 844df760f7..42cc8afd8b 100755 > --- a/t/t4209-log-pickaxe.sh > +++ b/t/t4209-log-pickaxe.sh > @@ -106,4 +106,26 @@ test_expect_success 'log -S --no-textconv (missing > textconv tool)' ' > rm .gitattributes > ' > > +test_expect_success 'log -G ignores binary files' ' > + rm -rf .git && > + git init && > + printf "a\0b" >data.bin && > + git add data.bin && > + git commit -m "message" && > + git log -G a >result && Would be less confusing as "-Ga" since that's the invocation we document, even though I see (but wasn't aware that...) "-G a" works too. > + test_must_be_empty result > +' > + > +test_expect_success 'log -G looks into binary files with textconv filter' ' > + rm -rf .git && > + git init && > + echo "* diff=bin" > .gitattributes && > + printf "a\0b" >data.bin && > + git add data.bin && > + git commit -m "message" && > + git -c diff.bin.textconv=cat log -G a >actual && > + git log >expected && > + test_cmp actual expected > +' > + > test_done This patch seems like the wrong direction to me. In particular the assertion that "the concept of differences only makes sense for text files". That's just not true. This patch breaks this: ( rm -rf /tmp/g-test && git init /tmp/g-test && cd /tmp/g-test && for i in {1..10}; do echo "Always matching thensome 5" >file && printf "a thensome %d binary \0" $i >>file && git add file && git commit -m"Bump $i" done && git log -Gthensome.*5 ) Right now this will emit 3/10 patches, and the right ones! I.e. "Bump [156]". The 1st one because it introduces the "Always matching thensome 5". Then 5/6 because the add/remove the string "a thensome 5 binary", respectively. Which matches /thensome.*5/. I.e. in the first one we do a regex match against the content here because we don't have both sides: https://github.com/git/git/blob/v2.19.2/diffcore-pickaxe.c#L48-L53 And then for the later ones where we have both sides we end up in diffgrep_consume(): https://github.com/git/git/blob/v2.19.2/diffcore-pickaxe.c#L27-L36 I think there may be a real issue here to address, which might be some combination of: a) Even though the diffcore can do a binary diff internally, this is
Re: [PATCH v1 2/2] log -S: Add test which searches in binary files
On Wed, Nov 21 2018, Thomas Braun wrote: > The -S option of log looks for differences that changes the > number of occurrences of the specified string (i.e. addition/deletion) > in a file. > > Add a test to ensure that we keep looking into binary files with -S > as changing that would break backwards compatibility in unexpected ways. > > Signed-off-by: Thomas Braun > --- > t/t4209-log-pickaxe.sh | 11 +++ > 1 file changed, 11 insertions(+) > > diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh > index 42cc8afd8b..d430f6f2f9 100755 > --- a/t/t4209-log-pickaxe.sh > +++ b/t/t4209-log-pickaxe.sh > @@ -128,4 +128,15 @@ test_expect_success 'log -G looks into binary files with > textconv filter' ' > test_cmp actual expected > ' > > +test_expect_success 'log -S looks into binary files' ' > + rm -rf .git && > + git init && > + printf "a\0b" >data.bin && > + git add data.bin && > + git commit -m "message" && > + git log -S a >actual && > + git log >expected && > + test_cmp actual expected > +' > + > test_done This should just be part of 1/2 since the behavior is changed there & the commit message should describe both cases.
Re: [PATCH v1 2/2] log -S: Add test which searches in binary files
Thomas Braun writes: > The -S option of log looks for differences that changes the > number of occurrences of the specified string (i.e. addition/deletion) > in a file. s/-S /-S/ and s/the specified string/the specified block of text/ would make it more in line with how Documentation/gitdiffcore.txt explains it. The original discussion from early 2017 also explains with a pointer why the primary mode of -S is not but is . > diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh > index 42cc8afd8b..d430f6f2f9 100755 > --- a/t/t4209-log-pickaxe.sh > +++ b/t/t4209-log-pickaxe.sh > @@ -128,4 +128,15 @@ test_expect_success 'log -G looks into binary files with > textconv filter' ' > test_cmp actual expected > ' > > +test_expect_success 'log -S looks into binary files' ' > + rm -rf .git && > + git init && Same comment as the one for 1/2 applies here. > + printf "a\0b" >data.bin && > + git add data.bin && > + git commit -m "message" && > + git log -S a >actual && > + git log >expected && > + test_cmp actual expected > +' > + > test_done Other than these, I think both patches look sensible. Thanks for resurrecting the old topic and reigniting it.
Re: [PATCH v1 1/2] log -G: Ignore binary files
Thomas Braun writes: > The -G option of log looks for the differences whose patch text > contains added/removed lines that match regex. > > The concept of differences only makes sense for text files, therefore > we need to ignore binary files when searching with -G as well. > > Signed-off-by: Thomas Braun > --- > Documentation/gitdiffcore.txt | 2 +- > diffcore-pickaxe.c| 5 + > t/t4209-log-pickaxe.sh| 22 ++ > 3 files changed, 28 insertions(+), 1 deletion(-) OK. > diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt > index c0a60f3158..059ddd3431 100644 > --- a/Documentation/gitdiffcore.txt > +++ b/Documentation/gitdiffcore.txt > @@ -242,7 +242,7 @@ textual diff has an added or a deleted line that matches > the given > regular expression. This means that it will detect in-file (or what > rename-detection considers the same file) moves, which is noise. The > implementation runs diff twice and greps, and this can be quite > -expensive. > +expensive. Binary files without textconv filter are ignored. OK. > diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c > index 69fc55ea1e..8c2558b07d 100644 > --- a/diffcore-pickaxe.c > +++ b/diffcore-pickaxe.c > @@ -144,6 +144,11 @@ static int pickaxe_match(struct diff_filepair *p, struct > diff_options *o, > textconv_two = get_textconv(o->repo->index, p->two); > } > > + if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) && > + ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) || > + (!textconv_two && diff_filespec_is_binary(o->repo, p->two > + return 0; > + > /* >* If we have an unmodified pair, we know that the count will be the >* same and don't even have to load the blobs. Unless textconv is in Shouldn't this new test come after the existing optimization, which allows us to leave without loading the blob contents (which is needed once you call diff_filespec_is_binary())? > diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh > index 844df760f7..42cc8afd8b 100755 > --- a/t/t4209-log-pickaxe.sh > +++ b/t/t4209-log-pickaxe.sh > @@ -106,4 +106,26 @@ test_expect_success 'log -S --no-textconv (missing > textconv tool)' ' > rm .gitattributes > ' > > +test_expect_success 'log -G ignores binary files' ' > + rm -rf .git && > + git init && Please never never ever do the above two unless you are writing a test that checks low-level repository details. If you want a clean history that has specific lineage of commits without getting affected by commits that have been made by the previous test pieces, it is OK to "checkout --orphan" to create an empty history to work with. > + printf "a\0b" >data.bin && > + git add data.bin && > + git commit -m "message" && > + git log -G a >result && > + test_must_be_empty result > +' > + > +test_expect_success 'log -G looks into binary files with textconv filter' ' > + rm -rf .git && > + git init && > + echo "* diff=bin" > .gitattributes && > + printf "a\0b" >data.bin && > + git add data.bin && > + git commit -m "message" && > + git -c diff.bin.textconv=cat log -G a >actual && > + git log >expected && > + test_cmp actual expected > +' > + > test_done
[PATCH v1 2/2] log -S: Add test which searches in binary files
The -S option of log looks for differences that changes the number of occurrences of the specified string (i.e. addition/deletion) in a file. Add a test to ensure that we keep looking into binary files with -S as changing that would break backwards compatibility in unexpected ways. Signed-off-by: Thomas Braun --- t/t4209-log-pickaxe.sh | 11 +++ 1 file changed, 11 insertions(+) diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh index 42cc8afd8b..d430f6f2f9 100755 --- a/t/t4209-log-pickaxe.sh +++ b/t/t4209-log-pickaxe.sh @@ -128,4 +128,15 @@ test_expect_success 'log -G looks into binary files with textconv filter' ' test_cmp actual expected ' +test_expect_success 'log -S looks into binary files' ' + rm -rf .git && + git init && + printf "a\0b" >data.bin && + git add data.bin && + git commit -m "message" && + git log -S a >actual && + git log >expected && + test_cmp actual expected +' + test_done -- 2.19.0.271.gfe8321ec05.dirty
[PATCH 0/2] Teach log -G to ignore binary files
Based on the previous discussion in [1] I've prepared patches which teach log -G to ignore binary files. log -S keeps its behaviour but got a test to ensure that. Feedback welcome! [1]: https://public-inbox.org/git/7a0992eb-adb9-a7a1-cfaa-3384bc4d3...@virtuell-zuhause.de/ Thomas Braun (2): log -G: Ignore binary files log -S: Add test which searches in binary files Documentation/gitdiffcore.txt | 2 +- diffcore-pickaxe.c| 5 + t/t4209-log-pickaxe.sh| 21 + 3 files changed, 27 insertions(+), 1 deletion(-) -- 2.19.0.271.gfe8321ec05.dirty
[PATCH v1 1/2] log -G: Ignore binary files
The -G option of log looks for the differences whose patch text contains added/removed lines that match regex. The concept of differences only makes sense for text files, therefore we need to ignore binary files when searching with -G as well. Signed-off-by: Thomas Braun --- Documentation/gitdiffcore.txt | 2 +- diffcore-pickaxe.c| 5 + t/t4209-log-pickaxe.sh| 22 ++ 3 files changed, 28 insertions(+), 1 deletion(-) diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt index c0a60f3158..059ddd3431 100644 --- a/Documentation/gitdiffcore.txt +++ b/Documentation/gitdiffcore.txt @@ -242,7 +242,7 @@ textual diff has an added or a deleted line that matches the given regular expression. This means that it will detect in-file (or what rename-detection considers the same file) moves, which is noise. The implementation runs diff twice and greps, and this can be quite -expensive. +expensive. Binary files without textconv filter are ignored. When `-S` or `-G` are used without `--pickaxe-all`, only filepairs that match their respective criterion are kept in the output. When diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c index 69fc55ea1e..8c2558b07d 100644 --- a/diffcore-pickaxe.c +++ b/diffcore-pickaxe.c @@ -144,6 +144,11 @@ static int pickaxe_match(struct diff_filepair *p, struct diff_options *o, textconv_two = get_textconv(o->repo->index, p->two); } + if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) && + ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) || +(!textconv_two && diff_filespec_is_binary(o->repo, p->two + return 0; + /* * If we have an unmodified pair, we know that the count will be the * same and don't even have to load the blobs. Unless textconv is in diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh index 844df760f7..42cc8afd8b 100755 --- a/t/t4209-log-pickaxe.sh +++ b/t/t4209-log-pickaxe.sh @@ -106,4 +106,26 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' ' rm .gitattributes ' +test_expect_success 'log -G ignores binary files' ' + rm -rf .git && + git init && + printf "a\0b" >data.bin && + git add data.bin && + git commit -m "message" && + git log -G a >result && + test_must_be_empty result +' + +test_expect_success 'log -G looks into binary files with textconv filter' ' + rm -rf .git && + git init && + echo "* diff=bin" > .gitattributes && + printf "a\0b" >data.bin && + git add data.bin && + git commit -m "message" && + git -c diff.bin.textconv=cat log -G a >actual && + git log >expected && + test_cmp actual expected +' + test_done -- 2.19.0.271.gfe8321ec05.dirty
[PATCH 0/2] Teach log -G to ignore binary files
Based on the previous discussion in [1] I've prepared patches which teach log -G to ignore binary files. log -S keeps its behaviour but got a test to ensure that. Feedback welcome! [1]: https://public-inbox.org/git/7a0992eb-adb9-a7a1-cfaa-3384bc4d3...@virtuell-zuhause.de/ PS: This is the (possibly missing) cover letter.
RE: [Question] Signature calculation ignoring parts of binary files
On September 13, 2018 1:52 PM, Junio C Hamano wrote: > Junio C Hamano writes: > > > "Randall S. Becker" writes: > > > >> The scenario is slightly different. > >> 1. Person A gives me a new binary file-1 with fingerprint A1. This > >> goes into git unchanged. > >> 2. Person B gives me binary file-2 with fingerprint B2. This does not > >> go into git yet. > >> 3. We attempt a git diff between the committed file-1 and uncommitted > >> file-2 using a textconv implementation that strips what we don't need to > compare. > >> 4. If file-1 and file-2 have no difference when textconv is used, > >> file-2 is not added and not committed. It is discarded with impunity, > >> never to be seen again, although we might whine a lot at the user for > >> attempting to put > >> file-2 in - but that's not git's issue. > > > > You are forgetting that Git is a distributed version control system, > > aren't you? Person A and B can introduce their "moral equivalent but > > bytewise different" copies to their repository under the same object > > name, and you can pull from them--what happens? > > > > It is fundamental that one object name given to Git identifies one > > specific byte sequence contained in an object uniquely. Once you > > broke that, you no longer have Git. > > Having said all that, if you want to keep the original with frills but somehow > give these bytewise different things that reduce to the same essence (e.g. > when passed thru a filter like textconv), I suspect a better approach might be > to store both the "original" and the result of passing the "original" through > the filter in the object database. In the above example, you'll get two > "original" > objects from person A and person B, plus one "canonical" object that are > bytewise different from either of these two originals, but what they reduce > to when you use the filter on them. Then you record the fact that to derive > the "essence" object, you can reduce either person A's or person B's > "original" through the filter, perhaps by using "git notes" attached to the > "essence" object, recording the object names of these originals (the reason > why using notes in this direction is because you can mechanically determine > which "essence" > object any given "original" object reduces to---it is just the matter of passing > it through the filter. But there can be more than one "original" that reduces > to the same "essence"). I like that idea. It turns the reduced object into a contract. Thanks.
Re: [Question] Signature calculation ignoring parts of binary files
Junio C Hamano writes: > "Randall S. Becker" writes: > >> The scenario is slightly different. >> 1. Person A gives me a new binary file-1 with fingerprint A1. This goes into >> git unchanged. >> 2. Person B gives me binary file-2 with fingerprint B2. This does not go >> into git yet. >> 3. We attempt a git diff between the committed file-1 and uncommitted file-2 >> using a textconv implementation that strips what we don't need to compare. >> 4. If file-1 and file-2 have no difference when textconv is used, file-2 is >> not added and not committed. It is discarded with impunity, never to be seen >> again, although we might whine a lot at the user for attempting to put >> file-2 in - but that's not git's issue. > > You are forgetting that Git is a distributed version control system, > aren't you? Person A and B can introduce their "moral equivalent > but bytewise different" copies to their repository under the same > object name, and you can pull from them--what happens? > > It is fundamental that one object name given to Git identifies one > specific byte sequence contained in an object uniquely. Once you > broke that, you no longer have Git. Having said all that, if you want to keep the original with frills but somehow give these bytewise different things that reduce to the same essence (e.g. when passed thru a filter like textconv), I suspect a better approach might be to store both the "original" and the result of passing the "original" through the filter in the object database. In the above example, you'll get two "original" objects from person A and person B, plus one "canonical" object that are bytewise different from either of these two originals, but what they reduce to when you use the filter on them. Then you record the fact that to derive the "essence" object, you can reduce either person A's or person B's "original" through the filter, perhaps by using "git notes" attached to the "essence" object, recording the object names of these originals (the reason why using notes in this direction is because you can mechanically determine which "essence" object any given "original" object reduces to---it is just the matter of passing it through the filter. But there can be more than one "original" that reduces to the same "essence").
RE: [Question] Signature calculation ignoring parts of binary files
On September 13, 2018 11:03 AM, Junio C Hamano wrote: > "Randall S. Becker" writes: > > > The scenario is slightly different. > > 1. Person A gives me a new binary file-1 with fingerprint A1. This > > goes into git unchanged. > > 2. Person B gives me binary file-2 with fingerprint B2. This does not > > go into git yet. > > 3. We attempt a git diff between the committed file-1 and uncommitted > > file-2 using a textconv implementation that strips what we don't need to > compare. > > 4. If file-1 and file-2 have no difference when textconv is used, > > file-2 is not added and not committed. It is discarded with impunity, > > never to be seen again, although we might whine a lot at the user for > > attempting to put > > file-2 in - but that's not git's issue. > > You are forgetting that Git is a distributed version control system, aren't you? > Person A and B can introduce their "moral equivalent but bytewise different" > copies to their repository under the same object name, and you can pull from > them--what happens? > > It is fundamental that one object name given to Git identifies one specific > byte sequence contained in an object uniquely. Once you broke that, you no > longer have Git. At that point I have a morally questionable situation, agreed. However, both are permitted to exist in the underlying tree without conflict in git - which I do consider a legitimately possible situation that will not break the application at all - although there is a semantic conflict in the application (not in git) that requires human decision to resolve. The fact that both objects can exist in git with different fingerprints is a good thing because it provides immutable evidence and ownership of someone bypassing the intent of the application. So, rather than using textconv, I shall implement this rule in the application rather than trying to configure git to do it. If two conflicting objects enter the commit history, the application will have the responsibility to resolve the semantic/legal conflict. Thanks, Randall
Re: [Question] Signature calculation ignoring parts of binary files
"Randall S. Becker" writes: > The scenario is slightly different. > 1. Person A gives me a new binary file-1 with fingerprint A1. This goes into > git unchanged. > 2. Person B gives me binary file-2 with fingerprint B2. This does not go > into git yet. > 3. We attempt a git diff between the committed file-1 and uncommitted file-2 > using a textconv implementation that strips what we don't need to compare. > 4. If file-1 and file-2 have no difference when textconv is used, file-2 is > not added and not committed. It is discarded with impunity, never to be seen > again, although we might whine a lot at the user for attempting to put > file-2 in - but that's not git's issue. You are forgetting that Git is a distributed version control system, aren't you? Person A and B can introduce their "moral equivalent but bytewise different" copies to their repository under the same object name, and you can pull from them--what happens? It is fundamental that one object name given to Git identifies one specific byte sequence contained in an object uniquely. Once you broke that, you no longer have Git.
RE: [Question] Signature calculation ignoring parts of binary files
On September 12, 2018 7:00 PM, Junio C Hamano wrote: > "Randall S. Becker" writes: > > >> author is important to our process. My objective is to keep the > >> original file 100% exact as supplied and then ignore any changes to > >> the metadata that I don't care about (like Creator) if the remainder of the > file is the same. > > That will *not* work. If person A gave you a version of original, which > hashes to X after you strip the cruft you do not care about, you would > register that original with person A's fingerprint on under the name of X. > What happens when person B gives you another version, which is not byte- > for-byte identical to the one you got earlier from person A, but does hash to > the same X after you strip the cruft? If you are going to store it in Git, and if > by SHA-1 you are calling what we perceive as "object name" in Git land, you > must store that one with person B's fingerprint on it also under the name of > X. Now which version will you get from Git when you ask it to give you the > object that hashes to X? The scenario is slightly different. 1. Person A gives me a new binary file-1 with fingerprint A1. This goes into git unchanged. 2. Person B gives me binary file-2 with fingerprint B2. This does not go into git yet. 3. We attempt a git diff between the committed file-1 and uncommitted file-2 using a textconv implementation that strips what we don't need to compare. 4. If file-1 and file-2 have no difference when textconv is used, file-2 is not added and not committed. It is discarded with impunity, never to be seen again, although we might whine a lot at the user for attempting to put file-2 in - but that's not git's issue. 5. If file-1 and file-2 have differences when textconv is used, file-2 is committed with fingerprint B2. 6. Even if an error is made by the user and they commit file-2 with B2 regardless of textconv, there will be a human who complains about it, but git has two unambiguous fingerprints that happen to have no diffs after textconv is applied. My original hope was that textconv could be used to influence the fingerprint, but I do not think that is the case, so I went with an alternative. In the application, I am not allowed to strip any cruft off file-1 when it is stored - it must be byte-for-byte the original file. This application is marginally related to a DRM-like situation where we only care about the original image provided by a user, but any copies that are provided by another user with modified metadata will be disallowed from repository. Does that make more sense? Cheers, Randall
Re: [Question] Signature calculation ignoring parts of binary files
"Randall S. Becker" writes: >> author is important to our process. My objective is to keep the original file >> 100% exact as supplied and then ignore any changes to the metadata that I >> don't care about (like Creator) if the remainder of the file is the same. That will *not* work. If person A gave you a version of original, which hashes to X after you strip the cruft you do not care about, you would register that original with person A's fingerprint on under the name of X. What happens when person B gives you another version, which is not byte-for-byte identical to the one you got earlier from person A, but does hash to the same X after you strip the cruft? If you are going to store it in Git, and if by SHA-1 you are calling what we perceive as "object name" in Git land, you must store that one with person B's fingerprint on it also under the name of X. Now which version will you get from Git when you ask it to give you the object that hashes to X?
RE: [Question] Signature calculation ignoring parts of binary files
On September 12, 2018 4:54 PM, I wrote: > On September 12, 2018 4:48 PM, Johannes Sixt wrote: > > Am 12.09.18 um 21:16 schrieb Randall S. Becker: > > > I feel really bad asking this, and I should know the answer, and yet. > > > > > > I have a binary file that needs to go into a repo intact (unchanged). > > > I also have a program that interprets the contents, like a textconv, > > > that can output the relevant portions of the file in whatever format > > > I like - used for diff typically, dumps in 1K chunks by file section. > > > What I'm looking for is to have the SHA1 signature calculated with > > > just the relevant portions of the file so that two actually > > > different files will be considered the same by git during a commit > > > or status. In real terms, I'm trying to ignore the Creator metadata > > > of a JPG because it is mutable and irrelevant to my repo contents. > > > > > > I'm sorry to ask, but I thought this was in .gitattributes but I > > > can't confirm the SHA1 behaviour. > > > > You are looking for a clean filter. See the 'filter' attribute in > > gitattributes(5). > > Your clean filter program or script should strip the unwanted metadata > > or set it to a constant known-good value. > > > > (You shouldn't need a smudge filter.) > > > > -- Hannes > > Thanks Hannes. I thought about the clean filter, but I don't actually want to > modify the file when going into git, just for SHA calculation. I need to be > able > to keep some origin metadata that might change with subsequent copies, so > just cleaning the origin is not going to work - actually knowing the original > author is important to our process. My objective is to keep the original file > 100% exact as supplied and then ignore any changes to the metadata that I > don't care about (like Creator) if the remainder of the file is the same. I had a thought that might be workable, opinions are welcome on this. The commit of my rather weird project is done by a script so I have flexibility in my approach. What I could do is set up a diff textconv configuration so that the text diff of the two JPG files will show no differences if the immutable fields and the image are the same. I can then trigger a git add and git commit for only those files where git diff reports no differences. That way the actual original file is stored in git with 100% fidelity (no cleaning). It's not as elegant as I'd like, but it does solve what I'm trying to do. Does this sound reasonable and/or is there a better way? Cheers, Randall -- Brief whoami: NonStop developer since approximately 2112884442 UNIX developer since approximately 421664400 -- In my real life, I talk too much.
RE: [Question] Signature calculation ignoring parts of binary files
> -Original Message- > From: git-ow...@vger.kernel.org On Behalf > Of Johannes Sixt > Sent: September 12, 2018 4:48 PM > To: Randall S. Becker > Cc: git@vger.kernel.org > Subject: Re: [Question] Signature calculation ignoring parts of binary files > > Am 12.09.18 um 21:16 schrieb Randall S. Becker: > > I feel really bad asking this, and I should know the answer, and yet. > > > > I have a binary file that needs to go into a repo intact (unchanged). > > I also have a program that interprets the contents, like a textconv, > > that can output the relevant portions of the file in whatever format I > > like - used for diff typically, dumps in 1K chunks by file section. > > What I'm looking for is to have the SHA1 signature calculated with > > just the relevant portions of the file so that two actually different > > files will be considered the same by git during a commit or status. In > > real terms, I'm trying to ignore the Creator metadata of a JPG because > > it is mutable and irrelevant to my repo contents. > > > > I'm sorry to ask, but I thought this was in .gitattributes but I can't > > confirm the SHA1 behaviour. > > You are looking for a clean filter. See the 'filter' attribute in > gitattributes(5). > Your clean filter program or script should strip the unwanted metadata or set > it to a constant known-good value. > > (You shouldn't need a smudge filter.) > > -- Hannes Thanks Hannes. I thought about the clean filter, but I don't actually want to modify the file when going into git, just for SHA calculation. I need to be able to keep some origin metadata that might change with subsequent copies, so just cleaning the origin is not going to work - actually knowing the original author is important to our process. My objective is to keep the original file 100% exact as supplied and then ignore any changes to the metadata that I don't care about (like Creator) if the remainder of the file is the same. Regards, Randall
Re: [Question] Signature calculation ignoring parts of binary files
Am 12.09.18 um 21:16 schrieb Randall S. Becker: I feel really bad asking this, and I should know the answer, and yet. I have a binary file that needs to go into a repo intact (unchanged). I also have a program that interprets the contents, like a textconv, that can output the relevant portions of the file in whatever format I like - used for diff typically, dumps in 1K chunks by file section. What I'm looking for is to have the SHA1 signature calculated with just the relevant portions of the file so that two actually different files will be considered the same by git during a commit or status. In real terms, I'm trying to ignore the Creator metadata of a JPG because it is mutable and irrelevant to my repo contents. I'm sorry to ask, but I thought this was in .gitattributes but I can't confirm the SHA1 behaviour. You are looking for a clean filter. See the 'filter' attribute in gitattributes(5). Your clean filter program or script should strip the unwanted metadata or set it to a constant known-good value. (You shouldn't need a smudge filter.) -- Hannes
[Question] Signature calculation ignoring parts of binary files
I feel really bad asking this, and I should know the answer, and yet. I have a binary file that needs to go into a repo intact (unchanged). I also have a program that interprets the contents, like a textconv, that can output the relevant portions of the file in whatever format I like - used for diff typically, dumps in 1K chunks by file section. What I'm looking for is to have the SHA1 signature calculated with just the relevant portions of the file so that two actually different files will be considered the same by git during a commit or status. In real terms, I'm trying to ignore the Creator metadata of a JPG because it is mutable and irrelevant to my repo contents. I'm sorry to ask, but I thought this was in .gitattributes but I can't confirm the SHA1 behaviour. Sheepishly, Randall -- Brief whoami: NonStop developer since approximately 2112884442 UNIX developer since approximately 421664400 -- In my real life, I talk too much.
Re: How to speedup git clone for big binary files (disable delta compression)
On Thu, Jul 19, 2018 at 12:05:00AM +0200, René Scheibe wrote: > Code: > - > #!/bin/bash > > # setup repository > git init --quiet repo > cd repo > > echo '*.bin binary -delta' > .gitattributes > git add .gitattributes > git commit --quiet -m 'attributes' > > for i in $(seq 10); do > dd if=/dev/urandom of=data.bin bs=1MB count=10 status=none > git add data.bin > git commit --quiet -m "data $i" > done > cd .. > > # create clone repository > time git clone --no-local repo clone This clone won't respect those attributes, because we don't dig into in-repo attributes. There's actually some inconsistency in how Git handles attribute locations. Usually they're just read from the top of the working tree, but in some instances we read them from the tree itself (e.g., git-archive respects some attributes from the tree it's archiving). If you do: echo "*.bin binary -delta" >repo/.git/info/attributes then that does work (we always respect repo-level attributes like that). > # repack original repository > cd repo > time git repack -a -d In this case we're reading the attributes from the working tree, and it does work. In theory the clone case could do so, too, but git-upload-pack, the server side of the clone, avoids looking at the working tree at all. That's something we _could_ address, but it doesn't really fix the general case, since most clones will be from a bare repository anyway. So in summary: 1. Depending on what you're trying to do, the .git/info/attributes trick might be enough for you. 2. I do think it would be nice for more places to respect attributes from in trees. There's a question of which tree, but I think in general reading them from HEAD in a bare repository would do what people want (it's a little funny if you're fetching branch "foo", but HEAD points to "bar", but it's at least consistent with the non-bare case). There's some prior art in the way we treat mailmaps (in a bare repo, we read HEAD:.mailmap). I suspect the patch may not be trivial, as I don't know how ready the attributes code is to handle in-tree lookups (remember that it is not just HEAD:.gitattributes we must care about, but other files sprinkled through the repository, like "HEAD:subdir/.gitattributes". -Peff
How to speedup git clone for big binary files (disable delta compression)
Hi, I was wondering why "git clone" seems to not respect "-delta" in .gitattributes. *Reproduction* I prepared a test repository with: - git v2.17.1 - .gitattributes containing "*.bin binary -delta" - 10 commits with a 10 MB random binary file Code: - #!/bin/bash # setup repository git init --quiet repo cd repo echo '*.bin binary -delta' > .gitattributes git add .gitattributes git commit --quiet -m 'attributes' for i in $(seq 10); do dd if=/dev/urandom of=data.bin bs=1MB count=10 status=none git add data.bin git commit --quiet -m "data $i" done cd .. # create clone repository time git clone --no-local repo clone # repack original repository cd repo time git repack -a -d - Output: - Cloning into 'clone'... remote: Counting objects: 33, done. remote: Compressing objects: 100% (31/31), done. remote: Total 33 (delta 0), reused 0 (delta 0) Receiving objects: 100% (33/33), 95.40 MiB | 19.94 MiB/s, done. real0m25,085s user0m22,749s sys 0m0,948s Counting objects: 33, done. Delta compression using up to 2 threads. Compressing objects: 100% (21/21), done. Writing objects: 100% (33/33), done. Total 33 (delta 0), reused 0 (delta 0) real0m5,652s user0m4,173s sys 0m0,178s - *Observations* _time_ - Cloning: "clone" takes always 25s - Optimizing: "repack" takes 25s with and 5s without delta compression _compressed objects_ - Cloning: "clone" compresses always 31 objects - Optimizing: "repack" compresses 31 objects with and 21 objects without delta compression *Expectations* Both operations ("repack" and "clone") are using "pack-objects". Therefore my expectation is that "clone" should respect "-delta" and be about as fast as "repack". Cheers, René
Re: Add option to git to ignore binary files unless force added
How about a hook to ignore certain files? Then you could ignore based on the contents of the fail instead of just the extension. It’d be very flexible. > On May 18, 2018, at 2:09 PM, Jacob Keller <jacob.kel...@gmail.com> wrote: > > On Fri, May 18, 2018 at 4:31 AM, Anmol Sethi <m...@anmol.io> wrote: >> This definitely works but it would be more clean to just have git ignore the >> binary files from the get go. >> > > Sure it'd be more convenient for you. But there are loads of possible > combinations, and the idea of what constitutes unwanted files is > hugely variable to each user. We don't really want to end up > supporting a million different ways to do this, and the hooks > interface provides a reasonable method for rejecting commits with > unwanted contents. > > > Thanks, > Jake -- Best, Anmol
Re: Add option to git to ignore binary files unless force added
On Fri, May 18, 2018 at 4:31 AM, Anmol Sethi <m...@anmol.io> wrote: > This definitely works but it would be more clean to just have git ignore the > binary files from the get go. > Sure it'd be more convenient for you. But there are loads of possible combinations, and the idea of what constitutes unwanted files is hugely variable to each user. We don't really want to end up supporting a million different ways to do this, and the hooks interface provides a reasonable method for rejecting commits with unwanted contents. Thanks, Jake
RE: Add option to git to ignore binary files unless force added
On May 18, 2018 7:31 AM, Anmol Sethi <m...@anmol.io> > That works but most binaries do not have a file extension. Its just not > standard on linux. > > > On May 17, 2018, at 8:37 AM, Randall S. Becker <rsbec...@nexbridge.com> > wrote: > > > > On May 16, 2018 11:18 PM, Jacob Keller > >> On Wed, May 16, 2018 at 5:45 PM, Anmol Sethi <m...@anmol.io> wrote: > >>> I think it’d be great to have an option to have git ignore binary > >>> files. My > >> repositories are always source only, committing a binary is always a > mistake. > >> At the moment, I have to configure the .gitignore to ignore every > >> binary file and that gets tedious. Having git ignore all binary files > >> would be > great. > >>> > >>> This could be achieved via an option in .gitconfig or maybe a > >>> special line in > >> .gitignore. > >>> > >>> I just want to never accidentally commit a binary again. > >> > >> I believe you can do a couple things. There should be a hook which > >> you can modify to validate that there are no binary files on > >> pre-commit[1], or pre- push[2] to verify that you never push commits > with binaries in them. > >> > >> You could also implement the update hook on the server if you control > >> it, to allow it to block pushes which contain binary files. > > > > What about configuring ${HOME}/.config/git/ignore instead (described at > https://git-scm.com/docs/gitignore). Inside, put: > > > > *.o > > *.exe > > *.bin > > *.dat > > Etc I have a similar problem on my platform, with a different solution. My builds involve GCC binaries, NonStop L-series binaries (x86), and a NonStop J-series binaries (itanium). To keep me sane, I have all build targets going to separate directories, like Build/GCC, Build/Lbin, Build/Jbin away from the sources. This allows me to ignore Build/ regardless of extension and also to build different targets without link collisions. This is similar to how Java works (a.k.a. bin/). Much more workable, IMHO, than trying to manage individual binaries name by name or even by extension. I also have a mix of jpg and UTF-16 HTML that would end up in false-positives on a pure binary match and I do want to manage those. What helps me is that I do most of my work in ECLIPSE, so derived resources (objects, generated sources) get auto-ignored by EGit, if you can make your compiler arrange that - but that's an ECLIPSE thing not a file system thing. Cheers, Randall -- Brief whoami: NonStop developer since approximately 2112884442 UNIX developer since approximately 421664400 -- In my real life, I talk too much.
Re: Add option to git to ignore binary files unless force added
This definitely works but it would be more clean to just have git ignore the binary files from the get go. > On May 16, 2018, at 11:18 PM, Jacob Keller <jacob.kel...@gmail.com> wrote: > > On Wed, May 16, 2018 at 5:45 PM, Anmol Sethi <m...@anmol.io> wrote: >> I think it’d be great to have an option to have git ignore binary files. My >> repositories are always source only, committing a binary is always a >> mistake. At the moment, I have to configure the .gitignore to ignore every >> binary file and that gets tedious. Having git ignore all binary files would >> be great. >> >> This could be achieved via an option in .gitconfig or maybe a special line >> in .gitignore. >> >> I just want to never accidentally commit a binary again. >> >> -- >> Best, >> Anmol >> > > I believe you can do a couple things. There should be a hook which you > can modify to validate that there are no binary files on > pre-commit[1], or pre-push[2] to verify that you never push commits > with binaries in them. > > You could also implement the update hook on the server if you control > it, to allow it to block pushes which contain binary files. > > Thanks, > Jake > > [1]https://git-scm.com/docs/githooks#_pre_commit > [2]https://git-scm.com/docs/githooks#_pre_push > [3]https://git-scm.com/docs/githooks#update -- Best, Anmol
Re: Add option to git to ignore binary files unless force added
That works but most binaries do not have a file extension. Its just not standard on linux. > On May 17, 2018, at 8:37 AM, Randall S. Becker <rsbec...@nexbridge.com> wrote: > > On May 16, 2018 11:18 PM, Jacob Keller >> On Wed, May 16, 2018 at 5:45 PM, Anmol Sethi <m...@anmol.io> wrote: >>> I think it’d be great to have an option to have git ignore binary files. My >> repositories are always source only, committing a binary is always a mistake. >> At the moment, I have to configure the .gitignore to ignore every binary file >> and that gets tedious. Having git ignore all binary files would be great. >>> >>> This could be achieved via an option in .gitconfig or maybe a special line >>> in >> .gitignore. >>> >>> I just want to never accidentally commit a binary again. >> >> I believe you can do a couple things. There should be a hook which you can >> modify to validate that there are no binary files on pre-commit[1], or pre- >> push[2] to verify that you never push commits with binaries in them. >> >> You could also implement the update hook on the server if you control it, to >> allow it to block pushes which contain binary files. > > What about configuring ${HOME}/.config/git/ignore instead (described at > https://git-scm.com/docs/gitignore). Inside, put: > > *.o > *.exe > *.bin > *.dat > Etc > > Cheers, > Randall > > -- Best, Anmol
RE: Add option to git to ignore binary files unless force added
On May 16, 2018 11:18 PM, Jacob Keller > On Wed, May 16, 2018 at 5:45 PM, Anmol Sethi <m...@anmol.io> wrote: > > I think it’d be great to have an option to have git ignore binary files. My > repositories are always source only, committing a binary is always a mistake. > At the moment, I have to configure the .gitignore to ignore every binary file > and that gets tedious. Having git ignore all binary files would be great. > > > > This could be achieved via an option in .gitconfig or maybe a special line > > in > .gitignore. > > > > I just want to never accidentally commit a binary again. > > I believe you can do a couple things. There should be a hook which you can > modify to validate that there are no binary files on pre-commit[1], or pre- > push[2] to verify that you never push commits with binaries in them. > > You could also implement the update hook on the server if you control it, to > allow it to block pushes which contain binary files. What about configuring ${HOME}/.config/git/ignore instead (described at https://git-scm.com/docs/gitignore). Inside, put: *.o *.exe *.bin *.dat Etc Cheers, Randall
Re: Add option to git to ignore binary files unless force added
On Wed, May 16, 2018 at 5:45 PM, Anmol Sethi <m...@anmol.io> wrote: > I think it’d be great to have an option to have git ignore binary files. My > repositories are always source only, committing a binary is always a mistake. > At the moment, I have to configure the .gitignore to ignore every binary file > and that gets tedious. Having git ignore all binary files would be great. > > This could be achieved via an option in .gitconfig or maybe a special line in > .gitignore. > > I just want to never accidentally commit a binary again. > > -- > Best, > Anmol > I believe you can do a couple things. There should be a hook which you can modify to validate that there are no binary files on pre-commit[1], or pre-push[2] to verify that you never push commits with binaries in them. You could also implement the update hook on the server if you control it, to allow it to block pushes which contain binary files. Thanks, Jake [1]https://git-scm.com/docs/githooks#_pre_commit [2]https://git-scm.com/docs/githooks#_pre_push [3]https://git-scm.com/docs/githooks#update
Add option to git to ignore binary files unless force added
I think it’d be great to have an option to have git ignore binary files. My repositories are always source only, committing a binary is always a mistake. At the moment, I have to configure the .gitignore to ignore every binary file and that gets tedious. Having git ignore all binary files would be great. This could be achieved via an option in .gitconfig or maybe a special line in .gitignore. I just want to never accidentally commit a binary again. -- Best, Anmol
Re: Binary files
On 20/07/2017 22:40, Junio C Hamano wrote: > Igor Djordjevic <igor.d.djordje...@gmail.com> writes: >> On 20/07/2017 09:41, Volodymyr Sendetskyi wrote: >>> It is known, that git handles badly storing binary files in its >>> repositories at all. >>> This is especially about large files: even without any changes to >>> these files, their copies are snapshotted on each commit. So even >>> repositories with a small amount of code can grove very fast in size >>> if they contain some great binary files. Alongside this, the SVN is >>> much better about that, because it make changes to the server version >>> of file only if some changes were done. >> >> You already got some proposals on what you could try for making large >> binary files handling easier, but I just wanted to comment on this >> part of your message, as it doesn`t seem to be correct. > > All correct. Thanks. No problem, thanks for confirmation, being relatively new around it`s appreciated, at least knowing that I got it correct myself :)
Re: Binary files
Igor Djordjevic <igor.d.djordje...@gmail.com> writes: > On 20/07/2017 09:41, Volodymyr Sendetskyi wrote: >> It is known, that git handles badly storing binary files in its >> repositories at all. >> This is especially about large files: even without any changes to >> these files, their copies are snapshotted on each commit. So even >> repositories with a small amount of code can grove very fast in size >> if they contain some great binary files. Alongside this, the SVN is >> much better about that, because it make changes to the server version >> of file only if some changes were done. > > You already got some proposals on what you could try for making large > binary files handling easier, but I just wanted to comment on this > part of your message, as it doesn`t seem to be correct. All correct. Thanks.
Re: Binary files
Hi Volodymyr, On 20/07/2017 09:41, Volodymyr Sendetskyi wrote: > It is known, that git handles badly storing binary files in its > repositories at all. > This is especially about large files: even without any changes to > these files, their copies are snapshotted on each commit. So even > repositories with a small amount of code can grove very fast in size > if they contain some great binary files. Alongside this, the SVN is > much better about that, because it make changes to the server version > of file only if some changes were done. You already got some proposals on what you could try for making large binary files handling easier, but I just wanted to comment on this part of your message, as it doesn`t seem to be correct. Even though each repository file is included in each commit (being a full repository state snapshot), meaning big binary files as well, that`s just from an end-user`s perspective. Actual implementation side is smarter than that - if file hasn`t changed between commits, it won`t get copied/written to Git object database again. Under the hood, many different commits can point to the same (unchanged) file, thus repository size _does not_ grow very fast with each commit if large binary file is without any changes. Usually, the biggest concern with Git and large files[1], in comparison to SVN, for example, is something else - Git model assuming each repository clone holding the complete repository history with all the different file versions included, so you can`t get just some of them, or the last snapshot only, keeping your local repository small in size. If the repository you`re cloning from is a big one, your locally cloned repository will be as well, even if you may not really be interested in the big files at all... but you got some suggestions for handling that already, as pointed out :) Just note that it`s not really Git vs SVN here, but more distributed vs centralized approach in general, as you can`t both have everything and yet skip something at the same time. Different systems may have different workarounds for a specific workflow, though. [1] Besides taking each file version as a full-sized snapshot (at the beginning, at least, until the delta compression packing occurs). Regards, Buga
Re: Binary files
On Thu, Jul 20, 2017 at 12:41 AM, Volodymyr Sendetskyi <volodymy...@devcom.com> wrote: > It is known, that git handles badly storing binary files in its > repositories at all. > This is especially about large files: even without any changes to > these files, their copies are snapshotted on each commit. So even > repositories with a small amount of code can grove very fast in size > if they contain some great binary files. Alongside this, the SVN is > much better about that, because it make changes to the server version > of file only if some changes were done. > > So the question is: why not implementing some feature, that would > somehow handle this problem? There are 'external' solutions such as git LFS and git-annex, mentioned in replies nearby. But note there are also efforts to handle large binary files internally https://public-inbox.org/git/3420d9ae9ef86b78af1abe721891233e3f5865a2.1500508695.git.jonathanta...@google.com/ https://public-inbox.org/git/20170713173459.3559-1-...@jeffhostetler.com/ https://public-inbox.org/git/20170620075523.26961-1-chrisc...@tuxfamily.org/
Re: Binary files
> On 20 Jul 2017, at 09:41, Volodymyr Sendetskyi <volodymy...@devcom.com> wrote: > > It is known, that git handles badly storing binary files in its > repositories at all. > This is especially about large files: even without any changes to > these files, their copies are snapshotted on each commit. So even > repositories with a small amount of code can grove very fast in size > if they contain some great binary files. Alongside this, the SVN is > much better about that, because it make changes to the server version > of file only if some changes were done. > > So the question is: why not implementing some feature, that would > somehow handle this problem? > > Of course, I don't know the internal git structure and the way of > working + some nuances (likely about the snapshots at all and the way > they are done), so handling this may be a great problem. But the > easiest feature for me as an end user will be something like > '.gitbinary', where I can list binary files, that would behave like on > SVN, or even more optimal, if you can implement it. Maybe there will > be a need for separate kinds of repositories, or even servers. But > that would be a great change and a logical way of next git's > evolution. GitLFS [1] might be the workaround you want. There are efforts to bring large file support natively to Git [2]. I tried to explain GitLFS in more detail here: https://www.youtube.com/watch?v=YQzNfb4IwEY - Lars [1] https://git-lfs.github.com/ [2] https://public-inbox.org/git/20170620075523.26961-1-chrisc...@tuxfamily.org/
Re: Binary files
On Thu, Jul 20, 2017 at 10:41:48AM +0300, Volodymyr Sendetskyi wrote: > It is known, that git handles badly storing binary files in its > repositories at all. [...] > So the question is: why not implementing some feature, that would > somehow handle this problem? [...] Have you examined git-lfs and git-annex? (Actually, there are/were more solutions [1] but these two appear to be the most used novadays.) Such solutions allow one to use Git for what it does best and defer handling of big files (or files for which lock-modify-unlock works better than the usual modify-merge) to a specialized solution. 1. http://blog.deveo.com/storing-large-binary-files-in-git-repositories/
Re: Binary files
On Thu, Jul 20, 2017 at 12:41 AM, Volodymyr Sendetskyi <volodymy...@devcom.com> wrote: > It is known, that git handles badly storing binary files in its > repositories at all. > This is especially about large files: even without any changes to > these files, their copies are snapshotted on each commit. So even > repositories with a small amount of code can grove very fast in size > if they contain some great binary files. Alongside this, the SVN is > much better about that, because it make changes to the server version > of file only if some changes were done. > > So the question is: why not implementing some feature, that would > somehow handle this problem? Like Git LFS or git annex? Features have been implemented to better handle large files; they're just not necessarily part of core Git. Have you checked whether one of those solutions might work for your use case? Best regards, Bryan Turner
Re: Binary files
It is known, that git handles badly storing binary files in its repositories at all. This is especially about large files: even without any changes to these files, their copies are snapshotted on each commit. So even repositories with a small amount of code can grove very fast in size if they contain some great binary files. Alongside this, the SVN is much better about that, because it make changes to the server version of file only if some changes were done. So the question is: why not implementing some feature, that would somehow handle this problem? Of course, I don't know the internal git structure and the way of working + some nuances (likely about the snapshots at all and the way they are done), so handling this may be a great problem. But the easiest feature for me as an end user will be something like '.gitbinary', where I can list binary files, that would behave like on SVN, or even more optimal, if you can implement it. Maybe there will be a need for separate kinds of repositories, or even servers. But that would be a great change and a logical way of next git's evolution.
Re: log -S/-G (aka pickaxe) searches binary files by default
Am 03.03.2017 um 17:07 schrieb Junio C Hamano: > Jeff King <p...@peff.net> writes: > >> On Thu, Mar 02, 2017 at 05:36:17PM -0800, Junio C Hamano wrote: >> ... >>>> Is that on purpose? >>> >>> No, it's a mere oversight (as I do not think I never even thought >>> about special casing binary >>> files from day one, it is unlikely that you would find _any_ old >>> version of Git that behaves >>> differently). >> >> The email focuses on "-G", and I think it is wrong to look in binary >> files there, as "grep in diff" does not make sense for a binary file >> that we would refuse to diff. > > Yeah, I agree. > >> But the subject also mentions "-S". I always assumed it was intentional >> to look in binary files there, as it is searching for a pure byte >> sequence. I would not mind an option to disable that, but I think the >> default should remain on. > > As the feature was built to be one of the core ingredients necessary > towards the 'ideal SCM' envisioned in > > > <http://public-inbox.org/git/pine.lnx.4.58.0504150753440.7...@ppc970.osdl.org> > > "-S" is about finding "a block of text". It was merely an oversight > that we didn't add explicit code to ignore binary when we introduced > the concept of "is this text? is it worth finding things in and > diffing binary files?". > > I do agree that it may be too late and/or disruptive to change its > behaviour now, as people may grew expectations different from the > original motivation and design, though. Thanks both for the encouraging answers. I'll try to come up with patches in the next couple of weeks for the following changes: "log -G": disable looking in binaries "log -S": add option to switch looking into binaries, defaults to true Thomas
Re: log -S/-G (aka pickaxe) searches binary files by default
Jeff King <p...@peff.net> writes: > On Thu, Mar 02, 2017 at 05:36:17PM -0800, Junio C Hamano wrote: > ... >> > Is that on purpose? >> >> No, it's a mere oversight (as I do not think I never even thought >> about special casing binary >> files from day one, it is unlikely that you would find _any_ old >> version of Git that behaves >> differently). > > The email focuses on "-G", and I think it is wrong to look in binary > files there, as "grep in diff" does not make sense for a binary file > that we would refuse to diff. Yeah, I agree. > But the subject also mentions "-S". I always assumed it was intentional > to look in binary files there, as it is searching for a pure byte > sequence. I would not mind an option to disable that, but I think the > default should remain on. As the feature was built to be one of the core ingredients necessary towards the 'ideal SCM' envisioned in <http://public-inbox.org/git/pine.lnx.4.58.0504150753440.7...@ppc970.osdl.org> "-S" is about finding "a block of text". It was merely an oversight that we didn't add explicit code to ignore binary when we introduced the concept of "is this text? is it worth finding things in and diffing binary files?". I do agree that it may be too late and/or disruptive to change its behaviour now, as people may grew expectations different from the original motivation and design, though.
Re: log -S/-G (aka pickaxe) searches binary files by default
On Thu, Mar 02, 2017 at 05:36:17PM -0800, Junio C Hamano wrote: > On Thu, Mar 2, 2017 at 4:52 PM, Thomas Braun > <thomas.br...@virtuell-zuhause.de> wrote: > > > > I happen to have quite large binary files in my repos. > > > > Today I realized that a line like > > git log -G a > > searches also files found to be binary (or explicitly marked as binary). > > > > Is that on purpose? > > No, it's a mere oversight (as I do not think I never even thought > about special casing binary > files from day one, it is unlikely that you would find _any_ old > version of Git that behaves > differently). The email focuses on "-G", and I think it is wrong to look in binary files there, as "grep in diff" does not make sense for a binary file that we would refuse to diff. But the subject also mentions "-S". I always assumed it was intentional to look in binary files there, as it is searching for a pure byte sequence. I would not mind an option to disable that, but I think the default should remain on. -Peff
Re: log -S/-G (aka pickaxe) searches binary files by default
On Thu, Mar 2, 2017 at 4:52 PM, Thomas Braun <thomas.br...@virtuell-zuhause.de> wrote: > > I happen to have quite large binary files in my repos. > > Today I realized that a line like > git log -G a > searches also files found to be binary (or explicitly marked as binary). > > Is that on purpose? No, it's a mere oversight (as I do not think I never even thought about special casing binary files from day one, it is unlikely that you would find _any_ old version of Git that behaves differently).
log -S/-G (aka pickaxe) searches binary files by default
Hi, I happen to have quite large binary files in my repos. Today I realized that a line like git log -G a searches also files found to be binary (or explicitly marked as binary). Is that on purpose? The documentation of "-G" states "Look for differences whose patch text contains added/removed lines that match ." which contradicts the current behaviour. At least for me text != binary. To reproduce: $ git init $ echo -e "a\0b" > data.bin $ git add data.bin $ git commit -m "Add new data" $ git log -p [...] diff --git a/data.bin b/data.bin new file mode 100644 index 000..1a23e4b Binary files /dev/null and b/data.bin differ $ git log -G a [...] Add new data I've verified the behaviour with git version 2.12.0.windows.1 and git version 2.12.0.189.g3bc5322 on debian. If it is on purpose is there a config option to disable that? Thanks for reading, Thomas
Re: Bug with disabling compression and 'binary' files.
Douglas Coxwrites: >> This may or may not be related to the symptom >> you are observing (if it is, then you would see a packfile created >> in objects/pack/, not in loose objects in object/??/ directories). > > No, the file is loose (it's in .git/objects/eb in this case). This is > seen immediately after the add, though I believe it's the same way > when doing a commit on a changed file. Then I do not have a guess as to where the symptom you are seeing is coming from.
Re: Bug with disabling compression and 'binary' files.
> This may or may not be related to the symptom > you are observing (if it is, then you would see a packfile created > in objects/pack/, not in loose objects in object/??/ directories). No, the file is loose (it's in .git/objects/eb in this case). This is seen immediately after the add, though I believe it's the same way when doing a commit on a changed file. On Tue, Nov 15, 2016 at 8:42 PM, Junio C Hamano <gits...@pobox.com> wrote: > Douglas Cox <zif...@gmail.com> writes: > >> I narrowed this down to the '-text' attribute that is set when >> specifying 'binary'. For some reason this flag is cancelling out the >> core.compression = 0 setting and I think this is a bug? >> >> Unfortunately core.compression = 0 is also global. Ideally it would be >> great if there was a separate 'compression' attribute that could be >> specified in .gitattributes per wildcard similar to how -delta can be >> used. This way we would still be able to get compression for >> text/source files, while still getting the speed of skipping >> compression for binary files that do not compress well. >> >> Has there been any discussion on having an attribute similar to this? > > Nope. > > I do not offhand think of a way for '-text' attribute (or any > attribute for what matter) to interfere with compression level, but > while reading the various parts of the system that futz with the > compression level configuration, I noticed one thing. When we do an > initial "bulk-checkin" optimization that sends all objects to a > single packfile upon "git add", the packfile creation uses its own > compression level that is not affected by any configuration or > command line option. This may or may not be related to the symptom > you are observing (if it is, then you would see a packfile created > in objects/pack/, not in loose objects in object/??/ directories). > > >
Re: Bug with disabling compression and 'binary' files.
Douglas Cox <zif...@gmail.com> writes: > I narrowed this down to the '-text' attribute that is set when > specifying 'binary'. For some reason this flag is cancelling out the > core.compression = 0 setting and I think this is a bug? > > Unfortunately core.compression = 0 is also global. Ideally it would be > great if there was a separate 'compression' attribute that could be > specified in .gitattributes per wildcard similar to how -delta can be > used. This way we would still be able to get compression for > text/source files, while still getting the speed of skipping > compression for binary files that do not compress well. > > Has there been any discussion on having an attribute similar to this? Nope. I do not offhand think of a way for '-text' attribute (or any attribute for what matter) to interfere with compression level, but while reading the various parts of the system that futz with the compression level configuration, I noticed one thing. When we do an initial "bulk-checkin" optimization that sends all objects to a single packfile upon "git add", the packfile creation uses its own compression level that is not affected by any configuration or command line option. This may or may not be related to the symptom you are observing (if it is, then you would see a packfile created in objects/pack/, not in loose objects in object/??/ directories).
Bug with disabling compression and 'binary' files.
I was doing some experiments today with large-ish (100-200MB) binary files and was trying to determine the best configuration for Git. Here are the steps and timings I saw: git init Test cp .../largemovie.mp4 . time git add largemovie.mp4 This took 6.5s for a 200MB file. This file compressed a little bit, but not enough to matter, so I wanted to see how much faster the add would be with compression disabled. So I ran: git config core.compression = 0 I then completely reran the test above starting with a clean repository. This time the add took only 2.08s. I repeated these two tests about 10 times using the same file each time to verify it wasn't related to disk caching, etc. At this point I decided that this was likely a good setting for this repository, so I also created a .gitattributes file and added a few entries often seen in suggestions for dealing with binary files: *.mp4 binary -delta The goal here was to disable any delta compression done during a gc/pack and use the other settings 'binary' applies. Unfortunately when I ran the test again (still using compression = 0), the test was back to taking 6.5s and the file inside the .git/objects/ folder was compressed again. I narrowed this down to the '-text' attribute that is set when specifying 'binary'. For some reason this flag is cancelling out the core.compression = 0 setting and I think this is a bug? Unfortunately core.compression = 0 is also global. Ideally it would be great if there was a separate 'compression' attribute that could be specified in .gitattributes per wildcard similar to how -delta can be used. This way we would still be able to get compression for text/source files, while still getting the speed of skipping compression for binary files that do not compress well. Has there been any discussion on having an attribute similar to this? Thanks, -Doug
Re: How to have EOL=LF and keep binary files auto-detection?
Thank you for the information. We create products (IDEs for different programming languages) for the three major platforms (Window/Linux/Mac OSX). Our products are written in Java and we want to minimize differences between distributions for different platforms, so we use LF separators for all resource files. On Mon, Apr 25, 2016 at 7:46 PM, Torsten Bögershausen <tbo...@web.de> wrote: > On 25.04.16 16:11, Kirill Likhodedov wrote: >> Hi, >> >> I wonder if it is possible both to have LFs in all and only text files in >> working trees, and keep Git’s binary files auto-detection? >> >> To be more precise: >> * we want all text files to be checked out in LF; >> * we don’t want force people to set “core.autocrlf” to false, preferring to >> keep this configuration in .gitattributes; >> * we obviously don’t want binary files to be touched by eol-normalization; >> * we also don’t want to declare all possible patterns of binary files - Git >> is good enough in detecting them automatically. >> >> However, I’ve found no way to do so. >> >> If I declare `* eol=lf` in .gitattributes, it makes Git treat all files as >> text and thus convert CRLF to LF even in binary files. It is consistent with >> man, but a bit surprising to have e.g. a zip or png file modified in this >> way. >> >> One could expect `* text=auto eol=lf` to work the way we want, but >> unfortunately it doesn’t work either: “eol=lf” forces “text” on all files. >> >> Thanks a lot for your help! >> -- Kirill. > > The short answer: Git doesn't currently do that. > The closest you can get, is to use > echo "* text=auto" >.gitattributes > and > git config core.eol lf > git config core.autocrlf false. > > The longer answer is, that I am working on a patch to allow just > the combination of "* text=auto eol=lf" to work as you want it. > > Which platform do you use ? > And (out of curiosity, why do you want text files with LF ?) > > If you are willing to compile and install Git yourself, > you can use the branch here: > https://github.com/tboegi/git/commits/160421_0706_reliable_t0027_allow_TC_combined_ident_CRLF_v7 > > Feedback is welcome, if it works as expected. > > > > -- Nikolay Chashnikov JetBrains http://www.jetbrains.com The Drive to Develop -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to have EOL=LF and keep binary files auto-detection?
On 25.04.16 16:11, Kirill Likhodedov wrote: > Hi, > > I wonder if it is possible both to have LFs in all and only text files in > working trees, and keep Git’s binary files auto-detection? > > To be more precise: > * we want all text files to be checked out in LF; > * we don’t want force people to set “core.autocrlf” to false, preferring to > keep this configuration in .gitattributes; > * we obviously don’t want binary files to be touched by eol-normalization; > * we also don’t want to declare all possible patterns of binary files - Git > is good enough in detecting them automatically. > > However, I’ve found no way to do so. > > If I declare `* eol=lf` in .gitattributes, it makes Git treat all files as > text and thus convert CRLF to LF even in binary files. It is consistent with > man, but a bit surprising to have e.g. a zip or png file modified in this way. > > One could expect `* text=auto eol=lf` to work the way we want, but > unfortunately it doesn’t work either: “eol=lf” forces “text” on all files. > > Thanks a lot for your help! > -- Kirill. The short answer: Git doesn't currently do that. The closest you can get, is to use echo "* text=auto" >.gitattributes and git config core.eol lf git config core.autocrlf false. The longer answer is, that I am working on a patch to allow just the combination of "* text=auto eol=lf" to work as you want it. Which platform do you use ? And (out of curiosity, why do you want text files with LF ?) If you are willing to compile and install Git yourself, you can use the branch here: https://github.com/tboegi/git/commits/160421_0706_reliable_t0027_allow_TC_combined_ident_CRLF_v7 Feedback is welcome, if it works as expected. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
How to have EOL=LF and keep binary files auto-detection?
Hi, I wonder if it is possible both to have LFs in all and only text files in working trees, and keep Git’s binary files auto-detection? To be more precise: * we want all text files to be checked out in LF; * we don’t want force people to set “core.autocrlf” to false, preferring to keep this configuration in .gitattributes; * we obviously don’t want binary files to be touched by eol-normalization; * we also don’t want to declare all possible patterns of binary files - Git is good enough in detecting them automatically. However, I’ve found no way to do so. If I declare `* eol=lf` in .gitattributes, it makes Git treat all files as text and thus convert CRLF to LF even in binary files. It is consistent with man, but a bit surprising to have e.g. a zip or png file modified in this way. One could expect `* text=auto eol=lf` to work the way we want, but unfortunately it doesn’t work either: “eol=lf” forces “text” on all files. Thanks a lot for your help! -- Kirill. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [git-for-windows] How is detected binary files?
On Fri, Nov 27, 2015 at 03:14:58PM +0100, Johannes Schindelin wrote: > On Wed, 25 Nov 2015, Andrzej Borucki wrote: > > > How git detects that file is binary? This must be safe because it not > > allowed to change line breaks in binary files. > > Binary files can contain byte 0 (zero), but: > > - 16 bit UTF also can contain zero > > - short binary files can not contain zero > > It would probably be better to direct this question to the general Git > mailing list (you reached the Git for Windows one, and this issue is not > specific to Windows). > > To answer your question, a NUL byte within the first 8000 bytes is indeed > considered as an indicator for binary files. > > If you use UTF-16, you will need to mark your files as such explicitly > (Git does not handle UTF-16 internally). I'm not sure if it is a good idea to treat UTF-16 as text. The rest of the diff (headers, etc) will all be in ASCII, so one or the other is going to be mojibake. You can get readable diffs by textconv-ing them to an ASCII-superset encoding like UTF-8. Something like: echo 'myfile diff=utf16' >.gitattributes git config diff.utf16.textconv 'iconv -f utf16 -t utf8' but of course the resulting patches cannot be applied, and you may miss any changes that do not make it through the encoding (e.g., using different bytes to represent the same code point). -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [git-for-windows] How is detected binary files?
Hi Andrzej, On Wed, 25 Nov 2015, Andrzej Borucki wrote: > How git detects that file is binary? This must be safe because it not > allowed to change line breaks in binary files. > Binary files can contain byte 0 (zero), but: > - 16 bit UTF also can contain zero > - short binary files can not contain zero It would probably be better to direct this question to the general Git mailing list (you reached the Git for Windows one, and this issue is not specific to Windows). To answer your question, a NUL byte within the first 8000 bytes is indeed considered as an indicator for binary files. If you use UTF-16, you will need to mark your files as such explicitly (Git does not handle UTF-16 internally). As to short binary files, you will also have to mark those explicitly. Ciao, Johannes -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] t9300: use cmp instead of test_cmp to compare binary files
test_cmp is intended to produce diff output for human consumption. The input in one instance in t9300-fast-import.sh are binary files, however. Use cmp to compare the files. This was noticed because on Windows we have a special implementation of test_cmp in pure bash code (to ignore differences due to intermittent CR in actual output), and bash runs into an infinite loop due to the binary nature of the input. Signed-off-by: Johannes Sixt j...@kdbg.org --- t/t9300-fast-import.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh index 99f5161..4b13170 100755 --- a/t/t9300-fast-import.sh +++ b/t/t9300-fast-import.sh @@ -2687,7 +2687,7 @@ test_expect_success 'R: verify created pack' ' test_expect_success \ 'R: verify written objects' \ 'git --git-dir=R/.git cat-file blob big-file:big1 actual -test_cmp expect actual +cmp expect actual a=$(git --git-dir=R/.git rev-parse big-file:big1) b=$(git --git-dir=R/.git rev-parse big-file:big2) test $a = $b' -- 2.0.0.12.gbcf935e -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] t9300: use cmp instead of test_cmp to compare binary files
Johannes Sixt j...@kdbg.org writes: test_cmp is intended to produce diff output for human consumption. The input in one instance in t9300-fast-import.sh are binary files, however. Use cmp to compare the files. Thanks. This was noticed because on Windows we have a special implementation of test_cmp in pure bash code (to ignore differences due to intermittent CR in actual output), and bash runs into an infinite loop due to the binary nature of the input. Signed-off-by: Johannes Sixt j...@kdbg.org --- t/t9300-fast-import.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh index 99f5161..4b13170 100755 --- a/t/t9300-fast-import.sh +++ b/t/t9300-fast-import.sh @@ -2687,7 +2687,7 @@ test_expect_success 'R: verify created pack' ' test_expect_success \ 'R: verify written objects' \ 'git --git-dir=R/.git cat-file blob big-file:big1 actual - test_cmp expect actual + cmp expect actual a=$(git --git-dir=R/.git rev-parse big-file:big1) b=$(git --git-dir=R/.git rev-parse big-file:big2) test $a = $b' -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] t9300: use cmp instead of test_cmp to compare binary files
Am 12.09.2014 um 19:58 schrieb Junio C Hamano: Johannes Sixt j...@kdbg.org writes: test_cmp is intended to produce diff output for human consumption. The input in one instance in t9300-fast-import.sh are binary files, however. Use cmp to compare the files. Thanks. This was noticed because on Windows we have a special implementation of test_cmp in pure bash code (to ignore differences due to intermittent CR in actual output), and bash runs into an infinite loop due to the binary nature of the input. Signed-off-by: Johannes Sixt j...@kdbg.org --- t/t9300-fast-import.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh index 99f5161..4b13170 100755 --- a/t/t9300-fast-import.sh +++ b/t/t9300-fast-import.sh @@ -2687,7 +2687,7 @@ test_expect_success 'R: verify created pack' ' test_expect_success \ 'R: verify written objects' \ 'git --git-dir=R/.git cat-file blob big-file:big1 actual - test_cmp expect actual + cmp expect actual a=$(git --git-dir=R/.git rev-parse big-file:big1) b=$(git --git-dir=R/.git rev-parse big-file:big2) test $a = $b' May I suggest to use test_cmp_bin instead of plain cmp? test_cmp_bin was introduced in b93e6e36 (t5000, t5003: do not use test_cmp to compare binary files, 2014-06-04) and by default is plain cmp. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] t9300: use test_cmp_bin instead of test_cmp to compare binary files
test_cmp is intended to produce diff output for human consumption. The input in one instance in t9300-fast-import.sh are binary files, however. Use test_cmp_bin to compare the files. This was noticed because on Windows we have a special implementation of test_cmp in pure bash code (to ignore differences due to intermittent CR in actual output), and bash runs into an infinite loop due to the binary nature of the input. Signed-off-by: Johannes Sixt j...@kdbg.org --- Am 12.09.2014 um 20:14 schrieb Thomas Braun: May I suggest to use test_cmp_bin instead of plain cmp? Of course! I did remember that there was talk about it, but missed that we actually implemented it. Sorry. t/t9300-fast-import.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh index 99f5161..72845f6 100755 --- a/t/t9300-fast-import.sh +++ b/t/t9300-fast-import.sh @@ -2687,7 +2687,7 @@ test_expect_success 'R: verify created pack' ' test_expect_success \ 'R: verify written objects' \ 'git --git-dir=R/.git cat-file blob big-file:big1 actual -test_cmp expect actual +test_cmp_bin expect actual a=$(git --git-dir=R/.git rev-parse big-file:big1) b=$(git --git-dir=R/.git rev-parse big-file:big2) test $a = $b' -- 2.0.0.12.gbcf935e -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
article: Using a rolling hash to break up binary files
I thought it worth bring to the list's attention a recent article on CodeProject that may be of interest to those looking at splitting binary files into deterministic hunks. http://www.codeproject.com/Articles/801608/Using-a-rolling-hash-to-break-up-binary-files It's based on Rabin and Karp's algorithm http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm. -- Philip -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: article: Using a rolling hash to break up binary files
On Thu, Jul 31, 2014 at 3:31 PM, Philip Oakley philipoak...@iee.org wrote: I thought it worth bring to the list's attention a recent article on CodeProject that may be of interest to those looking at splitting binary files into deterministic hunks. http://www.codeproject.com/Articles/801608/Using-a-rolling-hash-to-break-up-binary-files It's based on Rabin and Karp's algorithm http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm. If I remember right, this is how bup[1] works. Its certainly what we do for delta compressing files. [1] https://github.com/bup/bup -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] t5000, t5003: do not use test_cmp to compare binary files
test_cmp() is primarily meant to compare text files (and display the difference for debug purposes). Raw cmp is better suited to compare binary files (tar, zip, etc.). On MinGW, test_cmp is a shell function mingw_test_cmp that tries to read both files into environment, stripping CR characters (introduced in commit 4d715ac0). This function usually speeds things up, as fork is extremly slow on Windows. But no wonder that this function is extremely slow and sometimes even crashes when comparing large tar or zip files. Signed-off-by: Stepan Kasal ka...@ucw.cz --- t/t5000-tar-tree.sh | 34 +- t/t5001-archive-attr.sh | 2 +- t/t5003-archive-zip.sh | 6 +++--- t/t5004-archive-corner-cases.sh | 2 +- 4 files changed, 22 insertions(+), 22 deletions(-) diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh index 1cf0a4e..31b1fd1 100755 --- a/t/t5000-tar-tree.sh +++ b/t/t5000-tar-tree.sh @@ -164,7 +164,7 @@ check_tar with_olde-prefix olde- test_expect_success 'git archive on large files' ' test_config core.bigfilethreshold 1 git archive HEAD b3.tar -test_cmp b.tar b3.tar +cmp b.tar b3.tar ' test_expect_success \ @@ -173,15 +173,15 @@ test_expect_success \ test_expect_success \ 'git archive vs. the same in a bare repo' \ -'test_cmp b.tar b3.tar' +'cmp b.tar b3.tar' test_expect_success 'git archive with --output' \ 'git archive --output=b4.tar HEAD -test_cmp b.tar b4.tar' +cmp b.tar b4.tar' test_expect_success 'git archive --remote' \ 'git archive --remote=. HEAD b5.tar -test_cmp b.tar b5.tar' +cmp b.tar b5.tar' test_expect_success \ 'validate file modification time' \ @@ -198,7 +198,7 @@ test_expect_success \ test_expect_success 'git archive with --output, override inferred format' ' git archive --format=tar --output=d4.zip HEAD - test_cmp b.tar d4.zip + cmp b.tar d4.zip ' test_expect_success \ @@ -244,34 +244,34 @@ test_expect_success 'archive --list shows only enabled remote filters' ' test_expect_success 'invoke tar filter by format' ' git archive --format=tar.foo HEAD config.tar.foo tr ab ba config.tar.foo config.tar - test_cmp b.tar config.tar + cmp b.tar config.tar git archive --format=bar HEAD config.bar tr ab ba config.bar config.tar - test_cmp b.tar config.tar + cmp b.tar config.tar ' test_expect_success 'invoke tar filter by extension' ' git archive -o config-implicit.tar.foo HEAD - test_cmp config.tar.foo config-implicit.tar.foo + cmp config.tar.foo config-implicit.tar.foo git archive -o config-implicit.bar HEAD - test_cmp config.tar.foo config-implicit.bar + cmp config.tar.foo config-implicit.bar ' test_expect_success 'default output format remains tar' ' git archive -o config-implicit.baz HEAD - test_cmp b.tar config-implicit.baz + cmp b.tar config-implicit.baz ' test_expect_success 'extension matching requires dot' ' git archive -o config-implicittar.foo HEAD - test_cmp b.tar config-implicittar.foo + cmp b.tar config-implicittar.foo ' test_expect_success 'only enabled filters are available remotely' ' test_must_fail git archive --remote=. --format=tar.foo HEAD \ remote.tar.foo git archive --remote=. --format=bar remote.bar HEAD - test_cmp remote.bar config.bar + cmp remote.bar config.bar ' test_expect_success GZIP 'git archive --format=tgz' ' @@ -280,27 +280,27 @@ test_expect_success GZIP 'git archive --format=tgz' ' test_expect_success GZIP 'git archive --format=tar.gz' ' git archive --format=tar.gz HEAD j1.tar.gz - test_cmp j.tgz j1.tar.gz + cmp j.tgz j1.tar.gz ' test_expect_success GZIP 'infer tgz from .tgz filename' ' git archive --output=j2.tgz HEAD - test_cmp j.tgz j2.tgz + cmp j.tgz j2.tgz ' test_expect_success GZIP 'infer tgz from .tar.gz filename' ' git archive --output=j3.tar.gz HEAD - test_cmp j.tgz j3.tar.gz + cmp j.tgz j3.tar.gz ' test_expect_success GZIP 'extract tgz file' ' gzip -d -c j.tgz j.tar - test_cmp b.tar j.tar + cmp b.tar j.tar ' test_expect_success GZIP 'remote tar.gz is allowed by default' ' git archive --remote=. --format=tar.gz HEAD remote.tar.gz - test_cmp j.tgz remote.tar.gz + cmp j.tgz remote.tar.gz ' test_expect_success GZIP 'remote tar.gz can be disabled' ' diff --git a/t/t5001-archive-attr.sh b/t/t5001-archive-attr.sh index 51dedab..dfc35b3 100755 --- a/t/t5001-archive-attr.sh +++ b/t/t5001-archive-attr.sh @@ -68,7 +68,7 @@ test_expect_missing worktree2/ignored-by-worktree test_expect_success 'git archive vs. bare' ' (cd bare git archive HEAD) bare-archive.tar - test_cmp archive.tar bare-archive.tar + cmp archive.tar bare
Re: [PATCH] t5000, t5003: do not use test_cmp to compare binary files
Am 04.06.2014 13:42, schrieb Stepan Kasal: test_cmp() is primarily meant to compare text files (and display the difference for debug purposes). Raw cmp is better suited to compare binary files (tar, zip, etc.). On MinGW, test_cmp is a shell function mingw_test_cmp that tries to read both files into environment, stripping CR characters (introduced in commit 4d715ac0). This function usually speeds things up, as fork is extremly slow on Windows. But no wonder that this function is extremely slow and sometimes even crashes when comparing large tar or zip files. Signed-off-by: Stepan Kasal ka...@ucw.cz --- t/t5000-tar-tree.sh | 34 +- t/t5001-archive-attr.sh | 2 +- t/t5003-archive-zip.sh | 6 +++--- t/t5004-archive-corner-cases.sh | 2 +- 4 files changed, 22 insertions(+), 22 deletions(-) diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh index 1cf0a4e..31b1fd1 100755 --- a/t/t5000-tar-tree.sh +++ b/t/t5000-tar-tree.sh @@ -164,7 +164,7 @@ check_tar with_olde-prefix olde- test_expect_success 'git archive on large files' ' test_config core.bigfilethreshold 1 git archive HEAD b3.tar -test_cmp b.tar b3.tar +cmp b.tar b3.tar ' Wouldn't a function like test_cmp_bin() be better suited for all? The windows folks can then use cmp inside test_cmp_bin() and all others just use test_cmp. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] t5000, t5003: do not use test_cmp to compare binary files
Hello Thomas, On Wed, Jun 04, 2014 at 02:13:44PM +0200, Thomas Braun wrote: Wouldn't a function like test_cmp_bin() be better suited for all? I also considered it. The advantage is that is shows that this intentionally differs from test_cmp. The windows folks can then use cmp inside test_cmp_bin() and all others ... would use cmp as well because it is better suited for the task than diff -u. So test_cmp_bin would be just an alias for cmp, on all platforms. Doesn't that sound weird? Stepan -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [msysGit] Re: [PATCH] t5000, t5003: do not use test_cmp to compare binary files
Hi Stephan, Am 04.06.2014 14:42, schrieb Stepan Kasal: On Wed, Jun 04, 2014 at 02:13:44PM +0200, Thomas Braun wrote: Wouldn't a function like test_cmp_bin() be better suited for all? I also considered it. The advantage is that is shows that this intentionally differs from test_cmp. The windows folks can then use cmp inside test_cmp_bin() and all others ... would use cmp as well because it is better suited for the task than diff -u. So test_cmp_bin would be just an alias for cmp, on all platforms. Doesn't that sound weird? I actually like the idea that the test assertions follow a common naming scheme and can easily be overriden by $arbitrary-crazy-platform. Using test_cmp_bin instead of cmp would result in then four assertions for comparing arbitrary data test_cmp test_18ncmp test_cmp_text test_cmp_bin where I think the purpose of each function is clear from its name. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] t5000, t5003: do not use test_cmp to compare binary files
test_cmp() is primarily meant to compare text files (and display the difference for debug purposes). Raw cmp is better suited to compare binary files (tar, zip, etc.). On MinGW, test_cmp is a shell function mingw_test_cmp that tries to read both files into environment, stripping CR characters (introduced in commit 4d715ac0). This function usually speeds things up, as fork is extremly slow on Windows. But no wonder that this function is extremely slow and sometimes even crashes when comparing large tar or zip files. Signed-off-by: Stepan Kasal ka...@ucw.cz --- Hi Thomas, On Wed, Jun 04, 2014 at 02:59:44PM +0200, Thomas Braun wrote: Using test_cmp_bin instead of cmp would result in then four assertions for comparing arbitrary data test_cmp test_i18ncmp test_cmp_text test_cmp_bin where I think the purpose of each function is clear from its name. [test_cmp_text does not exist (yet)] OK, I agree, hence this modified version of the patch. Stepan t/t5000-tar-tree.sh | 34 +- t/t5001-archive-attr.sh | 2 +- t/t5003-archive-zip.sh | 6 +++--- t/t5004-archive-corner-cases.sh | 2 +- t/test-lib-functions.sh | 6 ++ 5 files changed, 28 insertions(+), 22 deletions(-) diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh index 1cf0a4e..4efaf8c 100755 --- a/t/t5000-tar-tree.sh +++ b/t/t5000-tar-tree.sh @@ -164,7 +164,7 @@ check_tar with_olde-prefix olde- test_expect_success 'git archive on large files' ' test_config core.bigfilethreshold 1 git archive HEAD b3.tar -test_cmp b.tar b3.tar +test_cmp_bin b.tar b3.tar ' test_expect_success \ @@ -173,15 +173,15 @@ test_expect_success \ test_expect_success \ 'git archive vs. the same in a bare repo' \ -'test_cmp b.tar b3.tar' +'test_cmp_bin b.tar b3.tar' test_expect_success 'git archive with --output' \ 'git archive --output=b4.tar HEAD -test_cmp b.tar b4.tar' +test_cmp_bin b.tar b4.tar' test_expect_success 'git archive --remote' \ 'git archive --remote=. HEAD b5.tar -test_cmp b.tar b5.tar' +test_cmp_bin b.tar b5.tar' test_expect_success \ 'validate file modification time' \ @@ -198,7 +198,7 @@ test_expect_success \ test_expect_success 'git archive with --output, override inferred format' ' git archive --format=tar --output=d4.zip HEAD - test_cmp b.tar d4.zip + test_cmp_bin b.tar d4.zip ' test_expect_success \ @@ -244,34 +244,34 @@ test_expect_success 'archive --list shows only enabled remote filters' ' test_expect_success 'invoke tar filter by format' ' git archive --format=tar.foo HEAD config.tar.foo tr ab ba config.tar.foo config.tar - test_cmp b.tar config.tar + test_cmp_bin b.tar config.tar git archive --format=bar HEAD config.bar tr ab ba config.bar config.tar - test_cmp b.tar config.tar + test_cmp_bin b.tar config.tar ' test_expect_success 'invoke tar filter by extension' ' git archive -o config-implicit.tar.foo HEAD - test_cmp config.tar.foo config-implicit.tar.foo + test_cmp_bin config.tar.foo config-implicit.tar.foo git archive -o config-implicit.bar HEAD - test_cmp config.tar.foo config-implicit.bar + test_cmp_bin config.tar.foo config-implicit.bar ' test_expect_success 'default output format remains tar' ' git archive -o config-implicit.baz HEAD - test_cmp b.tar config-implicit.baz + test_cmp_bin b.tar config-implicit.baz ' test_expect_success 'extension matching requires dot' ' git archive -o config-implicittar.foo HEAD - test_cmp b.tar config-implicittar.foo + test_cmp_bin b.tar config-implicittar.foo ' test_expect_success 'only enabled filters are available remotely' ' test_must_fail git archive --remote=. --format=tar.foo HEAD \ remote.tar.foo git archive --remote=. --format=bar remote.bar HEAD - test_cmp remote.bar config.bar + test_cmp_bin remote.bar config.bar ' test_expect_success GZIP 'git archive --format=tgz' ' @@ -280,27 +280,27 @@ test_expect_success GZIP 'git archive --format=tgz' ' test_expect_success GZIP 'git archive --format=tar.gz' ' git archive --format=tar.gz HEAD j1.tar.gz - test_cmp j.tgz j1.tar.gz + test_cmp_bin j.tgz j1.tar.gz ' test_expect_success GZIP 'infer tgz from .tgz filename' ' git archive --output=j2.tgz HEAD - test_cmp j.tgz j2.tgz + test_cmp_bin j.tgz j2.tgz ' test_expect_success GZIP 'infer tgz from .tar.gz filename' ' git archive --output=j3.tar.gz HEAD - test_cmp j.tgz j3.tar.gz + test_cmp_bin j.tgz j3.tar.gz ' test_expect_success GZIP 'extract tgz file' ' gzip -d -c j.tgz j.tar - test_cmp b.tar j.tar + test_cmp_bin b.tar j.tar ' test_expect_success GZIP 'remote tar.gz is allowed by default' ' git
Re: [PATCH v2] t5000, t5003: do not use test_cmp to compare binary files
Stepan Kasal ka...@ucw.cz writes: test_cmp() is primarily meant to compare text files (and display the difference for debug purposes). Raw cmp is better suited to compare binary files (tar, zip, etc.). On MinGW, test_cmp is a shell function mingw_test_cmp that tries to read both files into environment, stripping CR characters (introduced in commit 4d715ac0). This function usually speeds things up, as fork is extremly slow on Windows. But no wonder that this function is extremely slow and sometimes even crashes when comparing large tar or zip files. Signed-off-by: Stepan Kasal ka...@ucw.cz --- Hi Thomas, On Wed, Jun 04, 2014 at 02:59:44PM +0200, Thomas Braun wrote: Using test_cmp_bin instead of cmp would result in then four assertions for comparing arbitrary data test_cmp test_i18ncmp test_cmp_text test_cmp_bin where I think the purpose of each function is clear from its name. [test_cmp_text does not exist (yet)] OK, I agree, hence this modified version of the patch. Yeah, I think the above reasoning is sound. And I do not think we ever need to have test_cmp_text -- our payload and our messages compared by tests to make sure our expectations hold are text by default. Will queue; thanks. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [msysGit] Re: [PATCH v2] t5000, t5003: do not use test_cmp to compare binary files
I have the problem that the overridden test_cmp crashes on a couple of places where it is doing a binary compare, so this is definitely needed. I actually used cmp -q in my override as it's the return code that is most important. //. On Wed, 4 Jun 2014 11:22:56 AM Junio C Hamano wrote: Stepan Kasal ka...@ucw.cz writes: test_cmp() is primarily meant to compare text files (and display the difference for debug purposes). Raw cmp is better suited to compare binary files (tar, zip, etc.). On MinGW, test_cmp is a shell function mingw_test_cmp that tries to read both files into environment, stripping CR characters (introduced in commit 4d715ac0). This function usually speeds things up, as fork is extremly slow on Windows. But no wonder that this function is extremely slow and sometimes even crashes when comparing large tar or zip files. Signed-off-by: Stepan Kasal ka...@ucw.cz --- Hi Thomas, On Wed, Jun 04, 2014 at 02:59:44PM +0200, Thomas Braun wrote: Using test_cmp_bin instead of cmp would result in then four assertions for comparing arbitrary data test_cmp test_i18ncmp test_cmp_text test_cmp_bin where I think the purpose of each function is clear from its name. [test_cmp_text does not exist (yet)] OK, I agree, hence this modified version of the patch. Yeah, I think the above reasoning is sound. And I do not think we ever need to have test_cmp_text -- our payload and our messages compared by tests to make sure our expectations hold are text by default. Will queue; thanks. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] grep -I: do not bother to read known-binary files
Hello, On Thu, May 15, 2014 at 03:22:26PM -0400, Jeff King wrote: As the person who is proposing the patch for git.git, I would hope Stepan would follow up on such review and confirm whether or not it is still needed. well, I try to. (I verified that less -I works in msysGit before submitting the fixup back there, fox example.) But I think my role is to moderate the reconciliation. In this case, having read the discussion, I decided to to ask Dscho to drop the patch. (It is always about balancing the risks and the expenses.) Stepan -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] grep -I: do not bother to read known-binary files
On Fri, May 16, 2014 at 10:19:57AM +0200, Stepan Kasal wrote: On Thu, May 15, 2014 at 03:22:26PM -0400, Jeff King wrote: As the person who is proposing the patch for git.git, I would hope Stepan would follow up on such review and confirm whether or not it is still needed. well, I try to. (I verified that less -I works in msysGit before submitting the fixup back there, fox example.) But I think my role is to moderate the reconciliation. In this case, having read the discussion, I decided to to ask Dscho to drop the patch. (It is always about balancing the risks and the expenses.) Sure, I think that is reasonable, and I hope I did not sound like blame Stepan, he was screwed up. After reading Dscho's other message and knowing that he has not much time for git, I was trying to communicate that I did not expect _him_ to be dealing with it. Git.git has existed for a long time without that patch, so from our perspective, it is a new patch. And as with any new patch, I would expect a submitter who receives review of eh, don't we already do this to either look into it, or decide that the review is probably right and not bother with it. And you did the latter, and that is totally fine and reasonable. From msysgit's perspective, they may or may not want to revert the patch that they already have. That is a _separate_ issue, and I think the burden of effort goes the other way. The patch should stay unless somebody goes to the work to confirm that it is not necessary (or unless they want to accept my say-so and indication that I did not do fresh testing, but that is up to them). Sorry if that is long and/or obvious. There have been enough bad feelings on the list lately that I didn't want there to be a misunderstanding about what I meant. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] grep -I: do not bother to read known-binary files
Hi, On Fri, May 16, 2014 at 04:29:58AM -0400, Jeff King wrote: [..] I hope I did not sound like blame Stepan, he was screwed up. no, you did not, it was ok. From msysgit's perspective, they may or may not want to revert the patch that they already have. That is a _separate_ issue, and I think the I hope Dscho will help with the decision: he can say keep it until we have evidence or at least time to do a more thorough review, for example. Stepan -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] grep -I: do not bother to read known-binary files
Hi Peff, On Wed, 14 May 2014, Jeff King wrote: On Wed, May 14, 2014 at 05:44:19PM +0200, Stepan Kasal wrote: From: Johannes Schindelin johannes.schinde...@gmx.de Date: Mon, 8 Nov 2010 16:10:43 +0100 Incidentally, this makes grep -I respect the binary attribute (actually, the -text attribute, but binary implies that). Hrm. Is this patch still necessary? In the time since this patch was written, we did 0826579 (grep: load file data after checking binary-ness, 2012-02-02) I have no time to test this but I trust that you made sure that it works as advertised. In my case, there were about 500 gigabytes of image data intermixed with code, and waiting for 'git grep' was not funny at all (and I did not have time back then to go through a full code submission cycle on the Git mailing list, either). So I guess we can drop my patch. Ciao, Johannes -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] grep -I: do not bother to read known-binary files
On Thu, May 15, 2014 at 07:42:00PM +0200, Johannes Schindelin wrote: Hrm. Is this patch still necessary? In the time since this patch was written, we did 0826579 (grep: load file data after checking binary-ness, 2012-02-02) I have no time to test this but I trust that you made sure that it works as advertised. In my case, there were about 500 gigabytes of image data intermixed with code, and waiting for 'git grep' was not funny at all (and I did not have time back then to go through a full code submission cycle on the Git mailing list, either). So I guess we can drop my patch. Certainly I tested it at the time (those commits I referenced contain timing information), and it should have improved the workload you describe. I did not test before/after the patch in this thread, but only read it and noticed that it was trying to do the same thing (that is why I said I suspect...). As the person who is proposing the patch for git.git, I would hope Stepan would follow up on such review and confirm whether or not it is still needed. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] grep -I: do not bother to read known-binary files
From: Johannes Schindelin johannes.schinde...@gmx.de Date: Mon, 8 Nov 2010 16:10:43 +0100 Incidentally, this makes grep -I respect the binary attribute (actually, the -text attribute, but binary implies that). Since the attributes are not thread-safe, we now need to switch off threading if -I was passed. Signed-off-by: Johannes Schindelin johannes.schinde...@gmx.de Signed-off-by: Stepan Kasal ka...@ucw.cz --- Hi, this patch has been in msysgit for 3.5 years. Stepan builtin/grep.c | 25 + 1 file changed, 25 insertions(+) diff --git a/builtin/grep.c b/builtin/grep.c index 43af5b7..8073fbe 100644 --- a/builtin/grep.c +++ b/builtin/grep.c @@ -18,6 +18,7 @@ #include quote.h #include dir.h #include pathspec.h +#include attr.h static char const * const grep_usage[] = { N_(git grep [options] [-e] pattern [rev...] [[--] path...]), @@ -163,6 +164,22 @@ static void work_done(struct work_item *w) grep_unlock(); } +static int skip_binary(struct grep_opt *opt, const char *filename) +{ + if ((opt-binary GREP_BINARY_NOMATCH)) { + static struct git_attr *attr_text; + struct git_attr_check check; + + if (!attr_text) + attr_text = git_attr(text); + memset(check, 0, sizeof(check)); + check.attr = attr_text; + return !git_check_attr(filename, 1, check) + ATTR_FALSE(check.value); + } + return 0; +} + static void *run(void *arg) { int hit = 0; @@ -173,6 +190,9 @@ static void *run(void *arg) if (!w) break; + if (skip_binary(opt, (const char *)w-source.identifier)) + continue; + opt-output_priv = w; hit |= grep_source(opt, w-source); grep_source_clear_data(w-source); @@ -379,6 +399,9 @@ static int grep_cache(struct grep_opt *opt, const struct pathspec *pathspec, int continue; if (!ce_path_match(ce, pathspec, NULL)) continue; + if (skip_binary(opt, ce-name)) + continue; + /* * If CE_VALID is on, we assume worktree file and its cache entry * are identical, even if worktree file has been modified, so use @@ -803,6 +826,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix) string_list_append(path_list, show_in_pager); use_threads = 0; } + if ((opt.binary GREP_BINARY_NOMATCH)) + use_threads = 0; if (!opt.pattern_list) die(_(no pattern given.)); -- 1.9.2.msysgit.0.335.gd2a461f -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] grep -I: do not bother to read known-binary files
Stepan Kasal ka...@ucw.cz writes: From: Johannes Schindelin johannes.schinde...@gmx.de Date: Mon, 8 Nov 2010 16:10:43 +0100 Incidentally, this makes grep -I respect the binary attribute (actually, the -text attribute, but binary implies that). Since the attributes are not thread-safe, we now need to switch off threading if -I was passed. Signed-off-by: Johannes Schindelin johannes.schinde...@gmx.de Signed-off-by: Stepan Kasal ka...@ucw.cz --- Hi, this patch has been in msysgit for 3.5 years. Stepan I do not think checking 'text' is the right way to do this. The attribute controls the eof normalization, and people sometimes want to keep CRLF terminated files in the repository no matter what the platform is (an example I heard is a sample protocol exchange over textual network protocol such as HTTP and SMTP), and the way to do so is to unset it. That would still let them look for patterns in and compare the changes to these paths. Looking for Marking files as binary in gitattributes(5) gives us a more authoritative alternative, I think. In short: - If 'diff' is Unset, or - If 'diff' is Set to X and diff.X.binary is true then the contents are not suitable for human consumption. I haven't thought things through to declare that grep -I would want to treat a PostScript file (which is an example in that section) as a binary file and refrain from trying to find substrings in it, but my gut feeling is that we probably should let it look inside such a file, so replacing 'text' with 'diff' in this patch and doing nothing about diff.*.binary would be the way to go. Given that the posted patch was older than 3.5 years, perhaps it needs updating to adhere the advice given in ab435611 (docs: explain diff.*.binary option, 2011-01-09). By the way, I wonder if the patch is about performance, implied by do not bother (to waste time looking inside), though. Is it an overall win to avoid looking inside -diff files, if that requires you to use only 1 core and leaving the other 7 idle? I think the user tells us it is binary with attribute, and the user tells us not to look into binary with -I is a lot better rationale to do this change , and that would characterise it as a correctness patch, not a performance patch. Thanks. builtin/grep.c | 25 + 1 file changed, 25 insertions(+) diff --git a/builtin/grep.c b/builtin/grep.c index 43af5b7..8073fbe 100644 --- a/builtin/grep.c +++ b/builtin/grep.c @@ -18,6 +18,7 @@ #include quote.h #include dir.h #include pathspec.h +#include attr.h static char const * const grep_usage[] = { N_(git grep [options] [-e] pattern [rev...] [[--] path...]), @@ -163,6 +164,22 @@ static void work_done(struct work_item *w) grep_unlock(); } +static int skip_binary(struct grep_opt *opt, const char *filename) +{ + if ((opt-binary GREP_BINARY_NOMATCH)) { + static struct git_attr *attr_text; + struct git_attr_check check; + + if (!attr_text) + attr_text = git_attr(text); + memset(check, 0, sizeof(check)); + check.attr = attr_text; + return !git_check_attr(filename, 1, check) + ATTR_FALSE(check.value); + } + return 0; +} + static void *run(void *arg) { int hit = 0; @@ -173,6 +190,9 @@ static void *run(void *arg) if (!w) break; + if (skip_binary(opt, (const char *)w-source.identifier)) + continue; + opt-output_priv = w; hit |= grep_source(opt, w-source); grep_source_clear_data(w-source); @@ -379,6 +399,9 @@ static int grep_cache(struct grep_opt *opt, const struct pathspec *pathspec, int continue; if (!ce_path_match(ce, pathspec, NULL)) continue; + if (skip_binary(opt, ce-name)) + continue; + /* * If CE_VALID is on, we assume worktree file and its cache entry * are identical, even if worktree file has been modified, so use @@ -803,6 +826,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix) string_list_append(path_list, show_in_pager); use_threads = 0; } + if ((opt.binary GREP_BINARY_NOMATCH)) + use_threads = 0; if (!opt.pattern_list) die(_(no pattern given.)); -- 1.9.2.msysgit.0.335.gd2a461f -- -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] grep -I: do not bother to read known-binary files
On Wed, May 14, 2014 at 05:44:19PM +0200, Stepan Kasal wrote: From: Johannes Schindelin johannes.schinde...@gmx.de Date: Mon, 8 Nov 2010 16:10:43 +0100 Incidentally, this makes grep -I respect the binary attribute (actually, the -text attribute, but binary implies that). Since the attributes are not thread-safe, we now need to switch off threading if -I was passed. Signed-off-by: Johannes Schindelin johannes.schinde...@gmx.de Signed-off-by: Stepan Kasal ka...@ucw.cz --- Hi, this patch has been in msysgit for 3.5 years. Stepan Hrm. Is this patch still necessary? In the time since this patch was written, we did 0826579 (grep: load file data after checking binary-ness, 2012-02-02), which should do the same thing. It deals with the threading via a lock, but we later learned in 9dd5245 (grep: pre-load userdiff drivers when threaded, 2012-02-02) to hoist that bit out. So I suspect this patch at best is doing nothing, and at worst is wasting extra time doing redundant attribute checks. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] grep -I: do not bother to read known-binary files
On Wed, May 14, 2014 at 10:52:28AM -0700, Junio C Hamano wrote: I do not think checking 'text' is the right way to do this. The attribute controls the eof normalization, and people sometimes want to keep CRLF terminated files in the repository no matter what the platform is (an example I heard is a sample protocol exchange over textual network protocol such as HTTP and SMTP), and the way to do so is to unset it. That would still let them look for patterns in and compare the changes to these paths. Looking for Marking files as binary in gitattributes(5) gives us a more authoritative alternative, I think. In short: - If 'diff' is Unset, or - If 'diff' is Set to X and diff.X.binary is true then the contents are not suitable for human consumption. I responded elsewhere in the thread that I think the patch under discussion is redundant at this point, but just to clarify: grep currently uses the rules you give above, as it builds on the userdiff driver code. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] grep -I: do not bother to read known-binary files
Jeff King p...@peff.net writes: On Wed, May 14, 2014 at 05:44:19PM +0200, Stepan Kasal wrote: From: Johannes Schindelin johannes.schinde...@gmx.de Date: Mon, 8 Nov 2010 16:10:43 +0100 Incidentally, this makes grep -I respect the binary attribute (actually, the -text attribute, but binary implies that). Since the attributes are not thread-safe, we now need to switch off threading if -I was passed. Signed-off-by: Johannes Schindelin johannes.schinde...@gmx.de Signed-off-by: Stepan Kasal ka...@ucw.cz --- Hi, this patch has been in msysgit for 3.5 years. Stepan Hrm. Is this patch still necessary? In the time since this patch was written, we did 0826579 (grep: load file data after checking binary-ness, 2012-02-02), which should do the same thing. It deals with the threading via a lock, but we later learned in 9dd5245 (grep: pre-load userdiff drivers when threaded, 2012-02-02) to hoist that bit out. Wow, power of Git history ;-) So I suspect this patch at best is doing nothing, and at worst is wasting extra time doing redundant attribute checks. Sounds like a sensible conclusion. Thanks. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Rebase triggers git diff header lacks filename information on very large patch with binary files
Hi, I just did a rebase, and it throws an error like this: Applying: comment1 Applying: comment2 Applying: comment3 Applying: comment4 Applying: patch_with_binary_files fatal: git diff header lacks filename information when removing 1 leading pathname component (line 7330213) Repository lacks necessary blobs to fall back on 3-way merge. Cannot fall back to three-way merge. Patch failed at 0005 patch_with_binary_files The copy of the patch that failed is found in: /usr/local/git_tree/affiliate_data/.git/rebase-apply/patch When you have resolved this problem, run git rebase --continue. If you prefer to skip this patch, run git rebase --skip instead. To check out the original branch and stop rebasing, run git rebase --abort. The patch is very large, 882453899 bytes. The patch also includes many binary files. Extracting the content around and before line 7330213 and up to the next diff header in the patch I see this: perl -lne'print $.\t$_ if 7330169 .. 7330213' .git/rebase-apply/patch 7330169 diff --git a/dir1/dir2/file.png b/dir1/dir2/file.png 7330170 new file mode 100644 7330171 index ..8a3219cb6545f23e3f7c61f058d82fc2c1bd9aac 7330172 GIT binary patch 7330173 literal 11301 7330174 zcmXYX1ymeO)Ai!+PH-nk@Zb{MHE3{$;O=gVh2Rd06MPp4?hxD|K!5jEd=*(p7;Ov [more lines of binary removed] 7330213 zznckDs-GVJg-A0uD|ONvCQWVX;j!JNnkQI9^=+zJ^SvLe1p-~c7bmY5wu4C=(8F0 [more lines of binary removed] 7330393 literal 0 7330394 HcmV?d1 7330395 7330396 diff --git a/dir1/dir2/file.css b/dir1/dir2/file.css 7330397 new file mode 100644 7330398 index ..75c8afc558424ea185c62b5a1c61ad6c32cddc21 I have munged the filenames. It looks to me like git can't apply patches over a certain size. Any suggestions on how to proceed here? cheers, Yves -- perl -Mre=debug -e /just|another|perl|hacker/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Rebase triggers git diff header lacks filename information on very large patch with binary files
On 14 January 2014 12:48, demerphq demer...@gmail.com wrote: Hi, I just did a rebase, and it throws an error like this: Applying: comment1 Applying: comment2 Applying: comment3 Applying: comment4 Applying: patch_with_binary_files fatal: git diff header lacks filename information when removing 1 leading pathname component (line 7330213) Repository lacks necessary blobs to fall back on 3-way merge. Cannot fall back to three-way merge. Patch failed at 0005 patch_with_binary_files The copy of the patch that failed is found in: /usr/local/git_tree/affiliate_data/.git/rebase-apply/patch When you have resolved this problem, run git rebase --continue. If you prefer to skip this patch, run git rebase --skip instead. To check out the original branch and stop rebasing, run git rebase --abort. The patch is very large, 882453899 bytes. The patch also includes many binary files. Extracting the content around and before line 7330213 and up to the next diff header in the patch I see this: perl -lne'print $.\t$_ if 7330169 .. 7330213' .git/rebase-apply/patch 7330169 diff --git a/dir1/dir2/file.png b/dir1/dir2/file.png 7330170 new file mode 100644 7330171 index ..8a3219cb6545f23e3f7c61f058d82fc2c1bd9aac 7330172 GIT binary patch 7330173 literal 11301 7330174 zcmXYX1ymeO)Ai!+PH-nk@Zb{MHE3{$;O=gVh2Rd06MPp4?hxD|K!5jEd=*(p7;Ov [more lines of binary removed] 7330213 zznckDs-GVJg-A0uD|ONvCQWVX;j!JNnkQI9^=+zJ^SvLe1p-~c7bmY5wu4C=(8F0 [more lines of binary removed] 7330393 literal 0 7330394 HcmV?d1 7330395 7330396 diff --git a/dir1/dir2/file.css b/dir1/dir2/file.css 7330397 new file mode 100644 7330398 index ..75c8afc558424ea185c62b5a1c61ad6c32cddc21 I have munged the filenames. It looks to me like git can't apply patches over a certain size. Any suggestions on how to proceed here? I aborted and then did a merge instead and it seemed to work out. Still, seems like something git should detect BEFORE it tries to do the rebase. cheers, Yves -- perl -Mre=debug -e /just|another|perl|hacker/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gitweb commitdiff page - binary files with ampersands in filename?
Oj W wrote: Change a binary file whose filename contains an ampersand, then view the commitdiff page in gitweb. Git outputs a message like Binary files a/bw.dll and b/bw.dll differ Gitweb format_diff_from_to_header() doesn't notice anything in that output which needs escaping, and writes it directly to the XHTML 1.0 Strict output. Then gitweb's output is invalid XML, meaning that browsers such as Firefox will refuse to display the page. This was because in case of binary files we don't get usual diff, just Binary files a/foo and b/foo differ just after extended git diff headers. There are two problems with gitweb code. First is that git_patchset_body() didn't recognize and handle this situation. This should be fixed by the patch below (I have trouble testing it as git-instaweb keeps using old gitweb version for some reason). Second is that format_diff_from_to_header() doesn't handle unrecognized extended git diff headers well - it doesn't HTML escape them. This is to be fixed. -- 8 -- --- gitweb/gitweb.perl | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl index 1309196..33a0de1 100755 --- a/gitweb/gitweb.perl +++ b/gitweb/gitweb.perl @@ -5345,7 +5345,8 @@ sub git_patchset_body { while ($patch_line = $fd) { chomp $patch_line; - last EXTENDED_HEADER if ($patch_line =~ m/^--- |^diff /); + last EXTENDED_HEADER + if ($patch_line =~ m/^--- |^diff |^Binary files .* differ$/); print format_extended_diff_header_line($patch_line, $diffinfo, \%from, \%to); @@ -5357,7 +5358,12 @@ sub git_patchset_body { print /div\n; # class=patch last PATCH; } - next PATCH if ($patch_line =~ m/^diff /); + if ($patch_line =~ m/^Binary files .* differ$/) { + print div class=\diff bin\ . + esc_html($patch_line, -nbsp = 1) . + /div\n; + } + next PATCH if ($patch_line =~ m/^diff |^Binary files .* differ$/); #assert($patch_line =~ m/^---/) if DEBUG; my $last_patch_line = $patch_line; -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
gitweb commitdiff page - binary files with ampersands in filename?
Change a binary file whose filename contains an ampersand, then view the commitdiff page in gitweb. Git outputs a message like Binary files a/bw.dll and b/bw.dll differ Gitweb format_diff_from_to_header() doesn't notice anything in that output which needs escaping, and writes it directly to the XHTML 1.0 Strict output. Then gitweb's output is invalid XML, meaning that browsers such as Firefox will refuse to display the page. (tested with v 1.7.9.5, but I can't see anything in latest https://github.com/git/git/blob/master/gitweb/gitweb.perl#CL2158 which is looking for text like Binary files ... differ) -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
gitdiffbinstat - git diff --shortstat -like output for changes in binary files
I use git mostly for game-development which means I have to deal with a lot of binary files (images, sound files etc). When I came to a point where I had run image optimization on a branch, I wanted to know of course how much smaller the new branch was in comparison to master. Problem was that 'git diff --stat' would only summerize per-binary-file size changes and 'git diff --shortstat' did skip the binary files entirely. To solve this problem, I wrote a script (gitdiffbinstat) which basically runs 'git diff --stat' and summerizes the output. The script can be found here: https://github.com/matthiaskrgr/gitdiffbinstat/blob/master/gitdiffbinstat.sh Screenshot of example output is attached. I wondered what you guys thought about the script, is there a chance to perhaps get it included as some kind of helper script into the official git repo? Regards, Matthias attachment: screenshot.png
Re: gitdiffbinstat - git diff --shortstat -like output for changes in binary files
On Fri, Mar 29, 2013 at 07:07:32PM +0100, Matthias Krüger wrote: I use git mostly for game-development which means I have to deal with a lot of binary files (images, sound files etc). When I came to a point where I had run image optimization on a branch, I wanted to know of course how much smaller the new branch was in comparison to master. Problem was that 'git diff --stat' would only summerize per-binary-file size changes and 'git diff --shortstat' did skip the binary files entirely. Have you tried --summary? Combined with --stat (or --shortstat) I wonder if it would get you closer to what you want. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gitdiffbinstat - git diff --shortstat -like output for changes in binary files
Jeff King p...@peff.net writes: I use git mostly for game-development which means I have to deal with a lot of binary files (images, sound files etc). When I came to a point where I had run image optimization on a branch, I wanted to know of course how much smaller the new branch was in comparison to master. Problem was that 'git diff --stat' would only summerize per-binary-file size changes and 'git diff --shortstat' did skip the binary files entirely. Have you tried --summary? Combined with --stat (or --shortstat) I wonder if it would get you closer to what you want. diff is not about how much did my project grow or shink. If you moved one block of lines up or down in the same file, you will see double the number of lines moved in the statistics. On the other hand, the use case to measure how much it helped to run png optimizers only cares about the total size. I do not think it is a good match to present the binary stat next to the textual diff stats in the first place. Adding two numbers do not give us any meaningful number. It is an interesting use case, but it just is not a problem core Git, which is a source code management system, particularly wants to bolt on a solution for, that does not mesh well with other parts of the system. We do want to make sure that people who want to deal with binaries have access to raw statistics, so that they can build their solution on top of, though. ls-tree -r -l gives byte-size of each blob, and the attribute system can let you tell which paths are binaries. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug Report - Binary Files as plain files, line endings conversions
Hi, $ git add . warning: LF will be replaced by CRLF in web/images/logo_twitter.png. The file will have its original line endings in your working directory. JOSE@COMPAQ /d/wamp/www/internationalstudies.co (master) $ git --version git version 1.7.11.msysgit.0 JOSE@COMPAQ /d/wamp/www/internationalstudies.co (master) $ cat .gitattributes # Auto detect text files and perform LF normalization * eol=crlf JOSE@COMPAQ /d/wamp/www/internationalstudies.co (master) Binary files as an image not may perform any line ending operations, many binary files will be corrupt when are restoring from repository, Saludos, José Nobile -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html