Re: [PATCH v2] log -G: Ignore binary files

2018-11-28 Thread Junio C Hamano
Junio C Hamano  writes:

>> +test_expect_success 'log -G ignores binary files' '
>> +git checkout --orphan orphan1 &&
>> +printf "a\0a" >data.bin &&
>> +git add data.bin &&
>> +git commit -m "message" &&
>> +git log -Ga >result &&
>> +test_must_be_empty result
>> +'
>
> As this is the first mention of data.bin, this is adding a new file
> data.bin that has two 'a' but is a binary file.  And that is the
> only commit in the history leading to orphan1.
>
> The fact that "log -Ga" won't find any means it missed the creation
> event, because the blob is binary.  Good.

By the way, this root commit records another file whose path is
"file" and has "Picked" in it.  If the file had 'a' in it, it
would have been included in "git log" output, but that is too subtle
a point to be noticed by the readers who are only reading this patch
without seeing what has been done to the index before this test
piece.

If you are going to restructure these tests to create a three-commit
history in a single expect_success that is inspected with various
"log -Ga" invocations in subsequent tests, it is worth removing that
other file (or rather, starting with "read-tree --empty" immediately
after checking out the orphan branch, to clarify to the readers that
there is nothing but what you add in the set-up step in the index)
to make the test more robust.



Re: [PATCH v2] log -G: Ignore binary files

2018-11-28 Thread Junio C Hamano
Thomas Braun  writes:

> Subject: Re: [PATCH v2] log -G: Ignore binary files

s/Ig/ig/; (will locally munge--this alone is no reason to reroll).

The code changes looked sensible.

> diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> index 844df760f7..5c3e2a16b2 100755
> --- a/t/t4209-log-pickaxe.sh
> +++ b/t/t4209-log-pickaxe.sh
> @@ -106,4 +106,44 @@ test_expect_success 'log -S --no-textconv (missing 
> textconv tool)' '
>   rm .gitattributes
>  '
>  
> +test_expect_success 'log -G ignores binary files' '
> + git checkout --orphan orphan1 &&
> + printf "a\0a" >data.bin &&
> + git add data.bin &&
> + git commit -m "message" &&
> + git log -Ga >result &&
> + test_must_be_empty result
> +'

As this is the first mention of data.bin, this is adding a new file
data.bin that has two 'a' but is a binary file.  And that is the
only commit in the history leading to orphan1.

The fact that "log -Ga" won't find any means it missed the creation
event, because the blob is binary.  Good.

> +test_expect_success 'log -G looks into binary files with -a' '
> + git checkout --orphan orphan2 &&
> + printf "a\0a" >data.bin &&
> + git add data.bin &&
> + git commit -m "message" &&

This starts from the state left by the previous test piece, i.e. we
have a binary data.bin file with two 'a' in it.  We pretend to
modify and add, but these two steps are no-op if the previous
succeeded, but even if the previous step failed, we get what we want
in the data.bin file.  And then we make an initial commit the same
way.

> + git log -a -Ga >actual &&
> + git log >expected &&

And we ran the same test but this time with "-a" to tell Git that
binary-ness should not matter.  It will find the sole commit.  Good.

> + test_cmp actual expected
> +'
> +
> +test_expect_success 'log -G looks into binary files with textconv filter' '
> + git checkout --orphan orphan3 &&
> + echo "* diff=bin" > .gitattributes &&

s/> />/; (will locally munge--this alone is no reason to reroll).

> + printf "a\0a" >data.bin &&
> + git add data.bin &&
> + git commit -m "message" &&
> + git -c diff.bin.textconv=cat log -Ga >actual &&

This exposes a slight iffy-ness in the design.  The textconv filter
used here does not strip the "binary-ness" from the payload, but it
is enough to tell the machinery that -G should look into the
difference.  Is that really desirable, though?

IOW, if this weren't the initial commit (which is handled by the
codepath to special-case creation and deletion in diff_grep()
function), would "log -Ga" show it without "-a"?  Should it?

I think this test piece (and probably the previous ones for "-a" vs
"no -a" without textconv, as well) should be using a history with
three commits, where

- the root commit introduces "a\0a" to data.bin (creation event)

- the second commit adds another instance of "a\0a" to data.bin
  (forces comparison)

- the third commit removes data.bin (deletion event)

and make sure that the three are treated identically.  If "log -Ga"
finds one (with the combination of other conditions like use of
textconv or -a option), it should find all three, and vice versa.

> + git log >expected &&
> + test_cmp actual expected
> +'
> +
> +test_expect_success 'log -S looks into binary files' '
> + git checkout --orphan orphan4 &&
> + printf "a\0a" >data.bin &&
> + git add data.bin &&
> + git commit -m "message" &&
> + git log -Sa >actual &&
> + git log >expected &&
> + test_cmp actual expected
> +'

Likewise.  This would also benefit from a three-commit history.

Perhaps you can create such a history at the beginning of these
additions as another "setup -G/-S binary test" step and test
different variations in subsequent tests without the setup?

>  test_done


Re: [PATCH v2] log -G: Ignore binary files

2018-11-28 Thread Ævar Arnfjörð Bjarmason


On Wed, Nov 28 2018, Thomas Braun wrote:

Looks much better this time around.

> The -G option of log looks for the differences whose patch text
> contains added/removed lines that match regex.
>
> As the concept of patch text only makes sense for text files, we need to
> ignore binary files when searching with -G  as well.
>
> The -S option of log looks for differences that changes
> the number of occurrences of the specified block of text (i.e.
> addition/deletion) in a file. As we want to keep the current behaviour,
> add a test to ensure it.
> [...]
> diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
> index c0a60f3158..059ddd3431 100644
> --- a/Documentation/gitdiffcore.txt
> +++ b/Documentation/gitdiffcore.txt
> @@ -242,7 +242,7 @@ textual diff has an added or a deleted line that matches 
> the given
>  regular expression.  This means that it will detect in-file (or what
>  rename-detection considers the same file) moves, which is noise.  The
>  implementation runs diff twice and greps, and this can be quite
> -expensive.
> +expensive.  Binary files without textconv filter are ignored.

Now that we support --text that should be documented. I tried to come up
with something on top:

diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
index 0378cd574e..42ae65fb57 100644
--- a/Documentation/diff-options.txt
+++ b/Documentation/diff-options.txt
@@ -524,6 +524,10 @@ struct), and want to know the history of that block 
since it first
 came into being: use the feature iteratively to feed the interesting
 block in the preimage back into `-S`, and keep going until you get the
 very first version of the block.
    ++
    +Unlike `-G` the `-S` option will always search through binary files
+without a textconv filter. [[TODO: Don't we want to support --no-text
+then as an optimization?]].

 -G::
Look for differences whose patch text contains added/removed
@@ -545,6 +549,15 @@ occurrences of that string did not change).
 +
 See the 'pickaxe' entry in linkgit:gitdiffcore[7] for more
 information.
++
+Unless `--text` is supplied binary files without a textconv filter
+will be ignored.  This was not the case before Git version 2.21..
++
+With `--text`, instead of patch lines we ::
Look for differences that change the number of occurrences of
diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
index c0a60f3158..26880b4149 100644
--- a/Documentation/gitdiffcore.txt
+++ b/Documentation/gitdiffcore.txt
@@ -251,6 +251,10 @@ criterion in a changeset, the entire changeset is 
kept.  This behavior
 is designed to make reviewing changes in the context of the whole
 changeset easier.

+Both `-S' and `-G' will ignore binary files without a textconv filter
+by default, this can be overriden with `--text`. With `--text` the
+binary patch we look through is generated as [[TODO: ???]].
+
 diffcore-order: For Sorting the Output Based on Filenames
 -

But as you can see given the TODO comments I don't know how this works
exactly. I *could* dig, but that's my main outstanding problem with this
patch, the commit message / docs aren't being updated to reflect the new
behavior.

I.e. let's leave the docs in some state where the reader can as
unambiguously know what to expect with -G and these binary diffs we've
been implicitly supporting as with the textual diffs. Ideally with some
examples of how to generate them (re my question about the base85 output
in v1).

Part of that's obviously behavior we've had all along, but it's much
more convincing to say:

We are changing X which we've done for ages, it works exactly like
this, and here's a switch to get it back.

Instead of:

X doesn't make sense, let's turn it off.

Also the diffcore docs already say stuff about how slow/fast things are,
and in a side-thread you said:

My main motiviation is to speed up "log -G" as that takes a
    considerable amount of time when it wades through MBs of binary
files which change often.

Makes sense, but then let's say something about that in that section of
the docs.

>  When `-S` or `-G` are used without `--pickaxe-all`, only filepairs
>  that match their respective criterion are kept in the output.  When
> diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> index 69fc55ea1e..4cea086f80 100644
> --- a/diffcore-pickaxe.c
> +++ b/diffcore-pickaxe.c
> @@ -154,6 +154,12 @@ static int pickaxe_match(struct diff_filepair *p, struct 
> diff_options *o,
>   if (textconv_one == textconv_two && diff_unmodified_pair(p))
>   return 0;
>
> + if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
> + !o->flags

[PATCH v2] log -G: Ignore binary files

2018-11-28 Thread Thomas Braun
The -G option of log looks for the differences whose patch text
contains added/removed lines that match regex.

As the concept of patch text only makes sense for text files, we need to
ignore binary files when searching with -G  as well.

The -S option of log looks for differences that changes
the number of occurrences of the specified block of text (i.e.
addition/deletion) in a file. As we want to keep the current behaviour,
add a test to ensure it.

Signed-off-by: Thomas Braun 
---

Changes since v1:
- Merged both patches into one
- Adapted commit messages
- Added missing support for -a flag with tests
- Placed new code into correct location to be able to reuse an existing
  optimization
- Uses help-suggested -Ga writing without spaces
- Uses orphan branches instead of cannonball cleanup with rm -rf
- Changed search text to make it clear that it is not about the \0 boundary

 Documentation/gitdiffcore.txt |  2 +-
 diffcore-pickaxe.c|  6 ++
 t/t4209-log-pickaxe.sh| 40 +++
 3 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
index c0a60f3158..059ddd3431 100644
--- a/Documentation/gitdiffcore.txt
+++ b/Documentation/gitdiffcore.txt
@@ -242,7 +242,7 @@ textual diff has an added or a deleted line that matches 
the given
 regular expression.  This means that it will detect in-file (or what
 rename-detection considers the same file) moves, which is noise.  The
 implementation runs diff twice and greps, and this can be quite
-expensive.
+expensive.  Binary files without textconv filter are ignored.
 
 When `-S` or `-G` are used without `--pickaxe-all`, only filepairs
 that match their respective criterion are kept in the output.  When
diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 69fc55ea1e..4cea086f80 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -154,6 +154,12 @@ static int pickaxe_match(struct diff_filepair *p, struct 
diff_options *o,
if (textconv_one == textconv_two && diff_unmodified_pair(p))
return 0;
 
+   if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
+   !o->flags.text &&
+   ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
+(!textconv_two && diff_filespec_is_binary(o->repo, p->two
+   return 0;
+
mf1.size = fill_textconv(o->repo, textconv_one, p->one, );
mf2.size = fill_textconv(o->repo, textconv_two, p->two, );
 
diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
index 844df760f7..5c3e2a16b2 100755
--- a/t/t4209-log-pickaxe.sh
+++ b/t/t4209-log-pickaxe.sh
@@ -106,4 +106,44 @@ test_expect_success 'log -S --no-textconv (missing 
textconv tool)' '
    rm .gitattributes
 '
 
+test_expect_success 'log -G ignores binary files' '
+   git checkout --orphan orphan1 &&
+   printf "a\0a" >data.bin &&
+   git add data.bin &&
+   git commit -m "message" &&
+   git log -Ga >result &&
+   test_must_be_empty result
+'
+
+test_expect_success 'log -G looks into binary files with -a' '
+   git checkout --orphan orphan2 &&
+   printf "a\0a" >data.bin &&
+   git add data.bin &&
+   git commit -m "message" &&
+   git log -a -Ga >actual &&
+   git log >expected &&
+   test_cmp actual expected
+'
+
+test_expect_success 'log -G looks into binary files with textconv filter' '
+   git checkout --orphan orphan3 &&
+   echo "* diff=bin" > .gitattributes &&
+   printf "a\0a" >data.bin &&
+   git add data.bin &&
+   git commit -m "message" &&
+   git -c diff.bin.textconv=cat log -Ga >actual &&
+   git log >expected &&
+   test_cmp actual expected
+'
+
+test_expect_success 'log -S looks into binary files' '
+   git checkout --orphan orphan4 &&
+   printf "a\0a" >data.bin &&
+   git add data.bin &&
+   git commit -m "message" &&
+   git log -Sa >actual &&
+   git log >expected &&
+   test_cmp actual expected
+'
+
 test_done
-- 
2.19.0.271.gfe8321ec05.dirty



Re: [PATCH v1 1/2] log -G: Ignore binary files

2018-11-28 Thread Thomas Braun
> Ævar Arnfjörð Bjarmason  hat am 22. November 2018 um 11:16 
> geschrieben:

[...]

> >
> > +test_expect_success 'log -G ignores binary files' '
> > +   rm -rf .git &&
> > +   git init &&
> > +   printf "a\0b" >data.bin &&
> > +   git add data.bin &&
> > +   git commit -m "message" &&
> > +   git log -G a >result &&
> 
> Would be less confusing as "-Ga" since that's the invocation we
> document, even though I see (but wasn't aware that...) "-G a" works too.

Done.

> > +   test_must_be_empty result
> > +'
> > +
> > +test_expect_success 'log -G looks into binary files with textconv filter' '
> > +   rm -rf .git &&
> > +   git init &&
> > +   echo "* diff=bin" > .gitattributes &&
> > +   printf "a\0b" >data.bin &&
> > +   git add data.bin &&
> > +   git commit -m "message" &&
> > +   git -c diff.bin.textconv=cat log -G a >actual &&
> > +   git log >expected &&
> > +   test_cmp actual expected
> > +'
> > +
> >  test_done
> 
> This patch seems like the wrong direction to me. In particular the
> assertion that "the concept of differences only makes sense for text
> files". That's just not true. This patch breaks this:
> 
> (
> rm -rf /tmp/g-test &&
> git init /tmp/g-test &&
> cd /tmp/g-test &&
> for i in {1..10}; do
> echo "Always matching thensome 5" >file &&
> printf "a thensome %d binary \0" $i >>file &&
> git add file &&
> git commit -m"Bump $i"
> done &&
> git log -Gthensome.*5
> )
> 
> Right now this will emit 3/10 patches, and the right ones! I.e. "Bump
> [156]". The 1st one because it introduces the "Always matching thensome
> 5". Then 5/6 because the add/remove the string "a thensome 5 binary",
> respectively. Which matches /thensome.*5/.

log -p does not show you the patch text in your example because it is treated
as binary. And currently "log -G" has a different opinion into what it looks
and what it ignores. My patch tries to bring both more in line.
 
> I.e. in the first one we do a regex match against the content here
> because we don't have both sides:
> https://github.com/git/git/blob/v2.19.2/diffcore-pickaxe.c#L48-L53
> 
> And then for the later ones where we have both sides we end up in
> diffgrep_consume():
> https://github.com/git/git/blob/v2.19.2/diffcore-pickaxe.c#L27-L36
> 
> I think there may be a real issue here to address, which might be some
> combination of:
> 
>  a) Even though the diffcore can do a binary diff internally, this is
> not what it exposes with "-p", we just say "Binary files differ".
> 
> I don't know how to emit the raw version we'll end up passing to
> diffgrep_consume() in this case. Is it just --binary without the
> encoding? I don't know...
> 
>  b) Your test case shows that you're matching a string at a \0
> boundary. Is this perhaps something you ran into? I.e. that we don't
> have some -F version of -G so we can't supply regexes that match
> past a \0? I had some related work on grep for this that hasn't been
> carried over to the diffcore:
> 
> git log --grep='grep:.*\\0' --author=Ævar
> 
>  c) Is this binary diff we end up matching against just bad in some
> cases? I haven't dug but that wouldn't surprise me, i.e. that it's
> trying to be line-based so we'll overmatch in many cases.
> 
> So maybe this is something that should be passed down as a flag? See a
> recent discussion at
> https://public-inbox.org/git/87lg77cmr1@evledraar.gmail.com/ for how
> that could be done.

It is not about the \0 boundary. v2 of the patches will clarify that. My main
motiviation is to speed up "log -G" as that takes a considerable amount of time 
when it wades through MBs of binary files which change often. And in multiple 
places
I can already treat binary files differently (e.g. turn off delta compression, 
skip
trying to diff them, no EOL normalization). And for me making log -G ignore 
what git 
thinks are binary files is making the line clearer between what should be 
treated
as binary and what as text.

> Also if we don't have some tests already that were failing with this
> patch we really should have those as "let's test the current behavior
> first". Unfortunately tests in this area are really lacking, see
> e.g. my:
> 
> git log --author=Junio --min-parents=2 --grep=ab/.*grep
> 
> For some series of patches to grep where to get one patch in I needed to
> often lead with 5-10 test patches to convince reviewers that I knew what
> I was changing, and also to be comfortable that I'd covered all the edge
> cases we currently supported, but weren't testing for.

I'm happy to add more test cases to convince everyone involved :)


Re: [PATCH v1 2/2] log -S: Add test which searches in binary files

2018-11-28 Thread Thomas Braun
> Junio C Hamano  hat am 22. November 2018 um 02:34 
> geschrieben:
> 
> 
> Thomas Braun  writes:
> 
> > The -S  option of log looks for differences that changes the
> > number of occurrences of the specified string (i.e. addition/deletion)
> > in a file.
> 
> s/-S /-S/ and
> s/the specified string/the specified block of text/ would make it
> more in line with how Documentation/gitdiffcore.txt explains it.
> The original discussion from early 2017 also explains with a pointer
> why the primary mode of -S is not  but is .

Thanks for the pointer. I've updated the commit message.
 
> > diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> > index 42cc8afd8b..d430f6f2f9 100755
> > --- a/t/t4209-log-pickaxe.sh
> > +++ b/t/t4209-log-pickaxe.sh
> > @@ -128,4 +128,15 @@ test_expect_success 'log -G looks into binary files 
> > with textconv filter' '
> >     test_cmp actual expected
> >  '
> >  
> > +test_expect_success 'log -S looks into binary files' '
> > +   rm -rf .git &&
> > +   git init &&
> 
> Same comment as the one for 1/2 applies here.

Fixed as well.

> > +   printf "a\0b" >data.bin &&
> > +   git add data.bin &&
> > +   git commit -m "message" &&
> > +   git log -S a >actual &&
> > +   git log >expected &&
> > +   test_cmp actual expected
> > +'
> > +
> >  test_done
> 
> Other than these, I think both patches look sensible.  Thanks for
> resurrecting the old topic and reigniting it.
>


Re: [PATCH v1 1/2] log -G: Ignore binary files

2018-11-28 Thread Thomas Braun


> Junio C Hamano  hat am 27. November 2018 um 01:51 
> geschrieben:
> 
> 
> Stefan Beller  writes:
> 
> > On Wed, Nov 21, 2018 at 1:08 PM Thomas Braun
> >  wrote:
> >>
> >> The -G  option of log looks for the differences whose patch text
> >> contains added/removed lines that match regex.
> >>
> >> The concept of differences only makes sense for text files, therefore
> >> we need to ignore binary files when searching with -G  as well.
> >
> > What about partial text/partial binary files?
> 
> Good point. You'd use "-a" (or "--text") to tell the diff machinery
> to treat the contents as text, and the new logic must pay attention
> to that command line option.

Yes exactly. Either use -a for the occasional use or a textconv filter
for permanent use.

Coming from the opposite side: I usually mark svg files as binary as the
textual diff is well, let's say uninspiring.


Re: [PATCH v1 2/2] log -S: Add test which searches in binary files

2018-11-28 Thread Thomas Braun
> Ævar Arnfjörð Bjarmason  hat am 22. November 2018 um 10:14 
> geschrieben:
> 
> 
> 
> On Wed, Nov 21 2018, Thomas Braun wrote:
> 
> > The -S  option of log looks for differences that changes the
> > number of occurrences of the specified string (i.e. addition/deletion)
> > in a file.
> >
> > Add a test to ensure that we keep looking into binary files with -S
> > as changing that would break backwards compatibility in unexpected ways.
> >
> > Signed-off-by: Thomas Braun 
> > ---
> >  t/t4209-log-pickaxe.sh | 11 +++
> >  1 file changed, 11 insertions(+)
> >
> > diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> > index 42cc8afd8b..d430f6f2f9 100755
> > --- a/t/t4209-log-pickaxe.sh
> > +++ b/t/t4209-log-pickaxe.sh
> > @@ -128,4 +128,15 @@ test_expect_success 'log -G looks into binary files 
> > with textconv filter' '
> > test_cmp actual expected
> >  '
> >
> > +test_expect_success 'log -S looks into binary files' '
> > +   rm -rf .git &&
> > +   git init &&
> > +   printf "a\0b" >data.bin &&
> > +   git add data.bin &&
> > +   git commit -m "message" &&
> > +   git log -S a >actual &&
> > +   git log >expected &&
> > +   test_cmp actual expected
> > +'
> > +
> >  test_done
> 
> This should just be part of 1/2 since the behavior is changed there &
> the commit message should describe both cases.

My reasoning was that this is a separate test which does not fit in with the 
other part.
But I'm happy in folding both into one patch. Done.


Re: [PATCH v1 1/2] log -G: Ignore binary files

2018-11-28 Thread Thomas Braun
> Ævar Arnfjörð Bjarmason  hat am 22. November 2018 um 11:16 
> geschrieben:

[...]

> >
> > +test_expect_success 'log -G ignores binary files' '
> > +   rm -rf .git &&
> > +   git init &&
> > +   printf "a\0b" >data.bin &&
> > +   git add data.bin &&
> > +   git commit -m "message" &&
> > +   git log -G a >result &&
> 
> Would be less confusing as "-Ga" since that's the invocation we
> document, even though I see (but wasn't aware that...) "-G a" works too.

Done.

> > +   test_must_be_empty result
> > +'
> > +
> > +test_expect_success 'log -G looks into binary files with textconv filter' '
> > +   rm -rf .git &&
> > +   git init &&
> > +   echo "* diff=bin" > .gitattributes &&
> > +   printf "a\0b" >data.bin &&
> > +   git add data.bin &&
> > +   git commit -m "message" &&
> > +   git -c diff.bin.textconv=cat log -G a >actual &&
> > +   git log >expected &&
> > +   test_cmp actual expected
> > +'
> > +
> >  test_done
> 
> This patch seems like the wrong direction to me. In particular the
> assertion that "the concept of differences only makes sense for text
> files". That's just not true. This patch breaks this:
> 
> (
> rm -rf /tmp/g-test &&
> git init /tmp/g-test &&
> cd /tmp/g-test &&
> for i in {1..10}; do
> echo "Always matching thensome 5" >file &&
> printf "a thensome %d binary \0" $i >>file &&
> git add file &&
> git commit -m"Bump $i"
> done &&
> git log -Gthensome.*5
> )
> 
> Right now this will emit 3/10 patches, and the right ones! I.e. "Bump
> [156]". The 1st one because it introduces the "Always matching thensome
> 5". Then 5/6 because the add/remove the string "a thensome 5 binary",
> respectively. Which matches /thensome.*5/.

log -p does not show you the patch text in your example because it is treated
as binary. And currently "log -G" has a different opinion into what it looks
and what it ignores. My patch tries to bring both more in line.
 
> I.e. in the first one we do a regex match against the content here
> because we don't have both sides:
> https://github.com/git/git/blob/v2.19.2/diffcore-pickaxe.c#L48-L53
> 
> And then for the later ones where we have both sides we end up in
> diffgrep_consume():
> https://github.com/git/git/blob/v2.19.2/diffcore-pickaxe.c#L27-L36
> 
> I think there may be a real issue here to address, which might be some
> combination of:
> 
>  a) Even though the diffcore can do a binary diff internally, this is
> not what it exposes with "-p", we just say "Binary files differ".
> 
> I don't know how to emit the raw version we'll end up passing to
> diffgrep_consume() in this case. Is it just --binary without the
> encoding? I don't know...
> 
>  b) Your test case shows that you're matching a string at a \0
> boundary. Is this perhaps something you ran into? I.e. that we don't
> have some -F version of -G so we can't supply regexes that match
> past a \0? I had some related work on grep for this that hasn't been
> carried over to the diffcore:
> 
> git log --grep='grep:.*\\0' --author=Ævar
> 
>  c) Is this binary diff we end up matching against just bad in some
> cases? I haven't dug but that wouldn't surprise me, i.e. that it's
> trying to be line-based so we'll overmatch in many cases.
> 
> So maybe this is something that should be passed down as a flag? See a
> recent discussion at
> https://public-inbox.org/git/87lg77cmr1@evledraar.gmail.com/ for how
> that could be done.

It is not about the \0 boundary. v2 of the patches will clarify that. My main
motiviation is to speed up "log -G" as that takes a considerable amount of time 
when it wades through MBs of binary files which change often. And in multiple 
places
I can already treat binary files differently (e.g. turn off delta compression, 
skip
trying to diff them, no EOL normalization). And for me making log -G ignore 
what git 
thinks are binary files is making the line clearer between what should be 
treated as binary
and what as text.

> Also if we don't have some tests already that were failing with this
> patch we really should have those as "let's test the current behavior
> first". Unfortunately tests in this area are really lacking, see
> e.g. my:
> 
> git log --author=Junio --min-parents=2 --grep=ab/.*grep
> 
> For some series of patches to grep where to get one patch in I needed to
> often lead with 5-10 test patches to convince reviewers that I knew what
> I was changing, and also to be comfortable that I'd covered all the edge
> cases we currently supported, but weren't testing for.

I'm happy to add more test cases to convince everyone involved :)


Re: [PATCH v1 1/2] log -G: Ignore binary files

2018-11-28 Thread Thomas Braun
> Jeff King  hat am 22. November 2018 um 17:20 geschrieben:
> 
> 
> On Wed, Nov 21, 2018 at 09:52:27PM +0100, Thomas Braun wrote:
> 
> > diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> > index 69fc55ea1e..8c2558b07d 100644
> > --- a/diffcore-pickaxe.c
> > +++ b/diffcore-pickaxe.c
> > @@ -144,6 +144,11 @@ static int pickaxe_match(struct diff_filepair *p, 
> > struct diff_options *o,
> > textconv_two = get_textconv(o->repo->index, p->two);
> > }
> >  
> > +   if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
> > +   ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
> > +(!textconv_two && diff_filespec_is_binary(o->repo, p->two
> > +   return 0;
> 
> If the user passes "-a" to treat binary files as text, we should
> probably skip the binary check. I think we'd need to check
> "o->flags.text" here.

Good point. I missed that flag. Added.

> > diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> > index 844df760f7..42cc8afd8b 100755
> > --- a/t/t4209-log-pickaxe.sh
> > +++ b/t/t4209-log-pickaxe.sh
> > @@ -106,4 +106,26 @@ test_expect_success 'log -S --no-textconv (missing 
> > textconv tool)' '
> > [...]
> > +test_expect_success 'log -G ignores binary files' '
> > [...]
> > +test_expect_success 'log -G looks into binary files with textconv filter' '
> 
> And likewise add a test here similar to the textconv one.

Added as well.


Re: [PATCH v1 1/2] log -G: Ignore binary files

2018-11-28 Thread Thomas Braun
> Junio C Hamano  hat am 22. November 2018 um 02:29 
> geschrieben:
> 
> 
> Thomas Braun  writes:
> 
> > The -G  option of log looks for the differences whose patch text
> > contains added/removed lines that match regex.
> >
> > The concept of differences only makes sense for text files, therefore
> > we need to ignore binary files when searching with -G  as well.
> >
> > Signed-off-by: Thomas Braun 
> > ---
> >  Documentation/gitdiffcore.txt |  2 +-
> >  diffcore-pickaxe.c|  5 +
> >  t/t4209-log-pickaxe.sh| 22 ++
> >  3 files changed, 28 insertions(+), 1 deletion(-)
> 
> OK.
> 
> > diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
> > index c0a60f3158..059ddd3431 100644
> > --- a/Documentation/gitdiffcore.txt
> > +++ b/Documentation/gitdiffcore.txt
> > @@ -242,7 +242,7 @@ textual diff has an added or a deleted line that 
> > matches the given
> >  regular expression.  This means that it will detect in-file (or what
> >  rename-detection considers the same file) moves, which is noise.  The
> >  implementation runs diff twice and greps, and this can be quite
> > -expensive.
> > +expensive.  Binary files without textconv filter are ignored.
> 
> OK.
> 
> > diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> > index 69fc55ea1e..8c2558b07d 100644
> > --- a/diffcore-pickaxe.c
> > +++ b/diffcore-pickaxe.c
> > @@ -144,6 +144,11 @@ static int pickaxe_match(struct diff_filepair *p, 
> > struct diff_options *o,
> > textconv_two = get_textconv(o->repo->index, p->two);
> > }
> >  
> > +   if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
> > +   ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
> > +(!textconv_two && diff_filespec_is_binary(o->repo, p->two
> > +   return 0;
> > +
> > /*
> >  * If we have an unmodified pair, we know that the count will be the
> >  * same and don't even have to load the blobs. Unless textconv is in
> 
> Shouldn't this new test come after the existing optimization, which
> allows us to leave without loading the blob contents (which is
> needed once you call diff_filespec_is_binary())?

Yes, good point.

> > diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> > index 844df760f7..42cc8afd8b 100755
> > --- a/t/t4209-log-pickaxe.sh
> > +++ b/t/t4209-log-pickaxe.sh
> > @@ -106,4 +106,26 @@ test_expect_success 'log -S --no-textconv (missing 
> > textconv tool)' '
> > rm .gitattributes
> >  '
> >  
> > +test_expect_success 'log -G ignores binary files' '
> > +   rm -rf .git &&
> > +   git init &&
> 
> Please never never ever do the above two unless you are writing a
> test that checks low-level repository details.
> 
> If you want a clean history that has specific lineage of commits
> without getting affected by commits that have been made by the
> previous test pieces, it is OK to "checkout --orphan" to create an
> empty history to work with.

Thanks for the hint. I thought I had seen a less intrusive way for getting an 
empty history. 
Changed.

> > +   printf "a\0b" >data.bin &&
> > +   git add data.bin &&
> > +   git commit -m "message" &&
> > +   git log -G a >result &&
> > +   test_must_be_empty result
> > +'
> > +
> > +test_expect_success 'log -G looks into binary files with textconv filter' '
> > +   rm -rf .git &&
> > +   git init &&
> > +   echo "* diff=bin" > .gitattributes &&
> > +   printf "a\0b" >data.bin &&
> > +   git add data.bin &&
> > +   git commit -m "message" &&
> > +   git -c diff.bin.textconv=cat log -G a >actual &&
> > +   git log >expected &&
> > +   test_cmp actual expected
> > +'
> > +
> >  test_done
>


Re: [PATCH v1 1/2] log -G: Ignore binary files

2018-11-26 Thread Junio C Hamano
Stefan Beller  writes:

> On Wed, Nov 21, 2018 at 1:08 PM Thomas Braun
>  wrote:
>>
>> The -G  option of log looks for the differences whose patch text
>> contains added/removed lines that match regex.
>>
>> The concept of differences only makes sense for text files, therefore
>> we need to ignore binary files when searching with -G  as well.
>
> What about partial text/partial binary files?

Good point. You'd use "-a" (or "--text") to tell the diff machinery
to treat the contents as text, and the new logic must pay attention
to that command line option.


Re: [PATCH v1 1/2] log -G: Ignore binary files

2018-11-26 Thread Stefan Beller
On Wed, Nov 21, 2018 at 1:08 PM Thomas Braun
 wrote:
>
> The -G  option of log looks for the differences whose patch text
> contains added/removed lines that match regex.
>
> The concept of differences only makes sense for text files, therefore
> we need to ignore binary files when searching with -G  as well.

What about partial text/partial binary files?

I recall using text searching tools (not necessarily git machinery,
my memory is fuzzy) to check for strings in pdf files, which are
usually marked binary in context of git, such that we do not
see their diffs in `log -p`.

But I would expect a search with -G or -S to still work...
until I find the exception in the docs, only to wonder if
there is a switch to turn off this optimisation for this
corner case.

Stefan


Re: [PATCH v1 1/2] log -G: Ignore binary files

2018-11-23 Thread Junio C Hamano
Jeff King  writes:

>> +if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
>> +((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
>> + (!textconv_two && diff_filespec_is_binary(o->repo, p->two
>> +return 0;
>
> If the user passes "-a" to treat binary files as text, we should
> probably skip the binary check. I think we'd need to check
> "o->flags.text" here.

Yeah, I forgot about that option.  It would give an escape hatch
that has a sane explanation.


Re: [PATCH v1 2/2] log -S: Add test which searches in binary files

2018-11-23 Thread Junio C Hamano
Ævar Arnfjörð Bjarmason  writes:

> On Wed, Nov 21 2018, Thomas Braun wrote:
>
>> The -S  option of log looks for differences that changes the
>> number of occurrences of the specified string (i.e. addition/deletion)
>> in a file.
>>
> ...
> This should just be part of 1/2 since the behavior is changed there &
> the commit message should describe both cases.

Sensible suggestion.  Thanks.


Re: [PATCH v1 1/2] log -G: Ignore binary files

2018-11-22 Thread Jeff King
On Thu, Nov 22, 2018 at 11:16:38AM +0100, Ævar Arnfjörð Bjarmason wrote:

> > +test_expect_success 'log -G looks into binary files with textconv filter' '
> > +   rm -rf .git &&
> > +   git init &&
> > +   echo "* diff=bin" > .gitattributes &&
> > +   printf "a\0b" >data.bin &&
> > +   git add data.bin &&
> > +   git commit -m "message" &&
> > +   git -c diff.bin.textconv=cat log -G a >actual &&
> > +   git log >expected &&
> > +   test_cmp actual expected
> > +'
> > +
> >  test_done
> 
> This patch seems like the wrong direction to me. In particular the
> assertion that "the concept of differences only makes sense for text
> files". That's just not true. This patch breaks this:

But "-G" is defined as "look for differences whose patch text contains
added/removed lines that match ". We don't have patch text here,
let alone added/removed lines.

For binary files, "-Sfoo" is better defined. I think we _could_ define
"search for  in the added/removed bytes of a binary file".  But
I don't think that's what the current code does (it really does a line
diff on a binary file, which is likely to put tons of unchanged crap
into the "added and removed" lines, because the line divisions aren't
meaningful in the first place).

> (
> rm -rf /tmp/g-test &&
> git init /tmp/g-test &&
> cd /tmp/g-test &&
> for i in {1..10}; do
> echo "Always matching thensome 5" >file &&
> printf "a thensome %d binary \0" $i >>file &&
> git add file &&
> git commit -m"Bump $i"
> done &&
> git log -Gthensome.*5
> )
> 
> Right now this will emit 3/10 patches, and the right ones! I.e. "Bump
> [156]". The 1st one because it introduces the "Always matching thensome
> 5". Then 5/6 because the add/remove the string "a thensome 5 binary",
> respectively. Which matches /thensome.*5/.

Right, this will sometimes do the right thing. But it will also often do
the wrong thing. It's also very expensive (we specifically avoid feeding
large binary files to xdiff, but I think "-G" will happily do so -- I
didn't double check, though).

-Peff


Re: [PATCH v1 1/2] log -G: Ignore binary files

2018-11-22 Thread Jeff King
On Wed, Nov 21, 2018 at 09:52:27PM +0100, Thomas Braun wrote:

> diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> index 69fc55ea1e..8c2558b07d 100644
> --- a/diffcore-pickaxe.c
> +++ b/diffcore-pickaxe.c
> @@ -144,6 +144,11 @@ static int pickaxe_match(struct diff_filepair *p, struct 
> diff_options *o,
>   textconv_two = get_textconv(o->repo->index, p->two);
>   }
>  
> + if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
> + ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
> +  (!textconv_two && diff_filespec_is_binary(o->repo, p->two
> + return 0;

If the user passes "-a" to treat binary files as text, we should
probably skip the binary check. I think we'd need to check
"o->flags.text" here.

> diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> index 844df760f7..42cc8afd8b 100755
> --- a/t/t4209-log-pickaxe.sh
> +++ b/t/t4209-log-pickaxe.sh
> @@ -106,4 +106,26 @@ test_expect_success 'log -S --no-textconv (missing 
> textconv tool)' '
> [...]
> +test_expect_success 'log -G ignores binary files' '
> [...]
> +test_expect_success 'log -G looks into binary files with textconv filter' '

And likewise add a test here similar to the textconv one.

-Peff


Re: [PATCH v1 1/2] log -G: Ignore binary files

2018-11-22 Thread Ævar Arnfjörð Bjarmason
>
On Wed, Nov 21 2018, Thomas Braun wrote:

> The -G  option of log looks for the differences whose patch text
> contains added/removed lines that match regex.
>
> The concept of differences only makes sense for text files, therefore
> we need to ignore binary files when searching with -G  as well.
>
> Signed-off-by: Thomas Braun 
> ---
>  Documentation/gitdiffcore.txt |  2 +-
>  diffcore-pickaxe.c|  5 +
>  t/t4209-log-pickaxe.sh| 22 ++
>  3 files changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
> index c0a60f3158..059ddd3431 100644
> --- a/Documentation/gitdiffcore.txt
> +++ b/Documentation/gitdiffcore.txt
> @@ -242,7 +242,7 @@ textual diff has an added or a deleted line that matches 
> the given
>  regular expression.  This means that it will detect in-file (or what
>  rename-detection considers the same file) moves, which is noise.  The
>  implementation runs diff twice and greps, and this can be quite
> -expensive.
> +expensive.  Binary files without textconv filter are ignored.
>
>  When `-S` or `-G` are used without `--pickaxe-all`, only filepairs
>  that match their respective criterion are kept in the output.  When
> diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> index 69fc55ea1e..8c2558b07d 100644
> --- a/diffcore-pickaxe.c
> +++ b/diffcore-pickaxe.c
> @@ -144,6 +144,11 @@ static int pickaxe_match(struct diff_filepair *p, struct 
> diff_options *o,
>   textconv_two = get_textconv(o->repo->index, p->two);
>   }
>
> + if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
> + ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
> +  (!textconv_two && diff_filespec_is_binary(o->repo, p->two
> + return 0;
> +
>   /*
>* If we have an unmodified pair, we know that the count will be the
>* same and don't even have to load the blobs. Unless textconv is in
> diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> index 844df760f7..42cc8afd8b 100755
> --- a/t/t4209-log-pickaxe.sh
> +++ b/t/t4209-log-pickaxe.sh
> @@ -106,4 +106,26 @@ test_expect_success 'log -S --no-textconv (missing 
> textconv tool)' '
>   rm .gitattributes
>  '
>
> +test_expect_success 'log -G ignores binary files' '
> + rm -rf .git &&
> + git init &&
> + printf "a\0b" >data.bin &&
> + git add data.bin &&
> + git commit -m "message" &&
> + git log -G a >result &&

Would be less confusing as "-Ga" since that's the invocation we
document, even though I see (but wasn't aware that...) "-G a" works too.

> + test_must_be_empty result
> +'
> +
> +test_expect_success 'log -G looks into binary files with textconv filter' '
> + rm -rf .git &&
> + git init &&
> + echo "* diff=bin" > .gitattributes &&
> + printf "a\0b" >data.bin &&
> + git add data.bin &&
> + git commit -m "message" &&
> + git -c diff.bin.textconv=cat log -G a >actual &&
> + git log >expected &&
> + test_cmp actual expected
> +'
> +
>  test_done

This patch seems like the wrong direction to me. In particular the
assertion that "the concept of differences only makes sense for text
files". That's just not true. This patch breaks this:

(
rm -rf /tmp/g-test &&
git init /tmp/g-test &&
cd /tmp/g-test &&
for i in {1..10}; do
echo "Always matching thensome 5" >file &&
printf "a thensome %d binary \0" $i >>file &&
git add file &&
git commit -m"Bump $i"
done &&
git log -Gthensome.*5
)

Right now this will emit 3/10 patches, and the right ones! I.e. "Bump
[156]". The 1st one because it introduces the "Always matching thensome
5". Then 5/6 because the add/remove the string "a thensome 5 binary",
respectively. Which matches /thensome.*5/.

I.e. in the first one we do a regex match against the content here
because we don't have both sides:
https://github.com/git/git/blob/v2.19.2/diffcore-pickaxe.c#L48-L53

And then for the later ones where we have both sides we end up in
diffgrep_consume():
https://github.com/git/git/blob/v2.19.2/diffcore-pickaxe.c#L27-L36

I think there may be a real issue here to address, which might be some
combination of:

 a) Even though the diffcore can do a binary diff internally, this is

Re: [PATCH v1 2/2] log -S: Add test which searches in binary files

2018-11-22 Thread Ævar Arnfjörð Bjarmason


On Wed, Nov 21 2018, Thomas Braun wrote:

> The -S  option of log looks for differences that changes the
> number of occurrences of the specified string (i.e. addition/deletion)
> in a file.
>
> Add a test to ensure that we keep looking into binary files with -S
> as changing that would break backwards compatibility in unexpected ways.
>
> Signed-off-by: Thomas Braun 
> ---
>  t/t4209-log-pickaxe.sh | 11 +++
>  1 file changed, 11 insertions(+)
>
> diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> index 42cc8afd8b..d430f6f2f9 100755
> --- a/t/t4209-log-pickaxe.sh
> +++ b/t/t4209-log-pickaxe.sh
> @@ -128,4 +128,15 @@ test_expect_success 'log -G looks into binary files with 
> textconv filter' '
>   test_cmp actual expected
>  '
>
> +test_expect_success 'log -S looks into binary files' '
> + rm -rf .git &&
> + git init &&
> + printf "a\0b" >data.bin &&
> + git add data.bin &&
> + git commit -m "message" &&
> + git log -S a >actual &&
> + git log >expected &&
> + test_cmp actual expected
> +'
> +
>  test_done

This should just be part of 1/2 since the behavior is changed there &
the commit message should describe both cases.


Re: [PATCH v1 2/2] log -S: Add test which searches in binary files

2018-11-21 Thread Junio C Hamano
Thomas Braun  writes:

> The -S  option of log looks for differences that changes the
> number of occurrences of the specified string (i.e. addition/deletion)
> in a file.

s/-S /-S/ and
s/the specified string/the specified block of text/ would make it
more in line with how Documentation/gitdiffcore.txt explains it.
The original discussion from early 2017 also explains with a pointer
why the primary mode of -S is not  but is .

> diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> index 42cc8afd8b..d430f6f2f9 100755
> --- a/t/t4209-log-pickaxe.sh
> +++ b/t/t4209-log-pickaxe.sh
> @@ -128,4 +128,15 @@ test_expect_success 'log -G looks into binary files with 
> textconv filter' '
>   test_cmp actual expected
>  '
>  
> +test_expect_success 'log -S looks into binary files' '
> + rm -rf .git &&
> + git init &&

Same comment as the one for 1/2 applies here.

> + printf "a\0b" >data.bin &&
> + git add data.bin &&
> + git commit -m "message" &&
> + git log -S a >actual &&
> + git log >expected &&
> + test_cmp actual expected
> +'
> +
>  test_done

Other than these, I think both patches look sensible.  Thanks for
resurrecting the old topic and reigniting it.


Re: [PATCH v1 1/2] log -G: Ignore binary files

2018-11-21 Thread Junio C Hamano
Thomas Braun  writes:

> The -G  option of log looks for the differences whose patch text
> contains added/removed lines that match regex.
>
> The concept of differences only makes sense for text files, therefore
> we need to ignore binary files when searching with -G  as well.
>
> Signed-off-by: Thomas Braun 
> ---
>  Documentation/gitdiffcore.txt |  2 +-
>  diffcore-pickaxe.c|  5 +
>  t/t4209-log-pickaxe.sh| 22 ++
>  3 files changed, 28 insertions(+), 1 deletion(-)

OK.

> diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
> index c0a60f3158..059ddd3431 100644
> --- a/Documentation/gitdiffcore.txt
> +++ b/Documentation/gitdiffcore.txt
> @@ -242,7 +242,7 @@ textual diff has an added or a deleted line that matches 
> the given
>  regular expression.  This means that it will detect in-file (or what
>  rename-detection considers the same file) moves, which is noise.  The
>  implementation runs diff twice and greps, and this can be quite
> -expensive.
> +expensive.  Binary files without textconv filter are ignored.

OK.

> diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> index 69fc55ea1e..8c2558b07d 100644
> --- a/diffcore-pickaxe.c
> +++ b/diffcore-pickaxe.c
> @@ -144,6 +144,11 @@ static int pickaxe_match(struct diff_filepair *p, struct 
> diff_options *o,
>   textconv_two = get_textconv(o->repo->index, p->two);
>   }
>  
> + if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
> + ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
> +  (!textconv_two && diff_filespec_is_binary(o->repo, p->two
> + return 0;
> +
>   /*
>* If we have an unmodified pair, we know that the count will be the
>* same and don't even have to load the blobs. Unless textconv is in

Shouldn't this new test come after the existing optimization, which
allows us to leave without loading the blob contents (which is
needed once you call diff_filespec_is_binary())?

> diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> index 844df760f7..42cc8afd8b 100755
> --- a/t/t4209-log-pickaxe.sh
> +++ b/t/t4209-log-pickaxe.sh
> @@ -106,4 +106,26 @@ test_expect_success 'log -S --no-textconv (missing 
> textconv tool)' '
>   rm .gitattributes
>  '
>  
> +test_expect_success 'log -G ignores binary files' '
> + rm -rf .git &&
> + git init &&

Please never never ever do the above two unless you are writing a
test that checks low-level repository details.

If you want a clean history that has specific lineage of commits
without getting affected by commits that have been made by the
previous test pieces, it is OK to "checkout --orphan" to create an
empty history to work with.

> + printf "a\0b" >data.bin &&
> + git add data.bin &&
> + git commit -m "message" &&
> + git log -G a >result &&
> + test_must_be_empty result
> +'
> +
> +test_expect_success 'log -G looks into binary files with textconv filter' '
> + rm -rf .git &&
> + git init &&
> + echo "* diff=bin" > .gitattributes &&
> + printf "a\0b" >data.bin &&
> + git add data.bin &&
> + git commit -m "message" &&
> + git -c diff.bin.textconv=cat log -G a >actual &&
> + git log >expected &&
> + test_cmp actual expected
> +'
> +
>  test_done


[PATCH v1 2/2] log -S: Add test which searches in binary files

2018-11-21 Thread Thomas Braun
The -S  option of log looks for differences that changes the
number of occurrences of the specified string (i.e. addition/deletion)
in a file.

Add a test to ensure that we keep looking into binary files with -S
as changing that would break backwards compatibility in unexpected ways.

Signed-off-by: Thomas Braun 
---
 t/t4209-log-pickaxe.sh | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
index 42cc8afd8b..d430f6f2f9 100755
--- a/t/t4209-log-pickaxe.sh
+++ b/t/t4209-log-pickaxe.sh
@@ -128,4 +128,15 @@ test_expect_success 'log -G looks into binary files with 
textconv filter' '
test_cmp actual expected
 '
 
+test_expect_success 'log -S looks into binary files' '
+   rm -rf .git &&
+   git init &&
+   printf "a\0b" >data.bin &&
+   git add data.bin &&
+   git commit -m "message" &&
+   git log -S a >actual &&
+   git log >expected &&
+   test_cmp actual expected
+'
+
 test_done
-- 
2.19.0.271.gfe8321ec05.dirty



[PATCH 0/2] Teach log -G to ignore binary files

2018-11-21 Thread Thomas Braun
Based on the previous discussion in [1] I've prepared patches which teach 
log -G to ignore binary files. log -S keeps its behaviour but got a test to 
ensure that.

Feedback welcome!

[1]: 
https://public-inbox.org/git/7a0992eb-adb9-a7a1-cfaa-3384bc4d3...@virtuell-zuhause.de/

Thomas Braun (2):
  log -G: Ignore binary files
  log -S: Add test which searches in binary files

 Documentation/gitdiffcore.txt |  2 +-
 diffcore-pickaxe.c|  5 +
 t/t4209-log-pickaxe.sh| 21 +
 3 files changed, 27 insertions(+), 1 deletion(-)

-- 
2.19.0.271.gfe8321ec05.dirty



[PATCH v1 1/2] log -G: Ignore binary files

2018-11-21 Thread Thomas Braun
The -G  option of log looks for the differences whose patch text
contains added/removed lines that match regex.

The concept of differences only makes sense for text files, therefore
we need to ignore binary files when searching with -G  as well.

Signed-off-by: Thomas Braun 
---
 Documentation/gitdiffcore.txt |  2 +-
 diffcore-pickaxe.c|  5 +
 t/t4209-log-pickaxe.sh| 22 ++
 3 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
index c0a60f3158..059ddd3431 100644
--- a/Documentation/gitdiffcore.txt
+++ b/Documentation/gitdiffcore.txt
@@ -242,7 +242,7 @@ textual diff has an added or a deleted line that matches 
the given
 regular expression.  This means that it will detect in-file (or what
 rename-detection considers the same file) moves, which is noise.  The
 implementation runs diff twice and greps, and this can be quite
-expensive.
+expensive.  Binary files without textconv filter are ignored.
 
 When `-S` or `-G` are used without `--pickaxe-all`, only filepairs
 that match their respective criterion are kept in the output.  When
diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 69fc55ea1e..8c2558b07d 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -144,6 +144,11 @@ static int pickaxe_match(struct diff_filepair *p, struct 
diff_options *o,
textconv_two = get_textconv(o->repo->index, p->two);
}
 
+   if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
+   ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
+(!textconv_two && diff_filespec_is_binary(o->repo, p->two
+   return 0;
+
/*
 * If we have an unmodified pair, we know that the count will be the
 * same and don't even have to load the blobs. Unless textconv is in
diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
index 844df760f7..42cc8afd8b 100755
--- a/t/t4209-log-pickaxe.sh
+++ b/t/t4209-log-pickaxe.sh
@@ -106,4 +106,26 @@ test_expect_success 'log -S --no-textconv (missing 
textconv tool)' '
rm .gitattributes
 '
 
+test_expect_success 'log -G ignores binary files' '
+   rm -rf .git &&
+   git init &&
+   printf "a\0b" >data.bin &&
+   git add data.bin &&
+   git commit -m "message" &&
+   git log -G a >result &&
+   test_must_be_empty result
+'
+
+test_expect_success 'log -G looks into binary files with textconv filter' '
+   rm -rf .git &&
+   git init &&
+   echo "* diff=bin" > .gitattributes &&
+   printf "a\0b" >data.bin &&
+   git add data.bin &&
+   git commit -m "message" &&
+   git -c diff.bin.textconv=cat log -G a >actual &&
+   git log >expected &&
+   test_cmp actual expected
+'
+
 test_done
-- 
2.19.0.271.gfe8321ec05.dirty



[PATCH 0/2] Teach log -G to ignore binary files

2018-11-21 Thread Thomas Braun
Based on the previous discussion in [1] I've prepared patches which teach
log -G to ignore binary files. log -S keeps its behaviour but got a test to 
ensure that.

Feedback welcome!

[1]: 
https://public-inbox.org/git/7a0992eb-adb9-a7a1-cfaa-3384bc4d3...@virtuell-zuhause.de/

PS: This is the (possibly missing) cover letter.


RE: [Question] Signature calculation ignoring parts of binary files

2018-09-13 Thread Randall S. Becker
On September 13, 2018 1:52 PM, Junio C Hamano wrote:
> Junio C Hamano  writes:
> 
> > "Randall S. Becker"  writes:
> >
> >> The scenario is slightly different.
> >> 1. Person A gives me a new binary file-1 with fingerprint A1. This
> >> goes into git unchanged.
> >> 2. Person B gives me binary file-2 with fingerprint B2. This does not
> >> go into git yet.
> >> 3. We attempt a git diff between the committed file-1 and uncommitted
> >> file-2 using a textconv implementation that strips what we don't need
to
> compare.
> >> 4. If file-1 and file-2 have no difference when textconv is used,
> >> file-2 is not added and not committed. It is discarded with impunity,
> >> never to be seen again, although we might whine a lot at the user for
> >> attempting to put
> >> file-2 in - but that's not git's issue.
> >
> > You are forgetting that Git is a distributed version control system,
> > aren't you?  Person A and B can introduce their "moral equivalent but
> > bytewise different" copies to their repository under the same object
> > name, and you can pull from them--what happens?
> >
> > It is fundamental that one object name given to Git identifies one
> > specific byte sequence contained in an object uniquely.  Once you
> > broke that, you no longer have Git.
> 
> Having said all that, if you want to keep the original with frills but
somehow
> give these bytewise different things that reduce to the same essence (e.g.
> when passed thru a filter like textconv), I suspect a better approach
might be
> to store both the "original" and the result of passing the "original"
through
> the filter in the object database.  In the above example, you'll get two
> "original"
> objects from person A and person B, plus one "canonical" object that are
> bytewise different from either of these two originals, but what they
reduce
> to when you use the filter on them.  Then you record the fact that to
derive
> the "essence" object, you can reduce either person A's or person B's
> "original" through the filter, perhaps by using "git notes" attached to
the
> "essence" object, recording the object names of these originals (the
reason
> why using notes in this direction is because you can mechanically
determine
> which "essence"
> object any given "original" object reduces to---it is just the matter of
passing
> it through the filter.  But there can be more than one "original" that
reduces
> to the same "essence").

I like that idea. It turns the reduced object into a contract. Thanks.



Re: [Question] Signature calculation ignoring parts of binary files

2018-09-13 Thread Junio C Hamano
Junio C Hamano  writes:

> "Randall S. Becker"  writes:
>
>> The scenario is slightly different.
>> 1. Person A gives me a new binary file-1 with fingerprint A1. This goes into
>> git unchanged.
>> 2. Person B gives me binary file-2 with fingerprint B2. This does not go
>> into git yet.
>> 3. We attempt a git diff between the committed file-1 and uncommitted file-2
>> using a textconv implementation that strips what we don't need to compare.
>> 4. If file-1 and file-2 have no difference when textconv is used, file-2 is
>> not added and not committed. It is discarded with impunity, never to be seen
>> again, although we might whine a lot at the user for attempting to put
>> file-2 in - but that's not git's issue.
>
> You are forgetting that Git is a distributed version control system,
> aren't you?  Person A and B can introduce their "moral equivalent
> but bytewise different" copies to their repository under the same
> object name, and you can pull from them--what happens?
>
> It is fundamental that one object name given to Git identifies one
> specific byte sequence contained in an object uniquely.  Once you
> broke that, you no longer have Git.

Having said all that, if you want to keep the original with frills
but somehow give these bytewise different things that reduce to the
same essence (e.g. when passed thru a filter like textconv), I
suspect a better approach might be to store both the "original" and
the result of passing the "original" through the filter in the
object database.  In the above example, you'll get two "original"
objects from person A and person B, plus one "canonical" object that
are bytewise different from either of these two originals, but what
they reduce to when you use the filter on them.  Then you record the
fact that to derive the "essence" object, you can reduce either
person A's or person B's "original" through the filter, perhaps by
using "git notes" attached to the "essence" object, recording the
object names of these originals (the reason why using notes in this
direction is because you can mechanically determine which "essence"
object any given "original" object reduces to---it is just the
matter of passing it through the filter.  But there can be more than
one "original" that reduces to the same "essence").



RE: [Question] Signature calculation ignoring parts of binary files

2018-09-13 Thread Randall S. Becker
On September 13, 2018 11:03 AM, Junio C Hamano wrote:
> "Randall S. Becker"  writes:
> 
> > The scenario is slightly different.
> > 1. Person A gives me a new binary file-1 with fingerprint A1. This
> > goes into git unchanged.
> > 2. Person B gives me binary file-2 with fingerprint B2. This does not
> > go into git yet.
> > 3. We attempt a git diff between the committed file-1 and uncommitted
> > file-2 using a textconv implementation that strips what we don't need to
> compare.
> > 4. If file-1 and file-2 have no difference when textconv is used,
> > file-2 is not added and not committed. It is discarded with impunity,
> > never to be seen again, although we might whine a lot at the user for
> > attempting to put
> > file-2 in - but that's not git's issue.
> 
> You are forgetting that Git is a distributed version control system,
aren't you?
> Person A and B can introduce their "moral equivalent but bytewise
different"
> copies to their repository under the same object name, and you can pull
from
> them--what happens?
> 
> It is fundamental that one object name given to Git identifies one
specific
> byte sequence contained in an object uniquely.  Once you broke that, you
no
> longer have Git.

At that point I have a morally questionable situation, agreed. However, both
are permitted to exist in the underlying tree without conflict in git -
which I do consider a legitimately possible situation that will not break
the application at all - although there is a semantic conflict in the
application (not in git) that requires human decision to resolve. The fact
that both objects can exist in git with different fingerprints is a good
thing because it provides immutable evidence and ownership of someone
bypassing the intent of the application.

So, rather than using textconv, I shall implement this rule in the
application rather than trying to configure git to do it. If two conflicting
objects enter the commit history, the application will have the
responsibility to resolve the semantic/legal conflict.

Thanks,
Randall




Re: [Question] Signature calculation ignoring parts of binary files

2018-09-13 Thread Junio C Hamano
"Randall S. Becker"  writes:

> The scenario is slightly different.
> 1. Person A gives me a new binary file-1 with fingerprint A1. This goes into
> git unchanged.
> 2. Person B gives me binary file-2 with fingerprint B2. This does not go
> into git yet.
> 3. We attempt a git diff between the committed file-1 and uncommitted file-2
> using a textconv implementation that strips what we don't need to compare.
> 4. If file-1 and file-2 have no difference when textconv is used, file-2 is
> not added and not committed. It is discarded with impunity, never to be seen
> again, although we might whine a lot at the user for attempting to put
> file-2 in - but that's not git's issue.

You are forgetting that Git is a distributed version control system,
aren't you?  Person A and B can introduce their "moral equivalent
but bytewise different" copies to their repository under the same
object name, and you can pull from them--what happens?

It is fundamental that one object name given to Git identifies one
specific byte sequence contained in an object uniquely.  Once you
broke that, you no longer have Git.



RE: [Question] Signature calculation ignoring parts of binary files

2018-09-13 Thread Randall S. Becker
On September 12, 2018 7:00 PM, Junio C Hamano wrote:
> "Randall S. Becker"  writes:
> 
> >> author is important to our process. My objective is to keep the
> >> original file 100% exact as supplied and then ignore any changes to
> >> the metadata that I don't care about (like Creator) if the remainder of
the
> file is the same.
> 
> That will *not* work.  If person A gave you a version of original, which
> hashes to X after you strip the cruft you do not care about, you would
> register that original with person A's fingerprint on under the name of X.
> What happens when person B gives you another version, which is not byte-
> for-byte identical to the one you got earlier from person A, but does hash
to
> the same X after you strip the cruft?  If you are going to store it in
Git, and if
> by SHA-1 you are calling what we perceive as "object name" in Git land,
you
> must store that one with person B's fingerprint on it also under the name
of
> X.  Now which version will you get from Git when you ask it to give you
the
> object that hashes to X?

The scenario is slightly different.
1. Person A gives me a new binary file-1 with fingerprint A1. This goes into
git unchanged.
2. Person B gives me binary file-2 with fingerprint B2. This does not go
into git yet.
3. We attempt a git diff between the committed file-1 and uncommitted file-2
using a textconv implementation that strips what we don't need to compare.
4. If file-1 and file-2 have no difference when textconv is used, file-2 is
not added and not committed. It is discarded with impunity, never to be seen
again, although we might whine a lot at the user for attempting to put
file-2 in - but that's not git's issue.
5. If file-1 and file-2 have differences when textconv is used, file-2 is
committed with fingerprint B2.
6. Even if an error is made by the user and they commit file-2 with B2
regardless of textconv, there will be a human who complains about it, but
git has two unambiguous fingerprints that happen to have no diffs after
textconv is applied.

My original hope was that textconv could be used to influence the
fingerprint, but I do not think that is the case, so I went with an
alternative. In the application, I am not allowed to strip any cruft off
file-1 when it is stored - it must be byte-for-byte the original file. This
application is marginally related to a DRM-like situation where we only care
about the original image provided by a user, but any copies that are
provided by another user with modified metadata will be disallowed from
repository.

Does that make more sense? 

Cheers,
Randall



Re: [Question] Signature calculation ignoring parts of binary files

2018-09-12 Thread Junio C Hamano
"Randall S. Becker"  writes:

>> author is important to our process. My objective is to keep the original file
>> 100% exact as supplied and then ignore any changes to the metadata that I
>> don't care about (like Creator) if the remainder of the file is the same.

That will *not* work.  If person A gave you a version of original,
which hashes to X after you strip the cruft you do not care about,
you would register that original with person A's fingerprint on
under the name of X.  What happens when person B gives you another
version, which is not byte-for-byte identical to the one you got
earlier from person A, but does hash to the same X after you strip
the cruft?  If you are going to store it in Git, and if by SHA-1 you
are calling what we perceive as "object name" in Git land, you must
store that one with person B's fingerprint on it also under the name
of X.  Now which version will you get from Git when you ask it to
give you the object that hashes to X?  


RE: [Question] Signature calculation ignoring parts of binary files

2018-09-12 Thread Randall S. Becker
On September 12, 2018 4:54 PM, I wrote:
> On September 12, 2018 4:48 PM, Johannes Sixt wrote:
> > Am 12.09.18 um 21:16 schrieb Randall S. Becker:
> > > I feel really bad asking this, and I should know the answer, and yet.
> > >
> > > I have a binary file that needs to go into a repo intact (unchanged).
> > > I also have a program that interprets the contents, like a textconv,
> > > that can output the relevant portions of the file in whatever format
> > > I like - used for diff typically, dumps in 1K chunks by file section.
> > > What I'm looking for is to have the SHA1 signature calculated with
> > > just the relevant portions of the file so that two actually
> > > different files will be considered the same by git during a commit
> > > or status. In real terms, I'm trying to ignore the Creator metadata
> > > of a JPG because it is mutable and irrelevant to my repo contents.
> > >
> > > I'm sorry to ask, but I thought this was in .gitattributes but I
> > > can't confirm the SHA1 behaviour.
> >
> > You are looking for a clean filter. See the 'filter' attribute in 
> > gitattributes(5).
> > Your clean filter program or script should strip the unwanted metadata
> > or set it to a constant known-good value.
> >
> > (You shouldn't need a smudge filter.)
> >
> > -- Hannes
> 
> Thanks Hannes. I thought about the clean filter, but I don't actually want to
> modify the file when going into git, just for SHA calculation. I need to be 
> able
> to keep some origin metadata that might change with subsequent copies, so
> just cleaning the origin is not going to work - actually knowing the original
> author is important to our process. My objective is to keep the original file
> 100% exact as supplied and then ignore any changes to the metadata that I
> don't care about (like Creator) if the remainder of the file is the same.

I had a thought that might be workable, opinions are welcome on this.

The commit of my rather weird project is done by a script so I have flexibility 
in my approach. What I could do is set up a diff textconv configuration so that 
the text diff of the two JPG files will show no differences if the immutable 
fields and the image are the same. I can then trigger a git add and git commit 
for only those files where git diff reports no differences. That way the actual 
original file is stored in git with 100% fidelity (no cleaning). It's not as 
elegant as I'd like, but it does solve what I'm trying to do. Does this sound 
reasonable and/or is there a better way?

Cheers,
Randall

-- Brief whoami:
 NonStop developer since approximately 2112884442
 UNIX developer since approximately 421664400
-- In my real life, I talk too much.





RE: [Question] Signature calculation ignoring parts of binary files

2018-09-12 Thread Randall S. Becker
> -Original Message-
> From: git-ow...@vger.kernel.org  On Behalf
> Of Johannes Sixt
> Sent: September 12, 2018 4:48 PM
> To: Randall S. Becker 
> Cc: git@vger.kernel.org
> Subject: Re: [Question] Signature calculation ignoring parts of binary files
> 
> Am 12.09.18 um 21:16 schrieb Randall S. Becker:
> > I feel really bad asking this, and I should know the answer, and yet.
> >
> > I have a binary file that needs to go into a repo intact (unchanged).
> > I also have a program that interprets the contents, like a textconv,
> > that can output the relevant portions of the file in whatever format I
> > like - used for diff typically, dumps in 1K chunks by file section.
> > What I'm looking for is to have the SHA1 signature calculated with
> > just the relevant portions of the file so that two actually different
> > files will be considered the same by git during a commit or status. In
> > real terms, I'm trying to ignore the Creator metadata of a JPG because
> > it is mutable and irrelevant to my repo contents.
> >
> > I'm sorry to ask, but I thought this was in .gitattributes but I can't
> > confirm the SHA1 behaviour.
> 
> You are looking for a clean filter. See the 'filter' attribute in 
> gitattributes(5).
> Your clean filter program or script should strip the unwanted metadata or set
> it to a constant known-good value.
> 
> (You shouldn't need a smudge filter.)
> 
> -- Hannes

Thanks Hannes. I thought about the clean filter, but I don't actually want to 
modify the file when going into git, just for SHA calculation. I need to be 
able to keep some origin metadata that might change with subsequent copies, so 
just cleaning the origin is not going to work - actually knowing the original 
author is important to our process. My objective is to keep the original file 
100% exact as supplied and then ignore any changes to the metadata that I don't 
care about (like Creator) if the remainder of the file is the same.

Regards,
Randall




Re: [Question] Signature calculation ignoring parts of binary files

2018-09-12 Thread Johannes Sixt

Am 12.09.18 um 21:16 schrieb Randall S. Becker:

I feel really bad asking this, and I should know the answer, and yet.

I have a binary file that needs to go into a repo intact (unchanged). I also
have a program that interprets the contents, like a textconv, that can
output the relevant portions of the file in whatever format I like - used
for diff typically, dumps in 1K chunks by file section. What I'm looking for
is to have the SHA1 signature calculated with just the relevant portions of
the file so that two actually different files will be considered the same by
git during a commit or status. In real terms, I'm trying to ignore the
Creator metadata of a JPG because it is mutable and irrelevant to my repo
contents.

I'm sorry to ask, but I thought this was in .gitattributes but I can't
confirm the SHA1 behaviour.


You are looking for a clean filter. See the 'filter' attribute in 
gitattributes(5). Your clean filter program or script should strip the 
unwanted metadata or set it to a constant known-good value.


(You shouldn't need a smudge filter.)

-- Hannes


[Question] Signature calculation ignoring parts of binary files

2018-09-12 Thread Randall S. Becker
I feel really bad asking this, and I should know the answer, and yet.

I have a binary file that needs to go into a repo intact (unchanged). I also
have a program that interprets the contents, like a textconv, that can
output the relevant portions of the file in whatever format I like - used
for diff typically, dumps in 1K chunks by file section. What I'm looking for
is to have the SHA1 signature calculated with just the relevant portions of
the file so that two actually different files will be considered the same by
git during a commit or status. In real terms, I'm trying to ignore the
Creator metadata of a JPG because it is mutable and irrelevant to my repo
contents.

I'm sorry to ask, but I thought this was in .gitattributes but I can't
confirm the SHA1 behaviour.

Sheepishly,
Randall


-- Brief whoami:
 NonStop developer since approximately 2112884442
 UNIX developer since approximately 421664400
-- In my real life, I talk too much.





Re: How to speedup git clone for big binary files (disable delta compression)

2018-07-18 Thread Jeff King
On Thu, Jul 19, 2018 at 12:05:00AM +0200, René Scheibe wrote:

> Code:
> -
> #!/bin/bash
> 
> # setup repository
> git init --quiet repo
> cd repo
> 
> echo '*.bin binary -delta' > .gitattributes
> git add .gitattributes
> git commit --quiet -m 'attributes'
> 
> for i in $(seq 10); do
> dd if=/dev/urandom of=data.bin bs=1MB count=10 status=none
> git add data.bin
> git commit --quiet -m "data $i"
> done
> cd ..
> 
> # create clone repository
> time git clone --no-local repo clone

This clone won't respect those attributes, because we don't dig into
in-repo attributes. There's actually some inconsistency in how Git
handles attribute locations. Usually they're just read from the top of
the working tree, but in some instances we read them from the tree
itself (e.g., git-archive respects some attributes from the tree it's
archiving).

If you do:

  echo "*.bin binary -delta" >repo/.git/info/attributes

then that does work (we always respect repo-level attributes like that).

> # repack original repository
> cd repo
> time git repack -a -d

In this case we're reading the attributes from the working tree, and it
does work. In theory the clone case could do so, too, but git-upload-pack,
the server side of the clone, avoids looking at the working tree at all.
That's something we _could_ address, but it doesn't really fix the
general case, since most clones will be from a bare repository anyway.

So in summary:

  1. Depending on what you're trying to do, the .git/info/attributes
 trick might be enough for you.

  2. I do think it would be nice for more places to respect attributes
 from in trees. There's a question of which tree, but I think in
 general reading them from HEAD in a bare repository would do what
 people want (it's a little funny if you're fetching branch "foo",
 but HEAD points to "bar", but it's at least consistent with the
 non-bare case). There's some prior art in the way we treat mailmaps
 (in a bare repo, we read HEAD:.mailmap).

 I suspect the patch may not be trivial, as I don't know how ready
 the attributes code is to handle in-tree lookups (remember that it
 is not just HEAD:.gitattributes we must care about, but other files
 sprinkled through the repository, like "HEAD:subdir/.gitattributes".

-Peff


How to speedup git clone for big binary files (disable delta compression)

2018-07-18 Thread René Scheibe
Hi,

I was wondering why "git clone" seems to not respect "-delta" in .gitattributes.


*Reproduction*

I prepared a test repository with:

- git v2.17.1
- .gitattributes containing "*.bin binary -delta"
- 10 commits with a 10 MB random binary file

Code:
-
#!/bin/bash

# setup repository
git init --quiet repo
cd repo

echo '*.bin binary -delta' > .gitattributes
git add .gitattributes
git commit --quiet -m 'attributes'

for i in $(seq 10); do
dd if=/dev/urandom of=data.bin bs=1MB count=10 status=none
git add data.bin
git commit --quiet -m "data $i"
done
cd ..

# create clone repository
time git clone --no-local repo clone

# repack original repository
cd repo
time git repack -a -d
-

Output:
-
Cloning into 'clone'...
remote: Counting objects: 33, done.
remote: Compressing objects: 100% (31/31), done.
remote: Total 33 (delta 0), reused 0 (delta 0)
Receiving objects: 100% (33/33), 95.40 MiB | 19.94 MiB/s, done.

real0m25,085s
user0m22,749s
sys 0m0,948s

Counting objects: 33, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (21/21), done.
Writing objects: 100% (33/33), done.
Total 33 (delta 0), reused 0 (delta 0)

real0m5,652s
user0m4,173s
sys 0m0,178s
-


*Observations*

_time_

- Cloning: "clone" takes always 25s
- Optimizing: "repack" takes 25s with and 5s without delta compression

_compressed objects_

- Cloning: "clone" compresses always 31 objects
- Optimizing: "repack" compresses 31 objects with and 21 objects without delta 
compression


*Expectations*

Both operations ("repack" and "clone") are using "pack-objects".

Therefore my expectation is that "clone" should respect "-delta" and be about 
as fast as "repack".


Cheers,
  René


Re: Add option to git to ignore binary files unless force added

2018-05-18 Thread Anmol Sethi
How about a hook to ignore certain files? Then you could ignore based on the 
contents of the fail instead of just the extension. It’d be very flexible.

> On May 18, 2018, at 2:09 PM, Jacob Keller <jacob.kel...@gmail.com> wrote:
> 
> On Fri, May 18, 2018 at 4:31 AM, Anmol Sethi <m...@anmol.io> wrote:
>> This definitely works but it would be more clean to just have git ignore the 
>> binary files from the get go.
>> 
> 
> Sure it'd be more convenient for you. But there are loads of possible
> combinations, and the idea of what constitutes unwanted files is
> hugely variable to each user. We don't really want to end up
> supporting a million different ways to do this, and the hooks
> interface provides a reasonable method for rejecting commits with
> unwanted contents.
> 
> 
> Thanks,
> Jake

-- 
Best,
Anmol



Re: Add option to git to ignore binary files unless force added

2018-05-18 Thread Jacob Keller
On Fri, May 18, 2018 at 4:31 AM, Anmol Sethi <m...@anmol.io> wrote:
> This definitely works but it would be more clean to just have git ignore the 
> binary files from the get go.
>

Sure it'd be more convenient for you. But there are loads of possible
combinations, and the idea of what constitutes unwanted files is
hugely variable to each user. We don't really want to end up
supporting a million different ways to do this, and the hooks
interface provides a reasonable method for rejecting commits with
unwanted contents.


Thanks,
Jake


RE: Add option to git to ignore binary files unless force added

2018-05-18 Thread Randall S. Becker
On May 18, 2018 7:31 AM, Anmol Sethi <m...@anmol.io>
> That works but most binaries do not have a file extension. Its just not
> standard on linux.
> 
> > On May 17, 2018, at 8:37 AM, Randall S. Becker <rsbec...@nexbridge.com>
> wrote:
> >
> > On May 16, 2018 11:18 PM, Jacob Keller
> >> On Wed, May 16, 2018 at 5:45 PM, Anmol Sethi <m...@anmol.io> wrote:
> >>> I think it’d be great to have an option to have git ignore binary
> >>> files. My
> >> repositories are always source only, committing a binary is always a
> mistake.
> >> At the moment, I have to configure the .gitignore to ignore every
> >> binary file and that gets tedious. Having git ignore all binary files 
> >> would be
> great.
> >>>
> >>> This could be achieved via an option in .gitconfig or maybe a
> >>> special line in
> >> .gitignore.
> >>>
> >>> I just want to never accidentally commit a binary again.
> >>
> >> I believe you can do a couple things. There should be a hook which
> >> you can modify to validate that there are no binary files on
> >> pre-commit[1], or pre- push[2] to verify that you never push commits
> with binaries in them.
> >>
> >> You could also implement the update hook on the server if you control
> >> it, to allow it to block pushes which contain binary files.
> >
> > What about configuring ${HOME}/.config/git/ignore instead (described at
> https://git-scm.com/docs/gitignore). Inside, put:
> >
> > *.o
> > *.exe
> > *.bin
> > *.dat
> > Etc

I have a similar problem on my platform, with a different solution. My builds 
involve GCC binaries, NonStop L-series binaries (x86), and a NonStop J-series 
binaries (itanium). To keep me sane, I have all build targets going to separate 
directories, like Build/GCC, Build/Lbin, Build/Jbin away from the sources. This 
allows me to ignore Build/ regardless of extension and also to build different 
targets without link collisions. This is similar to how Java works (a.k.a. 
bin/). Much more workable, IMHO, than trying to manage individual binaries name 
by name or even by extension. I also have a mix of jpg and UTF-16 HTML that 
would end up in false-positives on a pure binary match and I do want to manage 
those.

What helps me is that I do most of my work in ECLIPSE, so derived resources 
(objects, generated sources) get auto-ignored by EGit, if you can make your 
compiler arrange that - but that's an ECLIPSE thing not a file system thing.

Cheers,
Randall

-- Brief whoami:
 NonStop developer since approximately 2112884442
 UNIX developer since approximately 421664400
-- In my real life, I talk too much.





Re: Add option to git to ignore binary files unless force added

2018-05-18 Thread Anmol Sethi
This definitely works but it would be more clean to just have git ignore the 
binary files from the get go.

> On May 16, 2018, at 11:18 PM, Jacob Keller <jacob.kel...@gmail.com> wrote:
> 
> On Wed, May 16, 2018 at 5:45 PM, Anmol Sethi <m...@anmol.io> wrote:
>> I think it’d be great to have an option to have git ignore binary files. My 
>> repositories are always source only, committing a binary is always a 
>> mistake. At the moment, I have to configure the .gitignore to ignore every 
>> binary file and that gets tedious. Having git ignore all binary files would 
>> be great.
>> 
>> This could be achieved via an option in .gitconfig or maybe a special line 
>> in .gitignore.
>> 
>> I just want to never accidentally commit a binary again.
>> 
>> --
>> Best,
>> Anmol
>> 
> 
> I believe you can do a couple things. There should be a hook which you
> can modify to validate that there are no binary files on
> pre-commit[1], or pre-push[2] to verify that you never push commits
> with binaries in them.
> 
> You could also implement the update hook on the server if you control
> it, to allow it to block pushes which contain binary files.
> 
> Thanks,
> Jake
> 
> [1]https://git-scm.com/docs/githooks#_pre_commit
> [2]https://git-scm.com/docs/githooks#_pre_push
> [3]https://git-scm.com/docs/githooks#update

-- 
Best,
Anmol



Re: Add option to git to ignore binary files unless force added

2018-05-18 Thread Anmol Sethi
That works but most binaries do not have a file extension. Its just not 
standard on linux.

> On May 17, 2018, at 8:37 AM, Randall S. Becker <rsbec...@nexbridge.com> wrote:
> 
> On May 16, 2018 11:18 PM, Jacob Keller
>> On Wed, May 16, 2018 at 5:45 PM, Anmol Sethi <m...@anmol.io> wrote:
>>> I think it’d be great to have an option to have git ignore binary files. My
>> repositories are always source only, committing a binary is always a mistake.
>> At the moment, I have to configure the .gitignore to ignore every binary file
>> and that gets tedious. Having git ignore all binary files would be great.
>>> 
>>> This could be achieved via an option in .gitconfig or maybe a special line 
>>> in
>> .gitignore.
>>> 
>>> I just want to never accidentally commit a binary again.
>> 
>> I believe you can do a couple things. There should be a hook which you can
>> modify to validate that there are no binary files on pre-commit[1], or pre-
>> push[2] to verify that you never push commits with binaries in them.
>> 
>> You could also implement the update hook on the server if you control it, to
>> allow it to block pushes which contain binary files.
> 
> What about configuring ${HOME}/.config/git/ignore instead (described at 
> https://git-scm.com/docs/gitignore). Inside, put:
> 
> *.o
> *.exe
> *.bin
> *.dat
> Etc
> 
> Cheers,
> Randall
> 
> 

-- 
Best,
Anmol



RE: Add option to git to ignore binary files unless force added

2018-05-17 Thread Randall S. Becker
On May 16, 2018 11:18 PM, Jacob Keller
> On Wed, May 16, 2018 at 5:45 PM, Anmol Sethi <m...@anmol.io> wrote:
> > I think it’d be great to have an option to have git ignore binary files. My
> repositories are always source only, committing a binary is always a mistake.
> At the moment, I have to configure the .gitignore to ignore every binary file
> and that gets tedious. Having git ignore all binary files would be great.
> >
> > This could be achieved via an option in .gitconfig or maybe a special line 
> > in
> .gitignore.
> >
> > I just want to never accidentally commit a binary again.
> 
> I believe you can do a couple things. There should be a hook which you can
> modify to validate that there are no binary files on pre-commit[1], or pre-
> push[2] to verify that you never push commits with binaries in them.
> 
> You could also implement the update hook on the server if you control it, to
> allow it to block pushes which contain binary files.

What about configuring ${HOME}/.config/git/ignore instead (described at 
https://git-scm.com/docs/gitignore). Inside, put:

*.o
*.exe
*.bin
*.dat
Etc

Cheers,
Randall




Re: Add option to git to ignore binary files unless force added

2018-05-16 Thread Jacob Keller
On Wed, May 16, 2018 at 5:45 PM, Anmol Sethi <m...@anmol.io> wrote:
> I think it’d be great to have an option to have git ignore binary files. My 
> repositories are always source only, committing a binary is always a mistake. 
> At the moment, I have to configure the .gitignore to ignore every binary file 
> and that gets tedious. Having git ignore all binary files would be great.
>
> This could be achieved via an option in .gitconfig or maybe a special line in 
> .gitignore.
>
> I just want to never accidentally commit a binary again.
>
> --
> Best,
> Anmol
>

I believe you can do a couple things. There should be a hook which you
can modify to validate that there are no binary files on
pre-commit[1], or pre-push[2] to verify that you never push commits
with binaries in them.

You could also implement the update hook on the server if you control
it, to allow it to block pushes which contain binary files.

Thanks,
Jake

[1]https://git-scm.com/docs/githooks#_pre_commit
[2]https://git-scm.com/docs/githooks#_pre_push
[3]https://git-scm.com/docs/githooks#update


Add option to git to ignore binary files unless force added

2018-05-16 Thread Anmol Sethi
I think it’d be great to have an option to have git ignore binary files. My 
repositories are always source only, committing a binary is always a mistake. 
At the moment, I have to configure the .gitignore to ignore every binary file 
and that gets tedious. Having git ignore all binary files would be great.

This could be achieved via an option in .gitconfig or maybe a special line in 
.gitignore.

I just want to never accidentally commit a binary again.

-- 
Best,
Anmol



Re: Binary files

2017-07-21 Thread Igor Djordjevic
On 20/07/2017 22:40, Junio C Hamano wrote:
> Igor Djordjevic <igor.d.djordje...@gmail.com> writes:
>> On 20/07/2017 09:41, Volodymyr Sendetskyi wrote:
>>> It is known, that git handles badly storing binary files in its
>>> repositories at all.
>>> This is especially about large files: even without any changes to
>>> these files, their copies are snapshotted on each commit. So even
>>> repositories with a small amount of code can grove very fast in size
>>> if they contain some great binary files. Alongside this, the SVN is
>>> much better about that, because it make changes to the server version
>>> of file only if some changes were done.
>>
>> You already got some proposals on what you could try for making large 
>> binary files handling easier, but I just wanted to comment on this 
>> part of your message, as it doesn`t seem to be correct.
> 
> All correct.  Thanks.

No problem, thanks for confirmation, being relatively new around it`s 
appreciated, at least knowing that I got it correct myself :)


Re: Binary files

2017-07-20 Thread Junio C Hamano
Igor Djordjevic <igor.d.djordje...@gmail.com> writes:

> On 20/07/2017 09:41, Volodymyr Sendetskyi wrote:
>> It is known, that git handles badly storing binary files in its
>> repositories at all.
>> This is especially about large files: even without any changes to
>> these files, their copies are snapshotted on each commit. So even
>> repositories with a small amount of code can grove very fast in size
>> if they contain some great binary files. Alongside this, the SVN is
>> much better about that, because it make changes to the server version
>> of file only if some changes were done.
>
> You already got some proposals on what you could try for making large 
> binary files handling easier, but I just wanted to comment on this 
> part of your message, as it doesn`t seem to be correct.

All correct.  Thanks.


Re: Binary files

2017-07-20 Thread Igor Djordjevic
Hi Volodymyr,

On 20/07/2017 09:41, Volodymyr Sendetskyi wrote:
> It is known, that git handles badly storing binary files in its
> repositories at all.
> This is especially about large files: even without any changes to
> these files, their copies are snapshotted on each commit. So even
> repositories with a small amount of code can grove very fast in size
> if they contain some great binary files. Alongside this, the SVN is
> much better about that, because it make changes to the server version
> of file only if some changes were done.

You already got some proposals on what you could try for making large 
binary files handling easier, but I just wanted to comment on this 
part of your message, as it doesn`t seem to be correct.

Even though each repository file is included in each commit (being a 
full repository state snapshot), meaning big binary files as well, 
that`s just from an end-user`s perspective.

Actual implementation side is smarter than that - if file hasn`t 
changed between commits, it won`t get copied/written to Git object 
database again.

Under the hood, many different commits can point to the same 
(unchanged) file, thus repository size _does not_ grow very fast with 
each commit if large binary file is without any changes.

Usually, the biggest concern with Git and large files[1], in 
comparison to SVN, for example, is something else - Git model 
assuming each repository clone holding the complete repository 
history with all the different file versions included, so you can`t 
get just some of them, or the last snapshot only, keeping your local 
repository small in size.

If the repository you`re cloning from is a big one, your locally 
cloned repository will be as well, even if you may not really be 
interested in the big files at all... but you got some suggestions 
for handling that already, as pointed out :)

Just note that it`s not really Git vs SVN here, but more distributed 
vs centralized approach in general, as you can`t both have everything 
and yet skip something at the same time. Different systems may have 
different workarounds for a specific workflow, though.

[1] Besides taking each file version as a full-sized snapshot (at the 
beginning, at least, until the delta compression packing occurs).

Regards,
Buga


Re: Binary files

2017-07-20 Thread Stefan Beller
On Thu, Jul 20, 2017 at 12:41 AM, Volodymyr Sendetskyi
<volodymy...@devcom.com> wrote:
> It is known, that git handles badly storing binary files in its
> repositories at all.
> This is especially about large files: even without any changes to
> these files, their copies are snapshotted on each commit. So even
> repositories with a small amount of code can grove very fast in size
> if they contain some great binary files. Alongside this, the SVN is
> much better about that, because it make changes to the server version
> of file only if some changes were done.
>
> So the question is: why not implementing some feature, that would
> somehow handle this problem?

There are 'external' solutions such as git LFS and git-annex, mentioned
in replies nearby.

But note there are also efforts to handle large binary files internally
https://public-inbox.org/git/3420d9ae9ef86b78af1abe721891233e3f5865a2.1500508695.git.jonathanta...@google.com/
https://public-inbox.org/git/20170713173459.3559-1-...@jeffhostetler.com/
https://public-inbox.org/git/20170620075523.26961-1-chrisc...@tuxfamily.org/


Re: Binary files

2017-07-20 Thread Lars Schneider

> On 20 Jul 2017, at 09:41, Volodymyr Sendetskyi <volodymy...@devcom.com> wrote:
> 
> It is known, that git handles badly storing binary files in its
> repositories at all.
> This is especially about large files: even without any changes to
> these files, their copies are snapshotted on each commit. So even
> repositories with a small amount of code can grove very fast in size
> if they contain some great binary files. Alongside this, the SVN is
> much better about that, because it make changes to the server version
> of file only if some changes were done.
> 
> So the question is: why not implementing some feature, that would
> somehow handle this problem?
> 
> Of course, I don't know the internal git structure and the way of
> working + some nuances (likely about the snapshots at all and the way
> they are done), so handling this may be a great problem. But the
> easiest feature for me as an end user will be something like
> '.gitbinary', where I can list binary files, that would behave like on
> SVN, or even more optimal, if you can implement it. Maybe there will
> be a need for separate kinds of repositories, or even servers. But
> that would be a great change and a logical way of next git's
> evolution.

GitLFS [1] might be the workaround you want. There are efforts to bring 
large file support natively to Git [2].

I tried to explain GitLFS in more detail here: 
https://www.youtube.com/watch?v=YQzNfb4IwEY

- Lars


[1] https://git-lfs.github.com/
[2] https://public-inbox.org/git/20170620075523.26961-1-chrisc...@tuxfamily.org/



Re: Binary files

2017-07-20 Thread Konstantin Khomoutov
On Thu, Jul 20, 2017 at 10:41:48AM +0300, Volodymyr Sendetskyi wrote:

> It is known, that git handles badly storing binary files in its
> repositories at all.
[...]
> So the question is: why not implementing some feature, that would
> somehow handle this problem?
[...]

Have you examined git-lfs and git-annex?
(Actually, there are/were more solutions [1] but these two appear to be
the most used novadays.)

Such solutions allow one to use Git for what it does best and defer
handling of big files (or files for which lock-modify-unlock works better
than the usual modify-merge) to a specialized solution.

1. http://blog.deveo.com/storing-large-binary-files-in-git-repositories/



Re: Binary files

2017-07-20 Thread Bryan Turner
On Thu, Jul 20, 2017 at 12:41 AM, Volodymyr Sendetskyi
<volodymy...@devcom.com> wrote:
> It is known, that git handles badly storing binary files in its
> repositories at all.
> This is especially about large files: even without any changes to
> these files, their copies are snapshotted on each commit. So even
> repositories with a small amount of code can grove very fast in size
> if they contain some great binary files. Alongside this, the SVN is
> much better about that, because it make changes to the server version
> of file only if some changes were done.
>
> So the question is: why not implementing some feature, that would
> somehow handle this problem?

Like Git LFS or git annex? Features have been implemented to better
handle large files; they're just not necessarily part of core Git.
Have you checked whether one of those solutions might work for your
use case?

Best regards,
Bryan Turner


Re: Binary files

2017-07-20 Thread Volodymyr Sendetskyi
It is known, that git handles badly storing binary files in its
repositories at all.
This is especially about large files: even without any changes to
these files, their copies are snapshotted on each commit. So even
repositories with a small amount of code can grove very fast in size
if they contain some great binary files. Alongside this, the SVN is
much better about that, because it make changes to the server version
of file only if some changes were done.

So the question is: why not implementing some feature, that would
somehow handle this problem?

Of course, I don't know the internal git structure and the way of
working + some nuances (likely about the snapshots at all and the way
they are done), so handling this may be a great problem. But the
easiest feature for me as an end user will be something like
'.gitbinary', where I can list binary files, that would behave like on
SVN, or even more optimal, if you can implement it. Maybe there will
be a need for separate kinds of repositories, or even servers. But
that would be a great change and a logical way of next git's
evolution.


Re: log -S/-G (aka pickaxe) searches binary files by default

2017-03-03 Thread Thomas Braun
Am 03.03.2017 um 17:07 schrieb Junio C Hamano:
> Jeff King <p...@peff.net> writes:
> 
>> On Thu, Mar 02, 2017 at 05:36:17PM -0800, Junio C Hamano wrote:
>> ...
>>>> Is that on purpose?
>>>
>>> No, it's a mere oversight (as I do not think I never even thought
>>> about special casing binary
>>> files from day one, it is unlikely that you would find _any_ old
>>> version of Git that behaves
>>> differently).
>>
>> The email focuses on "-G", and I think it is wrong to look in binary
>> files there, as "grep in diff" does not make sense for a binary file
>> that we would refuse to diff.
> 
> Yeah, I agree.
> 
>> But the subject also mentions "-S". I always assumed it was intentional
>> to look in binary files there, as it is searching for a pure byte
>> sequence. I would not mind an option to disable that, but I think the
>> default should remain on.
> 
> As the feature was built to be one of the core ingredients necessary
> towards the 'ideal SCM' envisioned in
> 
>   
> <http://public-inbox.org/git/pine.lnx.4.58.0504150753440.7...@ppc970.osdl.org>
> 
> "-S" is about finding "a block of text". It was merely an oversight
> that we didn't add explicit code to ignore binary when we introduced
> the concept of "is this text?  is it worth finding things in and
> diffing binary files?".
> 
> I do agree that it may be too late and/or disruptive to change its
> behaviour now, as people may grew expectations different from the
> original motivation and design, though.

Thanks both for the encouraging answers.

I'll try to come up with patches in the next couple of weeks for the
following changes:
"log -G": disable looking in binaries
"log -S": add option to switch looking into binaries, defaults to true

Thomas



Re: log -S/-G (aka pickaxe) searches binary files by default

2017-03-03 Thread Junio C Hamano
Jeff King <p...@peff.net> writes:

> On Thu, Mar 02, 2017 at 05:36:17PM -0800, Junio C Hamano wrote:
> ...
>> > Is that on purpose?
>> 
>> No, it's a mere oversight (as I do not think I never even thought
>> about special casing binary
>> files from day one, it is unlikely that you would find _any_ old
>> version of Git that behaves
>> differently).
>
> The email focuses on "-G", and I think it is wrong to look in binary
> files there, as "grep in diff" does not make sense for a binary file
> that we would refuse to diff.

Yeah, I agree.

> But the subject also mentions "-S". I always assumed it was intentional
> to look in binary files there, as it is searching for a pure byte
> sequence. I would not mind an option to disable that, but I think the
> default should remain on.

As the feature was built to be one of the core ingredients necessary
towards the 'ideal SCM' envisioned in

  <http://public-inbox.org/git/pine.lnx.4.58.0504150753440.7...@ppc970.osdl.org>

"-S" is about finding "a block of text". It was merely an oversight
that we didn't add explicit code to ignore binary when we introduced
the concept of "is this text?  is it worth finding things in and
diffing binary files?".

I do agree that it may be too late and/or disruptive to change its
behaviour now, as people may grew expectations different from the
original motivation and design, though.



Re: log -S/-G (aka pickaxe) searches binary files by default

2017-03-02 Thread Jeff King
On Thu, Mar 02, 2017 at 05:36:17PM -0800, Junio C Hamano wrote:

> On Thu, Mar 2, 2017 at 4:52 PM, Thomas Braun
> <thomas.br...@virtuell-zuhause.de> wrote:
> >
> > I happen to have quite large binary files in my repos.
> >
> > Today I realized that a line like
> > git log -G a
> > searches also files found to be binary (or explicitly marked as binary).
> >
> > Is that on purpose?
> 
> No, it's a mere oversight (as I do not think I never even thought
> about special casing binary
> files from day one, it is unlikely that you would find _any_ old
> version of Git that behaves
> differently).

The email focuses on "-G", and I think it is wrong to look in binary
files there, as "grep in diff" does not make sense for a binary file
that we would refuse to diff.

But the subject also mentions "-S". I always assumed it was intentional
to look in binary files there, as it is searching for a pure byte
sequence. I would not mind an option to disable that, but I think the
default should remain on.

-Peff


Re: log -S/-G (aka pickaxe) searches binary files by default

2017-03-02 Thread Junio C Hamano
On Thu, Mar 2, 2017 at 4:52 PM, Thomas Braun
<thomas.br...@virtuell-zuhause.de> wrote:
>
> I happen to have quite large binary files in my repos.
>
> Today I realized that a line like
> git log -G a
> searches also files found to be binary (or explicitly marked as binary).
>
> Is that on purpose?

No, it's a mere oversight (as I do not think I never even thought
about special casing binary
files from day one, it is unlikely that you would find _any_ old
version of Git that behaves
differently).


log -S/-G (aka pickaxe) searches binary files by default

2017-03-02 Thread Thomas Braun
Hi,

I happen to have quite large binary files in my repos.

Today I realized that a line like
git log -G a
searches also files found to be binary (or explicitly marked as binary).

Is that on purpose?
The documentation of "-G" states

"Look for differences whose patch text contains added/removed lines that
match ."

which contradicts the current behaviour. At least for me text != binary.

To reproduce:
$ git init
$ echo -e "a\0b" > data.bin
$ git add data.bin
$ git commit -m "Add new data"
$ git log -p
[...]
diff --git a/data.bin b/data.bin
new file mode 100644
index 000..1a23e4b
Binary files /dev/null and b/data.bin differ
$ git log -G a
[...]

Add new data

I've verified the behaviour with git version 2.12.0.windows.1 and git
version 2.12.0.189.g3bc5322 on debian.

If it is on purpose is there a config option to disable that?

Thanks for reading,
Thomas


Re: Bug with disabling compression and 'binary' files.

2016-11-15 Thread Junio C Hamano
Douglas Cox  writes:

>> This may or may not be related to the symptom
>> you are observing (if it is, then you would see a packfile created
>> in objects/pack/, not in loose objects in object/??/ directories).
>
> No, the file is loose (it's in .git/objects/eb in this case). This is
> seen immediately after the add, though I believe it's the same way
> when doing a commit on a changed file.

Then I do not have a guess as to where the symptom you are seeing is
coming from.


Re: Bug with disabling compression and 'binary' files.

2016-11-15 Thread Douglas Cox
> This may or may not be related to the symptom
> you are observing (if it is, then you would see a packfile created
> in objects/pack/, not in loose objects in object/??/ directories).

No, the file is loose (it's in .git/objects/eb in this case). This is
seen immediately after the add, though I believe it's the same way
when doing a commit on a changed file.


On Tue, Nov 15, 2016 at 8:42 PM, Junio C Hamano <gits...@pobox.com> wrote:
> Douglas Cox <zif...@gmail.com> writes:
>
>> I narrowed this down to the '-text' attribute that is set when
>> specifying 'binary'.  For some reason this flag is cancelling out the
>> core.compression = 0 setting and I think this is a bug?
>>
>> Unfortunately core.compression = 0 is also global. Ideally it would be
>> great if there was a separate 'compression' attribute that could be
>> specified in .gitattributes per wildcard similar to how -delta can be
>> used. This way we would still be able to get compression for
>> text/source files, while still getting the speed of skipping
>> compression for binary files that do not compress well.
>>
>> Has there been any discussion on having an attribute similar to this?
>
> Nope.
>
> I do not offhand think of a way for '-text' attribute (or any
> attribute for what matter) to interfere with compression level, but
> while reading the various parts of the system that futz with the
> compression level configuration, I noticed one thing.  When we do an
> initial "bulk-checkin" optimization that sends all objects to a
> single packfile upon "git add", the packfile creation uses its own
> compression level that is not affected by any configuration or
> command line option.  This may or may not be related to the symptom
> you are observing (if it is, then you would see a packfile created
> in objects/pack/, not in loose objects in object/??/ directories).
>
>
>


Re: Bug with disabling compression and 'binary' files.

2016-11-15 Thread Junio C Hamano
Douglas Cox <zif...@gmail.com> writes:

> I narrowed this down to the '-text' attribute that is set when
> specifying 'binary'.  For some reason this flag is cancelling out the
> core.compression = 0 setting and I think this is a bug?
>
> Unfortunately core.compression = 0 is also global. Ideally it would be
> great if there was a separate 'compression' attribute that could be
> specified in .gitattributes per wildcard similar to how -delta can be
> used. This way we would still be able to get compression for
> text/source files, while still getting the speed of skipping
> compression for binary files that do not compress well.
>
> Has there been any discussion on having an attribute similar to this?

Nope.  

I do not offhand think of a way for '-text' attribute (or any
attribute for what matter) to interfere with compression level, but
while reading the various parts of the system that futz with the
compression level configuration, I noticed one thing.  When we do an
initial "bulk-checkin" optimization that sends all objects to a
single packfile upon "git add", the packfile creation uses its own
compression level that is not affected by any configuration or
command line option.  This may or may not be related to the symptom
you are observing (if it is, then you would see a packfile created
in objects/pack/, not in loose objects in object/??/ directories).





Bug with disabling compression and 'binary' files.

2016-11-15 Thread Douglas Cox
I was doing some experiments today with large-ish (100-200MB) binary
files and was trying to determine the best configuration for Git. Here
are the steps and timings I saw:

git init Test
cp .../largemovie.mp4 .
time git add largemovie.mp4

This took 6.5s for a 200MB file.

This file compressed a little bit, but not enough to matter, so I
wanted to see how much faster the add would be with compression
disabled. So I ran:

git config core.compression = 0

I then completely reran the test above starting with a clean
repository. This time the add took only 2.08s.  I repeated these two
tests about 10 times using the same file each time to verify it wasn't
related to disk caching, etc.

At this point I decided that this was likely a good setting for this
repository, so I also created a .gitattributes file and added a few
entries often seen in suggestions for dealing with binary files:

*.mp4 binary -delta

The goal here was to disable any delta compression done during a
gc/pack and use the other settings 'binary' applies. Unfortunately
when I ran the test again (still using compression = 0), the test was
back to taking 6.5s and the file inside the .git/objects/ folder was
compressed again.

I narrowed this down to the '-text' attribute that is set when
specifying 'binary'.  For some reason this flag is cancelling out the
core.compression = 0 setting and I think this is a bug?

Unfortunately core.compression = 0 is also global. Ideally it would be
great if there was a separate 'compression' attribute that could be
specified in .gitattributes per wildcard similar to how -delta can be
used. This way we would still be able to get compression for
text/source files, while still getting the speed of skipping
compression for binary files that do not compress well.

Has there been any discussion on having an attribute similar to this?

Thanks,
-Doug


Re: How to have EOL=LF and keep binary files auto-detection?

2016-04-26 Thread Nikolay Chashnikov
Thank you for the information.

We create products (IDEs for different programming languages) for the
three major platforms (Window/Linux/Mac OSX). Our products are written
in Java and we want to minimize differences between distributions for
different platforms, so we use LF separators for all resource files.

On Mon, Apr 25, 2016 at 7:46 PM, Torsten Bögershausen <tbo...@web.de> wrote:
> On 25.04.16 16:11, Kirill Likhodedov wrote:
>> Hi,
>>
>> I wonder if it is possible both to have LFs in all and only text files in 
>> working trees, and keep Git’s binary files auto-detection?
>>
>> To be more precise:
>> * we want all text files to be checked out in LF;
>> * we don’t want force people to set “core.autocrlf” to false, preferring to 
>> keep this configuration in .gitattributes;
>> * we obviously don’t want binary files to be touched by eol-normalization;
>> * we also don’t want to declare all possible patterns of binary files - Git 
>> is good enough in detecting them automatically.
>>
>> However, I’ve found no way to do so.
>>
>> If I declare `* eol=lf` in .gitattributes, it makes Git treat all files as 
>> text and thus convert CRLF to LF even in binary files. It is consistent with 
>> man, but a bit surprising to have e.g. a zip or png file modified in this 
>> way.
>>
>> One could expect `* text=auto eol=lf` to work the way we want, but 
>> unfortunately it doesn’t work either: “eol=lf” forces “text” on all files.
>>
>> Thanks a lot for your help!
>> -- Kirill.
>
> The short answer: Git doesn't currently do that.
> The closest you can get, is to use
> echo "* text=auto" >.gitattributes
> and
> git config core.eol lf
> git config core.autocrlf false.
>
> The longer answer is, that I am working on a patch to allow just
> the combination of "* text=auto eol=lf" to work as you want it.
>
> Which platform do you use ?
> And (out of curiosity, why do you want text files with LF ?)
>
> If you are willing to compile and install Git yourself,
> you can use the branch here:
> https://github.com/tboegi/git/commits/160421_0706_reliable_t0027_allow_TC_combined_ident_CRLF_v7
>
> Feedback is welcome, if it works as expected.
>
>
>
>



-- 
Nikolay Chashnikov
JetBrains
http://www.jetbrains.com
The Drive to Develop
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to have EOL=LF and keep binary files auto-detection?

2016-04-25 Thread Torsten Bögershausen
On 25.04.16 16:11, Kirill Likhodedov wrote:
> Hi, 
> 
> I wonder if it is possible both to have LFs in all and only text files in 
> working trees, and keep Git’s binary files auto-detection?
> 
> To be more precise:
> * we want all text files to be checked out in LF; 
> * we don’t want force people to set “core.autocrlf” to false, preferring to 
> keep this configuration in .gitattributes; 
> * we obviously don’t want binary files to be touched by eol-normalization; 
> * we also don’t want to declare all possible patterns of binary files - Git 
> is good enough in detecting them automatically.
> 
> However, I’ve found no way to do so.
> 
> If I declare `* eol=lf` in .gitattributes, it makes Git treat all files as 
> text and thus convert CRLF to LF even in binary files. It is consistent with 
> man, but a bit surprising to have e.g. a zip or png file modified in this way.
> 
> One could expect `* text=auto eol=lf` to work the way we want, but 
> unfortunately it doesn’t work either: “eol=lf” forces “text” on all files.
> 
> Thanks a lot for your help!
> -- Kirill.

The short answer: Git doesn't currently do that.
The closest you can get, is to use
echo "* text=auto" >.gitattributes
and
git config core.eol lf
git config core.autocrlf false.

The longer answer is, that I am working on a patch to allow just
the combination of "* text=auto eol=lf" to work as you want it.

Which platform do you use ?
And (out of curiosity, why do you want text files with LF ?)

If you are willing to compile and install Git yourself,
you can use the branch here:
https://github.com/tboegi/git/commits/160421_0706_reliable_t0027_allow_TC_combined_ident_CRLF_v7

Feedback is welcome, if it works as expected.




--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


How to have EOL=LF and keep binary files auto-detection?

2016-04-25 Thread Kirill Likhodedov
Hi, 

I wonder if it is possible both to have LFs in all and only text files in 
working trees, and keep Git’s binary files auto-detection?

To be more precise:
* we want all text files to be checked out in LF; 
* we don’t want force people to set “core.autocrlf” to false, preferring to 
keep this configuration in .gitattributes; 
* we obviously don’t want binary files to be touched by eol-normalization; 
* we also don’t want to declare all possible patterns of binary files - Git is 
good enough in detecting them automatically.

However, I’ve found no way to do so.

If I declare `* eol=lf` in .gitattributes, it makes Git treat all files as text 
and thus convert CRLF to LF even in binary files. It is consistent with man, 
but a bit surprising to have e.g. a zip or png file modified in this way.

One could expect `* text=auto eol=lf` to work the way we want, but 
unfortunately it doesn’t work either: “eol=lf” forces “text” on all files.

Thanks a lot for your help!
-- Kirill.


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [git-for-windows] How is detected binary files?

2015-12-01 Thread Jeff King
On Fri, Nov 27, 2015 at 03:14:58PM +0100, Johannes Schindelin wrote:

> On Wed, 25 Nov 2015, Andrzej Borucki wrote:
> 
> > How git detects that file is binary? This must be safe because it not 
> > allowed to change line breaks in binary files. 
> > Binary files can contain byte 0 (zero), but:
> > - 16 bit UTF also can contain zero
> > - short binary files can not contain zero
> 
> It would probably be better to direct this question to the general Git
> mailing list (you reached the Git for Windows one, and this issue is not
> specific to Windows).
> 
> To answer your question, a NUL byte within the first 8000 bytes is indeed
> considered as an indicator for binary files.
> 
> If you use UTF-16, you will need to mark your files as such explicitly
> (Git does not handle UTF-16 internally).

I'm not sure if it is a good idea to treat UTF-16 as text. The rest of
the diff (headers, etc) will all be in ASCII, so one or the other is
going to be mojibake.

You can get readable diffs by textconv-ing them to an ASCII-superset
encoding like UTF-8. Something like:

echo 'myfile diff=utf16' >.gitattributes
git config diff.utf16.textconv 'iconv -f utf16 -t utf8'

but of course the resulting patches cannot be applied, and you may miss
any changes that do not make it through the encoding (e.g., using
different bytes to represent the same code point).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [git-for-windows] How is detected binary files?

2015-11-27 Thread Johannes Schindelin
Hi Andrzej,

On Wed, 25 Nov 2015, Andrzej Borucki wrote:

> How git detects that file is binary? This must be safe because it not 
> allowed to change line breaks in binary files. 
> Binary files can contain byte 0 (zero), but:
> - 16 bit UTF also can contain zero
> - short binary files can not contain zero

It would probably be better to direct this question to the general Git
mailing list (you reached the Git for Windows one, and this issue is not
specific to Windows).

To answer your question, a NUL byte within the first 8000 bytes is indeed
considered as an indicator for binary files.

If you use UTF-16, you will need to mark your files as such explicitly
(Git does not handle UTF-16 internally).

As to short binary files, you will also have to mark those explicitly.

Ciao,
Johannes
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] t9300: use cmp instead of test_cmp to compare binary files

2014-09-12 Thread Johannes Sixt
test_cmp is intended to produce diff output for human consumption. The
input in one instance in t9300-fast-import.sh are binary files, however.
Use cmp to compare the files.

This was noticed because on Windows we have a special implementation of
test_cmp in pure bash code (to ignore differences due to intermittent CR
in actual output), and bash runs into an infinite loop due to the binary
nature of the input.

Signed-off-by: Johannes Sixt j...@kdbg.org
---
 t/t9300-fast-import.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 99f5161..4b13170 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -2687,7 +2687,7 @@ test_expect_success 'R: verify created pack' '
 test_expect_success \
'R: verify written objects' \
'git --git-dir=R/.git cat-file blob big-file:big1 actual 
-test_cmp expect actual 
+cmp expect actual 
 a=$(git --git-dir=R/.git rev-parse big-file:big1) 
 b=$(git --git-dir=R/.git rev-parse big-file:big2) 
 test $a = $b'
-- 
2.0.0.12.gbcf935e

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] t9300: use cmp instead of test_cmp to compare binary files

2014-09-12 Thread Junio C Hamano
Johannes Sixt j...@kdbg.org writes:

 test_cmp is intended to produce diff output for human consumption. The
 input in one instance in t9300-fast-import.sh are binary files, however.
 Use cmp to compare the files.

Thanks.


 This was noticed because on Windows we have a special implementation of
 test_cmp in pure bash code (to ignore differences due to intermittent CR
 in actual output), and bash runs into an infinite loop due to the binary
 nature of the input.

 Signed-off-by: Johannes Sixt j...@kdbg.org
 ---
  t/t9300-fast-import.sh | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
 index 99f5161..4b13170 100755
 --- a/t/t9300-fast-import.sh
 +++ b/t/t9300-fast-import.sh
 @@ -2687,7 +2687,7 @@ test_expect_success 'R: verify created pack' '
  test_expect_success \
   'R: verify written objects' \
   'git --git-dir=R/.git cat-file blob big-file:big1 actual 
 -  test_cmp expect actual 
 +  cmp expect actual 
a=$(git --git-dir=R/.git rev-parse big-file:big1) 
b=$(git --git-dir=R/.git rev-parse big-file:big2) 
test $a = $b'
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] t9300: use cmp instead of test_cmp to compare binary files

2014-09-12 Thread Thomas Braun
Am 12.09.2014 um 19:58 schrieb Junio C Hamano:
 Johannes Sixt j...@kdbg.org writes:
 
 test_cmp is intended to produce diff output for human consumption. The
 input in one instance in t9300-fast-import.sh are binary files, however.
 Use cmp to compare the files.
 
 Thanks.
 

 This was noticed because on Windows we have a special implementation of
 test_cmp in pure bash code (to ignore differences due to intermittent CR
 in actual output), and bash runs into an infinite loop due to the binary
 nature of the input.

 Signed-off-by: Johannes Sixt j...@kdbg.org
 ---
  t/t9300-fast-import.sh | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
 index 99f5161..4b13170 100755
 --- a/t/t9300-fast-import.sh
 +++ b/t/t9300-fast-import.sh
 @@ -2687,7 +2687,7 @@ test_expect_success 'R: verify created pack' '
  test_expect_success \
  'R: verify written objects' \
  'git --git-dir=R/.git cat-file blob big-file:big1 actual 
 - test_cmp expect actual 
 + cmp expect actual 
   a=$(git --git-dir=R/.git rev-parse big-file:big1) 
   b=$(git --git-dir=R/.git rev-parse big-file:big2) 
   test $a = $b'

May I suggest to use test_cmp_bin instead of plain cmp?
test_cmp_bin was introduced in
b93e6e36 (t5000, t5003: do not use test_cmp to compare binary files,
2014-06-04) and by default is plain cmp.


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] t9300: use test_cmp_bin instead of test_cmp to compare binary files

2014-09-12 Thread Johannes Sixt
test_cmp is intended to produce diff output for human consumption. The
input in one instance in t9300-fast-import.sh are binary files, however.
Use test_cmp_bin to compare the files.

This was noticed because on Windows we have a special implementation of
test_cmp in pure bash code (to ignore differences due to intermittent CR
in actual output), and bash runs into an infinite loop due to the binary
nature of the input.

Signed-off-by: Johannes Sixt j...@kdbg.org
---
Am 12.09.2014 um 20:14 schrieb Thomas Braun:
 May I suggest to use test_cmp_bin instead of plain cmp?

Of course! I did remember that there was talk about it, but missed
that we actually implemented it. Sorry.

 t/t9300-fast-import.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 99f5161..72845f6 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -2687,7 +2687,7 @@ test_expect_success 'R: verify created pack' '
 test_expect_success \
'R: verify written objects' \
'git --git-dir=R/.git cat-file blob big-file:big1 actual 
-test_cmp expect actual 
+test_cmp_bin expect actual 
 a=$(git --git-dir=R/.git rev-parse big-file:big1) 
 b=$(git --git-dir=R/.git rev-parse big-file:big2) 
 test $a = $b'
-- 
2.0.0.12.gbcf935e

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


article: Using a rolling hash to break up binary files

2014-07-31 Thread Philip Oakley
I thought it worth bring to the list's attention a  recent article on 
CodeProject that may be of interest to those looking at splitting binary 
files into deterministic hunks.


http://www.codeproject.com/Articles/801608/Using-a-rolling-hash-to-break-up-binary-files

It's based on Rabin and Karp's algorithm 
http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm.


--
Philip 


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: article: Using a rolling hash to break up binary files

2014-07-31 Thread Shawn Pearce
On Thu, Jul 31, 2014 at 3:31 PM, Philip Oakley philipoak...@iee.org wrote:
 I thought it worth bring to the list's attention a  recent article on
 CodeProject that may be of interest to those looking at splitting binary
 files into deterministic hunks.

 http://www.codeproject.com/Articles/801608/Using-a-rolling-hash-to-break-up-binary-files

 It's based on Rabin and Karp's algorithm
 http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm.

If I remember right, this is how bup[1] works. Its certainly what we
do for delta compressing files.

[1] https://github.com/bup/bup
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] t5000, t5003: do not use test_cmp to compare binary files

2014-06-04 Thread Stepan Kasal
test_cmp() is primarily meant to compare text files (and display the
difference for debug purposes).

Raw cmp is better suited to compare binary files (tar, zip, etc.).

On MinGW, test_cmp is a shell function mingw_test_cmp that tries to
read both files into environment, stripping CR characters (introduced
in commit 4d715ac0).

This function usually speeds things up, as fork is extremly slow on
Windows.  But no wonder that this function is extremely slow and
sometimes even crashes when comparing large tar or zip files.

Signed-off-by: Stepan Kasal ka...@ucw.cz
---
 t/t5000-tar-tree.sh | 34 +-
 t/t5001-archive-attr.sh |  2 +-
 t/t5003-archive-zip.sh  |  6 +++---
 t/t5004-archive-corner-cases.sh |  2 +-
 4 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 1cf0a4e..31b1fd1 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -164,7 +164,7 @@ check_tar with_olde-prefix olde-
 test_expect_success 'git archive on large files' '
 test_config core.bigfilethreshold 1 
 git archive HEAD b3.tar 
-test_cmp b.tar b3.tar
+cmp b.tar b3.tar
 '
 
 test_expect_success \
@@ -173,15 +173,15 @@ test_expect_success \
 
 test_expect_success \
 'git archive vs. the same in a bare repo' \
-'test_cmp b.tar b3.tar'
+'cmp b.tar b3.tar'
 
 test_expect_success 'git archive with --output' \
 'git archive --output=b4.tar HEAD 
-test_cmp b.tar b4.tar'
+cmp b.tar b4.tar'
 
 test_expect_success 'git archive --remote' \
 'git archive --remote=. HEAD b5.tar 
-test_cmp b.tar b5.tar'
+cmp b.tar b5.tar'
 
 test_expect_success \
 'validate file modification time' \
@@ -198,7 +198,7 @@ test_expect_success \
 
 test_expect_success 'git archive with --output, override inferred format' '
git archive --format=tar --output=d4.zip HEAD 
-   test_cmp b.tar d4.zip
+   cmp b.tar d4.zip
 '
 
 test_expect_success \
@@ -244,34 +244,34 @@ test_expect_success 'archive --list shows only enabled 
remote filters' '
 test_expect_success 'invoke tar filter by format' '
git archive --format=tar.foo HEAD config.tar.foo 
tr ab ba config.tar.foo config.tar 
-   test_cmp b.tar config.tar 
+   cmp b.tar config.tar 
git archive --format=bar HEAD config.bar 
tr ab ba config.bar config.tar 
-   test_cmp b.tar config.tar
+   cmp b.tar config.tar
 '
 
 test_expect_success 'invoke tar filter by extension' '
git archive -o config-implicit.tar.foo HEAD 
-   test_cmp config.tar.foo config-implicit.tar.foo 
+   cmp config.tar.foo config-implicit.tar.foo 
git archive -o config-implicit.bar HEAD 
-   test_cmp config.tar.foo config-implicit.bar
+   cmp config.tar.foo config-implicit.bar
 '
 
 test_expect_success 'default output format remains tar' '
git archive -o config-implicit.baz HEAD 
-   test_cmp b.tar config-implicit.baz
+   cmp b.tar config-implicit.baz
 '
 
 test_expect_success 'extension matching requires dot' '
git archive -o config-implicittar.foo HEAD 
-   test_cmp b.tar config-implicittar.foo
+   cmp b.tar config-implicittar.foo
 '
 
 test_expect_success 'only enabled filters are available remotely' '
test_must_fail git archive --remote=. --format=tar.foo HEAD \
remote.tar.foo 
git archive --remote=. --format=bar remote.bar HEAD 
-   test_cmp remote.bar config.bar
+   cmp remote.bar config.bar
 '
 
 test_expect_success GZIP 'git archive --format=tgz' '
@@ -280,27 +280,27 @@ test_expect_success GZIP 'git archive --format=tgz' '
 
 test_expect_success GZIP 'git archive --format=tar.gz' '
git archive --format=tar.gz HEAD j1.tar.gz 
-   test_cmp j.tgz j1.tar.gz
+   cmp j.tgz j1.tar.gz
 '
 
 test_expect_success GZIP 'infer tgz from .tgz filename' '
git archive --output=j2.tgz HEAD 
-   test_cmp j.tgz j2.tgz
+   cmp j.tgz j2.tgz
 '
 
 test_expect_success GZIP 'infer tgz from .tar.gz filename' '
git archive --output=j3.tar.gz HEAD 
-   test_cmp j.tgz j3.tar.gz
+   cmp j.tgz j3.tar.gz
 '
 
 test_expect_success GZIP 'extract tgz file' '
gzip -d -c j.tgz j.tar 
-   test_cmp b.tar j.tar
+   cmp b.tar j.tar
 '
 
 test_expect_success GZIP 'remote tar.gz is allowed by default' '
git archive --remote=. --format=tar.gz HEAD remote.tar.gz 
-   test_cmp j.tgz remote.tar.gz
+   cmp j.tgz remote.tar.gz
 '
 
 test_expect_success GZIP 'remote tar.gz can be disabled' '
diff --git a/t/t5001-archive-attr.sh b/t/t5001-archive-attr.sh
index 51dedab..dfc35b3 100755
--- a/t/t5001-archive-attr.sh
+++ b/t/t5001-archive-attr.sh
@@ -68,7 +68,7 @@ test_expect_missing   worktree2/ignored-by-worktree
 
 test_expect_success 'git archive vs. bare' '
(cd bare  git archive HEAD) bare-archive.tar 
-   test_cmp archive.tar bare-archive.tar
+   cmp archive.tar bare

Re: [PATCH] t5000, t5003: do not use test_cmp to compare binary files

2014-06-04 Thread Thomas Braun
Am 04.06.2014 13:42, schrieb Stepan Kasal:
 test_cmp() is primarily meant to compare text files (and display the
 difference for debug purposes).
 
 Raw cmp is better suited to compare binary files (tar, zip, etc.).
 
 On MinGW, test_cmp is a shell function mingw_test_cmp that tries to
 read both files into environment, stripping CR characters (introduced
 in commit 4d715ac0).
 
 This function usually speeds things up, as fork is extremly slow on
 Windows.  But no wonder that this function is extremely slow and
 sometimes even crashes when comparing large tar or zip files.
 
 Signed-off-by: Stepan Kasal ka...@ucw.cz
 ---
  t/t5000-tar-tree.sh | 34 +-
  t/t5001-archive-attr.sh |  2 +-
  t/t5003-archive-zip.sh  |  6 +++---
  t/t5004-archive-corner-cases.sh |  2 +-
  4 files changed, 22 insertions(+), 22 deletions(-)
 
 diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
 index 1cf0a4e..31b1fd1 100755
 --- a/t/t5000-tar-tree.sh
 +++ b/t/t5000-tar-tree.sh
 @@ -164,7 +164,7 @@ check_tar with_olde-prefix olde-
  test_expect_success 'git archive on large files' '
  test_config core.bigfilethreshold 1 
  git archive HEAD b3.tar 
 -test_cmp b.tar b3.tar
 +cmp b.tar b3.tar
  '

Wouldn't a function like test_cmp_bin() be better suited for all?
The windows folks can then use cmp inside test_cmp_bin() and all others
just use test_cmp.

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] t5000, t5003: do not use test_cmp to compare binary files

2014-06-04 Thread Stepan Kasal
Hello Thomas,

On Wed, Jun 04, 2014 at 02:13:44PM +0200, Thomas Braun wrote:
 Wouldn't a function like test_cmp_bin() be better suited for all?

I also considered it.  The advantage is that is shows that
this intentionally differs from test_cmp.

 The windows folks can then use cmp inside test_cmp_bin() and all others

... would use cmp as well because it is better suited for the task
than diff -u.  So test_cmp_bin would be just an alias for cmp, on all
platforms.  Doesn't that sound weird?

Stepan
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [msysGit] Re: [PATCH] t5000, t5003: do not use test_cmp to compare binary files

2014-06-04 Thread Thomas Braun
Hi Stephan,

Am 04.06.2014 14:42, schrieb Stepan Kasal:
 On Wed, Jun 04, 2014 at 02:13:44PM +0200, Thomas Braun wrote:
 Wouldn't a function like test_cmp_bin() be better suited for all?

 I also considered it.  The advantage is that is shows that
 this intentionally differs from test_cmp.

 The windows folks can then use cmp inside test_cmp_bin() and all others

 ... would use cmp as well because it is better suited for the task
 than diff -u.  So test_cmp_bin would be just an alias for cmp, on all
 platforms.  Doesn't that sound weird?

I actually like the idea that the test assertions follow a common naming
scheme and can easily be overriden by $arbitrary-crazy-platform.

Using test_cmp_bin instead of cmp would result in then four assertions
for comparing arbitrary data
test_cmp
test_18ncmp
test_cmp_text
test_cmp_bin
where I think the purpose of each function is clear from its name.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] t5000, t5003: do not use test_cmp to compare binary files

2014-06-04 Thread Stepan Kasal
test_cmp() is primarily meant to compare text files (and display the
difference for debug purposes).

Raw cmp is better suited to compare binary files (tar, zip, etc.).

On MinGW, test_cmp is a shell function mingw_test_cmp that tries to
read both files into environment, stripping CR characters (introduced
in commit 4d715ac0).

This function usually speeds things up, as fork is extremly slow on
Windows.  But no wonder that this function is extremely slow and
sometimes even crashes when comparing large tar or zip files.

Signed-off-by: Stepan Kasal ka...@ucw.cz
---

Hi Thomas,
On Wed, Jun 04, 2014 at 02:59:44PM +0200, Thomas Braun wrote:
 Using test_cmp_bin instead of cmp would result in then four assertions
 for comparing arbitrary data
 test_cmp
 test_i18ncmp
 test_cmp_text
 test_cmp_bin
 where I think the purpose of each function is clear from its name.

[test_cmp_text does not exist (yet)]

OK, I agree, hence this modified version of the patch.

Stepan

 t/t5000-tar-tree.sh | 34 +-
 t/t5001-archive-attr.sh |  2 +-
 t/t5003-archive-zip.sh  |  6 +++---
 t/t5004-archive-corner-cases.sh |  2 +-
 t/test-lib-functions.sh |  6 ++
 5 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 1cf0a4e..4efaf8c 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -164,7 +164,7 @@ check_tar with_olde-prefix olde-
 test_expect_success 'git archive on large files' '
 test_config core.bigfilethreshold 1 
 git archive HEAD b3.tar 
-test_cmp b.tar b3.tar
+test_cmp_bin b.tar b3.tar
 '
 
 test_expect_success \
@@ -173,15 +173,15 @@ test_expect_success \
 
 test_expect_success \
 'git archive vs. the same in a bare repo' \
-'test_cmp b.tar b3.tar'
+'test_cmp_bin b.tar b3.tar'
 
 test_expect_success 'git archive with --output' \
 'git archive --output=b4.tar HEAD 
-test_cmp b.tar b4.tar'
+test_cmp_bin b.tar b4.tar'
 
 test_expect_success 'git archive --remote' \
 'git archive --remote=. HEAD b5.tar 
-test_cmp b.tar b5.tar'
+test_cmp_bin b.tar b5.tar'
 
 test_expect_success \
 'validate file modification time' \
@@ -198,7 +198,7 @@ test_expect_success \
 
 test_expect_success 'git archive with --output, override inferred format' '
git archive --format=tar --output=d4.zip HEAD 
-   test_cmp b.tar d4.zip
+   test_cmp_bin b.tar d4.zip
 '
 
 test_expect_success \
@@ -244,34 +244,34 @@ test_expect_success 'archive --list shows only enabled 
remote filters' '
 test_expect_success 'invoke tar filter by format' '
git archive --format=tar.foo HEAD config.tar.foo 
tr ab ba config.tar.foo config.tar 
-   test_cmp b.tar config.tar 
+   test_cmp_bin b.tar config.tar 
git archive --format=bar HEAD config.bar 
tr ab ba config.bar config.tar 
-   test_cmp b.tar config.tar
+   test_cmp_bin b.tar config.tar
 '
 
 test_expect_success 'invoke tar filter by extension' '
git archive -o config-implicit.tar.foo HEAD 
-   test_cmp config.tar.foo config-implicit.tar.foo 
+   test_cmp_bin config.tar.foo config-implicit.tar.foo 
git archive -o config-implicit.bar HEAD 
-   test_cmp config.tar.foo config-implicit.bar
+   test_cmp_bin config.tar.foo config-implicit.bar
 '
 
 test_expect_success 'default output format remains tar' '
git archive -o config-implicit.baz HEAD 
-   test_cmp b.tar config-implicit.baz
+   test_cmp_bin b.tar config-implicit.baz
 '
 
 test_expect_success 'extension matching requires dot' '
git archive -o config-implicittar.foo HEAD 
-   test_cmp b.tar config-implicittar.foo
+   test_cmp_bin b.tar config-implicittar.foo
 '
 
 test_expect_success 'only enabled filters are available remotely' '
test_must_fail git archive --remote=. --format=tar.foo HEAD \
remote.tar.foo 
git archive --remote=. --format=bar remote.bar HEAD 
-   test_cmp remote.bar config.bar
+   test_cmp_bin remote.bar config.bar
 '
 
 test_expect_success GZIP 'git archive --format=tgz' '
@@ -280,27 +280,27 @@ test_expect_success GZIP 'git archive --format=tgz' '
 
 test_expect_success GZIP 'git archive --format=tar.gz' '
git archive --format=tar.gz HEAD j1.tar.gz 
-   test_cmp j.tgz j1.tar.gz
+   test_cmp_bin j.tgz j1.tar.gz
 '
 
 test_expect_success GZIP 'infer tgz from .tgz filename' '
git archive --output=j2.tgz HEAD 
-   test_cmp j.tgz j2.tgz
+   test_cmp_bin j.tgz j2.tgz
 '
 
 test_expect_success GZIP 'infer tgz from .tar.gz filename' '
git archive --output=j3.tar.gz HEAD 
-   test_cmp j.tgz j3.tar.gz
+   test_cmp_bin j.tgz j3.tar.gz
 '
 
 test_expect_success GZIP 'extract tgz file' '
gzip -d -c j.tgz j.tar 
-   test_cmp b.tar j.tar
+   test_cmp_bin b.tar j.tar
 '
 
 test_expect_success GZIP 'remote tar.gz is allowed by default' '
git

Re: [PATCH v2] t5000, t5003: do not use test_cmp to compare binary files

2014-06-04 Thread Junio C Hamano
Stepan Kasal ka...@ucw.cz writes:

 test_cmp() is primarily meant to compare text files (and display the
 difference for debug purposes).

 Raw cmp is better suited to compare binary files (tar, zip, etc.).

 On MinGW, test_cmp is a shell function mingw_test_cmp that tries to
 read both files into environment, stripping CR characters (introduced
 in commit 4d715ac0).

 This function usually speeds things up, as fork is extremly slow on
 Windows.  But no wonder that this function is extremely slow and
 sometimes even crashes when comparing large tar or zip files.

 Signed-off-by: Stepan Kasal ka...@ucw.cz
 ---

 Hi Thomas,
 On Wed, Jun 04, 2014 at 02:59:44PM +0200, Thomas Braun wrote:
 Using test_cmp_bin instead of cmp would result in then four assertions
 for comparing arbitrary data
 test_cmp
 test_i18ncmp
 test_cmp_text
 test_cmp_bin
 where I think the purpose of each function is clear from its name.

 [test_cmp_text does not exist (yet)]

 OK, I agree, hence this modified version of the patch.

Yeah, I think the above reasoning is sound.  And I do not think we
ever need to have test_cmp_text -- our payload and our messages
compared by tests to make sure our expectations hold are text by
default.

Will queue; thanks.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [msysGit] Re: [PATCH v2] t5000, t5003: do not use test_cmp to compare binary files

2014-06-04 Thread Michael Geddes
I have the problem that the overridden test_cmp crashes on a couple of places 
where it is doing a binary compare, so this is definitely needed.

I actually used cmp -q   in my override as it's the return code that is most 
important.

//.

On Wed, 4 Jun 2014 11:22:56 AM Junio C Hamano wrote:
 Stepan Kasal ka...@ucw.cz writes:
  test_cmp() is primarily meant to compare text files (and display the
  difference for debug purposes).
  
  Raw cmp is better suited to compare binary files (tar, zip, etc.).
  
  On MinGW, test_cmp is a shell function mingw_test_cmp that tries to
  read both files into environment, stripping CR characters (introduced
  in commit 4d715ac0).
  
  This function usually speeds things up, as fork is extremly slow on
  Windows.  But no wonder that this function is extremely slow and
  sometimes even crashes when comparing large tar or zip files.
  
  Signed-off-by: Stepan Kasal ka...@ucw.cz
  ---
  
  Hi Thomas,
  
  On Wed, Jun 04, 2014 at 02:59:44PM +0200, Thomas Braun wrote:
  Using test_cmp_bin instead of cmp would result in then four assertions
  for comparing arbitrary data
  test_cmp
  test_i18ncmp
  test_cmp_text
  test_cmp_bin
  where I think the purpose of each function is clear from its name.
  
  [test_cmp_text does not exist (yet)]
  
  OK, I agree, hence this modified version of the patch.
 
 Yeah, I think the above reasoning is sound.  And I do not think we
 ever need to have test_cmp_text -- our payload and our messages
 compared by tests to make sure our expectations hold are text by
 default.
 
 Will queue; thanks.

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] grep -I: do not bother to read known-binary files

2014-05-16 Thread Stepan Kasal
Hello,

On Thu, May 15, 2014 at 03:22:26PM -0400, Jeff King wrote:
 As the person who is proposing the patch for git.git, I would hope
 Stepan would follow up on such review and confirm whether or not it is
 still needed.

well, I try to.  (I verified that less -I works in msysGit before
submitting the fixup back there, fox example.)

But I think my role is to moderate the reconciliation.
In this case, having read the discussion, I decided to to ask Dscho
to drop the patch.

(It is always about balancing the risks and the expenses.)

Stepan
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] grep -I: do not bother to read known-binary files

2014-05-16 Thread Jeff King
On Fri, May 16, 2014 at 10:19:57AM +0200, Stepan Kasal wrote:

 On Thu, May 15, 2014 at 03:22:26PM -0400, Jeff King wrote:
  As the person who is proposing the patch for git.git, I would hope
  Stepan would follow up on such review and confirm whether or not it is
  still needed.
 
 well, I try to.  (I verified that less -I works in msysGit before
 submitting the fixup back there, fox example.)
 
 But I think my role is to moderate the reconciliation.
 In this case, having read the discussion, I decided to to ask Dscho
 to drop the patch.
 
 (It is always about balancing the risks and the expenses.)

Sure, I think that is reasonable, and I hope I did not sound like blame
Stepan, he was screwed up. After reading Dscho's other message and
knowing that he has not much time for git, I was trying to communicate
that I did not expect _him_ to be dealing with it.

Git.git has existed for a long time without that patch, so from our
perspective, it is a new patch. And as with any new patch, I would
expect a submitter who receives review of eh, don't we already do this
to either look into it, or decide that the review is probably right and
not bother with it. And you did the latter, and that is totally fine and
reasonable.

From msysgit's perspective, they may or may not want to revert the patch
that they already have. That is a _separate_ issue, and I think the
burden of effort goes the other way. The patch should stay unless
somebody goes to the work to confirm that it is not necessary (or unless
they want to accept my say-so and indication that I did not do fresh
testing, but that is up to them).

Sorry if that is long and/or obvious. There have been enough bad
feelings on the list lately that I didn't want there to be a
misunderstanding about what I meant.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] grep -I: do not bother to read known-binary files

2014-05-16 Thread Stepan Kasal
Hi,

On Fri, May 16, 2014 at 04:29:58AM -0400, Jeff King wrote:
 [..] I hope I did not sound like blame Stepan, he was screwed up.

no, you did not, it was ok.

 From msysgit's perspective, they may or may not want to revert the patch
 that they already have. That is a _separate_ issue, and I think the

I hope Dscho will help with the decision: he can say keep it until
we have evidence or at least time to do a more thorough review, for
example.

Stepan
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] grep -I: do not bother to read known-binary files

2014-05-15 Thread Johannes Schindelin
Hi Peff,

On Wed, 14 May 2014, Jeff King wrote:

 On Wed, May 14, 2014 at 05:44:19PM +0200, Stepan Kasal wrote:
 
  From: Johannes Schindelin johannes.schinde...@gmx.de
  Date: Mon, 8 Nov 2010 16:10:43 +0100
  
  Incidentally, this makes grep -I respect the binary attribute (actually,
  the -text attribute, but binary implies that).
 
 Hrm. Is this patch still necessary? In the time since this patch was
 written, we did 0826579 (grep: load file data after checking
 binary-ness, 2012-02-02)

I have no time to test this but I trust that you made sure that it works
as advertised. In my case, there were about 500 gigabytes of image data
intermixed with code, and waiting for 'git grep' was not funny at all (and
I did not have time back then to go through a full code submission cycle
on the Git mailing list, either).

So I guess we can drop my patch.

Ciao,
Johannes
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] grep -I: do not bother to read known-binary files

2014-05-15 Thread Jeff King
On Thu, May 15, 2014 at 07:42:00PM +0200, Johannes Schindelin wrote:

  Hrm. Is this patch still necessary? In the time since this patch was
  written, we did 0826579 (grep: load file data after checking
  binary-ness, 2012-02-02)
 
 I have no time to test this but I trust that you made sure that it works
 as advertised. In my case, there were about 500 gigabytes of image data
 intermixed with code, and waiting for 'git grep' was not funny at all (and
 I did not have time back then to go through a full code submission cycle
 on the Git mailing list, either).
 
 So I guess we can drop my patch.

Certainly I tested it at the time (those commits I referenced contain
timing information), and it should have improved the workload you
describe. I did not test before/after the patch in this thread, but only
read it and noticed that it was trying to do the same thing (that is why
I said I suspect...).

As the person who is proposing the patch for git.git, I would hope
Stepan would follow up on such review and confirm whether or not it is
still needed.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] grep -I: do not bother to read known-binary files

2014-05-14 Thread Stepan Kasal
From: Johannes Schindelin johannes.schinde...@gmx.de
Date: Mon, 8 Nov 2010 16:10:43 +0100

Incidentally, this makes grep -I respect the binary attribute (actually,
the -text attribute, but binary implies that).

Since the attributes are not thread-safe, we now need to switch off
threading if -I was passed.

Signed-off-by: Johannes Schindelin johannes.schinde...@gmx.de
Signed-off-by: Stepan Kasal ka...@ucw.cz
---

Hi,
this patch has been in msysgit for 3.5 years.
Stepan

 builtin/grep.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/builtin/grep.c b/builtin/grep.c
index 43af5b7..8073fbe 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -18,6 +18,7 @@
 #include quote.h
 #include dir.h
 #include pathspec.h
+#include attr.h
 
 static char const * const grep_usage[] = {
N_(git grep [options] [-e] pattern [rev...] [[--] path...]),
@@ -163,6 +164,22 @@ static void work_done(struct work_item *w)
grep_unlock();
 }
 
+static int skip_binary(struct grep_opt *opt, const char *filename)
+{
+   if ((opt-binary  GREP_BINARY_NOMATCH)) {
+   static struct git_attr *attr_text;
+   struct git_attr_check check;
+
+   if (!attr_text)
+   attr_text = git_attr(text);
+   memset(check, 0, sizeof(check));
+   check.attr = attr_text;
+   return !git_check_attr(filename, 1, check) 
+   ATTR_FALSE(check.value);
+   }
+   return 0;
+}
+
 static void *run(void *arg)
 {
int hit = 0;
@@ -173,6 +190,9 @@ static void *run(void *arg)
if (!w)
break;
 
+   if (skip_binary(opt, (const char *)w-source.identifier))
+   continue;
+
opt-output_priv = w;
hit |= grep_source(opt, w-source);
grep_source_clear_data(w-source);
@@ -379,6 +399,9 @@ static int grep_cache(struct grep_opt *opt, const struct 
pathspec *pathspec, int
continue;
if (!ce_path_match(ce, pathspec, NULL))
continue;
+   if (skip_binary(opt, ce-name))
+   continue;
+
/*
 * If CE_VALID is on, we assume worktree file and its cache 
entry
 * are identical, even if worktree file has been modified, so 
use
@@ -803,6 +826,8 @@ int cmd_grep(int argc, const char **argv, const char 
*prefix)
string_list_append(path_list, show_in_pager);
use_threads = 0;
}
+   if ((opt.binary  GREP_BINARY_NOMATCH))
+   use_threads = 0;
 
if (!opt.pattern_list)
die(_(no pattern given.));
-- 
1.9.2.msysgit.0.335.gd2a461f

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] grep -I: do not bother to read known-binary files

2014-05-14 Thread Junio C Hamano
Stepan Kasal ka...@ucw.cz writes:

 From: Johannes Schindelin johannes.schinde...@gmx.de
 Date: Mon, 8 Nov 2010 16:10:43 +0100

 Incidentally, this makes grep -I respect the binary attribute (actually,
 the -text attribute, but binary implies that).

 Since the attributes are not thread-safe, we now need to switch off
 threading if -I was passed.

 Signed-off-by: Johannes Schindelin johannes.schinde...@gmx.de
 Signed-off-by: Stepan Kasal ka...@ucw.cz
 ---

 Hi,
 this patch has been in msysgit for 3.5 years.
 Stepan

I do not think checking 'text' is the right way to do this.  The
attribute controls the eof normalization, and people sometimes want
to keep CRLF terminated files in the repository no matter what the
platform is (an example I heard is a sample protocol exchange over
textual network protocol such as HTTP and SMTP), and the way to do
so is to unset it.  That would still let them look for patterns in
and compare the changes to these paths.

Looking for Marking files as binary in gitattributes(5) gives us a
more authoritative alternative, I think.  In short:

 - If 'diff' is Unset, or
 - If 'diff' is Set to X and diff.X.binary is true

then the contents are not suitable for human consumption.  I haven't
thought things through to declare that grep -I would want to treat
a PostScript file (which is an example in that section) as a binary
file and refrain from trying to find substrings in it, but my gut
feeling is that we probably should let it look inside such a file,
so replacing 'text' with 'diff' in this patch and doing nothing
about diff.*.binary would be the way to go.  Given that the posted
patch was older than 3.5 years, perhaps it needs updating to adhere
the advice given in ab435611 (docs: explain diff.*.binary option,
2011-01-09).

By the way, I wonder if the patch is about performance, implied by
do not bother (to waste time looking inside), though.  Is it an
overall win to avoid looking inside -diff files, if that requires
you to use only 1 core and leaving the other 7 idle?

I think the user tells us it is binary with attribute, and the user
tells us not to look into binary with -I is a lot better rationale
to do this change , and that would characterise it as a correctness
patch, not a performance patch.

Thanks.


  builtin/grep.c | 25 +
  1 file changed, 25 insertions(+)

 diff --git a/builtin/grep.c b/builtin/grep.c
 index 43af5b7..8073fbe 100644
 --- a/builtin/grep.c
 +++ b/builtin/grep.c
 @@ -18,6 +18,7 @@
  #include quote.h
  #include dir.h
  #include pathspec.h
 +#include attr.h
  
  static char const * const grep_usage[] = {
   N_(git grep [options] [-e] pattern [rev...] [[--] path...]),
 @@ -163,6 +164,22 @@ static void work_done(struct work_item *w)
   grep_unlock();
  }
  
 +static int skip_binary(struct grep_opt *opt, const char *filename)
 +{
 + if ((opt-binary  GREP_BINARY_NOMATCH)) {
 + static struct git_attr *attr_text;
 + struct git_attr_check check;
 +
 + if (!attr_text)
 + attr_text = git_attr(text);
 + memset(check, 0, sizeof(check));
 + check.attr = attr_text;
 + return !git_check_attr(filename, 1, check) 
 + ATTR_FALSE(check.value);
 + }
 + return 0;
 +}
 +
  static void *run(void *arg)
  {
   int hit = 0;
 @@ -173,6 +190,9 @@ static void *run(void *arg)
   if (!w)
   break;
  
 + if (skip_binary(opt, (const char *)w-source.identifier))
 + continue;
 +
   opt-output_priv = w;
   hit |= grep_source(opt, w-source);
   grep_source_clear_data(w-source);
 @@ -379,6 +399,9 @@ static int grep_cache(struct grep_opt *opt, const struct 
 pathspec *pathspec, int
   continue;
   if (!ce_path_match(ce, pathspec, NULL))
   continue;
 + if (skip_binary(opt, ce-name))
 + continue;
 +
   /*
* If CE_VALID is on, we assume worktree file and its cache 
 entry
* are identical, even if worktree file has been modified, so 
 use
 @@ -803,6 +826,8 @@ int cmd_grep(int argc, const char **argv, const char 
 *prefix)
   string_list_append(path_list, show_in_pager);
   use_threads = 0;
   }
 + if ((opt.binary  GREP_BINARY_NOMATCH))
 + use_threads = 0;
  
   if (!opt.pattern_list)
   die(_(no pattern given.));
 -- 
 1.9.2.msysgit.0.335.gd2a461f

 -- 
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] grep -I: do not bother to read known-binary files

2014-05-14 Thread Jeff King
On Wed, May 14, 2014 at 05:44:19PM +0200, Stepan Kasal wrote:

 From: Johannes Schindelin johannes.schinde...@gmx.de
 Date: Mon, 8 Nov 2010 16:10:43 +0100
 
 Incidentally, this makes grep -I respect the binary attribute (actually,
 the -text attribute, but binary implies that).
 
 Since the attributes are not thread-safe, we now need to switch off
 threading if -I was passed.
 
 Signed-off-by: Johannes Schindelin johannes.schinde...@gmx.de
 Signed-off-by: Stepan Kasal ka...@ucw.cz
 ---
 
 Hi,
 this patch has been in msysgit for 3.5 years.
 Stepan

Hrm. Is this patch still necessary? In the time since this patch was
written, we did 0826579 (grep: load file data after checking
binary-ness, 2012-02-02), which should do the same thing. It deals with
the threading via a lock, but we later learned in 9dd5245 (grep:
pre-load userdiff drivers when threaded, 2012-02-02) to hoist that bit
out.

So I suspect this patch at best is doing nothing, and at worst is
wasting extra time doing redundant attribute checks.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] grep -I: do not bother to read known-binary files

2014-05-14 Thread Jeff King
On Wed, May 14, 2014 at 10:52:28AM -0700, Junio C Hamano wrote:

 I do not think checking 'text' is the right way to do this.  The
 attribute controls the eof normalization, and people sometimes want
 to keep CRLF terminated files in the repository no matter what the
 platform is (an example I heard is a sample protocol exchange over
 textual network protocol such as HTTP and SMTP), and the way to do
 so is to unset it.  That would still let them look for patterns in
 and compare the changes to these paths.
 
 Looking for Marking files as binary in gitattributes(5) gives us a
 more authoritative alternative, I think.  In short:
 
  - If 'diff' is Unset, or
  - If 'diff' is Set to X and diff.X.binary is true

 then the contents are not suitable for human consumption.

I responded elsewhere in the thread that I think the patch under
discussion is redundant at this point, but just to clarify: grep
currently uses the rules you give above, as it builds on the userdiff
driver code.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] grep -I: do not bother to read known-binary files

2014-05-14 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 On Wed, May 14, 2014 at 05:44:19PM +0200, Stepan Kasal wrote:

 From: Johannes Schindelin johannes.schinde...@gmx.de
 Date: Mon, 8 Nov 2010 16:10:43 +0100
 
 Incidentally, this makes grep -I respect the binary attribute (actually,
 the -text attribute, but binary implies that).
 
 Since the attributes are not thread-safe, we now need to switch off
 threading if -I was passed.
 
 Signed-off-by: Johannes Schindelin johannes.schinde...@gmx.de
 Signed-off-by: Stepan Kasal ka...@ucw.cz
 ---
 
 Hi,
 this patch has been in msysgit for 3.5 years.
 Stepan

 Hrm. Is this patch still necessary? In the time since this patch was
 written, we did 0826579 (grep: load file data after checking
 binary-ness, 2012-02-02), which should do the same thing. It deals with
 the threading via a lock, but we later learned in 9dd5245 (grep:
 pre-load userdiff drivers when threaded, 2012-02-02) to hoist that bit
 out.

Wow, power of Git history ;-)

 So I suspect this patch at best is doing nothing, and at worst is
 wasting extra time doing redundant attribute checks.

Sounds like a sensible conclusion.  Thanks.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Rebase triggers git diff header lacks filename information on very large patch with binary files

2014-01-14 Thread demerphq
Hi,

I just did a rebase, and it throws an error like this:

Applying: comment1
Applying: comment2
Applying: comment3
Applying: comment4
Applying: patch_with_binary_files
fatal: git diff header lacks filename information when removing 1
leading pathname component (line 7330213)
Repository lacks necessary blobs to fall back on 3-way merge.
Cannot fall back to three-way merge.
Patch failed at 0005 patch_with_binary_files
The copy of the patch that failed is found in:
   /usr/local/git_tree/affiliate_data/.git/rebase-apply/patch

When you have resolved this problem, run git rebase --continue.
If you prefer to skip this patch, run git rebase --skip instead.
To check out the original branch and stop rebasing, run git rebase --abort.

The patch is very large, 882453899 bytes.

The patch also includes many binary files.

Extracting the content around and before line 7330213 and up to the
next diff header in the patch I see this:

perl -lne'print $.\t$_ if 7330169 .. 7330213' .git/rebase-apply/patch
7330169 diff --git a/dir1/dir2/file.png b/dir1/dir2/file.png
7330170 new file mode 100644
7330171 index 
..8a3219cb6545f23e3f7c61f058d82fc2c1bd9aac
7330172 GIT binary patch
7330173 literal 11301
7330174 zcmXYX1ymeO)Ai!+PH-nk@Zb{MHE3{$;O=gVh2Rd06MPp4?hxD|K!5jEd=*(p7;Ov
[more lines of binary removed]
7330213 zznckDs-GVJg-A0uD|ONvCQWVX;j!JNnkQI9^=+zJ^SvLe1p-~c7bmY5wu4C=(8F0
[more lines of binary removed]
7330393 literal 0
7330394 HcmV?d1
7330395
7330396 diff --git a/dir1/dir2/file.css b/dir1/dir2/file.css
7330397 new file mode 100644
7330398 index 
..75c8afc558424ea185c62b5a1c61ad6c32cddc21

I have munged the filenames.

It looks to me like git can't apply patches over a certain size.

Any suggestions on how to proceed here?

cheers,
Yves




-- 
perl -Mre=debug -e /just|another|perl|hacker/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Rebase triggers git diff header lacks filename information on very large patch with binary files

2014-01-14 Thread demerphq
On 14 January 2014 12:48, demerphq demer...@gmail.com wrote:
 Hi,

 I just did a rebase, and it throws an error like this:

 Applying: comment1
 Applying: comment2
 Applying: comment3
 Applying: comment4
 Applying: patch_with_binary_files
 fatal: git diff header lacks filename information when removing 1
 leading pathname component (line 7330213)
 Repository lacks necessary blobs to fall back on 3-way merge.
 Cannot fall back to three-way merge.
 Patch failed at 0005 patch_with_binary_files
 The copy of the patch that failed is found in:
/usr/local/git_tree/affiliate_data/.git/rebase-apply/patch

 When you have resolved this problem, run git rebase --continue.
 If you prefer to skip this patch, run git rebase --skip instead.
 To check out the original branch and stop rebasing, run git rebase --abort.

 The patch is very large, 882453899 bytes.

 The patch also includes many binary files.

 Extracting the content around and before line 7330213 and up to the
 next diff header in the patch I see this:

 perl -lne'print $.\t$_ if 7330169 .. 7330213' .git/rebase-apply/patch
 7330169 diff --git a/dir1/dir2/file.png b/dir1/dir2/file.png
 7330170 new file mode 100644
 7330171 index 
 ..8a3219cb6545f23e3f7c61f058d82fc2c1bd9aac
 7330172 GIT binary patch
 7330173 literal 11301
 7330174 zcmXYX1ymeO)Ai!+PH-nk@Zb{MHE3{$;O=gVh2Rd06MPp4?hxD|K!5jEd=*(p7;Ov
 [more lines of binary removed]
 7330213 zznckDs-GVJg-A0uD|ONvCQWVX;j!JNnkQI9^=+zJ^SvLe1p-~c7bmY5wu4C=(8F0
 [more lines of binary removed]
 7330393 literal 0
 7330394 HcmV?d1
 7330395
 7330396 diff --git a/dir1/dir2/file.css b/dir1/dir2/file.css
 7330397 new file mode 100644
 7330398 index 
 ..75c8afc558424ea185c62b5a1c61ad6c32cddc21

 I have munged the filenames.

 It looks to me like git can't apply patches over a certain size.

 Any suggestions on how to proceed here?

I aborted and then did a merge instead and it seemed to work out.

Still, seems like something git should detect BEFORE it tries to do the rebase.

cheers,
Yves



-- 
perl -Mre=debug -e /just|another|perl|hacker/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gitweb commitdiff page - binary files with ampersands in filename?

2013-04-15 Thread Jakub Narębski
Oj W wrote:

 Change a binary file whose filename contains an ampersand, then view
 the commitdiff page in gitweb.
 
 Git outputs a message like Binary files a/bw.dll and b/bw.dll differ
 
 Gitweb format_diff_from_to_header() doesn't notice anything in that
 output which needs escaping, and writes it directly to the XHTML 1.0
 Strict output.
 
 Then gitweb's output is invalid XML, meaning that browsers such as
 Firefox will refuse to display the page.

This was because in case of binary files we don't get usual diff,
just Binary files a/foo and b/foo differ just after extended git diff 
headers.

There are two problems with gitweb code.  First is that git_patchset_body()
didn't recognize and handle this situation.  This should be fixed by the
patch below (I have trouble testing it as git-instaweb keeps using old
gitweb version for some reason).

Second is that format_diff_from_to_header() doesn't handle unrecognized
extended git diff headers well - it doesn't HTML escape them.  This is to
be fixed.

-- 8 --
---
 gitweb/gitweb.perl |   10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 1309196..33a0de1 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -5345,7 +5345,8 @@ sub git_patchset_body {
while ($patch_line = $fd) {
chomp $patch_line;
 
-   last EXTENDED_HEADER if ($patch_line =~ m/^--- |^diff 
/);
+   last EXTENDED_HEADER
+   if ($patch_line =~ m/^--- |^diff |^Binary files 
.* differ$/);
 
print format_extended_diff_header_line($patch_line, 
$diffinfo,
   \%from, \%to);
@@ -5357,7 +5358,12 @@ sub git_patchset_body {
print /div\n; # class=patch
last PATCH;
}
-   next PATCH if ($patch_line =~ m/^diff /);
+   if ($patch_line =~ m/^Binary files .* differ$/) {
+   print div class=\diff bin\ .
+ esc_html($patch_line, -nbsp = 1) .
+ /div\n;
+   }
+   next PATCH if ($patch_line =~ m/^diff |^Binary files .* 
differ$/);
#assert($patch_line =~ m/^---/) if DEBUG;
 
my $last_patch_line = $patch_line;
-- 
1.7.10.4


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


gitweb commitdiff page - binary files with ampersands in filename?

2013-04-09 Thread Oj W
Change a binary file whose filename contains an ampersand, then view
the commitdiff page in gitweb.

Git outputs a message like Binary files a/bw.dll and b/bw.dll differ

Gitweb format_diff_from_to_header() doesn't notice anything in that
output which needs escaping, and writes it directly to the XHTML 1.0
Strict output.

Then gitweb's output is invalid XML, meaning that browsers such as
Firefox will refuse to display the page.

(tested with v 1.7.9.5, but I can't see anything in latest
https://github.com/git/git/blob/master/gitweb/gitweb.perl#CL2158 which
is looking for text like Binary files ... differ)
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


gitdiffbinstat - git diff --shortstat -like output for changes in binary files

2013-03-29 Thread Matthias Krüger
I use git mostly for game-development which means I have to deal with a 
lot of binary files (images, sound files etc).


When I came to a point where I had run image optimization on a branch, I 
wanted to know of course how much smaller the new branch was in 
comparison to master.
Problem was that 'git diff --stat' would only summerize per-binary-file 
size changes and 'git diff --shortstat' did skip the binary files entirely.


To solve this problem, I wrote a script (gitdiffbinstat) which 
basically runs 'git diff --stat' and summerizes the output.


The script can be found here: 
https://github.com/matthiaskrgr/gitdiffbinstat/blob/master/gitdiffbinstat.sh

Screenshot of example output is attached.

I wondered what you guys thought about the script, is there a chance to 
perhaps get it included as some kind of helper script into the official 
git repo?



Regards, Matthias
attachment: screenshot.png

Re: gitdiffbinstat - git diff --shortstat -like output for changes in binary files

2013-03-29 Thread Jeff King
On Fri, Mar 29, 2013 at 07:07:32PM +0100, Matthias Krüger wrote:

 I use git mostly for game-development which means I have to deal with
 a lot of binary files (images, sound files etc).
 
 When I came to a point where I had run image optimization on a
 branch, I wanted to know of course how much smaller the new branch
 was in comparison to master.
 Problem was that 'git diff --stat' would only summerize
 per-binary-file size changes and 'git diff --shortstat' did skip the
 binary files entirely.

Have you tried --summary? Combined with --stat (or --shortstat) I
wonder if it would get you closer to what you want.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gitdiffbinstat - git diff --shortstat -like output for changes in binary files

2013-03-29 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 I use git mostly for game-development which means I have to deal with
 a lot of binary files (images, sound files etc).
 
 When I came to a point where I had run image optimization on a
 branch, I wanted to know of course how much smaller the new branch
 was in comparison to master.
 Problem was that 'git diff --stat' would only summerize
 per-binary-file size changes and 'git diff --shortstat' did skip the
 binary files entirely.

 Have you tried --summary? Combined with --stat (or --shortstat) I
 wonder if it would get you closer to what you want.

diff is not about how much did my project grow or shink.  If you
moved one block of lines up or down in the same file, you will see
double the number of lines moved in the statistics.

On the other hand, the use case to measure how much it helped to run
png optimizers only cares about the total size.

I do not think it is a good match to present the binary stat next
to the textual diff stats in the first place.  Adding two numbers do
not give us any meaningful number.

It is an interesting use case, but it just is not a problem core
Git, which is a source code management system, particularly wants to
bolt on a solution for, that does not mesh well with other parts of
the system.

We do want to make sure that people who want to deal with binaries
have access to raw statistics, so that they can build their solution
on top of, though.  ls-tree -r -l gives byte-size of each blob,
and the attribute system can let you tell which paths are binaries.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug Report - Binary Files as plain files, line endings conversions

2012-07-15 Thread Jose Nobile
 Hi,

 $ git add .
 warning: LF will be replaced by CRLF in web/images/logo_twitter.png.
 The file will have its original line endings in your working directory.

 JOSE@COMPAQ /d/wamp/www/internationalstudies.co (master)
 $ git --version
 git version 1.7.11.msysgit.0

 JOSE@COMPAQ /d/wamp/www/internationalstudies.co (master)
 $ cat .gitattributes
 # Auto detect text files and perform LF normalization
 * eol=crlf

 JOSE@COMPAQ /d/wamp/www/internationalstudies.co (master)


 Binary files as an image not may perform any line ending operations,
many binary files will be corrupt when are restoring from repository,

 Saludos,
 José Nobile
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html