Re: Consider escaping special characters like 'less' does

2017-10-22 Thread Jeff King
On Sun, Oct 22, 2017 at 08:27:20AM -0400, Jason Pyeron wrote:

> > ESC (for color)
> > +   if ($line =~ s/[\000-\011\013-\032\034-\037]/?/g) {
> 
> What about CR [0x0D] ?

I assumed that CR was one of the things we'd want to avoid (and it was
in fact what I used to test this). E.g., try:

  echo base >file
  git add file
  printf 'foo\r+bar\n' >file

Running through "less" shows something like:

  diff --git a/file b/file
  index df967b9..5b6ee80 100644
  --- a/file
  +++ b/file
  @@ -1 +1 @@
  -base
  +foo^M+bar

but "git add -p" shows:

  $ git add -p
  diff --git a/file b/file
  index df967b9..5b6ee80 100644
  --- a/file
  +++ b/file
  @@ -1 +1 @@
  -base
  +bar
  Stage this hunk [y,n,q,a,d,/,e,?]? 

For systems where CRLF is the native line ending, we'd still expect to
see only LFs here when line-ending conversion is on (since the diff
shows the canonical in-repo form).

For files where the CRs must remain for some reason, they'd show as a
"?" at the end. The "^M" shown by "less" is a bit more informative. If
we really do want to pursue this direction, it might make sense to use
more descriptive placeholders.

-Peff


RE: Consider escaping special characters like 'less' does

2017-10-22 Thread Jason Pyeron
> -Original Message-
> From: Jeff King
> Sent: Monday, October 16, 2017 6:13 PM
> To: Joris Valette
> Cc: Andreas Schwab; Jason Pyeron; git@vger.kernel.org
> Subject: Re: Consider escaping special characters like 'less' does
> 

> 
> I.e., something like this would probably help your case 
> without hurting
> anybody:
> 
> diff --git a/git-add--interactive.perl b/git-add--interactive.perl
> index 28b325d754..d44e5ea459 100755
> --- a/git-add--interactive.perl
> +++ b/git-add--interactive.perl
> @@ -714,6 +714,16 @@ sub parse_diff {
>   push @{$hunk[-1]{DISPLAY}},
>   (@colored ? $colored[$i] : $diff[$i]);
>   }
> +
> + foreach my $hunk (@hunk) {
> + foreach my $line (@{$hunk->{DISPLAY}}) {
> + # all control chars minus newline and 
> ESC (for color)
> + if ($line =~ s/[\000-\011\013-\032\034-\037]/?/g) {

What about CR [0x0D] ?

> + $hunk->{CONTROLCHARS} = 1;
> + }
> + }
> + }
> +
>   return @hunk;
>  }
>  
> @@ -1407,6 +1417,9 @@ sub patch_update_file {
>   if ($hunk[$ix]{TYPE} eq 'hunk') {
>   $other .= ',e';
>   }
> + if ($hunk[$ix]->{CONTROLCHARS}) {
> + print "warning: control characters in 
> hunk have been replaced by '?'\n";
> + }
>   for (@{$hunk[$ix]{DISPLAY}}) {
>   print;
>   }
> 
> I can't help but feel this is the tip of a larger iceberg, 
> though. E.g.,
> what about characters outside of the terminal's correct encoding? Or
> broken UTF-8 characters?
> 
> -Peff
> 



Re: Consider escaping special characters like 'less' does

2017-10-16 Thread Jeff King
On Tue, Oct 17, 2017 at 10:13:34AM +0900, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > Alternatively, I suppose we could just always escape in
> > add--interactive. I'm trying to think of a case where somebody would
> > really want their diffFilter to see the raw bytes (or vice versa, to
> > show raw bytes produced by their filter), but I'm having trouble coming
> > up with one.
> 
> Your patch below only implements the "tame the raw bytes that come
> out of their filter", which is quite agreeable.

Yes. I think that is probably OK, especially given that we continue to
allow terminal escapes (certainly some filters would want to use colors;
I don't know if any would want to use more exotic control codes).

> > I can't help but feel this is the tip of a larger iceberg, though. E.g.,
> > what about characters outside of the terminal's correct encoding? Or
> > broken UTF-8 characters?
> 
> Hmph.  If you use it as a "built-in" that is a fallback for
> diffFilter, i.e. use it only when the end user does not have one,
> then users can override whatever wrong thing the built-in logic does
> so... ;-)

Yes, and maybe that is the best way to do it. It just seems like it is
opening a can of worms about exactly which things should be filtered and
how.

I also wondered if people would be annoyed that by using a filter, they
don't get the benefit of the escaping, unless their filter implements it
separately on top (and the original purpose of the filter option was for
things like diff-highlight and diff-so-fancy, which do not do such
escaping).

-Peff


Re: Consider escaping special characters like 'less' does

2017-10-16 Thread Junio C Hamano
Jeff King  writes:

> Alternatively, I suppose we could just always escape in
> add--interactive. I'm trying to think of a case where somebody would
> really want their diffFilter to see the raw bytes (or vice versa, to
> show raw bytes produced by their filter), but I'm having trouble coming
> up with one.

Your patch below only implements the "tame the raw bytes that come
out of their filter", which is quite agreeable.

> I.e., something like this would probably help your case without hurting
> anybody:
> ...
>
> I can't help but feel this is the tip of a larger iceberg, though. E.g.,
> what about characters outside of the terminal's correct encoding? Or
> broken UTF-8 characters?

Hmph.  If you use it as a "built-in" that is a fallback for
diffFilter, i.e. use it only when the end user does not have one,
then users can override whatever wrong thing the built-in logic does
so... ;-)



Re: Consider escaping special characters like 'less' does

2017-10-16 Thread Jeff King
On Tue, Oct 17, 2017 at 12:47:01AM +0200, Andreas Schwab wrote:

> On Okt 16 2017, Jeff King  wrote:
> 
> > I can't help but feel this is the tip of a larger iceberg, though. E.g.,
> > what about characters outside of the terminal's correct encoding? Or
> > broken UTF-8 characters?
> 
> Or correctly encoded UTF-8 characters that look confusing?  Or blobs
> with embedded escape sequences?

Yes, leaving ESC unfiltered here made me hesitate, too. We must allow it
to show colors, but showing diffs with raw terminal codes can be quite
confusing.

My general advice on committing unprintable characters is: don't.

-Peff


Re: Consider escaping special characters like 'less' does

2017-10-16 Thread Andreas Schwab
On Okt 16 2017, Jeff King  wrote:

> I can't help but feel this is the tip of a larger iceberg, though. E.g.,
> what about characters outside of the terminal's correct encoding? Or
> broken UTF-8 characters?

Or correctly encoded UTF-8 characters that look confusing?  Or blobs
with embedded escape sequences?

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


Re: Consider escaping special characters like 'less' does

2017-10-16 Thread Jeff King
On Sun, Oct 15, 2017 at 11:37:04PM +0200, Joris Valette wrote:

> 2017-10-15 22:06 GMT+02:00 Jeff King :
> > Git's diff generation will never do such escaping by default, because it
> > means creating a patch that cannot be applied to get back the original
> > content.
> 
> Indeed this would be a problem. That's where my knowledge of git's
> source code ends, but in that case, can't the output be discriminated
> against the command that was executed?
> Command that outputs an applicable patch -> don't escape
> Command that outputs a diff to see changes -> escape

Speaking in a general sense, people use "git diff" for both purposes,
and we can't necessarily tell which[1]. As a matter of fact, that's a
potential problem with textconv filters as well (which are enabled by
default for git-diff, but not for plumbing like diff-tree, format-patch,
etc). I think the feature isn't widely used enough for people to run
into problems (and they also use it only on things that they don't
_expect_ to be able to make patches for, since they're binaries).

[1] Of course we can come up with heuristics, like "is stdout going to a
a terminal"? But in such a case we already kick in the pager, and it
does the exact escaping you're asking for. :)

For a command like add--interactive, it knows which invocations are for
showing to the user and which are for applying (and it already uses
"--color" selectively, for instance). So if there were an "escape this"
option in git's diff generation, we could selectively pass it. But
again, I think the right solution there is not building the escaping
into Git, but just passing it through another filter.

> > It doesn't seem out of the question to me to have an out-of-the-box
> > default for interactive.diffFilter which does some basic escaping (we
> > could even implement it inside the perl script for efficiency).
> 
> Yes I read this thread, but I was left unsatisfied because I would
> like something out-of-the-box.
> Your suggestion might be the best solution then: implement some
> default escaping for interactive.diffFilter.

Alternatively, I suppose we could just always escape in
add--interactive. I'm trying to think of a case where somebody would
really want their diffFilter to see the raw bytes (or vice versa, to
show raw bytes produced by their filter), but I'm having trouble coming
up with one.

I.e., something like this would probably help your case without hurting
anybody:

diff --git a/git-add--interactive.perl b/git-add--interactive.perl
index 28b325d754..d44e5ea459 100755
--- a/git-add--interactive.perl
+++ b/git-add--interactive.perl
@@ -714,6 +714,16 @@ sub parse_diff {
push @{$hunk[-1]{DISPLAY}},
(@colored ? $colored[$i] : $diff[$i]);
}
+
+   foreach my $hunk (@hunk) {
+   foreach my $line (@{$hunk->{DISPLAY}}) {
+   # all control chars minus newline and ESC (for color)
+   if ($line =~ s/[\000-\011\013-\032\034-\037]/?/g) {
+   $hunk->{CONTROLCHARS} = 1;
+   }
+   }
+   }
+
return @hunk;
 }
 
@@ -1407,6 +1417,9 @@ sub patch_update_file {
if ($hunk[$ix]{TYPE} eq 'hunk') {
$other .= ',e';
}
+   if ($hunk[$ix]->{CONTROLCHARS}) {
+   print "warning: control characters in hunk have been 
replaced by '?'\n";
+   }
for (@{$hunk[$ix]{DISPLAY}}) {
print;
}

I can't help but feel this is the tip of a larger iceberg, though. E.g.,
what about characters outside of the terminal's correct encoding? Or
broken UTF-8 characters?

-Peff


Re: Consider escaping special characters like 'less' does

2017-10-15 Thread Joris Valette
2017-10-15 22:06 GMT+02:00 Jeff King :
> Git's diff generation will never do such escaping by default, because it
> means creating a patch that cannot be applied to get back the original
> content.

Indeed this would be a problem. That's where my knowledge of git's
source code ends, but in that case, can't the output be discriminated
against the command that was executed?
Command that outputs an applicable patch -> don't escape
Command that outputs a diff to see changes -> escape

> There _are_ already options to create diffs that cannot be applied, like
> --textconv. So it would be possible to add a similar option for
> escaping. But I don't think we really need or want a separate option,
> when you can already do one of:
>
>   1. If your files have special binary characters that are hard to see,
>  you can use the existing textconv system to do whatever escaping
>  you like. And then the Git will diff the result of the escaping,
>  which means you get readable diffs when they change.
>
>   2. Put the raw output of git's diff through a filter that escapes. We
>  already do this most of the time by piping through less. The most
>  noticeable exception is "add --patch". There you can set up a
>  program to filter as well. There's more information in a recent
>  thread here:
>
>
> https://public-inbox.org/git/20171012184736.rglkbyryauwuv...@sigill.intra.peff.net/
>
> It doesn't seem out of the question to me to have an out-of-the-box
> default for interactive.diffFilter which does some basic escaping (we
> could even implement it inside the perl script for efficiency).

Yes I read this thread, but I was left unsatisfied because I would
like something out-of-the-box.
Your suggestion might be the best solution then: implement some
default escaping for interactive.diffFilter.

Joris


Re: Consider escaping special characters like 'less' does

2017-10-15 Thread Jeff King
On Sun, Oct 15, 2017 at 06:39:45PM +0200, Joris Valette wrote:

> > It's your MUA's fault that you get a ?, the mail didn't contain any.
> 
> Actually, the original mail contained the special BOM sequence but
> it's generally invisible. His MUA shows it with a '?', mine doesn't
> show anything, neither does Firefox on the mailing list page.
> 
> The question remains: could escaping be done?

Git's diff generation will never do such escaping by default, because it
means creating a patch that cannot be applied to get back the original
content.

There _are_ already options to create diffs that cannot be applied, like
--textconv. So it would be possible to add a similar option for
escaping. But I don't think we really need or want a separate option,
when you can already do one of:

  1. If your files have special binary characters that are hard to see,
 you can use the existing textconv system to do whatever escaping
 you like. And then the Git will diff the result of the escaping,
 which means you get readable diffs when they change.

  2. Put the raw output of git's diff through a filter that escapes. We
 already do this most of the time by piping through less. The most
 noticeable exception is "add --patch". There you can set up a
 program to filter as well. There's more information in a recent
 thread here:

   
https://public-inbox.org/git/20171012184736.rglkbyryauwuv...@sigill.intra.peff.net/

It doesn't seem out of the question to me to have an out-of-the-box
default for interactive.diffFilter which does some basic escaping (we
could even implement it inside the perl script for efficiency).

-Peff


Re: Consider escaping special characters like 'less' does

2017-10-15 Thread Joris Valette
2017-10-15 17:46 GMT+02:00 Andreas Schwab :
> On Okt 15 2017, "Jason Pyeron"  wrote:
>
>>> -Original Message-
>>> From: Joris Valette
>>> Sent: Sunday, October 15, 2017 9:34 AM
>>> To: git@vger.kernel.org
>>> Subject: Consider escaping special characters like 'less' does
>>>
>>> The pager 'less' escapes some characters when calling 'git diff'. This
>>> is what I might get:
>>>
>>> $ git diff --cached
>>> diff --git a/some_file b/some_file
>>> new file mode 100644
>>> index 000..357323f
>>> --- /dev/null
>>> +++ b/some_file
>>> @@ -0,0 +1 @@
>>> +Hello
>>> \ No newline at end of file
>>>
>>> This example is a simple file encoded in UTF-8 *with BOM*.
>>> On the other hand, the built-in git output shows this:
>>>
>>> $ git --no-pager diff --cached
>>> diff --git a/some_file b/some_file
>>> new file mode 100644
>>> index 000..357323f
>>> --- /dev/null
>>> +++ b/some_file
>>> @@ -0,0 +1 @@
>>> +?Hello
>>> \ No newline at end of file
>>
>> It is your terminal, not git's fault that you get a ? rendered.
>
> It's your MUA's fault that you get a ?, the mail didn't contain any.

Actually, the original mail contained the special BOM sequence but
it's generally invisible. His MUA shows it with a '?', mine doesn't
show anything, neither does Firefox on the mailing list page.

The question remains: could escaping be done?

Joris


Re: Consider escaping special characters like 'less' does

2017-10-15 Thread Andreas Schwab
On Okt 15 2017, "Jason Pyeron"  wrote:

>> -Original Message-
>> From: Joris Valette
>> Sent: Sunday, October 15, 2017 9:34 AM
>> To: git@vger.kernel.org
>> Subject: Consider escaping special characters like 'less' does
>> 
>> The pager 'less' escapes some characters when calling 'git diff'. This
>> is what I might get:
>> 
>> $ git diff --cached
>> diff --git a/some_file b/some_file
>> new file mode 100644
>> index 000..357323f
>> --- /dev/null
>> +++ b/some_file
>> @@ -0,0 +1 @@
>> +Hello
>> \ No newline at end of file
>> 
>> This example is a simple file encoded in UTF-8 *with BOM*.
>> On the other hand, the built-in git output shows this:
>> 
>> $ git --no-pager diff --cached
>> diff --git a/some_file b/some_file
>> new file mode 100644
>> index 000..357323f
>> --- /dev/null
>> +++ b/some_file
>> @@ -0,0 +1 @@
>> +?Hello
>> \ No newline at end of file
>
> It is your terminal, not git's fault that you get a ? rendered.

It's your MUA's fault that you get a ?, the mail didn't contain any.

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


RE: Consider escaping special characters like 'less' does

2017-10-15 Thread Jason Pyeron
> -Original Message-
> From: Joris Valette
> Sent: Sunday, October 15, 2017 9:34 AM
> To: git@vger.kernel.org
> Subject: Consider escaping special characters like 'less' does
> 
> The pager 'less' escapes some characters when calling 'git diff'. This
> is what I might get:
> 
> $ git diff --cached
> diff --git a/some_file b/some_file
> new file mode 100644
> index 000..357323f
> --- /dev/null
> +++ b/some_file
> @@ -0,0 +1 @@
> +Hello
> \ No newline at end of file
> 
> This example is a simple file encoded in UTF-8 *with BOM*.
> On the other hand, the built-in git output shows this:
> 
> $ git --no-pager diff --cached
> diff --git a/some_file b/some_file
> new file mode 100644
> index 000..357323f
> --- /dev/null
> +++ b/some_file
> @@ -0,0 +1 @@
> +?Hello
> \ No newline at end of file

It is your terminal, not git's fault that you get a ? rendered.

$ git diff
diff --git a/test.txt b/test.txt
index af9f93a..294e397 100644
--- a/test.txt
+++ b/test.txt
@@ -1 +1 @@
-the quick brown fox jumps over the lazy dog
+the quick brown fox jumps over the lazy dog

$ git diff | hexdump.exe -C
  64 69 66 66 20 2d 2d 67  69 74 20 61 2f 74 65 73  |diff --git a/tes|
0010  74 2e 74 78 74 20 62 2f  74 65 73 74 2e 74 78 74  |t.txt b/test.txt|
0020  0a 69 6e 64 65 78 20 61  66 39 66 39 33 61 2e 2e  |.index af9f93a..|
0030  32 39 34 65 33 39 37 20  31 30 30 36 34 34 0a 2d  |294e397 100644.-|
0040  2d 2d 20 61 2f 74 65 73  74 2e 74 78 74 0a 2b 2b  |-- a/test.txt.++|
0050  2b 20 62 2f 74 65 73 74  2e 74 78 74 0a 40 40 20  |+ b/test.txt.@@ |
0060  2d 31 20 2b 31 20 40 40  0a 2d 74 68 65 20 71 75  |-1 +1 @@.-the qu|
0070  69 63 6b 20 62 72 6f 77  6e 20 66 6f 78 20 6a 75  |ick brown fox ju|
0080  6d 70 73 20 6f 76 65 72  20 74 68 65 20 6c 61 7a  |mps over the laz|

Note: 0a is newline, 2b is + ef bb bf are the specials added to the file, 74 68 
65 20 71 75 is 'the qu'

0090  79 20 64 6f 67 0a 2b ef  bb bf 74 68 65 20 71 75  |y dog.+...the qu|
00a0  69 63 6b 20 62 72 6f 77  6e 20 66 6f 78 20 6a 75  |ick brown fox ju|
00b0  6d 70 73 20 6f 76 65 72  20 74 68 65 20 6c 61 7a  |mps over the laz|
00c0  79 20 64 6f 67 0a |y dog.|
00c6

$ git diff | cat
diff --git a/test.txt b/test.txt
index af9f93a..294e397 100644
--- a/test.txt
+++ b/test.txt
@@ -1 +1 @@
-the quick brown fox jumps over the lazy dog
+the quick brown fox jumps over the lazy dog

$ git --no-pager diff
diff --git a/test.txt b/test.txt
index af9f93a..294e397 100644
--- a/test.txt
+++ b/test.txt
@@ -1 +1 @@
-the quick brown fox jumps over the lazy dog
+the quick brown fox jumps over the lazy dog

And my UTF-8 capable terminal displays it fine.

What do you get when you pipe to hexdump?

-Jason



Consider escaping special characters like 'less' does

2017-10-15 Thread Joris Valette
The pager 'less' escapes some characters when calling 'git diff'. This
is what I might get:

$ git diff --cached
diff --git a/some_file b/some_file
new file mode 100644
index 000..357323f
--- /dev/null
+++ b/some_file
@@ -0,0 +1 @@
+Hello
\ No newline at end of file

This example is a simple file encoded in UTF-8 *with BOM*.
On the other hand, the built-in git output shows this:

$ git --no-pager diff --cached
diff --git a/some_file b/some_file
new file mode 100644
index 000..357323f
--- /dev/null
+++ b/some_file
@@ -0,0 +1 @@
+Hello
\ No newline at end of file




You can see 'less' shows  explicitly, while it is implicit and
hidden with git's built-in output.
Other characters behave the same, like ZERO WIDTH SPACE , and
I believe many others.
This is particularly annoying when there are other changes on the same
line. Here I add ' world!' but also remove the BOM:

$ git diff
diff --git a/some_file b/some_file
index 357323f..6769dd6 100644
--- a/some_file
+++ b/some_file
@@ -1 +1 @@
-Hello
\ No newline at end of file
+Hello world!
\ No newline at end of file

Compare with:

$ git --no-pager diff
diff --git a/some_file b/some_file
index 357323f..6769dd6 100644
--- a/some_file
+++ b/some_file
@@ -1 +1 @@
-Hello
\ No newline at end of file
+Hello world!
\ No newline at end of file




In the first example I can see both changes, in the second example the
BOM removal is hidden and I won't notice it because there is indeed
another genuine change on the same line.

I'll add a third example with CRLF line-terminators:

$ git diff
diff --git a/some_file b/some_file
index 357323f..dfd6895 100644
--- a/some_file
+++ b/some_file
@@ -1 +1 @@
-Hello
\ No newline at end of file
+Hello world!^M

and:

$ git --no-pager diff
diff --git a/some_file b/some_file
index 357323f..dfd6895 100644
--- a/some_file
+++ b/some_file
@@ -1 +1 @@
-Hello
\ No newline at end of file
+Hello world!




My thoughts are that git should add this feature and tend towards
less's behavior. A diff output should show as much as possible.
I would happily use 'less' all the time, but unfortunately I don't
think I can. For instance, 'git add --patch', which I use quite often,
prints through git's built-in output, and I miss less's features.

Joris Valette.