Re: [PATCH v2 2/2] sh-setup: explicitly mark CR as a field separator

2015-10-27 Thread Johannes Schindelin
Hi Junio,

On Mon, 26 Oct 2015, Junio C Hamano wrote:

> I asked Dscho if the shell is done correctly _for_ the platform.

This assumes that the platform is either CR/LF or LF. That is incorrect.
Windows does *not* dictate the line endings to be CR/LF. Certain
applications do.

The shell is an MSys2 shell, and MSys2 convention is to use LF. Hence the
shell is done correctly for the platform.

Yet on Windows (and in cross-platform projects also on Linux and MacOSX),
you have to deal with the fact that text files produced outside the cozy
little shell can have different line endings.

As I said, I am uninterested in arguing for arguing's sake. Otherwise I
would shoot down your argument that LF in IFS "does not hurt" but CR
somehow magically does. I am sure it would be a helluva discussion and
fun. But I do not have the time.

What I really needed you to do indicate the way forward.

The patch you provided in the mail I am replying to fails to fix the bug.

Ciao,
Johannes
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] sh-setup: explicitly mark CR as a field separator

2015-10-27 Thread Matthieu Moy
Junio C Hamano  writes:

> Matthieu Moy  writes:
>
>> Johannes Schindelin  writes:
>>
>>> This is the correct thing to do, really: we already specify LF as
>>> field separator.
>>
>> I'm almost convinced that this is the right thing to do in the long run
>> ("almost" because I'm not sure, not because I have arguments against). I
>> agree with Junio that the commit message should be more convincing, but
>> indeed, accepting LF and not CR is strange.
>
> If there were a single character that denotes CRLF, I'd say that
> including such a character in IFS would make sense on a system with
> CRLF EOL convention.  But that is not the case.

It's not, but there are platforms where the newline convention is
CR-only. AFAIK, they are all "old" architectures (like Mac OS < X), but
there are still CR-only files lying around.

>  * read is not the only user of IFS.  Expressing "list of things"
>(pre bashism "shell array" days) by concatenating elements into a
>single string variable, separated with LF, and later iterating
>over them is a very common use case, e.g.
>
>   LF='
> '
>   list="$thing1"
>   list="$list$LF$thing2"
>   list="$list$LF$thing3"
> ...
>
>   IFS=$LF
>   for thing in $list
> do
>   ...
>
>And including LF by default in IFS, especially when "things" can
>contain SP/HT, is handy.

I don't get the argument. You're talking about a case where you
explicitly set IFS, and the patch is about changing the default. The
use-case above would work exactly like before if we modify the default
IFS.

> If you
> have a variable in which "A^MB" is there, "set $variable" would
> split them into two and assign B to $2, which is not what the
> scripts would expect.

The same goes if you replace ^M with a tab or a space. Using unquoted
"set $variable" is sane only if you are sure that $variable does not
contain unexpected special characters. I can't imagine a case where one
would accept space, tab or LF as separator and would need to accept CR
as a non-separator.

-- 
Matthieu Moy
http://www-verimag.imag.fr/~moy/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] sh-setup: explicitly mark CR as a field separator

2015-10-27 Thread Johannes Schindelin
Hi Junio,

On Mon, 26 Oct 2015, Junio C Hamano wrote:

> Subject: [PATCH] rebase-i: work around Windows CRLF line endings
> 
> Editors on Windows can and do save text files with CRLF line
> endings, which is the convention on the platform.  We are seeing
> reports that the "read" command in a port of bash to the environment
> however does not strip the CRLF at the end, not adjusting for the
> same convention on the platform.
> 
> This breaks the recently added sanity checks for the insn sheet fed
> to "rebase -i"; instead of an empty line (hence nothing in $command),
> the script was getting a lone CR in there.
> 
> Special case a lone CR and treat it the same way as an empty line to
> work this around.

You do not need me to tell you that it is likely that the same issue will
arise in other places, but if you are content with this work-around, so be
it.


> The test was stolen from Dscho.

There was no need to steal this, as I made a specific effort to introduce
the test in a separate commit. There was also no need to remove the
attribution for finding the bug.

Let's not do that.

I will submit v3 in a second, including your work-around in favor of the
patch you hate so much.

Ciao,
Johannes
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] sh-setup: explicitly mark CR as a field separator

2015-10-27 Thread Junio C Hamano
Johannes Schindelin  writes:

> This assumes that the platform is either CR/LF or LF. That is incorrect.
> Windows does *not* dictate the line endings to be CR/LF. Certain
> applications do.
>
> The shell is an MSys2 shell, and MSys2 convention is to use LF. Hence the
> shell is done correctly for the platform.
>
> Yet on Windows (and in cross-platform projects also on Linux and MacOSX),
> you have to deal with the fact that text files produced outside the cozy
> little shell can have different line endings.

I know that.  That is why I said this is platform specific.  When
somebody on Ubuntu comes and says "XYZ does not work when I saved my
text file with CRLF", I can say "don't do that then".  I've been
assuming that you cannot say the same to your users, in order to be
nicer to Windows.  You say MSys2 convention is to use LF, but what
do you exactly mean by that?

I am hoping that you do not mean "we treat a file created by a DOS
editor as a text file with CR appended at the end of each line".
I'd instead expect that a file the user wrote ABC on the first line
will give you ABC in $word if the user did this:

#!/bin/sh
read word  As I said, I am uninterested in arguing for arguing's sake.

I already explained why this particular thing does not have anything
to do with IFS but comes from the way your "read" behaves, which is
not in line with the way users on your platform expects.  Is that so
hard to understand?

An argument you make to support adding stuff to IFS _is_ an argument
for arguing's sake.

> What I really needed you to do indicate the way forward.

Yes, and I already did, didn't I?

I already said that mucking with IFS is a wrong thing to do, and
indicated that that is not the way forward.

I also already said that fixing your read is the only sensible thing
to do in the longer term iff you claim to support those who use
"certain applications" that leave CRLF line terminator.

 * http://pubs.opengroup.org/onlinepubs/9699919799/utilities/read.html
   defines the operation as a two-step process.  It is supposed to read
   an input line and then "The terminating  (if any) shall
   be removed from the input".  And then "the result shall be split into
   fields" (at IFS boundaries).

   As POSIX does not say  has to be a single octet with
   value 10 (the same way you are allowed to show CRLF when you call
   printf("\n"), and you do not see CR left at the end when you use
   gets(3)).

   So on a platform that (optionally) accepts CRLF as a line
   terminator, "read" should read up to LF for a line, and if there
   is a CR immediately before that LF, remove that CR, too.  Now you
   got the result of "terminating  (if any) shall be
   removed".  Then look at "the result" and split into fields using
   IFS.  That way, you still process LF terminated files correctly,
   and you would never show a single ^M for an empty line, which
   triggered this thread.

 * Optionally, you may want to implement the "field splitting" (at
   IFS boundaries) with a small twist.  When the script/user
   included LF in IFS, split at LF but remove CR if there is one
   that immediately precedes the LF.

And also I gave you a more localized (i.e. less likely to negatively
affect unrelated parts of the system) workaround that we could use,
until your "read" is fixed.

Mind you, I am not happy with that "$cr may appear at the end of the
last token on the line" workaround.  Other places in the code (both
inside and outside Git scripted porcelains) that calls "read" to
read from text files can suffer from the same issue, until it is
fixed.  Covering up the bug in "read" by throwing extra characters
to IFS, especially when read is not the only user of IFS, is not a
fix.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] sh-setup: explicitly mark CR as a field separator

2015-10-26 Thread Matthieu Moy
Johannes Schindelin  writes:

> This is the correct thing to do, really: we already specify LF as
> field separator.

I'm almost convinced that this is the right thing to do in the long run
("almost" because I'm not sure, not because I have arguments against). I
agree with Junio that the commit message should be more convincing, but
indeed, accepting LF and not CR is strange.

However, is this the right thing to do in the maintainance branch? It
does fix the issue, but does so in a rather intrusive way, so I'd need
more arguments to be convinced that this is safe to merge in maint. Or
have a local fix for rebase to be merged in maint, and apply this in
master for the next feature release.

Sorry for being negative, and especially sorry since I'm partly guilty
for the breakage. I just want to be sure that we don't break anything
while repairing it (we already introduced this breakage while repairing
another one...).

>  # Similarly for IFS, but some shells (e.g. FreeBSD 7.2) are buggy and
>  # do not equate an unset IFS with IFS with the default, so here is
> -# an explicit SP HT LF.
> +# an explicit SP HT LF CR.
>  IFS='
> -'
> +'"$(printf '\r')"

While we're there, it may be better to have a single "printf ' \t\n\r'"
to avoid the whitespace magic in the source code.

-- 
Matthieu Moy
http://www-verimag.imag.fr/~moy/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] sh-setup: explicitly mark CR as a field separator

2015-10-26 Thread Junio C Hamano
Junio C Hamano  writes:

> Something along the line of the following would be tolerable, even
> though I think in the longer term, not just in Git land but in the
> larger ecosystem to use POSIXy tools on Windows, the best solution
> is to fix the shell so that it matches the expectation of the users
> of its platform.
>
> I say "something along the line of" here because I do not know how
> the problematic shell behaves on the assignment command that stuffs
> a lone CR into a variable.  It _might_ need a similar protection
> against the shell feature "the last EOL is removed from the result
> of command expansion", as I did in the above example, depending on
> how incoherent the implementation is.  The implementation seems to
> accept CRLF and LF in shell scripts proper just fine, but apparently
> its implementation of "read" does not honor such platform EOL
> convention, which caused this issue, and I don't know what it does
> in the codepath that implements command expansion.

I still have to say "something along the lines of" as I do not have
(and I do not wish to have, to be honest) an access to test this
with the problematic shell implementation, but here is an updated
patch.

-- >8 --
Subject: [PATCH] rebase-i: work around Windows CRLF line endings

Editors on Windows can and do save text files with CRLF line
endings, which is the convention on the platform.  We are seeing
reports that the "read" command in a port of bash to the environment
however does not strip the CRLF at the end, not adjusting for the
same convention on the platform.

This breaks the recently added sanity checks for the insn sheet fed
to "rebase -i"; instead of an empty line (hence nothing in $command),
the script was getting a lone CR in there.

Special case a lone CR and treat it the same way as an empty line to
work this around.

The test was stolen from Dscho.

Signed-off-by: Junio C Hamano 
---
 git-rebase--interactive.sh| 13 +
 t/t3404-rebase-interactive.sh | 12 
 2 files changed, 25 insertions(+)

diff --git a/git-rebase--interactive.sh b/git-rebase--interactive.sh
index c42ba34..3911711 100644
--- a/git-rebase--interactive.sh
+++ b/git-rebase--interactive.sh
@@ -77,6 +77,10 @@ amend="$state_dir"/amend
 rewritten_list="$state_dir"/rewritten-list
 rewritten_pending="$state_dir"/rewritten-pending
 
+# Work around a Windows port of shell that does not strip
+# the newline at the end of a line correctly.
+cr=$(printf "\015")
+
 strategy_args=
 if test -n "$do_merge"
 then
@@ -518,6 +522,11 @@ do_next () {
"$comment_char"*|''|noop|drop|d)
mark_action_done
;;
+   "$cr")
+   # Windows port of shell not stripping the newline
+   # at the end of an empty line correctly.
+   mark_action_done
+   ;;
pick|p)
comment_for_reflog pick
 
@@ -888,6 +897,10 @@ check_bad_cmd_and_sha () {
"$comment_char"*|''|noop|x|exec)
# Doesn't expect a SHA-1
;;
+   "$cr")
+   # Windows port of shell not stripping the newline
+   # at the end of an empty line correctly.
+   ;;
pick|p|drop|d|reword|r|edit|e|squash|s|fixup|f)
if ! check_commit_sha "${rest%%[]*}" "$lineno" 
"$1"
then
diff --git a/t/t3404-rebase-interactive.sh b/t/t3404-rebase-interactive.sh
index 88d7d53..2f97a3d 100755
--- a/t/t3404-rebase-interactive.sh
+++ b/t/t3404-rebase-interactive.sh
@@ -1240,4 +1240,16 @@ test_expect_success 'static check of bad SHA-1' '
test E = $(git cat-file commit HEAD | sed -ne \$p)
 '
 
+test_expect_success 'editor saves as CR/LF' '
+   git checkout -b with-crlf &&
+   write_script add-crs.sh <<-\EOF &&
+   sed -e "s/\$/Q/" <"$1" | tr Q "\\015" >"$1".new &&
+   mv -f "$1".new "$1"
+   EOF
+   (
+   test_set_editor "$(pwd)/add-crs.sh" &&
+   git rebase -i HEAD^
+   )
+'
+
 test_done
-- 
2.6.2-405-g6877da6


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] sh-setup: explicitly mark CR as a field separator

2015-10-26 Thread Junio C Hamano
Matthieu Moy  writes:

> Johannes Schindelin  writes:
>
>> This is the correct thing to do, really: we already specify LF as
>> field separator.
>
> I'm almost convinced that this is the right thing to do in the long run
> ("almost" because I'm not sure, not because I have arguments against). I
> agree with Junio that the commit message should be more convincing, but
> indeed, accepting LF and not CR is strange.

If there were a single character that denotes CRLF, I'd say that
including such a character in IFS would make sense on a system with
CRLF EOL convention.  But that is not the case.

On a platform with LF EOL convention, having LF in IFS makes sense,
due to two reasons.

 * read is not the only user of IFS.  Expressing "list of things"
   (pre bashism "shell array" days) by concatenating elements into a
   single string variable, separated with LF, and later iterating
   over them is a very common use case, e.g.

LF='
'
list="$thing1"
list="$list$LF$thing2"
list="$list$LF$thing3"
...

IFS=$LF
for thing in $list
do
...

   And including LF by default in IFS, especially when "things" can
   contain SP/HT, is handy.

 * It does not hurt on a platform with LF EOL convention to include
   LF in IFS, because you cannot have a "lone LF" in the middle of a
   line.  Presence of a single LF alone will terminate the current
   line and the bytes that follow it will be a next line.

No similarity argument can be made for a lone CR on a platform that
uses CRLF EOL convention.  A "lone CR" can appear in the middle of a
line without terminating the current line, as only a CR that is
immediately followed by a LF is the end of line, so you cannot make
the "It does not hurt" argument for including CR to IFS.  If you
have a variable in which "A^MB" is there, "set $variable" would
split them into two and assign B to $2, which is not what the
scripts would expect.

> However, is this the right thing to do in the maintainance branch? It
> does fix the issue, but does so in a rather intrusive way, so I'd need
> more arguments to be convinced that this is safe to merge in maint.

If this were the "right" thing in general for shell scripts on
systems with CRLF EOL convention, the implementation of the shell
language on such a platform would be doing that upon startup (or
upon "unset IFS"), and we wouldn't be having this discussion.  Don't
you think it is strange that individual applications like Git have
to set IFS to include CR?  I see it as a strong sign that this is
not a correct thing to do.

Intrusive does not begin to describe it.

I think the "right" thing for a shell implementation on a CRLF
platform to do is twofold.

 * Read
   http://pubs.opengroup.org/onlinepubs/9699919799/utilities/read.html
   and notice that the operation is defined as a two-step process.
   It is supposed to read an input line and then "The terminating
(if any) shall be removed from the input".  And then
   "the result shall be split into fields" (at IFS boundaries).

   As POSIX does not say  has to be a single octet with
   value 10 (the same way you are allowed to show CRLF when you call
   printf("\n")), a shell implementation would read CRLF terminated
   lines, and remove the terminating CRLF before doing the field
   splitting using IFS.

   This is why I said in my message to Dscho that IFS does not get
   into the picture.

 * Implement the "field splitting" (at IFS boundaries) with a small
   twist.  When the script/user included '\n' in IFS, split at LF
   but remove CR if there is one that immediately precedes the LF.

The latter would help if you did the previous "read is not the only
user of IFS" example like this instead:

LF=$(printf '\n.')
LF=${LF%.}
list="$thing1"
list="$list$LF$thing2"
list="$list$LF$thing3"
...

IFS=$LF
for thing in $list
do
...

and the platform 'printf' gave CRLF for "\n" (newline), resulting
the list to be concatenated with CRLF, not LF.

And this is why I asked Dscho if the shell is done correctly _for_
the platform.

> Sorry for being negative, and especially sorry since I'm partly guilty
> for the breakage. I just want to be sure that we don't break anything
> while repairing it (we already introduced this breakage while repairing
> another one...).

Something along the line of the following would be tolerable, even
though I think in the longer term, not just in Git land but in the
larger ecosystem to use POSIXy tools on Windows, the best solution
is to fix the shell so that it matches the expectation of the users
of its platform.

I say "something along the line of" here because I do not know how
the problematic shell behaves on the assignment command that stuffs
a lone CR into a variable.  It _might_ need a similar protection
against the shell feature "the last