Re: [PATCH v5 5/7] convert: add 'working-tree-encoding' attribute

2018-01-31 Thread Junio C Hamano
Lars Schneider  writes:

>> I am not sure why this is special cased and other codepaths have "if
>> WRITE_OBJECT then die, otherwise error" checks, so no, I do not
>> agree with your reasoning, at least not yet.
>
> The convert_to_git()/encode_to_git() machinery is used in two different
> kinds of code paths:
>
> Some code paths actually write to the Git database (indicated by the 
> CONV_WRITE_OBJECT flag). I consider these the "critical/important" code 
> paths and I don't want to tolerate any encoding errors in these cases as 
> the errors would be "forever" in the Git database. That's why I call 
> die() on errors for these cases to abort whatever we are doing. 
>
> Other code paths do not write to the Git database (e.g. during "git 
> checkout" we use the code to ensure that we are moving away from the 
> exact state that we think we are moving away). In these code paths I am 
> less concerned about encoding errors. I also don't want to abort the 
> operation (e.g. "git checkout") in these cases. That's why I only inform
> the user about the problem with an error message.

Warning the users early while they are doing non-writing
operation to give them chance to adjust the contents, before they
actually need to register the contents as objects by writing, at
which point we need to die.  That's a reasonable distinction and all
of that I already agree with.

What was questionable and left unexplained was why this roundtrip
thing needs to be different.

> The encoding round-trip check can be expensive. That's why I decided 
> initially to only execute the check in the "critical/important" 
> write-to-Git-database situations (CONV_WRITE_OBJECT flag!). I also 
> decided to run it only if the "SHIFT-JIS" encoding is used as this was
> the only encoding that I could find which reportedly does not round-trip
> with UTF-8 (although I was not able to replicate the round-trip 
> problems). 

I still do not see why you have problems with the approach of
maintaining a configurable set of "iffy" encodings (and throw SJIS
into the default list) to achieve all of the above and more.  For
SJIS users, instead of having to set environment variables to obtain
safe behaviour, they automatically get safe behaviour.  When using
encodings that are not problematic, they do not need to spend cycles
checking round-trip.  And when SJIS users know they do not care about
roundtrip checks, they can just configure SJIS away from the list.


Re: [PATCH v5 5/7] convert: add 'working-tree-encoding' attribute

2018-01-31 Thread Lars Schneider

> On 30 Jan 2018, at 22:56, Junio C Hamano  wrote:
> 
> Lars Schneider  writes:
> 
>>> On 30 Jan 2018, at 21:05, Junio C Hamano  wrote:
>>> 
>>> tbo...@web.de writes:
>>> 
 +  if ((conv_flags & CONV_WRITE_OBJECT) && !strcmp(enc->name, 
 "SHIFT-JIS")) {
 +  char *re_src;
 +  int re_src_len;
>>> 
>>> I think it is a bad idea to 
>>> 
>>> (1) not check without CONV_WRITE_OBJECT here.
>> 
>> The idea is to perform the roundtrip check *only* if we 
>> actually write to Git. In all other cases we don't care
>> if the encoding roundtrips.
>> 
>> "git checkout" is such a case where we don't care as 
>> noted by Peff here:
>> https://public-inbox.org/git/20171215095838.ga3...@sigill.intra.peff.net/
>> 
>> Do you agree?
> 
> I am not sure why this is special cased and other codepaths have "if
> WRITE_OBJECT then die, otherwise error" checks, so no, I do not
> agree with your reasoning, at least not yet.

The convert_to_git()/encode_to_git() machinery is used in two different
kinds of code paths:

Some code paths actually write to the Git database (indicated by the 
CONV_WRITE_OBJECT flag). I consider these the "critical/important" code 
paths and I don't want to tolerate any encoding errors in these cases as 
the errors would be "forever" in the Git database. That's why I call 
die() on errors for these cases to abort whatever we are doing. 

Other code paths do not write to the Git database (e.g. during "git 
checkout" we use the code to ensure that we are moving away from the 
exact state that we think we are moving away). In these code paths I am 
less concerned about encoding errors. I also don't want to abort the 
operation (e.g. "git checkout") in these cases. That's why I only inform
the user about the problem with an error message.

The encoding round-trip check can be expensive. That's why I decided 
initially to only execute the check in the "critical/important" 
write-to-Git-database situations (CONV_WRITE_OBJECT flag!). I also 
decided to run it only if the "SHIFT-JIS" encoding is used as this was
the only encoding that I could find which reportedly does not round-trip
with UTF-8 (although I was not able to replicate the round-trip 
problems). 

I want to change the current implementation as follows:

I want to check the round-trip encoding only if the the environment 
variable "GIT_WORKING_TREE_ENCODING_ROUNDTRIP_CHECK" is set. This way
a user can check the round-trip if necessary for *any* encoding. I
don't want to make it a git config because that setting should only 
rarely be used for debugging purposes.

Performing the round-trip check every time is not necessary from my 
point of view because it can be expensive and I was not able to generate
a test case which *does not* round-trip without triggering any other
iconv error.

- Lars

Re: [PATCH v5 5/7] convert: add 'working-tree-encoding' attribute

2018-01-30 Thread Junio C Hamano
Lars Schneider  writes:

>> On 30 Jan 2018, at 21:05, Junio C Hamano  wrote:
>> 
>> tbo...@web.de writes:
>> 
>>> +   if ((conv_flags & CONV_WRITE_OBJECT) && !strcmp(enc->name, 
>>> "SHIFT-JIS")) {
>>> +   char *re_src;
>>> +   int re_src_len;
>> 
>> I think it is a bad idea to 
>> 
>> (1) not check without CONV_WRITE_OBJECT here.
>
> The idea is to perform the roundtrip check *only* if we 
> actually write to Git. In all other cases we don't care
> if the encoding roundtrips.
>
> "git checkout" is such a case where we don't care as 
> noted by Peff here:
> https://public-inbox.org/git/20171215095838.ga3...@sigill.intra.peff.net/
>
> Do you agree?

I am not sure why this is special cased and other codepaths have "if
WRITE_OBJECT then die, otherwise error" checks, so no, I do not
agree with your reasoning, at least not yet.



Re: [PATCH v5 5/7] convert: add 'working-tree-encoding' attribute

2018-01-30 Thread Lars Schneider

> On 30 Jan 2018, at 21:05, Junio C Hamano  wrote:
> 
> tbo...@web.de writes:
> 
>> +if ((conv_flags & CONV_WRITE_OBJECT) && !strcmp(enc->name, 
>> "SHIFT-JIS")) {
>> +char *re_src;
>> +int re_src_len;
> 
> I think it is a bad idea to 
> 
> (1) not check without CONV_WRITE_OBJECT here.

The idea is to perform the roundtrip check *only* if we 
actually write to Git. In all other cases we don't care
if the encoding roundtrips.

"git checkout" is such a case where we don't care as 
noted by Peff here:
https://public-inbox.org/git/20171215095838.ga3...@sigill.intra.peff.net/

Do you agree?


> (2) hardcode SJIS and do this always and to SJIS alone.
> 
> ...
> 
> For (2), perhaps introduce a multi-value configuration variable
> core.checkRoundtripEncoding, whose default value consists of just
> SJIS, but allow users to add or clear it?

Well, in that case I would make it simpler and make
core.checkRoundtripEncoding a boolean that applies to all encodings
if enabled. We could make even simpler than that by removing the entire 
roundtrip check. The thing is, I was not able to come up with a
sequence that would not generate a iconv error *and* not round trip.
Would that be ok for you to remove all that roundtrip checking code?


>> +re_src = reencode_string_len(dst, dst_len,
>> + enc->name, default_encoding,
>> + _src_len);
>> +
>> +if (!re_src || src_len != re_src_len ||
>> +memcmp(src, re_src, src_len)) {
>> +const char* msg = _("encoding '%s' from %s to %s and "
>> +"back is not the same");
>> +if (conv_flags & CONV_WRITE_OBJECT)
>> +die(msg, path, enc->name, default_encoding);
>> +else
>> +error(msg, path, enc->name, default_encoding);
> 
> The "error" side of this inner if() is dead code, I think.

Good catch. I think this code should go away if we keep the roundtrip
code and you agree with my statement above.


Thanks a lot for the review,
Lars

Re: [PATCH v5 5/7] convert: add 'working-tree-encoding' attribute

2018-01-30 Thread Junio C Hamano
tbo...@web.de writes:

> + if ((conv_flags & CONV_WRITE_OBJECT) && !strcmp(enc->name, 
> "SHIFT-JIS")) {
> + char *re_src;
> + int re_src_len;

I think it is a bad idea to 

 (1) not check without CONV_WRITE_OBJECT here.
 (2) hardcode SJIS and do this always and to SJIS alone.

For (1), a fix would be obvious (and that will resurrect the dead
code below).

For (2), perhaps introduce a multi-value configuration variable
core.checkRoundtripEncoding, whose default value consists of just
SJIS, but allow users to add or clear it?

> + re_src = reencode_string_len(dst, dst_len,
> +  enc->name, default_encoding,
> +  _src_len);
> +
> + if (!re_src || src_len != re_src_len ||
> + memcmp(src, re_src, src_len)) {
> + const char* msg = _("encoding '%s' from %s to %s and "
> + "back is not the same");
> + if (conv_flags & CONV_WRITE_OBJECT)
> + die(msg, path, enc->name, default_encoding);
> + else
> + error(msg, path, enc->name, default_encoding);

The "error" side of this inner if() is dead code, I think.


[PATCH v5 5/7] convert: add 'working-tree-encoding' attribute

2018-01-29 Thread tboegi
From: Lars Schneider 

Git recognizes files encoded with ASCII or one of its supersets (e.g.
UTF-8 or ISO-8859-1) as text files. All other encodings are usually
interpreted as binary and consequently built-in Git text processing
tools (e.g. 'git diff') as well as most Git web front ends do not
visualize the content.

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Signed-off-by: Lars Schneider 
Signed-off-by: Torsten Bögershausen 
---
 Documentation/gitattributes.txt  |  60 
 convert.c| 190 -
 convert.h|   1 +
 sha1_file.c  |   2 +-
 t/t0028-working-tree-encoding.sh | 196 +++
 5 files changed, 447 insertions(+), 2 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81..a8dbf4be3 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,6 +272,66 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.
 
 
+`working-tree-encoding`
+^^^
+
+Git recognizes files encoded with ASCII or one of its supersets (e.g.
+UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
+interpreted as binary and consequently built-in Git text processing
+tools (e.g. 'git diff') as well as most Git web front ends do not
+visualize the content.
+
+In these cases you can tell Git the encoding of a file in the working
+directory with the `working-tree-encoding` attribute. If a file with this
+attributes is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8 and stores the result in its internal data
+structure (called "the index"). On checkout the content is encoded
+back to the specified encoding.
+
+Please note that using the `working-tree-encoding` attribute may have a
+number of pitfalls:
+
+- Git clients that do not support the `working-tree-encoding` attribute
+  will checkout the respective files UTF-8 encoded and not in the
+  expected encoding. Consequently, these files will appear different
+  which typically causes trouble. This is in particular the case for
+  older Git versions and alternative Git implementations such as JGit
+  or libgit2 (as of January 2018).
+
+- Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
+  errors as the conversion might not be round trip safe.
+
+- Reencoding content requires resources that might slow down certain
+  Git operations (e.g 'git checkout' or 'git add').
+
+Use the `working-tree-encoding` attribute only if you cannot store a file in
+UTF-8 encoding and if you want Git to be able to process the content as
+text.
+
+Use the following attributes if your '*.txt' files are UTF-16 encoded
+with byte order mark (BOM) and you want Git to perform automatic line
+ending conversion based on your platform.
+
+
+*.txt  text working-tree-encoding=UTF-16
+
+
+Use the following attributes if your '*.txt' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory.
+
+
+*.txt  working-tree-encoding=UTF-16LE text eol=CRLF
+
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+
+iconv --list
+
+
+
 `ident`
 ^^^
 
diff --git a/convert.c b/convert.c
index b976eb968..0c372069b 100644
--- a/convert.c
+++ b/convert.c
@@ -7,6 +7,7 @@
 #include "sigchain.h"
 #include "pkt-line.h"
 #include "sub-process.h"
+#include "utf8.h"
 
 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -265,6 +266,147 @@ static int will_convert_lf_to_crlf(size_t len, struct 
text_stat *stats,
 
 }
 
+static struct encoding {
+   const char *name;
+   struct encoding *next;
+} *encoding, **encoding_tail;
+static const char *default_encoding = "UTF-8";
+
+static int encode_to_git(const char *path, const char *src, size_t src_len,
+struct strbuf *buf, struct encoding *enc, int 
conv_flags)
+{
+   char *dst;
+   int dst_len;
+
+   /*
+* No encoding is specified or there is nothing to encode.
+* Tell the caller that the content was not modified.
+*/
+   if (!enc || (src && !src_len))
+   return 0;
+
+   /*
+* Looks like we got called from "would_convert_to_git()".
+* This means Git wants to know if it would encode (= modify!)
+* the content. Let's answer with "yes", since an encoding was
+