[PATCH v11 10/10] convert: add round trip check based on 'core.checkRoundtripEncoding'

2018-03-09 Thread lars . schneider
From: Lars Schneider 

UTF supports lossless conversion round tripping and conversions between
UTF and other encodings are mostly round trip safe as Unicode aims to be
a superset of all other character encodings. However, certain encodings
(e.g. SHIFT-JIS) are known to have round trip issues [1].

Add 'core.checkRoundtripEncoding', which contains a comma separated
list of encodings, to define for what encodings Git should check the
conversion round trip if they are used in the 'working-tree-encoding'
attribute.

Set SHIFT-JIS as default value for 'core.checkRoundtripEncoding'.

[1] 
https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode

Signed-off-by: Lars Schneider 
---
 Documentation/config.txt |  6 +++
 Documentation/gitattributes.txt  |  8 
 config.c |  5 +++
 convert.c| 79 +++-
 convert.h|  1 +
 environment.c|  1 +
 t/t0028-working-tree-encoding.sh | 39 
 7 files changed, 138 insertions(+), 1 deletion(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 0e25b2c92b..7dcac9b540 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -530,6 +530,12 @@ core.autocrlf::
This variable can be set to 'input',
in which case no output conversion is performed.
 
+core.checkRoundtripEncoding::
+   A comma and/or whitespace separated list of encodings that Git
+   performs UTF-8 round trip checks on if they are used in an
+   `working-tree-encoding` attribute (see linkgit:gitattributes[5]).
+   The default value is `SHIFT-JIS`.
+
 core.symlinks::
If false, symbolic links are checked out as small plain files that
contain the link text. linkgit:git-update-index[1] and
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 31a4f92840..aa3deae392 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -312,6 +312,14 @@ number of pitfalls:
   internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
   That operation will fail and cause an error.
 
+- Reencoding content to non-UTF encodings can cause errors as the
+  conversion might not be UTF-8 round trip safe. If you suspect your
+  encoding to not be round trip safe, then add it to
+  `core.checkRoundtripEncoding` to make Git check the round trip
+  encoding (see linkgit:git-config[1]). SHIFT-JIS (Japanese character
+  set) is known to have round trip issues with UTF-8 and is checked by
+  default.
+
 - Reencoding content requires resources that might slow down certain
   Git operations (e.g 'git checkout' or 'git add').
 
diff --git a/config.c b/config.c
index 1f003fbb90..d0ada9fcd4 100644
--- a/config.c
+++ b/config.c
@@ -1172,6 +1172,11 @@ static int git_default_core_config(const char *var, 
const char *value)
return 0;
}
 
+   if (!strcmp(var, "core.checkroundtripencoding")) {
+   check_roundtrip_encoding = xstrdup(value);
+   return 0;
+   }
+
if (!strcmp(var, "core.notesref")) {
notes_ref_name = xstrdup(value);
return 0;
diff --git a/convert.c b/convert.c
index d739078016..c2d24882c1 100644
--- a/convert.c
+++ b/convert.c
@@ -347,6 +347,42 @@ static void trace_encoding(const char *context, const char 
*path,
strbuf_release(&trace);
 }
 
+static int check_roundtrip(const char *enc_name)
+{
+   /*
+* check_roundtrip_encoding contains a string of comma and/or
+* space separated encodings (eg. "UTF-16, ASCII, CP1125").
+* Search for the given encoding in that string.
+*/
+   const char *found = strcasestr(check_roundtrip_encoding, enc_name);
+   const char *next;
+   int len;
+   if (!found)
+   return 0;
+   next = found + strlen(enc_name);
+   len = strlen(check_roundtrip_encoding);
+   return (found && (
+   /*
+* check that the found encoding is at the
+* beginning of check_roundtrip_encoding or
+* that it is prefixed with a space or comma
+*/
+   found == check_roundtrip_encoding || (
+   (isspace(found[-1]) || found[-1] == ',')
+   )
+   ) && (
+   /*
+* check that the found encoding is at the
+* end of check_roundtrip_encoding or
+* that it is suffixed with a space or comma
+*/
+   next == check_roundtrip_encoding + len || (
+   next < check_roundtrip_encoding + len &&
+   (isspace(next[0]) || next[0] == ',')
+   )
+

Re: [PATCH v11 10/10] convert: add round trip check based on 'core.checkRoundtripEncoding'

2018-03-09 Thread Eric Sunshine
On Fri, Mar 9, 2018 at 12:35 PM,   wrote:
> [...]
> Add 'core.checkRoundtripEncoding', which contains a comma separated
> list of encodings, to define for what encodings Git should check the
> conversion round trip if they are used in the 'working-tree-encoding'
> attribute.
> [...]
> Signed-off-by: Lars Schneider 
> ---
> diff --git a/convert.c b/convert.c
> @@ -1150,7 +1227,7 @@ static const char *git_path_check_encoding(struct 
> attr_check_item *check)
> /* Don't encode to the default encoding */
> -   if (!strcasecmp(value, default_encoding))
> +   if (is_encoding_utf8(value) && is_encoding_utf8(default_encoding))
> return NULL;

This change belongs in 6/10, not 10/10, methinks.


Re: [PATCH v11 10/10] convert: add round trip check based on 'core.checkRoundtripEncoding'

2018-03-09 Thread Junio C Hamano
Eric Sunshine  writes:

> On Fri, Mar 9, 2018 at 12:35 PM,   wrote:
>> [...]
>> Add 'core.checkRoundtripEncoding', which contains a comma separated
>> list of encodings, to define for what encodings Git should check the
>> conversion round trip if they are used in the 'working-tree-encoding'
>> attribute.
>> [...]
>> Signed-off-by: Lars Schneider 
>> ---
>> diff --git a/convert.c b/convert.c
>> @@ -1150,7 +1227,7 @@ static const char *git_path_check_encoding(struct 
>> attr_check_item *check)
>> /* Don't encode to the default encoding */
>> -   if (!strcasecmp(value, default_encoding))
>> +   if (is_encoding_utf8(value) && is_encoding_utf8(default_encoding))
>> return NULL;
>
> This change belongs in 6/10, not 10/10, methinks.

It is actually worse than that, no?  When default_encoding is
(somehow) configured not to be UTF-8, e.g. "Shift_JIS", we used to
avoid converting from Shift_JIS to Shift_JIS, but the optimization
no longer happens with this code.

In any case, I think same_encoding() is probably a good thing to use
here at step 6/10, so the point is moot, I guess.


Re: [PATCH v11 10/10] convert: add round trip check based on 'core.checkRoundtripEncoding'

2018-03-09 Thread Eric Sunshine
On Fri, Mar 9, 2018 at 3:22 PM, Junio C Hamano  wrote:
> Eric Sunshine  writes:
>> On Fri, Mar 9, 2018 at 12:35 PM,   wrote:
>>> /* Don't encode to the default encoding */
>>> -   if (!strcasecmp(value, default_encoding))
>>> +   if (is_encoding_utf8(value) && is_encoding_utf8(default_encoding))
>>> return NULL;
>>
>> This change belongs in 6/10, not 10/10, methinks.
>
> It is actually worse than that, no?  When default_encoding is
> (somehow) configured not to be UTF-8, e.g. "Shift_JIS", we used to
> avoid converting from Shift_JIS to Shift_JIS, but the optimization
> no longer happens with this code.
>
> In any case, I think same_encoding() is probably a good thing to use
> here at step 6/10, so the point is moot, I guess.

Agreed, same_encoding() would be superior.