Re: [PATCH v8 5/7] convert: add 'working-tree-encoding' attribute

2018-02-28 Thread Eric Sunshine
On Tue, Feb 27, 2018 at 6:16 AM, Lars Schneider
 wrote:
>> On 25 Feb 2018, at 08:15, Eric Sunshine  wrote:
>> On Sat, Feb 24, 2018 at 11:27 AM,   wrote:
>> The above paragraph is giving an example of the scenario described in
>> the paragraph above it. It might, therefore, be clearer to start this
>> paragraph with "For example,...". Also, as an aid to non-Windows
>> developers, it might be helpful to introduce ".proj" files as
>> "Microsoft Visual Studio `*.proj` files...".
>
> Agreed. How about this?
>
>   For example, Microsoft Visual Studio resources files (`*.rc`) or
>   PowerShell script files (`*.ps1`) are sometimes encoded in UTF-16.
>   If you declare `*.ps1` as files as UTF-16 and you add `foo.ps1` with
>   a `working-tree-encoding` enabled Git client, then `foo.ps1` will be
>   stored as UTF-8 internally. A client without `working-tree-encoding`
>   support will checkout `foo.ps1` as UTF-8 encoded file. This will
>   typically cause trouble for the users of this file.

Since "*.proj" is mentioned a number of times in paragraphs below this
one, perhaps just stick with it instead of introducing "*.rc" and
"*.ps1", which don't necessarily add value. Alternately, if you use
.rc and .ps1, then change the remaining .proj references to follow
suit.

>>> diff --git a/convert.c b/convert.c
>>> +   } else if (is_missing_required_utf_bom(enc->name, src, src_len)) {
>>> +   const char *error_msg = _(
>>> +   "BOM is required in '%s' if encoded as %s");
>>> +   const char *advise_msg = _(
>>> +   "The file '%s' is missing a byte order mark (BOM). "
>>> +   "Please use %sBE or %sLE (depending on the byte 
>>> order) "
>>> +   "as working-tree-encoding.");
>>> +   advise(advise_msg, path, enc->name, enc->name);
>>> +   if (conv_flags & CONV_WRITE_OBJECT)
>>> +   die(error_msg, path, enc->name);
>>> +   else
>>> +   error(error_msg, path, enc->name);
>>> +   }
>>
>> For a function which presumably is agnostic about the working-tree
>> encoding (supporting whatever iconv does), it nevertheless has
>> unusually intimate knowledge about UTF BOMs, in general, and the
>> internals of has_prohibited_utf_bom() and
>> is_missing_required_utf_bom(), in particular. It might be cleaner, and
>> would simplify this function, if all this UTF BOM-specific gunk was
>> moved elsewhere and generalized into a "validate conversion" function.
>
> Agreed! How about this?
>
> /*
>  * If we convert to an UTF encoding, then check for common BOM errors.
>  */
> if (!memcmp("UTF-", enc, 4))
> if (has_utf_bom_error(path, enc, src, src_len, die_on_error))
> return 0;

Better. The comment merely repeats the code, which is clear in its
intent, thus doesn't really add value (but that's a minor point).

I'd probably generalize it further to a "validate conversion" function
(which itself would have this UTF-specific knowledge), but that's a
matter of taste and can be done later when/if other types of
validation requirements arise.


Re: [PATCH v8 5/7] convert: add 'working-tree-encoding' attribute

2018-02-27 Thread Lars Schneider

> On 25 Feb 2018, at 08:15, Eric Sunshine  wrote:
> 
> On Sat, Feb 24, 2018 at 11:27 AM,   wrote:
>> Git recognizes files encoded with ASCII or one of its supersets (e.g.
>> UTF-8 or ISO-8859-1) as text files. All other encodings are usually
>> interpreted as binary and consequently built-in Git text processing
>> tools (e.g. 'git diff') as well as most Git web front ends do not
>> visualize the content.
>> 
>> Add an attribute to tell Git what encoding the user has defined for a
>> given file. If the content is added to the index, then Git converts the
>> content to a canonical UTF-8 representation. On checkout Git will
>> reverse the conversion.
>> 
>> Signed-off-by: Lars Schneider 
>> ---
>> diff --git a/Documentation/gitattributes.txt 
>> b/Documentation/gitattributes.txt
>> @@ -272,6 +272,84 @@ few exceptions.  Even though...
>> +Please note that using the `working-tree-encoding` attribute may have a
>> +number of pitfalls:
>> +
>> +- Alternative Git implementations (e.g. JGit or libgit2) and older Git
>> +  versions (as of March 2018) do not support the `working-tree-encoding`
>> +  attribute. If you decide to use the `working-tree-encoding` attribute
>> +  in your repository, then it is strongly recommended to ensure that all
>> +  clients working with the repository support it.
>> +
>> +  If you declare `*.proj` files as UTF-16 and you add `foo.proj` with an
>> +  `working-tree-encoding` enabled Git client, then `foo.proj` will be
>> +  stored as UTF-8 internally. A client without `working-tree-encoding`
>> +  support will checkout `foo.proj` as UTF-8 encoded file. This will
>> +  typically cause trouble for the users of this file.
> 
> The above paragraph is giving an example of the scenario described in
> the paragraph above it. It might, therefore, be clearer to start this
> paragraph with "For example,...". Also, as an aid to non-Windows
> developers, it might be helpful to introduce ".proj" files as
> "Microsoft Visual Studio `*.proj` files...".

Agreed. How about this?

  For example, Microsoft Visual Studio resources files (`*.rc`) or
  PowerShell script files (`*.ps1`) are sometimes encoded in UTF-16. 
  If you declare `*.ps1` as files as UTF-16 and you add `foo.ps1` with 
  a `working-tree-encoding` enabled Git client, then `foo.ps1` will be
  stored as UTF-8 internally. A client without `working-tree-encoding`
  support will checkout `foo.ps1` as UTF-8 encoded file. This will
  typically cause trouble for the users of this file.


>> diff --git a/convert.c b/convert.c
>> @@ -265,6 +266,110 @@ static int will_convert_lf_to_crlf(size_t len, struct 
>> text_stat *stats,
>> +static struct encoding {
>> +   const char *name;
>> +   struct encoding *next;
>> +} *encoding, **encoding_tail;
> 
> Why does this struct need to be a linked-list since it contains only a
> name? In fact, why do the struct and linked list exist at all? Back in
> v1 when the struct held more members, it made sense to place them in a
> collection so they could be re-used, but today it just seems an
> unnecessary artifact which buys you nothing.
> 
> Is the plan that some future patch series will add members to the
> struct? If not, then it seems that it should be retired altogether.

Absolutely! Thank you!


>> +static const char *default_encoding = "UTF-8";
>> +
>> +static int encode_to_git(const char *path, const char *src, size_t src_len,
>> +struct strbuf *buf, struct encoding *enc, int 
>> conv_flags)
>> +{
>> +   [...]
>> +   if (has_prohibited_utf_bom(enc->name, src, src_len)) {
>> +   const char *error_msg = _(
>> +   "BOM is prohibited in '%s' if encoded as %s");
>> +   /*
>> +* This advise is shown for UTF-??BE and UTF-??LE encodings.
> 
> s/advise/advice/

Agreed!


>> +* We truncate the encoding name to 6 chars with %.6s to cut
>> +* off the last two "byte order" characters.
>> +*/
>> +   const char *advise_msg = _(
>> +   "The file '%s' contains a byte order mark (BOM). "
>> +   "Please use %.6s as working-tree-encoding.");
>> +   advise(advise_msg, path, enc->name);
>> +   if (conv_flags & CONV_WRITE_OBJECT)
>> +   die(error_msg, path, enc->name);
>> +   else
>> +   error(error_msg, path, enc->name);
> 
> What is the intended behavior in the non-die() case? As implemented,
> this is still going to attempt the conversion even though it threw an
> error. Should it be returning 0 here instead? Same question for the
> couple additional cases below.

Good question. My intention was to print an error and then let iconv try
the conversion anyways. I agree that this is bogus. Let's return in case
of an error right away.


>> +
>> +   } else if 

Re: [PATCH v8 5/7] convert: add 'working-tree-encoding' attribute

2018-02-24 Thread Eric Sunshine
On Sat, Feb 24, 2018 at 11:27 AM,   wrote:
> Git recognizes files encoded with ASCII or one of its supersets (e.g.
> UTF-8 or ISO-8859-1) as text files. All other encodings are usually
> interpreted as binary and consequently built-in Git text processing
> tools (e.g. 'git diff') as well as most Git web front ends do not
> visualize the content.
>
> Add an attribute to tell Git what encoding the user has defined for a
> given file. If the content is added to the index, then Git converts the
> content to a canonical UTF-8 representation. On checkout Git will
> reverse the conversion.
>
> Signed-off-by: Lars Schneider 
> ---
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> @@ -272,6 +272,84 @@ few exceptions.  Even though...
> +Please note that using the `working-tree-encoding` attribute may have a
> +number of pitfalls:
> +
> +- Alternative Git implementations (e.g. JGit or libgit2) and older Git
> +  versions (as of March 2018) do not support the `working-tree-encoding`
> +  attribute. If you decide to use the `working-tree-encoding` attribute
> +  in your repository, then it is strongly recommended to ensure that all
> +  clients working with the repository support it.
> +
> +  If you declare `*.proj` files as UTF-16 and you add `foo.proj` with an
> +  `working-tree-encoding` enabled Git client, then `foo.proj` will be
> +  stored as UTF-8 internally. A client without `working-tree-encoding`
> +  support will checkout `foo.proj` as UTF-8 encoded file. This will
> +  typically cause trouble for the users of this file.

The above paragraph is giving an example of the scenario described in
the paragraph above it. It might, therefore, be clearer to start this
paragraph with "For example,...". Also, as an aid to non-Windows
developers, it might be helpful to introduce ".proj" files as
"Microsoft Visual Studio `*.proj` files...".

> diff --git a/convert.c b/convert.c
> @@ -265,6 +266,110 @@ static int will_convert_lf_to_crlf(size_t len, struct 
> text_stat *stats,
> +static struct encoding {
> +   const char *name;
> +   struct encoding *next;
> +} *encoding, **encoding_tail;

Why does this struct need to be a linked-list since it contains only a
name? In fact, why do the struct and linked list exist at all? Back in
v1 when the struct held more members, it made sense to place them in a
collection so they could be re-used, but today it just seems an
unnecessary artifact which buys you nothing.

Is the plan that some future patch series will add members to the
struct? If not, then it seems that it should be retired altogether.

> +static const char *default_encoding = "UTF-8";
> +
> +static int encode_to_git(const char *path, const char *src, size_t src_len,
> +struct strbuf *buf, struct encoding *enc, int 
> conv_flags)
> +{
> +   [...]
> +   if (has_prohibited_utf_bom(enc->name, src, src_len)) {
> +   const char *error_msg = _(
> +   "BOM is prohibited in '%s' if encoded as %s");
> +   /*
> +* This advise is shown for UTF-??BE and UTF-??LE encodings.

s/advise/advice/

> +* We truncate the encoding name to 6 chars with %.6s to cut
> +* off the last two "byte order" characters.
> +*/
> +   const char *advise_msg = _(
> +   "The file '%s' contains a byte order mark (BOM). "
> +   "Please use %.6s as working-tree-encoding.");
> +   advise(advise_msg, path, enc->name);
> +   if (conv_flags & CONV_WRITE_OBJECT)
> +   die(error_msg, path, enc->name);
> +   else
> +   error(error_msg, path, enc->name);

What is the intended behavior in the non-die() case? As implemented,
this is still going to attempt the conversion even though it threw an
error. Should it be returning 0 here instead? Same question for the
couple additional cases below.

> +
> +   } else if (is_missing_required_utf_bom(enc->name, src, src_len)) {
> +   const char *error_msg = _(
> +   "BOM is required in '%s' if encoded as %s");
> +   const char *advise_msg = _(
> +   "The file '%s' is missing a byte order mark (BOM). "
> +   "Please use %sBE or %sLE (depending on the byte 
> order) "
> +   "as working-tree-encoding.");
> +   advise(advise_msg, path, enc->name, enc->name);
> +   if (conv_flags & CONV_WRITE_OBJECT)
> +   die(error_msg, path, enc->name);
> +   else
> +   error(error_msg, path, enc->name);
> +   }

For a function which presumably is agnostic about the working-tree
encoding (supporting whatever iconv does), it nevertheless has
unusually intimate knowledge about UTF BOMs, in general, and the

[PATCH v8 5/7] convert: add 'working-tree-encoding' attribute

2018-02-24 Thread lars . schneider
From: Lars Schneider 

Git recognizes files encoded with ASCII or one of its supersets (e.g.
UTF-8 or ISO-8859-1) as text files. All other encodings are usually
interpreted as binary and consequently built-in Git text processing
tools (e.g. 'git diff') as well as most Git web front ends do not
visualize the content.

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Signed-off-by: Lars Schneider 
---
 Documentation/gitattributes.txt  |  78 ++
 convert.c| 157 ++-
 convert.h|   1 +
 sha1_file.c  |   2 +-
 t/t0028-working-tree-encoding.sh | 226 +++
 5 files changed, 462 insertions(+), 2 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..eddeaee1f7 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,6 +272,84 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.
 
 
+`working-tree-encoding`
+^^^
+
+Git recognizes files encoded in ASCII or one of its supersets (e.g.
+UTF-8, ISO-8859-1, ...) as text files. Files encoded in certain other
+encodings (e.g. UTF-16) are interpreted as binary and consequently
+built-in Git text processing tools (e.g. 'git diff') as well as most Git
+web front ends do not visualize the contents of these files by default.
+
+In these cases you can tell Git the encoding of a file in the working
+directory with the `working-tree-encoding` attribute. If a file with this
+attribute is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8. Finally, Git stores the UTF-8 encoded
+content in its internal data structure (called "the index"). On checkout
+the content is reencoded back to the specified encoding.
+
+Please note that using the `working-tree-encoding` attribute may have a
+number of pitfalls:
+
+- Alternative Git implementations (e.g. JGit or libgit2) and older Git
+  versions (as of March 2018) do not support the `working-tree-encoding`
+  attribute. If you decide to use the `working-tree-encoding` attribute
+  in your repository, then it is strongly recommended to ensure that all
+  clients working with the repository support it.
+
+  If you declare `*.proj` files as UTF-16 and you add `foo.proj` with an
+  `working-tree-encoding` enabled Git client, then `foo.proj` will be
+  stored as UTF-8 internally. A client without `working-tree-encoding`
+  support will checkout `foo.proj` as UTF-8 encoded file. This will
+  typically cause trouble for the users of this file.
+
+  If a Git client, that does not support the `working-tree-encoding`
+  attribute, adds a new file `bar.proj`, then `bar.proj` will be
+  stored "as-is" internally (in this example probably as UTF-16).
+  A client with `working-tree-encoding` support will interpret the
+  internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
+  That operation will fail and cause an error.
+
+- Reencoding content requires resources that might slow down certain
+  Git operations (e.g 'git checkout' or 'git add').
+
+Use the `working-tree-encoding` attribute only if you cannot store a file
+in UTF-8 encoding and if you want Git to be able to process the content
+as text.
+
+As an example, use the following attributes if your '*.proj' files are
+UTF-16 encoded with byte order mark (BOM) and you want Git to perform
+automatic line ending conversion based on your platform.
+
+
+*.proj text working-tree-encoding=UTF-16
+
+
+Use the following attributes if your '*.proj' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory. Please note, it is highly recommended to
+explicitly define the line endings with `eol` if the `working-tree-encoding`
+attribute is used to avoid ambiguity.
+
+
+*.proj text working-tree-encoding=UTF-16LE eol=CRLF
+
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+
+iconv --list
+
+
+If you do not know the encoding of a file, then you can use the `file`
+command to guess the encoding:
+
+
+file foo.proj
+
+
+
 `ident`
 ^^^
 
diff --git a/convert.c b/convert.c
index b976eb968c..d20c341b6d 100644
--- a/convert.c
+++ b/convert.c
@@ -7,6 +7,7 @@
 #include "sigchain.h"
 #include "pkt-line.h"
 #include "sub-process.h"
+#include "utf8.h"
 
 /*
  * convert.c - convert a file when