Re: [PATCH v11 06/10] convert: add 'working-tree-encoding' attribute

2018-04-15 Thread Lars Schneider

> On 05 Apr 2018, at 18:41, Torsten Bögershausen  wrote:
> 
> On 01.04.18 15:24, Lars Schneider wrote:
>>> TRUE or false are values, but just wrong ones.
>>> If this test is removed, the user will see "failed to encode "TRUE" to 
>>> "UTF-8",
>>> which should give enough information to fix it.
>> 
>> I see your point. However, I would like to stop the processing right
>> there for these invalid values. How about 
>> 
>>  error(_("true/false are no valid working-tree-encodings"));
>> 
>> I think that is the most straight forward/helpful error message
>> for the enduser (I consider the term "boolean" but dismissed it
>> as potentially confusing to folks not familiar with the term).
>> 
>> OK with you?
> 
> Yes.

Great!


> Another thing that came up recently, independent of your series:
> 
> What should happen if a user specifies "UTF-8" and the file
> has an UTF-8 encoded BOM ?
> I ask because I stumbled over such a file coming from a Windows
> which the java compiler under Linux didn't accept.
> 
> And because some tools love to put an UTF-8 encoded BOM
> into text files.
> 
> The clearest thing would be to extend the BOM check in 5/9
> to cover UTF-32, UTF-16 and UTF-8.
> 
> Are there any plans to do so?

If `working-tree-encoding` is not defined or defined as UTF-8,
then we would return from encode_to_git() early. That means we
would never run validate_encoding() which would check the BOM.

However, adding the UTF-8 BOM would still make sense. This way
Git could scream if a user set `working-tree-encoding` to UTF-16
but the file is really UTF-8 encoded.


> And thanks for the work.

Thanks :-)


- Lars

Re: [PATCH v11 06/10] convert: add 'working-tree-encoding' attribute

2018-04-05 Thread Torsten Bögershausen
On 01.04.18 15:24, Lars Schneider wrote:
>> TRUE or false are values, but just wrong ones.
>> If this test is removed, the user will see "failed to encode "TRUE" to 
>> "UTF-8",
>> which should give enough information to fix it.
> 
> I see your point. However, I would like to stop the processing right
> there for these invalid values. How about 
> 
>   error(_("true/false are no valid working-tree-encodings"));
> 
> I think that is the most straight forward/helpful error message
> for the enduser (I consider the term "boolean" but dismissed it
> as potentially confusing to folks not familiar with the term).
> 
> OK with you?

Yes.

Another thing that came up recently, independent of your series:

What should happen if a user specifies "UTF-8" and the file
has an UTF-8 encoded BOM ?
I ask because I stumbled over such a file coming from a Windows
which the java compiler under Linux didn't accept.

And because some tools love to put an UTF-8 encoded BOM
into text files.

The clearest thing would be to extend the BOM check in 5/9
to cover UTF-32, UTF-16 and UTF-8.

Are there any plans to do so?

And thanks for the work.


Re: [PATCH v11 06/10] convert: add 'working-tree-encoding' attribute

2018-04-01 Thread Lars Schneider

> On 18 Mar 2018, at 08:24, Torsten Bögershausen  wrote:
> 
> Some comments inline
> 
> On Fri, Mar 09, 2018 at 06:35:32PM +0100, lars.schnei...@autodesk.com wrote:
>> From: Lars Schneider 
>> 
>> Git recognizes files encoded with ASCII or one of its supersets (e.g.
>> UTF-8 or ISO-8859-1) as text files. All other encodings are usually
>> interpreted as binary and consequently built-in Git text processing
>> tools (e.g. 'git diff') as well as most Git web front ends do not
>> visualize the content.
>> 
>> Add an attribute to tell Git what encoding the user has defined for a
>> given file. If the content is added to the index, then Git converts the
> 
> Minor comment:
> "Git converts the content"
> Everywhere else (?) "encodes or reencodes" is used.
> "Git reencodes the content" may be more consistent.

OK, will change.


>> 
>> +static const char *default_encoding = "UTF-8";
>> +
>> +static int encode_to_git(const char *path, const char *src, size_t src_len,
>> + struct strbuf *buf, const char *enc, int conv_flags)
>> +{
>> +char *dst;
>> +int dst_len;
>> +int die_on_error = conv_flags & CONV_WRITE_OBJECT;
>> +
>> +/*
>> + * No encoding is specified or there is nothing to encode.
>> + * Tell the caller that the content was not modified.
>> + */
>> +if (!enc || (src && !src_len))
>> +return 0;
> 
> (This may have been discussed before.
> As we checked (enc != NULL) I think we can add here:)
>   if (is_encoding_utf8(enc))
>   return 0;

This should be covered in git_path_check_encoding(),
introduced in v12:

/* Don't encode to the default encoding */
if (same_encoding(value, default_encoding))
return NULL;

In that function the encoding of a certain file is read from
the .gitattributes. If the encoding matches the compile-time
defined default encoding (= UTF-8), then the encoding is set
to NULL.


>> 
>> +
>> +static int encode_to_worktree(const char *path, const char *src, size_t 
>> src_len,
>> +  struct strbuf *buf, const char *enc)
>> +{
>> +char *dst;
>> +int dst_len;
>> +
>> +/*
>> + * No encoding is specified or there is nothing to encode.
>> + * Tell the caller that the content was not modified.
>> + */
>> +if (!enc || (src && !src_len))
>> +return 0;
> 
> Same as above:
>   if (is_encoding_utf8(enc))
>   return 0;
> 
>> +
>> +dst = reencode_string_len(src, src_len, enc, default_encoding,
>> +  _len);
>> +if (!dst) {
>> +error("failed to encode '%s' from %s to %s",
>> +path, default_encoding, enc);
>> +return 0;
>> +}
>> +
>> +strbuf_attach(buf, dst, dst_len, dst_len + 1);
>> +return 1;
>> +}
>> +
>> static int crlf_to_git(const struct index_state *istate,
>> const char *path, const char *src, size_t len,
>> struct strbuf *buf,
>> @@ -978,6 +1051,25 @@ static int ident_to_worktree(const char *path, const 
>> char *src, size_t len,
>>  return 1;
>> }
>> 
>> +static const char *git_path_check_encoding(struct attr_check_item *check)
>> +{
>> +const char *value = check->value;
>> +
>> +if (ATTR_UNSET(value) || !strlen(value))
>> +return NULL;
>> +
> 
> 
>> +if (ATTR_TRUE(value) || ATTR_FALSE(value)) {
>> +error(_("working-tree-encoding attribute requires a value"));
>> +return NULL;
>> +}
> 
> TRUE or false are values, but just wrong ones.
> If this test is removed, the user will see "failed to encode "TRUE" to 
> "UTF-8",
> which should give enough information to fix it.

I see your point. However, I would like to stop the processing right
there for these invalid values. How about 

  error(_("true/false are no valid working-tree-encodings"));

I think that is the most straight forward/helpful error message
for the enduser (I consider the term "boolean" but dismissed it
as potentially confusing to folks not familiar with the term).

OK with you?

> 
>> +
>> +/* Don't encode to the default encoding */
>> +if (!strcasecmp(value, default_encoding))
>> +return NULL;
> Same as above ?:
>   if (is_encoding_utf8(value))
>   return 0;

Yes, that was fixed in v12 as mentioned above :-)

- Lars

Re: [PATCH v11 06/10] convert: add 'working-tree-encoding' attribute

2018-03-18 Thread Torsten Bögershausen
Some comments inline

On Fri, Mar 09, 2018 at 06:35:32PM +0100, lars.schnei...@autodesk.com wrote:
> From: Lars Schneider 
> 
> Git recognizes files encoded with ASCII or one of its supersets (e.g.
> UTF-8 or ISO-8859-1) as text files. All other encodings are usually
> interpreted as binary and consequently built-in Git text processing
> tools (e.g. 'git diff') as well as most Git web front ends do not
> visualize the content.
> 
> Add an attribute to tell Git what encoding the user has defined for a
> given file. If the content is added to the index, then Git converts the

Minor comment:
"Git converts the content"
Everywhere else (?) "encodes or reencodes" is used.
"Git reencodes the content" may be more consistent.


[No comments on the .gitattributes]

>  
> diff --git a/convert.c b/convert.c
> index b976eb968c..aa59ecfe49 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -7,6 +7,7 @@
>  #include "sigchain.h"
>  #include "pkt-line.h"
>  #include "sub-process.h"
> +#include "utf8.h"
>  
>  /*
>   * convert.c - convert a file when checking it out and checking it in.
> @@ -265,6 +266,78 @@ static int will_convert_lf_to_crlf(size_t len, struct 
> text_stat *stats,
>  
>  }
>  
> +static const char *default_encoding = "UTF-8";
> +
> +static int encode_to_git(const char *path, const char *src, size_t src_len,
> +  struct strbuf *buf, const char *enc, int conv_flags)
> +{
> + char *dst;
> + int dst_len;
> + int die_on_error = conv_flags & CONV_WRITE_OBJECT;
> +
> + /*
> +  * No encoding is specified or there is nothing to encode.
> +  * Tell the caller that the content was not modified.
> +  */
> + if (!enc || (src && !src_len))
> + return 0;

(This may have been discussed before.
 As we checked (enc != NULL) I think we can add here:)
if (is_encoding_utf8(enc))
return 0;

> +
> + /*
> +  * Looks like we got called from "would_convert_to_git()".
> +  * This means Git wants to know if it would encode (= modify!)
> +  * the content. Let's answer with "yes", since an encoding was
> +  * specified.
> +  */
> + if (!buf && !src)
> + return 1;
> +
> + dst = reencode_string_len(src, src_len, default_encoding, enc,
> +   _len);
> + if (!dst) {
> + /*
> +  * We could add the blob "as-is" to Git. However, on checkout
> +  * we would try to reencode to the original encoding. This
> +  * would fail and we would leave the user with a messed-up
> +  * working tree. Let's try to avoid this by screaming loud.
> +  */
> + const char* msg = _("failed to encode '%s' from %s to %s");
> + if (die_on_error)
> + die(msg, path, enc, default_encoding);
> + else {
> + error(msg, path, enc, default_encoding);
> + return 0;
> + }
> + }
> +
> + strbuf_attach(buf, dst, dst_len, dst_len + 1);
> + return 1;
> +}
> +
> +static int encode_to_worktree(const char *path, const char *src, size_t 
> src_len,
> +   struct strbuf *buf, const char *enc)
> +{
> + char *dst;
> + int dst_len;
> +
> + /*
> +  * No encoding is specified or there is nothing to encode.
> +  * Tell the caller that the content was not modified.
> +  */
> + if (!enc || (src && !src_len))
> + return 0;

 Same as above:
if (is_encoding_utf8(enc))
return 0;

> +
> + dst = reencode_string_len(src, src_len, enc, default_encoding,
> +   _len);
> + if (!dst) {
> + error("failed to encode '%s' from %s to %s",
> + path, default_encoding, enc);
> + return 0;
> + }
> +
> + strbuf_attach(buf, dst, dst_len, dst_len + 1);
> + return 1;
> +}
> +
>  static int crlf_to_git(const struct index_state *istate,
>  const char *path, const char *src, size_t len,
>  struct strbuf *buf,
> @@ -978,6 +1051,25 @@ static int ident_to_worktree(const char *path, const 
> char *src, size_t len,
>   return 1;
>  }
>  
> +static const char *git_path_check_encoding(struct attr_check_item *check)
> +{
> + const char *value = check->value;
> +
> + if (ATTR_UNSET(value) || !strlen(value))
> + return NULL;
> +


> + if (ATTR_TRUE(value) || ATTR_FALSE(value)) {
> + error(_("working-tree-encoding attribute requires a value"));
> + return NULL;
> + }

TRUE or false are values, but just wrong ones.
If this test is removed, the user will see "failed to encode "TRUE" to "UTF-8",
which should give enough information to fix it.

> +
> + /* Don't encode to the default encoding */
> + if (!strcasecmp(value, default_encoding))
> + return NULL;
 Same as above ?:

Re: [PATCH v11 06/10] convert: add 'working-tree-encoding' attribute

2018-03-15 Thread Lars Schneider

> On 09 Mar 2018, at 20:10, Junio C Hamano  wrote:
> 
> lars.schnei...@autodesk.com writes:
> 
>> +static const char *default_encoding = "UTF-8";
>> +
>> ...
>> +static const char *git_path_check_encoding(struct attr_check_item *check)
>> +{
>> +const char *value = check->value;
>> +
>> +if (ATTR_UNSET(value) || !strlen(value))
>> +return NULL;
>> +
>> +if (ATTR_TRUE(value) || ATTR_FALSE(value)) {
>> +error(_("working-tree-encoding attribute requires a value"));
>> +return NULL;
>> +}
> 
> Hmph, so we decide to be loud but otherwise ignore an undefined
> configuration?  Shouldn't we rather die instead to avoid touching
> the user data in unexpected ways?

OK.


>> +
>> +/* Don't encode to the default encoding */
>> +if (!strcasecmp(value, default_encoding))
>> +return NULL;
> 
> Is this an optimization to avoid "recode one encoding to the same
> encoding" no-op overhead?

Correct.

>  We already have the optimization in the
> same spirit in may existing codepaths that has nothing to do with
> w-t-e, and I think we should share the code.  Two pieces of thought
> comes to mind.
> 
> One is a lot smaller in scale: Is same_encoding() sufficient for
> this callsite instead of strcasecmp()?

Yes!


> The other one is a lot bigger: Looking at all the existing callers
> of same_encoding() that call reencode_string() when it returns false,
> would it make sense to drop same_encoding() and move the optimization
> to reencode_string() instead?
> 
> I suspect that the answer to the smaller one is "yes, and even if
> not, it should be easy to enhance/extend same_encoding() to make it
> do what we want it to, and such a change will benefit even existing
> callers."  The answer to the larger one is likely "the optimization
> is not about skipping only reencode_string() call but other things
> are subtly different among callers of same_encoding(), so such a
> refactoring would not be all that useful."

I agree. reencode_string() would need to signal 3 cases:
1. reencode performed
2. reencode not necessary
3. reencode failed

We could model "reencode not necessary" as "char *in == char *return".
However, I think this should be tackled in a separate series.

Thanks
Lars


Re: [PATCH v11 06/10] convert: add 'working-tree-encoding' attribute

2018-03-09 Thread Junio C Hamano
lars.schnei...@autodesk.com writes:

> +static const char *default_encoding = "UTF-8";
> +
> ...
> +static const char *git_path_check_encoding(struct attr_check_item *check)
> +{
> + const char *value = check->value;
> +
> + if (ATTR_UNSET(value) || !strlen(value))
> + return NULL;
> +
> + if (ATTR_TRUE(value) || ATTR_FALSE(value)) {
> + error(_("working-tree-encoding attribute requires a value"));
> + return NULL;
> + }

Hmph, so we decide to be loud but otherwise ignore an undefined
configuration?  Shouldn't we rather die instead to avoid touching
the user data in unexpected ways?

> +
> + /* Don't encode to the default encoding */
> + if (!strcasecmp(value, default_encoding))
> + return NULL;

Is this an optimization to avoid "recode one encoding to the same
encoding" no-op overhead?  We already have the optimization in the
same spirit in may existing codepaths that has nothing to do with
w-t-e, and I think we should share the code.  Two pieces of thought
comes to mind.

One is a lot smaller in scale: Is same_encoding() sufficient for
this callsite instead of strcasecmp()?

The other one is a lot bigger: Looking at all the existing callers
of same_encoding() that call reencode_string() when it returns false,
would it make sense to drop same_encoding() and move the optimization
to reencode_string() instead?

I suspect that the answer to the smaller one is "yes, and even if
not, it should be easy to enhance/extend same_encoding() to make it
do what we want it to, and such a change will benefit even existing
callers."  The answer to the larger one is likely "the optimization
is not about skipping only reencode_string() call but other things
are subtly different among callers of same_encoding(), so such a
refactoring would not be all that useful."

The above still holds for the code after 10/10 touches this part.


[PATCH v11 06/10] convert: add 'working-tree-encoding' attribute

2018-03-09 Thread lars . schneider
From: Lars Schneider 

Git recognizes files encoded with ASCII or one of its supersets (e.g.
UTF-8 or ISO-8859-1) as text files. All other encodings are usually
interpreted as binary and consequently built-in Git text processing
tools (e.g. 'git diff') as well as most Git web front ends do not
visualize the content.

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Signed-off-by: Lars Schneider 
---
 Documentation/gitattributes.txt  |  80 ++
 convert.c| 114 ++-
 convert.h|   1 +
 sha1_file.c  |   2 +-
 t/t0028-working-tree-encoding.sh | 144 +++
 5 files changed, 339 insertions(+), 2 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..31a4f92840 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,6 +272,86 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.
 
 
+`working-tree-encoding`
+^^^
+
+Git recognizes files encoded in ASCII or one of its supersets (e.g.
+UTF-8, ISO-8859-1, ...) as text files. Files encoded in certain other
+encodings (e.g. UTF-16) are interpreted as binary and consequently
+built-in Git text processing tools (e.g. 'git diff') as well as most Git
+web front ends do not visualize the contents of these files by default.
+
+In these cases you can tell Git the encoding of a file in the working
+directory with the `working-tree-encoding` attribute. If a file with this
+attribute is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8. Finally, Git stores the UTF-8 encoded
+content in its internal data structure (called "the index"). On checkout
+the content is reencoded back to the specified encoding.
+
+Please note that using the `working-tree-encoding` attribute may have a
+number of pitfalls:
+
+- Alternative Git implementations (e.g. JGit or libgit2) and older Git
+  versions (as of March 2018) do not support the `working-tree-encoding`
+  attribute. If you decide to use the `working-tree-encoding` attribute
+  in your repository, then it is strongly recommended to ensure that all
+  clients working with the repository support it.
+
+  For example, Microsoft Visual Studio resources files (`*.rc`) or
+  PowerShell script files (`*.ps1`) are sometimes encoded in UTF-16.
+  If you declare `*.ps1` as files as UTF-16 and you add `foo.ps1` with
+  a `working-tree-encoding` enabled Git client, then `foo.ps1` will be
+  stored as UTF-8 internally. A client without `working-tree-encoding`
+  support will checkout `foo.ps1` as UTF-8 encoded file. This will
+  typically cause trouble for the users of this file.
+
+  If a Git client, that does not support the `working-tree-encoding`
+  attribute, adds a new file `bar.ps1`, then `bar.ps1` will be
+  stored "as-is" internally (in this example probably as UTF-16).
+  A client with `working-tree-encoding` support will interpret the
+  internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
+  That operation will fail and cause an error.
+
+- Reencoding content requires resources that might slow down certain
+  Git operations (e.g 'git checkout' or 'git add').
+
+Use the `working-tree-encoding` attribute only if you cannot store a file
+in UTF-8 encoding and if you want Git to be able to process the content
+as text.
+
+As an example, use the following attributes if your '*.ps1' files are
+UTF-16 encoded with byte order mark (BOM) and you want Git to perform
+automatic line ending conversion based on your platform.
+
+
+*.ps1  text working-tree-encoding=UTF-16
+
+
+Use the following attributes if your '*.ps1' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory. Please note, it is highly recommended to
+explicitly define the line endings with `eol` if the `working-tree-encoding`
+attribute is used to avoid ambiguity.
+
+
+*.ps1  text working-tree-encoding=UTF-16LE eol=CRLF
+
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+
+iconv --list
+
+
+If you do not know the encoding of a file, then you can use the `file`
+command to guess the encoding:
+
+
+file foo.ps1
+
+
+
 `ident`
 ^^^
 
diff --git a/convert.c b/convert.c
index b976eb968c..aa59ecfe49 100644
--- a/convert.c
+++ b/convert.c
@@ -7,6 +7,7 @@