Re: [PATCH v4 5/6] convert: add 'working-tree-encoding' attribute

2018-01-23 Thread Junio C Hamano
Jeff King  writes:

> But with Coccinelle, it's a lot easier to apply the change tree-wide, and
> to convert topics in flight as they get merged. The maintainer still
> gets conflicts with topics-in-flight that touch converted areas, though.
> So I'd be curious to hear if Junio's opinion has changed at all.

There are two distinct kinds of cost on such a tree-wide change.
Conflicts with in-flight topic cannot be avoided other than truly
avoiding, i.e. refraining from touching the areas in flux, but it is
primarily what the maintainer does, and with help with rerere it can
be reasonably well automated ;-)

But the cost of reviewing could become a lot smaller when our tools
are trustworthy.  As long as we can be reasonably certain that the
tree-wide change patch does one thing it is intended to do and
nothing else (e.g. comes with mechanical reproduction recipe that
allows the patch to be independently audited), I do not have much
problem with such a clean-up.

The "avoid tree-wide change" rule still applies for things that
allows a lot of subjective judgment and discretion.  I do not know
of a good way to reduce reviewer costs on those kind of changes.

Thanks.


Re: [PATCH v4 5/6] convert: add 'working-tree-encoding' attribute

2018-01-23 Thread Jeff King
On Tue, Jan 23, 2018 at 11:25:58AM +0100, Simon Ruderich wrote:

> On Mon, Jan 22, 2018 at 07:54:01PM -0500, Jeff King wrote:
> > But anyway, that was a bit of a tangent. Certainly the smaller change is
> > just standardizing on sizeof(*foo), which I think most people agree on
> > at this point. It might be worth putting in CodingGuidelines.
> 
> Personally I prefer sizeof(*foo) which is a well non-idiom, used
> in many projects and IMHO easy to read and understand.

Me too.

> I've played a little with coccinelle and the following spatch
> seems to catch many occurrences of sizeof(struct ..) (the first
> hunk seems to expand multiple times causing conflicts in the
> generated patch). Cases like a->b = xcalloc() are not matched, I
> don't know enough coccinelle for that. If there's interest I
> could prepare patches, but it will create quite some code churn.

Yeah, I'm not sure what our current policy is there. Traditionally our
strategy was not to churn code, but to update old idioms as they were
touched. Especially if the change was not urgent, but mostly stylistic
(which this one is).

But with Coccinelle, it's a lot easier to apply the change tree-wide, and
to convert topics in flight as they get merged. The maintainer still
gets conflicts with topics-in-flight that touch converted areas, though.
So I'd be curious to hear if Junio's opinion has changed at all.

-Peff


Re: [PATCH v4 5/6] convert: add 'working-tree-encoding' attribute

2018-01-23 Thread Simon Ruderich
On Mon, Jan 22, 2018 at 07:54:01PM -0500, Jeff King wrote:
> But anyway, that was a bit of a tangent. Certainly the smaller change is
> just standardizing on sizeof(*foo), which I think most people agree on
> at this point. It might be worth putting in CodingGuidelines.

Personally I prefer sizeof(*foo) which is a well non-idiom, used
in many projects and IMHO easy to read and understand.

I've played a little with coccinelle and the following spatch
seems to catch many occurrences of sizeof(struct ..) (the first
hunk seems to expand multiple times causing conflicts in the
generated patch). Cases like a->b = xcalloc() are not matched, I
don't know enough coccinelle for that. If there's interest I
could prepare patches, but it will create quite some code churn.

Regards
Simon

@@
type T;
identifier x;
@@
- T *x = xmalloc(sizeof(T));
+ T *x = xmalloc(sizeof(*x));

@@
type T;
T *x;
@@
- x = xmalloc(sizeof(T));
+ x = xmalloc(sizeof(*x));


@@
type T;
identifier x;
expression n;
@@
- T *x = xcalloc(n, sizeof(T));
+ T *x = xcalloc(n, sizeof(*x));

@@
type T;
T *x;
expression n;
@@
- x = xcalloc(n, sizeof(T));
+ x = xcalloc(n, sizeof(*x));


@@
type T;
T x;
@@
- memset(&x, 0, sizeof(T));
+ memset(&x, 0, sizeof(x));

@@
type T;
T *x;
@@
- memset(x, 0, sizeof(T));
+ memset(x, 0, sizeof(*x));

-- 
+ privacy is necessary
+ using gnupg http://gnupg.org
+ public key id: 0x92FEFDB7E44C32F9


Re: [PATCH v4 5/6] convert: add 'working-tree-encoding' attribute

2018-01-23 Thread Simon Ruderich
On Mon, Jan 22, 2018 at 01:35:25PM +0100, Lars Schneider wrote:
>>> +   enc->name = xstrdup_toupper(value);  /* aways use upper case names! */
>>
>> "aways" -> "always" and I think the comment should say why
>> uppercase is important.
>
> Would that be better?
>
>   /* Aways use upper case names to simplify subsequent string comparison. 
> */
>   enc->name = xstrdup_toupper(value);
>
> AFAIK uppercase and lowercase names are both valid. I just wanted to
> ensure that we use one consistent casing. That reads better in error messages
> and I don't need to check for the letter case in has_prohibited_utf_bom()
> and friends in utf8.c

Sounds good (minus the "Aways" typo Eric already noticed).

>> Micro-nit: For consistency with the previous test, remove the
>> empty line and comment (or just keep the files generated from the
>> "setup test repo" phase and don't explicitly delete them)?
>
> I would rather add a new line and a comment to the previous test
> to be consistent.
>
> I know we could leave the files but these lingering files could
> always surprise writers of future tests (at least they surprised
> me in other tests).

Sure, that sounds good. Just noticed the inconsistency and wanted
to mention it.

Regards
Simon
-- 
+ privacy is necessary
+ using gnupg http://gnupg.org
+ public key id: 0x92FEFDB7E44C32F9


Re: [PATCH v4 5/6] convert: add 'working-tree-encoding' attribute

2018-01-22 Thread Jeff King
On Mon, Jan 22, 2018 at 01:35:25PM +0100, Lars Schneider wrote:

> >> +  enc = xcalloc(1, sizeof(struct convert_driver));
> > 
> > I think this should be "sizeof(struct encoding)" but I prefer
> > "sizeof(*enc)" which prevents these kind of mistakes.
> 
> Great catch! Thank you!
> 
> Other code paths are at risk of this problem too. Consider this:
> 
> $ git grep 'sizeof(\*' | wc -l
>  303
> $ git grep 'sizeof(struct ' | wc -l
>  208
> 
> E.g. even in the same file (likely where I got the code from):
> https://github.com/git/git/blob/59c276cf4da0705064c32c9dba54baefa282ea55/convert.c#L780
> 
> @Junio: 
> Would you welcome a patch that replaces "struct foo" with "*foo"
> if applicable?

This is part of the reason we've been moving to helpers like
ALLOC_ARRAY(), which make it harder to get this wrong.

We don't have an ALLOC_OBJECT(), which is what you would want here. I'm
not sure if that is helpful or crossing the line of "you're obscuring it
to the point that people familiar with C have trouble reading the code".
The ALLOC_ARRAY() macros have been sort of an experiment there (I tend
to like them, but I also work with Git's code often enough that I am not
likely to be confused by our bespoke macros).

But anyway, that was a bit of a tangent. Certainly the smaller change is
just standardizing on sizeof(*foo), which I think most people agree on
at this point. It might be worth putting in CodingGuidelines.

-Peff


Re: [PATCH v4 5/6] convert: add 'working-tree-encoding' attribute

2018-01-22 Thread Lars Schneider

> On 21 Jan 2018, at 15:22, Simon Ruderich  wrote:
> 
> On Sat, Jan 20, 2018 at 04:24:17PM +0100, lars.schnei...@autodesk.com wrote:
>> +static struct encoding *git_path_check_encoding(struct attr_check_item 
>> *check)
>> +{
>> +const char *value = check->value;
>> +struct encoding *enc;
>> +
>> +if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
>> +!strlen(value))
>> +return NULL;
>> +
>> +for (enc = encoding; enc; enc = enc->next)
>> +if (!strcasecmp(value, enc->name))
>> +return enc;
>> +
>> +/* Don't encode to the default encoding */
>> +if (!strcasecmp(value, default_encoding))
>> +return NULL;
>> +
>> +enc = xcalloc(1, sizeof(struct convert_driver));
> 
> I think this should be "sizeof(struct encoding)" but I prefer
> "sizeof(*enc)" which prevents these kind of mistakes.

Great catch! Thank you!

Other code paths are at risk of this problem too. Consider this:

$ git grep 'sizeof(\*' | wc -l
 303
$ git grep 'sizeof(struct ' | wc -l
 208

E.g. even in the same file (likely where I got the code from):
https://github.com/git/git/blob/59c276cf4da0705064c32c9dba54baefa282ea55/convert.c#L780

@Junio: 
Would you welcome a patch that replaces "struct foo" with "*foo"
if applicable?


>> +enc->name = xstrdup_toupper(value);  /* aways use upper case names! */
> 
> "aways" -> "always" and I think the comment should say why
> uppercase is important.

Would that be better?

/* Aways use upper case names to simplify subsequent string comparison. 
*/
enc->name = xstrdup_toupper(value);

AFAIK uppercase and lowercase names are both valid. I just wanted to
ensure that we use one consistent casing. That reads better in error messages
and I don't need to check for the letter case in has_prohibited_utf_bom()
and friends in utf8.c


>> +test_expect_success 'ensure UTF-8 is stored in Git' '
>> +git cat-file -p :test.utf16 >test.utf16.git &&
>> +test_cmp_bin test.utf8.raw test.utf16.git &&
>> +rm test.utf8.raw test.utf16.git
>> +'
>> +
>> +test_expect_success 're-encode to UTF-16 on checkout' '
>> +rm test.utf16 &&
>> +git checkout test.utf16 &&
>> +test_cmp_bin test.utf16.raw test.utf16 &&
>> +
>> +# cleanup
>> +rm test.utf16.raw
> 
> Micro-nit: For consistency with the previous test, remove the
> empty line and comment (or just keep the files generated from the
> "setup test repo" phase and don't explicitly delete them)?

I would rather add a new line and a comment to the previous test 
to be consistent.

I know we could leave the files but these lingering files could
always surprise writers of future tests (at least they surprised
me in other tests).


Thank you very much for the review,
Lars

Re: [PATCH v4 5/6] convert: add 'working-tree-encoding' attribute

2018-01-21 Thread Simon Ruderich
On Sat, Jan 20, 2018 at 04:24:17PM +0100, lars.schnei...@autodesk.com wrote:
> +static struct encoding *git_path_check_encoding(struct attr_check_item 
> *check)
> +{
> + const char *value = check->value;
> + struct encoding *enc;
> +
> + if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
> + !strlen(value))
> + return NULL;
> +
> + for (enc = encoding; enc; enc = enc->next)
> + if (!strcasecmp(value, enc->name))
> + return enc;
> +
> + /* Don't encode to the default encoding */
> + if (!strcasecmp(value, default_encoding))
> + return NULL;
> +
> + enc = xcalloc(1, sizeof(struct convert_driver));

I think this should be "sizeof(struct encoding)" but I prefer
"sizeof(*enc)" which prevents these kind of mistakes.

> + enc->name = xstrdup_toupper(value);  /* aways use upper case names! */

"aways" -> "always" and I think the comment should say why
uppercase is important.

> +test_expect_success 'ensure UTF-8 is stored in Git' '
> + git cat-file -p :test.utf16 >test.utf16.git &&
> + test_cmp_bin test.utf8.raw test.utf16.git &&
> + rm test.utf8.raw test.utf16.git
> +'
> +
> +test_expect_success 're-encode to UTF-16 on checkout' '
> + rm test.utf16 &&
> + git checkout test.utf16 &&
> + test_cmp_bin test.utf16.raw test.utf16 &&
> +
> + # cleanup
> + rm test.utf16.raw

Micro-nit: For consistency with the previous test, remove the
empty line and comment (or just keep the files generated from the
"setup test repo" phase and don't explicitly delete them)?

Regards
Simon
-- 
+ privacy is necessary
+ using gnupg http://gnupg.org
+ public key id: 0x92FEFDB7E44C32F9


[PATCH v4 5/6] convert: add 'working-tree-encoding' attribute

2018-01-20 Thread lars . schneider
From: Lars Schneider 

Git recognizes files encoded with ASCII or one of its supersets (e.g.
UTF-8 or ISO-8859-1) as text files. All other encodings are usually
interpreted as binary and consequently built-in Git text processing
tools (e.g. 'git diff') as well as most Git web front ends do not
visualize the content.

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Signed-off-by: Lars Schneider 
---
 Documentation/gitattributes.txt  |  60 
 convert.c| 190 -
 convert.h|   1 +
 sha1_file.c  |   2 +-
 t/t0028-working-tree-encoding.sh | 196 +++
 5 files changed, 447 insertions(+), 2 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..a8dbf4be30 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,6 +272,66 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.
 
 
+`working-tree-encoding`
+^^^
+
+Git recognizes files encoded with ASCII or one of its supersets (e.g.
+UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
+interpreted as binary and consequently built-in Git text processing
+tools (e.g. 'git diff') as well as most Git web front ends do not
+visualize the content.
+
+In these cases you can tell Git the encoding of a file in the working
+directory with the `working-tree-encoding` attribute. If a file with this
+attributes is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8 and stores the result in its internal data
+structure (called "the index"). On checkout the content is encoded
+back to the specified encoding.
+
+Please note that using the `working-tree-encoding` attribute may have a
+number of pitfalls:
+
+- Git clients that do not support the `working-tree-encoding` attribute
+  will checkout the respective files UTF-8 encoded and not in the
+  expected encoding. Consequently, these files will appear different
+  which typically causes trouble. This is in particular the case for
+  older Git versions and alternative Git implementations such as JGit
+  or libgit2 (as of January 2018).
+
+- Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
+  errors as the conversion might not be round trip safe.
+
+- Reencoding content requires resources that might slow down certain
+  Git operations (e.g 'git checkout' or 'git add').
+
+Use the `working-tree-encoding` attribute only if you cannot store a file in
+UTF-8 encoding and if you want Git to be able to process the content as
+text.
+
+Use the following attributes if your '*.txt' files are UTF-16 encoded
+with byte order mark (BOM) and you want Git to perform automatic line
+ending conversion based on your platform.
+
+
+*.txt  text working-tree-encoding=UTF-16
+
+
+Use the following attributes if your '*.txt' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory.
+
+
+*.txt  working-tree-encoding=UTF-16LE text eol=CRLF
+
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+
+iconv --list
+
+
+
 `ident`
 ^^^
 
diff --git a/convert.c b/convert.c
index b976eb968c..0c372069b1 100644
--- a/convert.c
+++ b/convert.c
@@ -7,6 +7,7 @@
 #include "sigchain.h"
 #include "pkt-line.h"
 #include "sub-process.h"
+#include "utf8.h"
 
 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -265,6 +266,147 @@ static int will_convert_lf_to_crlf(size_t len, struct 
text_stat *stats,
 
 }
 
+static struct encoding {
+   const char *name;
+   struct encoding *next;
+} *encoding, **encoding_tail;
+static const char *default_encoding = "UTF-8";
+
+static int encode_to_git(const char *path, const char *src, size_t src_len,
+struct strbuf *buf, struct encoding *enc, int 
conv_flags)
+{
+   char *dst;
+   int dst_len;
+
+   /*
+* No encoding is specified or there is nothing to encode.
+* Tell the caller that the content was not modified.
+*/
+   if (!enc || (src && !src_len))
+   return 0;
+
+   /*
+* Looks like we got called from "would_convert_to_git()".
+* This means Git wants to know if it would encode (= modify!)
+* the content. Let's answer with "yes", since an encoding was
+* specified.
+*/
+   if (!buf && !src)
+   return 1;
+
+   if (has