Re: [PATCH v3 0/7] convert: add support for different encodings
On Mon, Jan 08, 2018 at 03:38:48PM +0100, Lars Schneider wrote: [] > > Some comments: > > > > I would like to have the CRLF conversion a little bit more strict - > > many users tend to set core.autocrlf=true or write "* text=auto" > > in the .gitattributes. > > Reading all the effort about BOM markers and UTF-16LE, I think there > > should ne some effort to make the line endings round trip. > > Therefore I changed convert.c to demand that the "text" attribute > > is set to enable CRLF conversions. > > (If I had submitted the patch, I would have demanded > > "text eol=lf" or "text eol=crlf", but the test case t0028 indicates > > that there is a demand to produce line endings as configured in core.eol) > > But wouldn't that be inconvenient for the users? E.g. if I add a UTF-16 > file on Windows with CRLF then it would be nice if Git would automatically > convert the line endings to LF on Linux, no? > > IOW: Why should we handle text files that have a defined checkout-encoding > differently compared to UTF-8 encoded text files? Wouldn't that be unexpected > to the user? > > Thanks, > Lars The problem is, if user A has core.autocrlf=true and user B core.autocrlf=false. (The line endings don't show up as expected at user B) Having said that in all shortness, you convinced me: If text=auto, we care about line endings. If text, we care about line endings. If the .gitattributes don't say anything about text, we don't convert eol. (In other words: we don't look at core.autocrlf, when checkout-encoding is defined) A new branch is push to github/tboegi
Re: [PATCH v3 0/7] convert: add support for different encodings
> On 07 Jan 2018, at 10:38, Torsten Bögershausenwrote: > > On Sat, Jan 06, 2018 at 01:48:01AM +0100, lars.schnei...@autodesk.com wrote: >> From: Lars Schneider >> >> Hi, >> >> Patches 1-5 and 6 are helper functions and preparation. >> Patch 6 is the actual change. >> >> I am still torn between "checkout-encoding" and "working-tree-encoding" >> as attribute name. I am happy to hear arguments for/against one or the >> other. > > checkout-encoding is probably misleading, as it is even the checkin-encoding. Yeah, I start to think the same. > What is wrong with working-tree-encoding ? > I think the 2 "-". > > What was wrong with workingtree-encoding ? Yeah, the two dashes are a minor annoyance. However, consider this: $ git grep 'working tree' -- '*.txt' | wc -l 570 $ git grep 'working-tree' -- '*.txt' | wc -l 6 $ git grep 'workingtree' -- '*.txt' | wc -l 0 $ git grep 'working tree' -- po | wc -l 704 $ git grep 'working-tree' -- po | wc -l 0 $ git grep 'workingtree' -- po | wc -l 0 I think "working tree" is a pretty established term that endusers might be able to understand. Therefore, I would like to go with "working-tree-encoding" as it was written that way at least 6 times in the Git tree before. Would that work for you? > Or > workdir-encoding ? Although I like the shortness, the term "workdir" might already be occupied [1]. Could that cause confusion? [1] 4f01748d51 (contrib/workdir: add a simple script to create a working directory, 2007-03-27) >> >> * Removed unnecessary NUL assignment in xstrdup_tolower() (Torsten) >> >> * Set "git config core.eol lf" to made the test run on Windows (Dscho) >> >> * Made BOM arrays static (Ramsay) > > > Some comments: > > I would like to have the CRLF conversion a little bit more strict - > many users tend to set core.autocrlf=true or write "* text=auto" > in the .gitattributes. > Reading all the effort about BOM markers and UTF-16LE, I think there > should ne some effort to make the line endings round trip. > Therefore I changed convert.c to demand that the "text" attribute > is set to enable CRLF conversions. > (If I had submitted the patch, I would have demanded > "text eol=lf" or "text eol=crlf", but the test case t0028 indicates > that there is a demand to produce line endings as configured in core.eol) But wouldn't that be inconvenient for the users? E.g. if I add a UTF-16 file on Windows with CRLF then it would be nice if Git would automatically convert the line endings to LF on Linux, no? IOW: Why should we handle text files that have a defined checkout-encoding differently compared to UTF-8 encoded text files? Wouldn't that be unexpected to the user? Thanks, Lars > > Anyway, I rebased it onto git.git/master, changed the docu, and pushed it to > https://github.com/tboegi/git/tree/180107-0935-For-lars-schneider-encode-V3B > > Here is a inter-diff against your version: > > diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt > index 1bc03e69c..b8d9f91c8 100644 > --- a/Documentation/gitattributes.txt > +++ b/Documentation/gitattributes.txt > @@ -281,7 +281,7 @@ interpreted as binary and consequently built-in Git text > processing > tools (e.g. 'git diff') as well as most Git web front ends do not > visualize the content. > > -In these cases you can teach Git the encoding of a file in the working > +In these cases you can tell Git the encoding of a file in the working Oops. I meant to change that already. Thanks! > directory with the `checkout-encoding` attribute. If a file with this > attributes is added to Git, then Git reencodes the content from the > specified encoding to UTF-8 and stores the result in its internal data > @@ -308,17 +308,20 @@ Use the `checkout-encoding` attribute only if you > cannot store a file in > UTF-8 encoding and if you want Git to be able to process the content as > text. > > +Note that when `checkout-encoding` is defined, by default the line > +endings are not converted. `text=auto` and core.autocrlf are ignored. > +Set the `text` attribute to enable CRLF conversions. > + > Use the following attributes if your '*.txt' files are UTF-16 encoded > -with byte order mark (BOM) and you want Git to perform automatic line > -ending conversion based on your platform. > +with byte order mark (BOM). > > > -*.txttext checkout-encoding=UTF-16 > +*.txtcheckout-encoding=UTF-16 > > > Use the following attributes if your '*.txt' files are UTF-16 little > -endian encoded without BOM and you want Git to use Windows line endings > -in the working directory. > +endian encoded without BOM and you want Git to use LF in the repo and > +CRLF in the working directory. > > > *.txtcheckout-encoding=UTF-16LE text eol=CRLF > diff --git a/convert.c b/convert.c > index
Re: [PATCH v3 0/7] convert: add support for different encodings
On Sat, Jan 06, 2018 at 01:48:01AM +0100, lars.schnei...@autodesk.com wrote: > From: Lars Schneider> > Hi, > > Patches 1-5 and 6 are helper functions and preparation. > Patch 6 is the actual change. > > I am still torn between "checkout-encoding" and "working-tree-encoding" > as attribute name. I am happy to hear arguments for/against one or the > other. checkout-encoding is probably misleading, as it is even the checkin-encoding. What is wrong with working-tree-encoding ? I think the 2 "-". What was wrong with workingtree-encoding ? Or workdir-encoding ? > > Changes since v2: > > * Added Torsten's crlfsave refactoring patch (patch 5) > @Torsten: I tried to make the commit message more clean, added > some comments to and renamed conv_flags_eol to > global_conv_flags_eol. > > * Improved documentation and commit message (Torsten) Good, thanks. > > * Removed unnecessary NUL assignment in xstrdup_tolower() (Torsten) > > * Set "git config core.eol lf" to made the test run on Windows (Dscho) > > * Made BOM arrays static (Ramsay) Some comments: I would like to have the CRLF conversion a little bit more strict - many users tend to set core.autocrlf=true or write "* text=auto" in the .gitattributes. Reading all the effort about BOM markers and UTF-16LE, I think there should ne some effort to make the line endings round trip. Therefore I changed convert.c to demand that the "text" attribute is set to enable CRLF conversions. (If I had submitted the patch, I would have demanded "text eol=lf" or "text eol=crlf", but the test case t0028 indicates that there is a demand to produce line endings as configured in core.eol) Anyway, I rebased it onto git.git/master, changed the docu, and pushed it to https://github.com/tboegi/git/tree/180107-0935-For-lars-schneider-encode-V3B Here is a inter-diff against your version: diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt index 1bc03e69c..b8d9f91c8 100644 --- a/Documentation/gitattributes.txt +++ b/Documentation/gitattributes.txt @@ -281,7 +281,7 @@ interpreted as binary and consequently built-in Git text processing tools (e.g. 'git diff') as well as most Git web front ends do not visualize the content. -In these cases you can teach Git the encoding of a file in the working +In these cases you can tell Git the encoding of a file in the working directory with the `checkout-encoding` attribute. If a file with this attributes is added to Git, then Git reencodes the content from the specified encoding to UTF-8 and stores the result in its internal data @@ -308,17 +308,20 @@ Use the `checkout-encoding` attribute only if you cannot store a file in UTF-8 encoding and if you want Git to be able to process the content as text. +Note that when `checkout-encoding` is defined, by default the line +endings are not converted. `text=auto` and core.autocrlf are ignored. +Set the `text` attribute to enable CRLF conversions. + Use the following attributes if your '*.txt' files are UTF-16 encoded -with byte order mark (BOM) and you want Git to perform automatic line -ending conversion based on your platform. +with byte order mark (BOM). -*.txt text checkout-encoding=UTF-16 +*.txt checkout-encoding=UTF-16 Use the following attributes if your '*.txt' files are UTF-16 little -endian encoded without BOM and you want Git to use Windows line endings -in the working directory. +endian encoded without BOM and you want Git to use LF in the repo and +CRLF in the working directory. *.txt checkout-encoding=UTF-16LE text eol=CRLF diff --git a/convert.c b/convert.c index 13f766d2a..1e29f515e 100644 --- a/convert.c +++ b/convert.c @@ -221,18 +221,27 @@ static void check_global_conv_flags_eol(const char *path, enum crlf_action crlf_ } } static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats, @@ -432,7 +441,7 @@ static int crlf_to_git(const struct index_state *istate, * cherry-pick. */ if ((!(conv_flags & CONV_EOL_RENORMALIZE)) && - has_cr_in_index(istate, path)) + has_crlf_in_index(istate, path)) convert_crlf_into_lf = 0; } if (((conv_flags & CONV_EOL_RNDTRP_WARN) || @@ -1214,9 +1223,28 @@ static void convert_attrs(struct conv_attrs *ca, const char *path) ca->crlf_action = git_path_check_crlf(ccheck + 0); ca->ident = git_path_check_ident(ccheck + 1); ca->drv = git_path_check_convert(ccheck + 2); + ca->checkout_encoding = git_path_check_encoding(ccheck + 5); if (ca->crlf_action != CRLF_BINARY) { enum eol eol_attr = git_path_check_eol(ccheck + 3); -
[PATCH v3 0/7] convert: add support for different encodings
From: Lars SchneiderHi, Patches 1-5 and 6 are helper functions and preparation. Patch 6 is the actual change. I am still torn between "checkout-encoding" and "working-tree-encoding" as attribute name. I am happy to hear arguments for/against one or the other. Changes since v2: * Added Torsten's crlfsave refactoring patch (patch 5) @Torsten: I tried to make the commit message more clean, added some comments to and renamed conv_flags_eol to global_conv_flags_eol. * Improved documentation and commit message (Torsten) * Removed unnecessary NUL assignment in xstrdup_tolower() (Torsten) * Set "git config core.eol lf" to made the test run on Windows (Dscho) * Made BOM arrays static (Ramsay) Thanks, Lars RFC: https://public-inbox.org/git/bdb9b884-6d17-4be3-a83c-f67e2afa2...@gmail.com/ v1: https://public-inbox.org/git/20171211155023.1405-1-lars.schnei...@autodesk.com/ v2: https://public-inbox.org/git/2017122915.39680-1-lars.schnei...@autodesk.com/ Base Ref: master Web-Diff: https://github.com/larsxschneider/git/commit/f21a1841a4 Checkout: git fetch https://github.com/larsxschneider/git encoding-v3 && git checkout f21a1841a4 ### Interdiff (v2..v3): diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt index 0039bd38c3..1bc03e69cb 100644 --- a/Documentation/gitattributes.txt +++ b/Documentation/gitattributes.txt @@ -285,11 +285,18 @@ In these cases you can teach Git the encoding of a file in the working directory with the `checkout-encoding` attribute. If a file with this attributes is added to Git, then Git reencodes the content from the specified encoding to UTF-8 and stores the result in its internal data -structure. On checkout the content is encoded back to the specified -encoding. +structure (called "the index"). On checkout the content is encoded +back to the specified encoding. -Please note that using the `checkout-encoding` attribute has a number -of drawbacks: +Please note that using the `checkout-encoding` attribute may have a +number of pitfalls: + +- Git clients that do not support the `checkout-encoding` attribute + will checkout the respective files UTF-8 encoded and not in the + expected encoding. Consequently, these files will appear different + which typically causes trouble. This is in particular the case for + older Git versions and alternative Git implementations such as JGit + or libgit2 (as of January 2018). - Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause errors as the conversion might not be round trip safe. @@ -297,12 +304,6 @@ of drawbacks: - Reencoding content requires resources that might slow down certain Git operations (e.g 'git checkout' or 'git add'). -- Git clients that do not support the `checkout-encoding` attribute or - the used encoding will checkout the respective files as UTF-8 encoded. - That means the content appears to be different which could cause - trouble. Affected clients are older Git versions and alternative Git - implementations such as JGit or libgit2 (as of January 2018). - Use the `checkout-encoding` attribute only if you cannot store a file in UTF-8 encoding and if you want Git to be able to process the content as text. diff --git a/apply.c b/apply.c index c4bd5cf1f2..f8b67bfee2 100644 --- a/apply.c +++ b/apply.c @@ -2263,8 +2263,8 @@ static void show_stats(struct apply_state *state, struct patch *patch) static int read_old_data(struct stat *st, struct patch *patch, const char *path, struct strbuf *buf) { - enum safe_crlf safe_crlf = patch->crlf_in_old ? - SAFE_CRLF_KEEP_CRLF : SAFE_CRLF_RENORMALIZE; + int conv_flags = patch->crlf_in_old ? + CONV_EOL_KEEP_CRLF : CONV_EOL_RENORMALIZE; switch (st->st_mode & S_IFMT) { case S_IFLNK: if (strbuf_readlink(buf, path, st->st_size) < 0) @@ -2281,7 +2281,7 @@ static int read_old_data(struct stat *st, struct patch *patch, * should never look at the index when explicit crlf option * is given. */ - convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf, 0); + convert_to_git(NULL, path, buf->buf, buf->len, buf, conv_flags); return 0; default: return -1; diff --git a/blame.c b/blame.c index 388b66897b..2893f3c103 100644 --- a/blame.c +++ b/blame.c @@ -229,7 +229,7 @@ static struct commit *fake_working_tree_commit(struct diff_options *opt, if (strbuf_read(, 0, 0) < 0) die_errno("failed to read from stdin"); } - convert_to_git(_index, path, buf.buf, buf.len, , 0, 0); + convert_to_git(_index, path, buf.buf, buf.len, , 0); origin->file.ptr = buf.buf; origin->file.size = buf.len; pretend_sha1_file(buf.buf, buf.len, OBJ_BLOB, origin->blob_oid.hash); diff --git