From: Lars Schneider <larsxschnei...@gmail.com>

Hi,

Patches 1-4, 6 are preparation and helper functions.
Patch 5,7 are the actual change.

This series depends on Torsten's 8462ff43e4 (convert_to_git():
safe_crlf/checksafe becomes int conv_flags, 2018-01-13) which is already
in master.

Changes since v6:

* use consistent casing for core.checkRoundtripEncoding (Junio)
* fix gibberish in commit message (Junio)
* improve documentation (Torsten)
* improve advise messages (Torsten)


Thanks,
Lars

  RFC: 
https://public-inbox.org/git/bdb9b884-6d17-4be3-a83c-f67e2afa2...@gmail.com/
   v1: 
https://public-inbox.org/git/20171211155023.1405-1-lars.schnei...@autodesk.com/
   v2: 
https://public-inbox.org/git/20171229152222.39680-1-lars.schnei...@autodesk.com/
   v3: 
https://public-inbox.org/git/20180106004808.77513-1-lars.schnei...@autodesk.com/
   v4: 
https://public-inbox.org/git/20180120152418.52859-1-lars.schnei...@autodesk.com/
   v5: https://public-inbox.org/git/20180129201855.9182-1-tbo...@web.de/
   v6: 
https://public-inbox.org/git/20180209132830.55385-1-lars.schnei...@autodesk.com/


Base Ref:
Web-Diff: https://github.com/larsxschneider/git/commit/2b94bec353
Checkout: git fetch https://github.com/larsxschneider/git encoding-v7 && git 
checkout 2b94bec353


### Interdiff (v6..v7):

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index ea5a9509c6..10cb37795d 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -291,19 +291,20 @@ the content is reencoded back to the specified encoding.
 Please note that using the `working-tree-encoding` attribute may have a
 number of pitfalls:

-- Git clients that do not support the `working-tree-encoding` attribute
-  will checkout the respective files UTF-8 encoded and not in the
-  expected encoding. Consequently, these files will appear different
-  which typically causes trouble. This is in particular the case for
-  older Git versions and alternative Git implementations such as JGit
-  or libgit2 (as of February 2018).
+- Third party Git implementations that do not support the
+  `working-tree-encoding` attribute will checkout the respective files
+  UTF-8 encoded and not in the expected encoding. Consequently, these
+  files will appear different which typically causes trouble. This is
+  in particular the case for older Git versions and alternative Git
+  implementations such as JGit or libgit2 (as of February 2018).

 - Reencoding content to non-UTF encodings can cause errors as the
   conversion might not be UTF-8 round trip safe. If you suspect your
-  encoding to not be round trip safe, then add it to 
`core.checkRoundtripEncoding`
-  to make Git check the round trip encoding (see linkgit:git-config[1]).
-  SHIFT-JIS (Japanese character set) is known to have round trip issues
-  with UTF-8 and is checked by default.
+  encoding to not be round trip safe, then add it to
+  `core.checkRoundtripEncoding` to make Git check the round trip
+  encoding (see linkgit:git-config[1]). SHIFT-JIS (Japanese character
+  set) is known to have round trip issues with UTF-8 and is checked by
+  default.

 - Reencoding content requires resources that might slow down certain
   Git operations (e.g 'git checkout' or 'git add').
@@ -327,7 +328,7 @@ explicitly define the line endings with `eol` if the 
`working-tree-encoding`
 attribute is used to avoid ambiguity.

 ------------------------
-*.proj                 working-tree-encoding=UTF-16LE text eol=CRLF
+*.proj                 text working-tree-encoding=UTF-16LE eol=CRLF
 ------------------------

 You can get a list of all available encodings on your platform with the
diff --git a/convert.c b/convert.c
index 71dffc7167..398cd9cf7b 100644
--- a/convert.c
+++ b/convert.c
@@ -352,29 +352,29 @@ static int encode_to_git(const char *path, const char 
*src, size_t src_len,

        if (has_prohibited_utf_bom(enc->name, src, src_len)) {
                const char *error_msg = _(
-                       "BOM is prohibited for '%s' if encoded as %s");
+                       "BOM is prohibited in '%s' if encoded as %s");
+               /*
+                * This advise is shown for UTF-??BE and UTF-??LE encodings.
+                * We truncate the encoding name to 6 chars with %.6s to cut
+                * off the last two "byte order" characters.
+                */
                const char *advise_msg = _(
-                       "You told Git to treat '%s' as %s. A byte order mark "
-                       "(BOM) is prohibited with this encoding. Either use "
-                       "%.6s as working tree encoding or remove the BOM from 
the "
-                       "file.");
-
-               advise(advise_msg, path, enc->name, enc->name, enc->name);
+                       "The file '%s' contains a byte order mark (BOM). "
+                       "Please use %.6s as working-tree-encoding.");
+               advise(advise_msg, path, enc->name);
                if (conv_flags & CONV_WRITE_OBJECT)
                        die(error_msg, path, enc->name);
                else
                        error(error_msg, path, enc->name);

-
        } else if (is_missing_required_utf_bom(enc->name, src, src_len)) {
                const char *error_msg = _(
-                       "BOM is required for '%s' if encoded as %s");
+                       "BOM is required in '%s' if encoded as %s");
                const char *advise_msg = _(
-                       "You told Git to treat '%s' as %s. A byte order mark "
-                       "(BOM) is required with this encoding. Either use "
-                       "%sBE/%sLE as working tree encoding or add a BOM to the 
"
-                       "file.");
-               advise(advise_msg, path, enc->name, enc->name, enc->name);
+                       "The file '%s' is missing a byte order mark (BOM). "
+                       "Please use %sBE or %sLE (depending on the byte order) "
+                       "as working-tree-encoding.");
+               advise(advise_msg, path, enc->name, enc->name);
                if (conv_flags & CONV_WRITE_OBJECT)
                        die(error_msg, path, enc->name);
                else
@@ -405,7 +405,7 @@ static int encode_to_git(const char *path, const char *src, 
size_t src_len,
         * Unicode aims to be a superset of all other character encodings.
         * However, certain encodings (e.g. SHIFT-JIS) are known to have round
         * trip issues [2]. Check the round trip conversion for all encodings
-        * listed in core.checkRoundTripEncoding.
+        * listed in core.checkRoundtripEncoding.
         *
         * The round trip check is only performed if content is written to Git.
         * This ensures that no information is lost during conversion to/from
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 5dcdd5f899..e4717402a5 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -221,10 +221,10 @@ test_expect_success 'check roundtrip encoding' '
        git reset &&

        # ... unless we overwrite the Git config!
-       test_config core.checkRoundTripEncoding "garbage" &&
+       test_config core.checkRoundtripEncoding "garbage" &&
        ! GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 >/dev/null |
                grep "Checking roundtrip encoding for SHIFT-JIS" &&
-       test_unconfig core.checkRoundTripEncoding &&
+       test_unconfig core.checkRoundtripEncoding &&
        git reset &&

        # UTF-16 encoded files should not be round-trip checked by default...
@@ -233,14 +233,14 @@ test_expect_success 'check roundtrip encoding' '
        git reset &&

        # ... unless we tell Git to check it!
-       test_config_global core.checkRoundTripEncoding "UTF-16, UTF-32" &&
+       test_config_global core.checkRoundtripEncoding "UTF-16, UTF-32" &&
        GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
                grep "Checking roundtrip encoding for UTF-16" &&
        git reset &&

        # ... unless we tell Git to check it!
        # (here we also check that the casing of the encoding is irrelevant)
-       test_config_global core.checkRoundTripEncoding "UTF-32, utf-16" &&
+       test_config_global core.checkRoundtripEncoding "UTF-32, utf-16" &&
        GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
                grep "Checking roundtrip encoding for UTF-16" &&
        git reset &&


### Patches

Lars Schneider (7):
  strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
  strbuf: add xstrdup_toupper()
  utf8: add function to detect prohibited UTF-16/32 BOM
  utf8: add function to detect a missing UTF-16/32 BOM
  convert: add 'working-tree-encoding' attribute
  convert: add tracing for 'working-tree-encoding' attribute
  convert: add round trip check based on 'core.checkRoundtripEncoding'

 Documentation/config.txt         |   6 +
 Documentation/gitattributes.txt  |  74 +++++++++++
 config.c                         |   5 +
 convert.c                        | 256 ++++++++++++++++++++++++++++++++++++++-
 convert.h                        |   2 +
 environment.c                    |   1 +
 sha1_file.c                      |   2 +-
 strbuf.c                         |  13 +-
 strbuf.h                         |   1 +
 t/t0028-working-tree-encoding.sh | 253 ++++++++++++++++++++++++++++++++++++++
 utf8.c                           |  37 ++++++
 utf8.h                           |  25 ++++
 12 files changed, 672 insertions(+), 3 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh


base-commit: 8a2f0888555ce46ac87452b194dec5cb66fb1417
--
2.16.1

Reply via email to