Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM
On Tue, Mar 06, 2018 at 03:37:16PM -0800, Junio C Hamano wrote: > Lars Schneider writes: > > > After thinking about it I wonder if we should barf on "utf16" without > > dash. Your Linux iconv would handle this correctly. My macOS iconv would > > not. > > That means the repo would checkout correctly on your machine but not on > > mine. > > > > What do you think? > > To be bluntly honest, I prefer not to have excess "sanity checks"; > there is no need to barf on utf16 in a project run by those who are > without macOS friends, for example. For that matter, while I do not > hate it so much to reject it, I am not all that supportive of this > "The consortium says without -LE or -BE suffix there must be BOM, > and this lacks one, so barf loudly" step in this topic myself. Loosing or adding a BOM couild render a file useless, depending on the SW that reads and processes it. The main reason for being critacal is to make sure that the material in the working-tree does a safe roundtrip when commited, pushed, pulled, checked out... The best thing we can do is probably to follow the specification as good as possible. Having a clear specification (UTF-16LE/BE has no BOM, UTF-16 has none, UTF-16 has a BOM) would even open the chance to work around a buggy iconv library, because Git can check the BOM. If, and only if, a platform messes it up, and we want to make a (compile time) workaround in Git for this very platform. A consistant behavior across all platforms is a good thing, sometimes I have a network share mounted to Linux, MacOS and Windows and it doesn't matter under which Git on which machine I checkout or commit. Oh, I see, there is a new patch coming...
Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM
Lars Schneider writes: > After thinking about it I wonder if we should barf on "utf16" without > dash. Your Linux iconv would handle this correctly. My macOS iconv would not. > That means the repo would checkout correctly on your machine but not on mine. > > What do you think? To be bluntly honest, I prefer not to have excess "sanity checks"; there is no need to barf on utf16 in a project run by those who are without macOS friends, for example. For that matter, while I do not hate it so much to reject it, I am not all that supportive of this "The consortium says without -LE or -BE suffix there must be BOM, and this lacks one, so barf loudly" step in this topic myself.
Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM
> On 07 Mar 2018, at 00:07, Junio C Hamano wrote: > > Junio C Hamano writes: > >> Lars Schneider writes: >> Also "UTF16" or other spelling the platform may support but this code fails to recognise will go unchecked. >>> >>> That is true. However, I would assume all iconv implementations use the >>> same encoding names for UTF encodings, no? That means UTF16 would never be >>> valid. Would you agree? >> >> After seeing "UTF16" and others in "iconv -l | grep -i utf" output, >> I am not sure what you mean by "Would you agree?" People can say in >> their .gitattributes file that this path is in "UTF16" without dash >> and that is what will be fed to this code, no? > > FWIW, "iconv -f utf8 -t utf16" seems not to complain, so I am > reasonably sure that people expect downcased ones to work as well. Sure! That's why I normalized it to upper case: https://public-inbox.org/git/CAPig+cQE0pKs-AMvh4GndyCXBMnx=70jppdm6k4jjte-74f...@mail.gmail.com/ After thinking about it I wonder if we should barf on "utf16" without dash. Your Linux iconv would handle this correctly. My macOS iconv would not. That means the repo would checkout correctly on your machine but not on mine. What do you think? - Lars
Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM
Junio C Hamano writes: > Lars Schneider writes: > >>> Also "UTF16" or other spelling >>> the platform may support but this code fails to recognise will go >>> unchecked. >> >> That is true. However, I would assume all iconv implementations use the >> same encoding names for UTF encodings, no? That means UTF16 would never be >> valid. Would you agree? > > After seeing "UTF16" and others in "iconv -l | grep -i utf" output, > I am not sure what you mean by "Would you agree?" People can say in > their .gitattributes file that this path is in "UTF16" without dash > and that is what will be fed to this code, no? FWIW, "iconv -f utf8 -t utf16" seems not to complain, so I am reasonably sure that people expect downcased ones to work as well.
Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM
> On 06 Mar 2018, at 23:53, Junio C Hamano wrote: > > Lars Schneider writes: > >>> Also "UTF16" or other spelling >>> the platform may support but this code fails to recognise will go >>> unchecked. >> >> That is true. However, I would assume all iconv implementations use the >> same encoding names for UTF encodings, no? That means UTF16 would never be >> valid. Would you agree? > > After seeing "UTF16" and others in "iconv -l | grep -i utf" output, > I am not sure what you mean by "Would you agree?" People can say in > their .gitattributes file that this path is in "UTF16" without dash > and that is what will be fed to this coe, no? On macOS I don't see UTF16... but I just checked on my Linux box and there it is. Consequently, we need to check both versions: with and without dash. I'll reroll ASAP (I try to do it first thing in the morning). Thanks, Lars
Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM
Lars Schneider writes: >> Also "UTF16" or other spelling >> the platform may support but this code fails to recognise will go >> unchecked. > > That is true. However, I would assume all iconv implementations use the > same encoding names for UTF encodings, no? That means UTF16 would never be > valid. Would you agree? After seeing "UTF16" and others in "iconv -l | grep -i utf" output, I am not sure what you mean by "Would you agree?" People can say in their .gitattributes file that this path is in "UTF16" without dash and that is what will be fed to this coe, no?
Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM
> On 06 Mar 2018, at 21:50, Junio C Hamano wrote: > > lars.schnei...@autodesk.com writes: > >> +int is_missing_required_utf_bom(const char *enc, const char *data, size_t >> len) >> +{ >> +return ( >> + !strcmp(enc, "UTF-16") && >> + !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) || >> + has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom))) >> +) || ( >> + !strcmp(enc, "UTF-32") && >> + !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) || >> + has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom))) >> +); >> +} > > These strcmp() calls seem inconsistent with the principle embodied > by utf8.c::fallback_encoding(), i.e. "be lenient to what we accept", > and make the interface uneven. I am wondering if we also want to > complain when the user gave us "utf16" and there is no byte order > mark in the contents, for example? Well, if I use stricmp() then I don't need to call and cleanup xstrdup_toupper() as discussed with Eric [1]. Is there a case insensitive starts_with() method? [1] https://public-inbox.org/git/CAPig+cQE0pKs-AMvh4GndyCXBMnx=70jppdm6k4jjte-74f...@mail.gmail.com/ > Also "UTF16" or other spelling > the platform may support but this code fails to recognise will go > unchecked. That is true. However, I would assume all iconv implementations use the same encoding names for UTF encodings, no? That means UTF16 would never be valid. Would you agree? - Lars
Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM
lars.schnei...@autodesk.com writes: > +int is_missing_required_utf_bom(const char *enc, const char *data, size_t > len) > +{ > + return ( > +!strcmp(enc, "UTF-16") && > +!(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) || > + has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom))) > + ) || ( > +!strcmp(enc, "UTF-32") && > +!(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) || > + has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom))) > + ); > +} These strcmp() calls seem inconsistent with the principle embodied by utf8.c::fallback_encoding(), i.e. "be lenient to what we accept", and make the interface uneven. I am wondering if we also want to complain when the user gave us "utf16" and there is no byte order mark in the contents, for example? Also "UTF16" or other spelling the platform may support but this code fails to recognise will go unchecked. Which actually may be a feature, not a bug, to be able to bypass this check---I dunno. The same comment applies to the previous step.
[PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM
From: Lars Schneider If the endianness is not defined in the encoding name, then let's be strict and require a BOM to avoid any encoding confusion. The is_missing_required_utf_bom() function returns true if a required BOM is missing. The Unicode standard instructs to assume big-endian if there in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used in HTML5 recommends to assume little-endian to "deal with deployed content" [3]. Strictly requiring a BOM seems to be the safest option for content in Git. This function is used in a subsequent commit. [1] http://unicode.org/faq/utf_bom.html#gen6 [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf Section 3.10, D98, page 132 [3] https://encoding.spec.whatwg.org/#utf-16le Signed-off-by: Lars Schneider --- utf8.c | 13 + utf8.h | 19 +++ 2 files changed, 32 insertions(+) diff --git a/utf8.c b/utf8.c index 914881cd1f..5113d26e56 100644 --- a/utf8.c +++ b/utf8.c @@ -562,6 +562,19 @@ int has_prohibited_utf_bom(const char *enc, const char *data, size_t len) ); } +int is_missing_required_utf_bom(const char *enc, const char *data, size_t len) +{ + return ( + !strcmp(enc, "UTF-16") && + !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) || +has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom))) + ) || ( + !strcmp(enc, "UTF-32") && + !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) || +has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom))) + ); +} + /* * Returns first character length in bytes for multi-byte `text` according to * `encoding`. diff --git a/utf8.h b/utf8.h index 0db1db4519..cce654a64a 100644 --- a/utf8.h +++ b/utf8.h @@ -79,4 +79,23 @@ void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int wid */ int has_prohibited_utf_bom(const char *enc, const char *data, size_t len); +/* + * If the endianness is not defined in the encoding name, then we + * require a BOM. The function returns true if a required BOM is missing. + * + * The Unicode standard instructs to assume big-endian if there in no + * BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard + * used in HTML5 recommends to assume little-endian to "deal with + * deployed content" [3]. + * + * Therefore, strictly requiring a BOM seems to be the safest option for + * content in Git. + * + * [1] http://unicode.org/faq/utf_bom.html#gen6 + * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf + * Section 3.10, D98, page 132 + * [3] https://encoding.spec.whatwg.org/#utf-16le + */ +int is_missing_required_utf_bom(const char *enc, const char *data, size_t len); + #endif -- 2.16.2