Re: [PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM
> On 27 Feb 2018, at 06:17, Eric Sunshinewrote: > > On Sun, Feb 25, 2018 at 6:35 AM, Lars Schneider > wrote: >>> On 25 Feb 2018, at 04:41, Eric Sunshine wrote: >>> Is this interpretation correct? When I read [1], I interpret it as >>> saying that no BOM _of any sort_ should be present when the encoding >>> is declared as one of UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE. >> >> Correct! >> >>> This code, on the other hand, only checks for BOMs corresponding >>> to the declared size (16 or 32 bits). >> >> Hmm. Interesting thought. You are saying that my code won't complain if >> a document declared as UTF-16LE has a UTF32-LE BOM, correct? > > Well, not specifically that case since UTF-16LE BOM is a subset of UTF32-LE > BOM. Correct - bad example on my part! > My observation was more general in that [1] seems to say that there > should be _no_ BOM whatsoever if one of UTF-16BE, UTF-16LE, UTF-32BE, > or UTF-32LE is declared. You are saying that a document declared as UTF-16LE must not start with feff (UTF-32BE BOM)? I interpreted that situation as a "feff" in the middle of a file and therefore the BOM should be treated as ZWNBSP as explained here: http://unicode.org/faq/utf_bom.html#bom6 Plus, if "_no_ BOM whatsoever" is allowed then wouldn't we need to check for UTF-1, UTF-7, and UTF-8 BOM's too? I dunno. >> I would say >> this is correct behavior in context of this function. This function assumes >> that the document is proper UTF-16/UTF-16BE/UTF-16LE but it is wrongly >> declared with respect to its BOM in the .gitattributes. Would this >> comment make it more clear to you? >>/* >> * If a data stream is declared as UTF-16BE or UTF-16LE, then a UTF-16 >> * BOM must not be used [1]. The same applies for the UTF-32 >> equivalents. >> * The function returns true if this rule is violated. >> * >> * [1] http://unicode.org/faq/utf_bom.html#bom10 >> */ >> I think what you are referring to is a different class of error and >> would therefore warrant its own checker function. Would you agree? > > I don't understand to what different class of error you refer. The > FAQ[1] seems pretty clear to me in that if one of those declarations > is used explicitly, then there should be _no_ BOM, period. It doesn't > say anything about allowing a BOM for a differently-sized encoding (16 > vs 32). > > If I squint very hard, I _guess_ I can see how you interpret [1] with > the more narrow meaning of the restriction applying only to a BOM of > the same size as the declared encoding, though reading it that way > doesn't come easily to me. For me it is somewhat the other way around :-) Since I am not sure what is right, I decided to ask the Internet: https://stackoverflow.com/questions/49038872/is-a-utf-32be-bom-valid-in-an-utf-16le-declared-data-stream Let's see if someone has a good answer. - Lars
Re: [PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM
On Sun, Feb 25, 2018 at 6:35 AM, Lars Schneiderwrote: >> On 25 Feb 2018, at 04:41, Eric Sunshine wrote: >> Is this interpretation correct? When I read [1], I interpret it as >> saying that no BOM _of any sort_ should be present when the encoding >> is declared as one of UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE. > > Correct! > >> This code, on the other hand, only checks for BOMs corresponding >> to the declared size (16 or 32 bits). > > Hmm. Interesting thought. You are saying that my code won't complain if > a document declared as UTF-16LE has a UTF32-LE BOM, correct? Well, not specifically that case since UTF-16LE BOM is a subset of UTF32-LE BOM. My observation was more general in that [1] seems to say that there should be _no_ BOM whatsoever if one of UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE is declared. > I would say > this is correct behavior in context of this function. This function assumes > that the document is proper UTF-16/UTF-16BE/UTF-16LE but it is wrongly > declared with respect to its BOM in the .gitattributes. Would this > comment make it more clear to you? > /* > * If a data stream is declared as UTF-16BE or UTF-16LE, then a UTF-16 > * BOM must not be used [1]. The same applies for the UTF-32 > equivalents. > * The function returns true if this rule is violated. > * > * [1] http://unicode.org/faq/utf_bom.html#bom10 > */ > I think what you are referring to is a different class of error and > would therefore warrant its own checker function. Would you agree? I don't understand to what different class of error you refer. The FAQ[1] seems pretty clear to me in that if one of those declarations is used explicitly, then there should be _no_ BOM, period. It doesn't say anything about allowing a BOM for a differently-sized encoding (16 vs 32). If I squint very hard, I _guess_ I can see how you interpret [1] with the more narrow meaning of the restriction applying only to a BOM of the same size as the declared encoding, though reading it that way doesn't come easily to me. >> I suppose the intention of [1] is to detect a mismatch between the >> declared encoding and how the stream is actually encoded. The check >> implemented here will fail to detect a mismatch between, say, declared >> encoding UTF-16BE and actual encoding UTF-32BE. > > As stated above the intention is to detect wrong BOMs! I think we cannot > detect the "declared as UTF-16BE but actually UTF-32BE" error. > > Consider this: > > printf "test" | iconv -f UTF-8 -t UTF-32BE | iconv -f UTF-16BE -t UTF-8 | od > -c > 000 \0 t \0 e \0 s \0 t > 010 > > In the first step we "encode" the string to UTF-32BE and then we "decode" it > as > UTF-16BE. The result is valid although not correct. Does this make sense? I'm probably being dense, but I don't understand what this is trying to illustrate in relation to has_prohibited_utf_bom().
Re: [PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM
> On 25 Feb 2018, at 04:41, Eric Sunshinewrote: > > On Sat, Feb 24, 2018 at 11:27 AM, wrote: >> Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE >> or UTF-32LE a BOM must not be used [1]. The function returns true if >> this is the case. >> >> [1] http://unicode.org/faq/utf_bom.html#bom10 >> >> Signed-off-by: Lars Schneider >> --- >> diff --git a/utf8.c b/utf8.c >> @@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz, >> +int has_prohibited_utf_bom(const char *enc, const char *data, size_t len) >> +{ >> + return ( >> + (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) && >> + (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) || >> + has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom))) >> + ) || ( >> + (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) && >> + (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) || >> + has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom))) >> + ); >> +} > > Is this interpretation correct? When I read [1], I interpret it as > saying that no BOM _of any sort_ should be present when the encoding > is declared as one of UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE. Correct! > This > code, on the other hand, only checks for BOMs corresponding to the > declared size (16 or 32 bits). Hmm. Interesting thought. You are saying that my code won't complain if a document declared as UTF-16LE has a UTF32-LE BOM, correct? I would say this is correct behavior in context of this function. This function assumes that the document is proper UTF-16/UTF-16BE/UTF-16LE but it is wrongly declared with respect to its BOM in the .gitattributes. Would this comment make it more clear to you? /* * If a data stream is declared as UTF-16BE or UTF-16LE, then a UTF-16 * BOM must not be used [1]. The same applies for the UTF-32 equivalents. * The function returns true if this rule is violated. * * [1] http://unicode.org/faq/utf_bom.html#bom10 */ I think what you are referring to is a different class of error and would therefore warrant its own checker function. Would you agree? > I suppose the intention of [1] is to detect a mismatch between the > declared encoding and how the stream is actually encoded. The check > implemented here will fail to detect a mismatch between, say, declared > encoding UTF-16BE and actual encoding UTF-32BE. As stated above the intention is to detect wrong BOMs! I think we cannot detect the "declared as UTF-16BE but actually UTF-32BE" error. Consider this: printf "test" | iconv -f UTF-8 -t UTF-32BE | iconv -f UTF-16BE -t UTF-8 | od -c 000 \0 t \0 e \0 s \0 t 010 In the first step we "encode" the string to UTF-32BE and then we "decode" it as UTF-16BE. The result is valid although not correct. Does this make sense? Thanks, Lars
Re: [PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM
On Sat, Feb 24, 2018 at 11:27 AM,wrote: > Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE > or UTF-32LE a BOM must not be used [1]. The function returns true if > this is the case. > > [1] http://unicode.org/faq/utf_bom.html#bom10 > > Signed-off-by: Lars Schneider > --- > diff --git a/utf8.c b/utf8.c > @@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz, > +int has_prohibited_utf_bom(const char *enc, const char *data, size_t len) > +{ > + return ( > + (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) && > + (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) || > + has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom))) > + ) || ( > + (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) && > + (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) || > + has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom))) > + ); > +} Is this interpretation correct? When I read [1], I interpret it as saying that no BOM _of any sort_ should be present when the encoding is declared as one of UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE. This code, on the other hand, only checks for BOMs corresponding to the declared size (16 or 32 bits). I suppose the intention of [1] is to detect a mismatch between the declared encoding and how the stream is actually encoded. The check implemented here will fail to detect a mismatch between, say, declared encoding UTF-16BE and actual encoding UTF-32BE.
[PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM
From: Lars SchneiderWhenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used [1]. The function returns true if this is the case. This function is used in a subsequent commit. [1] http://unicode.org/faq/utf_bom.html#bom10 Signed-off-by: Lars Schneider --- utf8.c | 24 utf8.h | 9 + 2 files changed, 33 insertions(+) diff --git a/utf8.c b/utf8.c index 2c27ce0137..914881cd1f 100644 --- a/utf8.c +++ b/utf8.c @@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz, } #endif +static int has_bom_prefix(const char *data, size_t len, + const char *bom, size_t bom_len) +{ + return (len >= bom_len) && !memcmp(data, bom, bom_len); +} + +static const char utf16_be_bom[] = {0xFE, 0xFF}; +static const char utf16_le_bom[] = {0xFF, 0xFE}; +static const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF}; +static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00}; + +int has_prohibited_utf_bom(const char *enc, const char *data, size_t len) +{ + return ( + (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) && + (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) || + has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom))) + ) || ( + (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) && + (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) || + has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom))) + ); +} + /* * Returns first character length in bytes for multi-byte `text` according to * `encoding`. diff --git a/utf8.h b/utf8.h index 6bbcf31a83..4711429af9 100644 --- a/utf8.h +++ b/utf8.h @@ -70,4 +70,13 @@ typedef enum { void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int width, const char *s); +/* + * Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE + * or UTF-32LE a BOM must not be used [1]. The function returns true if + * this is the case. + * + * [1] http://unicode.org/faq/utf_bom.html#bom10 + */ +int has_prohibited_utf_bom(const char *enc, const char *data, size_t len); + #endif -- 2.16.1