Re: [PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM

2018-02-28 Thread Lars Schneider

> On 27 Feb 2018, at 06:17, Eric Sunshine  wrote:
> 
> On Sun, Feb 25, 2018 at 6:35 AM, Lars Schneider
>  wrote:
>>> On 25 Feb 2018, at 04:41, Eric Sunshine  wrote:
>>> Is this interpretation correct? When I read [1], I interpret it as
>>> saying that no BOM _of any sort_ should be present when the encoding
>>> is declared as one of UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.
>> 
>> Correct!
>> 
>>> This code, on the other hand, only checks for BOMs corresponding
>>> to the declared size (16 or 32 bits).
>> 
>> Hmm. Interesting thought. You are saying that my code won't complain if
>> a document declared as UTF-16LE has a UTF32-LE BOM, correct?
> 
> Well, not specifically that case since UTF-16LE BOM is a subset of UTF32-LE 
> BOM.

Correct - bad example on my part!


> My observation was more general in that [1] seems to say that there
> should be _no_ BOM whatsoever if one of UTF-16BE, UTF-16LE, UTF-32BE,
> or UTF-32LE is declared.

You are saying that a document declared as UTF-16LE must not start 
with feff (UTF-32BE BOM)? I interpreted that situation as a "feff"
in the middle of a file and therefore the BOM should be treated as
ZWNBSP as explained here: http://unicode.org/faq/utf_bom.html#bom6

Plus, if "_no_ BOM whatsoever" is allowed then wouldn't we need to check
for UTF-1, UTF-7, and UTF-8 BOM's too?

I dunno.


>> I would say
>> this is correct behavior in context of this function. This function assumes
>> that the document is proper UTF-16/UTF-16BE/UTF-16LE but it is wrongly
>> declared with respect to its BOM in the .gitattributes. Would this
>> comment make it more clear to you?
>>/*
>> * If a data stream is declared as UTF-16BE or UTF-16LE, then a UTF-16
>> * BOM must not be used [1]. The same applies for the UTF-32 
>> equivalents.
>> * The function returns true if this rule is violated.
>> *
>> * [1] http://unicode.org/faq/utf_bom.html#bom10
>> */
>> I think what you are referring to is a different class of error and
>> would therefore warrant its own checker function. Would you agree?
> 
> I don't understand to what different class of error you refer. The
> FAQ[1] seems pretty clear to me in that if one of those declarations
> is used explicitly, then there should be _no_ BOM, period. It doesn't
> say anything about allowing a BOM for a differently-sized encoding (16
> vs 32).
> 
> If I squint very hard, I _guess_ I can see how you interpret [1] with
> the more narrow meaning of the restriction applying only to a BOM of
> the same size as the declared encoding, though reading it that way
> doesn't come easily to me.

For me it is somewhat the other way around :-)
Since I am not sure what is right, I decided to ask the Internet:
https://stackoverflow.com/questions/49038872/is-a-utf-32be-bom-valid-in-an-utf-16le-declared-data-stream

Let's see if someone has a good answer.

- Lars



Re: [PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM

2018-02-26 Thread Eric Sunshine
On Sun, Feb 25, 2018 at 6:35 AM, Lars Schneider
 wrote:
>> On 25 Feb 2018, at 04:41, Eric Sunshine  wrote:
>> Is this interpretation correct? When I read [1], I interpret it as
>> saying that no BOM _of any sort_ should be present when the encoding
>> is declared as one of UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.
>
> Correct!
>
>> This code, on the other hand, only checks for BOMs corresponding
>> to the declared size (16 or 32 bits).
>
> Hmm. Interesting thought. You are saying that my code won't complain if
> a document declared as UTF-16LE has a UTF32-LE BOM, correct?

Well, not specifically that case since UTF-16LE BOM is a subset of UTF32-LE BOM.

My observation was more general in that [1] seems to say that there
should be _no_ BOM whatsoever if one of UTF-16BE, UTF-16LE, UTF-32BE,
or UTF-32LE is declared.

> I would say
> this is correct behavior in context of this function. This function assumes
> that the document is proper UTF-16/UTF-16BE/UTF-16LE but it is wrongly
> declared with respect to its BOM in the .gitattributes. Would this
> comment make it more clear to you?
> /*
>  * If a data stream is declared as UTF-16BE or UTF-16LE, then a UTF-16
>  * BOM must not be used [1]. The same applies for the UTF-32 
> equivalents.
>  * The function returns true if this rule is violated.
>  *
>  * [1] http://unicode.org/faq/utf_bom.html#bom10
>  */
> I think what you are referring to is a different class of error and
> would therefore warrant its own checker function. Would you agree?

I don't understand to what different class of error you refer. The
FAQ[1] seems pretty clear to me in that if one of those declarations
is used explicitly, then there should be _no_ BOM, period. It doesn't
say anything about allowing a BOM for a differently-sized encoding (16
vs 32).

If I squint very hard, I _guess_ I can see how you interpret [1] with
the more narrow meaning of the restriction applying only to a BOM of
the same size as the declared encoding, though reading it that way
doesn't come easily to me.

>> I suppose the intention of [1] is to detect a mismatch between the
>> declared encoding and how the stream is actually encoded. The check
>> implemented here will fail to detect a mismatch between, say, declared
>> encoding UTF-16BE and actual encoding UTF-32BE.
>
> As stated above the intention is to detect wrong BOMs! I think we cannot
> detect the "declared as UTF-16BE but actually UTF-32BE" error.
>
> Consider this:
>
> printf "test" | iconv -f UTF-8 -t UTF-32BE | iconv -f UTF-16BE -t UTF-8 | od 
> -c
> 000   \0   t  \0   e  \0   s  \0   t
> 010
>
> In the first step we "encode" the string to UTF-32BE and then we "decode" it 
> as
> UTF-16BE. The result is valid although not correct. Does this make sense?

I'm probably being dense, but I don't understand what this is trying
to illustrate in relation to has_prohibited_utf_bom().


Re: [PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM

2018-02-25 Thread Lars Schneider

> On 25 Feb 2018, at 04:41, Eric Sunshine  wrote:
> 
> On Sat, Feb 24, 2018 at 11:27 AM,   wrote:
>> Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
>> or UTF-32LE a BOM must not be used [1]. The function returns true if
>> this is the case.
>> 
>> [1] http://unicode.org/faq/utf_bom.html#bom10
>> 
>> Signed-off-by: Lars Schneider 
>> ---
>> diff --git a/utf8.c b/utf8.c
>> @@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz,
>> +int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
>> +{
>> +   return (
>> + (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) &&
>> + (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
>> +  has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
>> +   ) || (
>> + (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) &&
>> + (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
>> +  has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
>> +   );
>> +}
> 
> Is this interpretation correct? When I read [1], I interpret it as
> saying that no BOM _of any sort_ should be present when the encoding
> is declared as one of UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.

Correct!

> This
> code, on the other hand, only checks for BOMs corresponding to the
> declared size (16 or 32 bits).

Hmm. Interesting thought. You are saying that my code won't complain if
a document declared as UTF-16LE has a UTF32-LE BOM, correct? I would say
this is correct behavior in context of this function. This function assumes
that the document is proper UTF-16/UTF-16BE/UTF-16LE but it is wrongly
declared with respect to its BOM in the .gitattributes. Would this
comment make it more clear to you?

/*
 * If a data stream is declared as UTF-16BE or UTF-16LE, then a UTF-16
 * BOM must not be used [1]. The same applies for the UTF-32 
equivalents.
 * The function returns true if this rule is violated.
 *
 * [1] http://unicode.org/faq/utf_bom.html#bom10
 */


I think what you are referring to is a different class of error and
would therefore warrant its own checker function. Would you agree?


> I suppose the intention of [1] is to detect a mismatch between the
> declared encoding and how the stream is actually encoded. The check
> implemented here will fail to detect a mismatch between, say, declared
> encoding UTF-16BE and actual encoding UTF-32BE.

As stated above the intention is to detect wrong BOMs! I think we cannot 
detect the "declared as UTF-16BE but actually UTF-32BE" error.

Consider this:

printf "test" | iconv -f UTF-8 -t UTF-32BE | iconv -f UTF-16BE -t UTF-8 | od -c
000   \0   t  \0   e  \0   s  \0   t
010

In the first step we "encode" the string to UTF-32BE and then we "decode" it as
UTF-16BE. The result is valid although not correct. Does this make sense?

Thanks,
Lars




Re: [PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM

2018-02-24 Thread Eric Sunshine
On Sat, Feb 24, 2018 at 11:27 AM,   wrote:
> Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
> or UTF-32LE a BOM must not be used [1]. The function returns true if
> this is the case.
>
> [1] http://unicode.org/faq/utf_bom.html#bom10
>
> Signed-off-by: Lars Schneider 
> ---
> diff --git a/utf8.c b/utf8.c
> @@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz,
> +int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
> +{
> +   return (
> + (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) &&
> + (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
> +  has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
> +   ) || (
> + (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) &&
> + (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
> +  has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
> +   );
> +}

Is this interpretation correct? When I read [1], I interpret it as
saying that no BOM _of any sort_ should be present when the encoding
is declared as one of UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE. This
code, on the other hand, only checks for BOMs corresponding to the
declared size (16 or 32 bits).

I suppose the intention of [1] is to detect a mismatch between the
declared encoding and how the stream is actually encoded. The check
implemented here will fail to detect a mismatch between, say, declared
encoding UTF-16BE and actual encoding UTF-32BE.


[PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM

2018-02-24 Thread lars . schneider
From: Lars Schneider 

Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
or UTF-32LE a BOM must not be used [1]. The function returns true if
this is the case.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#bom10

Signed-off-by: Lars Schneider 
---
 utf8.c | 24 
 utf8.h |  9 +
 2 files changed, 33 insertions(+)

diff --git a/utf8.c b/utf8.c
index 2c27ce0137..914881cd1f 100644
--- a/utf8.c
+++ b/utf8.c
@@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz,
 }
 #endif
 
+static int has_bom_prefix(const char *data, size_t len,
+ const char *bom, size_t bom_len)
+{
+   return (len >= bom_len) && !memcmp(data, bom, bom_len);
+}
+
+static const char utf16_be_bom[] = {0xFE, 0xFF};
+static const char utf16_le_bom[] = {0xFF, 0xFE};
+static const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
+static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
+
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
+{
+   return (
+ (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) &&
+ (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+  has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+   ) || (
+ (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) &&
+ (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+  has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+   );
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 6bbcf31a83..4711429af9 100644
--- a/utf8.h
+++ b/utf8.h
@@ -70,4 +70,13 @@ typedef enum {
 void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int 
width,
   const char *s);
 
+/*
+ * Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
+ * or UTF-32LE a BOM must not be used [1]. The function returns true if
+ * this is the case.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#bom10
+ */
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.1