Re: [PATCH v5 4/7] utf8: add function to detect a missing UTF-16/32 BOM

2018-01-30 Thread Junio C Hamano
Lars Schneider  writes:

> "false". Therefore, "is_missing_required_utf_bom()" might be 
> lengthy but should fit.

Thanks, sounds understandable a lot better than the original ;-)


Re: [PATCH v5 4/7] utf8: add function to detect a missing UTF-16/32 BOM

2018-01-30 Thread Lars Schneider

> On 30 Jan 2018, at 20:15, Junio C Hamano  wrote:
> 
> tbo...@web.de writes:
> 
>> From: Lars Schneider 
>> 
>> If the endianness is not defined in the encoding name, then let's
>> be strict and require a BOM to avoid any encoding confusion. The
>> has_missing_utf_bom() function returns true if a required BOM is
>> missing.
>> 
>> The Unicode standard instructs to assume big-endian if there in no BOM
>> for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
>> in HTML5 recommends to assume little-endian to "deal with deployed
>> content" [3]. Strictly requiring a BOM seems to be the safest option
>> for content in Git.
> 
> I do not have strong opinion on encoding such policy-ish behaviour
> as our default, but am I alone to find that "has missing X" is a
> confusing name for a helper function?  "is missing X" (or "lacks
> X") is a bit more understandable, I guess.

That might be a german/english translation thingy but I think I get
your point. "has" implies there is something and "missing" implies
there is nothing :)

"is_missing_utf_bom()" might be even a bit unspecific as UTF-8
is usually missing a UTF BOM but the function would still return 
"false". Therefore, "is_missing_required_utf_bom()" might be 
lengthy but should fit.

OK for you?

- Lars


> 
>> +int has_missing_utf_bom(const char *enc, const char *data, size_t len)
>> +{
>> +return (
>> +   !strcmp(enc, "UTF-16") &&
>> +   !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
>> + has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
>> +) || (
>> +   !strcmp(enc, "UTF-32") &&
>> +   !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
>> + has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
>> +);
>> +}



Re: [PATCH v5 4/7] utf8: add function to detect a missing UTF-16/32 BOM

2018-01-30 Thread Junio C Hamano
tbo...@web.de writes:

> From: Lars Schneider 
>
> If the endianness is not defined in the encoding name, then let's
> be strict and require a BOM to avoid any encoding confusion. The
> has_missing_utf_bom() function returns true if a required BOM is
> missing.
>
> The Unicode standard instructs to assume big-endian if there in no BOM
> for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
> in HTML5 recommends to assume little-endian to "deal with deployed
> content" [3]. Strictly requiring a BOM seems to be the safest option
> for content in Git.

I do not have strong opinion on encoding such policy-ish behaviour
as our default, but am I alone to find that "has missing X" is a
confusing name for a helper function?  "is missing X" (or "lacks
X") is a bit more understandable, I guess.

> +int has_missing_utf_bom(const char *enc, const char *data, size_t len)
> +{
> + return (
> +!strcmp(enc, "UTF-16") &&
> +!(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
> +  has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
> + ) || (
> +!strcmp(enc, "UTF-32") &&
> +!(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
> +  has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
> + );
> +}


[PATCH v5 4/7] utf8: add function to detect a missing UTF-16/32 BOM

2018-01-29 Thread tboegi
From: Lars Schneider 

If the endianness is not defined in the encoding name, then let's
be strict and require a BOM to avoid any encoding confusion. The
has_missing_utf_bom() function returns true if a required BOM is
missing.

The Unicode standard instructs to assume big-endian if there in no BOM
for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
in HTML5 recommends to assume little-endian to "deal with deployed
content" [3]. Strictly requiring a BOM seems to be the safest option
for content in Git.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#gen6
[2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
 Section 3.10, D98, page 132
[3] https://encoding.spec.whatwg.org/#utf-16le

Signed-off-by: Lars Schneider 
Signed-off-by: Torsten Bögershausen 
---
 utf8.c | 13 +
 utf8.h | 16 
 2 files changed, 29 insertions(+)

diff --git a/utf8.c b/utf8.c
index 914881cd1..f033fec1c 100644
--- a/utf8.c
+++ b/utf8.c
@@ -562,6 +562,19 @@ int has_prohibited_utf_bom(const char *enc, const char 
*data, size_t len)
);
 }
 
+int has_missing_utf_bom(const char *enc, const char *data, size_t len)
+{
+   return (
+  !strcmp(enc, "UTF-16") &&
+  !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+   ) || (
+  !strcmp(enc, "UTF-32") &&
+  !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+   );
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 4711429af..26b5e9185 100644
--- a/utf8.h
+++ b/utf8.h
@@ -79,4 +79,20 @@ void strbuf_utf8_align(struct strbuf *buf, align_type 
position, unsigned int wid
  */
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
 
+/*
+ * If the endianness is not defined in the encoding name, then we
+ * require a BOM. The function returns true if a required BOM is missing.
+ *
+ * The Unicode standard instructs to assume big-endian if there
+ * in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG
+ * encoding standard used in HTML5 recommends to assume
+ * little-endian to "deal with deployed content" [3].
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#gen6
+ * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
+ * Section 3.10, D98, page 132
+ * [3] https://encoding.spec.whatwg.org/#utf-16le
+ */
+int has_missing_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.0.rc0.2.g64d3e4d0cc.dirty