I think you and I are on the same page here, Clemens? I abhor the BOM, but the 
question is whether or not SQLite will cater to the fact that the bigger names 
in the industry appear hell-bent on shoving it in users’ documents by default.

Given that ‘.import’ and ‘.mode csv’ are “user mode” commands, perhaps leeway 
can be shown in breaking with standards for the sake of compatibility and 
sanity?

Mahmoud 

From: Clemens Ladisch
Sent: Friday, June 23, 2017 2:25 AM
To: sqlite-users@mailinglists.sqlite.org
Subject: Re: [sqlite] UTF8-BOM not disregarded in CSV import

Mahmoud Al-Qudsi wrote:
> with `.import ……`, SQLite3 includes a BOM (UTF-8) as part of the first
> column of the first record.

The Unicode Standard 9.0 says in section 3.10:
| When represented in UTF-8, the byte order mark turns into the byte
| sequence <EF BB BF>. Its usage at the beginning of a UTF-8 data stream
| is neither required nor recommended by the Unicode Standard,

so you should not use it.

Treating this character as a zero width no-break space, and keeping it,
is a correct interpretation of the file.

> IMHO, this is of particular importance since the latest versions of MS
> Excel default to “UTF-8 CSV” which includes a BOM.

That's wrong:
| When converting between different encoding schemes, extreme care must
| be taken in handling any initial byte order marks. For example, if one
| converted a UTF-16 byte serialization with an initial byte order mark
| to a UTF-8 byte serialization, thereby converting the byte order mark
| to <EF BB BF> in the UTF-8 form, the <EF BB BF> would now be ambiguous
| as to its status as a byte order mark (from its source) or as an
| initial zero width no-break space. If the UTF-8 byte serialization
| were then converted to UTF-16BE and the initial <EF BB BF> were
| converted to <FE FF>, the interpretation of the U+FEFF character would
| have been modified by the conversion. This would be nonconformant
| behavior according to conformance clause C7, because the change
| between byte serializations would have resulted in modification of the
| interpretation of the text. This is one reason why the use of the
| initial byte sequence <EF BB BF> as a signature on UTF-8 byte
| sequences is not recommended by the Unicode Standard.

And Google Docs also thinks it would be a good idea to act against
this recommendation:
<https://productforums.google.com/forum/#!topic/docs/p_jCTwzuIqk>

> Would anyone be opposed to a patch to SQLite that disregarded a BOM
> when found during a csv import operation?

Well, being wrong doesn't mean that Microsoft or Google will change
their behaviour ...


Regards,
Clemens
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to