I think you and I are on the same page here, Clemens? I abhor the BOM, but the question is whether or not SQLite will cater to the fact that the bigger names in the industry appear hell-bent on shoving it in users’ documents by default.
Given that ‘.import’ and ‘.mode csv’ are “user mode” commands, perhaps leeway can be shown in breaking with standards for the sake of compatibility and sanity? Mahmoud From: Clemens Ladisch Sent: Friday, June 23, 2017 2:25 AM To: sqlite-users@mailinglists.sqlite.org Subject: Re: [sqlite] UTF8-BOM not disregarded in CSV import Mahmoud Al-Qudsi wrote: > with `.import ……`, SQLite3 includes a BOM (UTF-8) as part of the first > column of the first record. The Unicode Standard 9.0 says in section 3.10: | When represented in UTF-8, the byte order mark turns into the byte | sequence <EF BB BF>. Its usage at the beginning of a UTF-8 data stream | is neither required nor recommended by the Unicode Standard, so you should not use it. Treating this character as a zero width no-break space, and keeping it, is a correct interpretation of the file. > IMHO, this is of particular importance since the latest versions of MS > Excel default to “UTF-8 CSV” which includes a BOM. That's wrong: | When converting between different encoding schemes, extreme care must | be taken in handling any initial byte order marks. For example, if one | converted a UTF-16 byte serialization with an initial byte order mark | to a UTF-8 byte serialization, thereby converting the byte order mark | to <EF BB BF> in the UTF-8 form, the <EF BB BF> would now be ambiguous | as to its status as a byte order mark (from its source) or as an | initial zero width no-break space. If the UTF-8 byte serialization | were then converted to UTF-16BE and the initial <EF BB BF> were | converted to <FE FF>, the interpretation of the U+FEFF character would | have been modified by the conversion. This would be nonconformant | behavior according to conformance clause C7, because the change | between byte serializations would have resulted in modification of the | interpretation of the text. This is one reason why the use of the | initial byte sequence <EF BB BF> as a signature on UTF-8 byte | sequences is not recommended by the Unicode Standard. And Google Docs also thinks it would be a good idea to act against this recommendation: <https://productforums.google.com/forum/#!topic/docs/p_jCTwzuIqk> > Would anyone be opposed to a patch to SQLite that disregarded a BOM > when found during a csv import operation? Well, being wrong doesn't mean that Microsoft or Google will change their behaviour ... Regards, Clemens _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users