Re: [sqlite] UTF8-BOM not disregarded in CSV import

Eric Grange Mon, 26 Jun 2017 00:10:29 -0700

Alas, there is no end in sight to the pain for the Unicode decision to not
make the BOM compulsory for UTF-8.


Making it optional or non-necessary basically made every single text file
ambiguous, with non-trivial heuristics and implicit conventions required
instead, resulting in character corruptions that are neither acceptable nor
understood by users.
Making it compulsory would have made pre-Unicode *nix command-line
utilities and C string code in need of fixing, much pain, sure, but in
retrospect, this would have been a much smarter choice as everything could
have been settled in matter of years.

But now, more than 20 years later, UTF-8 storage is still a mess, with no
end in sight :/


On Sun, Jun 25, 2017 at 9:16 PM, Cezary H. Noweta <c...@poczta.onet.pl>
wrote:

> Hello,
>
> On 2017-06-23 22:12, Mahmoud Al-Qudsi wrote:
>
>> I think you and I are on the same page here, Clemens? I abhor the
>> BOM, but the question is whether or not SQLite will cater to the fact
>> that the bigger names in the industry appear hell-bent on shoving it
>> in users’ documents by default.
>>
>
> Given that ‘.import’ and ‘.mode csv’ are “user mode” commands,
>> perhaps leeway can be shown in breaking with standards for the sake
>> of compatibility and sanity?
>>
>
> IMHO, this is not a good way to show a leeway. The Unicode Standard has
> enough bad things in itself. It is not necessary to transform a good
> Unicode's thing into a bad one.
>
> Should SQLite disregard one <EF BB BF> sequence, or all <EF BB BF>
> sequences, or at most 2, 3, 10 ones at the beginning of a file? Such
> stream can be produced by a sequence of conversions done by a mix of
> conforming and ``breaking the standard for the sake of compatibility''
> converters.
>
> To be clear: I understand your point very well - ``let's ignore optional
> BOM at the beginning'', but I want to show that there is no limit in
> such thinking. Why one optional? You have not pointed out what
> compatibility with. The next step is to ignore N BOMs for the sake of
> compatibility with breaking the standard for the sake of compatibility
> with breaking the standard for the sake of... lim = \infty. I cannot see
> any sanity here.
>
> The standard says: ``Only UTF-16/32 (even not UTF-16/32LE/BE) encoding
> forms can contain BOM''. Let's conform to this.
>
> Certainly, there are no objections to extend an import's functionality
> in such a way that it ignores the initial 0xFEFF. However, an import
> should allow ZWNBSP as the first character, in its basic form, to be
> conforming to the standard.
>
> -- best regards
>
> Cezary H. Noweta
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] UTF8-BOM not disregarded in CSV import

Reply via email to