At the bottom...

-----Original Message----- From: Eric Grange
Sent: Monday, June 26, 2017 3:09 AM
To: SQLite mailing list
Subject: Re: [sqlite] UTF8-BOM not disregarded in CSV import

Alas, there is no end in sight to the pain for the Unicode decision to not
make the BOM compulsory for UTF-8.

Making it optional or non-necessary basically made every single text file
ambiguous, with non-trivial heuristics and implicit conventions required
instead, resulting in character corruptions that are neither acceptable nor
understood by users.
Making it compulsory would have made pre-Unicode *nix command-line
utilities and C string code in need of fixing, much pain, sure, but in
retrospect, this would have been a much smarter choice as everything could
have been settled in matter of years.

But now, more than 20 years later, UTF-8 storage is still a mess, with no
end in sight :/


On Sun, Jun 25, 2017 at 9:16 PM, Cezary H. Noweta <c...@poczta.onet.pl>
wrote:

Hello,

On 2017-06-23 22:12, Mahmoud Al-Qudsi wrote:

I think you and I are on the same page here, Clemens? I abhor the
BOM, but the question is whether or not SQLite will cater to the fact
that the bigger names in the industry appear hell-bent on shoving it
in users’ documents by default.


Given that ‘.import’ and ‘.mode csv’ are “user mode” commands,
perhaps leeway can be shown in breaking with standards for the sake
of compatibility and sanity?


IMHO, this is not a good way to show a leeway. The Unicode Standard has
enough bad things in itself. It is not necessary to transform a good
Unicode's thing into a bad one.

Should SQLite disregard one <EF BB BF> sequence, or all <EF BB BF>
sequences, or at most 2, 3, 10 ones at the beginning of a file? Such
stream can be produced by a sequence of conversions done by a mix of
conforming and ``breaking the standard for the sake of compatibility''
converters.

To be clear: I understand your point very well - ``let's ignore optional
BOM at the beginning'', but I want to show that there is no limit in
such thinking. Why one optional? You have not pointed out what
compatibility with. The next step is to ignore N BOMs for the sake of
compatibility with breaking the standard for the sake of compatibility
with breaking the standard for the sake of... lim = \infty. I cannot see
any sanity here.

The standard says: ``Only UTF-16/32 (even not UTF-16/32LE/BE) encoding
forms can contain BOM''. Let's conform to this.

Certainly, there are no objections to extend an import's functionality
in such a way that it ignores the initial 0xFEFF. However, an import
should allow ZWNBSP as the first character, in its basic form, to be
conforming to the standard.

-- best regards

Cezary H. Noweta
_______________________________________________

I have made a desicion to always include the BOM in all my text files whether they are UTF8, UTF16 or UTF32 little or big endian. I think all of us should also. Just because the "Unicode Gurus" didn't think so, does not mean they are right. I had a doctor give me the wrong diagnose. There were just too many symptoms that looked alike and they chosed one and went with it. The same thing happened, the Unicode Gurus, they never thought about the problems they would be causing today. Some applications do not place BOM on UTF8, UTF16 files, and then you have to go and find which one is it, and decode the file correctly. This can all be prevented by having a BOM. Yes, I know I am saying everything every body is, but what I am also saying is to let us all use the BOM, and also have every application we write welcome the BOM. One last thing, every application uses internal file information to tell whether a file is able to be read by the application, and whether the application version supports that version of that file, etc. UTF8, UTF16, UTF32, litle or big Endian should have BOM. Thanks.

josé
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to