Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Rowan Worth Wed, 28 Jun 2017 06:20:56 -0700

On 27 June 2017 at 18:42, Eric Grange <[email protected]> wrote:

> So while in theory all the scenarios you describe are interesting, in
> practice seeing an utf-8 BOM provides an extremely
> high likeliness that a file will indeed be utf-8. Not always, but a memory
> chip could also be hit by a cosmic ray.
>
> Conversely the absence of an utf-8 BOM means a high probability of
> "something undetermined": ANSI or BOMless utf-8,
> or something more oddball (in which I lump utf-16 btw)... and the need for
> heuristics to kick in.
>


I think we are largely in agreement here (esp. wrt utf-16 being an oddball
interchange format).

It doesn't answer my question though, ie. what advantage the BOM tag
provides compared to assuming utf-8 from the outset. Yes if you see a utf-8
BOM you have immediate confidence that the data is utf-8 encoded, but what
have you lost if you start with [fake] confidence and treat the data as
utf-8 until proven otherwise?

Either the data is utf-8, or ASCII, or ANSI but with no high-bit characters
and everything works, or you find an invalid byte sequence which gives you
high confidence that this is not actually utf-8 data. Granted it requires
more than three bytes lookahead, but we're gonig to be using that data
anyway.

I guess the one clear advantage I see of a utf-8 BOM is that it can
simplify some code, and reduce some duplicate work when interfacing with
APIs which both require a text encoding specified up-front and don't offer
a convenient error path when decoding fails. But adding utf-8 with BOM as
yet another text encoding configuration to the landscape seems like a high
price to pay, and certainly not an overall simplification.

Outside of source code and Linux config files, BOMless utf-8 are certainly
> not the most frequent text files, ANSI and
> other various encodings dominate, because most non-ASCII text files were
> (are) produced under DOS or Windows,
> where notepad and friends use ANSI by default f.i.
>

Notepad barely counts as a text editor (newlines are always two bytes long
yeah? :P), but I take your point that ANSI is common (especially CP1251?).
I've honestly never seen a utf-8 file *with* a BOM though, so perhaps I've
lived a sheltered life.

I'm not sure what you were going for here:

the overwhelming majority of text content are likely to involve ASCII at the

beginning (from various markups, think html, xml, json, source code... even
> csv


Since HTML's encoding is generally specified in the HTTP header or
<http-equiv> metadata.
XML's encoding must be specified on the first line (unless the default
utf-8 is used or a BOM is present).
JSON's encoding must be either utf-8, utf-16 or utf-32.
Source code encoding is generally defined by the language in question.

That may not be a desirable or happy situation, but that is the situation
> we have to deal with.
>

True, we're stuck with decisions of the past. I guess (and maybe I've
finally understood your position?) if a BOM was mandated for _all_ utf-8
data from the outset to clearly distinguish it from pre-existing ANSI
codepages then I could see its value. Although I remain a little revulsed
by having those three little bytes at the front of all my files to solve
what is predominently a transport issue ;)

-Rowan
_______________________________________________
sqlite-users mailing list
[email protected]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Reply via email to