Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Eric Grange Tue, 27 Jun 2017 03:42:48 -0700

> In case 7 we have little choice but to invoke heuristics or defer to the
> user, yes?

Yes in theory, but "no" in the real world, or rather "not in any way that
matters"

In the real world, text files are heavily skewed towards 8 bit formats,
meaning just three cases dominate the debate:
- ASCII / ANSI
- utf-8 with BOM
- utf-8 without BOM

And further, the overwhelming majority of text content are likely to
involve ASCII at the beginning (from various markups,
think html, xml, json, source code... even csv, because of explicit
separator specification or 1st column name).

So while in theory all the scenarios you describe are interesting, in
practice seeing an utf-8 BOM provides an extremely
high likeliness that a file will indeed be utf-8. Not always, but a memory
chip could also be hit by a cosmic ray.

Conversely the absence of an utf-8 BOM means a high probability of
"something undetermined": ANSI or BOMless utf-8,
or something more oddball (in which I lump utf-16 btw)... and the need for
heuristics to kick in.

Outside of source code and Linux config files, BOMless utf-8 are certainly
not the most frequent text files, ANSI and
other various encodings dominate, because most non-ASCII text files were
(are) produced under DOS or Windows,
where notepad and friends use ANSI by default f.i.

That may not be a desirable or happy situation, but that is the situation
we have to deal with.

It is also the reason why 20 years later the utf-8 BOM is still in use: it
explicit and has a practical success rate higher
than any of the heuristics, while the collisions of the BOM with actual
ANSI (or other) text start are unheard of.

On Tue, Jun 27, 2017 at 10:34 AM, Robert Hairgrove <evorgri...@hispeed.ch>
wrote:

> On Tue, 2017-06-27 at 01:14 -0600, Scott Robison wrote:
> > The original issue was two of the largest companies in the world
> > output the
> > Byte Encoding Mark(TM)(Patent Pending) (or BOM) at the beginning of
> > UTF-8
> > encoded text streams, and it would be friendly for the SQLite3 shell
> > to
> > skip it or use it for encoding identification in at least some cases.
>
> I would suggest adding a command-line argument to the shell indicating
> whether to ignore a BOM or not, possibly requiring specification of a
> certain encoding or list of encodings to consider.
>
> Certainly this should not be a requirement for the library per se, but
> a responsibility of the client to provide data in the proper encoding.
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Reply via email to