> In case 7 we have little choice but to invoke heuristics or defer to the > user, yes?
Yes in theory, but "no" in the real world, or rather "not in any way that matters" In the real world, text files are heavily skewed towards 8 bit formats, meaning just three cases dominate the debate: - ASCII / ANSI - utf-8 with BOM - utf-8 without BOM And further, the overwhelming majority of text content are likely to involve ASCII at the beginning (from various markups, think html, xml, json, source code... even csv, because of explicit separator specification or 1st column name). So while in theory all the scenarios you describe are interesting, in practice seeing an utf-8 BOM provides an extremely high likeliness that a file will indeed be utf-8. Not always, but a memory chip could also be hit by a cosmic ray. Conversely the absence of an utf-8 BOM means a high probability of "something undetermined": ANSI or BOMless utf-8, or something more oddball (in which I lump utf-16 btw)... and the need for heuristics to kick in. Outside of source code and Linux config files, BOMless utf-8 are certainly not the most frequent text files, ANSI and other various encodings dominate, because most non-ASCII text files were (are) produced under DOS or Windows, where notepad and friends use ANSI by default f.i. That may not be a desirable or happy situation, but that is the situation we have to deal with. It is also the reason why 20 years later the utf-8 BOM is still in use: it explicit and has a practical success rate higher than any of the heuristics, while the collisions of the BOM with actual ANSI (or other) text start are unheard of. On Tue, Jun 27, 2017 at 10:34 AM, Robert Hairgrove <evorgri...@hispeed.ch> wrote: > On Tue, 2017-06-27 at 01:14 -0600, Scott Robison wrote: > > The original issue was two of the largest companies in the world > > output the > > Byte Encoding Mark(TM)(Patent Pending) (or BOM) at the beginning of > > UTF-8 > > encoded text streams, and it would be friendly for the SQLite3 shell > > to > > skip it or use it for encoding identification in at least some cases. > > I would suggest adding a command-line argument to the shell indicating > whether to ignore a BOM or not, possibly requiring specification of a > certain encoding or list of encodings to consider. > > Certainly this should not be a requirement for the library per se, but > a responsibility of the client to provide data in the proper encoding. > _______________________________________________ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users