On 27 June 2017 at 18:42, Eric Grange <egra...@glscene.org> wrote: > So while in theory all the scenarios you describe are interesting, in > practice seeing an utf-8 BOM provides an extremely > high likeliness that a file will indeed be utf-8. Not always, but a memory > chip could also be hit by a cosmic ray. > > Conversely the absence of an utf-8 BOM means a high probability of > "something undetermined": ANSI or BOMless utf-8, > or something more oddball (in which I lump utf-16 btw)... and the need for > heuristics to kick in. >
I think we are largely in agreement here (esp. wrt utf-16 being an oddball interchange format). It doesn't answer my question though, ie. what advantage the BOM tag provides compared to assuming utf-8 from the outset. Yes if you see a utf-8 BOM you have immediate confidence that the data is utf-8 encoded, but what have you lost if you start with [fake] confidence and treat the data as utf-8 until proven otherwise? Either the data is utf-8, or ASCII, or ANSI but with no high-bit characters and everything works, or you find an invalid byte sequence which gives you high confidence that this is not actually utf-8 data. Granted it requires more than three bytes lookahead, but we're gonig to be using that data anyway. I guess the one clear advantage I see of a utf-8 BOM is that it can simplify some code, and reduce some duplicate work when interfacing with APIs which both require a text encoding specified up-front and don't offer a convenient error path when decoding fails. But adding utf-8 with BOM as yet another text encoding configuration to the landscape seems like a high price to pay, and certainly not an overall simplification. Outside of source code and Linux config files, BOMless utf-8 are certainly > not the most frequent text files, ANSI and > other various encodings dominate, because most non-ASCII text files were > (are) produced under DOS or Windows, > where notepad and friends use ANSI by default f.i. > Notepad barely counts as a text editor (newlines are always two bytes long yeah? :P), but I take your point that ANSI is common (especially CP1251?). I've honestly never seen a utf-8 file *with* a BOM though, so perhaps I've lived a sheltered life. I'm not sure what you were going for here: the overwhelming majority of text content are likely to involve ASCII at the beginning (from various markups, think html, xml, json, source code... even > csv Since HTML's encoding is generally specified in the HTTP header or <http-equiv> metadata. XML's encoding must be specified on the first line (unless the default utf-8 is used or a BOM is present). JSON's encoding must be either utf-8, utf-16 or utf-32. Source code encoding is generally defined by the language in question. That may not be a desirable or happy situation, but that is the situation > we have to deal with. > True, we're stuck with decisions of the past. I guess (and maybe I've finally understood your position?) if a BOM was mandated for _all_ utf-8 data from the outset to clearly distinguish it from pre-existing ANSI codepages then I could see its value. Although I remain a little revulsed by having those three little bytes at the front of all my files to solve what is predominently a transport issue ;) -Rowan _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users