Philippe Verdy <verd...@wanadoo.fr> wrote: |2012/7/16 Steven Atreju <snatr...@googlemail.com>: |> Fifteen years ago i think i would have put effort in including the |> BOM after reading this, for complete correctness! I'm pretty sure |> that i really would have done so. | |Fifteen years ago I would not ahave advocated it. Simply because |support of UTF-8 was very poor (and there were even differences of |interpretations between the ISO/IEC definition and the Unicode |definition, notably differences for the conformance requirements). |This is no longer the case. | |> So, given that this page ranks 3 when searching for «utf-8 bom» |> from within Germany i would 1), fix the «ecoding» typo and 2) |> would change this to be less «neutral». The answer to «Q.» is |> simply «Yes. Software should be capable to strip an encoded BOM |> in UTF, because some softish Unicode processors fail to do so when |> converting in between different multioctet UTF schemes. Using BOM |> with UTF-8 is not recommended.» |> |> |> I know that, in Germany, many, many small libraries become closed |> |> because there is not enough money available to keep up with the |> |> digital race, and even the greater *do* have problems to stay in |> |> touch! |> | |> |People like to complain about the BOM, but no libraries are shutting |> |down because of it. "Keeping up with the digital race" isn't about |> |handling two or three bytes at the beginning of a text file, in a way |> |that has been defined for two decades. |> |> RFC 2279 doesn't note the BOM. |> |> Looking at my 119,90.- German Mark Unicode 3.0 book, there is |> indeed talk about the UTF-8 BOM. We have (2.7, page 28) |> «Conformance to the Unicode Standard does not requires the use of |> the BOM as such a signature» (typo taken plain; or is it no |> typo?), and (13.6, page 324) «..never any questions of byte order |> with UTF-8 text, this sequence can serve as signature for .. this |> sequence of bytes will be extremely rare at the beginning of text |> files in other encodings ... for example []Microsoft Windows[]». |> |> So this is fine. It seems UTF-16 and UTF-32 were never ment for |> data exchange and the BOM was really a byte order indicator for a |> consumer that was aware of the encoding but not the byte order. |> And UTF-8 got an additional «wohooo - i'm Unicode text» signature |> tag, though optional. I like the term «extremely rare» sooo much!! |> :-) | |No need to rant. There's the evidence that the role of BOM in UTF-8 |has been to help the migration from legacy charsets to Unicode, to |avoid mojibake. And this role is still important. As UTF-8 became |proeminent in interchanges, and the need for migration from older |encodings largely augmented, this small signature has helped knowing |which files were converted or not, even if there was no meta data |(meta data is freuently dropped as soon as the ressource is no longer |on a web server, but stored in a file of a local filesystem). | |As there are still a lot of local resources using other encodings, the |signature really helps managing the local contents. And more and more |applications will recognize this signature automatically to avoid |using the default legacy encodings of the local system (something they |still do in absence of meta data and of the BOM) : you no longer need |to use a menu in apps to select the proper encoding (most often it is |not available, or requires restarting the application or cancelling an |ongoing transaction, and still frequently we still have to manage the |situation were resources using legacy local encodings and those in |UTF-8 are mixed in the application). | |The BOM is then extremely useful in a transition that will durate |several decennials (or more) each time that resource is not strictly |bound to the 7-bit US-ASCII subset.
I disagree, disagree, disagree :). |I am also convinced that even Shell interpreters on Linux/Unix should |recognize and accept the leading BOM before the hash/bang starting |line (which is commonly used for filetype identification and runtime |behavior), without claiming that they don"t know what to do to run the |file or which shell interpreter to use. Please let it be as agnostic as it is. While watching the parade i've noticed that some standard Renault trucks did not have a soot filter. That's a complete no-go. We were shocked. |PHP itself should be allowed to use it as well (but unfortunetaly it |still does not have the concept of tracking the effective encoding to |parse its scripts simply. | |Yes this requires modifying the database of filetype signatures, but |this type of update has always been necessary since long for handling |more and more filetypes (see for example the frequent updates and the |growth of the "/etc/magic" database used by the Unix/Linux tool |"file"). But i'm lucky that you mention this tool, since i've forgotten to do so in my last post. It appeared first in 1973 and is a standardized POSIX application and a part of all operating systems i currently want to know of, including Mac OS X. It handles the UTF-8 BOM the right way, possibly the only really right way. And here is how: |looks_utf8_with_BOM(const unsigned char *buf, size_t nbytes, unichar *ubuf, | size_t *ulen) |{ | if (nbytes > 3 && buf[0] == 0xef && buf[1] == 0xbb && buf[2] == 0xbf) | return file_looks_utf8(buf + 3, nbytes - 3, ubuf, ulen); | else | return -1; |} So, if there is a BOM, check the rest for normal UTF-8 text. (Without knowing all the details of the file(1) internals, i think the heuristic won't match *without* treating the BOM in a special way.) Better that is. Steven