"Doug Ewell" <d...@ewellic.org> wrote: |Steven Atreju wrote: | |> If Unicode *defines* that the so-called BOM is in fact a Unicode- |> indicating tag that MUST be present, | |But Unicode does not define that.
Nope. On http://unicode.org/faq/utf_bom.html i read: Q: Why do some of the UTFs have a BE or LE in their label, such as UTF-16LE? So it seems to me that the Unicode Consortium takes care of newbies and those people who work at a very high programming level, say, PHP, Flash, JavaScript or even no programming at all. And: Q: Is the UTF-8 encoding scheme the same irrespective of whether the underlying processor is little endian or big endian? ... Where a BOM is used with UTF-8, it is only used as an ecoding signature to distinguish UTF-8 from other encodings — it has nothing to do with byte order. Fifteen years ago i think i would have put effort in including the BOM after reading this, for complete correctness! I'm pretty sure that i really would have done so. So, given that this page ranks 3 when searching for «utf-8 bom» from within Germany i would 1), fix the «ecoding» typo and 2) would change this to be less «neutral». The answer to «Q.» is simply «Yes. Software should be capable to strip an encoded BOM in UTF, because some softish Unicode processors fail to do so when converting in between different multioctet UTF schemes. Using BOM with UTF-8 is not recommended.» |> I know that, in Germany, many, many small libraries become closed |> because there is not enough money available to keep up with the |> digital race, and even the greater *do* have problems to stay in |> touch! | |People like to complain about the BOM, but no libraries are shutting |down because of it. "Keeping up with the digital race" isn't about |handling two or three bytes at the beginning of a text file, in a way |that has been defined for two decades. RFC 2279 doesn't note the BOM. Looking at my 119,90.- German Mark Unicode 3.0 book, there is indeed talk about the UTF-8 BOM. We have (2.7, page 28) «Conformance to the Unicode Standard does not requires the use of the BOM as such a signature» (typo taken plain; or is it no typo?), and (13.6, page 324) «..never any questions of byte order with UTF-8 text, this sequence can serve as signature for .. this sequence of bytes will be extremely rare at the beginning of text files in other encodings ... for example []Microsoft Windows[]». So this is fine. It seems UTF-16 and UTF-32 were never ment for data exchange and the BOM was really a byte order indicator for a consumer that was aware of the encoding but not the byte order. And UTF-8 got an additional «wohooo - i'm Unicode text» signature tag, though optional. I like the term «extremely rare» sooo much!! :-) I restart my «rant» UTF-8 filetype thread from the beginning now. I wonder: was the Unicode Consortium really so unconfident? Do i really read «UTF-8 will drown in this evil mess of terroristic charsets, so rise the torch of freedom in this unfriendly environment!»? I have downloaded the 6.0 and 6.1 stuff as a PDF and for free (:->. If you know how to deal with UTF-8, you can deal with UTF-8. If you don't, no signature ever will help you, no?! If you don't know the charset of some text, that comes from nowhere, i.e., no container format with meta-information, no filetype extension with implicit meta-information, as is used on Mac OS and DOS, then UTF-8 is still very easily identifieable by itself due to the way the algorithm is designed. Is it?? Tear down the wall! Tear down the wall! Tear down the wall! |It's about technologies and |standards and platforms and formats that change incompatibly every few |years. That is of course true. But what to do with these myriads of aggressive nerds that linger in these neon-enlightened four square meter boxes, with their poignant hunger for penthouse windows and four-cylinder Mercedes-Benz limousines? I'm asking you. I've seen photos of standard committees in palm-covered bays (CSS2? DOM? W3M anyway), i've dropped my subscription to regular IETF discussion because i can stand only so and so many dozens of dinner, hotel-room reservation, laptop-compatible socket in Paris? and whatever threads (the annual ladies steakhouse meeting!). So here you are. These people have deserved it, and no better. Steven