Re: UTF-8 BOM (Re: Charset declaration in HTML)
Doug Ewell d...@ewellic.org wrote: |Steven Atreju wrote: | | If Unicode *defines* that the so-called BOM is in fact a Unicode- | indicating tag that MUST be present, | |But Unicode does not define that. Nope. On http://unicode.org/faq/utf_bom.html i read: Q: Why do some of the UTFs have a BE or LE in their label, such as UTF-16LE? So it seems to me that the Unicode Consortium takes care of newbies and those people who work at a very high programming level, say, PHP, Flash, JavaScript or even no programming at all. And: Q: Is the UTF-8 encoding scheme the same irrespective of whether the underlying processor is little endian or big endian? ... Where a BOM is used with UTF-8, it is only used as an ecoding signature to distinguish UTF-8 from other encodings — it has nothing to do with byte order. Fifteen years ago i think i would have put effort in including the BOM after reading this, for complete correctness! I'm pretty sure that i really would have done so. So, given that this page ranks 3 when searching for «utf-8 bom» from within Germany i would 1), fix the «ecoding» typo and 2) would change this to be less «neutral». The answer to «Q.» is simply «Yes. Software should be capable to strip an encoded BOM in UTF, because some softish Unicode processors fail to do so when converting in between different multioctet UTF schemes. Using BOM with UTF-8 is not recommended.» | I know that, in Germany, many, many small libraries become closed | because there is not enough money available to keep up with the | digital race, and even the greater *do* have problems to stay in | touch! | |People like to complain about the BOM, but no libraries are shutting |down because of it. Keeping up with the digital race isn't about |handling two or three bytes at the beginning of a text file, in a way |that has been defined for two decades. RFC 2279 doesn't note the BOM. Looking at my 119,90.- German Mark Unicode 3.0 book, there is indeed talk about the UTF-8 BOM. We have (2.7, page 28) «Conformance to the Unicode Standard does not requires the use of the BOM as such a signature» (typo taken plain; or is it no typo?), and (13.6, page 324) «..never any questions of byte order with UTF-8 text, this sequence can serve as signature for .. this sequence of bytes will be extremely rare at the beginning of text files in other encodings ... for example []Microsoft Windows[]». So this is fine. It seems UTF-16 and UTF-32 were never ment for data exchange and the BOM was really a byte order indicator for a consumer that was aware of the encoding but not the byte order. And UTF-8 got an additional «wohooo - i'm Unicode text» signature tag, though optional. I like the term «extremely rare» sooo much!! :-) I restart my «rant» UTF-8 filetype thread from the beginning now. I wonder: was the Unicode Consortium really so unconfident? Do i really read «UTF-8 will drown in this evil mess of terroristic charsets, so rise the torch of freedom in this unfriendly environment!»? I have downloaded the 6.0 and 6.1 stuff as a PDF and for free (:-. If you know how to deal with UTF-8, you can deal with UTF-8. If you don't, no signature ever will help you, no?! If you don't know the charset of some text, that comes from nowhere, i.e., no container format with meta-information, no filetype extension with implicit meta-information, as is used on Mac OS and DOS, then UTF-8 is still very easily identifieable by itself due to the way the algorithm is designed. Is it?? Tear down the wall! Tear down the wall! Tear down the wall! |It's about technologies and |standards and platforms and formats that change incompatibly every few |years. That is of course true. But what to do with these myriads of aggressive nerds that linger in these neon-enlightened four square meter boxes, with their poignant hunger for penthouse windows and four-cylinder Mercedes-Benz limousines? I'm asking you. I've seen photos of standard committees in palm-covered bays (CSS2? DOM? W3M anyway), i've dropped my subscription to regular IETF discussion because i can stand only so and so many dozens of dinner, hotel-room reservation, laptop-compatible socket in Paris? and whatever threads (the annual ladies steakhouse meeting!). So here you are. These people have deserved it, and no better. Steven
Re: UTF-8 BOM (Re: Charset declaration in HTML)
Steven Atreju, Mon, 16 Jul 2012 13:35:04 +0200: Doug Ewell d...@ewellic.org wrote: And: Q: Is the UTF-8 encoding scheme the same irrespective of whether the underlying processor is little endian or big endian? ... Where a BOM is used with UTF-8, it is only used as an ecoding signature to distinguish UTF-8 from other encodings — it has nothing to do with byte order. Fifteen years ago i think i would have put effort in including the BOM after reading this, for complete correctness! I'm pretty sure that i really would have done so. I believe that most people that are conscious about inserting the BOM, do so because, without it, then Web browsers (with Chrome as the exception, whenever the page contains non-ASCII characters, at least) are unlikely to sniff a UTF-8 encoded page to be UTF-8 encoded. So, it has nothing with complete correctness to do, but everything to do with complete safety. So, given that this page ranks 3 when searching for «utf-8 bom» from within Germany i would 1), fix the «ecoding» typo and 2) would change this to be less «neutral». The answer to «Q.» is simply «Yes. Software should be capable to strip an encoded BOM in UTF, because some softish Unicode processors fail to do so when converting in between different multioctet UTF schemes. Using BOM with UTF-8 is not recommended.» The current text is much to prefer. Also, you place the wagon before the horse. You place tools over users. There is one reason to use UTF-8 BOM which that FAQ point doesn't mention, however, and that is that Chrome/Safari/Webkit plus IE treat a UTF-8 encoded text/html page with a BOM different from a UTF-8 encoded text/html page without a BOM - even when the page is otherwise properly labelled as UTF-8. For the former, then the user would not be able to override the encoding, manually. Whereas for a page without the BOM, then the user can override the encoding/shoot themselves (and others) in the foot. And UTF-8 got an additional «wohooo - i'm Unicode text» signature tag, though optional. I like the term «extremely rare» sooo much!! :-) What's the problem? If you know how to deal with UTF-8, you can deal with UTF-8. If you don't, no signature ever will help you, no?! Do you mean that, instead of the wohoo, one should do more thorough sniffing? I have no insight into how reliable such non-BOM-sniffing is. But I take it that it is much less secure than BOM-sniffing. Hence it would be risky (?) to deny users to override the encoding of a non-BOM-sniffed page. Which, bottom line, means that the BOM got an advantage. If you don't know the charset of some text, that comes from nowhere, i.e., no container format with meta-information, no filetype extension with implicit meta-information, as is used on Mac OS and DOS, then UTF-8 is still very easily identifieable by itself due to the way the algorithm is designed. Is it?? As I just said in a reply to Doug: Of the Web browsers in current use, Chrome is the very best. This is, I think, because it, to a higher degree than the competition, assumes UTF-8 whenever it finds non-ASCII characters. Clearly, sniffing could improve. At least in the browser world. But is that also true for command lines tools? -- Leif H Silli
RE: pre-HTML5 and the BOM
Leif Halvard Silli xn dash dash mlform dash iua at xn dash dash mlform dash iua dot no wrote: So, in a way, the ZWNBSP - or any other non-ASCII character (it would in fact be better to use U+200B, to reserve the U+FEFF for its designated BOM purpose) could serve as a UTF-8 sniff character not only when it is the first character of the document, but also elsewhere in documents. And this already happens ... My normal signature block includes a soft hyphen, U+00AD, which is C2 AD in UTF-8, for test purposes and as a hint that the message is UTF-8. The Web interface from which I'm sending this particular message may or may not preserve this character. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: pre-HTML5 and the BOM
Doug Ewell, Sat, 14 Jul 2012 15:14:10 -0600: Philippe Verdy wrote: It would break if the only place where to place a BOM is just the start of a file. But as I propose, we allow BOMs to occur anywhere to specify which encoding to use to decode what follows each one, even shell scripts would work [ snip ] U+FEFF is specifically defined as having the BOM semantic only when it appears at the beginning of the file or stream. Everywhere else, it can have only the ZWNBSP semantic. True. That said: Of the Web browsers in current use, Chrome is the very best (read: most aggressive) at UTF-8 sniffing. The others hardly sniff anything but for the BOM. For example, if you do an UTF-8 encoded page which contains nothing but ASCII - except a U+FEFF character (or any other non-ASCII character) inside the class= attribute of e.g. the html element, then Chrome will sniff it as UTF-8 encoded. Whereas IE, Webkit, Opera, Firefox will default to ISO-8858-1/Windows-1252. So, in a way, the ZWNBSP - or any other non-ASCII character (it would in fact be better to use U+200B, to reserve the U+FEFF for its designated BOM purpose) could serve as a UTF-8 sniff character not only when it is the first character of the document, but also elsewhere in documents. And this already happens ... (May be we see here a reflection of how Chrome is colored by its owner's role as a giant social media content producer/facilitator, whereas the other browser vendors are too much stuck in their back-compatibility mantra.) -- Leif Halvard Silli
Copyleft
Recently, the Canadian symbols (marque de commerce) and (marque déposée) have been added to Unicode at U+1F16A and U+1F16B. Would it be possible to add the copyleft symbol in the neighbourhood ? It looks like a reversed ©. Today, to type it, I use a reversed c with a combining enclosing circle ↄ⃝ , but that’s only a loose approximation.
Re: pre-HTML5 and the BOM
Le 14/07/12 23:14, Doug Ewell a écrit : A related question, though, is why some people think the sky will fall if a text file contains loose zero-width no-break spaces. U+FEFF is the very model of a default ignorable code point. I don’t think the sky will fall but I say there still are a few programming languages which, in some specific conditions, may produce an error when they meet a BOM.
Re: Copyleft
Ↄ⃝ may be a better approximation. Leo On Mon, Jul 16, 2012 at 10:47 AM, Jean-François Colson j...@colson.eu wrote: Recently, the Canadian symbols (marque de commerce) and (marque déposée) have been added to Unicode at U+1F16A and U+1F16B. Would it be possible to add the copyleft symbol in the neighbourhood ? It looks like a reversed ©. Today, to type it, I use a reversed c with a combining enclosing circle ↄ⃝ , but that’s only a loose approximation.
RE: Copyleft
There was a discussion on this list around May 2000 regarding the so-called copyleft symbol. There were concerns that it was not really a symbol with legal standing, like © and ® and ™, but more of a logo, notably one worn on T-shirts by followers of a sort of social movement. Eventually it was more or less decided that the combinations with U+20DD were sufficient. Obviously, with recent developments in the type of symbols that have been encoded, the objections expressed in 2000 might no longer apply. —Doug Original Message Subject: Re: Copyleft From: Leo Broukhis l...@mailcom.com Date: Mon, July 16, 2012 4:08 pm To: Jean-François_Colson j...@colson.eu Cc: unicode@unicode.org Ↄ⃝ may be a better approximation. Leo On Mon, Jul 16, 2012 at 10:47 AM, Jean-François Colson j...@colson.eu wrote: Recently, the Canadian symbols (marque de commerce) and (marque déposée) have been added to Unicode at U+1F16A and U+1F16B. Would it be possible to add the copyleft symbol in the neighbourhood ? It looks like a reversed ©. Today, to type it, I use a reversed c with a combining enclosing circle ↄ⃝ , but that’s only a loose approximation.
Re: UTF-8 BOM (Re: Charset declaration in HTML)
Steven Atreju wrote: Q: Is the UTF-8 encoding scheme the same irrespective of whether the underlying processor is little endian or big endian? ... Where a BOM is used with UTF-8, it is only used as an ecoding signature to distinguish UTF-8 from other encodings — it has nothing to do with byte order. ... So, given that this page ranks 3 when searching for «utf-8 bom» from within Germany i would 1), fix the «ecoding» typo and 2) would change this to be less «neutral». The answer to «Q.» is simply «Yes. Software should be capable to strip an encoded BOM in UTF, because some softish Unicode processors fail to do so when converting in between different multioctet UTF schemes. Using BOM with UTF-8 is not recommended.» That's an answer to a different question. Yes, the UTF-8 encoding scheme is the same irrespective of whether the underlying processor is little-endian or big-endian. The FAQ question you quoted doesn't address whether BOM is desirable for UTF-8. This is one reason I prefer the term signature or U+FEFF instead of BOM when talking about UTF-8. RFC 2279 doesn't note the BOM. RFC 2279 was superseded by RFC 3629 almost nine years ago. RFC 3629 has a whole section (6) about the U+FEFF signature. Looking at my 119,90.- German Mark Unicode 3.0 book, The Unicode 3.0 book was an excellent resource, but it was released almost 12 years ago. Some of it may not reflect the latest information or recommendations. there is indeed talk about the UTF-8 BOM. We have (2.7, page 28) «Conformance to the Unicode Standard does not requires the use of the BOM as such a signature» (typo taken plain; or is it no typo?), and (13.6, page 324) «..never any questions of byte order with UTF-8 text, this sequence can serve as signature for .. this sequence of bytes will be extremely rare at the beginning of text files in other encodings ... for example []Microsoft Windows[]». So this is fine. It seems UTF-16 and UTF-32 were never ment for data exchange and the BOM was really a byte order indicator for a consumer that was aware of the encoding but not the byte order. The part of 13.6 you quoted doesn't make any statement at all about UTF-16 or UTF-32. Back when Unicode was conceived, the 16-bit format was the only one envisioned for data exchange. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: pre-HTML5 and the BOM
2012/7/16 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no: html element, then Chrome will sniff it as UTF-8 encoded. Whereas IE, Webkit, Opera, Firefox will default to ISO-8858-1/Windows-1252. Actually ISO 885**9**-1. But we've also been told that, given the C1 controls are simply invalid for HTML, even if a site indicates ISO-8859-1, it will be interpreted as Windows-1252 (meaning there were will remain a few unassigned byte values that are invalid, causing the HTML parser to try other encodings if they are found, but not UTF-8 which will be invalid there too and that could as well raise exceptions). Most of these exceptions however will just be remapped to the U+FFFD replacement character. The support of legacy encodings is now more restrictive in HTML5 which only supports UTF-8 and Windows-1252, plus a few other encodings (ASCII is considered now an alias of Windows-1252, also for compatibiluty reasons, even if strict US-ASCII resources could be interpreted without changes as UTF-8) and require explicit encoding (sniffing no longer works for something else as UTF-8 for its leading BOM interpreted as a data signature and not as a character)
Re: pre-HTML5 and the BOM
2012/7/15 David Starner prosfil...@gmail.com: /tmp $ echo -n a file1 /tmp $ echo b file2 /tmp $ cat file1 file2 file3 /tmp $ echo ab | diff -q - file3 Once again the problem is the /bin/cat tool which is used for everything and agnostic about preserving text selantics. using another cat that is Unicode aware would solve the problem. Same thing about diff which is however only designed to work with text files and that should be Unicode aware by default. May be there should be a new standard in Unix for /ubin/ being present for Unicode-aware tools and insertable in user's PATH environments if needed. Allowing migrations to newer standards. This is expected behavior, and with if statements is probably done by thousands of scripts. Add a hidden BOM at the start of file2 and this whole thing breaks, as diff is going to find them different. Again, diff is an ancient tool that deals with all sorts of text, quasi-text and binary matter, and frankly aBOMb is different from ab. If we're building a C file with Unix tools, if a char *c = ab; suddenly becomes char *c = BOMab; i don't know by what semantics you expect that to work the same. And the very model of a default ignorable code point is likely to be the very model of a bug that will hide in plain sight.