Re: BOM's at Beginning of Web Pages?
. Roozbeh Pournader wrote, No, just let's recommend explicitly against BOM in UTF-8 instead of politely telling that it's OK to put a BOM only because somebody liked the idea and released some software doing that. Well, should we CC William? ;) I wish that the copy had gone to William rather than the Unicode List. The bit about P14 tags was intended to be an off-list jest. Ooops! It's been one of *those* weeks... Best regards, James Kass .
Re: BOM's at Beginning of Web Pages?
. Roozbeh Pournader wrote, According to the specs, it's illegal, and it doesn't hurt to fix it. So why shouldn't one? The lack of the BOM in the 'white space' section of the specs may just be an oversight. Since plain text files can have any kind of file extension, and the *.TXT extension historically covers many different code pages, some people do find the BOM helpful. It enables some of the editors to correctly load a file the first time without having to manually reset the encoding format and reload. You're right about the BOM being irrelevant to the browser, since the HTML encoding is supposed to be declared as mark-up in the HTML header. But, at least on Win platforms, when the user (or author) views the source, the default editor (usually Notepad) seems to require that the BOM be present. NotePad also (AFAICT) automatically inserts the BOM when file-saving as UTF-8. The non-technical user may not even be aware of this. I've found the BOM handy, but could probably live without it on any of my web pages. Especially if it's going to display as a Euro symbol on some systems... Best regards, James Kass .
Re: BOM's at Beginning of Web Pages?
At 19:10 -0800 2003-02-15, Michael \(michka\) Kaplan wrote:±± Of course if I had a penny for every byte that has been used discussing these three bytes sometimes found at the beginning of a UTF-8 document, I would not be working this weekend; I'd be somewhere really warm and sunny. My point was that its being used on the Unicode home page mucks up the home page display and so it needs to be deleted from that page. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: BOM's at Beginning of Web Pages?
Well, since the whole web could be full of such pages, fixing the browser would be a better long term strategy in the short term, the best tool for quick fixes to HTML pages *is* notepad, which is what is being blamed for causing the problem. :-) Has anyone worked to be positive that this is the cause of the errant euro? With two simple UTF-8 encoded page (one with and one without the BOM) ? I still have a hard time seeing how a BOM can cause a euro in any way other than consulting fees. MichKa - Original Message - From: Michael Everson [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Sunday, February 16, 2003 11:20 AM Subject: Re: BOM's at Beginning of Web Pages? At 19:10 -0800 2003-02-15, Michael \(michka\) Kaplan wrote:±± Of course if I had a penny for every byte that has been used discussing these three bytes sometimes found at the beginning of a UTF-8 document, I would not be working this weekend; I'd be somewhere really warm and sunny. My point was that its being used on the Unicode home page mucks up the home page display and so it needs to be deleted from that page. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: BOM's at Beginning of Web Pages? Mac IE's Euro
Has anyone worked to be positive that this is the cause of the errant euro? With two simple UTF-8 encoded page (one with and one without the BOM) ? I still have a hard time seeing how a BOM can cause a euro in any way other than consulting fees. Mac OS X IE 5.2 is the only browser that does this (display the UTF-8 bytes for U+FEFF as a Euro sign). It would indeed be interesting to know why. You can input U+FEFF all by itself in a document and open it with this browser and display a Euro. It's not exactly the same Euro as you get with U+20AC. Weaker, with an extra tail at the top and equal crossbars. Perhaps this indicates a mis-encoded font on the system? But why would no other browser use it? For anyone interested I've put a photo of the two (BOM on top) at: http://homepage.mac.com/thgewecke/bomeuro.jpg
Re: BOM's at Beginning of Web Pages?
The W3C validator compares the document contents with the DTD (ie. validates it) but does not do checking for compatibility with html specifications. I.e. it does not do lint checking. So do not use the validator to prove or disprove that a document conforms to html syntax or specification. tex Roozbeh Pournader wrote: On Sun, 16 Feb 2003 [EMAIL PROTECTED] wrote: The W3C Mark Up Validation Service at: http://validator.w3.org/ ...validates a UTF-8 web page with a BOM as valid HTML 4.01, suggesting that the BOM is not at all illegal. Well, I found that, but mischievously tried to hide the fact ;) According to the specs, it's illegal, and it doesn't hurt to fix it. So why shouldn't one? roozbeh -- - Tex Texin cell: +1 781 789 1898 mailto:[EMAIL PROTECTED] Xen Master http://www.i18nGuy.com XenCrafthttp://www.XenCraft.com Making e-Business Work Around the World -
Re: Everson Mono
On Saturday, February 15, 2003, at 07:22 PM, [EMAIL PROTECTED] wrote: You could pick up the old TTFDUMP.EXE program from Microsoft Typography developer's web pages at http://www.microsoft.com/typography/creators.htm This utility can dump any or all of the tables in a TTF/OTF into a plain text file which is human-readable. Once the cmap table information has been dumped, you can import the text into your process and process it. (It only works on Plane Zero fonts.) And you can get ftxdumperfuser at Apple's site http://developer.apple.com/fonts, which works on Mac OS X and can handle the astral planes. == John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://www.tejat.net/
Re: BOM's at Beginning of Web Pages?
On Sun, 16 Feb 2003 [EMAIL PROTECTED] wrote: The lack of the BOM in the 'white space' section of the specs may just be an oversight. I like the idea. This looks practical to me. Ammending HTML 4 to consider this. Since plain text files can have any kind of file extension, and the *.TXT extension historically covers many different code pages, some people do find the BOM helpful. [...] And some people find it annoying and dangerous. A BOM-ed UTF-8 file breaks the Unix text file model to some degree. I can post a link if anyone's interested. I've found the BOM handy, but could probably live without it on any of my web pages. Especially if it's going to display as a Euro symbol on some systems... I'll call it irony. It's some certain version of MS Internet Explorer it's breaking on Mac, and I've also seen it break MS FrontPage 2000 on a Windows 2000 machine (FrontPage had not seen the UTF-8 declaration in the HTML file itself yet, and it saw the three non-ASCII bytes, and automatically treated the file as CP1252), the same machine that was used to edit the HTML as a text file (in Notepad of course). roozbeh
Re: BOM's at Beginning of Web Pages? Mac IE's Euro
Tom Gewecke tom at bluesky dot org wrote: You can input U+FEFF all by itself in a document and open it with this browser and display a Euro. It's not exactly the same Euro as you get with U+20AC. Weaker, with an extra tail at the top and equal crossbars. Perhaps this indicates a mis-encoded font on the system? But why would no other browser use it? For anyone interested I've put a photo of the two (BOM on top) at: http://homepage.mac.com/thgewecke/bomeuro.jpg The first looks like Courier New, probably a standard font for plain-text files. A file containing nothing but U+FEFF would be identified as plain text. The second looks like Verdana, probably a standard font for HTML files. The mystery remains as to why U+FEFF (or the bytes 0xEF 0xBB 0xBF, however interpreted) would be displayed as a Euro sign. U+20AC EURO SIGN is mapped to 0xDB in most Mac character sets and 0x80 in most Windows code pages. -Doug Ewell Fullerton, California
Re: BOM's at Beginning of Web Pages? Mac IE's Euro
On Sun, 16 Feb 2003, Doug Ewell wrote: The mystery remains as to why U+FEFF (or the bytes 0xEF 0xBB 0xBF, however interpreted) would be displayed as a Euro sign. Autodetection as some other codepage? roozbeh
Re: BOM's at Beginning of Web Pages?
Roozbeh Pournader roozbeh at sharif dot edu wrote: Found it! It's forbidden to start a HTML 4.0 page with a UTF-8 BOM. Proof: ... That's all. So the only characters that are allowed in a HTML 4.0 web page before the HTML header, are U+0009, U+000A, U+000C, U+000D, U+0020, and U+200B. QED. I can't argue with the excellent gumshoe work Roozbeh did. But it does seem peculiar, as Michka observed, that ZWSP should be a legal white space character for this purpose but ZWNBSP should not; and as James noted, it may have been an oversight. (I would add to Michka's comment that it seems equally bizarre to allow U+000C FORM FEED at the start of an HTML file but not U+FEFF.) PS: UTF-16 is an exception to that, since the BOM is not part of the document and should be removed for processing. If this is true -- that U+FEFF is a kind of meta-character that doesn't really belong to the text per se -- then it should be equally true for UTF-8, whether its role is as a true Byte Order Mark (needed in UTF-16 and UTF-32 but not UTF-8) or as a signature (potentially useful in all Unicode CES's). Only in its evil-twin role as a zero-width no-break space is it truly part of the text, in which case the previous discussion comments about white-space characters applies. Michael (michka) Kaplan michka at trigeminal dot com wrote: Rather then treating HTML like the SQL standard (lofty goals that no one company completely supports because it would be insane to do it!) they can bend to the actual usage out there and just move on, right? Michka is probably right that Notepad is one of the more popular HTML editors out there, but even though I'm sure he didn't mean it this way, I would prefer not to say anything that can be twisted into the HTML specification should be changed to match the way Microsoft does things. That is bound to bring all the Microsoft haters out of the woodwork. Rather, I would stress the inconsistency of allowing U+FEFF at the beginning of an HTML file encoded in UTF-16 but not in one encoded in the much more common UTF-8. Of course if I had a penny for every byte that has been used discussing these three bytes sometimes found at the beginning of a UTF-8 document, I would not be working this weekend; I'd be somewhere really warm and sunny. There is so much disagreement, confusion, and misunderstanding surrounding these three little bytes that I feel the discussion is completely warranted. (At least nobody can ever claim it's off topic!) Roozbeh responded: Well, that needs researching into what UTF-8 is in W3C and HTML 4.0 terms: ... RFC 2279. A copy can be found at http://www.ietf.org/rfc/rfc2279.txt, or any other place you like and search for FEFF, BOM, ZERO WIDTH NO-BREAK SPACE, or the sequence EF BB BF there. Nothing can be found. RFC 2279 defines and describes the technical structure of UTF-8. Usage issues surrounding U+FEFF as either a signature or a ZWNBSP would have been out of scope. Most Unicode and WG2 documents do not discuss the BOM either. Michka wrote back: If the problem was indeed due to a BOM then the answer *is* to fix the browser. Windows 2000 and XP have shipped onto a gazillion machines and a lot of people make quick spot changes to HTML pages in notepad. The BOM is here and any browser that cannot handle not showing either a BOM or a ZBNBSP can be classed as a dumb one. Certainly, Microsoft is in a position to fix their own browser to make it tolerant of the BOM. If they ship a quick and handy editor that prepends a BOM to UTF-8 text files (which I think is a good idea, for the reasons James cited), and if people are using that editor for HTML files encoded in UTF-8, then their browser should behave sensibly when handed an HTML file with a leading BOM. Messing up the layout at the top of a page is not sensible, and displaying a Euro sign is just plain weird. But note that so far, all of the weirdness seems to be with IE 5.2 for Macintosh. I've never seen any of this with IE 5.5 or 6.0 for Windows. (Indeed, my Web pages all used to begin with BOMs and I never noticed a problem, but I removed the BOMs when Michael Everson told me they displayed badly on his Mac.) So it seems only the Mac version of IE needs fixing. I don't see anything wrong with IE allowing a BOM at the start of UTF-8-encoded HTML files, even if it is not expressly allowed by the HTML specification. Browser vendors have certainly gone farther than that to extend the standard in the past; remember Netscape's notorious blink element? But I also think the HTML Working Group should consider explicitly allowing the BOM at the start of HTML files encoded in UTF-8. (Note that it is explicitly allowed in XML.) -Doug Ewell Fullerton, California
Re: BOM's at Beginning of Web Pages? Mac IE's Euro
Roozbeh Pournader roozbeh at sharif dot edu wrote: The mystery remains as to why U+FEFF (or the bytes 0xEF 0xBB 0xBF, however interpreted) would be displayed as a Euro sign. Autodetection as some other codepage? The Unicode home page includes the following line, right where it should be, in the head section: meta http-equiv=Content-Type content=text/html; charset=utf-8 Any User Agent that takes a page properly marked as UTF-8, as above, and still tries to autodetect a local code page, is badly misguided. How would it handle a real UTF-8-encoded euro sign (0xE2 0x82 0xAC)? -Doug Ewell Fullerton, California
Re: BOM's at Beginning of Web Pages? Mac IE's Euro
On Sun, 16 Feb 2003, Doug Ewell wrote: The Unicode home page includes the following line, right where it should be, in the head section: meta http-equiv=Content-Type content=text/html; charset=utf-8 Any User Agent that takes a page properly marked as UTF-8, as above, and still tries to autodetect a local code page, is badly misguided. How would it handle a real UTF-8-encoded euro sign (0xE2 0x82 0xAC)? AFAICR, there is supposed to be no single non-ASCII character before that meta tag. I really don't like to search the specs again, but I'm sure I saw it somewhere. The HTML renderer sees those characters and thinks the document has already started (since the html, head and body tags are are not mandatory in HTML 4 Transitional). So it goes into autodetection mode. The same situation happens with MS FrontPage 2000 (but I've already explained that). roozbeh