Re: BOM's at Beginning of Web Pages?

2003-02-16 Thread jameskass
.
Roozbeh Pournader wrote,

 No, just let's recommend explicitly against BOM in UTF-8 instead of 
 politely telling that it's OK to put a BOM only because somebody liked the 
 idea and released some software doing that.
 
 Well, should we CC William? ;)

I wish that the copy had gone to William rather than the Unicode
List.  The bit about P14 tags was intended to be an off-list jest.  
Ooops!  It's been one of *those* weeks...

Best regards,

James Kass
.





Re: BOM's at Beginning of Web Pages?

2003-02-16 Thread jameskass
.
Roozbeh Pournader wrote,

 According to the specs, it's illegal, and it doesn't hurt to fix it. So 
 why shouldn't one?

The lack of the BOM in the 'white space' section of the specs may
just be an oversight.

Since plain text files can have any kind of file extension, and the
*.TXT extension historically covers many different code pages, some
people do find the BOM helpful.  It enables some of the editors to
correctly load a file the first time without having to manually
reset the encoding format and reload.

You're right about the BOM being irrelevant to the browser, since
the HTML encoding is supposed to be declared as mark-up in the
HTML header.  But, at least on Win platforms, when the user (or 
author) views the source, the default editor (usually Notepad) 
seems to require that the BOM be present.  NotePad also (AFAICT) 
automatically inserts the BOM when file-saving as UTF-8.  
The non-technical user may not even be aware of this.

I've found the BOM handy, but could probably live without it on
any of my web pages.  Especially if it's going to display as a
Euro symbol on some systems...

Best regards,

James Kass
.




Re: BOM's at Beginning of Web Pages?

2003-02-16 Thread Michael Everson
At 19:10 -0800 2003-02-15, Michael \(michka\) Kaplan wrote:±±


Of course if I had a penny for every byte that has been used 
discussing these three bytes sometimes found at the beginning of a 
UTF-8 document, I would not be working this weekend; I'd be 
somewhere really warm and sunny.

My point was that its being used on the Unicode home page mucks up 
the home page display and so it needs to be deleted from that page.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: BOM's at Beginning of Web Pages?

2003-02-16 Thread Michael \(michka\) Kaplan
Well, since the whole web could be full of such pages, fixing the
browser would be a better long term strategy in the short term,
the best tool for quick fixes to HTML pages *is* notepad, which is
what is being blamed for causing the problem. :-)

Has anyone worked to be positive that this is the cause of the errant
euro? With two simple UTF-8 encoded page (one with and one without the
BOM) ? I still have a hard time seeing how a BOM can cause a euro in
any way other than consulting fees.

MichKa

- Original Message - 
From: Michael Everson [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Sunday, February 16, 2003 11:20 AM
Subject: Re: BOM's at Beginning of Web Pages?


 At 19:10 -0800 2003-02-15, Michael \(michka\) Kaplan wrote:±±

 Of course if I had a penny for every byte that has been used
 discussing these three bytes sometimes found at the beginning of a
 UTF-8 document, I would not be working this weekend; I'd be
 somewhere really warm and sunny.

 My point was that its being used on the Unicode home page mucks up
 the home page display and so it needs to be deleted from that page.
 -- 
 Michael Everson * * Everson Typography *  * http://www.evertype.com







Re: BOM's at Beginning of Web Pages? Mac IE's Euro

2003-02-16 Thread Tom Gewecke

Has anyone worked to be positive that this is the cause of the errant
euro? With two simple UTF-8 encoded page (one with and one without the
BOM) ? I still have a hard time seeing how a BOM can cause a euro in
any way other than consulting fees.

Mac OS X IE 5.2 is the only browser that does this (display the UTF-8 bytes
for U+FEFF as a Euro sign). It would indeed be interesting to know why.

You can input U+FEFF all by itself in a document and open it with this
browser and display a Euro. It's not exactly the same Euro as you get with
U+20AC.  Weaker, with an extra tail at the top and equal crossbars.
Perhaps this indicates a mis-encoded font on the system?  But why would no
other browser use it?  For anyone interested I've put a photo of the two
(BOM on top) at:

http://homepage.mac.com/thgewecke/bomeuro.jpg






Re: BOM's at Beginning of Web Pages?

2003-02-16 Thread Tex Texin
The W3C validator compares the document contents with the DTD (ie. validates
it) but does not do checking for compatibility with html specifications.
I.e. it does not do lint checking.

So do not use the validator to prove or disprove that a document conforms to
html syntax or specification.

tex

Roozbeh Pournader wrote:
 
 On Sun, 16 Feb 2003 [EMAIL PROTECTED] wrote:
 
  The W3C Mark Up Validation Service at:
  http://validator.w3.org/
 
  ...validates a UTF-8 web page with a BOM as valid HTML 4.01,
  suggesting that the BOM is not at all illegal.
 
 Well, I found that, but mischievously tried to hide the fact ;)
 
 According to the specs, it's illegal, and it doesn't hurt to fix it. So
 why shouldn't one?
 
 roozbeh

-- 
-
Tex Texin   cell: +1 781 789 1898   mailto:[EMAIL PROTECTED]
Xen Master  http://www.i18nGuy.com
 
XenCrafthttp://www.XenCraft.com
Making e-Business Work Around the World
-




Re: Everson Mono

2003-02-16 Thread John H. Jenkins

On Saturday, February 15, 2003, at 07:22 PM, [EMAIL PROTECTED] wrote:



You could pick up the old TTFDUMP.EXE program from Microsoft Typography
developer's web pages at
http://www.microsoft.com/typography/creators.htm
This utility can dump any or all of the tables in a TTF/OTF into
a plain text file which is human-readable.  Once the cmap table
information has been dumped, you can import the text into your
process and process it.  (It only works on Plane Zero fonts.)



And you can get ftxdumperfuser at Apple's site 
http://developer.apple.com/fonts, which works on Mac OS X and can 
handle the astral planes.


==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://www.tejat.net/




Re: BOM's at Beginning of Web Pages?

2003-02-16 Thread Roozbeh Pournader
On Sun, 16 Feb 2003 [EMAIL PROTECTED] wrote:

 The lack of the BOM in the 'white space' section of the specs may
 just be an oversight.

I like the idea. This looks practical to me. Ammending HTML 4 to consider
this.

 Since plain text files can have any kind of file extension, and the
 *.TXT extension historically covers many different code pages, some
 people do find the BOM helpful. [...]

And some people find it annoying and dangerous. A BOM-ed UTF-8 file breaks
the Unix text file model to some degree. I can post a link if anyone's
interested.

 I've found the BOM handy, but could probably live without it on
 any of my web pages.  Especially if it's going to display as a
 Euro symbol on some systems...

I'll call it irony. It's some certain version of MS Internet Explorer it's
breaking on Mac, and I've also seen it break MS FrontPage 2000 on a
Windows 2000 machine (FrontPage had not seen the UTF-8 declaration in the
HTML file itself yet, and it saw the three non-ASCII bytes, and
automatically treated the file as CP1252), the same machine that was used
to edit the HTML as a text file (in Notepad of course).

roozbeh





Re: BOM's at Beginning of Web Pages? Mac IE's Euro

2003-02-16 Thread Doug Ewell
Tom Gewecke tom at bluesky dot org wrote:

 You can input U+FEFF all by itself in a document and open it with this
 browser and display a Euro. It's not exactly the same Euro as you get
 with U+20AC.  Weaker, with an extra tail at the top and equal
 crossbars.  Perhaps this indicates a mis-encoded font on the system?
 But why would no other browser use it?  For anyone interested I've put
 a photo of the two (BOM on top) at:

 http://homepage.mac.com/thgewecke/bomeuro.jpg

The first looks like Courier New, probably a standard font for
plain-text files.  A file containing nothing but U+FEFF would be
identified as plain text.

The second looks like Verdana, probably a standard font for HTML files.

The mystery remains as to why U+FEFF (or the bytes 0xEF 0xBB 0xBF,
however interpreted) would be displayed as a Euro sign.  U+20AC EURO
SIGN is mapped to 0xDB in most Mac character sets and 0x80 in most
Windows code pages.

-Doug Ewell
 Fullerton, California





Re: BOM's at Beginning of Web Pages? Mac IE's Euro

2003-02-16 Thread Roozbeh Pournader
On Sun, 16 Feb 2003, Doug Ewell wrote:

 The mystery remains as to why U+FEFF (or the bytes 0xEF 0xBB 0xBF,
 however interpreted) would be displayed as a Euro sign.

Autodetection as some other codepage?

roozbeh





Re: BOM's at Beginning of Web Pages?

2003-02-16 Thread Doug Ewell
Roozbeh Pournader roozbeh at sharif dot edu wrote:

 Found it! It's forbidden to start a HTML 4.0 page with a UTF-8 BOM.
 Proof:
 ...
 That's all. So the only characters that are allowed in a HTML 4.0 web
 page before the HTML header, are U+0009, U+000A, U+000C, U+000D,
 U+0020, and U+200B. QED.

I can't argue with the excellent gumshoe work Roozbeh did.  But it does
seem peculiar, as Michka observed, that ZWSP should be a legal white
space character for this purpose but ZWNBSP should not; and as James
noted, it may have been an oversight.  (I would add to Michka's comment
that it seems equally bizarre to allow U+000C FORM FEED at the start of
an HTML file but not U+FEFF.)

 PS: UTF-16 is an exception to that, since the BOM is not part of the
 document and should be removed for processing.

If this is true -- that U+FEFF is a kind of meta-character that doesn't
really belong to the text per se -- then it should be equally true for
UTF-8, whether its role is as a true Byte Order Mark (needed in UTF-16
and UTF-32 but not UTF-8) or as a signature (potentially useful in all
Unicode CES's).  Only in its evil-twin role as a zero-width no-break
space is it truly part of the text, in which case the previous
discussion comments about white-space characters applies.

Michael (michka) Kaplan michka at trigeminal dot com wrote:

 Rather then treating HTML like the SQL standard (lofty goals that no
 one company completely supports because it would be insane to do it!)
 they can bend to the actual usage out there and just move on, right?

Michka is probably right that Notepad is one of the more popular HTML
editors out there, but even though I'm sure he didn't mean it this way,
I would prefer not to say anything that can be twisted into the HTML
specification should be changed to match the way Microsoft does things.
That is bound to bring all the Microsoft haters out of the woodwork.
Rather, I would stress the inconsistency of allowing U+FEFF at the
beginning of an HTML file encoded in UTF-16 but not in one encoded in
the much more common UTF-8.

 Of course if I had a penny for every byte that has been used
 discussing these three bytes sometimes found at the beginning of a
 UTF-8 document, I would not be working this weekend; I'd be somewhere
 really warm and sunny.

There is so much disagreement, confusion, and misunderstanding
surrounding these three little bytes that I feel the discussion is
completely warranted.  (At least nobody can ever claim it's off topic!)

Roozbeh responded:

 Well, that needs researching into what UTF-8 is in W3C and HTML 4.0
 terms:
 ...
 RFC 2279. A copy can be found at
 http://www.ietf.org/rfc/rfc2279.txt, or any other place you like and
 search for FEFF, BOM, ZERO WIDTH NO-BREAK SPACE, or the sequence EF
 BB BF there. Nothing can be found.

RFC 2279 defines and describes the technical structure of UTF-8.  Usage
issues surrounding U+FEFF as either a signature or a ZWNBSP would have
been out of scope.  Most Unicode and WG2 documents do not discuss the
BOM either.

Michka wrote back:

 If the problem was indeed due to a BOM then the answer *is* to fix the
 browser. Windows 2000 and XP have shipped onto a gazillion machines
 and a lot of people make quick spot changes to HTML pages in notepad.
 The BOM is here and any browser that cannot handle not showing either
 a BOM or a ZBNBSP can be classed as a dumb one.

Certainly, Microsoft is in a position to fix their own browser to make
it tolerant of the BOM.  If they ship a quick and handy editor that
prepends a BOM to UTF-8 text files (which I think is a good idea, for
the reasons James cited), and if people are using that editor for HTML
files encoded in UTF-8, then their browser should behave sensibly when
handed an HTML file with a leading BOM.  Messing up the layout at the
top of a page is not sensible, and displaying a Euro sign is just plain
weird.

But note that so far, all of the weirdness seems to be with IE 5.2 for
Macintosh.  I've never seen any of this with IE 5.5 or 6.0 for Windows.
(Indeed, my Web pages all used to begin with BOMs and I never noticed a
problem, but I removed the BOMs when Michael Everson told me they
displayed badly on his Mac.)  So it seems only the Mac version of IE
needs fixing.

I don't see anything wrong with IE allowing a BOM at the start of
UTF-8-encoded HTML files, even if it is not expressly allowed by the
HTML specification.  Browser vendors have certainly gone farther than
that to extend the standard in the past; remember Netscape's notorious
blink element?  But I also think the HTML Working Group should
consider explicitly allowing the BOM at the start of HTML files encoded
in UTF-8.  (Note that it is explicitly allowed in XML.)

-Doug Ewell
 Fullerton, California





Re: BOM's at Beginning of Web Pages? Mac IE's Euro

2003-02-16 Thread Doug Ewell
Roozbeh Pournader roozbeh at sharif dot edu wrote:

 The mystery remains as to why U+FEFF (or the bytes 0xEF 0xBB 0xBF,
 however interpreted) would be displayed as a Euro sign.

 Autodetection as some other codepage?

The Unicode home page includes the following line, right where it should
be, in the head section:

meta http-equiv=Content-Type content=text/html; charset=utf-8

Any User Agent that takes a page properly marked as UTF-8, as above, and
still tries to autodetect a local code page, is badly misguided.  How
would it handle a real UTF-8-encoded euro sign (0xE2 0x82 0xAC)?

-Doug Ewell
 Fullerton, California





Re: BOM's at Beginning of Web Pages? Mac IE's Euro

2003-02-16 Thread Roozbeh Pournader
On Sun, 16 Feb 2003, Doug Ewell wrote:

 The Unicode home page includes the following line, right where it should
 be, in the head section:
 
 meta http-equiv=Content-Type content=text/html; charset=utf-8
 
 Any User Agent that takes a page properly marked as UTF-8, as above, and
 still tries to autodetect a local code page, is badly misguided.  How
 would it handle a real UTF-8-encoded euro sign (0xE2 0x82 0xAC)?

AFAICR, there is supposed to be no single non-ASCII character before that
meta tag. I really don't like to search the specs again, but I'm sure I
saw it somewhere. The HTML renderer sees those characters and thinks the
document has already started (since the html, head and body tags are
are not mandatory in HTML 4 Transitional). So it goes into autodetection
mode. The same situation happens with MS FrontPage 2000 (but I've already 
explained that).

roozbeh