Re: BOM's at Beginning of Web Pages?

Roozbeh Pournader Sat, 15 Feb 2003 19:09:29 -0800

Found it! It's forbidden to start a HTML 4.0 page with a UTF-8 BOM. Proof:


1. Open the main page of Unicode. You can see that the HTML header says:

   <!doctype HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"><html>

So, we are talking about HTML 4.0 here. The reference for HTML 4.0 is:

   http://www.w3.org/TR/1998/REC-html40-19980424/

The section about HTML header is Section 7.1, Introduction to the 
structure of an HTML document:

   http://www.w3.org/TR/1998/REC-html40-19980424/struct/global.html#h-7.1

which mentions:

  "An HTML 4.0 document is composed of three parts:

      1. a line containing HTML version information,
      2. a declarative header section (delimited by the HEAD element),
      3. a body, which contains the document's actual content. The body 
         may be implemented by the BODY element or the FRAMESET element.

   White space (spaces, newlines, tabs, and comments) may appear before or 
   after each section. Sections 2 and 3 should be delimited by the HTML 
   element."

So "White space" is allowed before the line containing HTML version 
information. But what is a white space? It is define in Section 9.1, White 
space:

  "The document character set includes a wide variety of white space 
   characters. Many of these are typographic elements used in some 
   applications to produce particular visual spacing effects. In HTML, 
   only the following characters are defined as white space characters:

      * ASCII space (&#x0020;)
      * ASCII tab (&#x0009;)
      * ASCII form feed (&#x000C;)
      * Zero-width space (&#x200B;) 
   
   Line breaks are also white space characters."

So, we need to know what is a line break! Well, section 9.3.2 defines 
that:

  "A line break is defined to be a carriage return (&#x000D;), a line feed 
   (&#x000A;), or a carriage return/line feed pair."

That's all. So the only characters that are allowed in a HTML 4.0 web page
before the HTML header, are U+0009, U+000A, U+000C, U+000D, U+0020, and
U+200B. QED.

roozbeh

PS: UTF-16 is an exception to that, since the BOM is not part of the 
document and should be removed for processing.

Re: BOM's at Beginning of Web Pages?

Reply via email to