Re: [whatwg] Valid Unicode
On Tue, 22 Apr 2008, Henri Sivonen wrote: > On Apr 22, 2008, at 14:18, Ian Hickson wrote: > > On Fri, 1 Dec 2006, Elliotte Harold wrote: > > > 2. Are control characters allowed (probably yes, based on other parts of > > > the spec). > > > > No as raw characters. Control characters that aren't in U+80-U+9F are > > allowed as entities. > ... > > > 6. Are noncharacters U+FDD0..U+FDEF allowed (?) > > > 7. Are the noncharacters from the last two characters of each plane > > > allowed (?) > > > > Not as raw charactes but, for now, as entities yes. > > Why the distinction between raw characters and entities? Won't that just > complicate things--serializers in particular? This has now been fixed. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Valid Unicode
On Apr 22, 2008, at 14:18, Ian Hickson wrote: On Fri, 1 Dec 2006, Elliotte Harold wrote: 2. Are control characters allowed (probably yes, based on other parts of the spec). No as raw characters. Control characters that aren't in U+80-U+9F are allowed as entities. ... 6. Are noncharacters U+FDD0..U+FDEF allowed (?) 7. Are the noncharacters from the last two characters of each plane allowed (?) Not as raw charactes but, for now, as entities yes. Why the distinction between raw characters and entities? Won't that just complicate things--serializers in particular? -- Henri Sivonen [EMAIL PROTECTED] http://hsivonen.iki.fi/
Re: [whatwg] Valid Unicode
On Fri, 1 Dec 2006, Elliotte Harold wrote: > > In 9.1.3 we see > > Text must consist of valid Unicode characters other than U+. Text should > not contain control characters other than space characters. > > > Later in 9.2.3.1 we find: > > If the number is not a valid Unicode character (e.g. if the number is higher > than 1114111), or if the number is zero, then return a character token for the > U+FFFD REPLACEMENT CHARACTER character instead. > > > I do not think the Unicode spec defines the notion of a "valid Unicode > character". (It does define a valid Unicode code unit sequence, but that's a > little different. A code unit sequence generally consists of more than one > character.) Thus I suggest we need to be more precise here about what is and > is not a valid Unicode character. The spec is much more precise now. Is it ok? > In particular: > > 1. Are private use characters allowed? Yes. > 2. Are control characters allowed (probably yes, based on other parts of > the spec). No as raw characters. Control characters that aren't in U+80-U+9F are allowed as entities. > 3. Are surrogate characters allowed? (probably no) No. > 4. Are non-characters beyond 10 allowed (no) No. > 5. Are reserved but currently undefined characters allowed (yes) Yes. > 6. Are noncharacters U+FDD0..U+FDEF allowed (?) > 7. Are the noncharacters from the last two characters of each plane > allowed (?) Not as raw charactes but, for now, as entities yes. On Sun, 3 Dec 2006, Henri Sivonen wrote: > On Dec 2, 2006, at 18:24, Sam Ruby wrote: > > > > It would not be wise for HTML5 to limit itself to the more constrained > > character set of XML. In particular, the form feed character is > > pretty popular, > > > > This is yet another case where "take HTML5, read it into a DOM, and > > serialize it as XML, and voil�: you have valid XHTML" doesn't work. > > What I am advocating is making sure that *conforming* HTML5 documents > can be serialized as XHTML5 without dataloss. This is important in order > to be able to promise that an "XML tool chain" can be used for > processing *conforming* HTML5 by sticking an HTML5 parser in front of > the processing pipeline (for *non-browser* use cases like data mining, > content management or conformance checking where scripts aren't executed > nor CSS rendering performed). The motivation is to make processing HTML5 > in non-browser apps less expensive without giving an incentive for the > solutions to violate the spec ad hoc on their own. > > For example, an "XML tool chain" is important enough for my conformance > checking service that if at this point the assumption of *conforming* > HTML5 being convertible to XHTML5 was broken in corner cases, I'd > probably come up with ad hoc trickery for masking it instead of throwing > away the tool chain. I'd prefer not having to do that and not having to > explain to everyone else who finds an "XML tool chain" to be of value > what tricks I needed to pull off to fake it. > > I am not suggesting that HTML5 browsers halt and catch fire upon finding > a form feed. And it is obvious that lossless conversion of all possible > non-conforming HTML5 documents to XML is impossible anyway, so making > that a goal would not be worthwhile. > > But what legitimate and popular use would a form feed have in HTML5? Why > can't we call it non-conforming? Are there use cases other than > converting .txt RFCs to HTML with regexps without bothering to get rid > of the form feeds? I don't think that it would be valuable to make that use case raise errors. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Valid Unicode
On Dec 3, 2006, at 03:47, Sam Ruby wrote: What I am advocating is making sure that *conforming* HTML5 documents can be serialized as XHTML5 without dataloss. Then you will also need to disallow newlines in attribute values. I believe that is not the case. See the last line of the table at the end of section 3.3.3 in the XML 1.0 spec. http://www.w3.org/TR/REC-xml/#AVNormalize (Note that if some of this doesn't currently work in Gecko, Gecko has a bug. Expat does the XML-compliant thing but then nsExpatDriver runs whitespace normalization again, which is bogus. https:// bugzilla.mozilla.org/show_bug.cgi?id=343870 It doesn't make sense to fix it until bug 18333 has landed.) In any case, I understand the desire; my read is that the WG's desire for backwards compatibility is higher. Limiting the character set to the allowable XML 1.1 character set should not be a problem for backwards compatibility purposes. XML 1.1 doesn't really solve anything in this area. XML 1.1 is part of the problem. It creates incompatibility in corner cases without compelling benefits. The real XML that is known to work with any "XML tool chain" is XML 1.0. I should point out that HTML5 proclaims non-conforming some things that no doubt exist on the Web and are far more common that form feeds. You can't even achieve any useful effect by including a form feed in HTML. -- Henri Sivonen [EMAIL PROTECTED] http://hsivonen.iki.fi/
Re: [whatwg] Valid Unicode
On 12/2/06, Henri Sivonen <[EMAIL PROTECTED]> wrote: On Dec 2, 2006, at 18:24, Sam Ruby wrote: > It would not be wise for HTML5 to limit itself to the more constrained > character set of XML. In particular, the form feed character is > pretty popular, BTW, I copy and pasted the wrong table. The characters I mentioned were discouraged (and include such things as Microsoft smart quotes mislabeled as iso-8859-1). The actual allowed set in XML 1.0 is as follows: #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x1-#x10] For XML 1.1 the list is as follows: [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x1-#x10] > This is yet another case where "take HTML5, read it into a DOM, and > serialize it as XML, and voilà: you have valid XHTML" doesn't work. What I am advocating is making sure that *conforming* HTML5 documents can be serialized as XHTML5 without dataloss. Then you will also need to disallow newlines in attribute values. In any case, I understand the desire; my read is that the WG's desire for backwards compatibility is higher. Limiting the character set to the allowable XML 1.1 character set should not be a problem for backwards compatibility purposes. - Sam Ruby
Re: [whatwg] Valid Unicode
On Dec 2, 2006, at 18:24, Sam Ruby wrote: It would not be wise for HTML5 to limit itself to the more constrained character set of XML. In particular, the form feed character is pretty popular, This is yet another case where "take HTML5, read it into a DOM, and serialize it as XML, and voilà: you have valid XHTML" doesn't work. What I am advocating is making sure that *conforming* HTML5 documents can be serialized as XHTML5 without dataloss. This is important in order to be able to promise that an "XML tool chain" can be used for processing *conforming* HTML5 by sticking an HTML5 parser in front of the processing pipeline (for *non-browser* use cases like data mining, content management or conformance checking where scripts aren't executed nor CSS rendering performed). The motivation is to make processing HTML5 in non-browser apps less expensive without giving an incentive for the solutions to violate the spec ad hoc on their own. For example, an "XML tool chain" is important enough for my conformance checking service that if at this point the assumption of *conforming* HTML5 being convertible to XHTML5 was broken in corner cases, I'd probably come up with ad hoc trickery for masking it instead of throwing away the tool chain. I'd prefer not having to do that and not having to explain to everyone else who finds an "XML tool chain" to be of value what tricks I needed to pull off to fake it. I am not suggesting that HTML5 browsers halt and catch fire upon finding a form feed. And it is obvious that lossless conversion of all possible non-conforming HTML5 documents to XML is impossible anyway, so making that a goal would not be worthwhile. But what legitimate and popular use would a form feed have in HTML5? Why can't we call it non-conforming? Are there use cases other than converting .txt RFCs to HTML with regexps without bothering to get rid of the form feeds? -- Henri Sivonen [EMAIL PROTECTED] http://hsivonen.iki.fi/
Re: [whatwg] Valid Unicode
On 12/1/06, Elliotte Harold <[EMAIL PROTECTED]> wrote: Henri Sivonen wrote: >> 6. Are noncharacters U+FDD0..U+FDEF allowed (?) >> 7. Are the noncharacters from the last two characters of each plane >> allowed (?) > > I don't have particularly strong feelings here. Putting those characters > is HTML is a bad idea, but allowing them is not a problem for HTML5 to > XHTML5 conversion and they aren't a common problem like C1 controls. FFFE and are specifically forbidden by XML so they should probably be forbidden here too. I think the others are allowed. Unicode (not XML) reserves U+D800 – U+DFFF as well as U+FFFE and U+. XML 1.0 only allows the following characters: [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF], [#x1FFFE-#x1], [#x2FFFE-#x2], [#x3FFFE-#x3], [#x4FFFE-#x4], [#x5FFFE-#x5], [#x6FFFE-#x6], [#x7FFFE-#x7], [#x8FFFE-#x8], [#x9FFFE-#x9], [#xAFFFE-#xA], [#xBFFFE-#xB], [#xCFFFE-#xC], [#xDFFFE-#xD], [#xEFFFE-#xE], [#xE-#xF], [#x10FFFE-#x10]. It would not be wise for HTML5 to limit itself to the more constrained character set of XML. In particular, the form feed character is pretty popular, This is yet another case where "take HTML5, read it into a DOM, and serialize it as XML, and voilà: you have valid XHTML" doesn't work. -- Elliotte Rusty Harold [EMAIL PROTECTED] Java I/O 2nd Edition Just Published! http://www.cafeaulait.org/books/javaio2/ http://www.amazon.com/exec/obidos/ISBN=0596527500/ref=nosim/cafeaulaitA/ - Sam Ruby
Re: [whatwg] Valid Unicode
On Dec 2, 2006, at 03:11, Elliotte Harold wrote: Henri Sivonen wrote: 6. Are noncharacters U+FDD0..U+FDEF allowed (?) 7. Are the noncharacters from the last two characters of each plane allowed (?) I don't have particularly strong feelings here. Putting those characters is HTML is a bad idea, but allowing them is not a problem for HTML5 to XHTML5 conversion and they aren't a common problem like C1 controls. FFFE and are specifically forbidden by XML so they should probably be forbidden here too. I think the others are allowed. Right. Agreed. I though you were only talking about astral planes in point #7. -- Henri Sivonen [EMAIL PROTECTED] http://hsivonen.iki.fi/
Re: [whatwg] Valid Unicode
Henri Sivonen wrote: Personally, I'd like to make non-conforming the control characters that XML 1.0 disallows (in order to keep conforming HTML5 documents convertible to XHTML5) as well as C1 controls (because they have no legitimate use in HTML but are a sign of a common bug). Sounds reasonable. -- Elliotte Rusty Harold [EMAIL PROTECTED] Java I/O 2nd Edition Just Published! http://www.cafeaulait.org/books/javaio2/ http://www.amazon.com/exec/obidos/ISBN=0596527500/ref=nosim/cafeaulaitA/
Re: [whatwg] Valid Unicode
Henri Sivonen wrote: 6. Are noncharacters U+FDD0..U+FDEF allowed (?) 7. Are the noncharacters from the last two characters of each plane allowed (?) I don't have particularly strong feelings here. Putting those characters is HTML is a bad idea, but allowing them is not a problem for HTML5 to XHTML5 conversion and they aren't a common problem like C1 controls. FFFE and are specifically forbidden by XML so they should probably be forbidden here too. I think the others are allowed. -- Elliotte Rusty Harold [EMAIL PROTECTED] Java I/O 2nd Edition Just Published! http://www.cafeaulait.org/books/javaio2/ http://www.amazon.com/exec/obidos/ISBN=0596527500/ref=nosim/cafeaulaitA/
Re: [whatwg] Valid Unicode
On Dec 1, 2006, at 14:38, Elliotte Harold wrote: 1. Are private use characters allowed? I think the answer should be "Yes", because not allowing them could make people subvert Unicode and use e.g. Latin-1 code points for a different purpose with a bogus font. Also, not allowing them would be a violation of Charmod requirements for specs. 2. Are control characters allowed (probably yes, based on other parts of the spec). Personally, I'd like to make non-conforming the control characters that XML 1.0 disallows (in order to keep conforming HTML5 documents convertible to XHTML5) as well as C1 controls (because they have no legitimate use in HTML but are a sign of a common bug). 3. Are surrogate characters allowed? (probably no) Surrogates are an artifact of UTF-16. They have no place on the character level. So I'd say "No". 6. Are noncharacters U+FDD0..U+FDEF allowed (?) 7. Are the noncharacters from the last two characters of each plane allowed (?) I don't have particularly strong feelings here. Putting those characters is HTML is a bad idea, but allowing them is not a problem for HTML5 to XHTML5 conversion and they aren't a common problem like C1 controls. -- Henri Sivonen [EMAIL PROTECTED] http://hsivonen.iki.fi/
[whatwg] Valid Unicode
In 9.1.3 we see Text must consist of valid Unicode characters other than U+. Text should not contain control characters other than space characters. Later in 9.2.3.1 we find: If the number is not a valid Unicode character (e.g. if the number is higher than 1114111), or if the number is zero, then return a character token for the U+FFFD REPLACEMENT CHARACTER character instead. I do not think the Unicode spec defines the notion of a "valid Unicode character". (It does define a valid Unicode code unit sequence, but that's a little different. A code unit sequence generally consists of more than one character.) Thus I suggest we need to be more precise here about what is and is not a valid Unicode character. In particular: 1. Are private use characters allowed? 2. Are control characters allowed (probably yes, based on other parts of the spec). 3. Are surrogate characters allowed? (probably no) 4. Are non-characters beyond 10 allowed (no) 5. Are reserved but currently undefined characters allowed (yes) 6. Are noncharacters U+FDD0..U+FDEF allowed (?) 7. Are the noncharacters from the last two characters of each plane allowed (?) -- Elliotte Rusty Harold [EMAIL PROTECTED] Java I/O 2nd Edition Just Published! http://www.cafeaulait.org/books/javaio2/ http://www.amazon.com/exec/obidos/ISBN=0596527500/ref=nosim/cafeaulaitA/