subject:"\[whatwg\] Valid Unicode"

Re: [whatwg] Valid Unicode

2008-05-23 Thread Ian Hickson

On Tue, 22 Apr 2008, Henri Sivonen wrote:
> On Apr 22, 2008, at 14:18, Ian Hickson wrote:
> > On Fri, 1 Dec 2006, Elliotte Harold wrote:
> > > 2. Are control characters allowed (probably yes, based on other parts of
> > > the spec).
> > 
> > No as raw characters. Control characters that aren't in U+80-U+9F are
> > allowed as entities.
> ...
> > > 6. Are noncharacters U+FDD0..U+FDEF allowed (?)
> > > 7. Are the noncharacters from the last two characters of each plane
> > > allowed (?)
> > 
> > Not as raw charactes but, for now, as entities yes.
> 
> Why the distinction between raw characters and entities? Won't that just 
> complicate things--serializers in particular?

This has now been fixed.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: [whatwg] Valid Unicode

2008-04-22 Thread Henri Sivonen


On Apr 22, 2008, at 14:18, Ian Hickson wrote:


On Fri, 1 Dec 2006, Elliotte Harold wrote:
2. Are control characters allowed (probably yes, based on other  
parts of

the spec).


No as raw characters. Control characters that aren't in U+80-U+9F are
allowed as entities.

...

6. Are noncharacters U+FDD0..U+FDEF allowed (?)
7. Are the noncharacters from the last two characters of each plane
allowed (?)


Not as raw charactes but, for now, as entities yes.



Why the distinction between raw characters and entities? Won't that  
just complicate things--serializers in particular?


--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/

Re: [whatwg] Valid Unicode

2008-04-22 Thread Ian Hickson

On Fri, 1 Dec 2006, Elliotte Harold wrote:
>
> In 9.1.3 we see
> 
> Text must consist of valid Unicode characters other than U+. Text should
> not contain control characters other than space characters.
> 
> 
> Later in 9.2.3.1 we find:
> 
> If the number is not a valid Unicode character (e.g. if the number is higher
> than 1114111), or if the number is zero, then return a character token for the
> U+FFFD REPLACEMENT CHARACTER character instead.
> 
> 
> I do not think the Unicode spec defines the notion of a "valid Unicode
> character". (It does define a valid Unicode code unit sequence, but that's a
> little different. A code unit sequence generally consists of more than one
> character.) Thus I suggest we need to be more precise here about what is and
> is not a valid Unicode character.

The spec is much more precise now. Is it ok?


> In particular:
> 
> 1. Are private use characters allowed?

Yes.

> 2. Are control characters allowed (probably yes, based on other parts of 
> the spec).

No as raw characters. Control characters that aren't in U+80-U+9F are 
allowed as entities.

> 3. Are surrogate characters allowed? (probably no)

No.

> 4. Are non-characters beyond 10 allowed (no)

No.

> 5. Are reserved but currently undefined characters allowed (yes)

Yes.

> 6. Are noncharacters U+FDD0..U+FDEF allowed (?)
> 7. Are the noncharacters from the last two characters of each plane 
> allowed (?)

Not as raw charactes but, for now, as entities yes.


On Sun, 3 Dec 2006, Henri Sivonen wrote:
> On Dec 2, 2006, at 18:24, Sam Ruby wrote:
> > 
> > It would not be wise for HTML5 to limit itself to the more constrained 
> > character set of XML.  In particular, the form feed character is 
> > pretty popular,
> > 
> > This is yet another case where "take HTML5, read it into a DOM, and 
> > serialize it as XML, and voil�: you have valid XHTML" doesn't work.
> 
> What I am advocating is making sure that *conforming* HTML5 documents 
> can be serialized as XHTML5 without dataloss. This is important in order 
> to be able to promise that an "XML tool chain" can be used for 
> processing *conforming* HTML5 by sticking an HTML5 parser in front of 
> the processing pipeline (for *non-browser* use cases like data mining, 
> content management or conformance checking where scripts aren't executed 
> nor CSS rendering performed). The motivation is to make processing HTML5 
> in non-browser apps less expensive without giving an incentive for the 
> solutions to violate the spec ad hoc on their own.
> 
> For example, an "XML tool chain" is important enough for my conformance 
> checking service that if at this point the assumption of *conforming* 
> HTML5 being convertible to XHTML5 was broken in corner cases, I'd 
> probably come up with ad hoc trickery for masking it instead of throwing 
> away the tool chain. I'd prefer not having to do that and not having to 
> explain to everyone else who finds an "XML tool chain" to be of value 
> what tricks I needed to pull off to fake it.
> 
> I am not suggesting that HTML5 browsers halt and catch fire upon finding 
> a form feed. And it is obvious that lossless conversion of all possible 
> non-conforming HTML5 documents to XML is impossible anyway, so making 
> that a goal would not be worthwhile.
> 
> But what legitimate and popular use would a form feed have in HTML5? Why 
> can't we call it non-conforming? Are there use cases other than 
> converting .txt RFCs to HTML with regexps without bothering to get rid 
> of the form feeds?

I don't think that it would be valuable to make that use case raise 
errors.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: [whatwg] Valid Unicode

2006-12-03 Thread Henri Sivonen


On Dec 3, 2006, at 03:47, Sam Ruby wrote:


What I am advocating is making sure that *conforming* HTML5 documents
can be serialized as XHTML5 without dataloss.


Then you will also need to disallow newlines in attribute values.


I believe that is not the case. See the last line of the table at the  
end of section 3.3.3 in the XML 1.0 spec.

http://www.w3.org/TR/REC-xml/#AVNormalize

(Note that if some of this doesn't currently work in Gecko, Gecko has  
a bug. Expat does the XML-compliant thing but then nsExpatDriver runs  
whitespace normalization again, which is bogus. https:// 
bugzilla.mozilla.org/show_bug.cgi?id=343870 It doesn't make sense to  
fix it until bug 18333 has landed.)



In any case, I understand the desire; my read is that the WG's desire
for backwards compatibility is higher.  Limiting the character set to
the allowable XML 1.1 character set should not be a problem for
backwards compatibility purposes.


XML 1.1 doesn't really solve anything in this area. XML 1.1 is part  
of the problem. It creates incompatibility in corner cases without  
compelling benefits. The real XML that is known to work with any "XML  
tool chain" is XML 1.0.


I should point out that HTML5 proclaims non-conforming some things  
that no doubt exist on the Web and are far more common that form  
feeds. You can't even achieve any useful effect by including a form  
feed in HTML.


--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/

Re: [whatwg] Valid Unicode

2006-12-02 Thread Sam Ruby

On 12/2/06, Henri Sivonen <[EMAIL PROTECTED]> wrote:

On Dec 2, 2006, at 18:24, Sam Ruby wrote:

> It would not be wise for HTML5 to limit itself to the more constrained
> character set of XML.  In particular, the form feed character is
> pretty popular,

BTW, I copy and pasted the wrong table.  The characters I mentioned
were discouraged (and include such things as Microsoft smart quotes
mislabeled as iso-8859-1).  The actual allowed set in XML 1.0 is as
follows:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x1-#x10]

For XML 1.1 the list is as follows:

[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x1-#x10]

> This is yet another case where "take HTML5, read it into a DOM, and
> serialize it as XML, and voilà: you have valid XHTML" doesn't work.

What I am advocating is making sure that *conforming* HTML5 documents
can be serialized as XHTML5 without dataloss.

Then you will also need to disallow newlines in attribute values.

In any case, I understand the desire; my read is that the WG's desire
for backwards compatibility is higher.  Limiting the character set to
the allowable XML 1.1 character set should not be a problem for
backwards compatibility purposes.

- Sam Ruby

Re: [whatwg] Valid Unicode

2006-12-02 Thread Henri Sivonen


On Dec 2, 2006, at 18:24, Sam Ruby wrote:


It would not be wise for HTML5 to limit itself to the more constrained
character set of XML.  In particular, the form feed character is
pretty popular,

This is yet another case where "take HTML5, read it into a DOM, and
serialize it as XML, and voilà: you have valid XHTML" doesn't work.


What I am advocating is making sure that *conforming* HTML5 documents  
can be serialized as XHTML5 without dataloss. This is important in  
order to be able to promise that an "XML tool chain" can be used for  
processing *conforming* HTML5 by sticking an HTML5 parser in front of  
the processing pipeline (for *non-browser* use cases like data  
mining, content management or conformance checking where scripts  
aren't executed nor CSS rendering performed). The motivation is to  
make processing HTML5 in non-browser apps less expensive without  
giving an incentive for the solutions to violate the spec ad hoc on  
their own.


For example, an "XML tool chain" is important enough for my  
conformance checking service that if at this point the assumption of  
*conforming* HTML5 being convertible to XHTML5 was broken in corner  
cases, I'd probably come up with ad hoc trickery for masking it  
instead of throwing away the tool chain. I'd prefer not having to do  
that and not having to explain to everyone else who finds an "XML  
tool chain" to be of value what tricks I needed to pull off to fake it.


I am not suggesting that HTML5 browsers halt and catch fire upon  
finding a form feed. And it is obvious that lossless conversion of  
all possible non-conforming HTML5 documents to XML is impossible  
anyway, so making that a goal would not be worthwhile.


But what legitimate and popular use would a form feed have in HTML5?  
Why can't we call it non-conforming? Are there use cases other than  
converting .txt RFCs to HTML with regexps without bothering to get  
rid of the form feeds?


--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/

Re: [whatwg] Valid Unicode

2006-12-02 Thread Sam Ruby

On 12/1/06, Elliotte Harold <[EMAIL PROTECTED]> wrote:

Henri Sivonen wrote:

>> 6. Are noncharacters U+FDD0..U+FDEF allowed (?)
>> 7. Are the noncharacters from the last two characters of each plane
>> allowed (?)
>
> I don't have particularly strong feelings here. Putting those characters
> is HTML is a bad idea, but allowing them is not a problem for HTML5 to
> XHTML5 conversion and they aren't a common problem like C1 controls.

FFFE and  are specifically forbidden by XML so they should probably
be forbidden here too. I think the others are allowed.

Unicode (not XML) reserves U+D800 – U+DFFF as well as U+FFFE and U+.

XML 1.0 only allows the following characters:

[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
[#x1FFFE-#x1], [#x2FFFE-#x2], [#x3FFFE-#x3],
[#x4FFFE-#x4], [#x5FFFE-#x5], [#x6FFFE-#x6],
[#x7FFFE-#x7], [#x8FFFE-#x8], [#x9FFFE-#x9],
[#xAFFFE-#xA], [#xBFFFE-#xB], [#xCFFFE-#xC],
[#xDFFFE-#xD], [#xEFFFE-#xE], [#xE-#xF],
[#x10FFFE-#x10].

It would not be wise for HTML5 to limit itself to the more constrained
character set of XML.  In particular, the form feed character is
pretty popular,

This is yet another case where "take HTML5, read it into a DOM, and
serialize it as XML, and voilà: you have valid XHTML" doesn't work.

--
Elliotte Rusty Harold  [EMAIL PROTECTED]
Java I/O 2nd Edition Just Published!
http://www.cafeaulait.org/books/javaio2/
http://www.amazon.com/exec/obidos/ISBN=0596527500/ref=nosim/cafeaulaitA/

- Sam Ruby

Re: [whatwg] Valid Unicode

2006-12-02 Thread Henri Sivonen


On Dec 2, 2006, at 03:11, Elliotte Harold wrote:


Henri Sivonen wrote:


6. Are noncharacters U+FDD0..U+FDEF allowed (?)
7. Are the noncharacters from the last two characters of each  
plane allowed (?)
I don't have particularly strong feelings here. Putting those  
characters is HTML is a bad idea, but allowing them is not a  
problem for HTML5 to XHTML5 conversion and they aren't a common  
problem like C1 controls.


FFFE and  are specifically forbidden by XML so they should  
probably be forbidden here too. I think the others are allowed.


Right. Agreed. I though you were only talking about astral planes in  
point #7.


--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/

Re: [whatwg] Valid Unicode

2006-12-01 Thread Elliotte Harold


Henri Sivonen wrote:

Personally, I'd like to make non-conforming the control characters that 
XML 1.0 disallows (in order to keep conforming HTML5 documents 
convertible to XHTML5) as well as C1 controls (because they have no 
legitimate use in HTML but are a sign of a common bug).


Sounds reasonable.

--
Elliotte Rusty Harold  [EMAIL PROTECTED]
Java I/O 2nd Edition Just Published!
http://www.cafeaulait.org/books/javaio2/
http://www.amazon.com/exec/obidos/ISBN=0596527500/ref=nosim/cafeaulaitA/

Re: [whatwg] Valid Unicode

2006-12-01 Thread Elliotte Harold


Henri Sivonen wrote:


6. Are noncharacters U+FDD0..U+FDEF allowed (?)
7. Are the noncharacters from the last two characters of each plane 
allowed (?)


I don't have particularly strong feelings here. Putting those characters 
is HTML is a bad idea, but allowing them is not a problem for HTML5 to 
XHTML5 conversion and they aren't a common problem like C1 controls.


FFFE and  are specifically forbidden by XML so they should probably 
be forbidden here too. I think the others are allowed.


--
Elliotte Rusty Harold  [EMAIL PROTECTED]
Java I/O 2nd Edition Just Published!
http://www.cafeaulait.org/books/javaio2/
http://www.amazon.com/exec/obidos/ISBN=0596527500/ref=nosim/cafeaulaitA/

Re: [whatwg] Valid Unicode

2006-12-01 Thread Henri Sivonen


On Dec 1, 2006, at 14:38, Elliotte Harold wrote:


1. Are private use characters allowed?


I think the answer should be "Yes", because not allowing them could  
make people subvert Unicode and use e.g. Latin-1 code points for a  
different purpose with a bogus font. Also, not allowing them would be  
a violation of Charmod requirements for specs.


2. Are control characters allowed (probably yes, based on other  
parts of the spec).


Personally, I'd like to make non-conforming the control characters  
that XML 1.0 disallows (in order to keep conforming HTML5 documents  
convertible to XHTML5) as well as C1 controls (because they have no  
legitimate use in HTML but are a sign of a common bug).



3. Are surrogate characters allowed? (probably no)


Surrogates are an artifact of UTF-16. They have no place on the  
character level. So I'd say "No".



6. Are noncharacters U+FDD0..U+FDEF allowed (?)
7. Are the noncharacters from the last two characters of each plane  
allowed (?)


I don't have particularly strong feelings here. Putting those  
characters is HTML is a bad idea, but allowing them is not a problem  
for HTML5 to XHTML5 conversion and they aren't a common problem like  
C1 controls.


--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/

[whatwg] Valid Unicode

2006-12-01 Thread Elliotte Harold


In 9.1.3 we see

Text must consist of valid Unicode characters other than U+. Text 
should not contain control characters other than space characters.



Later in 9.2.3.1 we find:

If the number is not a valid Unicode character (e.g. if the number is 
higher than 1114111), or if the number is zero, then return a character 
token for the U+FFFD REPLACEMENT CHARACTER character instead.



I do not think the Unicode spec defines the notion of a "valid Unicode 
character". (It does define a valid Unicode code unit sequence, but 
that's a little different. A code unit sequence generally consists of 
more than one character.) Thus I suggest we need to be more precise here 
about what is and is not a valid Unicode character. In particular:



1. Are private use characters allowed?
2. Are control characters allowed (probably yes, based on other parts of 
the spec).

3. Are surrogate characters allowed? (probably no)
4. Are non-characters beyond 10 allowed (no)
5. Are reserved but currently undefined characters allowed (yes)
6. Are noncharacters U+FDD0..U+FDEF allowed (?)
7. Are the noncharacters from the last two characters of each plane 
allowed (?)



--
Elliotte Rusty Harold  [EMAIL PROTECTED]
Java I/O 2nd Edition Just Published!
http://www.cafeaulait.org/books/javaio2/
http://www.amazon.com/exec/obidos/ISBN=0596527500/ref=nosim/cafeaulaitA/

Re: [whatwg] Valid Unicode

Re: [whatwg] Valid Unicode

Re: [whatwg] Valid Unicode

Re: [whatwg] Valid Unicode

Re: [whatwg] Valid Unicode

Re: [whatwg] Valid Unicode

Re: [whatwg] Valid Unicode

Re: [whatwg] Valid Unicode

Re: [whatwg] Valid Unicode

Re: [whatwg] Valid Unicode

Re: [whatwg] Valid Unicode

[whatwg] Valid Unicode

12 matches

Site Navigation

Mail list logo

Footer information