Re: [whatwg] Byte-wise tokenization algorithm

2008-12-21 Thread Ian Hickson
On Sun, 21 Dec 2008, Edward Z. Yang wrote: > > I suppose the big pivot point is "as if". A byte-wise implementation > would replace character globally with byte, and any U+ designation > with the UTF-8 encoded byte version. HTML 5 dictates end behavior, not > the actual algorithm implementa

Re: [whatwg] Byte-wise tokenization algorithm

2008-12-21 Thread Geoffrey Sneddon
On 21 Dec 2008, at 16:35, Edward Z. Yang wrote: I suppose the big pivot point is "as if". A byte-wise implementation would replace character globally with byte, and any U+ designation with the UTF-8 encoded byte version. HTML 5 dictates end behavior, not the actual algorithm implementation,

Re: [whatwg] Byte-wise tokenization algorithm

2008-12-21 Thread Philip Taylor
On Sun, Dec 21, 2008 at 5:41 AM, Ian Hickson wrote: > On Sat, 20 Dec 2008, Edward Z. Yang wrote: >> >> 1. Given an input stream that is known to be valid UTF-8, is it possible >> to implement the tokenization algorithm with byte-wise operations only? >> I think it's possible, since all of the char

Re: [whatwg] Byte-wise tokenization algorithm

2008-12-21 Thread Edward Z. Yang
Ian Hickson wrote: > Yes. (At least, that's the intent; if you find anything that contradicts > that, please let me know.) Great. I'll be sure to ping you if I find out otherwise. > Looking just at parsing, yes, probably... I suppose the big pivot point is "as if". A byte-wise implementation wo

Re: [whatwg] Byte-wise tokenization algorithm

2008-12-21 Thread Geoffrey Sneddon
On 21 Dec 2008, at 05:41, Ian Hickson wrote: 1. Given an input stream that is known to be valid UTF-8, is it possible to implement the tokenization algorithm with byte-wise operations only? I think it's possible, since all of the character matching parts of the algorithm map to characters

Re: [whatwg] Byte-wise tokenization algorithm

2008-12-20 Thread Ian Hickson
On Sat, 20 Dec 2008, Edward Z. Yang wrote: > > I am currently working on a PHP5 implementation of the HTML5 > specification. PHP has abysmal Unicode support, and implementing Unicode > streams in userspace may be unacceptablu slow. Thus, my questions: > > 1. Given an input stream that is known t

[whatwg] Byte-wise tokenization algorithm

2008-12-20 Thread Edward Z. Yang
I am currently working on a PHP5 implementation of the HTML5 specification. PHP has abysmal Unicode support, and implementing Unicode streams in userspace may be unacceptablu slow. Thus, my questions: 1. Given an input stream that is known to be valid UTF-8, is it possible to implement the tokeniz