Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-11 Thread Adam Barth
On Mon, Mar 11, 2013 at 9:56 AM, Darin Adler wrote: > On Mar 11, 2013, at 9:54 AM, Adam Barth wrote: >> If you're about to reply complaining about the above, please save your >> complaints for another time. > > Huh? The last time we tried to talk about changing the design of the HTML parser on

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-11 Thread Michael Saboff
No complaints with the long term direction. I agree that it is a tall order to implement. - Michael On Mar 11, 2013, at 9:54 AM, Adam Barth wrote: > Oh, Ok. I misunderstood your original message to say that the project > as a whole had reached this conclusion, which certainly isn't the > cas

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-11 Thread Darin Adler
On Mar 11, 2013, at 9:54 AM, Adam Barth wrote: > As for the long-term direction of the HTML parser, my guess is that the > optimum design will be to deliver the network bytes to the parser directly on > the parser thread. Sounds right to me. > If you're about to reply complaining about the ab

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-11 Thread Adam Barth
Oh, Ok. I misunderstood your original message to say that the project as a whole had reached this conclusion, which certainly isn't the case, rather than that you personally had reached that conclusion. As for the long-term direction of the HTML parser, my guess is that the optimum design will be

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-11 Thread Michael Saboff
Maciej, *I* deemed using a character type template for the HTMLTokenizer as being unwieldy. Given there was the existing SegmentedString input abstraction, it made logical sense to put the 8/16 bit coding there. If I would have moved the 8/16 logic into the tokenizer itself, we might have nee

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-09 Thread Maciej Stachowiak
On Mar 9, 2013, at 3:05 PM, Adam Barth wrote: > > In retrospect, I think what I was reacting to was msaboff statement > that an unnamed group of people had decided that the HTML tokenizer > was too unwieldy to have a dedicated 8-bit path. In particular, it's > unclear to me who made that decisi

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-09 Thread Adam Barth
On Sat, Mar 9, 2013 at 12:48 PM, Luis de Bethencourt wrote: > On Mar 7, 2013 10:37 PM, "Brady Eidson" wrote: >> > On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff >> > wrote: >> >> The various tokenizers / lexers work various ways to handle LChar >> >> versus UChar input streams. Most of the othe

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-09 Thread Luis de Bethencourt
On Mar 7, 2013 10:37 PM, "Brady Eidson" wrote: > > > On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff wrote: > >> The various tokenizers / lexers work various ways to handle LChar versus UChar input streams. Most of the other tokenizers are templatized on input character type. In the case of HTML,

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-07 Thread Brady Eidson
> On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff wrote: >> The various tokenizers / lexers work various ways to handle LChar versus >> UChar input streams. Most of the other tokenizers are templatized on input >> character type. In the case of HTML, the tokenizer handles a UChar character >> a

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-07 Thread Adam Barth
Yes, I understand how the HTML tokenizer works. :) Adam On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff wrote: > The various tokenizers / lexers work various ways to handle LChar versus > UChar input streams. Most of the other tokenizers are templatized on input > character type. In the case

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-07 Thread Michael Saboff
The various tokenizers / lexers work various ways to handle LChar versus UChar input streams. Most of the other tokenizers are templatized on input character type. In the case of HTML, the tokenizer handles a UChar character at a time. For 8 bit input streams, the zero extension of a LChar to

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-07 Thread Adam Barth
The HTMLTokenizer still works in UChars. There's likely some performance to be gained by moving it to an 8-bit character type. There's some trickiness involved because HTML entities can expand to characters outside of Latin-1. Also, it's unclear if we want two tokenizers (one that's 8 bits wide an

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-07 Thread Darin Adler
No. I retract my question. Sounds like we already have it right! thanks for setting me straight. Maybe some day we could make a non copying code path that points directly at the data in the SharedBuffer, but I have no idea if that'd be beneficial. -- Darin Sent from my iPhone On Mar 7, 2013

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-07 Thread Michael Saboff
There is an all-ASCII case in TextCodecUTF8::decode(). It should be keeping all ASCII data as 8 bit. TextCodecWindowsLatin1::decode() has not only an all-ASCII case, but it only up converts to 16 bit in a couple of rare cases. Is there some other case you don't think we are handling? - Micha

[webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-07 Thread Darin Adler
Hi folks. Today, bytes that come in from the network get turned into UTF-16 by the decoding process. We then turn some of them back into Latin-1 during the parsing process. Should we make changes so there’s an 8-bit path? It might be as simple as writing code that has more of an all-ASCII speci