On Mon, Mar 11, 2013 at 9:56 AM, Darin Adler wrote:
> On Mar 11, 2013, at 9:54 AM, Adam Barth wrote:
>> If you're about to reply complaining about the above, please save your
>> complaints for another time.
>
> Huh?
The last time we tried to talk about changing the design of the HTML
parser on
No complaints with the long term direction. I agree that it is a tall order to
implement.
- Michael
On Mar 11, 2013, at 9:54 AM, Adam Barth wrote:
> Oh, Ok. I misunderstood your original message to say that the project
> as a whole had reached this conclusion, which certainly isn't the
> cas
On Mar 11, 2013, at 9:54 AM, Adam Barth wrote:
> As for the long-term direction of the HTML parser, my guess is that the
> optimum design will be to deliver the network bytes to the parser directly on
> the parser thread.
Sounds right to me.
> If you're about to reply complaining about the ab
Oh, Ok. I misunderstood your original message to say that the project
as a whole had reached this conclusion, which certainly isn't the
case, rather than that you personally had reached that conclusion.
As for the long-term direction of the HTML parser, my guess is that
the optimum design will be
Maciej,
*I* deemed using a character type template for the HTMLTokenizer as being
unwieldy. Given there was the existing SegmentedString input abstraction, it
made logical sense to put the 8/16 bit coding there. If I would have moved the
8/16 logic into the tokenizer itself, we might have nee
On Mar 9, 2013, at 3:05 PM, Adam Barth wrote:
>
> In retrospect, I think what I was reacting to was msaboff statement
> that an unnamed group of people had decided that the HTML tokenizer
> was too unwieldy to have a dedicated 8-bit path. In particular, it's
> unclear to me who made that decisi
On Sat, Mar 9, 2013 at 12:48 PM, Luis de Bethencourt
wrote:
> On Mar 7, 2013 10:37 PM, "Brady Eidson" wrote:
>> > On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff
>> > wrote:
>> >> The various tokenizers / lexers work various ways to handle LChar
>> >> versus UChar input streams. Most of the othe
On Mar 7, 2013 10:37 PM, "Brady Eidson" wrote:
>
> > On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff
wrote:
> >> The various tokenizers / lexers work various ways to handle LChar
versus UChar input streams. Most of the other tokenizers are templatized
on input character type. In the case of HTML,
> On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff wrote:
>> The various tokenizers / lexers work various ways to handle LChar versus
>> UChar input streams. Most of the other tokenizers are templatized on input
>> character type. In the case of HTML, the tokenizer handles a UChar character
>> a
Yes, I understand how the HTML tokenizer works. :)
Adam
On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff wrote:
> The various tokenizers / lexers work various ways to handle LChar versus
> UChar input streams. Most of the other tokenizers are templatized on input
> character type. In the case
The various tokenizers / lexers work various ways to handle LChar versus UChar
input streams. Most of the other tokenizers are templatized on input character
type. In the case of HTML, the tokenizer handles a UChar character at a time.
For 8 bit input streams, the zero extension of a LChar to
The HTMLTokenizer still works in UChars. There's likely some
performance to be gained by moving it to an 8-bit character type.
There's some trickiness involved because HTML entities can expand to
characters outside of Latin-1. Also, it's unclear if we want two
tokenizers (one that's 8 bits wide an
No. I retract my question. Sounds like we already have it right! thanks for
setting me straight.
Maybe some day we could make a non copying code path that points directly at
the data in the SharedBuffer, but I have no idea if that'd be beneficial.
-- Darin
Sent from my iPhone
On Mar 7, 2013
There is an all-ASCII case in TextCodecUTF8::decode(). It should be keeping
all ASCII data as 8 bit. TextCodecWindowsLatin1::decode() has not only an
all-ASCII case, but it only up converts to 16 bit in a couple of rare cases.
Is there some other case you don't think we are handling?
- Micha
Hi folks.
Today, bytes that come in from the network get turned into UTF-16 by the
decoding process. We then turn some of them back into Latin-1 during the
parsing process. Should we make changes so there’s an 8-bit path? It might be
as simple as writing code that has more of an all-ASCII speci
15 matches
Mail list logo