No complaints with the long term direction. I agree that it is a tall order to implement.
- Michael On Mar 11, 2013, at 9:54 AM, Adam Barth <aba...@webkit.org> wrote: > Oh, Ok. I misunderstood your original message to say that the project > as a whole had reached this conclusion, which certainly isn't the > case, rather than that you personally had reached that conclusion. > > As for the long-term direction of the HTML parser, my guess is that > the optimum design will be to deliver the network bytes to the parser > directly on the parser thread. On the parser thread, we can merge > charset decoding, input stream pre-processing, and tokenization to > move directly from network bytes to CompactHTMLTokens. That approach > removes a number of copies, 8-bit-to-16-bit, and 16-bit-to-8-bit > conversions. Parsing directly into CompactHTMLTokens also means we > won't have to do any copies or conversions at all for well-known > strings (e.g., "div" and friends from HTMLNames). > > If you're about to reply complaining about the above, please save your > complaints for another time. I realize that some parts of that design > will be difficult or impossible to implement on some ports due to > limitations on how then interact with their networking stack. In any > case, I don't plan to implement that design anytime soon, and I'm sure > we'll have plenty of time to discuss its merits in the future. > > Adam > > > On Mon, Mar 11, 2013 at 8:56 AM, Michael Saboff <msab...@apple.com> wrote: >> Maciej, >> >> *I* deemed using a character type template for the HTMLTokenizer as being >> unwieldy. Given there was the existing SegmentedString input abstraction, >> it made logical sense to put the 8/16 bit coding there. If I would have >> moved the 8/16 logic into the tokenizer itself, we might have needed to do >> 8->16 up conversions when a SegmentedStrings had mixed bit-ness in the >> contained substrings. Even if that wasn't the case, the patch would have >> been far larger and likely include tricky code for escapes. >> >> As I got into the middle of the 8-bit strings, I realized that not only >> could I keep performance parity, but some of the techniques I came up with >> offered good performance improvement. The HTMLTokenizer ended up being one >> of those cases. This patch required a couple of reworks for performance >> reasons and garnered a lot of discussion from various parts of the webkit >> community. See https://bugs.webkit.org/show_bug.cgi?id=90321 for the trail. >> Ryosuke noted that this patch was responsible for a 24% improvement in the >> url-parser test in their bots (comment 47). My performance final results >> are in comment 43 and show between 1 and 9% progression on the various HTML >> parser tests. >> >> Adam, If you believe there is more work to be done in the HTMLTokenizer, >> file a bug and cc me. I'm interested in hearing your thoughts. >> >> - Michael >> >> On Mar 9, 2013, at 4:24 PM, Maciej Stachowiak <m...@apple.com> wrote: >> >> >> On Mar 9, 2013, at 3:05 PM, Adam Barth <aba...@webkit.org> wrote: >> >> >> In retrospect, I think what I was reacting to was msaboff statement >> that an unnamed group of people had decided that the HTML tokenizer >> was too unwieldy to have a dedicated 8-bit path. In particular, it's >> unclear to me who made that decision. I certainly do not consider the >> matter decided. >> >> >> It would be good to find out who it was that said that (or more >> specifically: "Using a character type template approach was deemed to be too >> unwieldy for the HTML tokenizer.") so you can talk to them about it. >> >> Michael? >> >> Regards, >> Maciej >> >> _______________________________________________ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev