Re: [webkit-dev] Feature Announcement: Moving HTML Parser off the Main Thread
On Thu, Jan 10, 2013 at 9:19 PM, Maciej Stachowiak m...@apple.com wrote: On Jan 10, 2013, at 12:07 PM, Adam Barth aba...@webkit.org wrote: On Thu, Jan 10, 2013 at 12:37 AM, Maciej Stachowiak m...@apple.com wrote: I presume from your other comments that the goal of this work is responsiveness, rather than page load speed as such. I'm excited about the potential to improve responsiveness during page loading. The goals are described in the first link Eric gave in his email: https://bugs.webkit.org/show_bug.cgi?id=106127#c0. Specifically: ---8--- 1) Moving parsing off the main thread could make web pages more responsive because the main thread is available for handling input events and executing JavaScript. 2) Moving parsing off the main thread could make web pages load more quickly because WebCore can do other work in parallel with parsing HTML (such as parsing CSS or attaching elements to the render tree). ---8--- OK - what test (if any) will be used to test whether the page load speed goal is achieved? All of them. :) More seriously, Chromium runs a very large battery of performance tests continuously on a matrix of different platforms, including desktop and mobile. You can see one of the overview dashboards here: http://build.chromium.org/f/chromium/perf/dashboard/overview.html The ones that are particularly relevant to this work are the various page load tests, both with simulated network delays and without network delays. For iterative benchmarking, we plan to use Chromium's Telemetry framework http://www.chromium.org/developers/telemetry. Specifically, I expect we plan to work with the top_25 dataset http://src.chromium.org/viewvc/chrome/trunk/src/tools/perf/page_sets/top_25.json?view=markup, but we might use some other data sets if there are particular areas we want to measure more carefully. One question: what tests are you planning to use to validate whether this approach achieves its goals of better responsiveness? The tests we've run so far are also described in the first link Eric gave in his email: https://bugs.webkit.org/show_bug.cgi?id=106127. They suggest that there's a good deal of room for improvement in this area. After we have a working implementation, we'll likely re-run those experiments and run other experiments to do an A/B comparison of the two approaches. As Filip points out, we'll likely end up with a hybrid of the two designs that's optimized for handling various work loads. I agree the test suggests there is room for improvement. From the description of how the test is run, I can think of two potential ways to improve how well it correlates with actual user-perceived responsiveness: (1) It seems to look at the max parsing pause time without considering whether there's any content being shown that it's possible to interact with. If the longest pauses happen before meaningful content is visible, then reducing those pauses is unlikely to actually materially improve responsiveness, at least in models where web content processing happens in a separate process or thread from the UI. One possibility is to track the max parsing pause time starting from the first visually non-empty layout. That would better approximate how much actual user interaction is blocked. Consider, also, that pages might be parsing in the same process in another tab, or in a frame in the current tab. (2) It might be helpful to track max and average pause time from non-parsing sources, for the sake of comparison. If you looked at the information Eric provided in his initial email, you might have noticed https://docs.google.com/spreadsheet/ccc?key=0AlC4tS7Ao1fIdGtJTWlSaUItQ1hYaDFDcWkzeVAxOGc#gid=0, which is precisely that. These might result in a more accurate assessment of the benfits. The reason I ask is that this sounds like a significant increase in complexity, so we should be very confident that there is a real and major benefit. One thing I wonder about is how common it is to have enough of the page processed that the user could interact with it in principle, yet still have large parsing chunks remaining which would prevent that interaction from being smooth. If you're interested in reducing the complexity of the parser, I'd recommend removing the NEW_XML code. As previously discussed, that code creates significant complexity for zero benefit. Tu quoque fallacy. From your glib reply, I get the impression that you are not giving the complexity cost of multithreading due consideration. I hope that is not actually the case and I merely caught you at a bad moment or something. I'm quite aware of the complexity of multithreaded code having written a great deal of it for Chromium. One of the things I hope comes out of this project is a good example of how to do multithreaded processing in WebCore. Currently, every subsystem seems rolls their own threading abstractions, I think largely because there hasn't been a
Re: [webkit-dev] Feature Announcement: Moving HTML Parser off the Main Thread
On Jan 11, 2013, at 12:21 AM, Adam Barth aba...@webkit.org wrote: On Thu, Jan 10, 2013 at 9:19 PM, Maciej Stachowiak m...@apple.com wrote: On Jan 10, 2013, at 12:07 PM, Adam Barth aba...@webkit.org wrote: On Thu, Jan 10, 2013 at 12:37 AM, Maciej Stachowiak m...@apple.com wrote: I presume from your other comments that the goal of this work is responsiveness, rather than page load speed as such. I'm excited about the potential to improve responsiveness during page loading. The goals are described in the first link Eric gave in his email: https://bugs.webkit.org/show_bug.cgi?id=106127#c0. Specifically: ---8--- 1) Moving parsing off the main thread could make web pages more responsive because the main thread is available for handling input events and executing JavaScript. 2) Moving parsing off the main thread could make web pages load more quickly because WebCore can do other work in parallel with parsing HTML (such as parsing CSS or attaching elements to the render tree). ---8--- OK - what test (if any) will be used to test whether the page load speed goal is achieved? All of them. :) More seriously, Chromium runs a very large battery of performance tests continuously on a matrix of different platforms, including desktop and mobile. You can see one of the overview dashboards here: http://build.chromium.org/f/chromium/perf/dashboard/overview.html The ones that are particularly relevant to this work are the various page load tests, both with simulated network delays and without network delays. For iterative benchmarking, we plan to use Chromium's Telemetry framework http://www.chromium.org/developers/telemetry. Specifically, I expect we plan to work with the top_25 dataset http://src.chromium.org/viewvc/chrome/trunk/src/tools/perf/page_sets/top_25.json?view=markup, but we might use some other data sets if there are particular areas we want to measure more carefully. One question: what tests are you planning to use to validate whether this approach achieves its goals of better responsiveness? The tests we've run so far are also described in the first link Eric gave in his email: https://bugs.webkit.org/show_bug.cgi?id=106127. They suggest that there's a good deal of room for improvement in this area. After we have a working implementation, we'll likely re-run those experiments and run other experiments to do an A/B comparison of the two approaches. As Filip points out, we'll likely end up with a hybrid of the two designs that's optimized for handling various work loads. I agree the test suggests there is room for improvement. From the description of how the test is run, I can think of two potential ways to improve how well it correlates with actual user-perceived responsiveness: (1) It seems to look at the max parsing pause time without considering whether there's any content being shown that it's possible to interact with. If the longest pauses happen before meaningful content is visible, then reducing those pauses is unlikely to actually materially improve responsiveness, at least in models where web content processing happens in a separate process or thread from the UI. One possibility is to track the max parsing pause time starting from the first visually non-empty layout. That would better approximate how much actual user interaction is blocked. Consider, also, that pages might be parsing in the same process in another tab, or in a frame in the current tab. (2) It might be helpful to track max and average pause time from non-parsing sources, for the sake of comparison. If you looked at the information Eric provided in his initial email, you might have noticed https://docs.google.com/spreadsheet/ccc?key=0AlC4tS7Ao1fIdGtJTWlSaUItQ1hYaDFDcWkzeVAxOGc#gid=0, which is precisely that. These might result in a more accurate assessment of the benfits. The reason I ask is that this sounds like a significant increase in complexity, so we should be very confident that there is a real and major benefit. One thing I wonder about is how common it is to have enough of the page processed that the user could interact with it in principle, yet still have large parsing chunks remaining which would prevent that interaction from being smooth. If you're interested in reducing the complexity of the parser, I'd recommend removing the NEW_XML code. As previously discussed, that code creates significant complexity for zero benefit. Tu quoque fallacy. From your glib reply, I get the impression that you are not giving the complexity cost of multithreading due consideration. I hope that is not actually the case and I merely caught you at a bad moment or something. I'm quite aware of the complexity of multithreaded code having written a great deal of it for Chromium. One of the things I hope comes out of this project is a good example of how to do multithreaded processing in WebCore.
Re: [webkit-dev] Feature Announcement: Moving HTML Parser off the Main Thread
Your comments here make me feel more positively towards this project. In particular, I'm happy that: - There actually will be meaningful testing. - You're prepared to abandon this approach if it doesn't meet its perf goals (presumably at minimum no regression to page load time or responsiveness while loading, and meaningful improvement to at least one of these). - The shared-nothing message-passing approach to threading sounds likely to be a relatively less complex/fragile approach to threading than most others. Thanks for following up. I have a comment on a tangential point that I'll split into another thread. Cheers, Maciej On Jan 11, 2013, at 12:21 AM, Adam Barth aba...@webkit.org wrote: On Thu, Jan 10, 2013 at 9:19 PM, Maciej Stachowiak m...@apple.com wrote: On Jan 10, 2013, at 12:07 PM, Adam Barth aba...@webkit.org wrote: On Thu, Jan 10, 2013 at 12:37 AM, Maciej Stachowiak m...@apple.com wrote: I presume from your other comments that the goal of this work is responsiveness, rather than page load speed as such. I'm excited about the potential to improve responsiveness during page loading. The goals are described in the first link Eric gave in his email: https://bugs.webkit.org/show_bug.cgi?id=106127#c0. Specifically: ---8--- 1) Moving parsing off the main thread could make web pages more responsive because the main thread is available for handling input events and executing JavaScript. 2) Moving parsing off the main thread could make web pages load more quickly because WebCore can do other work in parallel with parsing HTML (such as parsing CSS or attaching elements to the render tree). ---8--- OK - what test (if any) will be used to test whether the page load speed goal is achieved? All of them. :) More seriously, Chromium runs a very large battery of performance tests continuously on a matrix of different platforms, including desktop and mobile. You can see one of the overview dashboards here: http://build.chromium.org/f/chromium/perf/dashboard/overview.html The ones that are particularly relevant to this work are the various page load tests, both with simulated network delays and without network delays. For iterative benchmarking, we plan to use Chromium's Telemetry framework http://www.chromium.org/developers/telemetry. Specifically, I expect we plan to work with the top_25 dataset http://src.chromium.org/viewvc/chrome/trunk/src/tools/perf/page_sets/top_25.json?view=markup, but we might use some other data sets if there are particular areas we want to measure more carefully. One question: what tests are you planning to use to validate whether this approach achieves its goals of better responsiveness? The tests we've run so far are also described in the first link Eric gave in his email: https://bugs.webkit.org/show_bug.cgi?id=106127. They suggest that there's a good deal of room for improvement in this area. After we have a working implementation, we'll likely re-run those experiments and run other experiments to do an A/B comparison of the two approaches. As Filip points out, we'll likely end up with a hybrid of the two designs that's optimized for handling various work loads. I agree the test suggests there is room for improvement. From the description of how the test is run, I can think of two potential ways to improve how well it correlates with actual user-perceived responsiveness: (1) It seems to look at the max parsing pause time without considering whether there's any content being shown that it's possible to interact with. If the longest pauses happen before meaningful content is visible, then reducing those pauses is unlikely to actually materially improve responsiveness, at least in models where web content processing happens in a separate process or thread from the UI. One possibility is to track the max parsing pause time starting from the first visually non-empty layout. That would better approximate how much actual user interaction is blocked. Consider, also, that pages might be parsing in the same process in another tab, or in a frame in the current tab. (2) It might be helpful to track max and average pause time from non-parsing sources, for the sake of comparison. If you looked at the information Eric provided in his initial email, you might have noticed https://docs.google.com/spreadsheet/ccc?key=0AlC4tS7Ao1fIdGtJTWlSaUItQ1hYaDFDcWkzeVAxOGc#gid=0, which is precisely that. These might result in a more accurate assessment of the benfits. The reason I ask is that this sounds like a significant increase in complexity, so we should be very confident that there is a real and major benefit. One thing I wonder about is how common it is to have enough of the page processed that the user could interact with it in principle, yet still have large parsing chunks remaining which would prevent that interaction from being smooth. If you're interested in
Re: [webkit-dev] Feature Announcement: Moving HTML Parser off the Main Thread
Parsing, especially JS parsing still takes a large amount of time on page loading. We tried to improve the preload scanner by moving it into anouther thread, but there was no gain (except some special cases). Synchronization between threads is surprisingly (ridiculously) costly, usually worth for those tasks, which needs quite a few million instructions to be executed (and tokenization takes far less in most cases). For smaller tasks, SIMD instruction sets can help, which is basically a parallel execution on a single thread. Anyway it is worth trying, but it is really challenging to make it work in practice. Good luck! Regards, Zoltan On Jan 9, 2013, at 10:04 PM, Ian Hickson i...@hixie.ch wrote: On Wed, 9 Jan 2013, Eric Seidel wrote: The core goal is to reduce latency -- to free up the main thread for JavaScript and UI interaction -- which as you correctly note, cannot be moved off of the main thread due to the single thread of execution model of the web. Parsing and (maybe to a lesser extent) compiling JS can be moved off the main thread, though, right? That's probably worth examining too, if it hasn't already been done. 100% agree. However, the same problem I brought up about tokenization applies here: a lot of JS functions are super cheap to parse and compile already, and the latency of doing so on the main thread is likely to be lower than the latency of chatting with another core. I suspect this could be alleviated by (1) aggressively pipelining the work, where during page load or during heavy JS use the compilation thread always has a non-empty queue of work to do; this will mean that the latency of communication is paid only when the first compilation occurs, and (2) allowing the main thread to steal work from the compilation queue. I'm not sure how to make (2) work well. For parsing it's actually harder since we rely heavily on the lazy parsing optimization: code is only parsed once we need it *right now* to run a function. For compilation, it's somewhat easier: the most expensive compilation step is the third-tier optimizing JIT; we can delay this as long as we want, though the longer we dela y it, the longer we spend running slower code. Hence, to make parsing concurrent, the main problem is figuring out how to do predictive parsing: have a concurrent thread start parsing something just before we need it. Without predictive parsing, making it concurrent would be a guaranteed loss since the main thread would just be stuck waiting for the thread to finish. To make optimized compiles concurrent without a regression, the main problem is ensuring that in those cases where we believe that the time taken to compile the function will be smaller than the time taken to awake the concurrent thread, we will instead just compile it on the main thread right away. Though, if we could predict that a function was going to get hot in the future, we could speculatively tell a concurrent thread to compile it fully knowing that it won't wake up and do so until exactly when we would have otherwise invoked the compiler on the main thread (that is, it'll wake up and start compiling it once the main thread has executed the function enough times to get good profiling data). Anyway, you're absolutely right that this is an area that should be explored. -F -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] Feature Announcement: Moving HTML Parser off the Main Thread
https://bugs.webkit.org/show_bug.cgi?id=63531 The work was done by Zoltan Horvath and Balazs Kelemen. Regards, Zoltan Hi Zoltan, I would be curious how you did the synchronization. I've had some luck reducing synchronization costs before. Was the patch ever uploaded anywhere? -F On Jan 10, 2013, at 12:11 AM, Zoltan Herczeg zherc...@webkit.org wrote: Parsing, especially JS parsing still takes a large amount of time on page loading. We tried to improve the preload scanner by moving it into anouther thread, but there was no gain (except some special cases). Synchronization between threads is surprisingly (ridiculously) costly, usually worth for those tasks, which needs quite a few million instructions to be executed (and tokenization takes far less in most cases). For smaller tasks, SIMD instruction sets can help, which is basically a parallel execution on a single thread. Anyway it is worth trying, but it is really challenging to make it work in practice. Good luck! Regards, Zoltan On Jan 9, 2013, at 10:04 PM, Ian Hickson i...@hixie.ch wrote: On Wed, 9 Jan 2013, Eric Seidel wrote: The core goal is to reduce latency -- to free up the main thread for JavaScript and UI interaction -- which as you correctly note, cannot be moved off of the main thread due to the single thread of execution model of the web. Parsing and (maybe to a lesser extent) compiling JS can be moved off the main thread, though, right? That's probably worth examining too, if it hasn't already been done. 100% agree. However, the same problem I brought up about tokenization applies here: a lot of JS functions are super cheap to parse and compile already, and the latency of doing so on the main thread is likely to be lower than the latency of chatting with another core. I suspect this could be alleviated by (1) aggressively pipelining the work, where during page load or during heavy JS use the compilation thread always has a non-empty queue of work to do; this will mean that the latency of communication is paid only when the first compilation occurs, and (2) allowing the main thread to steal work from the compilation queue. I'm not sure how to make (2) work well. For parsing it's actually harder since we rely heavily on the lazy parsing optimization: code is only parsed once we need it *right now* to run a function. For compilation, it's somewhat easier: the most expensive compilation step is the third-tier optimizing JIT; we can delay this as long as we want, though the longer we dela y it, the longer we spend running slower code. Hence, to make parsing concurrent, the main problem is figuring out how to do predictive parsing: have a concurrent thread start parsing something just before we need it. Without predictive parsing, making it concurrent would be a guaranteed loss since the main thread would just be stuck waiting for the thread to finish. To make optimized compiles concurrent without a regression, the main problem is ensuring that in those cases where we believe that the time taken to compile the function will be smaller than the time taken to awake the concurrent thread, we will instead just compile it on the main thread right away. Though, if we could predict that a function was going to get hot in the future, we could speculatively tell a concurrent thread to compile it fully knowing that it won't wake up and do so until exactly when we would have otherwise invoked the compiler on the main thread (that is, it'll wake up and start compiling it once the main thread has executed the function enough times to get good profiling data). Anyway, you're absolutely right that this is an area that should be explored. -F -- Ian Hickson U+1047E)\._.,--,'``. fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] Feature Announcement: Moving HTML Parser off the Main Thread
I presume from your other comments that the goal of this work is responsiveness, rather than page load speed as such. I'm excited about the potential to improve responsiveness during page loading. One question: what tests are you planning to use to validate whether this approach achieves its goals of better responsiveness? The reason I ask is that this sounds like a significant increase in complexity, so we should be very confident that there is a real and major benefit. One thing I wonder about is how common it is to have enough of the page processed that the user could interact with it in principle, yet still have large parsing chunks remaining which would prevent that interaction from being smooth. Another thing I wonder about is whether yielding to the event loop more aggressively could achieve a similar benefit at a much lower complexity cost. Having a test to drive the work would allow us to answer these types of questions. (It may also be that the test data you cited would already answer these questions but I didn't sufficiently understand it; if so, further explanation would be appreciated.) Regards, Maciej On Jan 9, 2013, at 6:00 PM, Eric Seidel e...@webkit.org wrote: We're planning to move parts of the HTML Parser off of the main thread: https://bugs.webkit.org/show_bug.cgi?id=106127 This is driven by our testing showing that HTML parsing on mobile is be slow, and long (causing user-visible delays averaging 10 frames / 150ms). https://bug-106127-attachments.webkit.org/attachment.cgi?id=182002 Complete data can be found at [1]. Mozilla moved their parser onto a separate thread during their HTML5 parser re-write: https://developer.mozilla.org/en-US/docs/Mozilla/Gecko/HTML_parser_threading We plan to take a slightly simpler approach, moving only Tokenizing off of the main thread: https://docs.google.com/drawings/d/1hwYyvkT7HFLAtTX_7LQp2lxA6LkaEWkXONmjtGCQjK0/edit The left is our current design, the middle is a tokenizer-only design, and the right is more like mozilla's threaded-parser design. Profiling shows Tokenizing accounts for about 10x the number of samples as TreeBuilding. Including Antti's recent testing (.5% vs. 3%): https://bugs.webkit.org/show_bug.cgi?id=106127#c10 If after we do this we measure and find ourselves still spending a lot of main-thread time parsing, we'll move the TreeBuilder too. :) (This work is a nicely separable sub-set of larger work needed to move the TreeBuilder.) We welcome your thoughts and comments. 1. https://docs.google.com/spreadsheet/ccc?key=0AlC4tS7Ao1fIdGtJTWlSaUItQ1hYaDFDcWkzeVAxOGc#gid=0 (Epic thanks to Nat Duca for helping us collect that data.) ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] Feature Announcement: Moving HTML Parser off the Main Thread
When loading web pages we are very frequently in a situation where we already have the source data (HTML text here but the same applies to preloaded Javascript, CSS, images, ...) and know we are likely to need it in soon, but can't actually utilize it for indeterminate time. This happens because pending external JS resources blocks the main parser (and pending CSS resources block JS execution) for web compatibility reasons. In this situation it makes sense to start processing resources we have to forms that are faster to use when they are eventually actually needed (like token stream here). One thing we already do when the main parser gets blocked is preload scanning. We look through the unparsed HTML source we have and trigger loads for any resources found. It would be beneficial if this happened off the main thread. We could do it when new data arrives in parallel with JS execution and other time consuming engine work, potentially triggering resource loads earlier. I think a good first step here would be to share the tokens between the preload scanner and the main parser and worry about the threading part afterwards. We often parse the HTML source more or less twice so this is an unquestionable win. antti On Thu, Jan 10, 2013 at 7:41 AM, Filip Pizlo fpi...@apple.com wrote: I think your biggest challenge will be ensuring that the latency of shoving things to another core and then shoving them back will be smaller than the latency of processing those same things on the main thread. For small documents, I expect concurrent tokenization to be a pure regression because the latency of waking up another thread to do just a small bit of work, plus the added cost of whatever synchronization operations will be needed to ensure safety, will involve more total work than just tokenizing locally. We certainly see this in the JSC parallel GC, and in line with traditional parallel GC design, we ensure that parallel threads only kick in when the main thread is unable to keep up with the work that it has created for itself. Do you have a vision for how to implement a similar self-throttling, where tokenizing continues on the main thread so long as it is cheap to do so? -Filip On Jan 9, 2013, at 6:00 PM, Eric Seidel e...@webkit.org wrote: We're planning to move parts of the HTML Parser off of the main thread: https://bugs.webkit.org/show_bug.cgi?id=106127 This is driven by our testing showing that HTML parsing on mobile is be slow, and long (causing user-visible delays averaging 10 frames / 150ms). https://bug-106127-attachments.webkit.org/attachment.cgi?id=182002 Complete data can be found at [1]. Mozilla moved their parser onto a separate thread during their HTML5 parser re-write: https://developer.mozilla.org/en-US/docs/Mozilla/Gecko/HTML_parser_threading We plan to take a slightly simpler approach, moving only Tokenizing off of the main thread: https://docs.google.com/drawings/d/1hwYyvkT7HFLAtTX_7LQp2lxA6LkaEWkXONmjtGCQjK0/edit The left is our current design, the middle is a tokenizer-only design, and the right is more like mozilla's threaded-parser design. Profiling shows Tokenizing accounts for about 10x the number of samples as TreeBuilding. Including Antti's recent testing (.5% vs. 3%): https://bugs.webkit.org/show_bug.cgi?id=106127#c10 If after we do this we measure and find ourselves still spending a lot of main-thread time parsing, we'll move the TreeBuilder too. :) (This work is a nicely separable sub-set of larger work needed to move the TreeBuilder.) We welcome your thoughts and comments. 1. https://docs.google.com/spreadsheet/ccc?key=0AlC4tS7Ao1fIdGtJTWlSaUItQ1hYaDFDcWkzeVAxOGc#gid=0 (Epic thanks to Nat Duca for helping us collect that data.) ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] Feature Announcement: Moving HTML Parser off the Main Thread
The data Eric and Adam were using comes from a python library a few of us have been developing called telemetry. Its basically a bunch of python that lets us write performance tests against any browser that speaks the inspector websocket protocol. We're using it a lot of should we parallelize X questions, as well as regression-style have our changes to X stayed a win over time? They might have other ways in mind to obtain this data that is more webkit-y, but I figure a bit on how we got this far might be useful for this mailing list. Roughly, telemetry scripts connect up to a host and port where you've arranged to have an inspector websocket listening, e.g. $MY_PHONE_IP:9222, or google-chrome --remote-debugging-port=9222 telemetry --browser=$LOCALHOST:9222. Once that's established, we have communication with WebCore's InspectorAgent, and assuming we trust the agent, can do some pretty powerful stuff from there. The benchmark being discussed here [webkit_benchmark] navigates the browser from page to page, enabling inspector's TimelineAgent as it does in order to get performance data about the page load. We then postprocess that data stream into a human consumable csv and there is [some amount] of rejoicing. Assuming we trust inspector timeline [Pavel's done a number of fixes to help us trust it more!] this gets pretty clean results, pretty easily. A key challenge with telemetry has been getting stable runs on real world sites. The archive.org technqiues are cool, but they dont capture some of the big ones, like a logged-in gmail account. We've addressed this using tonyg and simonjam's http://code.google.com/p/web-page-replay/. If the browser under test supports web page replay [~= redirecting dns requests to the replay server instead of the real site], then you can get stable, repeatable runs against super complex real world sites --- its worked on every site we've tried so far. The core telemetry framework is here: http://src.chromium.org/chrome/trunk/src/tools/telemetry/ Its in chromium repo, but please dont hold that against it --- its movable, given interest. The actual webkit benchmark is pretty simple, because most of the functionality comes from telemetry: https://codereview.chromium.org/11791043/ With the patch above landed, obtaining the benchmarking results that Eric got against chrome should be ~= getting a telemetry checkout and doing: ./run_multipage_benchmarks --browser=canary webkit_benchmark page_sets/top_25.json Or if you had an android with chrome on it: ./run_multipage_benchmarks --browser=android-chrome webkit_benchmark page_sets/top_25.json Anyway, I'll leave it to Eric/Adam to speak to how this maps back into the WebKit ecosystem. The use of inspector protocol makes it a theoretical possibility on other ports, but I know some people get nervous (or run away angrily!) when they hear that we're using Inspector as a perf data source. :) - Nat On Thu, Jan 10, 2013 at 1:44 AM, Antti Koivisto koivi...@iki.fi wrote: When loading web pages we are very frequently in a situation where we already have the source data (HTML text here but the same applies to preloaded Javascript, CSS, images, ...) and know we are likely to need it in soon, but can't actually utilize it for indeterminate time. This happens because pending external JS resources blocks the main parser (and pending CSS resources block JS execution) for web compatibility reasons. In this situation it makes sense to start processing resources we have to forms that are faster to use when they are eventually actually needed (like token stream here). One thing we already do when the main parser gets blocked is preload scanning. We look through the unparsed HTML source we have and trigger loads for any resources found. It would be beneficial if this happened off the main thread. We could do it when new data arrives in parallel with JS execution and other time consuming engine work, potentially triggering resource loads earlier. I think a good first step here would be to share the tokens between the preload scanner and the main parser and worry about the threading part afterwards. We often parse the HTML source more or less twice so this is an unquestionable win. antti On Thu, Jan 10, 2013 at 7:41 AM, Filip Pizlo fpi...@apple.com wrote: I think your biggest challenge will be ensuring that the latency of shoving things to another core and then shoving them back will be smaller than the latency of processing those same things on the main thread. For small documents, I expect concurrent tokenization to be a pure regression because the latency of waking up another thread to do just a small bit of work, plus the added cost of whatever synchronization operations will be needed to ensure safety, will involve more total work than just tokenizing locally. We certainly see this in the JSC parallel GC, and in line with traditional parallel GC design, we ensure that parallel threads only
Re: [webkit-dev] Feature Announcement: Moving HTML Parser off the Main Thread
On Thu, Jan 10, 2013 at 8:37 AM, Maciej Stachowiak m...@apple.com wrote: The reason I ask is that this sounds like a significant increase in complexity, so we should be very confident that there is a real and major benefit. One thing I wonder about is how common it is to have enough of the page processed that the user could interact with it in principle, yet still have large parsing chunks remaining which would prevent that interaction from being smooth. Another thing I wonder about is whether yielding to the event loop more aggressively could achieve a similar benefit at a much lower complexity cost. I don't want to let this point of Maciej's slip away: on mobile we may have fewer cores than desktop, and we're paying a pretty high complexity burden for multiple threads already; some of Nat's awesome recent work in Chromium is too multithreaded for my comfort. I'd back-of-enveloped yielding during page layout and guessed it wasn't worthwhile, but do we know that yielding during parsing isn't? Tom ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] Feature Announcement: Moving HTML Parser off the Main Thread
Thanks everyone for your feedback. Detailed responses inline. On Wed, Jan 9, 2013 at 9:41 PM, Filip Pizlo fpi...@apple.com wrote: I think your biggest challenge will be ensuring that the latency of shoving things to another core and then shoving them back will be smaller than the latency of processing those same things on the main thread. Yes. That's something we know we have to worry about. Given that we need to retain the ability to parse HTML on the main thread to handle document.write and innerHTML, we should be able to easily do A/B comparisons to make sure we understand any performance trade-offs that might arise. For small documents, I expect concurrent tokenization to be a pure regression because the latency of waking up another thread to do just a small bit of work, plus the added cost of whatever synchronization operations will be needed to ensure safety, will involve more total work than just tokenizing locally. Once we have the ability to tokenize on a background thread, we can examine cases like these and heuristically decide whether to use the background thread or not at runtime. As I wrote above, we'll need these ability anyway, so keeping the ability to optimize these cases shouldn't add any new constraints to the design. We certainly see this in the JSC parallel GC, and in line with traditional parallel GC design, we ensure that parallel threads only kick in when the main thread is unable to keep up with the work that it has created for itself. Do you have a vision for how to implement a similar self-throttling, where tokenizing continues on the main thread so long as it is cheap to do so? It's certainly something we can tune in the optimization phase. I don't think we need a particular vision to be able to do it. Given that we want to implement speculative parsing (to replace preload scanning---more on this below), we'll already have the ability to checkpoint and restore the tokenizer state across threads. Once you have that primitive, it's easy to decide whether to continue tokenization on the main thread or on a background thread. On Wed, Jan 9, 2013 at 10:04 PM, Ian Hickson i...@hixie.ch wrote: Parsing and (maybe to a lesser extent) compiling JS can be moved off the main thread, though, right? That's probably worth examining too, if it hasn't already been done. Yes, once we have the tokenizer running on a background thread, that opens up the possibility of parsing other sorts of data on the background thread as well. For example, when the tokenizer encounters an inline script block, you could imagine parsing the script on the background thread as well so that the main thread has less work to do. (You could also imagine making the optimizations without a background tokenizer, but the design constraints would be a bit different.) On Thu, Jan 10, 2013 at 12:11 AM, Zoltan Herczeg zherc...@webkit.org wrote: Parsing, especially JS parsing still takes a large amount of time on page loading. We tried to improve the preload scanner by moving it into anouther thread, but there was no gain (except some special cases). Synchronization between threads is surprisingly (ridiculously) costly, usually worth for those tasks, which needs quite a few million instructions to be executed (and tokenization takes far less in most cases). For smaller tasks, SIMD instruction sets can help, which is basically a parallel execution on a single thread. Anyway it is worth trying, but it is really challenging to make it work in practice. Good luck! This is something we're worried about and will need to be careful about. In the design we're proposing, preload scanning is replaced by speculative parsing, so the overhead of the preload scanner is removed entirely. The way this works is a follows: When running on the background thread, the tokenizer produces a queue of PickledTokens. As these tokens are queued, we can scan them to kick off any preloads that we find. Whenever the tokenizer queues a token that creates a new insertion point (in the terminology of the HTML specification), the tokenizer checkpoints itself but continues tokenizing speculatively. (Notice that tokens produced in this situation are still scanned for preloads but might not ever actually result in DOM being constructed.) After the main thread has processed the token that created the insertion point, if no characters were inserted, the main thread continues processing PickledTokens that were created speculative. If some characters were inserted, the main thread instead instructs the tokenizer to roll back to that checkpoint and continue tokenizing in a new state. In this case, the queue of speculative tokens is discarded. Notice that in the common case, we're execute JavaScript and tokenize in parallel, something that's not possible with a main-thread tokenizer. Once the script is done executing, we expect it to be common to be able to result tree building immediately as the
Re: [webkit-dev] Feature Announcement: Moving HTML Parser off the Main Thread
On Jan 10, 2013, at 12:07 PM, Adam Barth aba...@webkit.org wrote: On Thu, Jan 10, 2013 at 12:37 AM, Maciej Stachowiak m...@apple.com wrote: I presume from your other comments that the goal of this work is responsiveness, rather than page load speed as such. I'm excited about the potential to improve responsiveness during page loading. The goals are described in the first link Eric gave in his email: https://bugs.webkit.org/show_bug.cgi?id=106127#c0. Specifically: ---8--- 1) Moving parsing off the main thread could make web pages more responsive because the main thread is available for handling input events and executing JavaScript. 2) Moving parsing off the main thread could make web pages load more quickly because WebCore can do other work in parallel with parsing HTML (such as parsing CSS or attaching elements to the render tree). OK - what test (if any) will be used to test whether the page load speed goal is achieved? ---8--- One question: what tests are you planning to use to validate whether this approach achieves its goals of better responsiveness? The tests we've run so far are also described in the first link Eric gave in his email: https://bugs.webkit.org/show_bug.cgi?id=106127. They suggest that there's a good deal of room for improvement in this area. After we have a working implementation, we'll likely re-run those experiments and run other experiments to do an A/B comparison of the two approaches. As Filip points out, we'll likely end up with a hybrid of the two designs that's optimized for handling various work loads. I agree the test suggests there is room for improvement. From the description of how the test is run, I can think of two potential ways to improve how well it correlates with actual user-perceived responsiveness: (1) It seems to look at the max parsing pause time without considering whether there's any content being shown that it's possible to interact with. If the longest pauses happen before meaningful content is visible, then reducing those pauses is unlikely to actually materially improve responsiveness, at least in models where web content processing happens in a separate process or thread from the UI. One possibility is to track the max parsing pause time starting from the first visually non-empty layout. That would better approximate how much actual user interaction is blocked. (2) It might be helpful to track max and average pause time from non-parsing sources, for the sake of comparison. These might result in a more accurate assessment of the benfits. The reason I ask is that this sounds like a significant increase in complexity, so we should be very confident that there is a real and major benefit. One thing I wonder about is how common it is to have enough of the page processed that the user could interact with it in principle, yet still have large parsing chunks remaining which would prevent that interaction from being smooth. If you're interested in reducing the complexity of the parser, I'd recommend removing the NEW_XML code. As previously discussed, that code creates significant complexity for zero benefit. Tu quoque fallacy. From your glib reply, I get the impression that you are not giving the complexity cost of multithreading due consideration. I hope that is not actually the case and I merely caught you at a bad moment or something. (And also we agreed to a drop dead date to remove the code which has either passed or is very close.) Another thing I wonder about is whether yielding to the event loop more aggressively could achieve a similar benefit at a much lower complexity cost. Yielding to the event loop more could reduce the ParseHTML_max time, but it cannot reduce the ParseHTML time. Generally speaking, yielding to the event loop is a trade-off between throughput (i.e., page load time) and responsiveness. Moving work to a background thread should let us achieve a better trade-off between these quantities than we're likely to be able to achieve by tuning the yield parameter alone. I agree that is possible. But it also seems like making the improvements that don't impose the complexity and hazards of multithreading in this area are worth trying first. Things such as retuning yielding and replacing the preload scanner with (non-threaded) speculative pre-tokenizing as suggested by Antti. That would let us better assess the benefits of the threading itself. Having a test to drive the work would allow us to answer these types of questions. (It may also be that the test data you cited would already answer these questions but I didn't sufficiently understand it; if so, further explanation would be appreciated.) If you're interested in building such a test, I would be interested in hearing the results. We don't plan to build such a test at this time. If you're actually planning to make a significant complexity-imposing architectural change
Re: [webkit-dev] Feature Announcement: Moving HTML Parser off the Main Thread
Adam, Thanks for your detailed reply. Seems like you guys have a pretty good plan in place. I hope this works and produces a performance improvement. That being said this does look like a sufficiently complex work item that success is far from guaranteed. So to play devil's advocate, what is your plan for if this doesn't work out? I.e. are we talking about adding a bunch of threading support code in the optimistic hope that it makes things run fast, and then forgetting about it if it doesn't? Or are you prepared to roll put any complexity that got landed if this does not ultimately live up to promise? Or is this going to be one giant patch that only lands if it works? I'm also trying to understand what would happen during the interim when this work is incomplete, we have thread-related goop in some critical paths, and we don't yet know if the WIP code is ever going to result in a speedup. And also, what will happen sometime from now if that code is never successfully optimized to the point where it is worth enabling. I appreciate that this sort of question can be asked of any performance work but in this particular case my gut tells me that this is going to result in significantly more complexity than the usual incremental performance work. So it's good to understand what plan B is. Probably a good answer to this sort of question would address some fears that people may have. If this work does lead to a performance win then probably everyone will be happy. But if it doesn't then it would be great to have a plan of retreat. -Filip Dnia 10 sty 2013 o godz. 12:07 Adam Barth aba...@webkit.org napisaĆ(a): Thanks everyone for your feedback. Detailed responses inline. On Wed, Jan 9, 2013 at 9:41 PM, Filip Pizlo fpi...@apple.com wrote: I think your biggest challenge will be ensuring that the latency of shoving things to another core and then shoving them back will be smaller than the latency of processing those same things on the main thread. Yes. That's something we know we have to worry about. Given that we need to retain the ability to parse HTML on the main thread to handle document.write and innerHTML, we should be able to easily do A/B comparisons to make sure we understand any performance trade-offs that might arise. For small documents, I expect concurrent tokenization to be a pure regression because the latency of waking up another thread to do just a small bit of work, plus the added cost of whatever synchronization operations will be needed to ensure safety, will involve more total work than just tokenizing locally. Once we have the ability to tokenize on a background thread, we can examine cases like these and heuristically decide whether to use the background thread or not at runtime. As I wrote above, we'll need these ability anyway, so keeping the ability to optimize these cases shouldn't add any new constraints to the design. We certainly see this in the JSC parallel GC, and in line with traditional parallel GC design, we ensure that parallel threads only kick in when the main thread is unable to keep up with the work that it has created for itself. Do you have a vision for how to implement a similar self-throttling, where tokenizing continues on the main thread so long as it is cheap to do so? It's certainly something we can tune in the optimization phase. I don't think we need a particular vision to be able to do it. Given that we want to implement speculative parsing (to replace preload scanning---more on this below), we'll already have the ability to checkpoint and restore the tokenizer state across threads. Once you have that primitive, it's easy to decide whether to continue tokenization on the main thread or on a background thread. On Wed, Jan 9, 2013 at 10:04 PM, Ian Hickson i...@hixie.ch wrote: Parsing and (maybe to a lesser extent) compiling JS can be moved off the main thread, though, right? That's probably worth examining too, if it hasn't already been done. Yes, once we have the tokenizer running on a background thread, that opens up the possibility of parsing other sorts of data on the background thread as well. For example, when the tokenizer encounters an inline script block, you could imagine parsing the script on the background thread as well so that the main thread has less work to do. (You could also imagine making the optimizations without a background tokenizer, but the design constraints would be a bit different.) On Thu, Jan 10, 2013 at 12:11 AM, Zoltan Herczeg zherc...@webkit.org wrote: Parsing, especially JS parsing still takes a large amount of time on page loading. We tried to improve the preload scanner by moving it into anouther thread, but there was no gain (except some special cases). Synchronization between threads is surprisingly (ridiculously) costly, usually worth for those tasks, which needs quite a few million instructions to be executed
[webkit-dev] Feature Announcement: Moving HTML Parser off the Main Thread
We're planning to move parts of the HTML Parser off of the main thread: https://bugs.webkit.org/show_bug.cgi?id=106127 This is driven by our testing showing that HTML parsing on mobile is be slow, and long (causing user-visible delays averaging 10 frames / 150ms). https://bug-106127-attachments.webkit.org/attachment.cgi?id=182002 Complete data can be found at [1]. Mozilla moved their parser onto a separate thread during their HTML5 parser re-write: https://developer.mozilla.org/en-US/docs/Mozilla/Gecko/HTML_parser_threading We plan to take a slightly simpler approach, moving only Tokenizing off of the main thread: https://docs.google.com/drawings/d/1hwYyvkT7HFLAtTX_7LQp2lxA6LkaEWkXONmjtGCQjK0/edit The left is our current design, the middle is a tokenizer-only design, and the right is more like mozilla's threaded-parser design. Profiling shows Tokenizing accounts for about 10x the number of samples as TreeBuilding. Including Antti's recent testing (.5% vs. 3%): https://bugs.webkit.org/show_bug.cgi?id=106127#c10 If after we do this we measure and find ourselves still spending a lot of main-thread time parsing, we'll move the TreeBuilder too. :) (This work is a nicely separable sub-set of larger work needed to move the TreeBuilder.) We welcome your thoughts and comments. 1. https://docs.google.com/spreadsheet/ccc?key=0AlC4tS7Ao1fIdGtJTWlSaUItQ1hYaDFDcWkzeVAxOGc#gid=0 (Epic thanks to Nat Duca for helping us collect that data.) ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] Feature Announcement: Moving HTML Parser off the Main Thread
On Wed, Jan 9, 2013 at 6:00 PM, Eric Seidel e...@webkit.org wrote: We're planning to move parts of the HTML Parser off of the main thread: https://bugs.webkit.org/show_bug.cgi?id=106127 This is driven by our testing showing that HTML parsing on mobile is be slow, and long (causing user-visible delays averaging 10 frames / 150ms). https://bug-106127-attachments.webkit.org/attachment.cgi?id=182002 Complete data can be found at [1]. In case it's not clear from that link, the ParseHTML column is the total amount of time the web inspector attributes to HTML parsing when loading those URLs on a Nexus 7 using a top-of-tree build of Chromium's content_shell (similar to WebKitTestRunner). The HTML parser parses data a chunk at a time, which means the total time doesn't tell the whole story. The ParseHTML_max column shows the largest single block of time spent in the HTML parser, which is more of a measure of the main thread jank caused by the parser. Antti has pointed out that the inspector isn't the best source of data. He measured total time using instruments, and got numbers that are consistent (within a factor of 2) of the inspector measurements. (We were using different data sets, so we wouldn't expect perfect agreement even if we were measuring precisely the same thing.) Adam Mozilla moved their parser onto a separate thread during their HTML5 parser re-write: https://developer.mozilla.org/en-US/docs/Mozilla/Gecko/HTML_parser_threading We plan to take a slightly simpler approach, moving only Tokenizing off of the main thread: https://docs.google.com/drawings/d/1hwYyvkT7HFLAtTX_7LQp2lxA6LkaEWkXONmjtGCQjK0/edit The left is our current design, the middle is a tokenizer-only design, and the right is more like mozilla's threaded-parser design. Profiling shows Tokenizing accounts for about 10x the number of samples as TreeBuilding. Including Antti's recent testing (.5% vs. 3%): https://bugs.webkit.org/show_bug.cgi?id=106127#c10 If after we do this we measure and find ourselves still spending a lot of main-thread time parsing, we'll move the TreeBuilder too. :) (This work is a nicely separable sub-set of larger work needed to move the TreeBuilder.) We welcome your thoughts and comments. 1. https://docs.google.com/spreadsheet/ccc?key=0AlC4tS7Ao1fIdGtJTWlSaUItQ1hYaDFDcWkzeVAxOGc#gid=0 (Epic thanks to Nat Duca for helping us collect that data.) ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] Feature Announcement: Moving HTML Parser off the Main Thread
How will we ensure thread safety? Even at just the tokenizing level don't we use AtomicString? AtromicString isn't threadsafe wrt StringImpl IIRC so this seems like it sould add a world of hurt. I realise it's been a long time since I've worked on this so it's completely possible that I'm not aware of the current behaviour. That aside I question what the benefit of this will be. All those cases where we've started parsing html are intrinsically tied to the web's general single thread of execution model, which implies that even if we do push parsing into a separate thread we'll just end up with the ui thread blocked on the parsing thread which doesn't seem hugely superior. What is the objective here? To improve performance, add parallelism, or reduce latency? --Oliver On Jan 9, 2013, at 6:10 PM, Adam Barth aba...@webkit.org wrote: On Wed, Jan 9, 2013 at 6:00 PM, Eric Seidel e...@webkit.org wrote: We're planning to move parts of the HTML Parser off of the main thread: https://bugs.webkit.org/show_bug.cgi?id=106127 This is driven by our testing showing that HTML parsing on mobile is be slow, and long (causing user-visible delays averaging 10 frames / 150ms). https://bug-106127-attachments.webkit.org/attachment.cgi?id=182002 Complete data can be found at [1]. In case it's not clear from that link, the ParseHTML column is the total amount of time the web inspector attributes to HTML parsing when loading those URLs on a Nexus 7 using a top-of-tree build of Chromium's content_shell (similar to WebKitTestRunner). The HTML parser parses data a chunk at a time, which means the total time doesn't tell the whole story. The ParseHTML_max column shows the largest single block of time spent in the HTML parser, which is more of a measure of the main thread jank caused by the parser. Antti has pointed out that the inspector isn't the best source of data. He measured total time using instruments, and got numbers that are consistent (within a factor of 2) of the inspector measurements. (We were using different data sets, so we wouldn't expect perfect agreement even if we were measuring precisely the same thing.) Adam Mozilla moved their parser onto a separate thread during their HTML5 parser re-write: https://developer.mozilla.org/en-US/docs/Mozilla/Gecko/HTML_parser_threading We plan to take a slightly simpler approach, moving only Tokenizing off of the main thread: https://docs.google.com/drawings/d/1hwYyvkT7HFLAtTX_7LQp2lxA6LkaEWkXONmjtGCQjK0/edit The left is our current design, the middle is a tokenizer-only design, and the right is more like mozilla's threaded-parser design. Profiling shows Tokenizing accounts for about 10x the number of samples as TreeBuilding. Including Antti's recent testing (.5% vs. 3%): https://bugs.webkit.org/show_bug.cgi?id=106127#c10 If after we do this we measure and find ourselves still spending a lot of main-thread time parsing, we'll move the TreeBuilder too. :) (This work is a nicely separable sub-set of larger work needed to move the TreeBuilder.) We welcome your thoughts and comments. 1. https://docs.google.com/spreadsheet/ccc?key=0AlC4tS7Ao1fIdGtJTWlSaUItQ1hYaDFDcWkzeVAxOGc#gid=0 (Epic thanks to Nat Duca for helping us collect that data.) ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] Feature Announcement: Moving HTML Parser off the Main Thread
On Wed, Jan 9, 2013 at 6:38 PM, Oliver Hunt oli...@apple.com wrote: How will we ensure thread safety? Even at just the tokenizing level don't we use AtomicString? AtromicString isn't threadsafe wrt StringImpl IIRC so this seems like it sould add a world of hurt. AtomicString is already usable from other threads (http://trac.webkit.org/changeset/38094), but are correct this is the core concern! PickledToken (or whatever it's called) will have to be written very carefully in order to minimize/eliminate copies, while still guaranteeing thread safety. The correct design and handling of PickledToken is the entire question of this whole endeavor. I realise it's been a long time since I've worked on this so it's completely possible that I'm not aware of the current behavior. That aside I question what the benefit of this will be. All those cases where we've started parsing html are intrinsically tied to the web's general single thread of execution model, which implies that even if we do push parsing into a separate thread we'll just end up with the ui thread blocked on the parsing thread which doesn't seem hugely superior. What is the objective here? To improve performance, add parallelism, or reduce latency? The core goal is to reduce latency -- to free up the main thread for JavaScript and UI interaction -- which as you correctly note, cannot be moved off of the main thread due to the single thread of execution model of the web. One could view the pre-load scanner as a lay-man's attempt at this type of tokenize asynchronously approach. This model gets preload scanning for free, as well as can easily answer wkb.ug/90751 request to speculative tokenizing of the entire document. (We just have to save markers before every script token, as if the script uses document.write, any tokens after /script become invalid.) I should also note that not all HTML parsing can be moved off of the main thread. innerHTML for example, would still be done entirely on the main thread. I would imagine that when we were to land this on trunk it would be behind a feature flag and ports could opt-in to the threaded-parsing path, as we must maintain the main-thread parsing ability for innerHTML anyway. --Oliver On Jan 9, 2013, at 6:10 PM, Adam Barth aba...@webkit.org wrote: On Wed, Jan 9, 2013 at 6:00 PM, Eric Seidel e...@webkit.org wrote: We're planning to move parts of the HTML Parser off of the main thread: https://bugs.webkit.org/show_bug.cgi?id=106127 This is driven by our testing showing that HTML parsing on mobile is be slow, and long (causing user-visible delays averaging 10 frames / 150ms). https://bug-106127-attachments.webkit.org/attachment.cgi?id=182002 Complete data can be found at [1]. In case it's not clear from that link, the ParseHTML column is the total amount of time the web inspector attributes to HTML parsing when loading those URLs on a Nexus 7 using a top-of-tree build of Chromium's content_shell (similar to WebKitTestRunner). The HTML parser parses data a chunk at a time, which means the total time doesn't tell the whole story. The ParseHTML_max column shows the largest single block of time spent in the HTML parser, which is more of a measure of the main thread jank caused by the parser. Antti has pointed out that the inspector isn't the best source of data. He measured total time using instruments, and got numbers that are consistent (within a factor of 2) of the inspector measurements. (We were using different data sets, so we wouldn't expect perfect agreement even if we were measuring precisely the same thing.) Adam Mozilla moved their parser onto a separate thread during their HTML5 parser re-write: https://developer.mozilla.org/en-US/docs/Mozilla/Gecko/HTML_parser_threading We plan to take a slightly simpler approach, moving only Tokenizing off of the main thread: https://docs.google.com/drawings/d/1hwYyvkT7HFLAtTX_7LQp2lxA6LkaEWkXONmjtGCQjK0/edit The left is our current design, the middle is a tokenizer-only design, and the right is more like mozilla's threaded-parser design. Profiling shows Tokenizing accounts for about 10x the number of samples as TreeBuilding. Including Antti's recent testing (.5% vs. 3%): https://bugs.webkit.org/show_bug.cgi?id=106127#c10 If after we do this we measure and find ourselves still spending a lot of main-thread time parsing, we'll move the TreeBuilder too. :) (This work is a nicely separable sub-set of larger work needed to move the TreeBuilder.) We welcome your thoughts and comments. 1. https://docs.google.com/spreadsheet/ccc?key=0AlC4tS7Ao1fIdGtJTWlSaUItQ1hYaDFDcWkzeVAxOGc#gid=0 (Epic thanks to Nat Duca for helping us collect that data.) ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev ___ webkit-dev mailing list
Re: [webkit-dev] Feature Announcement: Moving HTML Parser off the Main Thread
On Wed, Jan 9, 2013 at 7:07 PM, Eric Seidel e...@webkit.org wrote: On Wed, Jan 9, 2013 at 6:38 PM, Oliver Hunt oli...@apple.com wrote: How will we ensure thread safety? Even at just the tokenizing level don't we use AtomicString? AtromicString isn't threadsafe wrt StringImpl IIRC so this seems like it sould add a world of hurt. AtomicString is already usable from other threads (http://trac.webkit.org/changeset/38094), but are correct this is the core concern! PickledToken (or whatever it's called) will have to be written very carefully in order to minimize/eliminate copies, while still guaranteeing thread safety. The correct design and handling of PickledToken is the entire question of this whole endeavor. That is probably what you meant, but just in case... AtomicString can be used from different threads, but is not thread safe. You must make an isolatedCopy() for message passing if you keep a reference to the String in your thread. Not the end of the world, but something to be aware of :) Cheers, Benjamin ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] Feature Announcement: Moving HTML Parser off the Main Thread
On Wed, Jan 9, 2013 at 7:35 PM, Benjamin Poulain benja...@webkit.org wrote: On Wed, Jan 9, 2013 at 7:07 PM, Eric Seidel e...@webkit.org wrote: On Wed, Jan 9, 2013 at 6:38 PM, Oliver Hunt oli...@apple.com wrote: How will we ensure thread safety? Even at just the tokenizing level don't we use AtomicString? AtromicString isn't threadsafe wrt StringImpl IIRC so this seems like it sould add a world of hurt. AtomicString is already usable from other threads (http://trac.webkit.org/changeset/38094), but are correct this is the core concern! PickledToken (or whatever it's called) will have to be written very carefully in order to minimize/eliminate copies, while still guaranteeing thread safety. The correct design and handling of PickledToken is the entire question of this whole endeavor. That is probably what you meant, but just in case... AtomicString can be used from different threads, but is not thread safe. You must make an isolatedCopy() for message passing if you keep a reference to the String in your thread. Not the end of the world, but something to be aware of :) Yeah, we're aware of this issue. We'll probably end up doing something slightly customized for this use case. For example, many of the AtomicStrings used in parsing are tag and attribute names that are known at compile time (e.g., div, href). When moving these strings back to the main thread, we need only the hash of the string and not the underlying characters in the string (because we know statically that the hash will exist in the main thread's atomic string table). It's tempting to optimize these things prematurely. We'll likely start with a simple approach that makes copies and then optimize away the copies over the development of the feature as indicated by profiles. Adam ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] Feature Announcement: Moving HTML Parser off the Main Thread
I think your biggest challenge will be ensuring that the latency of shoving things to another core and then shoving them back will be smaller than the latency of processing those same things on the main thread. For small documents, I expect concurrent tokenization to be a pure regression because the latency of waking up another thread to do just a small bit of work, plus the added cost of whatever synchronization operations will be needed to ensure safety, will involve more total work than just tokenizing locally. We certainly see this in the JSC parallel GC, and in line with traditional parallel GC design, we ensure that parallel threads only kick in when the main thread is unable to keep up with the work that it has created for itself. Do you have a vision for how to implement a similar self-throttling, where tokenizing continues on the main thread so long as it is cheap to do so? -Filip On Jan 9, 2013, at 6:00 PM, Eric Seidel e...@webkit.org wrote: We're planning to move parts of the HTML Parser off of the main thread: https://bugs.webkit.org/show_bug.cgi?id=106127 This is driven by our testing showing that HTML parsing on mobile is be slow, and long (causing user-visible delays averaging 10 frames / 150ms). https://bug-106127-attachments.webkit.org/attachment.cgi?id=182002 Complete data can be found at [1]. Mozilla moved their parser onto a separate thread during their HTML5 parser re-write: https://developer.mozilla.org/en-US/docs/Mozilla/Gecko/HTML_parser_threading We plan to take a slightly simpler approach, moving only Tokenizing off of the main thread: https://docs.google.com/drawings/d/1hwYyvkT7HFLAtTX_7LQp2lxA6LkaEWkXONmjtGCQjK0/edit The left is our current design, the middle is a tokenizer-only design, and the right is more like mozilla's threaded-parser design. Profiling shows Tokenizing accounts for about 10x the number of samples as TreeBuilding. Including Antti's recent testing (.5% vs. 3%): https://bugs.webkit.org/show_bug.cgi?id=106127#c10 If after we do this we measure and find ourselves still spending a lot of main-thread time parsing, we'll move the TreeBuilder too. :) (This work is a nicely separable sub-set of larger work needed to move the TreeBuilder.) We welcome your thoughts and comments. 1. https://docs.google.com/spreadsheet/ccc?key=0AlC4tS7Ao1fIdGtJTWlSaUItQ1hYaDFDcWkzeVAxOGc#gid=0 (Epic thanks to Nat Duca for helping us collect that data.) ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] Feature Announcement: Moving HTML Parser off the Main Thread
On Wed, 9 Jan 2013, Eric Seidel wrote: The core goal is to reduce latency -- to free up the main thread for JavaScript and UI interaction -- which as you correctly note, cannot be moved off of the main thread due to the single thread of execution model of the web. Parsing and (maybe to a lesser extent) compiling JS can be moved off the main thread, though, right? That's probably worth examining too, if it hasn't already been done. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] Feature Announcement: Moving HTML Parser off the Main Thread
On Jan 9, 2013, at 10:04 PM, Ian Hickson i...@hixie.ch wrote: On Wed, 9 Jan 2013, Eric Seidel wrote: The core goal is to reduce latency -- to free up the main thread for JavaScript and UI interaction -- which as you correctly note, cannot be moved off of the main thread due to the single thread of execution model of the web. Parsing and (maybe to a lesser extent) compiling JS can be moved off the main thread, though, right? That's probably worth examining too, if it hasn't already been done. 100% agree. However, the same problem I brought up about tokenization applies here: a lot of JS functions are super cheap to parse and compile already, and the latency of doing so on the main thread is likely to be lower than the latency of chatting with another core. I suspect this could be alleviated by (1) aggressively pipelining the work, where during page load or during heavy JS use the compilation thread always has a non-empty queue of work to do; this will mean that the latency of communication is paid only when the first compilation occurs, and (2) allowing the main thread to steal work from the compilation queue. I'm not sure how to make (2) work well. For parsing it's actually harder since we rely heavily on the lazy parsing optimization: code is only parsed once we need it *right now* to run a function. For compilation, it's somewhat easier: the most expensive compilation step is the third-tier optimizing JIT; we can delay this as long as we want, though the longer we dela y it, the longer we spend running slower code. Hence, to make parsing concurrent, the main problem is figuring out how to do predictive parsing: have a concurrent thread start parsing something just before we need it. Without predictive parsing, making it concurrent would be a guaranteed loss since the main thread would just be stuck waiting for the thread to finish. To make optimized compiles concurrent without a regression, the main problem is ensuring that in those cases where we believe that the time taken to compile the function will be smaller than the time taken to awake the concurrent thread, we will instead just compile it on the main thread right away. Though, if we could predict that a function was going to get hot in the future, we could speculatively tell a concurrent thread to compile it fully knowing that it won't wake up and do so until exactly when we would have otherwise invoked the compiler on the main thread (that is, it'll wake up and start compiling it once the main thread has executed the function enough times to get good profiling data). Anyway, you're absolutely right that this is an area that should be explored. -F -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev