This document lives at (viewable publicly, but not editable except for specific people): https://docs.google.com/document/d/1TdDbjYFpFq4TH06vh-e-P8qoLcQ8SpNMIudKr3x-7ho/edit
I'm happy to respond to comments here, and if there are significant changes I'll add them to the doc and comment here. I have patches to allow us to measure this at low-overhead and back-infer the delays, which are under development. ** Measuring interactivity in a browser (especially during load) ** When loading a page, frequently the page may be unresponsive to the user during the load, even after the page “appears” to have finished loading. This is a major problem for pages and for users, and it’s hard to measure. Interaction can be clicking on a link; scrolling, interacting with JS buttons on the page, etc. Existing measurements: The primary existing measurement is TIme-To_interactive (TTI) (and a subset of it, Time-To-First-Interactive (TTFI, now renamed First CPU Idle), which is currently implemented by Firefox). Another indirect measurement is First Input Delay (FID), which can only be measured in the field, since it measures the delay when a user happens to first try to interact with the page. TTI/TTFI have issues: The 50ms “Long Task” cliff (which means that stalls slightly under it won’t be seen as an issue, and runs over it will all be treated the same regardless of how bad the jank is); the arbitrary “max 2 network connections” in TTI, the arbitrary 5s window without a long task or 3+ network connections, etc. These metrics do have some value, but for more-complex pages or lower-end hardware what they tell you is problematic to interpret. TTI tries to capture not only Jank, but also to determine indirectly when a page has all the resources loaded it needs to provide functionality to the user. Note that Jank and a page having complete functionality are really two different things; TTI conflates/combines them. Worse, TTI has varying definitions; Google’s definition uses 5s and 2 network requests, but Akamai uses very different parameters. There are criticisms of TTI and how useful it is: see https://blog.dareboost.com/en/2019/05/measuring-interactivity-time-to-interactive/ - the section starting at “Common misconception about Time To Interactive“. Relative to this proposal, see the comment “What we’re missing is “a Speed Index of interactions” where each Long Task would be weighted by its duration and the time at which it occurs, in relation to other metrics (e. g. rendering metrics) to infer the user’s frustration.” While this proposal doesn’t completely cover that suggestion, but it gets much closer to the impact on the user during load. Another useful datapoint is “rage clicks” versus TTFI/onLoad and visual completeness. People rage click (multiple clicks close together) when a page seems like it should be usable (partly or fully visually complete) but it doesn’t respond (or give visual feedback) quickly to a click. See https://speakerdeck.com/bluesmoon/ux-and-performance-metrics-that-matter-a062d37f-e6c7-4b8a-8399-472ec76bb75e?slide=16 for some examples. It’s not uncommon for pages to use timeouts to load lots of JS needed for the page to work or to load code to support less-used features, and to run various other bits of code off timeouts. This code all has to be parsed, and often does a fair bit of setup after loading. Even if this code never causes Jank (defined for TTI as 50ms delays), this can delay a page becoming actually interactive and usable. The 50ms “Long Task” value is not totally arbitrary; it’s based on human perception of “immediate” reaction, and it presupposes that 50ms is below the perception cliff. Note, however, that a Long Task of 50ms may not correlate to an input reaction delay of ~50ms; the actual delay may be considerably more depending on how inputs are handled. The prioritized User Input Event Queue in Firefox does attempt to ensure that the maximum delay seen is close to the Long Task value - however, some events have higher priority than Input (vsync), and any events in the input queue already will delay new user input. If user input causes events to be added to the normal event queues (in order to respond to the input), those events will be delayed by all the events already in the queues. Some types of JS script loads are only needed for unusual user interactions; typically loads of supporting JS scripts for such functionality is deferred; parsing/processing such scripts could cause jank, but shouldn’t block most usage of the page. Goals: We want to understand and track both user-perceptible Jank, and when a page is fully available for the user to interact with. We could also discriminate between a page being basically interactive (scrollable/clickable), and fully interactive (all scripts ready). The longer term purpose of these measurements is to improve the user perception of the interactivity of the page once it looks visually complete, and to enable page authors to do likewise. Options: * Jank * TTI/TTFI (or FCI) * See above for issues with it * 5-second window mostly ensures the page is actually done loading/initializing. * MID (Median Input Delay - see my posting here) * Measures expected input delay during load * Exact definition of what value to derive from the data TBD (Mean, Median, 95%, Max, etc) * Issues: * Since it’s calculating a (single) value-over-time, when you stop has a major impact on the final value. Starting time also has a major impact, since while the browser is loading the initial page data it’s likely highly interactive. * Stop on Page Ready? * Start on last byte delivered of main page? FCP? * Stop at the TTI/TTFI point? * Stop at Visually Complete (or 85%) plus some delta? Start at 10% visually complete? 30%? (and which “visually complete”?) * Graphs of expected input delay * Gives continuous readout of delay * Can works with profiler counter display * No issue defining a stop-time, though you need to stop measuring sometime. * Issues: * Doesn’t provide a single number to compare * Can provide mean/avg/95%/etc numbers, but then you have the “stopping” problem again * Could use visually complete * Page Ready * Detect when page becomes idle * Page going idle CPU-wise isn’t sufficient * Need to ignore housekeeping activity, like animations, polls, etc. * TTFI/FCI is meant to be this, more or less * All scripts of type <x> must be ready * [All short-duration timeouts during load have run.] * Likely unnecessary In reality, we want to measure more than one thing here. We want to know when the page is ready; and we want to understand how janky the browser is during a page load (and later as well, though that’s a different question). Graphs are intuitive visually, but a graph of delay over a load is hard for automation to do something with directly, or alert on. We can track Max Delay easily, since that doesn’t have the stopping (or starting) problems. We could also track time-with-delay-worse-than-X, which is relatively free of the starting/stopping problem, but still may be slightly impacted. For example, we could have these sorts of results (hypothetical): * Foo.com: * Onload: 10.2s * TTFI: 15.4s * FCP: 3.4s * Time-To-Page-Ready: 11.2s * Max-Input-Delay: 900ms * Time-Non-Responsive(50ms): 6.5s (over period start-to-TTFI) * Time-Non-Responsive(100ms): 4.3s * TIme-Non-Responsive(250ms): 1.2s * Avg, Mean, 5%, 95% for Responsiveness (over what period?)[c] The most interesting of these to track may be Time-To-Page-Ready (TTPR), Max-Input-Delay, and perhaps Time-Non-Responsive(50ms). Proposal: We should track the input delay loading “real-world” pages such as in the load tests in Raptor, and let the tests run until TTFI or until N (10?) seconds after the load event. We should track the input delay as finely as we can (preferably 1ms), but no worse than once every ~33ms. This should be available via browsertime, the Gecko Profiler, and also in Raptor run in automation. Since the delay graph is closely tied to the specific run, we’ll want to view the series in Raptor or browsertime, as well as get the raw numbers directly. Viewing them in the profiler (as a track) is especially useful, since they’re directly tied to the code and events running. With this data, we can develop appropriate metrics that can be then alerted on or tracked. Ideally we can define a single/few alertable metrics from the data. Once we have graphed data, we can calculate what makes sense and how well this maps to reality. Initially we should try calculating several of the above metrics, and setting up some datasets were we can play with the start and end-points and what we calculate from the values. We can then start tracking some of these in perfherder and see if they’re stable enough to use as alerts. -- Randell Jesup, Mozilla Corp remove "news" for personal email _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform