Measuring interactivity in a browser (especially during load)

Randell Jesup Wed, 31 Jul 2019 14:45:39 -0700

This document lives at (viewable publicly, but not editable except for
specific people):
https://docs.google.com/document/d/1TdDbjYFpFq4TH06vh-e-P8qoLcQ8SpNMIudKr3x-7ho/edit

I'm happy to respond to comments here, and if there are significant
changes I'll add them to the doc and comment here. I have patches to
allow us to measure this at low-overhead and back-infer the delays,
which are under development.

** Measuring interactivity in a browser (especially during load) **

When loading a page, frequently the page may be unresponsive to the user
during the load, even after the page “appears” to have finished loading.
This is a major problem for pages and for users, and it’s hard to
measure. Interaction can be clicking on a link; scrolling, interacting
with JS buttons on the page, etc.

Existing measurements:

The primary existing measurement is TIme-To_interactive (TTI) (and a
subset of it, Time-To-First-Interactive (TTFI, now renamed First CPU
Idle), which is currently implemented by Firefox). Another indirect
measurement is First Input Delay (FID), which can only be measured in
the field, since it measures the delay when a user happens to first try
to interact with the page.

TTI/TTFI have issues: The 50ms “Long Task” cliff (which means that
stalls slightly under it won’t be seen as an issue, and runs over it
will all be treated the same regardless of how bad the jank is); the
arbitrary “max 2 network connections” in TTI, the arbitrary 5s window
without a long task or 3+ network connections, etc. These metrics do
have some value, but for more-complex pages or lower-end hardware what
they tell you is problematic to interpret. TTI tries to capture not
only Jank, but also to determine indirectly when a page has all the
resources loaded it needs to provide functionality to the user. Note
that Jank and a page having complete functionality are really two
different things; TTI conflates/combines them.

Worse, TTI has varying definitions; Google’s definition uses 5s and 2
network requests, but Akamai uses very different parameters.

There are criticisms of TTI and how useful it is: see
https://blog.dareboost.com/en/2019/05/measuring-interactivity-time-to-interactive/
- the section starting at “Common misconception about Time To
Interactive“. Relative to this proposal, see the comment “What we’re
missing is “a Speed Index of interactions” where each Long Task would be
weighted by its duration and the time at which it occurs, in relation to
other metrics (e. g. rendering metrics) to infer the user’s
frustration.” While this proposal doesn’t completely cover that
suggestion, but it gets much closer to the impact on the user during
load.

Another useful datapoint is “rage clicks” versus TTFI/onLoad and visual
completeness. People rage click (multiple clicks close together) when a
page seems like it should be usable (partly or fully visually complete)
but it doesn’t respond (or give visual feedback) quickly to a click.
See
https://speakerdeck.com/bluesmoon/ux-and-performance-metrics-that-matter-a062d37f-e6c7-4b8a-8399-472ec76bb75e?slide=16
for some examples.

It’s not uncommon for pages to use timeouts to load lots of JS needed
for the page to work or to load code to support less-used features, and
to run various other bits of code off timeouts. This code all has to be
parsed, and often does a fair bit of setup after loading. Even if this
code never causes Jank (defined for TTI as 50ms delays), this can delay
a page becoming actually interactive and usable.

The 50ms “Long Task” value is not totally arbitrary; it’s based on human
perception of “immediate” reaction, and it presupposes that 50ms is
below the perception cliff. Note, however, that a Long Task of 50ms may
not correlate to an input reaction delay of ~50ms; the actual delay may
be considerably more depending on how inputs are handled. The
prioritized User Input Event Queue in Firefox does attempt to ensure
that the maximum delay seen is close to the Long Task value - however,
some events have higher priority than Input (vsync), and any events in
the input queue already will delay new user input. If user input causes
events to be added to the normal event queues (in order to respond to
the input), those events will be delayed by all the events already in
the queues.

Some types of JS script loads are only needed for unusual user
interactions; typically loads of supporting JS scripts for such
functionality is deferred; parsing/processing such scripts could cause
jank, but shouldn’t block most usage of the page.

Goals:

We want to understand and track both user-perceptible Jank, and when a
page is fully available for the user to interact with. We could also
discriminate between a page being basically interactive
(scrollable/clickable), and fully interactive (all scripts ready).

The longer term purpose of these measurements is to improve the user
perception of the interactivity of the page once it looks visually
complete, and to enable page authors to do likewise.

Options:
* Jank
* TTI/TTFI (or FCI)
* See above for issues with it
* 5-second window mostly ensures the page is actually done
loading/initializing.
* MID (Median Input Delay - see my posting here)
* Measures expected input delay during load
* Exact definition of what value to derive from the data TBD
(Mean, Median, 95%, Max, etc)
* Issues:
* Since it’s calculating a (single) value-over-time, when
you stop has a major impact on the final value. Starting
time also has a major impact, since while the browser is
loading the initial page data it’s likely highly interactive.
* Stop on Page Ready?
* Start on last byte delivered of main page? FCP?
* Stop at the TTI/TTFI point?
* Stop at Visually Complete (or 85%) plus some delta? Start
at 10% visually complete? 30%? (and which “visually complete”?)
* Graphs of expected input delay
* Gives continuous readout of delay
* Can works with profiler counter display
* No issue defining a stop-time, though you need to stop measuring
sometime.
* Issues:
* Doesn’t provide a single number to compare
* Can provide mean/avg/95%/etc numbers, but then you have the
“stopping” problem again
* Could use visually complete
* Page Ready
* Detect when page becomes idle
* Page going idle CPU-wise isn’t sufficient
* Need to ignore housekeeping activity, like animations, polls, etc.
* TTFI/FCI is meant to be this, more or less
* All scripts of type <x> must be ready
* [All short-duration timeouts during load have run.]
* Likely unnecessary

In reality, we want to measure more than one thing here. We want to
know when the page is ready; and we want to understand how janky the
browser is during a page load (and later as well, though that’s a
different question). Graphs are intuitive visually, but a graph of
delay over a load is hard for automation to do something with directly,
or alert on. We can track Max Delay easily, since that doesn’t have the
stopping (or starting) problems. We could also track
time-with-delay-worse-than-X, which is relatively free of the
starting/stopping problem, but still may be slightly impacted.

For example, we could have these sorts of results (hypothetical):
* Foo.com:
* Onload: 10.2s
* TTFI: 15.4s
* FCP: 3.4s
* Time-To-Page-Ready: 11.2s
* Max-Input-Delay: 900ms
* Time-Non-Responsive(50ms): 6.5s (over period start-to-TTFI)
* Time-Non-Responsive(100ms): 4.3s
* TIme-Non-Responsive(250ms): 1.2s
* Avg, Mean, 5%, 95% for Responsiveness (over what period?)[c]

The most interesting of these to track may be Time-To-Page-Ready (TTPR),
Max-Input-Delay, and perhaps Time-Non-Responsive(50ms).

Proposal:

We should track the input delay loading “real-world” pages such as in
the load tests in Raptor, and let the tests run until TTFI or until N
(10?) seconds after the load event. We should track the input delay as
finely as we can (preferably 1ms), but no worse than once every
~33ms.

This should be available via browsertime, the Gecko Profiler, and also
in Raptor run in automation.

Since the delay graph is closely tied to the specific run, we’ll want to
view the series in Raptor or browsertime, as well as get the raw numbers
directly. Viewing them in the profiler (as a track) is especially
useful, since they’re directly tied to the code and events running.

With this data, we can develop appropriate metrics that can be then
alerted on or tracked. Ideally we can define a single/few alertable
metrics from the data. Once we have graphed data, we can calculate what
makes sense and how well this maps to reality. Initially we should try
calculating several of the above metrics, and setting up some datasets
were we can play with the start and end-points and what we calculate
from the values. We can then start tracking some of these in perfherder
and see if they’re stable enough to use as alerts.

--
Randell Jesup, Mozilla Corp
remove "news" for personal email
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Measuring interactivity in a browser (especially during load)

Reply via email to