[dev-platform] Intent to Ship: New Socorro Backend (crash-stats changes)

Alexis Beingessner Thu, 09 Dec 2021 09:07:22 -0800

Intent to Ship: New Socorro Backend (crash-stats changes)

We are planning to enable the new rust-minidump
<https://github.com/luser/rust-minidump> backend for socorro (
crash-stats.mozilla.org) next week (Monday?). This implementation was
deployed to mozcrash.py
<https://searchfox.org/mozilla-central/source/testing/mozbase/mozcrash/mozcrash/mozcrash.py>
(the `try` crash analyzer) a couple weeks ago (Bug 1741205
<https://bugzilla.mozilla.org/show_bug.cgi?id=1741205>). It has also been
running on crash-stats staging <https://crash-stats.allizom.org/> for the
past few weeks, and seems to be working well.

It's very likely that some crash signatures will change. Hopefully not
many, but it's inevitable. If you think a signature (or anything else) is
notably worse now, please needinfo :Gankra on bugzilla or ping @gankra:
mozilla.org on element (she wrote and maintains the bulk of the new
implementation).

Detailed user documentation for the new implementation can be found here
<https://github.com/luser/rust-minidump/blob/master/minidump-stackwalk/README.md>
(including a section on using it with Firefox crashes
<https://github.com/luser/rust-minidump/blob/master/minidump-stackwalk/README.md#analyzing-firefox-minidumps>).
It's much easier to run locally!

Although we've put a fair amount of effort into making this a "behind the
scenes'' change you don't need to care about, there are definitely some
differences. At this point we're fairly confident that all of these changes
(except one, see exploitability) are either benign differences of opinion
or net improvements.

You can identify which implementation processed a crash report by checking the
new "Debug" tab in crash-stats
<https://crash-stats.mozilla.org/report/index/024bbbe9-b305-4b18-8a0a-f2aca0211208#tab-debug>
(needs protected data access). The new implementation will have a
"Stackwalker version" like:

minidump-stackwalk 0.9.5 (2021-12-08 c4617dc2)

while the old implementation will just say

stackwalker unknown

The rest of this message is a complete listing and discussion of these
changes (to the best of our knowledge). This will be primarily focused on
the raw JSON output, but those values often get verbatim reflected in the
crash-stats UI. If you are unfamiliar/uncertain about the meaning of a JSON
value discussed here, the new implementation also comes with a complete
schema that describes the semantics of each field
<https://github.com/luser/rust-minidump/blob/master/minidump-processor/json-schema.md>
!

All of these differences are "WONTFIX" -- we're happy with the new
behaviours! That said, they can be relitigated if anyone has a compelling
argument for why an old behaviour is better.

To help you read this document, we have marked each difference with an
indicator:

❌ is a regression

🚧 is a benign difference

✅ is an improvement

❌ exploitability

Let's get the one regression out of the way first. The "exploitability"
field is no longer populated. Although a minimal implementation was written
<https://github.com/luser/rust-minidump/pull/206>, we decided against
landing it because of the feedback we got when we asked how people use the
current exploitability implementation. Pretty unanimously our peers came
back with: "it sucks and we ignore it" (If you disagree, please let us know!
).

Exploitability is a feature where breakpad would combine all of the
information in the minidump to try to guess how likely it is that this
crash is something exploitable. For instance, if the crash was an OOM,
that's probably fine; if the crash was a null pointer deref, that's a
little worrying; and if you segfaulted the instruction pointer, that's a
huge concern.

Unfortunately the signal-to-noise ratio was too bad to make it actually
useful. Our peers frequently found that benign crashes were flagged as
concerning, so checking or monitoring "exploitability" was rarely useful.

If we ever decide to reimplement this functionality, a lot more work will
need to go into eliminating false alarms (e.g. properly noticing that a
crash was due to an assertion and not really an otherwise-very-sketchy
illegal instruction).

🚧 various stackwalking differences

Stackwalking is complicated. The two implementations largely use the same
algorithms and approaches, and will mostly come back with the same answer.
But there's so many little subtleties and interactions that it's impossible
(and undesirable) to keep the two the same.

If you do encounter something strange, the new implementation includes
debugging
tools for stackwalking
<https://github.com/luser/rust-minidump/blob/master/minidump-stackwalk/README.md#debugging-stackwalking>
(not run in production, but can be run locally on "suspicious" crash
reports, which socc-pair can help with <https://github.com/Gankra/socc-pair>
).

To the best of my reckoning the differences are generally either benign or
in the new implementation's favour, especially for the crashing thread,
which is the most important one. Other threads are more likely to be in a
zombie state that is harder (or impossible?) to stackwalk. I've definitely
seen a few places where the new implementation is doing an obviously better
job, but I don't have a good sense for how "common" or "significant" that
is.

Most differences occur when we don't have any good information informing
the stackwalk, and are basically making educated guesses (e.g. when
`trust=scan` or when the CFI is seemingly wrong). In these situations it's
very easy for either implementation to "hallucinate" extra frames or skip
over frames. When this happens it's hard to tell if one is doing a better
or worse job than the other. Garbage in, garbage out.

In these cases the new implementation is slightly more biased towards "keep
it simple". So for instance, we do not implement the cfi_scan hack
<https://github.com/luser/rust-minidump/issues/330>, because it produces
dubious results.

The new implementation also tries to be more consistent between platforms
when there isn't a clear reason for them to have divergent behaviour. So a
heuristic we apply on x86 is more likely to also be applied on x64 and ARM.
Some arbitrary inconsistency was however preserved in places where the
difference between the old implementation and new implementation seemed too
large. e.g. we generally tried to preserve any quirks that were observable
in breakpad's testsuite.

(Note: just like the old implementation, every platform's stackwalker is a
completely independent file, so keeping things consistent is an explicit
decision and not just an accidental artifact of excessive code reuse.)

🚧 hard errors are reported differently

The JSON schema contains a "status" field that the old implementation would
try to use to report an error (e.g. for empty/corrupt minidumps). This is a
bit unreliable, because things may be in a corrupted state, resulting in
corrupt JSON.

The new implementation prefers to produce no input if it encounters an
error, and prints an error message to stderr. That said, in the new
implementation this mostly only ever happens for empty/corrupt minidumps,
where there isn't anything interesting to be done.

The Socorro processor rule that runs minidump-stackwalk now captures stderr
and truncates it to the last 10 lines. It includes this data in the
"mdsw_stderr" field and it's visible in the Debug tab for people who have
protected data access.

🚧 null vs nothing

There are a bunch of fields that the old stackwalker doesn't emit at all if
there's no value, whereas the new implementation will always emit the field
with the value `null`. This is partially a quirk of the implementation, but
it also just makes it easier to tell if some data could be there but isn't.

mac_crash_info
-

lsb_release
-

crash_info.assertion
-

system_info.cpu_microcode_version
-

crashing_thread.last_error_value
-

crashing_thread.thread_name
-

crashing_thread.frames.N.line
-

crashing_thread.frames.N.file
-

crashing_thread.frames.N.module
-

crashing_thread.frames.N.module_offset
-

crashing_thread.frames.N.function_offset
-

crashing_thread.frames.N.function
-

threads.N.last_error_value
-

threads.N.thread_name
-

threads.N.frames.N.file
-

threads.N.frames.N.function
-

threads.N.frames.N.function_offset
-

threads.N.frames.N.line
-

threads.N.frames.N.module
-

threads.N.frames.N.module_offset
-

modules.N.cert_subject
-

modules.N.symbol_url
-

unloaded_modules.N.cert_subject

🚧 false vs nothing

Similar to null vs. nothing, there are a couple of fields where the old
stackwalker emits nothing, but rust minidump-stackwalk emits false.

modules.N.loaded_symbols
-

modules.N.corrupt_symbol

🚧 module version is null instead of empty string

Breakpad stackwalker emitted modules.N.version as empty string ("") and
Rust minidump-stackwalk emits a null.

Crash Stats doesn't do anything with this, but scripts that look at
processed crash data and depend on this will need to be updated.

🚧 modules have a different order

The old implementation liked to sort modules (executable, libraries, mapped
files like fonts) by address. The new implementation simply preserves the
order of the modules in the minidump.

(Due to this difference, main_module -- an index into the modules array --
may also change if a minidump is reprocessed, but it will still point to
the same module as before.)

🚧 bad module code_ids are preserved

The old implementation would replace a seemingly corrupt (too short)
modules.N.code_id value with the string "id" (no clue why). The new
implementation just reports whatever value it finds.

✅ module names have their case preserved

The old implementation would lower-case threads.N.module strings, the new
implementation preserves the casing of the module's name.

(e.g. instead of accessiblehandler.dll, you will see AccessibleHandler.dll)

✅ unloaded_modules are now correctly messier

The old implementation similarly sorted and deduplicated the contents of
"unloaded_modules". However this isn't appropriate, because
unloaded_modules is a log of unload events. Modules can (and do) show up
multiple times in unloaded_modules when they are loaded and unloaded
multiple times. These "repeat" entries may or may not have the same
address. The ordering of the log is also mostly chronological, so the
ordering has significance.

However (at least for Microsoft's native implementation of minidumps),
unloaded_modules is not a complete log, because they use a fixed-size
circular buffer to track unload events. If the buffer fills up, the oldest
entries will get overwritten by the newest ones. This circular buffer also
results in the "start" of the log not actually being the first element, and
there's no way to tell which element is first. Hence mostly chronological.

The new implementation faithfully returns the log as it appears in the
minidump.

(Unfortunately I don't actually know if the log is ascending or descending
in time at the moment)

🚧 legacy memory statistics are removed

Several fields have been deprecated and aren't emitted in the new
implementation because they have been obsoleted by our more robust memory
reporting infrastructure in the "Crash Annotations" tab (example
<https://crash-stats.mozilla.org/report/index/4aa1531d-45b1-4b82-866f-31d3f0211209#tab-annotations>
).

tiny_block_size
-

largest_free_vm_block
-

write_combine_size

🚧 legacy stack truncation fields are removed

The old implementation used to truncate long backtraces (presumably to
simplify the output). This ended up confusing people, so it was disabled
about a year ago. Since the new implementation never supported this
(mis)feature, it does not output the fields that were used to report on
when this happened:

crashing_thread.frames_truncated (was always false)
-

crashing_thread.total_frames (this stored the untruncated count)

Note that crashing_thread.frame_count is still present (and always has
been), and contains the number of frames. For the last year this has always
contained the exact same value as total_frames.

🚧 symbol debugging fields removed/changed

These were just things that maintainers of the old implementation added to
help debug their own work. We did not feel the need to track them in the
new implementation.

modules.N.symbol_disk_cache_hit
-

modules.N.symbols_fetch_time

The other module debug statistics (like symbols_corrupt) will not
necessarily have the same values, as they reflect internal implementation
details.

✅ symbols_url is now correct

modules.N.symbols_url specifies what url the symbols for this module were
downloaded from. In the old implementation this value would be wrong if the
symbols were found in the cache (it made a sloppy guess). The new
implementation stores this value in the cache so it will always be correct.

Consumers of this value hacked around the value being incorrect, but these
hacks can now be removed.

🚧 last_error_value may change

threads.N.last_error_value contains the GetLastError()
<https://docs.microsoft.com/en-us/windows/win32/api/errhandlingapi/nf-errhandlingapi-getlasterror>
value for each thread (which for unix people is similar to errno).
Interpreting error codes on windows is an imperfect art because the major
error types have overlapping values. For instance STATUS_WAIT_3 and
ERROR_PATH_NOT_FOUND are the exact same value, and we just have to pick one
arbitrarily.

The new implementation has a different order it tries to resolve errors in
(we think it's better), so some of these values may change if you reprocess
a minidump.

Perhaps in the future we should list all possibilities?

--
You received this message because you are subscribed to the Google Groups
"[email protected]" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/a/mozilla.org/d/msgid/dev-platform/CAHDr%2B1mAKGcb1kP63z_%2BXgba-LHUnu5PcSyC7RytXsNhJ1ho2A%40mail.gmail.com.

[dev-platform] Intent to Ship: New Socorro Backend (crash-stats changes)

Reply via email to