In crashkill we have been tracking crashes that occur in low-memory situations for a while. However, we are seeing a troubling uptick of issues in Firefox 23 and then 25. I believe that some people may not be able to use Firefox because of these bugs, and I think that we should be reacting more strongly to diagnose and solve these issues and get any fixes that already exist sent up the trains.

Followup to dev-platform, please.

= Data and Background =

See, as some anecdotal evidence:

Bug 930797 is a user who just upgraded to Firefox 25 and is seeing these a lot. Bug 937290 is another user who just upgraded to Firefox 25 and is seeing a bunch of crashes, some of which are empty-dump and some of which are all over the place (maybe OOM crashes). See also a recent thread "How to track down why Firefox is crashing so much." in firefox-dev, where two additional users are reporting consistent issues (one mac, one windows).

Note that in many cases, the user hasn't actually run out of memory: they have plenty of physical memory and page file available. In most cases they also have enough available VM space! Often, however, this VM space is fragmented to the point where normal allocations (64k jemalloc heap blocks, or several-megabyte graphics or network buffers) cannot be made. Because of work done during the recent tree closure, we now have this measurement in about:memory (on Windows) as vsize-max-contiguous. It is also being computed for Windows crashes on crash-stats for clients that are new enough (win7+).

Unfortunately, often when we are out of memory crash reports come back as empty minidumps (because the crash reporter has to allocation memory and/or VM space to create minidumps). We believe that most of the empty-minidump crashes present on crash-stats are in fact also out-of-memory crashes.

I've been creating reports about OOM crashes using crash-stats and found some startling data:
Looking just at the Windows crashes from last Friday (22-Nov):
* probably not OOM: 91565
* probably OOM: 57841
* unknown (not enough data because they are running an old version of Windows that doesn't report VM information in crash reports): 150874

The criterion for "probably OOM" are:
* Has an OOMAnnotationSize marking meaning jemalloc aborted an infallible allocator * Has "ABORT: OOM" in the app notes meaning XPCOM aborted in infallible string/hashtable/array code
* Has <50MB of contiguous free VM space

This data seems to indicate that almost 40% of our Firefox crashes are due to OOM conditions.

Because one of the long-term possibilities discussed for solving this issue is releasing a 64-bit version of Firefox, I additionally broke down the "OOM" crashes into users running a 32-bit version of Windows and users running a 64-bit version of Windows:

OOM,win64,15744
OOM,win32,42097

I did this by checking the "TotalVirtualMemory" annotation in the crash report: if it reports 4G of TotalVirtualMemory, then the user has a 64-bit Windows, and if it reports either 2G or 3G, the user is running a 32-bit Windows. So I do not expect that doing Firefox for win64 will help users who are already experiencing memory issues, although it may well help new users and users who are running memory-intensive applications such as games.

Scripts for this analysis at https://github.com/mozilla/jydoop/blob/master/scripts/oom-classifier.py if you want to see what it's doing.

= Next Steps =

As far as I can tell, there are several basic problems that we should be tackling. For now, I'm going to brainstorm some ideas and hope that people will react or take of these items.

== Measurement ==

* Move minidump collection out of the Firefox process. This is something we've been talking about for a while but apparently never filed, so it's now filed as https://bugzilla.mozilla.org/show_bug.cgi?id=942873 * Develop a tool/instructions for users to profile the VM allocations in their Firefox process. We know that many of the existing VM problems are graphics-related, but we're not sure exactly who is making the allocations, and whether they are leaks, cached textures, or other things, and whether it's Firefox code, Windows code, or driver code causing problems. I know dmajor is working on some xperf logging for this, and we should probably try to expand that out into something that we can ask end users who are experiencing problems to run. * The about:memory patches which add contiguous-vm measurement should probably be uplifted to Fx26, and any other measurement tools that would be valuable diagnostics.

== VM fragmentation ==

Bug 941837 identified a bad VM allocation pattern in our JS code which was causing 1MB VM fragmentation. Getting this patch uplifted seems important. But I know that several other things landed as a part of fixing the recent tree closure: has anyone identified whether any of the other patches here could be affecting release users and should be uplifted?

== Graphics Solutions ==

The issues reported in bug 930797 at least appear to be related to HTML5 <video> rendering. The STR aren't precise, but it seems that we should try and understand and fix the issue reported by that user. Disabling hardware acceleration does not appear to help.

Bas has a bunch of information in bug 859955 about degenerate behavior of graphics drivers: they often map textures into the Firefox process, and sometimes cache the latest N textures (N=200 in one test) no matter what the texture size is. I have a feeling that we need to do something here, but it's not clear what. Perhaps it's driver-specific workarounds, or blacklisting old driver versions, or working with driver vendors to have better behavior.

== Dealing with OOM crash sites ==

Currently we still have a fair number of call sites that crash with infallible allocation or after allocation failure where the allocations are potentially large or huge. In general, infallible allocation should only be used for fixed-size quantities (C++ classes). Any arrays where the count is controlled by content, or large buffers for graphics or networking data should be allocated using fallible allocators, null-checked, and the system should propagate failure.

I am working on generating some reports on existing crashes where OOMAllocationSize is variable, and also crash signatures that correlate highly with OOM conditions. We should fix these sites.

This is only a stopgap measure, because we see plenty of crashes where OOMAllocationSize is very small (56 bytes), but it will help keep the browser alive for longer and also foil some trivial DoS attacks.

== Regression ranges ==

Some of the issues appear to be recently introduced in Firefox 25. We need to jump on regression ranges ASAP. I could really use help working with users such as those identified at the top of this message to see if there are regression ranges in nightly builds that cause more issues.

== Last-ditch UI==

When contiguous VM starts getting low, we should probably warn the user and ask them to restart Firefox soon or risk crashing. I know that this sucks, but a warning before you crash at least gives you a chance to save things. I have filed this as https://bugzilla.mozilla.org/show_bug.cgi?id=942892

--BDS

_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to