Reacting more strongly to low-memory situations in Firefox 25

Benjamin Smedberg Mon, 25 Nov 2013 09:05:14 -0800

In crashkill we have been tracking crashes that occur in low-memorysituations for a while. However, we are seeing a troubling uptick ofissues in Firefox 23 and then 25. I believe that some people may not beable to use Firefox because of these bugs, and I think that we should bereacting more strongly to diagnose and solve these issues and get anyfixes that already exist sent up the trains.


Followup to dev-platform, please.


= Data and Background =

See, as some anecdotal evidence:

Bug 930797 is a user who just upgraded to Firefox 25 and is seeing thesea lot.Bug 937290 is another user who just upgraded to Firefox 25 and is seeinga bunch of crashes, some of which are empty-dump and some of which areall over the place (maybe OOM crashes).See also a recent thread "How to track down why Firefox is crashing somuch." in firefox-dev, where two additional users are reportingconsistent issues (one mac, one windows).

Note that in many cases, the user hasn't actually run out of memory:they have plenty of physical memory and page file available. In mostcases they also have enough available VM space! Often, however, this VMspace is fragmented to the point where normal allocations (64k jemallocheap blocks, or several-megabyte graphics or network buffers) cannot bemade. Because of work done during the recent tree closure, we now havethis measurement in about:memory (on Windows) as vsize-max-contiguous.It is also being computed for Windows crashes on crash-stats for clientsthat are new enough (win7+).

Unfortunately, often when we are out of memory crash reports come backas empty minidumps (because the crash reporter has to allocation memoryand/or VM space to create minidumps). We believe that most of theempty-minidump crashes present on crash-stats are in fact alsoout-of-memory crashes.

I've been creating reports about OOM crashes using crash-stats and foundsome startling data:

Looking just at the Windows crashes from last Friday (22-Nov):
* probably not OOM: 91565
* probably OOM: 57841

* unknown (not enough data because they are running an old version ofWindows that doesn't report VM information in crash reports): 150874


The criterion for "probably OOM" are:

* Has an OOMAnnotationSize marking meaning jemalloc aborted aninfallible allocator* Has "ABORT: OOM" in the app notes meaning XPCOM aborted in infalliblestring/hashtable/array code

* Has <50MB of contiguous free VM space

This data seems to indicate that almost 40% of our Firefox crashes aredue to OOM conditions.

Because one of the long-term possibilities discussed for solving thisissue is releasing a 64-bit version of Firefox, I additionally brokedown the "OOM" crashes into users running a 32-bit version of Windowsand users running a 64-bit version of Windows:


OOM,win64,15744
OOM,win32,42097

I did this by checking the "TotalVirtualMemory" annotation in the crashreport: if it reports 4G of TotalVirtualMemory, then the user has a64-bit Windows, and if it reports either 2G or 3G, the user is running a32-bit Windows. So I do not expect that doing Firefox for win64 willhelp users who are already experiencing memory issues, although it maywell help new users and users who are running memory-intensiveapplications such as games.

Scripts for this analysis athttps://github.com/mozilla/jydoop/blob/master/scripts/oom-classifier.pyif you want to see what it's doing.


= Next Steps =

As far as I can tell, there are several basic problems that we should betackling. For now, I'm going to brainstorm some ideas and hope thatpeople will react or take of these items.


== Measurement ==

* Move minidump collection out of the Firefox process. This is somethingwe've been talking about for a while but apparently never filed, so it'snow filed as https://bugzilla.mozilla.org/show_bug.cgi?id=942873* Develop a tool/instructions for users to profile the VM allocations intheir Firefox process. We know that many of the existing VM problems aregraphics-related, but we're not sure exactly who is making theallocations, and whether they are leaks, cached textures, or otherthings, and whether it's Firefox code, Windows code, or driver codecausing problems. I know dmajor is working on some xperf logging forthis, and we should probably try to expand that out into something thatwe can ask end users who are experiencing problems to run.* The about:memory patches which add contiguous-vm measurement shouldprobably be uplifted to Fx26, and any other measurement tools that wouldbe valuable diagnostics.


== VM fragmentation ==

Bug 941837 identified a bad VM allocation pattern in our JS code whichwas causing 1MB VM fragmentation. Getting this patch uplifted seemsimportant. But I know that several other things landed as a part offixing the recent tree closure: has anyone identified whether any of theother patches here could be affecting release users and should be uplifted?


== Graphics Solutions ==

The issues reported in bug 930797 at least appear to be related to HTML5<video> rendering. The STR aren't precise, but it seems that we shouldtry and understand and fix the issue reported by that user. Disablinghardware acceleration does not appear to help.

Bas has a bunch of information in bug 859955 about degenerate behaviorof graphics drivers: they often map textures into the Firefox process,and sometimes cache the latest N textures (N=200 in one test) no matterwhat the texture size is. I have a feeling that we need to do somethinghere, but it's not clear what. Perhaps it's driver-specific workarounds,or blacklisting old driver versions, or working with driver vendors tohave better behavior.


== Dealing with OOM crash sites ==

Currently we still have a fair number of call sites that crash withinfallible allocation or after allocation failure where the allocationsare potentially large or huge. In general, infallible allocation shouldonly be used for fixed-size quantities (C++ classes). Any arrays wherethe count is controlled by content, or large buffers for graphics ornetworking data should be allocated using fallible allocators,null-checked, and the system should propagate failure.

I am working on generating some reports on existing crashes whereOOMAllocationSize is variable, and also crash signatures that correlatehighly with OOM conditions. We should fix these sites.

This is only a stopgap measure, because we see plenty of crashes whereOOMAllocationSize is very small (56 bytes), but it will help keep thebrowser alive for longer and also foil some trivial DoS attacks.


== Regression ranges ==

Some of the issues appear to be recently introduced in Firefox 25. Weneed to jump on regression ranges ASAP. I could really use help workingwith users such as those identified at the top of this message to see ifthere are regression ranges in nightly builds that cause more issues.


== Last-ditch UI==

When contiguous VM starts getting low, we should probably warn the userand ask them to restart Firefox soon or risk crashing. I know that thissucks, but a warning before you crash at least gives you a chance tosave things. I have filed this ashttps://bugzilla.mozilla.org/show_bug.cgi?id=942892


--BDS

_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reacting more strongly to low-memory situations in Firefox 25

Reply via email to