A tale of OOMs and thumbnails: the abstract

Richard Newman Mon, 04 Nov 2013 13:40:17 -0800

There are a few abstract observations that resulted from the concrete problems 
in my last message. They essentially boil down to: how can we protect ourselves 
from our own code? Or, phrased differently: how can I avoid introducing bugs in 
the future?


I am very keen to hear ideas on how to shift the odds in our favor. If we can 
gather some recommendations that we agree on, I'll write them up (and add them 
to the reviewer guidelines).


Firstly, we ended up with a bunch of OOMs on mid-to-high-end devices because we 
made some valid assumptions on smaller devices.

On an older RAZR device, with 248-pixel-wide thumbnails, jumping to the next 
power of two is fine. Arguably on *any* device, you don't want to take a 
screenshot smaller than 256 or 512px.

Even if it blows up one power too far, it's not 2.3MB. And the larger the 
screenshot, the prettier it is when scaled down. That decision was valid, right 
up until we crossed the 512px column width threshold.

I also heard opinions in IRC that this size didn't matter, because of course 
that was just the capture size, which any device can do for just a moment, 
right? That assumption wasn't borne out by the code: we store the full-size 
image (PNG-compressed) in the DB and extract it on launch, and we hold a 
reference to the full-size bitmap rather than scaling it down and throwing away 
the larger version.

We make decisions like this all the time: buffer sizes, batch sizes, preferred 
image dimensions, maximum cache thresholds. As far as I know, we don't have an 
institutionalized mechanism for forcing ourselves to be aware of these 
decisions. Imagine if we had a test that verified that our thumbnails were 
never more than a sanity threshold -- in 2010 that sanity threshold would have 
been maybe 500KB, and each time that test failed (e.g., when we first hit xhdpi 
devices), with a nice descriptive error message, we'd be forced to reassess 
whether we should raise the limit, change our approach, or do something else.

When you write or review code that contains an explicit or systemic limit, 
think about whether you want to add a test, or a runtime check, to call 
attention to a broken assumption. If nothing else, this will force you to think 
about the implicit assumptions you're making. Perhaps the right outcome is a 
telemetry probe that reports cache fullness, or cache resize events. Perhaps 
it's a test that just checks the assumption. At the very least it ought to be a 
comment.

(And I suppose the limit applies in the other direction, too: on 2015's 16GB 
handsets, is today's 512KB cache size a good tradeoff?)



Secondly, it doesn't seem like we're paying as much attention as we should to 
memory usage… else we'd have spotted this before on our daily-driver phones. I 
strongly encourage people to run Fennec with `monitor` attached. You can watch 
incremental memory allocations, and take heap snapshots and analyze them. Hell, 
I learn a lot just by watching the full adb log scroll by.

(See <https://github.com/rnewman/android-quiet-logs> if you want to do 
something noisier but more valuable than `fgrep Gecko`.)

I won't go so far as to say that this should be a pre-commit hook, or even a 
mandatory review step, but I know I'm going to keep this in mind for my next 
review or patch.

I filed a meta bug off which to hang bugs that result from memory analysis. 
Next time you're testing by hand, and you have a non-empty profile, try 
grabbing a memory snapshot and familiarizing yourself with the interface. Maybe 
you find that some feature is holding on to a reference and leaking memory… or 
maybe you watch the incremental allocator and wonder why it cost two megs to 
open a menu. File a bug!



Thirdly, our code architecture is difficult to understand (e.g., with favicon 
and thumbnail logic spanning tabs, listeners, the favicon system itself, the 
DB, bits of Gecko, …). That leads to incremental changes that orphan code: it's 
cargo-culted forward between revisions. It was enormously valuable to me to 
have pointers to bugs explaining *why* we did one thing and not another. 
Fortunately I have an obstreperous and questioning personality!

It also leaves a lot of opportunity for memory leaks -- a foo holds on to a 
bar, the bar holds on to a map, the map contains tasks, the tasks hold on to … 
. When you need six buffers open to figure out who owns what, you're in trouble.

I'd love for us to spend a little of our copious free time pushing towards 
simplicity -- pushing related logic together, reducing coupling, avoiding the 
temptation to just fling another runnable onto a background thread and mutate 
another instance's member variables. It'll pay off in the long run.



Thoughts?

-R
_______________________________________________
mobile-firefox-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/mobile-firefox-dev

A tale of OOMs and thumbnails: the abstract

Reply via email to