Summary:

Tried 3.11rc7, very happy with how it behaved in our testing.  Tried
this week's 3.12rc5, disappointed that a "step backwards" was taken
on that one for us.  The difference for us was in the "low memory killer"
that was configured in the 3.11rc7 build but not the 3.12rc5 system.
Details below, as a consequence I'm tagging this bug with both "upstream
3.11rc7 fixes" as well as "upstream 3.12rc5 doesn't fix"!


Details:

I've now switched to a real hardware (Dell multicore) platform to make
sure no one has any doubts as to this kernel problem being an issue on
real hardware as well as my VM testbed.  I can achieve the same hang
failure in the original bug description using either my 2GB VM or the
actual machine now.

I first reproduced the hang with a more recent 3.2.0-45 kernel on this
64-bit Dell hardware and then tried both the mainline 3.11rc7 and this
week's 3.12rc5 kernels from the URL supplied above by Christopher.

The good news is that I was unable to reproduce a problem using the
3.11rc7 kernel and the system was extremely well-behaved!  That is,
despite running a very heavy load it remained responsive to new requests,
appeared to get more overall work accomplished compared to the 3.2 system
in the same time period, and had a minimum of kswapd scan rates in the
"sar" records.  And no direct allocation failure scan rates at all.
Naturally, the system was SIGKILL'ing off selected processes periodically
but this is the price I'd expect for running the memory-overloading
test I have here and in my real-world environment.  We much prefer
this behavior of individual processes being killed off, which can be
subsequently relaunched, rather than hanging or crashing the entire
system.  Especially since it appeared that the SIGKILLs in my tests
were *always* directed at processes that were actively doing the memory
consuming work, so they were good choices.

I note that the processes SKIGKILL'ed off in the above 3.11rc7 system
were dispatched to their death by the "low memory killer" logic in the
lowmemorykiller.c code.  The standard kernel OOM killer rarely, if ever,
was invoked.  The 3.11rc7 kernel appears to have been built with the
CONFIG_ANDROID_LOW_MEMORY_KILLER=y setting which caused that low memory
killer code to be statically linked into the kernel and register its low
memory shrinker callback function which issued the appropriate SIGKILLs
under overloaded conditions.

The bad news is that the more recent 3.12rc5 kernel I tried did NOT
have the above CONFIG_ANDROID_LOW_MEMORY_KILLER=y setting and instead
relied upon just the kernel OOM killer.  This 3.12rc5 system is behaving
similarly to when I turned off the 3.11rc7's "low memory killer" via
a /sys/module low memory minfree parameter.  That is, the 3.12rc5 (or
3.11rc7 with "low memory killer" disabled) system experienced:

 1) Much longer, and with wide variance, user response times
    External wget queries went from 1-5 seconds with the "low memory
    killer" enabled during the overloading tests to 2 *minutes* without
    that facility!

 2) High kswapd scans of .5M-1M/second in the "sar" reports
    With the low memory killer, kswapd scan rates never exceeded a few K/sec.

 3) Fairly high direct allocation failure scans as well (K/sec)

 4) Multiple processes critical to system functions were OOM'ed
    Management shell/terminal sessions that were idle, sshd, cron, etc.

 5) Even a panic in one test sequence
    "Kernel panic - not syncing: Out of memory and no killable processes..."

The behavior of our test systems without the low memory killer
functionality is poor, with the system either crashing or providing
a poor (simulated) customer response.  Either is better than the 3.2
"hang" I've reported, but not by much for our production/response needs!

I understand that there are concerns about the "low memory killer"
killing off processes before even getting to use the allocated
swap space on a system.  I observed that as well, which for us was
fine.   But I appreciate that it may not be desirable to have the
"CONFIG_ANDROID_LOW_MEMORY_KILLER=y" option for all folks' usage cases
as was done for the 3.11rc7 build.  But what about supplying that "low
memory killer" as an optionally loadable module by simply building with
"CONFIG_ANDROID_LOW_MEMORY_KILLER=m" in the kernel/distribution package?
That way, those of us who desire to not use any swap area and prefer a
more responsive system overall will have a simple way to load that module
distributed with the then-current Ubuntu kernel.  There are usage cases
where its better to shed load by killing off processes earlier rather than
degrade response time by using the swap area to preserve those processes.
The default would be to retain the current 3.12rc5 behavior: do NOT load
the low memory killer and in so doing experience the standard kernel OOM
handling.  The later could be improved over time as a separate effort,
if needed.

We would consider the above minor loadable module configuration change as
a simple way to resolve this memory overloading issue to our satisfaction.
I look forward to hearing whether this can be done for some supported
version of an LTS precise kernel, such as via a backport of an LTS 3.12
kernel perhaps.


** Tags added: kernel-bug-exists-upstream kernel-bug-exists-upstream-v3.12-rc5 
kernel-fixed-upstream kernel-fixed-upstream-v3.11-rc7

** Changed in: linux (Ubuntu)
       Status: Incomplete => Confirmed

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1154876

Title:
  3.2.0-38 and earlier systems hang with heavy memory usage

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to