On Thursday 08 November 2007 00:48, Frank van Maarseveen wrote: > On Wed, Nov 07, 2007 at 09:01:17AM +1100, Nick Piggin wrote: > > On Tuesday 06 November 2007 04:42, Frank van Maarseveen wrote: > > > For quite some time I'm seeing occasional lockups spread over 50 > > > different machines I'm maintaining. Symptom: a page allocation failure > > > with order:1, GFP_ATOMIC, while there is plenty of memory, as it seems > > > (lots of free pages, almost no swap used) followed by a lockup > > > (everything dead). I've collected all (12) crash cases which occurred > > > the last 10 weeks on 50 machines total (i.e. 1 crash every 41 weeks on > > > average). The kernel messages are summarized to show the interesting > > > part (IMO) they have in common. Over the years this has become the > > > crash cause #1 for stable kernels for me (fglrx doesn't count ;). > > > > > > One note: I suspect that reporting a GFP_ATOMIC allocation failure in > > > an network driver via that same driver (netconsole) may not be the > > > smartest thing to do and this could be responsible for the lockup > > > itself. However, the initial page allocation failure remains and I'm > > > not sure how to address that problem. > > > > It isn't unexpected. If an atomic allocation doesn't have enough memory, > > it kicks off kswapd to start freeing memory for it. However, it cannot > > wait for memory to become free (it's GFP_ATOMIC), so it has to return > > failure. GFP_ATOMIC allocation paths are designed so that the kernel can > > recover from this situation, and a subsequent allocation will have free > > memory. > > > > Probably in production kernels we should default to only reporting this > > when page reclaim is not making any progress. > > > > > I still think the issue is memory fragmentation but if so, it looks > > > a bit extreme to me: One system with 2GB of ram crashed after a day, > > > merely running a couple of TCP server programs. All systems have either > > > 1 or 2GB ram and at least 1G of (merely unused) swap. > > > > You can reduce the chances of it happening by increasing > > /proc/sys/vm/min_free_kbytes. > > It's 3807 everywhere by default here which means roughly 950 pages if I > understand correctly. However, the problem occurs with much more free > pages as it seems. "grep ' free:' messages*" on the netconsole logging > machine shows:
But it's an order-1 allocation, which may not be available due to fragmentation. Although you might have large amounts of memory free at a given point, fragmentation can be triggered earlier when free memory gets very low (because order-0 allocations may have taken up all of the free order-1 pages). Increasing it is known to help. Although you shouldn't crash due to allocation failures... it would be nice if you could connect a serial or vga console and see what's happening... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/