Re: VM/networking crash cause #1: page allocation failure (order:1, GFP_ATOMIC)
From: Nick Piggin <[EMAIL PROTECTED]> Date: Thu, 8 Nov 2007 16:55:38 +1100 > But it's an order-1 allocation, which may not be available due to > fragmentation. If the user is using an MTU of only 1500 (and thus not using jumbo frames) he may be being bitten by a bug that was recently fixed by Herbert Xu that made us use order > 0 pages accidently. His fix is below: commit deea84b0ae3d26b41502ae0a39fe7fe134e703d0 Author: Herbert Xu <[EMAIL PROTECTED]> Date: Sun Oct 21 16:27:46 2007 -0700 [NET]: Fix SKB_WITH_OVERHEAD calculation The calculation in SKB_WITH_OVERHEAD is incorrect in that it can cause an overflow across a page boundary which is what it's meant to prevent. In particular, the header length (X) should not be lumped together with skb_shared_info. The latter needs to be aligned properly while the header has no choice but to sit in front of wherever the payload is. Therefore the correct calculation is to take away the aligned size of skb_shared_info, and then subtract the header length. The resulting quantity L satisfies the following inequality: SKB_DATA_ALIGN(L + X) + sizeof(struct skb_shared_info) <= PAGE_SIZE This is the quantity used by alloc_skb to do the actual allocation. Signed-off-by: Herbert Xu <[EMAIL PROTECTED]> Signed-off-by: David S. Miller <[EMAIL PROTECTED]> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index f93f22b..369f60a 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -41,8 +41,7 @@ #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \ ~(SMP_CACHE_BYTES - 1)) #define SKB_WITH_OVERHEAD(X) \ - (((X) - sizeof(struct skb_shared_info)) & \ -~(SMP_CACHE_BYTES - 1)) + ((X) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info))) #define SKB_MAX_ORDER(X, ORDER) \ SKB_WITH_OVERHEAD((PAGE_SIZE << (ORDER)) - (X)) #define SKB_MAX_HEAD(X)(SKB_MAX_ORDER((X), 0)) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: VM/networking crash cause #1: page allocation failure (order:1, GFP_ATOMIC)
On Thursday 08 November 2007 00:48, Frank van Maarseveen wrote: > On Wed, Nov 07, 2007 at 09:01:17AM +1100, Nick Piggin wrote: > > On Tuesday 06 November 2007 04:42, Frank van Maarseveen wrote: > > > For quite some time I'm seeing occasional lockups spread over 50 > > > different machines I'm maintaining. Symptom: a page allocation failure > > > with order:1, GFP_ATOMIC, while there is plenty of memory, as it seems > > > (lots of free pages, almost no swap used) followed by a lockup > > > (everything dead). I've collected all (12) crash cases which occurred > > > the last 10 weeks on 50 machines total (i.e. 1 crash every 41 weeks on > > > average). The kernel messages are summarized to show the interesting > > > part (IMO) they have in common. Over the years this has become the > > > crash cause #1 for stable kernels for me (fglrx doesn't count ;). > > > > > > One note: I suspect that reporting a GFP_ATOMIC allocation failure in > > > an network driver via that same driver (netconsole) may not be the > > > smartest thing to do and this could be responsible for the lockup > > > itself. However, the initial page allocation failure remains and I'm > > > not sure how to address that problem. > > > > It isn't unexpected. If an atomic allocation doesn't have enough memory, > > it kicks off kswapd to start freeing memory for it. However, it cannot > > wait for memory to become free (it's GFP_ATOMIC), so it has to return > > failure. GFP_ATOMIC allocation paths are designed so that the kernel can > > recover from this situation, and a subsequent allocation will have free > > memory. > > > > Probably in production kernels we should default to only reporting this > > when page reclaim is not making any progress. > > > > > I still think the issue is memory fragmentation but if so, it looks > > > a bit extreme to me: One system with 2GB of ram crashed after a day, > > > merely running a couple of TCP server programs. All systems have either > > > 1 or 2GB ram and at least 1G of (merely unused) swap. > > > > You can reduce the chances of it happening by increasing > > /proc/sys/vm/min_free_kbytes. > > It's 3807 everywhere by default here which means roughly 950 pages if I > understand correctly. However, the problem occurs with much more free > pages as it seems. "grep ' free:' messages*" on the netconsole logging > machine shows: But it's an order-1 allocation, which may not be available due to fragmentation. Although you might have large amounts of memory free at a given point, fragmentation can be triggered earlier when free memory gets very low (because order-0 allocations may have taken up all of the free order-1 pages). Increasing it is known to help. Although you shouldn't crash due to allocation failures... it would be nice if you could connect a serial or vga console and see what's happening... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: VM/networking crash cause #1: page allocation failure (order:1, GFP_ATOMIC)
From: Nick Piggin [EMAIL PROTECTED] Date: Thu, 8 Nov 2007 16:55:38 +1100 But it's an order-1 allocation, which may not be available due to fragmentation. If the user is using an MTU of only 1500 (and thus not using jumbo frames) he may be being bitten by a bug that was recently fixed by Herbert Xu that made us use order 0 pages accidently. His fix is below: commit deea84b0ae3d26b41502ae0a39fe7fe134e703d0 Author: Herbert Xu [EMAIL PROTECTED] Date: Sun Oct 21 16:27:46 2007 -0700 [NET]: Fix SKB_WITH_OVERHEAD calculation The calculation in SKB_WITH_OVERHEAD is incorrect in that it can cause an overflow across a page boundary which is what it's meant to prevent. In particular, the header length (X) should not be lumped together with skb_shared_info. The latter needs to be aligned properly while the header has no choice but to sit in front of wherever the payload is. Therefore the correct calculation is to take away the aligned size of skb_shared_info, and then subtract the header length. The resulting quantity L satisfies the following inequality: SKB_DATA_ALIGN(L + X) + sizeof(struct skb_shared_info) = PAGE_SIZE This is the quantity used by alloc_skb to do the actual allocation. Signed-off-by: Herbert Xu [EMAIL PROTECTED] Signed-off-by: David S. Miller [EMAIL PROTECTED] diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index f93f22b..369f60a 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -41,8 +41,7 @@ #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) \ ~(SMP_CACHE_BYTES - 1)) #define SKB_WITH_OVERHEAD(X) \ - (((X) - sizeof(struct skb_shared_info)) \ -~(SMP_CACHE_BYTES - 1)) + ((X) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info))) #define SKB_MAX_ORDER(X, ORDER) \ SKB_WITH_OVERHEAD((PAGE_SIZE (ORDER)) - (X)) #define SKB_MAX_HEAD(X)(SKB_MAX_ORDER((X), 0)) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: VM/networking crash cause #1: page allocation failure (order:1, GFP_ATOMIC)
On Thursday 08 November 2007 00:48, Frank van Maarseveen wrote: On Wed, Nov 07, 2007 at 09:01:17AM +1100, Nick Piggin wrote: On Tuesday 06 November 2007 04:42, Frank van Maarseveen wrote: For quite some time I'm seeing occasional lockups spread over 50 different machines I'm maintaining. Symptom: a page allocation failure with order:1, GFP_ATOMIC, while there is plenty of memory, as it seems (lots of free pages, almost no swap used) followed by a lockup (everything dead). I've collected all (12) crash cases which occurred the last 10 weeks on 50 machines total (i.e. 1 crash every 41 weeks on average). The kernel messages are summarized to show the interesting part (IMO) they have in common. Over the years this has become the crash cause #1 for stable kernels for me (fglrx doesn't count ;). One note: I suspect that reporting a GFP_ATOMIC allocation failure in an network driver via that same driver (netconsole) may not be the smartest thing to do and this could be responsible for the lockup itself. However, the initial page allocation failure remains and I'm not sure how to address that problem. It isn't unexpected. If an atomic allocation doesn't have enough memory, it kicks off kswapd to start freeing memory for it. However, it cannot wait for memory to become free (it's GFP_ATOMIC), so it has to return failure. GFP_ATOMIC allocation paths are designed so that the kernel can recover from this situation, and a subsequent allocation will have free memory. Probably in production kernels we should default to only reporting this when page reclaim is not making any progress. I still think the issue is memory fragmentation but if so, it looks a bit extreme to me: One system with 2GB of ram crashed after a day, merely running a couple of TCP server programs. All systems have either 1 or 2GB ram and at least 1G of (merely unused) swap. You can reduce the chances of it happening by increasing /proc/sys/vm/min_free_kbytes. It's 3807 everywhere by default here which means roughly 950 pages if I understand correctly. However, the problem occurs with much more free pages as it seems. grep ' free:' messages* on the netconsole logging machine shows: But it's an order-1 allocation, which may not be available due to fragmentation. Although you might have large amounts of memory free at a given point, fragmentation can be triggered earlier when free memory gets very low (because order-0 allocations may have taken up all of the free order-1 pages). Increasing it is known to help. Although you shouldn't crash due to allocation failures... it would be nice if you could connect a serial or vga console and see what's happening... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: VM/networking crash cause #1: page allocation failure (order:1, GFP_ATOMIC)
On Wed, Nov 07, 2007 at 02:56:45PM +0100, Frank van Maarseveen wrote: > On Tue, Nov 06, 2007 at 05:13:50PM -0600, Robert Hancock wrote: > > Frank van Maarseveen wrote: > > >For quite some time I'm seeing occasional lockups spread over 50 different > > >machines I'm maintaining. Symptom: a page allocation failure with order:1, > > >GFP_ATOMIC, while there is plenty of memory, as it seems (lots of free > > >pages, almost no swap used) followed by a lockup (everything dead). I've > > >collected all (12) crash cases which occurred the last 10 weeks on 50 > > >machines total (i.e. 1 crash every 41 weeks on average). The kernel > > >messages are summarized to show the interesting part (IMO) they have > > >in common. Over the years this has become the crash cause #1 for stable > > >kernels for me (fglrx doesn't count ;). > > > > > >One note: I suspect that reporting a GFP_ATOMIC allocation failure in an > > >network driver via that same driver (netconsole) may not be the smartest > > >thing to do and this could be responsible for the lockup itself. However, > > >the initial page allocation failure remains and I'm not sure how to > > >address that problem. > > > > > >I still think the issue is memory fragmentation but if so, it looks > > >a bit extreme to me: One system with 2GB of ram crashed after a day, > > >merely running a couple of TCP server programs. All systems have either > > >1 or 2GB ram and at least 1G of (merely unused) swap. > > > > These are all order-1 allocations for received network packets that need > > to be allocated out of low memory (assuming you're using a 32-bit > > kernel), so it's quite possible for them to fail on occasion. (Are you > > using jumbo frames?) > > I don't use jumbo frames. > > > > > > That should not be causing a lockup though.. the received packet should > > just get dropped. > > Ok, packet loss is recoverable to some extend. When a system crashes > I often see a couple of page allocation failures in the same second, > all reported via netconsole. [snip] I've grepped for 'Normal free:' assuming it is the low memory you mention to see how it correlates. Of the 12 cases 7 did crash, 5 recovered: Nov 5 12:58:27 lokka Normal free:6444kB min:3736kB low:4668kB high:5604kB active:235196kB inactive:104336kB present:889680kB pages_scanned:44 all_unreclaimable? no Nov 5 12:58:27 lokka Normal free:6444kB min:3736kB low:4668kB high:5604kB active:235196kB inactive:104336kB present:889680kB pages_scanned:44 all_unreclaimable? no Nov 5 12:58:27 lokka Normal free:6444kB min:3736kB low:4668kB high:5604kB active:235196kB inactive:104336kB present:889680kB pages_scanned:44 all_unreclaimable? no crash Oct 29 11:48:07 somero Normal free:5412kB min:3736kB low:4668kB high:5604kB active:288068kB inactive:105708kB present:889680kB pages_scanned:0 all_unreclaimable? no Oct 29 11:48:07 somero Normal free:6704kB min:3736kB low:4668kB high:5604kB active:287940kB inactive:105084kB present:889680kB pages_scanned:0 all_unreclaimable? no Oct 29 11:48:08 somero Normal free:8332kB min:3736kB low:4668kB high:5604kB active:287760kB inactive:104240kB present:889680kB pages_scanned:54 all_unreclaimable? no ok (more cases with increasing free memory not received via netconsole) Oct 26 11:27:01 naantali Normal free:3976kB min:3736kB low:4668kB high:5604kB active:318568kB inactive:152928kB present:889680kB pages_scanned:0 all_unreclaimable? no Oct 26 11:27:01 naantali Normal free:4408kB min:3736kB low:4668kB high:5604kB active:318256kB inactive:152856kB present:889680kB pages_scanned:0 all_unreclaimable? no Oct 26 11:27:01 naantali Normal free:4408kB min:3736kB low:4668kB high:5604kB active:318256kB inactive:152856kB present:889680kB pages_scanned:0 all_unreclaimable? no crash Oct 12 14:56:44 koli Normal free:11628kB min:3736kB low:4668kB high:5604kB active:238112kB inactive:157232kB present:889680kB pages_scanned:0 all_unreclaimable? no ok Oct 1 08:51:58 salla Normal free:5496kB min:3736kB low:4668kB high:5604kB active:409500kB inactive:46388kB present:889680kB pages_scanned:137 all_unreclaimable? no Oct 1 08:51:59 salla Normal free:7396kB min:3736kB low:4668kB high:5604kB active:408292kB inactive:46740kB present:889680kB pages_scanned:0 all_unreclaimable? no crash Sep 17 10:34:49 lokka Normal free:39756kB min:3736kB low:4668kB high:5604kB active:236916kB inactive:175624kB present:889680kB pages_scanned:0 all_unreclaimable? no ok Sep 17 10:48:48 karvio Normal free:11648kB min:3736kB low:4668kB high:5604kB active:424420kB inactive:45380kB present:889680kB pages_scanned:144 all_unreclaimable? no Sep 17 10:48:48 karvio Normal free:11648kB min:3736kB low:4668kB high:5604kB active:424420kB inactive:45380kB present:889680kB pages_scanned:144 all_unreclaimable? no crash Sep 20 10:32:50 nivala Normal free:27276kB min:3736kB low:4668kB high:5604kB active:354084kB inactive:104152kB present:889680kB pages_scanned:260 all_unreclaimable? no crash Sep
Re: VM/networking crash cause #1: page allocation failure (order:1, GFP_ATOMIC)
On Tue, Nov 06, 2007 at 05:13:50PM -0600, Robert Hancock wrote: > Frank van Maarseveen wrote: > >For quite some time I'm seeing occasional lockups spread over 50 different > >machines I'm maintaining. Symptom: a page allocation failure with order:1, > >GFP_ATOMIC, while there is plenty of memory, as it seems (lots of free > >pages, almost no swap used) followed by a lockup (everything dead). I've > >collected all (12) crash cases which occurred the last 10 weeks on 50 > >machines total (i.e. 1 crash every 41 weeks on average). The kernel > >messages are summarized to show the interesting part (IMO) they have > >in common. Over the years this has become the crash cause #1 for stable > >kernels for me (fglrx doesn't count ;). > > > >One note: I suspect that reporting a GFP_ATOMIC allocation failure in an > >network driver via that same driver (netconsole) may not be the smartest > >thing to do and this could be responsible for the lockup itself. However, > >the initial page allocation failure remains and I'm not sure how to > >address that problem. > > > >I still think the issue is memory fragmentation but if so, it looks > >a bit extreme to me: One system with 2GB of ram crashed after a day, > >merely running a couple of TCP server programs. All systems have either > >1 or 2GB ram and at least 1G of (merely unused) swap. > > These are all order-1 allocations for received network packets that need > to be allocated out of low memory (assuming you're using a 32-bit > kernel), so it's quite possible for them to fail on occasion. (Are you > using jumbo frames?) I don't use jumbo frames. > > That should not be causing a lockup though.. the received packet should > just get dropped. Ok, packet loss is recoverable to some extend. When a system crashes I often see a couple of page allocation failures in the same second, all reported via netconsole. Here's the full log of such a case: Oct 26 11:27:01 naantali kswapd0: page allocation failure. order:1, mode:0x4020 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali show_trace_log_lvl+0x1a/0x30 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali show_trace+0x12/0x20 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali dump_stack+0x16/0x20 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali __alloc_pages+0x27e/0x300 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali allocate_slab+0x46/0x90 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali new_slab+0x31/0x140 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali __slab_alloc+0xbc/0x180 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali __kmalloc_track_caller+0x74/0x80 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali __alloc_skb+0x4d/0x110 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali tcp_collapse+0x17e/0x3b0 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali tcp_prune_queue+0x7f/0x1c0 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali tcp_data_queue+0x487/0x720 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali tcp_rcv_established+0x3a0/0x6e0 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali tcp_v4_do_rcv+0xe9/0x100 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali tcp_v4_rcv+0x7f1/0x8d0 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali ip_local_deliver+0xef/0x250 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali ip_rcv+0x264/0x560 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali netif_receive_skb+0x2ad/0x320 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali process_backlog+0x91/0x120 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali net_rx_action+0x8d/0x170 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali __do_softirq+0x78/0x100 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali do_softirq+0x3c/0x40 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali irq_exit+0x45/0x50 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali do_IRQ+0x4f/0xa0 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali common_interrupt+0x23/0x30 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali _spin_unlock+0x16/0x20 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali prune_dcache+0x142/0x1a0 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali shrink_dcache_memory+0x1e/0x50 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali shrink_slab+0x139/0x1d0 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali balance_pgdat+0x220/0x380 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali kswapd+0xd8/0x140 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali kthread+0x5c/0xa0 Oct 26 11:27:01 naantali [] Oct 26 11:27:01 naantali kernel_thread_helper+0x7/0x10 Oct 26 11:27:01 naantali === Oct 26 11:27:01 naantali Mem-info: Oct 26 11:27:01 naantali DMA per-cpu: Oct 26 11:27:01 naantali CPU0: Hot: hi:0, btch: 1 usd: 0 Cold: hi:0, btch: 1 usd: 0 Oct 26 11:27:01 naantali Normal per-cpu: Oct 26 11:27:01 naantali CPU0: Hot: hi: 186, btch: 31
Re: VM/networking crash cause #1: page allocation failure (order:1, GFP_ATOMIC)
On Wed, Nov 07, 2007 at 09:01:17AM +1100, Nick Piggin wrote: > On Tuesday 06 November 2007 04:42, Frank van Maarseveen wrote: > > For quite some time I'm seeing occasional lockups spread over 50 different > > machines I'm maintaining. Symptom: a page allocation failure with order:1, > > GFP_ATOMIC, while there is plenty of memory, as it seems (lots of free > > pages, almost no swap used) followed by a lockup (everything dead). I've > > collected all (12) crash cases which occurred the last 10 weeks on 50 > > machines total (i.e. 1 crash every 41 weeks on average). The kernel > > messages are summarized to show the interesting part (IMO) they have > > in common. Over the years this has become the crash cause #1 for stable > > kernels for me (fglrx doesn't count ;). > > > > One note: I suspect that reporting a GFP_ATOMIC allocation failure in an > > network driver via that same driver (netconsole) may not be the smartest > > thing to do and this could be responsible for the lockup itself. However, > > the initial page allocation failure remains and I'm not sure how to > > address that problem. > > It isn't unexpected. If an atomic allocation doesn't have enough memory, > it kicks off kswapd to start freeing memory for it. However, it cannot > wait for memory to become free (it's GFP_ATOMIC), so it has to return > failure. GFP_ATOMIC allocation paths are designed so that the kernel can > recover from this situation, and a subsequent allocation will have free > memory. > > Probably in production kernels we should default to only reporting this > when page reclaim is not making any progress. > > > > I still think the issue is memory fragmentation but if so, it looks > > a bit extreme to me: One system with 2GB of ram crashed after a day, > > merely running a couple of TCP server programs. All systems have either > > 1 or 2GB ram and at least 1G of (merely unused) swap. > > You can reduce the chances of it happening by increasing > /proc/sys/vm/min_free_kbytes. It's 3807 everywhere by default here which means roughly 950 pages if I understand correctly. However, the problem occurs with much more free pages as it seems. "grep ' free:' messages*" on the netconsole logging machine shows: messages:Nov 5 12:58:27 lokka free:89531 slab:122946 mapped:4421 pagetables:295 bounce:0 messages.0:Oct 29 11:48:07 somero free:113487 slab:109249 mapped:177942 pagetables:645 bounce:0 messages.1:Oct 26 11:27:01 naantali free:41073 slab:97335 mapped:2540 pagetables:149 bounce:0 messages.3:Oct 12 14:56:44 koli free:25694 slab:111842 mapped:16299 pagetables:794 bounce:0 messages.4:Oct 1 08:51:58 salla free:2964 slab:97675 mapped:279622 pagetables:650 bounce:0 messages.6:Sep 17 10:34:49 lokka free:170034 slab:95601 mapped:5548 pagetables:304 bounce:0 messages.6:Sep 17 10:48:48 karvio free:33597 slab:94909 mapped:247838 pagetables:695 bounce:0 messages.6:Sep 20 10:32:50 nivala free:80943 slab:93154 mapped:200040 pagetables:698 bounce:0 messages.8:Sep 3 09:46:11 lahti free:8195 slab:125438 mapped:2911 pagetables:192 bounce:0 messages.9:Aug 30 10:40:46 ropi free:61633 slab:90119 mapped:272634 pagetables:487 bounce:0 messages.9:Aug 30 10:46:58 ivalo free:41600 slab:88279 mapped:272705 pagetables:487 bounce:0 messages.9:Aug 31 16:30:02 lokka free:40661 slab:115006 mapped:21208 pagetables:493 bounce:0 So it is happening even with 170034 free pages (Sep 17 10:34:49), i.e. 640M free. In this particular case the machine didn't crash but in the majority of cases it does. Here's the full log of the 170034 free pages case: Sep 17 10:34:49 lokka kernel: ftxpd: page allocation failure. order:1, mode:0x4020 Sep 17 10:34:49 lokka kernel: [] show_trace_log_lvl+0x1a/0x30 Sep 17 10:34:49 lokka kernel: [] show_trace+0x12/0x20 Sep 17 10:34:49 lokka kernel: [] dump_stack+0x16/0x20 Sep 17 10:34:49 lokka kernel: [] __alloc_pages+0x27e/0x300 Sep 17 10:34:49 lokka kernel: [] allocate_slab+0x46/0x90 Sep 17 10:34:49 lokka kernel: [] new_slab+0x31/0x140 Sep 17 10:34:49 lokka kernel: [] __slab_alloc+0xbc/0x180 Sep 17 10:34:49 lokka kernel: [] __kmalloc_track_caller+0x74/0x80 Sep 17 10:34:49 lokka kernel: [] __alloc_skb+0x4d/0x110 Sep 17 10:34:49 lokka kernel: [] tcp_collapse+0x17e/0x3b0 Sep 17 10:34:49 lokka kernel: [] tcp_prune_queue+0x7f/0x1c0 Sep 17 10:34:49 lokka kernel: [] tcp_data_queue+0x487/0x720 Sep 17 10:34:49 lokka kernel: [] tcp_rcv_established+0x3a0/0x6e0 Sep 17 10:34:49 lokka kernel: [] tcp_v4_do_rcv+0xe9/0x100 Sep 17 10:34:49 lokka kernel: [] tcp_v4_rcv+0x7f1/0x8d0 Sep 17 10:34:49 lokka kernel: [] ip_local_deliver+0xef/0x250 Sep 17 10:34:49 lokka kernel: [] ip_rcv+0x264/0x560 Sep 17 10:34:49 lokka kernel: [] netif_receive_skb+0x2ad/0x320 Sep 17 10:34:49 lokka kernel: [] process_backlog+0x91/0x120 Sep 17 10:34:49 lokka kernel: [] net_rx_action+0x8d/0x170 Sep 17 10:34:49 lokka kernel: [] __do_softirq+0x78/0x100 Sep 17 10:34:49 lokka kernel: [] do_softirq+0x3c/0x40 Sep 17 10:34:49 lokka
Re: VM/networking crash cause #1: page allocation failure (order:1, GFP_ATOMIC)
On Wed, Nov 07, 2007 at 09:01:17AM +1100, Nick Piggin wrote: On Tuesday 06 November 2007 04:42, Frank van Maarseveen wrote: For quite some time I'm seeing occasional lockups spread over 50 different machines I'm maintaining. Symptom: a page allocation failure with order:1, GFP_ATOMIC, while there is plenty of memory, as it seems (lots of free pages, almost no swap used) followed by a lockup (everything dead). I've collected all (12) crash cases which occurred the last 10 weeks on 50 machines total (i.e. 1 crash every 41 weeks on average). The kernel messages are summarized to show the interesting part (IMO) they have in common. Over the years this has become the crash cause #1 for stable kernels for me (fglrx doesn't count ;). One note: I suspect that reporting a GFP_ATOMIC allocation failure in an network driver via that same driver (netconsole) may not be the smartest thing to do and this could be responsible for the lockup itself. However, the initial page allocation failure remains and I'm not sure how to address that problem. It isn't unexpected. If an atomic allocation doesn't have enough memory, it kicks off kswapd to start freeing memory for it. However, it cannot wait for memory to become free (it's GFP_ATOMIC), so it has to return failure. GFP_ATOMIC allocation paths are designed so that the kernel can recover from this situation, and a subsequent allocation will have free memory. Probably in production kernels we should default to only reporting this when page reclaim is not making any progress. I still think the issue is memory fragmentation but if so, it looks a bit extreme to me: One system with 2GB of ram crashed after a day, merely running a couple of TCP server programs. All systems have either 1 or 2GB ram and at least 1G of (merely unused) swap. You can reduce the chances of it happening by increasing /proc/sys/vm/min_free_kbytes. It's 3807 everywhere by default here which means roughly 950 pages if I understand correctly. However, the problem occurs with much more free pages as it seems. grep ' free:' messages* on the netconsole logging machine shows: messages:Nov 5 12:58:27 lokka free:89531 slab:122946 mapped:4421 pagetables:295 bounce:0 messages.0:Oct 29 11:48:07 somero free:113487 slab:109249 mapped:177942 pagetables:645 bounce:0 messages.1:Oct 26 11:27:01 naantali free:41073 slab:97335 mapped:2540 pagetables:149 bounce:0 messages.3:Oct 12 14:56:44 koli free:25694 slab:111842 mapped:16299 pagetables:794 bounce:0 messages.4:Oct 1 08:51:58 salla free:2964 slab:97675 mapped:279622 pagetables:650 bounce:0 messages.6:Sep 17 10:34:49 lokka free:170034 slab:95601 mapped:5548 pagetables:304 bounce:0 messages.6:Sep 17 10:48:48 karvio free:33597 slab:94909 mapped:247838 pagetables:695 bounce:0 messages.6:Sep 20 10:32:50 nivala free:80943 slab:93154 mapped:200040 pagetables:698 bounce:0 messages.8:Sep 3 09:46:11 lahti free:8195 slab:125438 mapped:2911 pagetables:192 bounce:0 messages.9:Aug 30 10:40:46 ropi free:61633 slab:90119 mapped:272634 pagetables:487 bounce:0 messages.9:Aug 30 10:46:58 ivalo free:41600 slab:88279 mapped:272705 pagetables:487 bounce:0 messages.9:Aug 31 16:30:02 lokka free:40661 slab:115006 mapped:21208 pagetables:493 bounce:0 So it is happening even with 170034 free pages (Sep 17 10:34:49), i.e. 640M free. In this particular case the machine didn't crash but in the majority of cases it does. Here's the full log of the 170034 free pages case: Sep 17 10:34:49 lokka kernel: ftxpd: page allocation failure. order:1, mode:0x4020 Sep 17 10:34:49 lokka kernel: [c01054aa] show_trace_log_lvl+0x1a/0x30 Sep 17 10:34:49 lokka kernel: [c01054d2] show_trace+0x12/0x20 Sep 17 10:34:49 lokka kernel: [c01055f6] dump_stack+0x16/0x20 Sep 17 10:34:49 lokka kernel: [c0156ade] __alloc_pages+0x27e/0x300 Sep 17 10:34:49 lokka kernel: [c016e926] allocate_slab+0x46/0x90 Sep 17 10:34:49 lokka kernel: [c016e9e1] new_slab+0x31/0x140 Sep 17 10:34:49 lokka kernel: [c016efec] __slab_alloc+0xbc/0x180 Sep 17 10:34:49 lokka kernel: [c0170364] __kmalloc_track_caller+0x74/0x80 Sep 17 10:34:49 lokka kernel: [c04b1d8d] __alloc_skb+0x4d/0x110 Sep 17 10:34:49 lokka kernel: [c0500cce] tcp_collapse+0x17e/0x3b0 Sep 17 10:34:49 lokka kernel: [c050103f] tcp_prune_queue+0x7f/0x1c0 Sep 17 10:34:49 lokka kernel: [c05008b7] tcp_data_queue+0x487/0x720 Sep 17 10:34:49 lokka kernel: [c0501a20] tcp_rcv_established+0x3a0/0x6e0 Sep 17 10:34:49 lokka kernel: [c0509859] tcp_v4_do_rcv+0xe9/0x100 Sep 17 10:34:49 lokka kernel: [c050a061] tcp_v4_rcv+0x7f1/0x8d0 Sep 17 10:34:49 lokka kernel: [c04ed2bf] ip_local_deliver+0xef/0x250 Sep 17 10:34:49 lokka kernel: [c04ed874] ip_rcv+0x264/0x560 Sep 17 10:34:49 lokka kernel: [c04b905d] netif_receive_skb+0x2ad/0x320 Sep 17 10:34:49 lokka kernel: [c04b9161] process_backlog+0x91/0x120 Sep 17 10:34:49 lokka kernel: [c04b927d] net_rx_action+0x8d/0x170 Sep 17 10:34:49 lokka kernel:
Re: VM/networking crash cause #1: page allocation failure (order:1, GFP_ATOMIC)
On Wed, Nov 07, 2007 at 02:56:45PM +0100, Frank van Maarseveen wrote: On Tue, Nov 06, 2007 at 05:13:50PM -0600, Robert Hancock wrote: Frank van Maarseveen wrote: For quite some time I'm seeing occasional lockups spread over 50 different machines I'm maintaining. Symptom: a page allocation failure with order:1, GFP_ATOMIC, while there is plenty of memory, as it seems (lots of free pages, almost no swap used) followed by a lockup (everything dead). I've collected all (12) crash cases which occurred the last 10 weeks on 50 machines total (i.e. 1 crash every 41 weeks on average). The kernel messages are summarized to show the interesting part (IMO) they have in common. Over the years this has become the crash cause #1 for stable kernels for me (fglrx doesn't count ;). One note: I suspect that reporting a GFP_ATOMIC allocation failure in an network driver via that same driver (netconsole) may not be the smartest thing to do and this could be responsible for the lockup itself. However, the initial page allocation failure remains and I'm not sure how to address that problem. I still think the issue is memory fragmentation but if so, it looks a bit extreme to me: One system with 2GB of ram crashed after a day, merely running a couple of TCP server programs. All systems have either 1 or 2GB ram and at least 1G of (merely unused) swap. These are all order-1 allocations for received network packets that need to be allocated out of low memory (assuming you're using a 32-bit kernel), so it's quite possible for them to fail on occasion. (Are you using jumbo frames?) I don't use jumbo frames. That should not be causing a lockup though.. the received packet should just get dropped. Ok, packet loss is recoverable to some extend. When a system crashes I often see a couple of page allocation failures in the same second, all reported via netconsole. [snip] I've grepped for 'Normal free:' assuming it is the low memory you mention to see how it correlates. Of the 12 cases 7 did crash, 5 recovered: Nov 5 12:58:27 lokka Normal free:6444kB min:3736kB low:4668kB high:5604kB active:235196kB inactive:104336kB present:889680kB pages_scanned:44 all_unreclaimable? no Nov 5 12:58:27 lokka Normal free:6444kB min:3736kB low:4668kB high:5604kB active:235196kB inactive:104336kB present:889680kB pages_scanned:44 all_unreclaimable? no Nov 5 12:58:27 lokka Normal free:6444kB min:3736kB low:4668kB high:5604kB active:235196kB inactive:104336kB present:889680kB pages_scanned:44 all_unreclaimable? no crash Oct 29 11:48:07 somero Normal free:5412kB min:3736kB low:4668kB high:5604kB active:288068kB inactive:105708kB present:889680kB pages_scanned:0 all_unreclaimable? no Oct 29 11:48:07 somero Normal free:6704kB min:3736kB low:4668kB high:5604kB active:287940kB inactive:105084kB present:889680kB pages_scanned:0 all_unreclaimable? no Oct 29 11:48:08 somero Normal free:8332kB min:3736kB low:4668kB high:5604kB active:287760kB inactive:104240kB present:889680kB pages_scanned:54 all_unreclaimable? no ok (more cases with increasing free memory not received via netconsole) Oct 26 11:27:01 naantali Normal free:3976kB min:3736kB low:4668kB high:5604kB active:318568kB inactive:152928kB present:889680kB pages_scanned:0 all_unreclaimable? no Oct 26 11:27:01 naantali Normal free:4408kB min:3736kB low:4668kB high:5604kB active:318256kB inactive:152856kB present:889680kB pages_scanned:0 all_unreclaimable? no Oct 26 11:27:01 naantali Normal free:4408kB min:3736kB low:4668kB high:5604kB active:318256kB inactive:152856kB present:889680kB pages_scanned:0 all_unreclaimable? no crash Oct 12 14:56:44 koli Normal free:11628kB min:3736kB low:4668kB high:5604kB active:238112kB inactive:157232kB present:889680kB pages_scanned:0 all_unreclaimable? no ok Oct 1 08:51:58 salla Normal free:5496kB min:3736kB low:4668kB high:5604kB active:409500kB inactive:46388kB present:889680kB pages_scanned:137 all_unreclaimable? no Oct 1 08:51:59 salla Normal free:7396kB min:3736kB low:4668kB high:5604kB active:408292kB inactive:46740kB present:889680kB pages_scanned:0 all_unreclaimable? no crash Sep 17 10:34:49 lokka Normal free:39756kB min:3736kB low:4668kB high:5604kB active:236916kB inactive:175624kB present:889680kB pages_scanned:0 all_unreclaimable? no ok Sep 17 10:48:48 karvio Normal free:11648kB min:3736kB low:4668kB high:5604kB active:424420kB inactive:45380kB present:889680kB pages_scanned:144 all_unreclaimable? no Sep 17 10:48:48 karvio Normal free:11648kB min:3736kB low:4668kB high:5604kB active:424420kB inactive:45380kB present:889680kB pages_scanned:144 all_unreclaimable? no crash Sep 20 10:32:50 nivala Normal free:27276kB min:3736kB low:4668kB high:5604kB active:354084kB inactive:104152kB present:889680kB pages_scanned:260 all_unreclaimable? no crash Sep 3 09:46:11 lahti Normal free:26200kB min:3736kB low:4668kB high:5604kB
Re: VM/networking crash cause #1: page allocation failure (order:1, GFP_ATOMIC)
On Tue, Nov 06, 2007 at 05:13:50PM -0600, Robert Hancock wrote: Frank van Maarseveen wrote: For quite some time I'm seeing occasional lockups spread over 50 different machines I'm maintaining. Symptom: a page allocation failure with order:1, GFP_ATOMIC, while there is plenty of memory, as it seems (lots of free pages, almost no swap used) followed by a lockup (everything dead). I've collected all (12) crash cases which occurred the last 10 weeks on 50 machines total (i.e. 1 crash every 41 weeks on average). The kernel messages are summarized to show the interesting part (IMO) they have in common. Over the years this has become the crash cause #1 for stable kernels for me (fglrx doesn't count ;). One note: I suspect that reporting a GFP_ATOMIC allocation failure in an network driver via that same driver (netconsole) may not be the smartest thing to do and this could be responsible for the lockup itself. However, the initial page allocation failure remains and I'm not sure how to address that problem. I still think the issue is memory fragmentation but if so, it looks a bit extreme to me: One system with 2GB of ram crashed after a day, merely running a couple of TCP server programs. All systems have either 1 or 2GB ram and at least 1G of (merely unused) swap. These are all order-1 allocations for received network packets that need to be allocated out of low memory (assuming you're using a 32-bit kernel), so it's quite possible for them to fail on occasion. (Are you using jumbo frames?) I don't use jumbo frames. That should not be causing a lockup though.. the received packet should just get dropped. Ok, packet loss is recoverable to some extend. When a system crashes I often see a couple of page allocation failures in the same second, all reported via netconsole. Here's the full log of such a case: Oct 26 11:27:01 naantali kswapd0: page allocation failure. order:1, mode:0x4020 Oct 26 11:27:01 naantali [c01054aa] Oct 26 11:27:01 naantali show_trace_log_lvl+0x1a/0x30 Oct 26 11:27:01 naantali [c01054d2] Oct 26 11:27:01 naantali show_trace+0x12/0x20 Oct 26 11:27:01 naantali [c01055f6] Oct 26 11:27:01 naantali dump_stack+0x16/0x20 Oct 26 11:27:01 naantali [c0156ace] Oct 26 11:27:01 naantali __alloc_pages+0x27e/0x300 Oct 26 11:27:01 naantali [c016e916] Oct 26 11:27:01 naantali allocate_slab+0x46/0x90 Oct 26 11:27:01 naantali [c016e9d1] Oct 26 11:27:01 naantali new_slab+0x31/0x140 Oct 26 11:27:01 naantali [c016efdc] Oct 26 11:27:01 naantali __slab_alloc+0xbc/0x180 Oct 26 11:27:01 naantali [c0170354] Oct 26 11:27:01 naantali __kmalloc_track_caller+0x74/0x80 Oct 26 11:27:01 naantali [c04b193d] Oct 26 11:27:01 naantali __alloc_skb+0x4d/0x110 Oct 26 11:27:01 naantali [c050088e] Oct 26 11:27:01 naantali tcp_collapse+0x17e/0x3b0 Oct 26 11:27:01 naantali [c0500bff] Oct 26 11:27:01 naantali tcp_prune_queue+0x7f/0x1c0 Oct 26 11:27:01 naantali [c0500477] Oct 26 11:27:01 naantali tcp_data_queue+0x487/0x720 Oct 26 11:27:01 naantali [c05015e0] Oct 26 11:27:01 naantali tcp_rcv_established+0x3a0/0x6e0 Oct 26 11:27:01 naantali [c0509419] Oct 26 11:27:01 naantali tcp_v4_do_rcv+0xe9/0x100 Oct 26 11:27:01 naantali [c0509c21] Oct 26 11:27:01 naantali tcp_v4_rcv+0x7f1/0x8d0 Oct 26 11:27:01 naantali [c04ece8f] Oct 26 11:27:01 naantali ip_local_deliver+0xef/0x250 Oct 26 11:27:01 naantali [c04ed444] Oct 26 11:27:01 naantali ip_rcv+0x264/0x560 Oct 26 11:27:01 naantali [c04b8c2d] Oct 26 11:27:01 naantali netif_receive_skb+0x2ad/0x320 Oct 26 11:27:01 naantali [c04b8d31] Oct 26 11:27:01 naantali process_backlog+0x91/0x120 Oct 26 11:27:01 naantali [c04b8e4d] Oct 26 11:27:01 naantali net_rx_action+0x8d/0x170 Oct 26 11:27:01 naantali [c0128de8] Oct 26 11:27:01 naantali __do_softirq+0x78/0x100 Oct 26 11:27:01 naantali [c0128eac] Oct 26 11:27:01 naantali do_softirq+0x3c/0x40 Oct 26 11:27:01 naantali [c0128f15] Oct 26 11:27:01 naantali irq_exit+0x45/0x50 Oct 26 11:27:01 naantali [c0106d2f] Oct 26 11:27:01 naantali do_IRQ+0x4f/0xa0 Oct 26 11:27:01 naantali [c01050b3] Oct 26 11:27:01 naantali common_interrupt+0x23/0x30 Oct 26 11:27:01 naantali [c0575736] Oct 26 11:27:01 naantali _spin_unlock+0x16/0x20 Oct 26 11:27:01 naantali [c0185052] Oct 26 11:27:01 naantali prune_dcache+0x142/0x1a0 Oct 26 11:27:01 naantali [c01855ee] Oct 26 11:27:01 naantali shrink_dcache_memory+0x1e/0x50 Oct 26 11:27:01 naantali [c015a4d9] Oct 26 11:27:01 naantali shrink_slab+0x139/0x1d0 Oct 26 11:27:01 naantali [c015b980] Oct 26 11:27:01 naantali balance_pgdat+0x220/0x380 Oct 26 11:27:01 naantali [c015bbb8] Oct 26 11:27:01 naantali kswapd+0xd8/0x140 Oct 26 11:27:01 naantali [c0136eac] Oct 26 11:27:01 naantali kthread+0x5c/0xa0 Oct 26 11:27:01 naantali [c0105337] Oct 26 11:27:01 naantali kernel_thread_helper+0x7/0x10 Oct 26 11:27:01 naantali === Oct 26 11:27:01 naantali Mem-info: Oct 26 11:27:01 naantali DMA
Re: VM/networking crash cause #1: page allocation failure (order:1, GFP_ATOMIC)
On Tuesday 06 November 2007 04:42, Frank van Maarseveen wrote: > For quite some time I'm seeing occasional lockups spread over 50 different > machines I'm maintaining. Symptom: a page allocation failure with order:1, > GFP_ATOMIC, while there is plenty of memory, as it seems (lots of free > pages, almost no swap used) followed by a lockup (everything dead). I've > collected all (12) crash cases which occurred the last 10 weeks on 50 > machines total (i.e. 1 crash every 41 weeks on average). The kernel > messages are summarized to show the interesting part (IMO) they have > in common. Over the years this has become the crash cause #1 for stable > kernels for me (fglrx doesn't count ;). > > One note: I suspect that reporting a GFP_ATOMIC allocation failure in an > network driver via that same driver (netconsole) may not be the smartest > thing to do and this could be responsible for the lockup itself. However, > the initial page allocation failure remains and I'm not sure how to > address that problem. It isn't unexpected. If an atomic allocation doesn't have enough memory, it kicks off kswapd to start freeing memory for it. However, it cannot wait for memory to become free (it's GFP_ATOMIC), so it has to return failure. GFP_ATOMIC allocation paths are designed so that the kernel can recover from this situation, and a subsequent allocation will have free memory. Probably in production kernels we should default to only reporting this when page reclaim is not making any progress. > I still think the issue is memory fragmentation but if so, it looks > a bit extreme to me: One system with 2GB of ram crashed after a day, > merely running a couple of TCP server programs. All systems have either > 1 or 2GB ram and at least 1G of (merely unused) swap. You can reduce the chances of it happening by increasing /proc/sys/vm/min_free_kbytes. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: VM/networking crash cause #1: page allocation failure (order:1, GFP_ATOMIC)
Frank van Maarseveen wrote: For quite some time I'm seeing occasional lockups spread over 50 different machines I'm maintaining. Symptom: a page allocation failure with order:1, GFP_ATOMIC, while there is plenty of memory, as it seems (lots of free pages, almost no swap used) followed by a lockup (everything dead). I've collected all (12) crash cases which occurred the last 10 weeks on 50 machines total (i.e. 1 crash every 41 weeks on average). The kernel messages are summarized to show the interesting part (IMO) they have in common. Over the years this has become the crash cause #1 for stable kernels for me (fglrx doesn't count ;). One note: I suspect that reporting a GFP_ATOMIC allocation failure in an network driver via that same driver (netconsole) may not be the smartest thing to do and this could be responsible for the lockup itself. However, the initial page allocation failure remains and I'm not sure how to address that problem. I still think the issue is memory fragmentation but if so, it looks a bit extreme to me: One system with 2GB of ram crashed after a day, merely running a couple of TCP server programs. All systems have either 1 or 2GB ram and at least 1G of (merely unused) swap. These are all order-1 allocations for received network packets that need to be allocated out of low memory (assuming you're using a 32-bit kernel), so it's quite possible for them to fail on occasion. (Are you using jumbo frames?) That should not be causing a lockup though.. the received packet should just get dropped. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: VM/networking crash cause #1: page allocation failure (order:1, GFP_ATOMIC)
Frank van Maarseveen wrote: For quite some time I'm seeing occasional lockups spread over 50 different machines I'm maintaining. Symptom: a page allocation failure with order:1, GFP_ATOMIC, while there is plenty of memory, as it seems (lots of free pages, almost no swap used) followed by a lockup (everything dead). I've collected all (12) crash cases which occurred the last 10 weeks on 50 machines total (i.e. 1 crash every 41 weeks on average). The kernel messages are summarized to show the interesting part (IMO) they have in common. Over the years this has become the crash cause #1 for stable kernels for me (fglrx doesn't count ;). One note: I suspect that reporting a GFP_ATOMIC allocation failure in an network driver via that same driver (netconsole) may not be the smartest thing to do and this could be responsible for the lockup itself. However, the initial page allocation failure remains and I'm not sure how to address that problem. I still think the issue is memory fragmentation but if so, it looks a bit extreme to me: One system with 2GB of ram crashed after a day, merely running a couple of TCP server programs. All systems have either 1 or 2GB ram and at least 1G of (merely unused) swap. These are all order-1 allocations for received network packets that need to be allocated out of low memory (assuming you're using a 32-bit kernel), so it's quite possible for them to fail on occasion. (Are you using jumbo frames?) That should not be causing a lockup though.. the received packet should just get dropped. -- Robert Hancock Saskatoon, SK, Canada To email, remove nospam from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: VM/networking crash cause #1: page allocation failure (order:1, GFP_ATOMIC)
On Tuesday 06 November 2007 04:42, Frank van Maarseveen wrote: For quite some time I'm seeing occasional lockups spread over 50 different machines I'm maintaining. Symptom: a page allocation failure with order:1, GFP_ATOMIC, while there is plenty of memory, as it seems (lots of free pages, almost no swap used) followed by a lockup (everything dead). I've collected all (12) crash cases which occurred the last 10 weeks on 50 machines total (i.e. 1 crash every 41 weeks on average). The kernel messages are summarized to show the interesting part (IMO) they have in common. Over the years this has become the crash cause #1 for stable kernels for me (fglrx doesn't count ;). One note: I suspect that reporting a GFP_ATOMIC allocation failure in an network driver via that same driver (netconsole) may not be the smartest thing to do and this could be responsible for the lockup itself. However, the initial page allocation failure remains and I'm not sure how to address that problem. It isn't unexpected. If an atomic allocation doesn't have enough memory, it kicks off kswapd to start freeing memory for it. However, it cannot wait for memory to become free (it's GFP_ATOMIC), so it has to return failure. GFP_ATOMIC allocation paths are designed so that the kernel can recover from this situation, and a subsequent allocation will have free memory. Probably in production kernels we should default to only reporting this when page reclaim is not making any progress. I still think the issue is memory fragmentation but if so, it looks a bit extreme to me: One system with 2GB of ram crashed after a day, merely running a couple of TCP server programs. All systems have either 1 or 2GB ram and at least 1G of (merely unused) swap. You can reduce the chances of it happening by increasing /proc/sys/vm/min_free_kbytes. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
VM/networking crash cause #1: page allocation failure (order:1, GFP_ATOMIC)
For quite some time I'm seeing occasional lockups spread over 50 different machines I'm maintaining. Symptom: a page allocation failure with order:1, GFP_ATOMIC, while there is plenty of memory, as it seems (lots of free pages, almost no swap used) followed by a lockup (everything dead). I've collected all (12) crash cases which occurred the last 10 weeks on 50 machines total (i.e. 1 crash every 41 weeks on average). The kernel messages are summarized to show the interesting part (IMO) they have in common. Over the years this has become the crash cause #1 for stable kernels for me (fglrx doesn't count ;). One note: I suspect that reporting a GFP_ATOMIC allocation failure in an network driver via that same driver (netconsole) may not be the smartest thing to do and this could be responsible for the lockup itself. However, the initial page allocation failure remains and I'm not sure how to address that problem. I still think the issue is memory fragmentation but if so, it looks a bit extreme to me: One system with 2GB of ram crashed after a day, merely running a couple of TCP server programs. All systems have either 1 or 2GB ram and at least 1G of (merely unused) swap. 2.6.22.10: lokka kswapd0: page allocation failure. order:1, mode:0x4020 Nov 5 12:58:27 lokka __alloc_skb+0x4d/0x110 Nov 5 12:58:27 lokka __netdev_alloc_skb+0x23/0x50 Nov 5 12:58:27 lokka e1000_alloc_rx_buffers+0x127/0x320 Nov 5 12:58:27 lokka e1000_clean_rx_irq+0x299/0x510 Nov 5 12:58:27 lokka e1000_intr+0x80/0x140 Nov 5 12:58:27 lokka e1000_intr+0x80/0x140 Nov 5 12:58:27 lokka handle_IRQ_event+0x28/0x60 Nov 5 12:58:27 lokka handle_fasteoi_irq+0x6e/0xd0 Nov 5 12:58:27 lokka do_IRQ+0x4a/0xa0 ... Nov 5 12:58:27 lokka Active:139432 inactive:152030 dirty:4406 writeback:4 unstable:0 Nov 5 12:58:27 lokka free:89531 slab:122946 mapped:4421 pagetables:295 bounce:0 ... Nov 5 12:58:27 lokka Swap cache: add 16, delete 16, find 0/0, race 0+0 Nov 5 12:58:27 lokka Free swap = 1004004kB Nov 5 12:58:27 lokka Total swap = 1004052kB Nov 5 12:58:27 lokka Free swap: 1004004kB Nov 5 12:58:27 lokka 523912 pages of RAM Nov 5 12:58:27 lokka 294536 pages of HIGHMEM 2.6.21.6: somero ftxpd: page allocation failure. order:1, mode:0x20 Oct 29 11:48:07 somero __alloc_skb+0x4c/0x110 Oct 29 11:48:07 somero __netdev_alloc_skb+0x1d/0x50 Oct 29 11:48:07 somero tg3_alloc_rx_skb+0x74/0x150 Oct 29 11:48:07 somero tg3_rx+0x35a/0x440 Oct 29 11:48:07 somero tg3_poll+0x77/0x1e0 Oct 29 11:48:07 somero net_rx_action+0x99/0x190 Oct 29 11:48:07 somero __do_softirq+0x80/0xf0 Oct 29 11:48:07 somero do_softirq+0x3c/0x40 Oct 29 11:48:07 somero irq_exit+0x45/0x50 Oct 29 11:48:07 somero do_IRQ+0x50/0xa0 ... Oct 29 11:48:07 somero Active:227052 inactive:48283 dirty:3423 writeback:0 unstable:0 Oct 29 11:48:07 somero free:113487 slab:109249 mapped:177942 pagetables:645 bounce:0 ... Oct 29 11:48:07 somero Swap cache: add 428601, delete 422625, find 84672794/84685901, race 0+7 Oct 29 11:48:07 somero Free swap = 1901644kB Oct 29 11:48:07 somero Total swap = 2008116kB Oct 29 11:48:07 somero Free swap: 1901644kB Oct 29 11:48:07 somero 521862 pages of RAM Oct 29 11:48:07 somero 292486 pages of HIGHMEM 2.6.22.10: naantali kswapd0: page allocation failure. order:1, mode:0x4020 (uptime: one day, merely running a couple of TCP server programs) Oct 26 11:27:01 naantali __alloc_skb+0x4d/0x110 Oct 26 11:27:01 naantali tcp_collapse+0x17e/0x3b0 Oct 26 11:27:01 naantali tcp_prune_queue+0x7f/0x1c0 Oct 26 11:27:01 naantali tcp_data_queue+0x487/0x720 Oct 26 11:27:01 naantali tcp_rcv_established+0x3a0/0x6e0 Oct 26 11:27:01 naantali tcp_v4_do_rcv+0xe9/0x100 Oct 26 11:27:01 naantali tcp_v4_rcv+0x7f1/0x8d0 Oct 26 11:27:01 naantali ip_local_deliver+0xef/0x250 Oct 26 11:27:01 naantali ip_rcv+0x264/0x560 Oct 26 11:27:01 naantali netif_receive_skb+0x2ad/0x320 Oct 26 11:27:01 naantali process_backlog+0x91/0x120 Oct 26 11:27:01 naantali net_rx_action+0x8d/0x170 Oct 26 11:27:01 naantali __do_softirq+0x78/0x100 Oct 26 11:27:01 naantali do_softirq+0x3c/0x40 Oct 26 11:27:01 naantali irq_exit+0x45/0x50 Oct 26 11:27:01 naantali do_IRQ+0x4f/0xa0 ... Oct 26 11:27:01 naantali Active:115770 inactive:257188 dirty:10243 writeback:1920 unstable:0 Oct 26 11:27:01 naantali free:41073 slab:97335 mapped:2540 pagetables:149 bounce:0 ... Oct 26 11:27:01 naantali Swap cache: add 90, delete 88, find 18/28, race 0+0 Oct 26 11:27:01 naantali Free swap = 2008048kB Oct 26 11:27:01 naantali Total swap = 2008084kB Oct 26 11:27:01 naantali Free swap: 2008048kB Oct 26 11:27:01 naantali 524144 pages of RAM Oct 26 11:27:01 naantali 294768 pages of HIGHMEM 2.6.21.6: koli ftxpd: page allocation failure. order:1, mode:0x20 Oct 12 14:56:44 koli __alloc_skb+0x4c/0x110 Oct 12 14:56:44 koli __netdev_alloc_skb+0x1d/0x50 Oct 12 14:56:44 koli tg3_alloc_rx_skb+0x74/0x150 Oct 12 14:56:44 koli tg3_rx+0x35a/0x440 Oct 12 14:56:44 koli tg3_poll+0x77/0x1e0 Oct 12 14:56:44 koli
VM/networking crash cause #1: page allocation failure (order:1, GFP_ATOMIC)
For quite some time I'm seeing occasional lockups spread over 50 different machines I'm maintaining. Symptom: a page allocation failure with order:1, GFP_ATOMIC, while there is plenty of memory, as it seems (lots of free pages, almost no swap used) followed by a lockup (everything dead). I've collected all (12) crash cases which occurred the last 10 weeks on 50 machines total (i.e. 1 crash every 41 weeks on average). The kernel messages are summarized to show the interesting part (IMO) they have in common. Over the years this has become the crash cause #1 for stable kernels for me (fglrx doesn't count ;). One note: I suspect that reporting a GFP_ATOMIC allocation failure in an network driver via that same driver (netconsole) may not be the smartest thing to do and this could be responsible for the lockup itself. However, the initial page allocation failure remains and I'm not sure how to address that problem. I still think the issue is memory fragmentation but if so, it looks a bit extreme to me: One system with 2GB of ram crashed after a day, merely running a couple of TCP server programs. All systems have either 1 or 2GB ram and at least 1G of (merely unused) swap. 2.6.22.10: lokka kswapd0: page allocation failure. order:1, mode:0x4020 Nov 5 12:58:27 lokka __alloc_skb+0x4d/0x110 Nov 5 12:58:27 lokka __netdev_alloc_skb+0x23/0x50 Nov 5 12:58:27 lokka e1000_alloc_rx_buffers+0x127/0x320 Nov 5 12:58:27 lokka e1000_clean_rx_irq+0x299/0x510 Nov 5 12:58:27 lokka e1000_intr+0x80/0x140 Nov 5 12:58:27 lokka e1000_intr+0x80/0x140 Nov 5 12:58:27 lokka handle_IRQ_event+0x28/0x60 Nov 5 12:58:27 lokka handle_fasteoi_irq+0x6e/0xd0 Nov 5 12:58:27 lokka do_IRQ+0x4a/0xa0 ... Nov 5 12:58:27 lokka Active:139432 inactive:152030 dirty:4406 writeback:4 unstable:0 Nov 5 12:58:27 lokka free:89531 slab:122946 mapped:4421 pagetables:295 bounce:0 ... Nov 5 12:58:27 lokka Swap cache: add 16, delete 16, find 0/0, race 0+0 Nov 5 12:58:27 lokka Free swap = 1004004kB Nov 5 12:58:27 lokka Total swap = 1004052kB Nov 5 12:58:27 lokka Free swap: 1004004kB Nov 5 12:58:27 lokka 523912 pages of RAM Nov 5 12:58:27 lokka 294536 pages of HIGHMEM 2.6.21.6: somero ftxpd: page allocation failure. order:1, mode:0x20 Oct 29 11:48:07 somero __alloc_skb+0x4c/0x110 Oct 29 11:48:07 somero __netdev_alloc_skb+0x1d/0x50 Oct 29 11:48:07 somero tg3_alloc_rx_skb+0x74/0x150 Oct 29 11:48:07 somero tg3_rx+0x35a/0x440 Oct 29 11:48:07 somero tg3_poll+0x77/0x1e0 Oct 29 11:48:07 somero net_rx_action+0x99/0x190 Oct 29 11:48:07 somero __do_softirq+0x80/0xf0 Oct 29 11:48:07 somero do_softirq+0x3c/0x40 Oct 29 11:48:07 somero irq_exit+0x45/0x50 Oct 29 11:48:07 somero do_IRQ+0x50/0xa0 ... Oct 29 11:48:07 somero Active:227052 inactive:48283 dirty:3423 writeback:0 unstable:0 Oct 29 11:48:07 somero free:113487 slab:109249 mapped:177942 pagetables:645 bounce:0 ... Oct 29 11:48:07 somero Swap cache: add 428601, delete 422625, find 84672794/84685901, race 0+7 Oct 29 11:48:07 somero Free swap = 1901644kB Oct 29 11:48:07 somero Total swap = 2008116kB Oct 29 11:48:07 somero Free swap: 1901644kB Oct 29 11:48:07 somero 521862 pages of RAM Oct 29 11:48:07 somero 292486 pages of HIGHMEM 2.6.22.10: naantali kswapd0: page allocation failure. order:1, mode:0x4020 (uptime: one day, merely running a couple of TCP server programs) Oct 26 11:27:01 naantali __alloc_skb+0x4d/0x110 Oct 26 11:27:01 naantali tcp_collapse+0x17e/0x3b0 Oct 26 11:27:01 naantali tcp_prune_queue+0x7f/0x1c0 Oct 26 11:27:01 naantali tcp_data_queue+0x487/0x720 Oct 26 11:27:01 naantali tcp_rcv_established+0x3a0/0x6e0 Oct 26 11:27:01 naantali tcp_v4_do_rcv+0xe9/0x100 Oct 26 11:27:01 naantali tcp_v4_rcv+0x7f1/0x8d0 Oct 26 11:27:01 naantali ip_local_deliver+0xef/0x250 Oct 26 11:27:01 naantali ip_rcv+0x264/0x560 Oct 26 11:27:01 naantali netif_receive_skb+0x2ad/0x320 Oct 26 11:27:01 naantali process_backlog+0x91/0x120 Oct 26 11:27:01 naantali net_rx_action+0x8d/0x170 Oct 26 11:27:01 naantali __do_softirq+0x78/0x100 Oct 26 11:27:01 naantali do_softirq+0x3c/0x40 Oct 26 11:27:01 naantali irq_exit+0x45/0x50 Oct 26 11:27:01 naantali do_IRQ+0x4f/0xa0 ... Oct 26 11:27:01 naantali Active:115770 inactive:257188 dirty:10243 writeback:1920 unstable:0 Oct 26 11:27:01 naantali free:41073 slab:97335 mapped:2540 pagetables:149 bounce:0 ... Oct 26 11:27:01 naantali Swap cache: add 90, delete 88, find 18/28, race 0+0 Oct 26 11:27:01 naantali Free swap = 2008048kB Oct 26 11:27:01 naantali Total swap = 2008084kB Oct 26 11:27:01 naantali Free swap: 2008048kB Oct 26 11:27:01 naantali 524144 pages of RAM Oct 26 11:27:01 naantali 294768 pages of HIGHMEM 2.6.21.6: koli ftxpd: page allocation failure. order:1, mode:0x20 Oct 12 14:56:44 koli __alloc_skb+0x4c/0x110 Oct 12 14:56:44 koli __netdev_alloc_skb+0x1d/0x50 Oct 12 14:56:44 koli tg3_alloc_rx_skb+0x74/0x150 Oct 12 14:56:44 koli tg3_rx+0x35a/0x440 Oct 12 14:56:44 koli tg3_poll+0x77/0x1e0 Oct 12 14:56:44 koli