Hi Andrew On (31/01/08 10:49), AndrewL733 didst pronounce: > My setup is Mandriva 2007 64-bit: > > IntelS5000PSLSATA motherboard > Single Dual Core 3 Ghz Xeon > 2 GB ECC RAM > Myricom 10 GbE Network Adapter (using out of kernel 1.4.0 driver > compiled with "MYRI10GE_ALLOC_ORDER=2") > > > I have two identical systems doing data transfer via Myricom cards > connected directly to one another (in other words, not going through a > switch). > > No problems hammering Myricom with 350 MB/sec NFS traffic while running > 2.6.20.15 and 2.6.22.9 (kernel.org "plain vanilla" kernels I compiled > and run, also with updated Myricom drivers). However, I just compiled > and installed 2.6.24 and get the following errors as soon as I begin > doing NFS I/O over the Myricom card. Maybe I missed something in the > kernel configuration? Throughput when running 2.6.24 is about 40 percent > lower than when running 2.6.22.9-- surely due to these errors. >
It's possible. Is there a stream of these errors coming through constantly? > Jan 29 22:13:36 master kernel: nfsd: page allocation failure. order:2, > mode:0x4020 So, that is a ZONE_NORMAL high-order atomic allocation (GFP_COMPOUND as well which is a little surprise but should make no difference). Not exactly recommended but there are a few drivers that do it and I guess yours is one. > Jan 29 22:13:36 master kernel: Pid: 6586, comm: nfsd Not tainted 2.6.24 #1 > Jan 29 22:13:36 master kernel: > Jan 29 22:13:36 master kernel: Call Trace: > Jan 29 22:13:36 master kernel: <IRQ> [__alloc_pages+822/912] > __alloc_pages+0x336/0x390 > Jan 29 22:13:36 master kernel: <IRQ> [<ffffffff80279d66>] > __alloc_pages+0x336/0x390 > Jan 29 22:13:36 master kernel: [ip_local_deliver+37/112] > ip_local_deliver+0x25/0x70 > Jan 29 22:13:36 master kernel: [<ffffffff80417bb5>] > ip_local_deliver+0x25/0x70 > Jan 29 22:13:36 master kernel: [_end+130301422/2130457992] > :myri10ge:myri10ge_alloc_rx_pages+0x156/0x270 > Jan 29 22:13:36 master kernel: [<ffffffff88280866>] > :myri10ge:myri10ge_alloc_rx_pages+0x156/0x270 > Jan 29 22:13:36 master kernel: [_end+130325174/2130457992] > :myri10ge:myri10ge_poll+0x57e/0xae0 > Jan 29 22:13:36 master kernel: [<ffffffff8828652e>] > :myri10ge:myri10ge_poll+0x57e/0xae0 Coming from the driver in question so no suprise there > <SNIP> > Jan 29 22:13:36 master kernel: Mem-info: > Jan 29 22:13:36 master kernel: DMA per-cpu: > Jan 29 22:13:36 master kernel: CPU 0: Hot: hi: 0, btch: 1 usd: 0 > Cold: hi: 0, btch: 1 usd: 0 > Jan 29 22:13:36 master kernel: CPU 1: Hot: hi: 0, btch: 1 usd: 0 > Cold: hi: 0, btch: 1 usd: 0 > Jan 29 22:13:36 master kernel: DMA32 per-cpu: > Jan 29 22:13:36 master kernel: CPU 0: Hot: hi: 186, btch: 31 usd: 150 > Cold: hi: 62, btch: 15 usd: 62 > Jan 29 22:13:36 master kernel: CPU 1: Hot: hi: 186, btch: 31 usd: 105 > Cold: hi: 62, btch: 15 usd: 52 > Jan 29 22:13:36 master kernel: Active:41027 inactive:437314 dirty:29708 > writeback:0 unstable:0 > Jan 29 22:13:36 master kernel: free:3892 slab:19405 mapped:11793 > pagetables:1512 bounce:0 > Jan 29 22:13:36 master kernel: DMA free:8008kB min:28kB low:32kB high:40kB > active:0kB inactive:2384kB present:11100kB pages_scanned:0 all_unreclaimable? > no > Jan 29 22:13:36 master kernel: lowmem_reserve[]: 0 1998 1998 1998 > Jan 29 22:13:36 master kernel: DMA32 free:7560kB min:5704kB low:7128kB > high:8556kB active:164108kB inactive:1746872kB present:2046800kB > pages_scanned:32 all_unreclaimable? no > Jan 29 22:13:36 master kernel: lowmem_reserve[]: 0 0 0 0 No apparent memory pressure there. We are not even below the low watermark for order-0 allocations and the lowmem_reserve is not a factor for DMA32 allocations. > Jan 29 22:13:36 master kernel: DMA: 4*4kB 5*8kB 6*16kB 6*32kB 4*64kB 2*128kB > 4*256kB 4*512kB 0*1024kB 0*2048kB 1*4096kB = 8024kB > Jan 29 22:13:36 master kernel: DMA32: 1365*4kB 11*8kB 38*16kB 18*32kB 0*64kB > 0*128kB 0*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 7756kB There are sufficient numbers of order-2 and higher pages free there. It is similar for the subsequent failures. This does not appear to be a fragmentation problem at least so lets see what watermarks look like just for zone->low_pages which should be relatively relaxed. Zone watermarks in pages; Low: 1782 Free: 1890 classzone_reserve: 0 Adjusted Low calculation low = 1782 - 4 + 1 == 1779 ALLOC_HIGH for __GFP_HIGH so low -= (low / 2) == 890 ALLOC_HARDER for !__GFP_WAIT so low -= (low / 4) == 668 Free pages at each order DMA32: 1365*4kB 11*8kB 38*16kB 18*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 7756kB Watermark check order 0: is free_pages(1890) < low(668) + class_rsrv(0) -- no, ok order 1: is free_pages(525) < low(334) -- no, ok order 2: is free_pages(503) < low(167) -- no. ok Err, so it looks like we are not hitting the watermarks either. This does not make sense, the allocation should not have failed. I have to be making a mistake with the zone watermark calculation, can anyone spot it? Is this definetly 2.6.24 and otherwise unpatched? > Jan 29 22:13:36 master kernel: Swap cache: add 43, delete 43, find 0/0, race > 0+0 > Jan 29 22:13:36 master kernel: Free swap = 4088332kB > Jan 29 22:13:36 master kernel: Total swap = 4088500kB > Jan 29 22:13:36 master kernel: Free swap: 4088332kB > Jan 29 22:13:36 master kernel: 523264 pages of RAM > Jan 29 22:13:36 master kernel: 9510 reserved pages > Jan 29 22:13:36 master kernel: 492885 pages shared > Jan 29 22:13:36 master kernel: 0 pages swap cached > Jan 29 22:13:44 master nmbd[3438]: [2008/01/29 22:13:44, 0] > <SNIP> Few questions to try pinning down what is going wrong. 1. Is the stream of errors constant once they start? 2. If you increase /proc/sys/vm/min_free_kbytes to 16384 or 64738, do the allocation errors still occur? If the errors no longer occur, is the performance restored? 3. If you compile with MYRI10GE_ALLOC_ORDER=0 on the old and new kernels, do the allocations errors occur? If not, how does the throughput compare between kernel versions? (I am checking for changes in NFS/IO/writeback that causes a writeout performance regression. Such a slowdown may cause memory pressure later and then these allocation errors when watermarks are hit) 4. If you apply the patch below, do the allocation errors still occur? (The patch disables fragmentation avoidance. I don't think fragmentation is the problem here because ample memory is free and watermarks look ok) 5. Any chance you could test 2.6.23 in case the problem also happens there? Thanks diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-clean/include/linux/mmzone.h linux-2.6.24-disable-antifrag/include/linux/mmzone.h --- linux-2.6.24-clean/include/linux/mmzone.h 2008-01-24 22:58:37.000000000 +0000 +++ linux-2.6.24-disable-antifrag/include/linux/mmzone.h 2008-02-01 17:23:39.000000000 +0000 @@ -45,14 +45,11 @@ for (order = 0; order < MAX_ORDER; order++) \ for (type = 0; type < MIGRATE_TYPES; type++) -extern int page_group_by_mobility_disabled; +#define page_group_by_mobility_disabled (0) static inline int get_pageblock_migratetype(struct page *page) { - if (unlikely(page_group_by_mobility_disabled)) - return MIGRATE_UNMOVABLE; - - return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end); + return MIGRATE_UNMOVABLE; } struct free_area { diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-clean/mm/page_alloc.c linux-2.6.24-disable-antifrag/mm/page_alloc.c --- linux-2.6.24-clean/mm/page_alloc.c 2008-01-24 22:58:37.000000000 +0000 +++ linux-2.6.24-disable-antifrag/mm/page_alloc.c 2008-02-01 17:23:39.000000000 +0000 @@ -164,12 +164,9 @@ int nr_node_ids __read_mostly = MAX_NUMN EXPORT_SYMBOL(nr_node_ids); #endif -int page_group_by_mobility_disabled __read_mostly; - static void set_pageblock_migratetype(struct page *page, int migratetype) { - set_pageblock_flags_group(page, (unsigned long)migratetype, - PB_migrate, PB_migrate_end); + return; } #ifdef CONFIG_DEBUG_VM @@ -2354,17 +2351,6 @@ void build_all_zonelists(void) /* cpuset refresh routine should be here */ } vm_total_pages = nr_free_pagecache_pages(); - /* - * Disable grouping by mobility if the number of pages in the - * system is too low to allow the mechanism to work. It would be - * more accurate, but expensive to check per-zone. This check is - * made on memory-hotadd so a system can start with mobility - * disabled and enable it later - */ - if (vm_total_pages < (pageblock_nr_pages * MIGRATE_TYPES)) - page_group_by_mobility_disabled = 1; - else - page_group_by_mobility_disabled = 0; printk("Built %i zonelists in %s order, mobility grouping %s. " "Total pages: %ld\n", -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/