Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On Fri, 9 Mar 2007, Mel Gorman wrote: > The results without slub_debug were not good except for IA64. x86_64 and ppc64 > both blew up for a variety of reasons. The IA64 results were Yuck that is the dst issue that Adrian is also looking at. Likely an issue with slab merging and RCU frees. > KernBench Comparison > > 2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-slub > %diff > User CPU time1084.64 1032.93 4.77% > System CPU time 73.38 63.14 > 13.95% > Total CPU time1158.02 1096.07 5.35% > Elapsedtime 307.00285.62 6.96% Wow! The first indication that we are on the right track with this. > AIM9 Comparison > 2 page_test 2097119.26 3398259.27 1301140.01 > 62.04% System Allocations & Pages/second Wow! Must have all stayed within slab boundaries. > 8 link_test 64776.047488.13 -57287.91 > -88.44% Link/Unlink Pairs/second Crap. Maybe we straddled a slab boundary here? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On Fri, 9 Mar 2007, Mel Gorman wrote: > I'm not sure what you mean by per-order queues. The buddy allocator already > has per-order lists. Somehow they do not seem to work right. SLAB (and now SLUB too) can avoid (or defer) fragmentation by keeping its own queues. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
Note that I am amazed that the kernbench even worked. The results without slub_debug were not good except for IA64. x86_64 and ppc64 both blew up for a variety of reasons. The IA64 results were KernBench Comparison 2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-slub %diff User CPU time1084.64 1032.93 4.77% System CPU time 73.38 63.14 13.95% Total CPU time1158.02 1096.07 5.35% Elapsedtime 307.00285.62 6.96% AIM9 Comparison --- 2.6.21-rc2-mm2-clean2.6.21-rc2-mm2-slub 1 creat-clo425460.75 438809.64 13348.89 3.14% File Creations and Closes/second 2 page_test 2097119.26 3398259.27 1301140.01 62.04% System Allocations & Pages/second 3 brk_test7008395.33 6728755.72 -279639.61 -3.99% System Memory Allocations/second 4 jmp_test 12226295.3112254966.21 28670.90 0.23% Non-local gotos/second 5 signal_test 1271126.28 1235510.96 -35615.32 -2.80% Signal Traps/second 6 exec_test 395.54 381.18 -14.36 -3.63% Program Loads/second 7 fork_test 13218.23 13211.41 -6.82 -0.05% Task Creations/second 8 link_test 64776.047488.13 -57287.91 -88.44% Link/Unlink Pairs/second An example console log from x86_64 is below. It's not particular clear why it went blamo and I haven't had a chance all day to kick it around for a bit due to a variety of other hilarity floating around. Linux version 2.6.21-rc2-mm2-autokern1 ([EMAIL PROTECTED]) (gcc version 4.1.1 20060525 (Red Hat 4.1.1-1)) #1 SMP Thu Mar 8 12:13:27 CST 2007 Command line: ro root=/dev/VolGroup00/LogVol00 rhgb console=tty0 console=ttyS1,19200 selinux=no autobench_args: root=30726124 ABAT:1173378546 loglevel=8 BIOS-provided physical RAM map: BIOS-e820: - 0009d400 (usable) BIOS-e820: 0009d400 - 000a (reserved) BIOS-e820: 000e - 0010 (reserved) BIOS-e820: 0010 - 3ffcddc0 (usable) BIOS-e820: 3ffcddc0 - 3ffd (ACPI data) BIOS-e820: 3ffd - 4000 (reserved) BIOS-e820: fec0 - 0001 (reserved) Entering add_active_range(0, 0, 157) 0 entries of 3200 used Entering add_active_range(0, 256, 262093) 1 entries of 3200 used end_pfn_map = 1048576 DMI 2.3 present. ACPI: RSDP 000FDFC0, 0014 (r0 IBM ) ACPI: RSDT 3FFCFF80, 0034 (r1 IBMSERBLADE 1000 IBM 45444F43) ACPI: FACP 3FFCFEC0, 0084 (r2 IBMSERBLADE 1000 IBM 45444F43) ACPI: DSDT 3FFCDDC0, 1EA6 (r1 IBMSERBLADE 1000 INTL 2002025) ACPI: FACS 3FFCFCC0, 0040 ACPI: APIC 3FFCFE00, 009C (r1 IBMSERBLADE 1000 IBM 45444F43) ACPI: SRAT 3FFCFD40, 0098 (r1 IBMSERBLADE 1000 IBM 45444F43) ACPI: HPET 3FFCFD00, 0038 (r1 IBMSERBLADE 1000 IBM 45444F43) SRAT: PXM 0 -> APIC 0 -> Node 0 SRAT: PXM 0 -> APIC 1 -> Node 0 SRAT: PXM 1 -> APIC 2 -> Node 1 SRAT: PXM 1 -> APIC 3 -> Node 1 SRAT: Node 0 PXM 0 0-4000 Entering add_active_range(0, 0, 157) 0 entries of 3200 used Entering add_active_range(0, 256, 262093) 1 entries of 3200 used NUMA: Using 63 for the hash shift. Bootmem setup node 0 -3ffcd000 Node 0 memmap at 0x81003efcd000 size 16773952 first pfn 0x81003efcd000 sizeof(struct page) = 64 Zone PFN ranges: DMA 0 -> 4096 DMA324096 -> 1048576 Normal1048576 -> 1048576 Movable zone start PFN for each node early_node_map[2] active PFN ranges 0:0 -> 157 0: 256 -> 262093 On node 0 totalpages: 261994 DMA zone: 64 pages used for memmap DMA zone: 2017 pages reserved DMA zone: 1916 pages, LIFO batch:0 DMA32 zone: 4031 pages used for memmap DMA32 zone: 253966 pages, LIFO batch:31 Normal zone: 0 pages used for memmap Movable zone: 0 pages used for memmap ACPI: PM-Timer IO Port: 0x2208 ACPI: Local APIC address 0xfee0 ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) Processor #0 (Bootup-CPU) ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled) Processor #1 ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] enabled) Processor #2 ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] enabled) Processor #3 ACPI: LAPIC_NMI (acpi_id[0x00] dfl dfl lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x02] dfl dfl lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x03] dfl dfl lint[0x1]) ACPI: IOAPIC (id[0x0e] address[0xfec0] gsi_base[0]) IOAPIC[0]: apic_id 14, address 0xfec0, GSI 0-23 ACPI: IOAPIC (id[0x0d] address[0xfec1] gsi_base[24]) IOAPIC[1]: apic_id 13, address 0xfe
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On Thu, 8 Mar 2007, Christoph Lameter wrote: Note that I am amazed that the kernbench even worked. On small machine How small? The machines I am testing on aren't "big" but they aren't misterable either. I seem to be getting into trouble with order 1 allocations. That in itself is pretty incredible. From what I see, allocations up to 3 generally work unless they are atomic even with the vanilla kernel. That said, it could be because slab is holding onto the high order pages for itself. SLAB seems to be able to avoid the situation by keeping higher order pages on a freelist and reduce the alloc/frees of higher order pages that the page allocator has to deal with. Maybe we need per order queues in the page allocator? I'm not sure what you mean by per-order queues. The buddy allocator already has per-order lists. There must be something fundamentally wrong in the page allocator if the SLAB queues fix this issue. I was able to fix the issue in V5 by forcing SLUB to keep a mininum number of objects around regardless of the fit to a page order page. Pass through is deadly since the crappy page allocator cannot handle it. Higher order page allocation failures can be avoided by using kmalloc. Yuck! Hopefully your patches fix that fundamental problem. One way to find out for sure. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On Thu, 8 Mar 2007, Christoph Lameter wrote: On Thu, 8 Mar 2007, Mel Gorman wrote: Note that the 16kb page size has a major impact on SLUB performance. On IA64 slub will use only 1/4th the locking overhead as on 4kb platforms. It'll be interesting to see the kernbench tests then with debugging disabled. You can get a similar effect on 4kb platforms by specifying slub_min_order=2 on bootup. This means that we have to rely on your patches to allow higher order allocs to work reliably though. It should work out because of the way buddy always selects the minimum page size will tend to cluster the slab allocations together whether they are reclaimable or not. It's something I can investigate when slub has stabilised a bit. However, in general, high order kernel allocations remain a bad idea. Depending on high order allocations that do not group could potentially lead to a situation where the movable areas are used more and more by kernel allocations. I cannot think of a workload that would actually break everything, but it's a possibility. The higher the order of slub the less locking overhead. So the better your patches deal with fragmentation the more we can reduce locking overhead in slub. I can certainly kick it around a lot and see what happen. It's best that slub_min_order=2 remain an optional performance enhancing switch though. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
Note that I am amazed that the kernbench even worked. On small machine I seem to be getting into trouble with order 1 allocations. SLAB seems to be able to avoid the situation by keeping higher order pages on a freelist and reduce the alloc/frees of higher order pages that the page allocator has to deal with. Maybe we need per order queues in the page allocator? There must be something fundamentally wrong in the page allocator if the SLAB queues fix this issue. I was able to fix the issue in V5 by forcing SLUB to keep a mininum number of objects around regardless of the fit to a page order page. Pass through is deadly since the crappy page allocator cannot handle it. Higher order page allocation failures can be avoided by using kmalloc. Yuck! Hopefully your patches fix that fundamental problem. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On Thu, 8 Mar 2007, Mel Gorman wrote: > > Note that the 16kb page size has a major > > impact on SLUB performance. On IA64 slub will use only 1/4th the locking > > overhead as on 4kb platforms. > It'll be interesting to see the kernbench tests then with debugging > disabled. You can get a similar effect on 4kb platforms by specifying slub_min_order=2 on bootup. This means that we have to rely on your patches to allow higher order allocs to work reliably though. The higher the order of slub the less locking overhead. So the better your patches deal with fragmentation the more we can reduce locking overhead in slub. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On Thu, 8 Mar 2007, Mel Gorman wrote: > Brought up 4 CPUs > Node 0 CPUs: 0-3 > mm/memory.c:111: bad pud c50e4480. Lower bits must be clear right? Looks like the pud was released and then reused for a 64 byte cache or so. This is likely a freelist pointer that slub put there after allocating the page for the 64 byte cache. Then we tried to use the pud. > migration_cost=0,1000 > *** SLUB: Redzone Inactive check fails in [EMAIL PROTECTED] Slab > c0756090 > offset=240 flags=50c7 inuse=3 freelist=c50de0f0 > Bytes b4 c50de0e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > Object c50de0f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > Object c50de100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > Object c50de110: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > Object c50de120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > >Redzone c50de130: 00 00 00 00 00 00 00 00 > FreePointer c50de138: Data overwritten after free or after slab was allocated. So this may be the same issue. pud was zapped after it was freed destroying the poison of another object in the 64 byte cache. Hmmm.. Maybe I should put the pad checks before the object checks. That way we detect that the whole slab was corrupted and do not flag just a single object. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On (08/03/07 08:48), Christoph Lameter didst pronounce: > On Thu, 8 Mar 2007, Mel Gorman wrote: > > > On x86_64, it completed successfully and looked reliable. There was a 5% > > performance loss on kernbench and aim9 figures were way down. However, with > > slub_debug enabled, I would expect that so it's not a fair comparison > > performance wise. I'll rerun the tests without debug and see what it looks > > like if you're interested and do not think it's too early to worry about > > performance instead of clarity. This is what I have for bl6-13 (machine > > appears on test.kernel.org so additional details are there). > > No its good to start worrying about performance now. There are still some > performance issues to be ironed out in particular on NUMA. I am not sure > f.e. how the reduction of partial lists affect performance. > Ok, I've sent off a bunch of tests - two of which are on NUMA (numaq and x86_64). It'll take them a long time to complete though as there is a lot of testing going on right now. > > IA64 (machine not visible on TKO) curiously did not exhibit the same > > problems > > on kernbench for Total CPU time which is very unexpected but you can see the > > System CPU times. The AIM9 figures were a bit of an upset but again, I blame > > slub_debug being enabled > > This was a single node box? Yes, memory looks like this; Zone PFN ranges: DMA 1024 -> 262144 Normal 262144 -> 262144 Movable zone start PFN for each node early_node_map[3] active PFN ranges 0: 1024 ->30719 0:32768 ->65413 0:65440 ->65505 On node 0 totalpages: 62405 Node 0 memmap at 0xe1126000 size 3670016 first pfn 0xe1134000 DMA zone: 220 pages used for memmap DMA zone: 0 pages reserved DMA zone: 62185 pages, LIFO batch:7 Normal zone: 0 pages used for memmap Movable zone: 0 pages used for memmap > Note that the 16kb page size has a major > impact on SLUB performance. On IA64 slub will use only 1/4th the locking > overhead as on 4kb platforms. > It'll be interesting to see the kernbench tests then with debugging disabled. > > (as an aside, the succes rates for high-order allocations are lower with > > SLUB. > > Again, I blame slub_debug. I know that enabling SLAB_DEBUG has similar > > effects > > because of red-zoning and the like) > > We have some additional patches here that reduce the max order for some > allocs. I believe the task_struct gets to be an order 2 alloc with V4, > Should make a difference for slab fragmentation > > Now, the bad news. This exploded on ppc64. It started going wrong early in > > the > > boot process and got worse. I haven't looked closely as to why yet as there > > is > > other stuff on my plate but I've included a console log that might be some > > use > > to you. If you think you have a fix for it, feel free to send it on and I'll > > give it a test. > > Hmmm... Looks like something is zapping an object. Try to rerun with > a kernel compiled with CONFIG_SLAB_DEBUG. I would expect similar results. > I've queued up a few tests. One completed as I wrote this and it didn't explode with SLAB_DEBUG set. Maybe the others will be different. I'll kick it around for a bit. It could be a real bug that slab is just not catuching. > > Brought up 4 CPUs > > Node 0 CPUs: 0-3 > > mm/memory.c:111: bad pud c50e4480. > > could not vmalloc 20971520 bytes for cache! > > Hmmm... a bad pud? I need to look at how the puds are managed on power. > > > migration_cost=0,1000 > > *** SLUB: Redzone Inactive check fails in [EMAIL PROTECTED] Slab > > An object was overwritten with zeros after it was freed. > > RTAS daemon started > > RTAS: event: 1, Type: Platform Error, Severity: 2 > > audit: initializing netlink socket (disabled) > > audit(1173335571.256:1): initialized > > Total HugeTLB memory allocated, 0 > > VFS: Disk quotas dquot_6.5.1 > > Dquot-cache hash table entries: 512 (order 0, 4096 bytes) > > JFS: nTxBlock = 8192, nTxLock = 65536 > > SELinux: Registering netfilter hooks > > io scheduler noop registered > > io scheduler anticipatory registered (default) > > io scheduler deadline registered > > io scheduler cfq registered > > pci_hotplug: PCI Hot Plug PCI Core version: 0.5 > > rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1 > > rpaphp: Slot [:00:02.2](PCI location=U7879.001.DQD0T7T-P1-C4) registered > > vio_register_driver: driver hvc_console registering > > [ cut here ] > > Badness at mm/slub.c:1701 > > Someone did a kmalloc(0, ...). Zero sized allocation are not flagged > by SLAB but SLUB does. > I'll chase up what's happening here. It will be "reproducable" independent of SLUB by adding a similar check. > > Call Trace: > > [C506B730] [C0011188] .show_stack+0x6c/0x1a0 (unreliable) > > [C506B7D0] [C01EE9F4] .report_bug+0x94/0xe8 > > [C506B860] [C038C85C] .program_check_exception+0x16c/0x5f4 > > [C
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On Thu, 8 Mar 2007, Mel Gorman wrote: > On x86_64, it completed successfully and looked reliable. There was a 5% > performance loss on kernbench and aim9 figures were way down. However, with > slub_debug enabled, I would expect that so it's not a fair comparison > performance wise. I'll rerun the tests without debug and see what it looks > like if you're interested and do not think it's too early to worry about > performance instead of clarity. This is what I have for bl6-13 (machine > appears on test.kernel.org so additional details are there). No its good to start worrying about performance now. There are still some performance issues to be ironed out in particular on NUMA. I am not sure f.e. how the reduction of partial lists affect performance. > IA64 (machine not visible on TKO) curiously did not exhibit the same problems > on kernbench for Total CPU time which is very unexpected but you can see the > System CPU times. The AIM9 figures were a bit of an upset but again, I blame > slub_debug being enabled This was a single node box? Note that the 16kb page size has a major impact on SLUB performance. On IA64 slub will use only 1/4th the locking overhead as on 4kb platforms. > (as an aside, the succes rates for high-order allocations are lower with SLUB. > Again, I blame slub_debug. I know that enabling SLAB_DEBUG has similar effects > because of red-zoning and the like) We have some additional patches here that reduce the max order for some allocs. I believe the task_struct gets to be an order 2 alloc with V4, > Now, the bad news. This exploded on ppc64. It started going wrong early in the > boot process and got worse. I haven't looked closely as to why yet as there is > other stuff on my plate but I've included a console log that might be some use > to you. If you think you have a fix for it, feel free to send it on and I'll > give it a test. Hmmm... Looks like something is zapping an object. Try to rerun with a kernel compiled with CONFIG_SLAB_DEBUG. I would expect similar results. > Brought up 4 CPUs > Node 0 CPUs: 0-3 > mm/memory.c:111: bad pud c50e4480. > could not vmalloc 20971520 bytes for cache! Hmmm... a bad pud? I need to look at how the puds are managed on power. > migration_cost=0,1000 > *** SLUB: Redzone Inactive check fails in [EMAIL PROTECTED] Slab An object was overwritten with zeros after it was freed. > RTAS daemon started > RTAS: event: 1, Type: Platform Error, Severity: 2 > audit: initializing netlink socket (disabled) > audit(1173335571.256:1): initialized > Total HugeTLB memory allocated, 0 > VFS: Disk quotas dquot_6.5.1 > Dquot-cache hash table entries: 512 (order 0, 4096 bytes) > JFS: nTxBlock = 8192, nTxLock = 65536 > SELinux: Registering netfilter hooks > io scheduler noop registered > io scheduler anticipatory registered (default) > io scheduler deadline registered > io scheduler cfq registered > pci_hotplug: PCI Hot Plug PCI Core version: 0.5 > rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1 > rpaphp: Slot [:00:02.2](PCI location=U7879.001.DQD0T7T-P1-C4) registered > vio_register_driver: driver hvc_console registering > [ cut here ] > Badness at mm/slub.c:1701 Someone did a kmalloc(0, ...). Zero sized allocation are not flagged by SLAB but SLUB does. > Call Trace: > [C506B730] [C0011188] .show_stack+0x6c/0x1a0 (unreliable) > [C506B7D0] [C01EE9F4] .report_bug+0x94/0xe8 > [C506B860] [C038C85C] .program_check_exception+0x16c/0x5f4 > [C506B930] [C00046F4] program_check_common+0xf4/0x100 > --- Exception: 700 at .get_slab+0xbc/0x18c > LR = .__kmalloc+0x28/0x104 > [C506BC20] [C506BCC0] 0xc506bcc0 (unreliable) > [C506BCD0] [C00CE2EC] .__kmalloc+0x28/0x104 > [C506BD60] [C022E724] .tty_register_driver+0x5c/0x23c > [C506BE10] [C0477910] .hvsi_init+0x154/0x1b4 > [C506BEC0] [C0451B7C] .init+0x1c4/0x2f8 > [C506BF90] [C00275D0] .kernel_thread+0x4c/0x68 > mm/memory.c:111: bad pud c5762900. > mm/memory.c:111: bad pud c5762480. > [ cut here ] > kernel BUG at mm/mmap.c:1999! More page table trouble. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On Tue, 6 Mar 2007, Christoph Lameter wrote: [PATCH] SLUB The unqueued slab allocator v4 Hi Christoph, I shoved these patches through a few tests on x86, x86_64, ia64 and ppc64 last night to see how they got on. I enabled slub_debug to catch any suprises that may be creeping about. The results are mixed. On x86_64, it completed successfully and looked reliable. There was a 5% performance loss on kernbench and aim9 figures were way down. However, with slub_debug enabled, I would expect that so it's not a fair comparison performance wise. I'll rerun the tests without debug and see what it looks like if you're interested and do not think it's too early to worry about performance instead of clarity. This is what I have for bl6-13 (machine appears on test.kernel.org so additional details are there). KernBench Comparison 2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-list-based %diff User CPU time 84.32 86.03 -2.03% System CPU time 32.97 38.21 -15.89% Total CPU time 117.29124.24 -5.93% Elapsedtime 34.95 37.31 -6.75% AIM9 Comparison --- 2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-list-based 1 creat-clo160706.55 62918.54 -97788.01 -60.85% File Creations and Closes/second 2 page_test190371.67 204050.99 13679.32 7.19% System Allocations & Pages/second 3 brk_test2320679.89 1923512.75 -397167.14 -17.11% System Memory Allocations/second 4 jmp_test 16391869.3816380353.27 -11516.11 -0.07% Non-local gotos/second 5 signal_test 492234.63 235710.71 -256523.92 -52.11% Signal Traps/second 6 exec_test 232.26 220.88 -11.38 -4.90% Program Loads/second 7 fork_test 4514.253609.40-904.85 -20.04% Task Creations/second 8 link_test 53639.76 26925.91 -26713.85 -49.80% Link/Unlink Pairs/second IA64 (machine not visible on TKO) curiously did not exhibit the same problems on kernbench for Total CPU time which is very unexpected but you can see the System CPU times. The AIM9 figures were a bit of an upset but again, I blame slub_debug being enabled KernBench Comparison 2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-list-based %diff User CPU time1084.64 1033.46 4.72% System CPU time 73.38 84.14 -14.66% Total CPU time1158.021117.6 3.49% Elapsedtime 307.00291.29 5.12% AIM9 Comparison --- 2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-list-based 1 creat-clo425460.75 137709.84 -287750.91 -67.63% File Creations and Closes/second 2 page_test 2097119.26 2373083.49 275964.23 13.16% System Allocations & Pages/second 3 brk_test7008395.33 3787961.51 -3220433.82 -45.95% System Memory Allocations/second 4 jmp_test 12226295.3112254744.03 28448.72 0.23% Non-local gotos/second 5 signal_test 1271126.28 334357.29 -936768.99 -73.70% Signal Traps/second 6 exec_test 395.54 349.00 -46.54 -11.77% Program Loads/second 7 fork_test 13218.238822.93 -4395.30 -33.25% Task Creations/second 8 link_test 64776.047410.75 -57365.29 -88.56% Link/Unlink Pairs/second (as an aside, the succes rates for high-order allocations are lower with SLUB. Again, I blame slub_debug. I know that enabling SLAB_DEBUG has similar effects because of red-zoning and the like) Now, the bad news. This exploded on ppc64. It started going wrong early in the boot process and got worse. I haven't looked closely as to why yet as there is other stuff on my plate but I've included a console log that might be some use to you. If you think you have a fix for it, feel free to send it on and I'll give it a test. Config file read, 1024 bytes Welcome Welcome to yaboot version 1.3.12 Enter "help" to get some basic usage information boot: autobench Please wait, loading kernel... Elf64 kernel loaded... Loading ramdisk... ramdisk loaded at 0240, size: 1648 Kbytes OF stdout device is: /vdevice/[EMAIL PROTECTED] Hypertas detected, assuming LPAR ! command line: ro console=hvc0 autobench_args: root=/dev/sda6 ABAT:1173335344 loglevel=8 slub_debug memory layout at init: all
[SLUB 0/3] SLUB: The unqueued slab allocator V4
[PATCH] SLUB The unqueued slab allocator v4 V3->V4 - Rename /proc/slabinfo to /proc/slubinfo. We have a different format after all. - More bug fixes and stabilization of diagnostic functions. This seems to be finally something that works wherever we test it. - Serialize kmem_cache_create and kmem_cache_destroy via slub_lock (Adrian's idea) - Add two new modifications (separate patches) to guarantee a mininum number of objects per slab and to pass through large allocations. Note that SLUB will warn on zero sized allocations. SLAB just allocates some memory. So some traces from the usb subsystem etc should be expected. There are very likely also issues remaining in SLUB. V2->V3 - Debugging and diagnostic support. This is runtime enabled and not compile time enabled. Runtime debugging can be controlled via kernel boot options on an individual slab cache basis or globally. - Slab Trace support (For individual slab caches). - Resiliency support: If basic sanity checks are enabled (via F f.e.) (boot option) then SLUB will do the best to perform diagnostics and then continue (i.e. mark corrupted objects as used). - Fix up numerous issues including clash of SLUBs use of page flags with i386 arch use for pmd and pgds (which are managed as slab caches, sigh). - Dynamic per CPU array sizing. - Explain SLUB slabcache flags V1->V2 - Fix up various issues. Tested on i386 UP, X86_64 SMP, ia64 NUMA. - Provide NUMA support by splitting partial lists per node. - Better Slab cache merge support (now at around 50% of slabs) - List slab cache aliases if slab caches are merged. - Updated descriptions /proc/slabinfo output This is a new slab allocator which was motivated by the complexity of the existing code in mm/slab.c. It attempts to address a variety of concerns with the existing implementation. A. Management of object queues A particular concern was the complex management of the numerous object queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for each allocating CPU and use objects from a slab directly instead of queueing them up. B. Storage overhead of object queues SLAB Object queues exist per node, per CPU. The alien cache queue even has a queue array that contain a queue for each processor on each node. For very large systems the number of queues and the number of objects that may be caught in those queues grows exponentially. On our systems with 1k nodes / processors we have several gigabytes just tied up for storing references to objects for those queues This does not include the objects that could be on those queues. One fears that the whole memory of the machine could one day be consumed by those queues. C. SLAB meta data overhead SLAB has overhead at the beginning of each slab. This means that data cannot be naturally aligned at the beginning of a slab block. SLUB keeps all meta data in the corresponding page_struct. Objects can be naturally aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte boundaries and can fit tightly into a 4k page with no bytes left over. SLAB cannot do this. D. SLAB has a complex cache reaper SLUB does not need a cache reaper for UP systems. On SMP systems the per CPU slab may be pushed back into partial list but that operation is simple and does not require an iteration over a list of objects. SLAB expires per CPU, shared and alien object queues during cache reaping which may cause strange hold offs. E. SLAB has complex NUMA policy layer support SLUB pushes NUMA policy handling into the page allocator. This means that allocation is coarser (SLUB does interleave on a page level) but that situation was also present before 2.6.13. SLABs application of policies to individual slab objects allocated in SLAB is certainly a performance concern due to the frequent references to memory policies which may lead a sequence of objects to come from one node after another. SLUB will get a slab full of objects from one node and then will switch to the next. F. Reduction of the size of partial slab lists SLAB has per node partial lists. This means that over time a large number of partial slabs may accumulate on those lists. These can only be reused if allocator occur on specific nodes. SLUB has a global pool of partial slabs and will consume slabs from that pool to decrease fragmentation. G. Tunables SLAB has sophisticated tuning abilities for each slab cache. One can manipulate the queue sizes in detail. However, filling the queues still requires the uses of the spin lock to check out slabs. SLUB has a global parameter (min_slab_order) for tuning. Increasing the minimum slab order can decrease the locking overhead. The bigger the slab order the less motions of pages between per CPU and partial lists occur and the better SLUB will be scaling. G. Slab merging We often have slab caches with simila