Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-09 Thread Christoph Lameter
On Fri, 9 Mar 2007, Mel Gorman wrote:

> The results without slub_debug were not good except for IA64. x86_64 and ppc64
> both blew up for a variety of reasons. The IA64 results were

Yuck that is the dst issue that Adrian is also looking at. Likely an issue 
with slab merging and RCU frees.
 
> KernBench Comparison
> 
>   2.6.21-rc2-mm2-clean   2.6.21-rc2-mm2-slub
> %diff
> User   CPU time1084.64   1032.93 4.77%
> System CPU time  73.38 63.14 
> 13.95%
> Total  CPU time1158.02   1096.07 5.35%
> Elapsedtime 307.00285.62 6.96%

Wow! The first indication that we are on the right track with this.

> AIM9 Comparison
>  2 page_test   2097119.26 3398259.27 1301140.01 
> 62.04% System Allocations & Pages/second

Wow! Must have all stayed within slab boundaries.

>  8 link_test 64776.047488.13  -57287.91 
> -88.44% Link/Unlink Pairs/second

Crap. Maybe we straddled a slab boundary here?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-09 Thread Christoph Lameter
On Fri, 9 Mar 2007, Mel Gorman wrote:

> I'm not sure what you mean by per-order queues. The buddy allocator already
> has per-order lists.

Somehow they do not seem to work right. SLAB (and now SLUB too) can avoid 
(or defer) fragmentation by keeping its own queues.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-09 Thread Mel Gorman



Note that I am amazed that the kernbench even worked.


The results without slub_debug were not good except for IA64. x86_64 and 
ppc64 both blew up for a variety of reasons. The IA64 results were


KernBench Comparison

  2.6.21-rc2-mm2-clean   2.6.21-rc2-mm2-slub  
%diff
User   CPU time1084.64   1032.93  
4.77%
System CPU time  73.38 63.14 
13.95%
Total  CPU time1158.02   1096.07  
5.35%
Elapsedtime 307.00285.62  
6.96%

AIM9 Comparison
---
 2.6.21-rc2-mm2-clean2.6.21-rc2-mm2-slub
 1 creat-clo425460.75  438809.64   13348.89  
3.14% File Creations and Closes/second
 2 page_test   2097119.26 3398259.27 1301140.01 62.04% 
System Allocations & Pages/second
 3 brk_test7008395.33 6728755.72 -279639.61 
-3.99% System Memory Allocations/second
 4 jmp_test   12226295.3112254966.21   28670.90  
0.23% Non-local gotos/second
 5 signal_test 1271126.28 1235510.96  -35615.32 
-2.80% Signal Traps/second
 6 exec_test   395.54 381.18 -14.36 
-3.63% Program Loads/second
 7 fork_test 13218.23   13211.41  -6.82 
-0.05% Task Creations/second
 8 link_test 64776.047488.13  -57287.91 
-88.44% Link/Unlink Pairs/second

An example console log from x86_64 is below. It's not particular clear why 
it went blamo and I haven't had a chance all day to kick it around for a 
bit due to a variety of other hilarity floating around.


Linux version 2.6.21-rc2-mm2-autokern1 ([EMAIL PROTECTED]) (gcc version 4.1.1 
20060525 (Red Hat 4.1.1-1)) #1 SMP Thu Mar 8 12:13:27 CST 2007
Command line: ro root=/dev/VolGroup00/LogVol00 rhgb console=tty0 
console=ttyS1,19200 selinux=no autobench_args: root=30726124 ABAT:1173378546 
loglevel=8
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009d400 (usable)
 BIOS-e820: 0009d400 - 000a (reserved)
 BIOS-e820: 000e - 0010 (reserved)
 BIOS-e820: 0010 - 3ffcddc0 (usable)
 BIOS-e820: 3ffcddc0 - 3ffd (ACPI data)
 BIOS-e820: 3ffd - 4000 (reserved)
 BIOS-e820: fec0 - 0001 (reserved)
Entering add_active_range(0, 0, 157) 0 entries of 3200 used
Entering add_active_range(0, 256, 262093) 1 entries of 3200 used
end_pfn_map = 1048576
DMI 2.3 present.
ACPI: RSDP 000FDFC0, 0014 (r0 IBM   )
ACPI: RSDT 3FFCFF80, 0034 (r1 IBMSERBLADE 1000 IBM  45444F43)
ACPI: FACP 3FFCFEC0, 0084 (r2 IBMSERBLADE 1000 IBM  45444F43)
ACPI: DSDT 3FFCDDC0, 1EA6 (r1 IBMSERBLADE 1000 INTL  2002025)
ACPI: FACS 3FFCFCC0, 0040
ACPI: APIC 3FFCFE00, 009C (r1 IBMSERBLADE 1000 IBM  45444F43)
ACPI: SRAT 3FFCFD40, 0098 (r1 IBMSERBLADE 1000 IBM  45444F43)
ACPI: HPET 3FFCFD00, 0038 (r1 IBMSERBLADE 1000 IBM  45444F43)
SRAT: PXM 0 -> APIC 0 -> Node 0
SRAT: PXM 0 -> APIC 1 -> Node 0
SRAT: PXM 1 -> APIC 2 -> Node 1
SRAT: PXM 1 -> APIC 3 -> Node 1
SRAT: Node 0 PXM 0 0-4000
Entering add_active_range(0, 0, 157) 0 entries of 3200 used
Entering add_active_range(0, 256, 262093) 1 entries of 3200 used
NUMA: Using 63 for the hash shift.
Bootmem setup node 0 -3ffcd000
Node 0 memmap at 0x81003efcd000 size 16773952 first pfn 0x81003efcd000
sizeof(struct page) = 64
Zone PFN ranges:
  DMA 0 -> 4096
  DMA324096 ->  1048576
  Normal1048576 ->  1048576
Movable zone start PFN for each node
early_node_map[2] active PFN ranges
0:0 ->  157
0:  256 ->   262093
On node 0 totalpages: 261994
  DMA zone: 64 pages used for memmap
  DMA zone: 2017 pages reserved
  DMA zone: 1916 pages, LIFO batch:0
  DMA32 zone: 4031 pages used for memmap
  DMA32 zone: 253966 pages, LIFO batch:31
  Normal zone: 0 pages used for memmap
  Movable zone: 0 pages used for memmap
ACPI: PM-Timer IO Port: 0x2208
ACPI: Local APIC address 0xfee0
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 (Bootup-CPU)
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] enabled)
Processor #2
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] enabled)
Processor #3
ACPI: LAPIC_NMI (acpi_id[0x00] dfl dfl lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x02] dfl dfl lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x03] dfl dfl lint[0x1])
ACPI: IOAPIC (id[0x0e] address[0xfec0] gsi_base[0])
IOAPIC[0]: apic_id 14, address 0xfec0, GSI 0-23
ACPI: IOAPIC (id[0x0d] address[0xfec1] gsi_base[24])
IOAPIC[1]: apic_id 13, address 0xfe

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-09 Thread Mel Gorman

On Thu, 8 Mar 2007, Christoph Lameter wrote:


Note that I am amazed that the kernbench even worked. On small machine


How small? The machines I am testing on aren't "big" but they aren't 
misterable either.



I
seem to be getting into trouble with order 1 allocations.


That in itself is pretty incredible. From what I see, allocations up to 3 
generally work unless they are atomic even with the vanilla kernel. That 
said, it could be because slab is holding onto the high order pages for 
itself.



SLAB seems to be
able to avoid the situation by keeping higher order pages on a freelist
and reduce the alloc/frees of higher order pages that the page allocator
has to deal with. Maybe we need per order queues in the page allocator?



I'm not sure what you mean by per-order queues. The buddy allocator 
already has per-order lists.



There must be something fundamentally wrong in the page allocator if the
SLAB queues fix this issue. I was able to fix the issue in V5 by forcing
SLUB to keep a mininum number of objects around regardless of the fit to
a page order page. Pass through is deadly since the crappy page allocator
cannot handle it.

Higher order page allocation failures can be avoided by using kmalloc.
Yuck! Hopefully your patches fix that fundamental problem.



One way to find out for sure.

--
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-09 Thread Mel Gorman

On Thu, 8 Mar 2007, Christoph Lameter wrote:


On Thu, 8 Mar 2007, Mel Gorman wrote:


Note that the 16kb page size has a major
impact on SLUB performance. On IA64 slub will use only 1/4th the locking
overhead as on 4kb platforms.

It'll be interesting to see the kernbench tests then with debugging
disabled.


You can get a similar effect on 4kb platforms by specifying slub_min_order=2 on 
bootup.
This means that we have to rely on your patches to allow higher order
allocs to work reliably though.


It should work out because of the way buddy always selects the minimum 
page size will tend to cluster the slab allocations together whether they 
are reclaimable or not. It's something I can investigate when slub has 
stabilised a bit.


However, in general, high order kernel allocations remain a bad idea. 
Depending on high order allocations that do not group could potentially 
lead to a situation where the movable areas are used more and more by 
kernel allocations. I cannot think of a workload that would actually break 
everything, but it's a possibility.



The higher the order of slub the less
locking overhead. So the better your patches deal with fragmentation the
more we can reduce locking overhead in slub.



I can certainly kick it around a lot and see what happen. It's best that 
slub_min_order=2 remain an optional performance enhancing switch though.


--
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-08 Thread Christoph Lameter
Note that I am amazed that the kernbench even worked. On small machine I 
seem to be getting into trouble with order 1 allocations. SLAB seems to be 
able to avoid the situation by keeping higher order pages on a freelist 
and reduce the alloc/frees of higher order pages that the page allocator
has to deal with. Maybe we need per order queues in the page allocator? 

There must be something fundamentally wrong in the page allocator if the 
SLAB queues fix this issue. I was able to fix the issue in V5 by forcing 
SLUB to keep a mininum number of objects around regardless of the fit to
a page order page. Pass through is deadly since the crappy page allocator 
cannot handle it.

Higher order page allocation failures can be avoided by using kmalloc. 
Yuck! Hopefully your patches fix that fundamental problem.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-08 Thread Christoph Lameter
On Thu, 8 Mar 2007, Mel Gorman wrote:

> > Note that the 16kb page size has a major 
> > impact on SLUB performance. On IA64 slub will use only 1/4th the locking 
> > overhead as on 4kb platforms.
> It'll be interesting to see the kernbench tests then with debugging
> disabled.

You can get a similar effect on 4kb platforms by specifying slub_min_order=2 on 
bootup.
This means that we have to rely on your patches to allow higher order 
allocs to work reliably though. The higher the order of slub the less 
locking overhead. So the better your patches deal with fragmentation the 
more we can reduce locking overhead in slub.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-08 Thread Christoph Lameter
On Thu, 8 Mar 2007, Mel Gorman wrote:

> Brought up 4 CPUs
> Node 0 CPUs: 0-3
> mm/memory.c:111: bad pud c50e4480.

Lower bits must be clear right? Looks like the pud was released
and then reused for a 64 byte cache or so. This is likely a freelist 
pointer that slub put there after allocating the page for the 64 byte 
cache. Then we tried to use the pud.

> migration_cost=0,1000
> *** SLUB: Redzone Inactive check fails in [EMAIL PROTECTED] Slab
> c0756090
> offset=240 flags=50c7 inuse=3 freelist=c50de0f0
>   Bytes b4 c50de0e0:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 
> Object c50de0f0:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 
> Object c50de100:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 
> Object c50de110:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 
> Object c50de120:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 
>Redzone c50de130:  00 00 00 00 00 00 00 00
>  FreePointer c50de138: 

Data overwritten after free or after slab was allocated. So this may be 
the same issue. pud was zapped after it was freed destroying the poison 
of another object in the 64 byte cache.

Hmmm.. Maybe I should put the pad checks before the object checks. 
That way we detect that the whole slab was corrupted and do not flag just 
a single object.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-08 Thread Mel Gorman
On (08/03/07 08:48), Christoph Lameter didst pronounce:
> On Thu, 8 Mar 2007, Mel Gorman wrote:
> 
> > On x86_64, it completed successfully and looked reliable. There was a 5%
> > performance loss on kernbench and aim9 figures were way down. However, with
> > slub_debug enabled, I would expect that so it's not a fair comparison
> > performance wise. I'll rerun the tests without debug and see what it looks
> > like if you're interested and do not think it's too early to worry about
> > performance instead of clarity. This is what I have for bl6-13 (machine
> > appears on test.kernel.org so additional details are there).
> 
> No its good to start worrying about performance now. There are still some 
> performance issues to be ironed out in particular on NUMA. I am not sure
> f.e. how the reduction of partial lists affect performance.
> 

Ok, I've sent off a bunch of tests - two of which are on NUMA (numaq and
x86_64). It'll take them a long time to complete though as there is a
lot of testing going on right now.

> > IA64 (machine not visible on TKO) curiously did not exhibit the same 
> > problems
> > on kernbench for Total CPU time which is very unexpected but you can see the
> > System CPU times. The AIM9 figures were a bit of an upset but again, I blame
> > slub_debug being enabled
> 
> This was a single node box?

Yes, memory looks like this;

Zone PFN ranges:
  DMA  1024 ->   262144
  Normal 262144 ->   262144
Movable zone start PFN for each node
early_node_map[3] active PFN ranges
0: 1024 ->30719
0:32768 ->65413
0:65440 ->65505
On node 0 totalpages: 62405
Node 0 memmap at 0xe1126000 size 3670016 first pfn 0xe1134000
  DMA zone: 220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 62185 pages, LIFO batch:7
  Normal zone: 0 pages used for memmap
  Movable zone: 0 pages used for memmap

> Note that the 16kb page size has a major 
> impact on SLUB performance. On IA64 slub will use only 1/4th the locking 
> overhead as on 4kb platforms.
> 

It'll be interesting to see the kernbench tests then with debugging
disabled.

> > (as an aside, the succes rates for high-order allocations are lower with 
> > SLUB.
> > Again, I blame slub_debug. I know that enabling SLAB_DEBUG has similar 
> > effects
> > because of red-zoning and the like)
> 
> We have some additional patches here that reduce the max order for some 
> allocs. I believe the task_struct gets to be an order 2 alloc with V4,
> 

Should make a difference for slab fragmentation

> > Now, the bad news. This exploded on ppc64. It started going wrong early in 
> > the
> > boot process and got worse. I haven't looked closely as to why yet as there 
> > is
> > other stuff on my plate but I've included a console log that might be some 
> > use
> > to you. If you think you have a fix for it, feel free to send it on and I'll
> > give it a test.
> 
> Hmmm... Looks like something is zapping an object. Try to rerun with 
> a kernel compiled with CONFIG_SLAB_DEBUG. I would expect similar results.
> 

I've queued up a few tests. One completed as I wrote this and it didn't
explode with SLAB_DEBUG set. Maybe the others will be different. I'll
kick it around for a bit.

It could be a real bug that slab is just not catuching.

> > Brought up 4 CPUs
> > Node 0 CPUs: 0-3
> > mm/memory.c:111: bad pud c50e4480.
> > could not vmalloc 20971520 bytes for cache!
> 
> Hmmm... a bad pud? I need to look at how the puds are managed on power.
> 
> > migration_cost=0,1000
> > *** SLUB: Redzone Inactive check fails in [EMAIL PROTECTED] Slab
> 
> An object was overwritten with zeros after it was freed.

> > RTAS daemon started
> > RTAS: event: 1, Type: Platform Error, Severity: 2
> > audit: initializing netlink socket (disabled)
> > audit(1173335571.256:1): initialized
> > Total HugeTLB memory allocated, 0
> > VFS: Disk quotas dquot_6.5.1
> > Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
> > JFS: nTxBlock = 8192, nTxLock = 65536
> > SELinux:  Registering netfilter hooks
> > io scheduler noop registered
> > io scheduler anticipatory registered (default)
> > io scheduler deadline registered
> > io scheduler cfq registered
> > pci_hotplug: PCI Hot Plug PCI Core version: 0.5
> > rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1
> > rpaphp: Slot [:00:02.2](PCI location=U7879.001.DQD0T7T-P1-C4) registered
> > vio_register_driver: driver hvc_console registering
> > [ cut here ]
> > Badness at mm/slub.c:1701
> 
> Someone did a kmalloc(0, ...). Zero sized allocation are not flagged
> by SLAB but SLUB does.
> 

I'll chase up what's happening here. It will be "reproducable" independent
of SLUB by adding a similar check.

> > Call Trace:
> > [C506B730] [C0011188] .show_stack+0x6c/0x1a0 (unreliable)
> > [C506B7D0] [C01EE9F4] .report_bug+0x94/0xe8
> > [C506B860] [C038C85C] .program_check_exception+0x16c/0x5f4
> > [C

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-08 Thread Christoph Lameter
On Thu, 8 Mar 2007, Mel Gorman wrote:

> On x86_64, it completed successfully and looked reliable. There was a 5%
> performance loss on kernbench and aim9 figures were way down. However, with
> slub_debug enabled, I would expect that so it's not a fair comparison
> performance wise. I'll rerun the tests without debug and see what it looks
> like if you're interested and do not think it's too early to worry about
> performance instead of clarity. This is what I have for bl6-13 (machine
> appears on test.kernel.org so additional details are there).

No its good to start worrying about performance now. There are still some 
performance issues to be ironed out in particular on NUMA. I am not sure
f.e. how the reduction of partial lists affect performance.

> IA64 (machine not visible on TKO) curiously did not exhibit the same problems
> on kernbench for Total CPU time which is very unexpected but you can see the
> System CPU times. The AIM9 figures were a bit of an upset but again, I blame
> slub_debug being enabled

This was a single node box? Note that the 16kb page size has a major 
impact on SLUB performance. On IA64 slub will use only 1/4th the locking 
overhead as on 4kb platforms.

> (as an aside, the succes rates for high-order allocations are lower with SLUB.
> Again, I blame slub_debug. I know that enabling SLAB_DEBUG has similar effects
> because of red-zoning and the like)

We have some additional patches here that reduce the max order for some 
allocs. I believe the task_struct gets to be an order 2 alloc with V4,

> Now, the bad news. This exploded on ppc64. It started going wrong early in the
> boot process and got worse. I haven't looked closely as to why yet as there is
> other stuff on my plate but I've included a console log that might be some use
> to you. If you think you have a fix for it, feel free to send it on and I'll
> give it a test.

Hmmm... Looks like something is zapping an object. Try to rerun with 
a kernel compiled with CONFIG_SLAB_DEBUG. I would expect similar results.

> Brought up 4 CPUs
> Node 0 CPUs: 0-3
> mm/memory.c:111: bad pud c50e4480.
> could not vmalloc 20971520 bytes for cache!

Hmmm... a bad pud? I need to look at how the puds are managed on power.

> migration_cost=0,1000
> *** SLUB: Redzone Inactive check fails in [EMAIL PROTECTED] Slab

An object was overwritten with zeros after it was freed.

> RTAS daemon started
> RTAS: event: 1, Type: Platform Error, Severity: 2
> audit: initializing netlink socket (disabled)
> audit(1173335571.256:1): initialized
> Total HugeTLB memory allocated, 0
> VFS: Disk quotas dquot_6.5.1
> Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
> JFS: nTxBlock = 8192, nTxLock = 65536
> SELinux:  Registering netfilter hooks
> io scheduler noop registered
> io scheduler anticipatory registered (default)
> io scheduler deadline registered
> io scheduler cfq registered
> pci_hotplug: PCI Hot Plug PCI Core version: 0.5
> rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1
> rpaphp: Slot [:00:02.2](PCI location=U7879.001.DQD0T7T-P1-C4) registered
> vio_register_driver: driver hvc_console registering
> [ cut here ]
> Badness at mm/slub.c:1701

Someone did a kmalloc(0, ...). Zero sized allocation are not flagged
by SLAB but SLUB does.

> Call Trace:
> [C506B730] [C0011188] .show_stack+0x6c/0x1a0 (unreliable)
> [C506B7D0] [C01EE9F4] .report_bug+0x94/0xe8
> [C506B860] [C038C85C] .program_check_exception+0x16c/0x5f4
> [C506B930] [C00046F4] program_check_common+0xf4/0x100
> --- Exception: 700 at .get_slab+0xbc/0x18c
> LR = .__kmalloc+0x28/0x104
> [C506BC20] [C506BCC0] 0xc506bcc0 (unreliable)
> [C506BCD0] [C00CE2EC] .__kmalloc+0x28/0x104
> [C506BD60] [C022E724] .tty_register_driver+0x5c/0x23c
> [C506BE10] [C0477910] .hvsi_init+0x154/0x1b4
> [C506BEC0] [C0451B7C] .init+0x1c4/0x2f8
> [C506BF90] [C00275D0] .kernel_thread+0x4c/0x68
> mm/memory.c:111: bad pud c5762900.
> mm/memory.c:111: bad pud c5762480.
> [ cut here ]
> kernel BUG at mm/mmap.c:1999!

More page table trouble.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-08 Thread Mel Gorman

On Tue, 6 Mar 2007, Christoph Lameter wrote:


[PATCH] SLUB The unqueued slab allocator v4



Hi Christoph,

I shoved these patches through a few tests on x86, x86_64, ia64 and ppc64 
last night to see how they got on. I enabled slub_debug to catch any 
suprises that may be creeping about.


The results are mixed.

On x86_64, it completed successfully and looked reliable. There was a 5% 
performance loss on kernbench and aim9 figures were way down. However, 
with slub_debug enabled, I would expect that so it's not a fair comparison 
performance wise. I'll rerun the tests without debug and see what it looks 
like if you're interested and do not think it's too early to worry about 
performance instead of clarity. This is what I have for bl6-13 (machine 
appears on test.kernel.org so additional details are there).


KernBench Comparison

  2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-list-based 
%diff

User   CPU time  84.32 86.03 
-2.03%
System CPU time  32.97 38.21
-15.89%
Total  CPU time 117.29124.24 
-5.93%
Elapsedtime  34.95 37.31 
-6.75%

AIM9 Comparison
---
 2.6.21-rc2-mm2-clean  2.6.21-rc2-mm2-list-based
 1 creat-clo160706.55   62918.54  -97788.01 
-60.85% File Creations and Closes/second
 2 page_test190371.67  204050.99   13679.32  7.19% 
System Allocations & Pages/second
 3 brk_test2320679.89 1923512.75 -397167.14 
-17.11% System Memory Allocations/second
 4 jmp_test   16391869.3816380353.27  -11516.11 
-0.07% Non-local gotos/second
 5 signal_test  492234.63  235710.71 -256523.92 
-52.11% Signal Traps/second
 6 exec_test   232.26 220.88 -11.38 
-4.90% Program Loads/second
 7 fork_test  4514.253609.40-904.85 
-20.04% Task Creations/second
 8 link_test 53639.76   26925.91 -26713.85  
-49.80% Link/Unlink Pairs/second


IA64 (machine not visible on TKO) curiously did not exhibit the same 
problems on kernbench for Total CPU time which is very unexpected but you 
can see the System CPU times. The AIM9 figures were a bit of an upset but 
again, I blame slub_debug being enabled


KernBench Comparison

  2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-list-based  
%diff
User   CPU time1084.64   1033.46  
4.72%
System CPU time  73.38 84.14
-14.66%
Total  CPU time1158.021117.6  
3.49%
Elapsedtime 307.00291.29  
5.12%

AIM9 Comparison
---
 2.6.21-rc2-mm2-clean  2.6.21-rc2-mm2-list-based
 1 creat-clo425460.75  137709.84 -287750.91 
-67.63% File Creations and Closes/second
 2 page_test   2097119.26 2373083.49  275964.23 13.16% 
System Allocations & Pages/second
 3 brk_test7008395.33 3787961.51 -3220433.82 
-45.95% System Memory Allocations/second
 4 jmp_test   12226295.3112254744.03   28448.72  
0.23% Non-local gotos/second
 5 signal_test 1271126.28  334357.29 -936768.99 
-73.70% Signal Traps/second
 6 exec_test   395.54 349.00 -46.54 
-11.77% Program Loads/second
 7 fork_test 13218.238822.93   -4395.30 
-33.25% Task Creations/second
 8 link_test 64776.047410.75  -57365.29 
-88.56% Link/Unlink Pairs/second

(as an aside, the succes rates for high-order allocations are lower with 
SLUB. Again, I blame slub_debug. I know that enabling SLAB_DEBUG has 
similar effects because of red-zoning and the like)


Now, the bad news. This exploded on ppc64. It started going wrong early in 
the boot process and got worse. I haven't looked closely as to why yet as 
there is other stuff on my plate but I've included a console log that 
might be some use to you. If you think you have a fix for it, feel free to 
send it on and I'll give it a test.



Config file read, 1024 bytes
Welcome
Welcome to yaboot version 1.3.12
Enter "help" to get some basic usage information
boot: autobench
Please wait, loading kernel...
   Elf64 kernel loaded...
Loading ramdisk...
ramdisk loaded at 0240, size: 1648 Kbytes
OF stdout device is: /vdevice/[EMAIL PROTECTED]
Hypertas detected, assuming LPAR !
command line: ro console=hvc0 autobench_args: root=/dev/sda6 ABAT:1173335344 loglevel=8 slub_debug 
memory layout at init:

  all

[SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-06 Thread Christoph Lameter
[PATCH] SLUB The unqueued slab allocator v4

V3->V4
- Rename /proc/slabinfo to /proc/slubinfo. We have a different format after
  all.
- More bug fixes and stabilization of diagnostic functions. This seems
  to be finally something that works wherever we test it.
- Serialize kmem_cache_create and kmem_cache_destroy via slub_lock (Adrian's
  idea)
- Add two new modifications (separate patches) to guarantee
  a mininum number of objects per slab and to pass through large
  allocations.

Note that SLUB will warn on zero sized allocations. SLAB just allocates
some memory. So some traces from the usb subsystem etc should be expected.
There are very likely also issues remaining in SLUB.

V2->V3
- Debugging and diagnostic support. This is runtime enabled and not compile
  time enabled. Runtime debugging can be controlled via kernel boot options
  on an individual slab cache basis or globally.
- Slab Trace support (For individual slab caches).
- Resiliency support: If basic sanity checks are enabled (via F f.e.)
  (boot option) then SLUB will do the best to perform diagnostics and
  then continue (i.e. mark corrupted objects as used).
- Fix up numerous issues including clash of SLUBs use of page
  flags with i386 arch use for pmd and pgds (which are managed
  as slab caches, sigh).
- Dynamic per CPU array sizing.
- Explain SLUB slabcache flags

V1->V2
- Fix up various issues. Tested on i386 UP, X86_64 SMP, ia64 NUMA.
- Provide NUMA support by splitting partial lists per node.
- Better Slab cache merge support (now at around 50% of slabs)
- List slab cache aliases if slab caches are merged.
- Updated descriptions /proc/slabinfo output

This is a new slab allocator which was motivated by the complexity of the
existing code in mm/slab.c. It attempts to address a variety of concerns
with the existing implementation.

A. Management of object queues

   A particular concern was the complex management of the numerous object
   queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
   each allocating CPU and use objects from a slab directly instead of
   queueing them up.

B. Storage overhead of object queues

   SLAB Object queues exist per node, per CPU. The alien cache queue even
   has a queue array that contain a queue for each processor on each
   node. For very large systems the number of queues and the number of
   objects that may be caught in those queues grows exponentially. On our
   systems with 1k nodes / processors we have several gigabytes just tied up
   for storing references to objects for those queues  This does not include
   the objects that could be on those queues. One fears that the whole
   memory of the machine could one day be consumed by those queues.

C. SLAB meta data overhead

   SLAB has overhead at the beginning of each slab. This means that data
   cannot be naturally aligned at the beginning of a slab block. SLUB keeps
   all meta data in the corresponding page_struct. Objects can be naturally
   aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte
   boundaries and can fit tightly into a 4k page with no bytes left over.
   SLAB cannot do this.

D. SLAB has a complex cache reaper

   SLUB does not need a cache reaper for UP systems. On SMP systems
   the per CPU slab may be pushed back into partial list but that
   operation is simple and does not require an iteration over a list
   of objects. SLAB expires per CPU, shared and alien object queues
   during cache reaping which may cause strange hold offs.

E. SLAB has complex NUMA policy layer support

   SLUB pushes NUMA policy handling into the page allocator. This means that
   allocation is coarser (SLUB does interleave on a page level) but that
   situation was also present before 2.6.13. SLABs application of
   policies to individual slab objects allocated in SLAB is
   certainly a performance concern due to the frequent references to
   memory policies which may lead a sequence of objects to come from
   one node after another. SLUB will get a slab full of objects
   from one node and then will switch to the next.

F. Reduction of the size of partial slab lists

   SLAB has per node partial lists. This means that over time a large
   number of partial slabs may accumulate on those lists. These can
   only be reused if allocator occur on specific nodes. SLUB has a global
   pool of partial slabs and will consume slabs from that pool to
   decrease fragmentation.

G. Tunables

   SLAB has sophisticated tuning abilities for each slab cache. One can
   manipulate the queue sizes in detail. However, filling the queues still
   requires the uses of the spin lock to check out slabs. SLUB has a global
   parameter (min_slab_order) for tuning. Increasing the minimum slab
   order can decrease the locking overhead. The bigger the slab order the
   less motions of pages between per CPU and partial lists occur and the
   better SLUB will be scaling.

G. Slab merging

   We often have slab caches with simila