Re: SLUB performance regression vs SLAB
From: Peter Zijlstra <[EMAIL PROTECTED]> Date: Fri, 05 Oct 2007 22:32:00 +0200 > Focus on the slab allocator usage, instrument it, record a trace, > generate a statistical model that matches, and write a small > programm/kernel module that has the same allocation pattern. Then verify > this statistical workload still shows the same performance difference. > > Easy: no > Doable: yes The other important bit is likely to generate a lot of DMA traffic such that the L2 cache bandwidth is getting used on the bus side by the PCI controller doing invalidations of both dirty and clean L2 cache lines as devices DMA to/from them. This will also be exercising the memory controller, further contending with the cpu when SLAB touches cold data structures. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On Thu, 2007-10-04 at 17:02 -0400, Chuck Ebbert wrote: > On 10/04/2007 04:55 PM, David Miller wrote: > > > > Anything, I do mean anything, can be simulated using small test > > programs. > > How do you simulate reading 100TB of data spread across 3000 disks, > selecting 10% of it using some criterion, then sorting and summarizing > the result? Focus on the slab allocator usage, instrument it, record a trace, generate a statistical model that matches, and write a small programm/kernel module that has the same allocation pattern. Then verify this statistical workload still shows the same performance difference. Easy: no Doable: yes - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
Patch 2/2 SLUB: Allow foreign objects on the per cpu object lists. In order to free objects we need to touch the page struct of the page that the object belongs to. If this occurs too frequently then we could generate a bouncing cacheline. We do not want that to occur too frequently. We can avoid the page struct touching for per cpu objects. Now we extend that to allow a limited number of objects that are not part of the cpu slab. Allow up to 4 times the objects that fit into a page in the per cpu list. If the objects are allocated before we need to free them then we have saved touching a page struct twice. The objects are presumably cache hot, so it is performance wise good to recycle these locally. Foreign objects are drained before deactivating cpu slabs and if too many objects accumulate. For kmem_cache_free() this also has the beneficial effect of getting virt_to_page() operations eliminated or grouped together which may help reduce the cache footprint and increase the speed of virt_to_page() lookups (they hopefully all come from the same pages). For kfree() we may have to do virt_to_page() in the worst case twice. Once grouped together. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- include/linux/slub_def.h |1 mm/slub.c| 82 ++- 2 files changed, 68 insertions(+), 15 deletions(-) Index: linux-2.6.23-rc8-mm2/include/linux/slub_def.h === --- linux-2.6.23-rc8-mm2.orig/include/linux/slub_def.h 2007-10-04 22:42:08.0 -0700 +++ linux-2.6.23-rc8-mm2/include/linux/slub_def.h 2007-10-04 22:43:19.0 -0700 @@ -16,6 +16,7 @@ struct kmem_cache_cpu { struct page *page; int node; int remaining; + int drain_limit; unsigned int offset; unsigned int objsize; }; Index: linux-2.6.23-rc8-mm2/mm/slub.c === --- linux-2.6.23-rc8-mm2.orig/mm/slub.c 2007-10-04 22:42:08.0 -0700 +++ linux-2.6.23-rc8-mm2/mm/slub.c 2007-10-04 22:56:49.0 -0700 @@ -187,6 +187,12 @@ static inline void ClearSlabDebug(struct */ #define MAX_PARTIAL 10 +/* + * How many times the number of objects per slab can accumulate on the + * per cpu objects list before we drain it. + */ +#define DRAIN_FACTOR 4 + #define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \ SLAB_POISON | SLAB_STORE_USER) @@ -1375,6 +1381,54 @@ static void unfreeze_slab(struct kmem_ca } } +static void __slab_free(struct kmem_cache *s, struct page *page, + void *x, void *addr, unsigned int offset); + +/* + * Drain freelist of objects foreign to the slab. Interrupts must be off. + * + * This is called + * + * 1. Before taking the slub lock when a cpu slab is to be deactivated. + *Deactivation can only deal with native objects on the freelist. + * + * 2. If the number of objects in the per cpu structures grows beyond + *3 times the objects that fit in a slab. In that case we need to throw + *some objects away. Stripping the foreign objects does the job and + *localizes any new the allocations. + */ +static void drain_foreign(struct kmem_cache *s, struct kmem_cache_cpu *c, void *addr) +{ + void **freelist = c->freelist; + + if (unlikely(c->node < 0)) { + /* Slow path user */ + __slab_free(s, virt_to_head_page(freelist), freelist, addr, c->offset); + freelist = NULL; + c->remaining--; + } + + if (!freelist) + return; + + c->freelist = NULL; + c->remaining = 0; + + while (freelist) { + void **object = freelist; + struct page *page = virt_to_head_page(freelist); + + freelist = freelist[c->offset]; + if (page == c->page) { + /* Local object. Keep for future allocations */ + object[c->offset] = c->freelist; + c->freelist = object; + c->remaining++; + } else + __slab_free(s, page, object, NULL, c->offset); + } +} + /* * Remove the cpu slab */ @@ -1405,6 +1459,7 @@ static void deactivate_slab(struct kmem_ static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c) { + drain_foreign(s, c, NULL); slab_lock(c->page); deactivate_slab(s, c); } @@ -1480,6 +1535,7 @@ static void *__slab_alloc(struct kmem_ca if (!c->page) goto new_slab; + drain_foreign(s, c, NULL); slab_lock(c->page); if (unlikely(!node_match(c, node))) goto another_slab; @@ -1553,6 +1609,7 @@ debug: c->page->inuse++; c->page->freelist = object[c->offset]; c->node = -1; + c->remaining = s-
Re: SLUB performance regression vs SLAB
On Fri, 5 Oct 2007, Jens Axboe wrote: > It might not, it might. The point is trying to isolate the problem and > making a simple test case that could be used to reproduce it, so that > Christoph (or someone else) can easily fix it. In case there is someone who wants to hack on it: Here is what I got so far for batching the frees. I will try to come up with a test next week if nothing else happens before: Patch 1/2 on top of mm: SLUB: Keep counter of remaining objects on the per cpu list Add a counter to keep track of how many objects are on the per cpu list. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- include/linux/slub_def.h |1 + mm/slub.c|8 ++-- 2 files changed, 7 insertions(+), 2 deletions(-) Index: linux-2.6.23-rc8-mm2/include/linux/slub_def.h === --- linux-2.6.23-rc8-mm2.orig/include/linux/slub_def.h 2007-10-04 22:41:58.0 -0700 +++ linux-2.6.23-rc8-mm2/include/linux/slub_def.h 2007-10-04 22:42:08.0 -0700 @@ -15,6 +15,7 @@ struct kmem_cache_cpu { void **freelist; struct page *page; int node; + int remaining; unsigned int offset; unsigned int objsize; }; Index: linux-2.6.23-rc8-mm2/mm/slub.c === --- linux-2.6.23-rc8-mm2.orig/mm/slub.c 2007-10-04 22:41:58.0 -0700 +++ linux-2.6.23-rc8-mm2/mm/slub.c 2007-10-04 22:42:08.0 -0700 @@ -1386,12 +1386,13 @@ static void deactivate_slab(struct kmem_ * because both freelists are empty. So this is unlikely * to occur. */ - while (unlikely(c->freelist)) { + while (unlikely(c->remaining)) { void **object; /* Retrieve object from cpu_freelist */ object = c->freelist; c->freelist = c->freelist[c->offset]; + c->remaining--; /* And put onto the regular freelist */ object[c->offset] = page->freelist; @@ -1491,6 +1492,7 @@ load_freelist: object = c->page->freelist; c->freelist = object[c->offset]; + c->remaining = s->objects - c->page->inuse - 1; c->page->inuse = s->objects; c->page->freelist = NULL; c->node = page_to_nid(c->page); @@ -1574,13 +1576,14 @@ static void __always_inline *slab_alloc( local_irq_save(flags); c = get_cpu_slab(s, smp_processor_id()); - if (unlikely(!c->freelist || !node_match(c, node))) + if (unlikely(!c->remaining || !node_match(c, node))) object = __slab_alloc(s, gfpflags, node, addr, c); else { object = c->freelist; c->freelist = object[c->offset]; + c->remaining--; } local_irq_restore(flags); @@ -1686,6 +1689,7 @@ static void __always_inline slab_free(st if (likely(page == c->page && c->node >= 0)) { object[c->offset] = c->freelist; c->freelist = object; + c->remaining++; } else __slab_free(s, page, x, addr, c->offset); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On Fri, 5 Oct 2007, Matthew Wilcox wrote: > I vaguely remembered something called orasim, so I went looking for it. > I found http://oss.oracle.com/~wcoekaer/orasim/ which is dated from > 2004, and I found http://oss.oracle.com/projects/orasimjobfiles/ which > seems to be a stillborn project. Is there anything else I should know > about orasim? ;-) Too bad. If this would work then I would have a load to work against. I have a patch here that may address the issue for SMP (no NUMA for now) by batching all frees on the per cpu freelist and then dumping them in groups. But it is likely not too wise to have you run your weeklong tests on this one. Needs some more care first. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On Fri, Oct 05 2007, Andi Kleen wrote: > Jens Axboe <[EMAIL PROTECTED]> writes: > > > > Writing a small test module to exercise slub/slab in various ways > > (allocating from all cpus freeing from one, as described) should not be > > too hard. Perhaps that would be enough to find this performance > > discrepancy between slab and slub? > > You could simulate that by just sending packets using unix sockets > between threads bound to different CPUs. Sending a packet allocates; > receiving deallocates. Sure, there are a host of ways to accomplish the same thing. > But it's not clear that will really simulate the cache bounce > environment of the database test. I don't think all passing of data > between CPUs using slub objects is slow. It might not, it might. The point is trying to isolate the problem and making a simple test case that could be used to reproduce it, so that Christoph (or someone else) can easily fix it. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On Fri, Oct 05 2007, Matthew Wilcox wrote: > On Fri, Oct 05, 2007 at 08:48:53AM +0200, Jens Axboe wrote: > > I'd like to second Davids emails here, this is a serious problem. Having > > a reproducible test case lowers the barrier for getting the problem > > fixed by orders of magnitude. It's the difference between the problem > > getting fixed in a day or two and it potentially lingering for months, > > because email ping-pong takes forever and "the test team has moved on to > > other tests, we'll let you know the results of test foo in 3 weeks time > > when we have a new slot on the box" just removing any developer > > motivation to work on the issue. > > I vaguely remembered something called orasim, so I went looking for it. > I found http://oss.oracle.com/~wcoekaer/orasim/ which is dated from > 2004, and I found http://oss.oracle.com/projects/orasimjobfiles/ which > seems to be a stillborn project. Is there anything else I should know > about orasim? ;-) I don't know much about orasim, except that internally we're trying to use fio for that instead. As far as I know, it was a project that was never feature complete (or completed all together, for that matter). -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On Fri, Oct 05, 2007 at 08:48:53AM +0200, Jens Axboe wrote: > I'd like to second Davids emails here, this is a serious problem. Having > a reproducible test case lowers the barrier for getting the problem > fixed by orders of magnitude. It's the difference between the problem > getting fixed in a day or two and it potentially lingering for months, > because email ping-pong takes forever and "the test team has moved on to > other tests, we'll let you know the results of test foo in 3 weeks time > when we have a new slot on the box" just removing any developer > motivation to work on the issue. I vaguely remembered something called orasim, so I went looking for it. I found http://oss.oracle.com/~wcoekaer/orasim/ which is dated from 2004, and I found http://oss.oracle.com/projects/orasimjobfiles/ which seems to be a stillborn project. Is there anything else I should know about orasim? ;-) -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
Jens Axboe <[EMAIL PROTECTED]> writes: > > Writing a small test module to exercise slub/slab in various ways > (allocating from all cpus freeing from one, as described) should not be > too hard. Perhaps that would be enough to find this performance > discrepancy between slab and slub? You could simulate that by just sending packets using unix sockets between threads bound to different CPUs. Sending a packet allocates; receiving deallocates. But it's not clear that will really simulate the cache bounce environment of the database test. I don't think all passing of data between CPUs using slub objects is slow. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On Fri, Oct 05 2007, Pekka Enberg wrote: > Hi, > > On 10/5/07, Jens Axboe <[EMAIL PROTECTED]> wrote: > > I'd like to second Davids emails here, this is a serious problem. Having > > a reproducible test case lowers the barrier for getting the problem > > fixed by orders of magnitude. It's the difference between the problem > > getting fixed in a day or two and it potentially lingering for months, > > because email ping-pong takes forever and "the test team has moved on to > > other tests, we'll let you know the results of test foo in 3 weeks time > > when we have a new slot on the box" just removing any developer > > motivation to work on the issue. > > What I don't understand is that why don't the people who _have_ access > to the test case fix the problem? Unlike slab, slub is not a pile of > crap that only Christoph can hack on... Often the people testing are only doing just that, testing. So they kindly offer to test any patches and so on, which usually takes forever because of the above limitations in response time, machine availability, etc. Writing a small test module to exercise slub/slab in various ways (allocating from all cpus freeing from one, as described) should not be too hard. Perhaps that would be enough to find this performance discrepancy between slab and slub? -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
Hi, On 10/5/07, Jens Axboe <[EMAIL PROTECTED]> wrote: > I'd like to second Davids emails here, this is a serious problem. Having > a reproducible test case lowers the barrier for getting the problem > fixed by orders of magnitude. It's the difference between the problem > getting fixed in a day or two and it potentially lingering for months, > because email ping-pong takes forever and "the test team has moved on to > other tests, we'll let you know the results of test foo in 3 weeks time > when we have a new slot on the box" just removing any developer > motivation to work on the issue. What I don't understand is that why don't the people who _have_ access to the test case fix the problem? Unlike slab, slub is not a pile of crap that only Christoph can hack on... Pekka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On Fri, Oct 05 2007, David Chinner wrote: > On Thu, Oct 04, 2007 at 03:07:18PM -0700, David Miller wrote: > > From: Chuck Ebbert <[EMAIL PROTECTED]> Date: Thu, 04 Oct 2007 17:47:48 > > -0400 > > > > > On 10/04/2007 05:11 PM, David Miller wrote: > > > > From: Chuck Ebbert <[EMAIL PROTECTED]> Date: Thu, 04 Oct 2007 17:02:17 > > > > -0400 > > > > > > > >> How do you simulate reading 100TB of data spread across 3000 disks, > > > >> selecting 10% of it using some criterion, then sorting and summarizing > > > >> the result? > > > > > > > > You repeatedly read zeros from a smaller disk into the same amount of > > > > memory, and sort that as if it were real data instead. > > > > > > You've just replaced 3000 concurrent streams of data with a single stream. > > > That won't test the memory allocator's ability to allocate memory to many > > > concurrent users very well. > > > > You've kindly removed my "thinking outside of the box" comment. > > > > The point is was not that my specific suggestion would be perfect, but that > > if you used your creativity and thought in similar directions you might find > > a way to do it. > > > > People are too narrow minded when it comes to these things, and that's the > > problem I want to address. > > And it's a good point, too, because often problems to one person are a > no-brainer to someone else. > > Creating lots of "fake" disks is trivial to do, IMO. Use loopback on > sparse files containing sparse filesxi, use ramdisks containing sparse > files or write a sparse dm target for sparse block device mapping, > etc. I'm sure there's more than the few I just threw out... Or use scsi_debug to fake drives/controllers, works wonderful as well for some things and involve the full IO stack. I'd like to second Davids emails here, this is a serious problem. Having a reproducible test case lowers the barrier for getting the problem fixed by orders of magnitude. It's the difference between the problem getting fixed in a day or two and it potentially lingering for months, because email ping-pong takes forever and "the test team has moved on to other tests, we'll let you know the results of test foo in 3 weeks time when we have a new slot on the box" just removing any developer motivation to work on the issue. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: SLUB performance regression vs SLAB
> On 10/04/2007 07:39 PM, David Schwartz wrote: > > But this is just a preposterous position to put him in. If there's no > > reproduceable test case, then why should he care that one > > program he can't > > even see works badly? If you care, you fix it. > People have been trying for years to make reproducible test cases > for huge and complex workloads. It doesn't work. The tests that do > work take weeks to run and need to be carefully validated before > they can be officially released. The open source community can and > should be working on similar tests, but they will never be simple. That's true, but irrelevent. Either the test can identify a problem that applies generally, or it's doing nothing but measuring how good the system is at doing the test. If the former, it should be possible to create a simple test case once you know from the complex test where the problem is. If the latter, who cares about a supposed regression? It should be possible to identify exactly what portion of the test shows the regression the most and exactly what the system is doing during that moment. The test may be great at finding regressions, but once it finds them, they should be forever *found*. Did you follow the recent incident when iperf fout what seemed to be a significnat CFS networking regression? The only way to identify that it was a quirk in what iperf was doing was by looking at exactly what iperf was doing. The only efficient way was to look at iperf's source and see that iperf's weird yielding meant it didn't replicate typical use cases like it was supposed to. DS - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On Thu, 4 Oct 2007 19:43:58 -0700 (PDT) Christoph Lameter <[EMAIL PROTECTED]> wrote: > So there could still be page struct contention left if multiple > processors frequently and simultaneously free to the same slab and > that slab is not the per cpu slab of a cpu. That could be addressed > by optimizing the object free handling further to not touch the page > struct even if we miss the per cpu slab. > > That get_partial* is far up indicates contention on the list lock > that should be addressable by either increasing the slab size or by > changing the object free handling to batch in some form. > > This is an SMP system right? 2 cores with 4 cpus each? The main loop > is always hitting on the same slabs? Which slabs would this be? Am I > right in thinking that one process allocates objects and then lets > multiple other processors do work and then the allocated object is > freed from a cpu that did not allocate the object? If neighboring > objects in one slab are allocated on one cpu and then are almost > simultaneously freed from a set of different cpus then this may be > explain the situation. - one of the characteristics of the application in use is the following: all cores submit IO (which means they allocate various scsi and block structures on all cpus).. but only 1 will free it (the one the IRQ is bound to). SO it's allocate-on-one-free-on-another at a high rate. That is assuming this is the IO slab; that's a bit of an assumption obviously (it's one of the slab things that are hot, but it's a complex workload, there could be others) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
I just spend some time looking at the functions that you see high in the list. The trouble is that I have to speculate and that I have nothing to verify my thoughts. If you could give me the hitlist for each of the 3 runs then this would help to check my thinking. I could be totally off here. It seems that we miss the per cpu slab frequently on slab_free() which leads to the calling of __slab_free() and which in turn needs to take a lock on the page (in the page struct). Typically the page lock is uncontended which seems to not be the case here otherwise it would not be that high up. The per cpu patch in mm should reduce the contention on the page struct by not touching the page struct on alloc and on free. Does not seem to work all the way though. slab_free() still has to touch the page struct if the free is not to the currently active cpu slab. So there could still be page struct contention left if multiple processors frequently and simultaneously free to the same slab and that slab is not the per cpu slab of a cpu. That could be addressed by optimizing the object free handling further to not touch the page struct even if we miss the per cpu slab. That get_partial* is far up indicates contention on the list lock that should be addressable by either increasing the slab size or by changing the object free handling to batch in some form. This is an SMP system right? 2 cores with 4 cpus each? The main loop is always hitting on the same slabs? Which slabs would this be? Am I right in thinking that one process allocates objects and then lets multiple other processors do work and then the allocated object is freed from a cpu that did not allocate the object? If neighboring objects in one slab are allocated on one cpu and then are almost simultaneously freed from a set of different cpus then this may be explain the situation. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On 10/04/2007 07:39 PM, David Schwartz wrote: > But this is just a preposterous position to put him in. If there's no > reproduceable test case, then why should he care that one program he can't > even see works badly? If you care, you fix it. > People have been trying for years to make reproducible test cases for huge and complex workloads. It doesn't work. The tests that do work take weeks to run and need to be carefully validated before they can be officially released. The open source community can and should be working on similar tests, but they will never be simple. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: SLUB performance regression vs SLAB
David Miller wrote: > Using an unpublishable benchmark, whose results even cannot be > published, really stretches the limits of "reasonable" don't you > think? > > This "SLUB isn't ready yet" bullshit is just a shamans dance which > distracts attention away from the real problem, which is that a > reproducable, publishable test case, is not being provided to the > developer so he can work on fixing the problem. > > I can tell you this thing would be fixed overnight if a proper test > case had been provided by now. I would just like to echo what you said just a bit angrier. This is the same as someone asking him to fix a bug that they can only see with a binary-only kernel module. I think he's perfectly justified in simply responding "the bug is as likely to be in your code as mine". Now, just because he's justified in doing that doesn't mean he should. I presume he has an honest desire to improve his own code and if they've found a real problem, I'm sure he'd love to fix it. But this is just a preposterous position to put him in. If there's no reproduceable test case, then why should he care that one program he can't even see works badly? If you care, you fix it. Matthew Wilcox wrote: > Yet here we stand. Christoph is aggressively trying to get slab removed > from the tree. There is a testcase which shows slub performing worse > than slab. It's not my fault I can't publish it. And just because I > can't publish it doesn't mean it doesn't exist. It means it may or may not exist. All we have is your word that slub is the problem. If I said I found a bug in the Linux kernel that caused it to panic but I could only reproduce it with the nVidia driver, I'd be laughed at. It may even be that slub is better, your benchmark simply interprets this as worse. Without the details of your benchmark, we can't know. For example, I've seen benchmarks that (usually unintentionally) actually do a *variable* amount of work and details of the implementation may result in the benchmark actually doing *more* work, so it taking longer does not mean it ran slower. DS - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On Thu, Oct 04, 2007 at 03:07:18PM -0700, David Miller wrote: > From: Chuck Ebbert <[EMAIL PROTECTED]> Date: Thu, 04 Oct 2007 17:47:48 > -0400 > > > On 10/04/2007 05:11 PM, David Miller wrote: > > > From: Chuck Ebbert <[EMAIL PROTECTED]> Date: Thu, 04 Oct 2007 17:02:17 > > > -0400 > > > > > >> How do you simulate reading 100TB of data spread across 3000 disks, > > >> selecting 10% of it using some criterion, then sorting and summarizing > > >> the result? > > > > > > You repeatedly read zeros from a smaller disk into the same amount of > > > memory, and sort that as if it were real data instead. > > > > You've just replaced 3000 concurrent streams of data with a single stream. > > That won't test the memory allocator's ability to allocate memory to many > > concurrent users very well. > > You've kindly removed my "thinking outside of the box" comment. > > The point is was not that my specific suggestion would be perfect, but that > if you used your creativity and thought in similar directions you might find > a way to do it. > > People are too narrow minded when it comes to these things, and that's the > problem I want to address. And it's a good point, too, because often problems to one person are a no-brainer to someone else. Creating lots of "fake" disks is trivial to do, IMO. Use loopback on sparse files containing sparse filesxi, use ramdisks containing sparse files or write a sparse dm target for sparse block device mapping, etc. I'm sure there's more than the few I just threw out... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
From: Chuck Ebbert <[EMAIL PROTECTED]> Date: Thu, 04 Oct 2007 17:47:48 -0400 > On 10/04/2007 05:11 PM, David Miller wrote: > > From: Chuck Ebbert <[EMAIL PROTECTED]> > > Date: Thu, 04 Oct 2007 17:02:17 -0400 > > > >> How do you simulate reading 100TB of data spread across 3000 disks, > >> selecting 10% of it using some criterion, then sorting and > >> summarizing the result? > > > > You repeatedly read zeros from a smaller disk into the same amount of > > memory, and sort that as if it were real data instead. > > You've just replaced 3000 concurrent streams of data with a single > stream. That won't test the memory allocator's ability to allocate > memory to many concurrent users very well. You've kindly removed my "thinking outside of the box" comment. The point is was not that my specific suggestion would be perfect, but that if you used your creativity and thought in similar directions you might find a way to do it. People are too narrow minded when it comes to these things, and that's the problem I want to address. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On 10/04/2007 05:11 PM, David Miller wrote: > From: Chuck Ebbert <[EMAIL PROTECTED]> > Date: Thu, 04 Oct 2007 17:02:17 -0400 > >> How do you simulate reading 100TB of data spread across 3000 disks, >> selecting 10% of it using some criterion, then sorting and >> summarizing the result? > > You repeatedly read zeros from a smaller disk into the same amount of > memory, and sort that as if it were real data instead. You've just replaced 3000 concurrent streams of data with a single stream. That won't test the memory allocator's ability to allocate memory to many concurrent users very well. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On Thu, 4 Oct 2007, Matthew Wilcox wrote: > Yet here we stand. Christoph is aggressively trying to get slab removed > from the tree. There is a testcase which shows slub performing worse > than slab. It's not my fault I can't publish it. And just because I > can't publish it doesn't mean it doesn't exist. > > Slab needs to not get removed until slub is as good a performer on this > benchmark. I agree with this SLAB will stay until we have worked through all the performance issues. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
From: Chuck Ebbert <[EMAIL PROTECTED]> Date: Thu, 04 Oct 2007 17:02:17 -0400 > How do you simulate reading 100TB of data spread across 3000 disks, > selecting 10% of it using some criterion, then sorting and > summarizing the result? You repeatedly read zeros from a smaller disk into the same amount of memory, and sort that as if it were real data instead. You're not thinking outside of the box, and you need to do that to write good test cases and fix kernel bugs effectively. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On 10/04/2007 04:55 PM, David Miller wrote: > > Anything, I do mean anything, can be simulated using small test > programs. How do you simulate reading 100TB of data spread across 3000 disks, selecting 10% of it using some criterion, then sorting and summarizing the result? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
From: Matthew Wilcox <[EMAIL PROTECTED]> Date: Thu, 4 Oct 2007 14:58:12 -0600 > On Thu, Oct 04, 2007 at 01:48:34PM -0700, David Miller wrote: > > There comes a point where it is the reporter's responsibility to help > > the developer come up with a publishable test case the developer can > > use to work on fixing the problem and help ensure it stays fixed. > > That's a lot of effort. Is it more effort than doing some remote > debugging with Christoph? I don't know. That's a good question and an excellent point. I'm sure that, either way, Christoph will be more than willing to engage and assist. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On Thu, Oct 04, 2007 at 01:55:37PM -0700, David Miller wrote: > Anything, I do mean anything, can be simulated using small test > programs. Pointing at a big fancy machine with lots of storage > and disk is a passive aggressive way to avoid the real issues, > in that nobody is putting forth the effort to try and come up > with an at least publishable test case that Christoph can use to > help you guys. > > If coming up with a reproducable and publishable test case is > the difference between this getting fixed and it not getting > fixed, are you going to invest the time to do that? If that's what it takes, then yes. But I'm far from convinced that it's as easy to come up with a TPC benchmark simulator as you think. There have been efforts in the past (orasim, for example), but presumably Christoph has already tried these benchmarks. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On Thu, Oct 04, 2007 at 01:48:34PM -0700, David Miller wrote: > There comes a point where it is the reporter's responsibility to help > the developer come up with a publishable test case the developer can > use to work on fixing the problem and help ensure it stays fixed. That's a lot of effort. Is it more effort than doing some remote debugging with Christoph? I don't know. > Using an unpublishable benchmark, whose results even cannot be > published, really stretches the limits of "reasonable" don't you > think? Yet here we stand. Christoph is aggressively trying to get slab removed from the tree. There is a testcase which shows slub performing worse than slab. It's not my fault I can't publish it. And just because I can't publish it doesn't mean it doesn't exist. Slab needs to not get removed until slub is as good a performer on this benchmark. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
From: [EMAIL PROTECTED] (Matthew Wilcox) Date: Thu, 4 Oct 2007 12:28:25 -0700 > On Thu, Oct 04, 2007 at 10:49:52AM -0700, Christoph Lameter wrote: > > Finally: Is there some way that I can reproduce the tests on my machines? > > As usual for these kinds of setups ... take a two-CPU machine, 64GB > of memory, half a dozen fibre channel adapters, about 3000 discs, > a commercial database, a team of experts for three months worth of > tuning ... Anything, I do mean anything, can be simulated using small test programs. Pointing at a big fancy machine with lots of storage and disk is a passive aggressive way to avoid the real issues, in that nobody is putting forth the effort to try and come up with an at least publishable test case that Christoph can use to help you guys. If coming up with a reproducable and publishable test case is the difference between this getting fixed and it not getting fixed, are you going to invest the time to do that? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
From: Arjan van de Ven <[EMAIL PROTECTED]> Date: Thu, 4 Oct 2007 10:50:46 -0700 > Ok every time something says anything not 100% positive about SLUB you > come back with "but it's fixed in the next patch set"... *every time*. I think this is partly Christoph subconsciously venting his frustration that he's never given a reproducable test case he can use to fix the problem. There comes a point where it is the reporter's responsibility to help the developer come up with a publishable test case the developer can use to work on fixing the problem and help ensure it stays fixed. Using an unpublishable benchmark, whose results even cannot be published, really stretches the limits of "reasonable" don't you think? This "SLUB isn't ready yet" bullshit is just a shamans dance which distracts attention away from the real problem, which is that a reproducable, publishable test case, is not being provided to the developer so he can work on fixing the problem. I can tell you this thing would be fixed overnight if a proper test case had been provided by now. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On Thu, Oct 04, 2007 at 12:05:35PM -0700, Christoph Lameter wrote: > > > Was the page allocator pass through patchset > > > separately applied as I requested? > > > > I don't believe so. Suresh? > > If it was a git pull then the pass through was included and never taken > out. It was a git pull from the performance branch that you pointed out earlier http://git.kernel.org/?p=linux/kernel/git/christoph/slab.git;a=log;h=performance and the config is based on EL5 config with just the SLUB turned on. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On Thu, 4 Oct 2007, Matthew Wilcox wrote: > We have three runs, all with 2.6.23-rc3 plus the patches that Suresh > applied from 20070922. The first run is with slab. The second run is > with SLUB and the third run is SLUB plus the tuning parameters you > recommended. There was quite a bit of communication on tuning parameters. Guess we got more confusion there and multiple configurations settings that I wanted to be tested separately were merged. Setting slub_min_order to more than zero can certainly be detrimental to performance since higher order page allocations can cause cacheline bouncing on zone locks. Which patches? 20070922 refers to a pull on the slab git tree on the performance branch? > I have a spreadsheet with Vtune data in it that was collected during > each of these test runs, so we can see which functions are the hottest. > I can grab that data and send it to you, if that's interesting. Please do. Add the kernel .configs please. Is there any slab queue tuning going on on boot with the SLAB configuration? Include any tuning that was done to the kernel please. > > Was the page allocator pass through patchset > > separately applied as I requested? > > I don't believe so. Suresh? If it was a git pull then the pass through was included and never taken out. > I think for future tests, it would be easiest if you send me a git > reference. That way we will all know precisely what is being tested. Sure we can do that. > > Finally: Is there some way that I can reproduce the tests on my machines? > > As usual for these kinds of setups ... take a two-CPU machine, 64GB > of memory, half a dozen fibre channel adapters, about 3000 discs, > a commercial database, a team of experts for three months worth of > tuning ... > > I don't know if anyone's tried to replicate a benchmark like this using > Postgres. Would be nice if they have ... Well we got our own performance test department here at SGI. If we get them involved then we can add another 3 months until we get the test results confirmed ;-). Seems that this is a small configuration. Why does it take that long? And the experts knew SLAB and not SLUB right? Lets look at all the data that you got and then see if this is enough to figure out what is wrong. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On Thu, Oct 04, 2007 at 10:49:52AM -0700, Christoph Lameter wrote: > I was not aware of that. Would it be possible for you to summarize all the > test data that you have right now about SLUB vs. SLAB with the patches > listed? Exactly what kernel version and what version of the per cpu > patches were tested? We have three runs, all with 2.6.23-rc3 plus the patches that Suresh applied from 20070922. The first run is with slab. The second run is with SLUB and the third run is SLUB plus the tuning parameters you recommended. I have a spreadsheet with Vtune data in it that was collected during each of these test runs, so we can see which functions are the hottest. I can grab that data and send it to you, if that's interesting. > Was the page allocator pass through patchset > separately applied as I requested? I don't believe so. Suresh? I think for future tests, it would be easiest if you send me a git reference. That way we will all know precisely what is being tested. > Finally: Is there some way that I can reproduce the tests on my machines? As usual for these kinds of setups ... take a two-CPU machine, 64GB of memory, half a dozen fibre channel adapters, about 3000 discs, a commercial database, a team of experts for three months worth of tuning ... I don't know if anyone's tried to replicate a benchmark like this using Postgres. Would be nice if they have ... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On Thu, 2007-10-04 at 10:50 -0700, Arjan van de Ven wrote: > On Thu, 4 Oct 2007 10:38:15 -0700 (PDT) > Christoph Lameter <[EMAIL PROTECTED]> wrote: > > > > Yeah the fastpath vs. slow path is not the issue as Siddha and I > > concluded earlier. Seems that we are mainly seeing cacheline bouncing > > due to two cpus accessing meta data in the same page struct. The > > patches in MM that are scheduled to be merged for .24 address > > > Ok every time something says anything not 100% positive about SLUB you > come back with "but it's fixed in the next patch set"... *every time*. > > To be honest, to me that sounds that SLUB isn't ready for prime time > yet, or at least not ready to be the only one in town... > > The day that the answer is "the kernel.org slub is fixing all the > issues" is when it's ready.. Arjan, to be honest, there has been some confusion on _what_ code has been tested with what results. And with Christoph not able to reproduce these results locally, it is very hard for him to fix it proper. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On Thu, 4 Oct 2007, Arjan van de Ven wrote: > Ok every time something says anything not 100% positive about SLUB you > come back with "but it's fixed in the next patch set"... *every time*. All I ask that people test the fixes that have been out there for the known issues. If there are remaining performance issues then lets figure them out and address them. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On Thu, 4 Oct 2007 10:38:15 -0700 (PDT) Christoph Lameter <[EMAIL PROTECTED]> wrote: > Yeah the fastpath vs. slow path is not the issue as Siddha and I > concluded earlier. Seems that we are mainly seeing cacheline bouncing > due to two cpus accessing meta data in the same page struct. The > patches in MM that are scheduled to be merged for .24 address Ok every time something says anything not 100% positive about SLUB you come back with "but it's fixed in the next patch set"... *every time*. To be honest, to me that sounds that SLUB isn't ready for prime time yet, or at least not ready to be the only one in town... The day that the answer is "the kernel.org slub is fixing all the issues" is when it's ready.. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On Thu, 4 Oct 2007, Matthew Wilcox wrote: > > Yeah the fastpath vs. slow path is not the issue as Siddha and I concluded > > earlier. Seems that we are mainly seeing cacheline bouncing due to two > > cpus accessing meta data in the same page struct. The patches in > > MM that are scheduled to be merged for .24 address that issue. I > > have repeatedly asked that these patches be tested. The patches were > > posted months ago. > > I just checked with the guys who did the test. When I said -rc3, I > mis-spoke; this is 2.6.23-rc3 *plus* the patches which Suresh agreed to > test for you. I was not aware of that. Would it be possible for you to summarize all the test data that you have right now about SLUB vs. SLAB with the patches listed? Exactly what kernel version and what version of the per cpu patches were tested? Was the page allocator pass through patchset separately applied as I requested? Finally: Is there some way that I can reproduce the tests on my machines? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On Thu, Oct 04, 2007 at 10:38:15AM -0700, Christoph Lameter wrote: > On Thu, 4 Oct 2007, Matthew Wilcox wrote: > > > So, on "a well-known OLTP benchmark which prohibits publishing absolute > > numbers" and on an x86-64 system (I don't think exactly which model > > is important), we're seeing *6.51%* performance loss on slub vs slab. > > This is with a 2.6.23-rc3 kernel. Tuning the boot parameters, as you've > > asked for before (slub_min_order=2, slub_max_order=4, slub_min_objects=8) > > gets back 0.38% of that. It's still down 6.13% over slab. > > Yeah the fastpath vs. slow path is not the issue as Siddha and I concluded > earlier. Seems that we are mainly seeing cacheline bouncing due to two > cpus accessing meta data in the same page struct. The patches in > MM that are scheduled to be merged for .24 address that issue. I > have repeatedly asked that these patches be tested. The patches were > posted months ago. I just checked with the guys who did the test. When I said -rc3, I mis-spoke; this is 2.6.23-rc3 *plus* the patches which Suresh agreed to test for you. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB performance regression vs SLAB
On Thu, 4 Oct 2007, Matthew Wilcox wrote: > So, on "a well-known OLTP benchmark which prohibits publishing absolute > numbers" and on an x86-64 system (I don't think exactly which model > is important), we're seeing *6.51%* performance loss on slub vs slab. > This is with a 2.6.23-rc3 kernel. Tuning the boot parameters, as you've > asked for before (slub_min_order=2, slub_max_order=4, slub_min_objects=8) > gets back 0.38% of that. It's still down 6.13% over slab. Yeah the fastpath vs. slow path is not the issue as Siddha and I concluded earlier. Seems that we are mainly seeing cacheline bouncing due to two cpus accessing meta data in the same page struct. The patches in MM that are scheduled to be merged for .24 address that issue. I have repeatedly asked that these patches be tested. The patches were posted months ago. > Now, where do we go next? I suspect that 2.6.23-rc9 has significant > changes since -rc3, but I'd like to confirm that before kicking off > another (expensive) run. Please, tell me what useful kernels are to test. I thought Siddha has a test in the works with the per cpu structure patchset from MM? Could you sync up with Siddha? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/