Re: 2.6.22 -mm merge plans
On Thu, May 10, 2007 at 03:39:36PM -0400, Mathieu Desnoyers wrote: > * Christoph Hellwig ([EMAIL PROTECTED]) wrote: > > On Thu, May 03, 2007 at 01:16:46PM -0400, Mathieu Desnoyers wrote: > > > > kprobes does this kind of synchronization internally, so the marker > > > > wrapper should probabl aswell. > > > > > > > > > > The problem appears on heavily loaded systems. Doing 50 > > > synchronize_sched() calls in a row can take up to a few seconds on a > > > 4-way machine. This is why I prefer to do it in the module to which > > > the callbacks belong. > > > > We recently had a discussion on batch unreistration interface for > > kprobes. I'm not very happy with having so different interfaces for > > different kind of probe registrations. > > > > Ok, I've had a look at the kprobes batch registration mechanisms and.. > well, it does not look well suited for the markers. Adding > supplementary data structures such as linked lists of probes does not > look like a good match. > > However, I agree with you that providing a similar API is good. > > Therefore, here is my proposal : > > The goal is to do the synchronize just after we unregister the last > probe handler provided by a given module. Since the unregistration > functions iterate on every marker present in the kernel, we can keep a > count of how many probes provided by the same module are still present. > If we see that we unregistered the last probe pointing to this module, > we issue a synchronize_sched(). > > It adds no data structure and keeps the same order of complexity as what > is already there, we only have to do 2 passes in the marker structures : > the first one finds the module associated with the callback and the > second disables the callbacks and keep a count of the number of > callbacks associated with the module. > > Mathieu > > P.S.: here is the code. > > > Linux Kernel Markers - Architecture Independant code Provide internal > synchronize_sched() in batch. > > The goal is to do the synchronize just after we unregister the last > probe handler provided by a given module. Since the unregistration > functions iterate on every marker present in the kernel, we can keep a > count of how many probes provided by the same module are still present. > If we see that we unregistered the last probe pointing to this module, > we issue a synchronize_sched(). > > It adds no data structure and keeps the same order of complexity as what > is already there, we only have to do 2 passes in the marker structures : > the first one finds the module associated with the callback and the > second disables the callbacks and keep a count of the number of > callbacks associated with the module. Looks good to me, please incorporate this is the next round of the markers patch series. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
* Christoph Hellwig ([EMAIL PROTECTED]) wrote: > On Thu, May 03, 2007 at 01:16:46PM -0400, Mathieu Desnoyers wrote: > > > kprobes does this kind of synchronization internally, so the marker > > > wrapper should probabl aswell. > > > > > > > The problem appears on heavily loaded systems. Doing 50 > > synchronize_sched() calls in a row can take up to a few seconds on a > > 4-way machine. This is why I prefer to do it in the module to which > > the callbacks belong. > > We recently had a discussion on batch unreistration interface for > kprobes. I'm not very happy with having so different interfaces for > different kind of probe registrations. > Ok, I've had a look at the kprobes batch registration mechanisms and.. well, it does not look well suited for the markers. Adding supplementary data structures such as linked lists of probes does not look like a good match. However, I agree with you that providing a similar API is good. Therefore, here is my proposal : The goal is to do the synchronize just after we unregister the last probe handler provided by a given module. Since the unregistration functions iterate on every marker present in the kernel, we can keep a count of how many probes provided by the same module are still present. If we see that we unregistered the last probe pointing to this module, we issue a synchronize_sched(). It adds no data structure and keeps the same order of complexity as what is already there, we only have to do 2 passes in the marker structures : the first one finds the module associated with the callback and the second disables the callbacks and keep a count of the number of callbacks associated with the module. Mathieu P.S.: here is the code. Linux Kernel Markers - Architecture Independant code Provide internal synchronize_sched() in batch. The goal is to do the synchronize just after we unregister the last probe handler provided by a given module. Since the unregistration functions iterate on every marker present in the kernel, we can keep a count of how many probes provided by the same module are still present. If we see that we unregistered the last probe pointing to this module, we issue a synchronize_sched(). It adds no data structure and keeps the same order of complexity as what is already there, we only have to do 2 passes in the marker structures : the first one finds the module associated with the callback and the second disables the callbacks and keep a count of the number of callbacks associated with the module. Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]> --- kernel/module.c | 62 ++-- 1 file changed, 52 insertions(+), 10 deletions(-) Index: linux-2.6-lttng/kernel/module.c === --- linux-2.6-lttng.orig/kernel/module.c2007-05-10 14:48:28.0 -0400 +++ linux-2.6-lttng/kernel/module.c 2007-05-10 15:38:27.0 -0400 @@ -404,8 +404,12 @@ } /* Sets a range of markers to a disabled state : unset the enable bit and - * provide the empty callback. */ + * provide the empty callback. + * Keep a count of other markers connected to the same module as the one + * provided as parameter. */ static int marker_remove_probe_range(const char *name, + struct module *probe_module, + int *ref_count, const struct __mark_marker *begin, const struct __mark_marker *end) { @@ -413,12 +417,17 @@ int found = 0; for (iter = begin; iter < end; iter++) { - if (strcmp(name, iter->mdata->name) == 0) { - marker_set_enable(iter->enable, 0, - iter->mdata->flags); - iter->mdata->call = __mark_empty_function; - found++; + if (strcmp(name, iter->mdata->name) != 0) { + if (probe_module) + if (__module_text_address( + (unsigned long)iter->mdata->call) + == probe_module) + (*ref_count)++; + continue; } + marker_set_enable(iter->enable, 0, iter->mdata->flags); + iter->mdata->call = __mark_empty_function; + found++; } return found; } @@ -450,6 +459,29 @@ return found; } +/* Get the module to which the probe handler's text belongs. + * Called with module_mutex taken. + * Returns NULL if the probe handler is not in a module. */ +static struct module *__marker_get_probe_module(const char *name) +{ + struct module *mod; + const struct __mark_marker *iter; + + list_for_each_entry(mod, &modules, list) { + if (mod->taints) + continue; + for (iter = mod->markers; + iter < mod->markers+mod->num_markers; iter++) {
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Thu, 10 May 2007, Nick Piggin wrote: The filesystem (or page cache) allows pages beyond i_size to come in there? That wasn't a problem before, was it? But now it is? The filesystem still doesn't, but if i_size is updated after the page is returned, we can have a problem that was previously taken care of with the truncate_count but now isn't. But... I thought the page lock was now taking care of that in your scheme? truncate_inode_pages has to wait for the page lock, then it finds the page is mapped and... ahh, it finds the copiee page is not mapped, so doesn't do its own little unmap_mapping_range, and the copied page squeaks through. Drat. I really think the truncate_count solution worked better, for truncation anyway. There may be persuasive reasons you need the page lock for invalidation: I gave up on trying to understand the required behaviour(s) for invalidation. So, bring back (the original use of, not my tree marker use of) truncate_count? Hmm, you probably don't want to do that, because there was some pleasure in removing the strange barriers associated with it. A second unmap_mapping_range is just one line of code - but it sure feels like a defeat to me, calling the whole exercise into question. (But then, you'd be right to say my perfectionism made it impossible for me to come up with any solution to the invalidation issues.) Well we could bring back the truncate_count, but I think that sucks because that's moving work into the page fault handler in order to avoid a bit of work when truncating mapped files. But that's a change we could have made at any time if we'd bothered, it's not really the issue here. I don't see how you could, because you need to increment truncate_count. Though indeed we did so, I don't see that we needed to increment truncate_count in that case (nobody could be coming through do_no_page on that file, when there are no mappings of it). Of course :P -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Thu, 10 May 2007, Nick Piggin wrote: > > > > The filesystem (or page cache) allows pages beyond i_size to come > > in there? That wasn't a problem before, was it? But now it is? > > The filesystem still doesn't, but if i_size is updated after the page > is returned, we can have a problem that was previously taken care of > with the truncate_count but now isn't. But... I thought the page lock was now taking care of that in your scheme? truncate_inode_pages has to wait for the page lock, then it finds the page is mapped and... ahh, it finds the copiee page is not mapped, so doesn't do its own little unmap_mapping_range, and the copied page squeaks through. Drat. I really think the truncate_count solution worked better, for truncation anyway. There may be persuasive reasons you need the page lock for invalidation: I gave up on trying to understand the required behaviour(s) for invalidation. So, bring back (the original use of, not my tree marker use of) truncate_count? Hmm, you probably don't want to do that, because there was some pleasure in removing the strange barriers associated with it. A second unmap_mapping_range is just one line of code - but it sure feels like a defeat to me, calling the whole exercise into question. (But then, you'd be right to say my perfectionism made it impossible for me to come up with any solution to the invalidation issues.) > > Suspect you'd need a barrier of some kind between the i_size_write and > > the mapping_mapped test? > > The unmap_mapping_range that runs after the truncate_inode_pages should > run in the correct order, I believe. Yes, if there's going to be that backup call, the first won't really need a barrier. > > But that's a change we could have made at > > any time if we'd bothered, it's not really the issue here. > > I don't see how you could, because you need to increment truncate_count. Though indeed we did so, I don't see that we needed to increment truncate_count in that case (nobody could be coming through do_no_page on that file, when there are no mappings of it). > But I believe this is fixing the issue, even if it does so in a peripheral > manner, because it avoids the added cost for unmapped files. It's a small improvement to your common case, I agree. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Wed, 9 May 2007, Nick Piggin wrote: Hugh Dickins wrote: On Wed, 2 May 2007, Nick Piggin wrote: But I'm pretty sure (to use your words!) regular truncate was not racy before: I believe Andrea's sequence count was handling that case fine, without a second unmap_mapping_range. OK, I think you're right. I _think_ it should also be OK with the lock_page version as well: we should not be able to have any pages after the first unmap_mapping_range call, because of the i_size write. So if we have no pages, there is nothing to 'cow' from. I'd be delighted if you can remove those later unmap_mapping_ranges. As I recall, the important thing for the copy pages is to be holding the page lock (or whatever other serialization) on the copied page still while the copy page is inserted into pagetable: that looks to be so in your __do_fault. Hmm, on second thoughts, I think I was right the first time, and do need the unmap after the pages are truncated. With the lock_page code, after the first unmap, we can get new ptes mapping pages, and subsequently they can be COWed and then the original pte zapped before the truncate loop checks it. The filesystem (or page cache) allows pages beyond i_size to come in there? That wasn't a problem before, was it? But now it is? The filesystem still doesn't, but if i_size is updated after the page is returned, we can have a problem that was previously taken care of with the truncate_count but now isn't. However, I wonder if we can't test mapping_mapped before the spinlock, which would make most truncates cheaper? Slightly cheaper, yes, though I doubt it'd be much in comparison with actually doing any work in unmap_mapping_range or truncate_inode_pages. But if we're supposing the common case for truncate is unmapped mappings, then the main cost there will be the locking, which I'm trying to avoid. Hopefully with this patch, most truncate workloads would get faster, even though truncate mapped files is going to be unavoidably slower. Suspect you'd need a barrier of some kind between the i_size_write and the mapping_mapped test? The unmap_mapping_range that runs after the truncate_inode_pages should run in the correct order, I believe. But that's a change we could have made at any time if we'd bothered, it's not really the issue here. I don't see how you could, because you need to increment truncate_count. But I believe this is fixing the issue, even if it does so in a peripheral manner, because it avoids the added cost for unmapped files. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Wed, 9 May 2007, Nick Piggin wrote: > Hugh Dickins wrote: > > On Wed, 2 May 2007, Nick Piggin wrote: > > > > >But I'm pretty sure (to use your words!) regular truncate was not racy > > > >before: I believe Andrea's sequence count was handling that case fine, > > > >without a second unmap_mapping_range. > > > > > >OK, I think you're right. I _think_ it should also be OK with the > > >lock_page version as well: we should not be able to have any pages > > >after the first unmap_mapping_range call, because of the i_size > > >write. So if we have no pages, there is nothing to 'cow' from. > > > > I'd be delighted if you can remove those later unmap_mapping_ranges. > > As I recall, the important thing for the copy pages is to be holding > > the page lock (or whatever other serialization) on the copied page > > still while the copy page is inserted into pagetable: that looks > > to be so in your __do_fault. > > Hmm, on second thoughts, I think I was right the first time, and do > need the unmap after the pages are truncated. With the lock_page code, > after the first unmap, we can get new ptes mapping pages, and > subsequently they can be COWed and then the original pte zapped before > the truncate loop checks it. The filesystem (or page cache) allows pages beyond i_size to come in there? That wasn't a problem before, was it? But now it is? > > However, I wonder if we can't test mapping_mapped before the spinlock, > which would make most truncates cheaper? Slightly cheaper, yes, though I doubt it'd be much in comparison with actually doing any work in unmap_mapping_range or truncate_inode_pages. Suspect you'd need a barrier of some kind between the i_size_write and the mapping_mapped test? But that's a change we could have made at any time if we'd bothered, it's not really the issue here. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Wed, 2 May 2007, Nick Piggin wrote: But I'm pretty sure (to use your words!) regular truncate was not racy before: I believe Andrea's sequence count was handling that case fine, without a second unmap_mapping_range. OK, I think you're right. I _think_ it should also be OK with the lock_page version as well: we should not be able to have any pages after the first unmap_mapping_range call, because of the i_size write. So if we have no pages, there is nothing to 'cow' from. I'd be delighted if you can remove those later unmap_mapping_ranges. As I recall, the important thing for the copy pages is to be holding the page lock (or whatever other serialization) on the copied page still while the copy page is inserted into pagetable: that looks to be so in your __do_fault. Hmm, on second thoughts, I think I was right the first time, and do need the unmap after the pages are truncated. With the lock_page code, after the first unmap, we can get new ptes mapping pages, and subsequently they can be COWed and then the original pte zapped before the truncate loop checks it. However, I wonder if we can't test mapping_mapped before the spinlock, which would make most truncates cheaper? -- SUSE Labs, Novell Inc. Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c 2007-04-24 15:02:51.0 +1000 +++ linux-2.6/mm/filemap.c 2007-05-09 17:30:47.0 +1000 @@ -2579,8 +2579,7 @@ if (rw == WRITE) { write_len = iov_length(iov, nr_segs); end = (offset + write_len - 1) >> PAGE_CACHE_SHIFT; - if (mapping_mapped(mapping)) - unmap_mapping_range(mapping, offset, write_len, 0); + unmap_mapping_range(mapping, offset, write_len, 0); } retval = filemap_write_and_wait(mapping); Index: linux-2.6/mm/memory.c === --- linux-2.6.orig/mm/memory.c 2007-05-09 17:25:28.0 +1000 +++ linux-2.6/mm/memory.c 2007-05-09 17:30:22.0 +1000 @@ -1956,6 +1956,9 @@ pgoff_t hba = holebegin >> PAGE_SHIFT; pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT; + if (!mapping_mapped(mapping)) + return; + /* Check for overflow. */ if (sizeof(holelen) > sizeof(hlen)) { long long holeend =
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Fri, 2007-05-04 at 19:23 +1000, Nick Piggin wrote: > These ops could also be put to use in bit spinlocks, buffer lock, and > probably a few other places too. Ok, the performance hit seems to be under control (especially with the bigger benchmark showing actual improvements). There's a little bogon with the PG_waiters bit that you already know about but appart from that it should be ok. I must say I absolutely _LOVE_ the bitops with explicit _lock/_unlock semantics. That should allow us to remove a whole bunch of dodgy barriers and smp_mb__before_whatever_magic_crap() things we have all over the place by providing precisely the expected semantics for bit locks. There are quite a few people who've been trying to do bit locks and I've always been very worried by how easy it is to get the barriers wrong (or too much barriers in the fast path) with these. There are a couple of things we might want to think about regarding the actual API to bit locks... the API you propose is simple, but it might not fit some of the most exotic usage requirements, which typically are related to manipulating other bits along with the lock bit. We might just ignore them though. In the case of the page lock, it's only hitting the slow path, and I would expect other usage scenarii to be similar. Cheers, Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
On Mon, Apr 30, 2007 at 04:20:07PM -0700, Andrew Morton wrote: ... > git-unionfs.patch > > Does this have a future? Yes! There are many active users who use our unioning functionality. Namespace unification consists of several major parts: 1) Duplicate elimination: This can be handled in the VFS. However, it would clutter up the VFS code with a lot of wrappers around key VFS functions to select the appropriate dentry/inode/etc. object from the underlying branch. (You also need to provide efficient and sane readdir/seekdir semantics which we do with our "On Disk Format" support.) 2) Copyup: Having a unified namespace by itself isn't enough. You also need copy-on-write functionality when the source file is on a read-only branch. This makes unioning much more useful and is one of the main attractions to unionfs users. 3) Whiteouts: Whiteouts are a key unioning construct. As it was pointed out at OLS 2006, they are a properly of the union and _NOT_ a branch. Therefore, they should not be stored persistently on a branch but rather in some "external" storage. 4) You also need unique and *persistent* inode numbers for network f/s exports and other unix tools. 5) You need to provide dynamic branch management functionality: adding, removing, and changing the mode of branches in an existing union. We have considerable experience in unioning file systems for years now; we are currently working on the third generation of the code. All of the above features, and more, are USED by users, and are NEEDED by users. We believe the right approach is the one we've taken, and is the least intrusive: a standalone (stackable) file system that doesn't clutter the VFS, with some small and gradual changes to the VFS to support stacking. As you may have noticed, we have been successfully submitting VFS patches to make the VFS more stacking friendly (not just to Unionfs, but also to eCryptfs which has been in since 2.6.19). The older Union mounts, alas, try to put all that functionality into the VFS. We recognize that some people think that union mounts at the VFS level is the "elegant" approach, but we hope people will listen to us and learn from our experience: unioning may seem simple in principle, but it is difficult in practice. (See http://unionfs.fileystems.org/ for a lot more info.) So we don't think that is a viable long term approach to have all of the unioning functionality in the VFS for two main reasons: (1) If you want users to use a VFS-level unioning functionality ala union-mounts, then you're going to have to implement *all* of the features we have implemented; the VFS clutter and complexity that will result will be very considerable, and we just don't think that it'd happen. (2) Some may suggest to have a lightweight union mounts that only offers a subset of the functionality that's suitable for placing in the VFS. In that case, most unionfs users simply won't use it. You'd need union mounts to provide ALL of the functionality that we have TODAY, if you want users to it. As far as we can see the remaining stumbling block right now is cache coherency between the layers. Whether you provide unioning as a stackable f/s or shoved into the VFS, coherency will have to be addressed. In our upcoming paper and talk at OLS'07, we plan to bring up and discuss several ideas we've explored already on how to resolve this incoherency. Our ideas range from complex graph-based pointer management between objects of all sorts, to simple timestamp-based VFS hooks. (We've been experimenting with several approaches and so far we're leaning toward the simple timestamp based on, again in the interest of keeping the VFS changes simple. We hope to have more results to report by OLS time.) Josef "Jeff" Sipek, on behalf of the Unionfs team. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: fragmentation avoidance Re: 2.6.22 -mm merge plans
Sorry for late response. I went on a vacation in last week. And I'm in the mountain of a ton of unread mail now > > Mel's moveable-zone work. > > These patches are what creates ZONE_MOVABLE. The last 6 patches should be > collapsed into a single patch: > > handle-kernelcore=-generic > > I believe Yasunori Goto is looking at these from the perspective of memory > hot-remove and has caught a few bugs in the past. Goto-san may be able to > comment on whether they have been reviewed recently. Hmm, I don't think my review is enough. To be precise, I'm just one user/tester of ZONE_MOVABLE. I have tried to make memory remove patches with Mel-san's ZONE_MOVABLE patch. And the bugs are things that I found in its work. (I'll post these patches in a few days.) > The main complexity is in one function in patch one which determines where > the PFN is in each node for ZONE_MOVABLE. Getting that right so that the > requested amount of kernel memory spread as evenly as possible is just > not straight-forward. >From memory-hotplug view, ZONE_MOVABLE should be aligned by section size. But MAX_ORDER alignment is enough for others... Bye. -- Yasunori Goto - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Nick Piggin wrote: Nick Piggin wrote: Christoph Hellwig wrote: Is that every fork/exec or just under certain cicumstances? A 5% regression on every fork/exec is not acceptable. Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4 numbers will be improved as well with that patch. Then if we have specific lock/unlock bitops, I hope it should reduce that further. OK, with the races and missing barriers fixed from the previous patch, plus the attached one added (+patch3), numbers are better again (I'm not sure if I have the ppc barriers correct though). These ops could also be put to use in bit spinlocks, buffer lock, and probably a few other places too. 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 +patch 1.71-1.73 175.2-180.8 780.5-794.2 +patch2 1.61-1.63 169.8-175.0 748.6-757.0 +patch3 1.54-1.57 165.6-170.9 748.5-757.5 So fault performance goes to under 5%, fork is in the noise, exec is still up 1%, but maybe that's noise or cache effects again. OK, with my new lock/unlock_page, dd if=large (bigger than RAM) sparse file of=/dev/null with an experimentally optimal block size (32K) goes from 626MB/s to 683MB/s on 2 CPU G5 booted with maxcpus=1. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Nick Piggin wrote: Christoph Hellwig wrote: Is that every fork/exec or just under certain cicumstances? A 5% regression on every fork/exec is not acceptable. Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4 numbers will be improved as well with that patch. Then if we have specific lock/unlock bitops, I hope it should reduce that further. OK, with the races and missing barriers fixed from the previous patch, plus the attached one added (+patch3), numbers are better again (I'm not sure if I have the ppc barriers correct though). These ops could also be put to use in bit spinlocks, buffer lock, and probably a few other places too. 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 +patch 1.71-1.73 175.2-180.8 780.5-794.2 +patch2 1.61-1.63 169.8-175.0 748.6-757.0 +patch3 1.54-1.57 165.6-170.9 748.5-757.5 So fault performance goes to under 5%, fork is in the noise, exec is still up 1%, but maybe that's noise or cache effects again. -- SUSE Labs, Novell Inc. Index: linux-2.6/include/asm-powerpc/bitops.h === --- linux-2.6.orig/include/asm-powerpc/bitops.h 2007-05-04 16:08:20.0 +1000 +++ linux-2.6/include/asm-powerpc/bitops.h 2007-05-04 16:14:39.0 +1000 @@ -87,6 +87,24 @@ : "cc" ); } +static __inline__ void clear_bit_unlock(int nr, volatile unsigned long *addr) +{ + unsigned long old; + unsigned long mask = BITOP_MASK(nr); + unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); + + __asm__ __volatile__( + LWSYNC_ON_SMP +"1:" PPC_LLARX "%0,0,%3 # clear_bit_unlock\n" + "andc %0,%0,%2\n" + PPC405_ERR77(0,%3) + PPC_STLCX "%0,0,%3\n" + "bne- 1b" + : "=&r" (old), "+m" (*p) + : "r" (mask), "r" (p) + : "cc" ); +} + static __inline__ void change_bit(int nr, volatile unsigned long *addr) { unsigned long old; @@ -126,6 +144,27 @@ return (old & mask) != 0; } +static __inline__ int test_and_set_bit_lock(unsigned long nr, + volatile unsigned long *addr) +{ + unsigned long old, t; + unsigned long mask = BITOP_MASK(nr); + unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); + + __asm__ __volatile__( +"1:" PPC_LLARX "%0,0,%3 # test_and_set_bit_lock\n" + "or %1,%0,%2 \n" + PPC405_ERR77(0,%3) + PPC_STLCX "%1,0,%3 \n" + "bne- 1b" + ISYNC_ON_SMP + : "=&r" (old), "=&r" (t) + : "r" (mask), "r" (p) + : "cc", "memory"); + + return (old & mask) != 0; +} + static __inline__ int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr) { Index: linux-2.6/include/linux/pagemap.h === --- linux-2.6.orig/include/linux/pagemap.h 2007-05-04 16:14:36.0 +1000 +++ linux-2.6/include/linux/pagemap.h 2007-05-04 16:17:34.0 +1000 @@ -136,13 +136,18 @@ extern void FASTCALL(__wait_on_page_locked(struct page *page)); extern void FASTCALL(unlock_page(struct page *page)); +static inline int trylock_page(struct page *page) +{ + return (likely(!TestSetPageLocked_Lock(page))); +} + /* * lock_page may only be called if we have the page's inode pinned. */ static inline void lock_page(struct page *page) { might_sleep(); - if (unlikely(TestSetPageLocked(page))) + if (!trylock_page(page)) __lock_page(page); } @@ -153,7 +158,7 @@ static inline void lock_page_nosync(struct page *page) { might_sleep(); - if (unlikely(TestSetPageLocked(page))) + if (!trylock_page(page)) __lock_page_nosync(page); } Index: linux-2.6/drivers/scsi/sg.c === --- linux-2.6.orig/drivers/scsi/sg.c2007-04-12 14:35:08.0 +1000 +++ linux-2.6/drivers/scsi/sg.c 2007-05-04 16:23:27.0 +1000 @@ -1734,7 +1734,7 @@ */ flush_dcache_page(pages[i]); /* ?? Is locking needed? I don't think so */ - /* if (TestSetPageLocked(pages[i])) + /* if (!trylock_page(pages[i])) goto out_unlock; */ } Index: linux-2.6/fs/cifs/file.c === --- linux-2.6.orig/fs/cifs/file.c 2007-04-12 14:35:09.0 +1000 +++ linux-2.6/fs/cifs/file.c2007-05-04 16:23:36.0 +1000 @@ -1229,7 +1229,7 @@ if (first < 0) lock_page(page); - else if (TestSetPageLocked(page)) + else if (!trylock_page(page)) break; if (unlikely(page->mapping != mapping)) { Index: linux-2.6/fs/jbd/commit.c ===
Re: 2.6.22 -mm merge plans -- vm bugfixes
Andrew Morton wrote: On Thu, 03 May 2007 11:32:23 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: void fastcall unlock_page(struct page *page) { + VM_BUG_ON(!PageLocked(page)); smp_mb__before_clear_bit(); - if (!TestClearPageLocked(page)) - BUG(); - smp_mb__after_clear_bit(); - wake_up_page(page, PG_locked); + ClearPageLocked(page); + if (unlikely(test_bit(PG_waiters, &page->flags))) { + clear_bit(PG_waiters, &page->flags); + wake_up_page(page, PG_locked); + } } Why is that significantly faster than plain old wake_up_page(), which tests waitqueue_active()? Because it needs fewer barriers and doesn't touch random a random hash cacheline in the fastpath. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub on PowerPC
On Fri, 4 May 2007, Benjamin Herrenschmidt wrote: > > The SLUB allocator relies on struct page fields first_page and slab, > > overwritten by ptl when SPLIT_PTLOCK: so the SLUB allocator cannot then > > be used for the lowest level of pagetable pages. This was obstructing > > SLUB on PowerPC, which uses kmem_caches for its pagetables. So convert > > its pte level to use quicklist pages (whereas pmd, pud and 64k-page pgd > > want partpages, so continue to use kmem_caches for pmd, pud and pgd). > > But to keep up appearances for pgtable_free, we still need PTE_CACHE_NUM. > > Interesting... I'll have a look asap. I would also recommend looking at removing the constructors for the remaining slabs. A constructor requires that SLUB never touch the object (same situation as is resulting from enabling debugging). So it must increase the object size in order to put the free pointer after the object. In case of a order of 2 cache this has a particularly bad effect of doubling object size. If the objects can be overwritten on free (no constructor) then we can use the first word of the object as a freepointer on kfree. Meaning we can use a hot cacheline so no cache miss. On alloc we have already touched the first cacheline which also avoids a cacheline fetch there. This is the optimal way of operation for SLUB. Hmmm We could add an option to allow the use of a constructor while keeping the free pointer at the beginning of the object? Then we would have to zap the first word on alloc. Would work like quicklists. Add SLAB_FREEPOINTER_MAY_OVERLAP? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub on PowerPC
On Thu, 2007-05-03 at 22:04 +0100, Hugh Dickins wrote: > On Thu, 3 May 2007, Hugh Dickins wrote: > > > > Seems we're all wrong in thinking Christoph's Kconfiggery worked > > as intended: maybe it just works some of the time. I'm not going > > to hazard a guess as to how to fix it up, will resume looking at > > the powerpc's quicklist potential later. > > Here's the patch I've been testing on G5, with 4k and with 64k pages, > with SLAB and with SLUB. But, though it doesn't crash, the pgd > kmem_cache in the 4k-page SLUB case is revealing SLUB's propensity > for using highorder allocations where SLAB would stick to order 0: > under load, exec's mm_init gets page allocation failure on order 4 > - SLUB's calculate_order may need some retuning. (I'd expect it to > be going for order 3 actually, I'm not sure how order 4 comes about.) > > I don't know how offensive Ben and Paulus may find this patch: > the kmem_cache use was nicely done and this messes it up a little. > > > The SLUB allocator relies on struct page fields first_page and slab, > overwritten by ptl when SPLIT_PTLOCK: so the SLUB allocator cannot then > be used for the lowest level of pagetable pages. This was obstructing > SLUB on PowerPC, which uses kmem_caches for its pagetables. So convert > its pte level to use quicklist pages (whereas pmd, pud and 64k-page pgd > want partpages, so continue to use kmem_caches for pmd, pud and pgd). > But to keep up appearances for pgtable_free, we still need PTE_CACHE_NUM. Interesting... I'll have a look asap. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub on PowerPC
On Thu, 3 May 2007, Christoph Lameter wrote: > > There are SLUB patches pending (not in rc7-mm2 as far as I can recall) > that reduce the default page order sizes to head off this issue. The > defaults were initially too large (and they still default to large > for testing if Mel's Antifrag work is detected to be active). Sounds good. > > - return kmem_cache_alloc(pgtable_cache[PTE_CACHE_NUM], > > - GFP_KERNEL|__GFP_REPEAT); > > + return quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL); > > __GFP_REPEAT is unusual here but this was carried over from the > kmem_cache_alloc it seems. Hmm... There is some variance on how we do this > between arches. Should we uniformly set or not set this flag? Not something to get into in this patch, but it did surprise me too. I believe __GFP_REPEAT should be avoided, and I don't see justification for it here (but need to remember not to do a blind virt_to_page on the result in some places if it might return NULL - which IIRC it actually can do even if __GFP_REPEAT, when chosen for OOM kill). But I've a suspicion it got put in there for some good reason I don't know about. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub on PowerPC
On Thu, 3 May 2007, Hugh Dickins wrote: > On Thu, 3 May 2007, Hugh Dickins wrote: > > > > Seems we're all wrong in thinking Christoph's Kconfiggery worked > > as intended: maybe it just works some of the time. I'm not going > > to hazard a guess as to how to fix it up, will resume looking at > > the powerpc's quicklist potential later. > > Here's the patch I've been testing on G5, with 4k and with 64k pages, > with SLAB and with SLUB. But, though it doesn't crash, the pgd > kmem_cache in the 4k-page SLUB case is revealing SLUB's propensity > for using highorder allocations where SLAB would stick to order 0: > under load, exec's mm_init gets page allocation failure on order 4 > - SLUB's calculate_order may need some retuning. (I'd expect it to > be going for order 3 actually, I'm not sure how order 4 comes about.) There are SLUB patches pending (not in rc7-mm2 as far as I can recall) that reduce the default page order sizes to head off this issue. The defaults were initially too large (and they still default to large for testing if Mel's Antifrag work is detected to be active). > - return kmem_cache_alloc(pgtable_cache[PTE_CACHE_NUM], > - GFP_KERNEL|__GFP_REPEAT); > + return quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL); __GFP_REPEAT is unusual here but this was carried over from the kmem_cache_alloc it seems. Hmm... There is some variance on how we do this between arches. Should we uniformly set or not set this flag? [EMAIL PROTECTED]:~/software/linux-2.6.21-rc7-mm2$ grep quicklist_alloc include/asm-ia64/* include/asm-ia64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL); include/asm-ia64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL); include/asm-ia64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL); include/asm-ia64/pgalloc.h: void *pg = quicklist_alloc(0, GFP_KERNEL, NULL); include/asm-ia64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL); [EMAIL PROTECTED]:~/software/linux-2.6.21-rc7-mm2$ grep quicklist_alloc arch/i386/mm/* arch/i386/mm/pgtable.c: pgd_t *pgd = quicklist_alloc(0, GFP_KERNEL, pgd_ctor); [EMAIL PROTECTED]:~/software/linux-2.6.21-rc7-mm2$ grep quicklist_alloc include/asm-sparc64/* include/asm-sparc64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL); include/asm-sparc64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL); include/asm-sparc64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL); include/asm-sparc64/pgalloc.h: void *pg = quicklist_alloc(0, GFP_KERNEL, NULL); [EMAIL PROTECTED]:~/software/linux-2.6.21-rc7-mm2$ grep quicklist_alloc include/asm-x86_64/* include/asm-x86_64/pgalloc.h: return (pmd_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL); include/asm-x86_64/pgalloc.h: return (pud_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL); include/asm-x86_64/pgalloc.h: pgd_t *pgd = (pgd_t *)quicklist_alloc(QUICK_PGD, include/asm-x86_64/pgalloc.h: return (pte_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL); include/asm-x86_64/pgalloc.h: void *p = (void *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub on PowerPC
On Thu, 3 May 2007, Hugh Dickins wrote: > > Seems we're all wrong in thinking Christoph's Kconfiggery worked > as intended: maybe it just works some of the time. I'm not going > to hazard a guess as to how to fix it up, will resume looking at > the powerpc's quicklist potential later. Here's the patch I've been testing on G5, with 4k and with 64k pages, with SLAB and with SLUB. But, though it doesn't crash, the pgd kmem_cache in the 4k-page SLUB case is revealing SLUB's propensity for using highorder allocations where SLAB would stick to order 0: under load, exec's mm_init gets page allocation failure on order 4 - SLUB's calculate_order may need some retuning. (I'd expect it to be going for order 3 actually, I'm not sure how order 4 comes about.) I don't know how offensive Ben and Paulus may find this patch: the kmem_cache use was nicely done and this messes it up a little. The SLUB allocator relies on struct page fields first_page and slab, overwritten by ptl when SPLIT_PTLOCK: so the SLUB allocator cannot then be used for the lowest level of pagetable pages. This was obstructing SLUB on PowerPC, which uses kmem_caches for its pagetables. So convert its pte level to use quicklist pages (whereas pmd, pud and 64k-page pgd want partpages, so continue to use kmem_caches for pmd, pud and pgd). But to keep up appearances for pgtable_free, we still need PTE_CACHE_NUM. Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]> --- arch/powerpc/Kconfig |4 arch/powerpc/mm/init_64.c | 17 ++--- include/asm-powerpc/pgalloc.h | 26 +++--- 3 files changed, 21 insertions(+), 26 deletions(-) --- 2.6.21-rc7-mm2/arch/powerpc/Kconfig 2007-04-26 13:33:51.0 +0100 +++ linux/arch/powerpc/Kconfig 2007-05-03 20:45:12.0 +0100 @@ -31,6 +31,10 @@ config MMU bool default y +config QUICKLIST + bool + default y + config GENERIC_HARDIRQS bool default y --- 2.6.21-rc7-mm2/arch/powerpc/mm/init_64.c2007-04-26 13:33:51.0 +0100 +++ linux/arch/powerpc/mm/init_64.c 2007-05-03 20:45:12.0 +0100 @@ -146,21 +146,16 @@ static void zero_ctor(void *addr, struct memset(addr, 0, kmem_cache_size(cache)); } -#ifdef CONFIG_PPC_64K_PAGES -static const unsigned int pgtable_cache_size[3] = { - PTE_TABLE_SIZE, PMD_TABLE_SIZE, PGD_TABLE_SIZE -}; -static const char *pgtable_cache_name[ARRAY_SIZE(pgtable_cache_size)] = { - "pte_pmd_cache", "pmd_cache", "pgd_cache", -}; -#else static const unsigned int pgtable_cache_size[2] = { - PTE_TABLE_SIZE, PMD_TABLE_SIZE + PGD_TABLE_SIZE, PMD_TABLE_SIZE }; static const char *pgtable_cache_name[ARRAY_SIZE(pgtable_cache_size)] = { - "pgd_pte_cache", "pud_pmd_cache", -}; +#ifdef CONFIG_PPC_64K_PAGES + "pgd_cache", "pmd_cache", +#else + "pgd_cache", "pud_pmd_cache", #endif /* CONFIG_PPC_64K_PAGES */ +}; #ifdef CONFIG_HUGETLB_PAGE /* Hugepages need one extra cache, initialized in hugetlbpage.c. We --- 2.6.21-rc7-mm2/include/asm-powerpc/pgalloc.h2007-02-04 18:44:54.0 + +++ linux/include/asm-powerpc/pgalloc.h 2007-05-03 20:45:12.0 +0100 @@ -10,21 +10,15 @@ #include #include #include +#include extern struct kmem_cache *pgtable_cache[]; -#ifdef CONFIG_PPC_64K_PAGES -#define PTE_CACHE_NUM 0 -#define PMD_CACHE_NUM 1 -#define PGD_CACHE_NUM 2 -#define HUGEPTE_CACHE_NUM 3 -#else -#define PTE_CACHE_NUM 0 -#define PMD_CACHE_NUM 1 -#define PUD_CACHE_NUM 1 #define PGD_CACHE_NUM 0 +#define PUD_CACHE_NUM 1 +#define PMD_CACHE_NUM 1 #define HUGEPTE_CACHE_NUM 2 -#endif +#define PTE_CACHE_NUM 3 /* from quicklist rather than kmem_cache */ /* * This program is free software; you can redistribute it and/or @@ -97,8 +91,7 @@ static inline void pmd_free(pmd_t *pmd) static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address) { - return kmem_cache_alloc(pgtable_cache[PTE_CACHE_NUM], - GFP_KERNEL|__GFP_REPEAT); + return quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL); } static inline struct page *pte_alloc_one(struct mm_struct *mm, @@ -109,7 +102,7 @@ static inline struct page *pte_alloc_one static inline void pte_free_kernel(pte_t *pte) { - kmem_cache_free(pgtable_cache[PTE_CACHE_NUM], pte); + quicklist_free(0, NULL, pte); } static inline void pte_free(struct page *ptepage) @@ -136,7 +129,10 @@ static inline void pgtable_free(pgtable_ void *p = (void *)(pgf.val & ~PGF_CACHENUM_MASK); int cachenum = pgf.val & PGF_CACHENUM_MASK; - kmem_cache_free(pgtable_cache[cachenum], p); + if (cachenum == PTE_CACHE_NUM) + quicklist_free(0, NULL, p); + else + kmem_cache_free(pgtable_cache[cachenum], p); } extern void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf); @
Re: 2.6.22 -mm merge plans
On Thu, May 03, 2007 at 01:16:46PM -0400, Mathieu Desnoyers wrote: > > kprobes does this kind of synchronization internally, so the marker > > wrapper should probabl aswell. > > > > The problem appears on heavily loaded systems. Doing 50 > synchronize_sched() calls in a row can take up to a few seconds on a > 4-way machine. This is why I prefer to do it in the module to which > the callbacks belong. We recently had a discussion on batch unreistration interface for kprobes. I'm not very happy with having so different interfaces for different kind of probe registrations. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
Here is the reworked patch, except a comment : * Christoph Hellwig ([EMAIL PROTECTED]) wrote: > > +void blk_probe_disconnect(void) > > +{ > > + uint8_t i; > > + > > + for (i = 0; i < NUM_PROBES; i++) { > > + marker_remove_probe(probe_array[i].name); > > + } > > + synchronize_sched();/* Wait for probes to finish */ > > kprobes does this kind of synchronization internally, so the marker > wrapper should probabl aswell. > The problem appears on heavily loaded systems. Doing 50 synchronize_sched() calls in a row can take up to a few seconds on a 4-way machine. This is why I prefer to do it in the module to which the callbacks belong. Here is the reviewed patch. It depends on a newer version of markers I'll send to Andrew soon. Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]> Index: linux-2.6-lttng/block/elevator.c === --- linux-2.6-lttng.orig/block/elevator.c 2007-05-03 12:27:12.0 -0400 +++ linux-2.6-lttng/block/elevator.c2007-05-03 12:54:58.0 -0400 @@ -32,7 +32,7 @@ #include #include #include -#include +#include #include #include @@ -571,7 +571,7 @@ unsigned ordseq; int unplug_it = 1; - blk_add_trace_rq(q, rq, BLK_TA_INSERT); + trace_mark(blk_request_insert, "%p %p", q, rq); rq->q = q; @@ -757,7 +757,7 @@ * not be passed by new incoming requests */ rq->cmd_flags |= REQ_STARTED; - blk_add_trace_rq(q, rq, BLK_TA_ISSUE); + trace_mark(blk_request_issue, "%p %p", q, rq); } if (!q->boundary_rq || q->boundary_rq == rq) { Index: linux-2.6-lttng/block/ll_rw_blk.c === --- linux-2.6-lttng.orig/block/ll_rw_blk.c 2007-05-03 12:27:12.0 -0400 +++ linux-2.6-lttng/block/ll_rw_blk.c 2007-05-03 12:54:58.0 -0400 @@ -28,6 +28,7 @@ #include #include #include +#include #include #include @@ -1551,7 +1552,7 @@ if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) { mod_timer(&q->unplug_timer, jiffies + q->unplug_delay); - blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG); + trace_mark(blk_plug_device, "%p %p %d", q, NULL, 0); } } @@ -1617,7 +1618,7 @@ * devices don't necessarily have an ->unplug_fn defined */ if (q->unplug_fn) { - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, + trace_mark(blk_pdu_unplug_io, "%p %p %d", q, NULL, q->rq.count[READ] + q->rq.count[WRITE]); q->unplug_fn(q); @@ -1628,7 +1629,7 @@ { request_queue_t *q = container_of(work, request_queue_t, unplug_work); - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, + trace_mark(blk_pdu_unplug_io, "%p %p %d", q, NULL, q->rq.count[READ] + q->rq.count[WRITE]); q->unplug_fn(q); @@ -1638,7 +1639,7 @@ { request_queue_t *q = (request_queue_t *)data; - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL, + trace_mark(blk_pdu_unplug_timer, "%p %p %d", q, NULL, q->rq.count[READ] + q->rq.count[WRITE]); kblockd_schedule_work(&q->unplug_work); @@ -2148,7 +2149,7 @@ rq_init(q, rq); - blk_add_trace_generic(q, bio, rw, BLK_TA_GETRQ); + trace_mark(blk_get_request, "%p %p %d", q, bio, rw); out: return rq; } @@ -2178,7 +2179,7 @@ if (!rq) { struct io_context *ioc; - blk_add_trace_generic(q, bio, rw, BLK_TA_SLEEPRQ); + trace_mark(blk_sleep_request, "%p %p %d", q, bio, rw); __generic_unplug_device(q); spin_unlock_irq(q->queue_lock); @@ -2252,7 +2253,7 @@ */ void blk_requeue_request(request_queue_t *q, struct request *rq) { - blk_add_trace_rq(q, rq, BLK_TA_REQUEUE); + trace_mark(blk_requeue, "%p %p", q, rq); if (blk_rq_tagged(rq)) blk_queue_end_tag(q, rq); @@ -2937,7 +2938,7 @@ if (!ll_back_merge_fn(q, req, bio)) break; - blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE); + trace_mark(blk_bio_backmerge, "%p %p", q, bio); req->biotail->bi_next = bio; req->biotail = bio; @@ -2954,7 +2955,7 @@ if (!ll_front_merge_fn(q, req, bio)) break; - blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE); + trace_mark(blk_bio_frontmerge, "%p %p", q, bio); bio->bi_next = req->bio;
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Thu, 03 May 2007 11:32:23 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: > void fastcall unlock_page(struct page *page) > { > + VM_BUG_ON(!PageLocked(page)); > smp_mb__before_clear_bit(); > - if (!TestClearPageLocked(page)) > - BUG(); > - smp_mb__after_clear_bit(); > - wake_up_page(page, PG_locked); > + ClearPageLocked(page); > + if (unlikely(test_bit(PG_waiters, &page->flags))) { > + clear_bit(PG_waiters, &page->flags); > + wake_up_page(page, PG_locked); > + } > } Why is that significantly faster than plain old wake_up_page(), which tests waitqueue_active()? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
H...There are a gazillion configs to choose from. It works fine with cell_defconfig. If I switch to 2 processors I can enable SLUB if I switch to 4 I cannot. I saw some other config weirdness like being unable to set SMP if SLOB is enabled with some configs. This should not work and does not work but the menus are then vanishing and one can still configure lots of processors while not having enabled SMP. It works as far as I can tell... The rest is arch weirdness. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Thu, 3 May 2007, William Lee Irwin III wrote: > I've seen this crash in flush_old_exec() before. ISTR it being due to > slub vs. pagetable alignment or something on that order. >From from other discussion regarding SLAB: It may be necessary for powerpc to set the default alignment to 8 bytes on 32 bit powerpc because it requires that alignemnt for 64 bit value. SLUB will *not* disable debugging like SLAB if you do that. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
On Thu, May 03, 2007 at 11:04:15AM -0400, Mathieu Desnoyers wrote: > - blk_add_trace_rq(q, rq, BLK_TA_INSERT); > + MARK(blk_request_insert, "%p %p", q, rq); I don't really like the shouting MARK name very much. Can we have a less-generic, less shouting name, e.g. trace_marker? The aboe would then be: trace_mark(blk_request_insert, "%p %p", q, rq); > +#define NUM_PROBES ARRAY_SIZE(probe_array) just get rid of this and use ARRAY_SIZE diretly below. > +int blk_probe_connect(void) > +{ > + int result; > + uint8_t i; just use an int for for loops. it's easy to read and probably faster on most systems (it the compiler isn't smart enough and promotes it to int anyway during code generation) > +void blk_probe_disconnect(void) > +{ > + uint8_t i; > + > + for (i = 0; i < NUM_PROBES; i++) { > + marker_remove_probe(probe_array[i].name); > + } > + synchronize_sched();/* Wait for probes to finish */ kprobes does this kind of synchronization internally, so the marker wrapper should probabl aswell. > +static int blk_probes_ref = 0; no need to initialize this. > /* > * Send out a notify message. > */ > @@ -229,6 +233,12 @@ > blk_remove_tree(bt->dir); > free_percpu(bt->sequence); > kfree(bt); > + mutex_lock(&blk_probe_mutex); > + if (blk_probes_ref == 1) { > + blk_probe_disconnect(); > + blk_probes_ref--; > + } if (--blk_probes_ref == 0) blk_probe_disconnect(); would probably be a tad cleaner. > + if (!blk_probes_ref) { > + blk_probe_connect(); > + blk_probes_ref++; > + } Dito here with a: if (!blk_probes_ref++) blk_probe_connect(); also the connect in the name seems rather add, what about arm/disarm instead? > static __init int blk_trace_init(void) > { > mutex_init(&blk_tree_mutex); > + mutex_init(&blk_probe_mutex); both should use DEFINE_MUTEX for compile-time initialization isntead. Also it's probably better to put the trace points into blktrace.c, that means all blktrace code can be static and self-contained. And we can probably do some additional cleanups by simplifying things later on. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
* Andrew Morton ([EMAIL PROTECTED]) wrote: > > Although some, like Christoph and myself, think that it would benefit to > > the kernel community to have a common infrastructure for more than just > > markers (meaning common serialization and buffering mechanism), it does > > not change the fact that the markers, being in mainline, are usable by > > projects through additional kernel modules. > > > > If we are looking at current "potential users" that are already in > > mainline, we could change blktrace to make it use the markers. > > That'd be a useful demonstration. Here is a proof of concept patch, for demonstration purpose, of moving blktrace to the markers. A few remarks : this patch has the positive effect of removing some code from the block io tracing hot paths, minimizing the i-cache impact in a system where the io tracing is compiled in but inactive. It also moves the blk tracing code from a header (and therefore from the body of the instrumented functions) to a separate C file. There, as soon as one device has to be traced, every devices have to fall into the tracing function call. This is slower than the previous inline function which tested the condition quickly. If it becomes a show stopper, it could be fixed by having the possibility to test a supplementary condition, dependant of the marker context, at the marker site, just after the enable/disable test. It does not make the code smaller, since I left all the specialized tracing functions for requests, bio, generic, remap, which would go away once a generic infrastructure is in place to serialize the information passed to the marker. This is mostly why I consider it a proof a concept. Patch named "markers-port-blktrace-to-markers.patch", can be placed after the marker patches in the 2.6.21-rc7-mm2 series. Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]> Index: linux-2.6-lttng/block/elevator.c === --- linux-2.6-lttng.orig/block/elevator.c 2007-05-02 20:33:22.0 -0400 +++ linux-2.6-lttng/block/elevator.c2007-05-02 20:33:49.0 -0400 @@ -32,7 +32,7 @@ #include #include #include -#include +#include #include #include @@ -571,7 +571,7 @@ unsigned ordseq; int unplug_it = 1; - blk_add_trace_rq(q, rq, BLK_TA_INSERT); + MARK(blk_request_insert, "%p %p", q, rq); rq->q = q; @@ -757,7 +757,7 @@ * not be passed by new incoming requests */ rq->cmd_flags |= REQ_STARTED; - blk_add_trace_rq(q, rq, BLK_TA_ISSUE); + MARK(blk_request_issue, "%p %p", q, rq); } if (!q->boundary_rq || q->boundary_rq == rq) { Index: linux-2.6-lttng/block/ll_rw_blk.c === --- linux-2.6-lttng.orig/block/ll_rw_blk.c 2007-05-02 20:33:32.0 -0400 +++ linux-2.6-lttng/block/ll_rw_blk.c 2007-05-02 23:21:02.0 -0400 @@ -28,6 +28,7 @@ #include #include #include +#include #include #include @@ -1551,7 +1552,7 @@ if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) { mod_timer(&q->unplug_timer, jiffies + q->unplug_delay); - blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG); + MARK(blk_plug_device, "%p %p %d", q, NULL, 0); } } @@ -1617,7 +1618,7 @@ * devices don't necessarily have an ->unplug_fn defined */ if (q->unplug_fn) { - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, + MARK(blk_pdu_unplug_io, "%p %p %d", q, NULL, q->rq.count[READ] + q->rq.count[WRITE]); q->unplug_fn(q); @@ -1628,7 +1629,7 @@ { request_queue_t *q = container_of(work, request_queue_t, unplug_work); - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, + MARK(blk_pdu_unplug_io, "%p %p %d", q, NULL, q->rq.count[READ] + q->rq.count[WRITE]); q->unplug_fn(q); @@ -1638,7 +1639,7 @@ { request_queue_t *q = (request_queue_t *)data; - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL, + MARK(blk_pdu_unplug_timer, "%p %p %d", q, NULL, q->rq.count[READ] + q->rq.count[WRITE]); kblockd_schedule_work(&q->unplug_work); @@ -2148,7 +2149,7 @@ rq_init(q, rq); - blk_add_trace_generic(q, bio, rw, BLK_TA_GETRQ); + MARK(blk_get_request, "%p %p %d", q, bio, rw); out: return rq; } @@ -2178,7 +2179,7 @@ if (!rq) { struct io_context *ioc; - blk_add_trace_generic(q, bio, rw, BLK_TA_SLEEPRQ); + MARK(blk_sleep_request, "%p %p %d", q, bio, rw); __generic_unplug_device(q); spi
Re: 2.6.22 -mm merge plans
Hi Andi, This plan makes sense. I will split the "patched in enabled/disable flags" part into a separate piece (good idea!) and then submit the LTTng core to Andrew. Christoph's has a good point about wanting a usable infrastructure to go ini. Regarding your plan, I must argue that blktrace is not a general purpose tracing infrastructure, but one dedicated to block io tracing. Therefore, it makes sense to bring in the generic infrastructure first and then convert blktrace to it. Mathieu * Andi Kleen ([EMAIL PROTECTED]) wrote: > > If we are looking at current "potential users" that are already in > > mainline, we could change blktrace to make it use the markers. > > Ok, but do it step by step: > - split out useful pieces like the "patched in enable/disable flags" > and submit them separate with an example user or two > [I got a couple of candidates e.g. with some of the sysctls in VM or > networking] > - post and merge that. > - don't implement anything initially that is not needed by blktrace > - post a minimal marker patch together with the blktrace > conversion for review again on linux-kernel > - await review comments. This review would not cover the basic > need of markers, just the specific implementation. > - then potentially merge incorporate review comments > - then merge > - later add features with individual review/discussion as new users in the > kernel are added. > > -Andi -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
* Christoph Hellwig ([EMAIL PROTECTED]) wrote: > On Wed, May 02, 2007 at 07:11:04PM -0400, Mathieu Desnoyers wrote: > > My statement was probably not clear enough. The actual marker code is > > useful as-is without any further kernel patching required : SystemTAP is > > an example where they use external modules to load probes that can > > connect either to markers or through kprobes. LTTng, in its current state, > > has a mostly modular core that also uses the markers. > > That just mean you have to load an enormous emount of exernal crap > that replaces the missing kernel functionality. It's exactly the > situation we want to avoid. > It makes sense to use -mm to hold the hole usable infrastructure before submitting it to mainline. I will submit my core LTTng patches to Andrew in the following weeks. There is no hurry, in the LTTng perspective, to merge the markers sooner, although they could be useful to other (external) projects meanwhile. Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Thu, 3 May 2007, Nick Piggin wrote: @@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!) { DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); + set_bit(PG_waiters, &page->flags); + if (unlikely(!TestSetPageLocked(page))) { What happens if another cpu is coming through __lock_page at the same time, did its set_bit, now finds PageLocked, and so proceeds to the __wait_on_bit_lock? But this cpu now clears PG_waiters, so this task's unlock_page won't wake the other? You're right, we can't clear the bit here. Doubt it mattered much anyway? Ah yes, that's a good easy answer. In fact, just remove this whole test and block (we already tried TestSetPageLocked outside just a short while ago, so this repeat won't often save anything). Yeah, I was getting too clever for my own boots :) I think the patch has merit though. Unfortunate that it uses another page flag, however it seemed to have quite a bit speedup on unlock_page (probably from both the barriers and an extra random cacheline load (from the hash)). I guess it has to get good results from more benchmarks... BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page above... that barrier is in the slow path as well though, so it shouldn't matter either. I vaguely wondered how such barriers had managed to dissolve away, but cranking my brain up to think about barriers takes far too long. That barrier was one too many :) However I believe the fastpath barrier can go away because the PG_locked operation is depending on the same cacheline as PG_waiters. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Thu, 3 May 2007, Nick Piggin wrote: > >>@@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!) > > > { > > > DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); > > > > > >+ set_bit(PG_waiters, &page->flags); > > >+ if (unlikely(!TestSetPageLocked(page))) { > > > > What happens if another cpu is coming through __lock_page at the > > same time, did its set_bit, now finds PageLocked, and so proceeds > > to the __wait_on_bit_lock? But this cpu now clears PG_waiters, > > so this task's unlock_page won't wake the other? > > You're right, we can't clear the bit here. Doubt it mattered much anyway? Ah yes, that's a good easy answer. In fact, just remove this whole test and block (we already tried TestSetPageLocked outside just a short while ago, so this repeat won't often save anything). > > BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page > above... that barrier is in the slow path as well though, so it shouldn't > matter either. I vaguely wondered how such barriers had managed to dissolve away, but cranking my brain up to think about barriers takes far too long. > > >+ clear_bit(PG_waiters, &page->flags); > > >+ return; > > >+ } > > > __wait_on_bit_lock(page_waitqueue(page), &wait, sync_page, > > >TASK_UNINTERRUPTIBLE); > >> } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Christoph Hellwig wrote: On Thu, May 03, 2007 at 11:32:23AM +1000, Nick Piggin wrote: The attached patch gets performance up a bit by avoiding some barriers and some cachelines: G5 pagefault fork exec 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 +patch 1.71-1.73 175.2-180.8 780.5-794.2 +patch2 1.61-1.63 169.8-175.0 748.6-757.0 So that brings the fork/exec hits down to much less than 5%, and would likely speed up other things that lock the page, like write or page reclaim. Is that every fork/exec or just under certain cicumstances? A 5% regression on every fork/exec is not acceptable. Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4 numbers will be improved as well with that patch. Then if we have specific lock/unlock bitops, I hope it should reduce that further. The overhead that is there should just be coming from the extra overhead in the file backed fault handler. For noop fork/execs, I think that tends to be more pronounced, it is hard to see any difference on any non-micro benchmark. The other thing is that I think there could be some cache effects happening -- for example the exec numbers on the 2nd line are disproportionately large. It definitely isn't a good thing to drop performance anywhere though, so I'll keep looking for improvements. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Thu, 3 May 2007, Nick Piggin wrote: The problem is that lock/unlock_page is expensive on powerpc, and if we improve that, we improve more than just the fault handler... The attached patch gets performance up a bit by avoiding some barriers and some cachelines: There's a strong whiff of raciness about this... but I could very easily be wrong. Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c 2007-05-02 15:00:26.0 +1000 +++ linux-2.6/mm/filemap.c 2007-05-03 08:34:32.0 +1000 @@ -532,11 +532,13 @@ */ void fastcall unlock_page(struct page *page) { + VM_BUG_ON(!PageLocked(page)); smp_mb__before_clear_bit(); - if (!TestClearPageLocked(page)) - BUG(); - smp_mb__after_clear_bit(); - wake_up_page(page, PG_locked); + ClearPageLocked(page); + if (unlikely(test_bit(PG_waiters, &page->flags))) { + clear_bit(PG_waiters, &page->flags); + wake_up_page(page, PG_locked); + } } EXPORT_SYMBOL(unlock_page); @@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!) { DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); + set_bit(PG_waiters, &page->flags); + if (unlikely(!TestSetPageLocked(page))) { What happens if another cpu is coming through __lock_page at the same time, did its set_bit, now finds PageLocked, and so proceeds to the __wait_on_bit_lock? But this cpu now clears PG_waiters, so this task's unlock_page won't wake the other? You're right, we can't clear the bit here. Doubt it mattered much anyway? BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page above... that barrier is in the slow path as well though, so it shouldn't matter either. + clear_bit(PG_waiters, &page->flags); + return; + } __wait_on_bit_lock(page_waitqueue(page), &wait, sync_page, TASK_UNINTERRUPTIBLE); } -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Thu, 3 May 2007, Nick Piggin wrote: > > The problem is that lock/unlock_page is expensive on powerpc, and > if we improve that, we improve more than just the fault handler... > > The attached patch gets performance up a bit by avoiding some > barriers and some cachelines: There's a strong whiff of raciness about this... but I could very easily be wrong. > Index: linux-2.6/mm/filemap.c > === > --- linux-2.6.orig/mm/filemap.c 2007-05-02 15:00:26.0 +1000 > +++ linux-2.6/mm/filemap.c2007-05-03 08:34:32.0 +1000 > @@ -532,11 +532,13 @@ > */ > void fastcall unlock_page(struct page *page) > { > + VM_BUG_ON(!PageLocked(page)); > smp_mb__before_clear_bit(); > - if (!TestClearPageLocked(page)) > - BUG(); > - smp_mb__after_clear_bit(); > - wake_up_page(page, PG_locked); > + ClearPageLocked(page); > + if (unlikely(test_bit(PG_waiters, &page->flags))) { > + clear_bit(PG_waiters, &page->flags); > + wake_up_page(page, PG_locked); > + } > } > EXPORT_SYMBOL(unlock_page); > > @@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!) > { > DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); > > + set_bit(PG_waiters, &page->flags); > + if (unlikely(!TestSetPageLocked(page))) { What happens if another cpu is coming through __lock_page at the same time, did its set_bit, now finds PageLocked, and so proceeds to the __wait_on_bit_lock? But this cpu now clears PG_waiters, so this task's unlock_page won't wake the other? > + clear_bit(PG_waiters, &page->flags); > + return; > + } > __wait_on_bit_lock(page_waitqueue(page), &wait, sync_page, > TASK_UNINTERRUPTIBLE); > } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Thu, May 03, 2007 at 11:32:23AM +1000, Nick Piggin wrote: > The attached patch gets performance up a bit by avoiding some > barriers and some cachelines: > > G5 > pagefault fork exec > 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 > +patch 1.71-1.73 175.2-180.8 780.5-794.2 > +patch2 1.61-1.63 169.8-175.0 748.6-757.0 > > So that brings the fork/exec hits down to much less than 5%, and > would likely speed up other things that lock the page, like write > or page reclaim. Is that every fork/exec or just under certain cicumstances? A 5% regression on every fork/exec is not acceptable. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
> If we are looking at current "potential users" that are already in > mainline, we could change blktrace to make it use the markers. Ok, but do it step by step: - split out useful pieces like the "patched in enable/disable flags" and submit them separate with an example user or two [I got a couple of candidates e.g. with some of the sysctls in VM or networking] - post and merge that. - don't implement anything initially that is not needed by blktrace - post a minimal marker patch together with the blktrace conversion for review again on linux-kernel - await review comments. This review would not cover the basic need of markers, just the specific implementation. - then potentially merge incorporate review comments - then merge - later add features with individual review/discussion as new users in the kernel are added. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
On Wed, May 02, 2007 at 11:46:40PM +0200, Tilman Schmidt wrote: > Am 02.05.2007 19:49 schrieb Andi Kleen: > > And also I think when something is merged it should have some users in tree. > > Isn't that a circular dependency? The normal mode of operation is to merge the initial users and the subsystem at the same time. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Thu, 3 May 2007, Andrew Morton wrote: > On Thu, 3 May 2007 09:46:32 +0100 (BST) Hugh Dickins <[EMAIL PROTECTED]> > wrote: > > On Thu, 3 May 2007, Andrew Morton wrote: > > > On Wed, 2 May 2007 10:25:47 -0700 (PDT) Christoph Lameter <[EMAIL > > > PROTECTED]> wrote: > > > > > > > +config ARCH_USES_SLAB_PAGE_STRUCT > > > > + bool > > > > + default y > > > > + depends on SPLIT_PTLOCK_CPUS <= NR_CPUS > > > > + > > > > > > That all seems to work as intended. > > > > > > However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the > > > machine early in boot. > > > > I thought that if that worked as intended, you wouldn't even > > get the chance to choose SLUB=y? That was how it was working > > for me (but I realize I didn't try more than make oldconfig). > > > > That sounds like what happens when SLUB's pagestruct use meets > > SPLIT_PTLOCK's pagestruct use. Does your .config really show > > CONFIG_SLUB=y together with CONFIG_ARCH_USES_SLAB_PAGE_STRUCT=y? > > Nope. > > g5:/usr/src/25> grep SLUB .config > CONFIG_SLUB=y > g5:/usr/src/25> grep SLAB .config > # CONFIG_SLAB is not set > g5:/usr/src/25> grep CPUS .config > CONFIG_NR_CPUS=8 > # CONFIG_CPUSETS is not set > # CONFIG_IRQ_ALL_CPUS is not set > CONFIG_SPLIT_PTLOCK_CPUS=4 > > It's in http://userweb.kernel.org/~akpm/config-g5.txt Seems we're all wrong in thinking Christoph's Kconfiggery worked as intended: maybe it just works some of the time. I'm not going to hazard a guess as to how to fix it up, will resume looking at the powerpc's quicklist potential later. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Thu, 3 May 2007 09:46:32 +0100 (BST) Hugh Dickins <[EMAIL PROTECTED]> wrote: > On Thu, 3 May 2007, Andrew Morton wrote: > > On Wed, 2 May 2007 10:25:47 -0700 (PDT) Christoph Lameter <[EMAIL > > PROTECTED]> wrote: > > > > > +config ARCH_USES_SLAB_PAGE_STRUCT > > > + bool > > > + default y > > > + depends on SPLIT_PTLOCK_CPUS <= NR_CPUS > > > + > > > > That all seems to work as intended. > > > > However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the > > machine early in boot. > > I thought that if that worked as intended, you wouldn't even > get the chance to choose SLUB=y? That was how it was working > for me (but I realize I didn't try more than make oldconfig). Right. This can be tested on x86 without a cross-compiler: ARCH=powerpc make mrproper ARCH=powerpc make fooconfig > > > > Too early for netconsole, no serial console. Wedges up uselessly with > > CONFIG_XMON=n, does mysterious repeated uncontrollable exceptions with > > CONFIG_XMON=y. This is all fairly typical for a powerpc/G5 crash :( > > > > However I was able to glimpse some stuff as it flew past. Crash started in > > flush_old_exec and ended in pgtable_free_tlb -> kmem_cache_free. I don't > > know > > how to do better than that I'm afraid, unless I'm to hunt down a PCIE serial > > card, perhaps. > > That sounds like what happens when SLUB's pagestruct use meets > SPLIT_PTLOCK's pagestruct use. Does your .config really show > CONFIG_SLUB=y together with CONFIG_ARCH_USES_SLAB_PAGE_STRUCT=y? > Nope. g5:/usr/src/25> grep SLUB .config CONFIG_SLUB=y g5:/usr/src/25> grep SLAB .config # CONFIG_SLAB is not set g5:/usr/src/25> grep CPUS .config CONFIG_NR_CPUS=8 # CONFIG_CPUSETS is not set # CONFIG_IRQ_ALL_CPUS is not set CONFIG_SPLIT_PTLOCK_CPUS=4 It's in http://userweb.kernel.org/~akpm/config-g5.txt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Thu, 3 May 2007, Andrew Morton wrote: > On Wed, 2 May 2007 10:25:47 -0700 (PDT) Christoph Lameter <[EMAIL PROTECTED]> > wrote: > > > +config ARCH_USES_SLAB_PAGE_STRUCT > > + bool > > + default y > > + depends on SPLIT_PTLOCK_CPUS <= NR_CPUS > > + > > That all seems to work as intended. > > However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the > machine early in boot. I thought that if that worked as intended, you wouldn't even get the chance to choose SLUB=y? That was how it was working for me (but I realize I didn't try more than make oldconfig). > > Too early for netconsole, no serial console. Wedges up uselessly with > CONFIG_XMON=n, does mysterious repeated uncontrollable exceptions with > CONFIG_XMON=y. This is all fairly typical for a powerpc/G5 crash :( > > However I was able to glimpse some stuff as it flew past. Crash started in > flush_old_exec and ended in pgtable_free_tlb -> kmem_cache_free. I don't know > how to do better than that I'm afraid, unless I'm to hunt down a PCIE serial > card, perhaps. That sounds like what happens when SLUB's pagestruct use meets SPLIT_PTLOCK's pagestruct use. Does your .config really show CONFIG_SLUB=y together with CONFIG_ARCH_USES_SLAB_PAGE_STRUCT=y? Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Thu, May 03, 2007 at 01:15:15AM -0700, Andrew Morton wrote: > That all seems to work as intended. > However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the > machine early in boot. > Too early for netconsole, no serial console. Wedges up uselessly with > CONFIG_XMON=n, does mysterious repeated uncontrollable exceptions with > CONFIG_XMON=y. This is all fairly typical for a powerpc/G5 crash :( > However I was able to glimpse some stuff as it flew past. Crash started in > flush_old_exec and ended in pgtable_free_tlb -> kmem_cache_free. I don't know > how to do better than that I'm afraid, unless I'm to hunt down a PCIE serial > card, perhaps. I've seen this crash in flush_old_exec() before. ISTR it being due to slub vs. pagetable alignment or something on that order. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Wed, 2 May 2007 10:25:47 -0700 (PDT) Christoph Lameter <[EMAIL PROTECTED]> wrote: > On Wed, 2 May 2007, Hugh Dickins wrote: > > > I presume the answer is just to extend your quicklist work to > > powerpc's lowest level of pagetables. The only other architecture > > which is using kmem_cache for them is arm26, which has > > "#error SMP is not supported", so won't be giving this problem. > > In the meantime we would need something like this to disable SLUB in this > particular configuration. Note that I have not tested this and the <= for > the comparision with SPLIT_PTLOCK_CPUS may not work (Never seen such a > construct in a Kconfig file but it is needed here). > > > > PowerPC: Disable SLUB for configurations in which slab page structs are > modified > > PowerPC uses the slab allocator to manage the lowest level of the page table. > In high cpu configurations we also use the page struct to split the page > table lock. Disallow the selection of SLUB for that case. > > [Not tested: I am not familiar with powerpc build procedures etc] > > Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> > > Index: linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig > === > --- linux-2.6.21-rc7-mm2.orig/arch/powerpc/Kconfig2007-05-02 > 10:07:34.0 -0700 > +++ linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig 2007-05-02 10:13:37.0 > -0700 > @@ -117,6 +117,19 @@ config GENERIC_BUG > default y > depends on BUG > > +# > +# Powerpc uses the slab allocator to manage its ptes and the > +# page structs of ptes are used for splitting the page table > +# lock for configurations supporting more than SPLIT_PTLOCK_CPUS. > +# > +# In that special configuration the page structs of slabs are modified. > +# This setting disables the selection of SLUB as a slab allocator. > +# > +config ARCH_USES_SLAB_PAGE_STRUCT > + bool > + default y > + depends on SPLIT_PTLOCK_CPUS <= NR_CPUS > + That all seems to work as intended. However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the machine early in boot. Too early for netconsole, no serial console. Wedges up uselessly with CONFIG_XMON=n, does mysterious repeated uncontrollable exceptions with CONFIG_XMON=y. This is all fairly typical for a powerpc/G5 crash :( However I was able to glimpse some stuff as it flew past. Crash started in flush_old_exec and ended in pgtable_free_tlb -> kmem_cache_free. I don't know how to do better than that I'm afraid, unless I'm to hunt down a PCIE serial card, perhaps. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
On Wed, May 02, 2007 at 04:36:27PM -0400, Mathieu Desnoyers wrote: > The idea is the following : either we integrate the infrastructure for > instrumentation / data serialization / buffer management / extraction of > data to user space in multiple different steps, which makes code review > easier for you guys, the staging area for that is -mm - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
On Wed, May 02, 2007 at 01:53:36PM -0700, Andrew Morton wrote: > In which case we have: > > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-alpha.patch > atomich-complete-atomic_long-operations-in-asm-generic.patch > atomich-i386-type-safety-fix.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-ia64.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-mips.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-parisc.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-powerpc.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-sparc64.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-x86_64.patch > atomich-atomic_add_unless-as-inline-remove-systemh-atomich-circular-dependency.patch > local_t-architecture-independant-extension.patch > local_t-alpha-extension.patch > local_t-i386-extension.patch > local_t-ia64-extension.patch > local_t-mips-extension.patch > local_t-parisc-cleanup.patch > local_t-powerpc-extension.patch > local_t-sparc64-cleanup.patch > local_t-x86_64-extension.patch > > For 2.6.22 > > linux-kernel-markers-kconfig-menus.patch > linux-kernel-markers-architecture-independant-code.patch > linux-kernel-markers-powerpc-optimization.patch > linux-kernel-markers-i386-optimization.patch > markers-add-instrumentation-markers-menus-to-avr32.patch > linux-kernel-markers-non-optimized-architectures.patch > markers-alpha-and-avr32-supportadd-alpha-markerh-add-arm26-markerh.patch > linux-kernel-markers-documentation.patch > # > markers-define-the-linker-macro-extra_rwdata.patch > markers-use-extra_rwdata-in-architectures.patch > # > some-grammatical-fixups-and-additions-to-atomich-kernel-doc.patch > no-longer-include-asm-kdebugh.patch This it a plan I can fully agree with. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
On Wed, May 02, 2007 at 07:11:04PM -0400, Mathieu Desnoyers wrote: > My statement was probably not clear enough. The actual marker code is > useful as-is without any further kernel patching required : SystemTAP is > an example where they use external modules to load probes that can > connect either to markers or through kprobes. LTTng, in its current state, > has a mostly modular core that also uses the markers. That just mean you have to load an enormous emount of exernal crap that replaces the missing kernel functionality. It's exactly the situation we want to avoid. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Wed, 2 May 2007, Nick Piggin wrote: [snip] More on-topic, since you suggest doing more within vmtruncate_range than the filesystem: no, I'm afraid that's misdesigned, and I want to move almost all of it into the filesystem ->truncate_range. Because, if what vmtruncate_range is doing before it gets to the filesystem isn't to be just a waste of time, the filesystem needs to know what's going on in advance - just as notify_change warns the filesystem about a coming truncation. But easier than inventing some new notification is to move it all into the filesystem, with unmap_mapping_range+truncate_inode_pages_range its library helpers. Well I would prefer it to follow the same pattern as regular truncate. I don't think it is misdesigned to call the filesystem _first_, but I think if you do that then the filesystem should call the vm to prepare / finish truncate, rather than open code calls to unmap itself. But I'm pretty sure (to use your words!) regular truncate was not racy before: I believe Andrea's sequence count was handling that case fine, without a second unmap_mapping_range. OK, I think you're right. I _think_ it should also be OK with the lock_page version as well: we should not be able to have any pages after the first unmap_mapping_range call, because of the i_size write. So if we have no pages, there is nothing to 'cow' from. I'd be delighted if you can remove those later unmap_mapping_ranges. As I recall, the important thing for the copy pages is to be holding the page lock (or whatever other serialization) on the copied page still while the copy page is inserted into pagetable: that looks to be so in your __do_fault. Yeah, I think my thought process went wrong on those... I'll revisit. But it is a shame, and leaves me wondering what you gained with the page lock there. One thing gained is ease of understanding, and if your later patches build an edifice upon the knowledge of holding that page lock while faulting, I've no wish to undermine that foundation. It also fixes a bug, doesn't it? ;) Well, I'd come to think that perhaps the bugs would be solved by that second unmap_mapping_range alone, so the pagelock changes just a misleading diversion. I'm not sure how I feel about that: calling unmap_mapping_range a second time feels such a cheat, but if (big if) it does solve the races, and the pagelock method is as expensive as your numbers now suggest... Well aside from being terribly ugly, it means we can still drop the dirty bit where we'd otherwise rather not, so I don't think we can do that. I think there may be some way we can do this without taking the page lock, and I was going to look at it, but I think it is quite neat to just lock the page... I don't think performance is _that_ bad. On the P4 it is a couple of % on the microbenchmarks. The G5 is worse, but even then I don't think it is I'll try to improve that and get back to you. The problem is that lock/unlock_page is expensive on powerpc, and if we improve that, we improve more than just the fault handler... The attached patch gets performance up a bit by avoiding some barriers and some cachelines: G5 pagefault fork exec 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 +patch 1.71-1.73 175.2-180.8 780.5-794.2 +patch2 1.61-1.63 169.8-175.0 748.6-757.0 So that brings the fork/exec hits down to much less than 5%, and would likely speed up other things that lock the page, like write or page reclaim. I think we could get further performance improvement by implementing arch specific bitops for lock/unlock operations, so we don't need to use things like smb_mb__before_clear_bit() if they aren't needed or full barriers in the test_and_set_bit(). -- SUSE Labs, Novell Inc. Index: linux-2.6/include/linux/page-flags.h === --- linux-2.6.orig/include/linux/page-flags.h 2007-04-24 10:39:56.0 +1000 +++ linux-2.6/include/linux/page-flags.h2007-05-03 08:38:53.0 +1000 @@ -91,6 +91,8 @@ #define PG_nosave_free 18 /* Used for system suspend/resume */ #define PG_buddy 19 /* Page is free, on buddy lists */ +#define PG_waiters 20 /* Page has PG_locked waiters */ + /* PG_owner_priv_1 users should have descriptive aliases */ #define PG_checked PG_owner_priv_1 /* Used by some filesystems */ Index: linux-2.6/include/linux/pagemap.h === --- linux-2.6.orig/include/linux/pagemap.h 2007-04-24 10:39:56.0 +1000 +++ linux-2.6/include/linux/pagemap.h 2007-05-03 08:35:08.0 +1000 @@ -141,7 +141,7 @@ static inline void lock_page(struct page *page) { might_sleep(); - if (TestSetPageLocked(page)) + if (unlikely(TestSetPageLocked(page))) __lock_page(page); } @@ -152,7 +152,7 @@ static inline void lock_pag
Re: 2.6.22 -mm merge plans: mm-more-rmap-checking
Hugh Dickins wrote: On Wed, 2 May 2007, Nick Piggin wrote: Yes, but IIRC I put that in because there was another check in SLES9 that I actually couldn't put in, but used this one instead because it also caught the bug we saw. ... This was actually a rare corruption that is also in 2.6.21, and as few rmap callsites as we have, it was never noticed until the SLES9 bug check was triggered. You are being very mysterious. Please describe this bug (privately if you think it's exploitable), and let's work on the patch to fix it, rather than this "debug" patch. It is exec-fix-remove_arg_zero.patch in Andrew's tree, it's exploitable in that it leaks memory, but it could also release corrupted pagetables into quicklists on those architectures that have them... Anyway, it quite likely would have gone unfixed for several more years if we didn't have the bug triggers in. Now you could argue that my patch obviously fixes all bugs in there (but I wouldn't :)), and being most complex of the few callsites, _now_ we can avoid the bug checks. However I'd prefer to keep them at least under CONFIG_DEBUG_VM. Hmm, I didn't notice the do_swap_page change, rather just derived its safety by looking at the current state of the code (which I guess must have been post-do_swap_page change)... Your addition of page_add_new_anon_rmap clarified the situation too. Do you have a pointer to the patch, for my interest? The patch which changed do_swap_page? commit c475a8ab625d567eacf5e30ec35d6d8704558062 Author: Hugh Dickins <[EMAIL PROTECTED]> Date: Tue Jun 21 17:15:12 2005 -0700 [PATCH] can_share_swap_page: use page_mapcount Yeah, this one, thanks. I'm just interested. Or my intended PG_swapcache to PAGE_MAPPING_SWAP patch, which does assume PageLocked in page_add_anon_rmap? Yes, I can send you its current unsplit state if you like (but have higher priorities before splitting and commenting it for posting). I would like to see that too, but when you are ready :) Thanks, Nick -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
On Wed, 2 May 2007 19:11:04 -0400 Mathieu Desnoyers <[EMAIL PROTECTED]> wrote: > > I didn't know that this was the plan. > > > > The problem I have with this is that once we've merged one part, we're > > committed to merging the other parts even though we haven't seen them yet. > > > > What happens if there's a revolt over the next set of patches? Do we > > remove the core markers patches again? We end up in a cant-go-forward, > > cant-go-backward situation. > > > > I thought the existing code was useful as-is for several projects, without > > requiring additional patching to core kernel. If such additional patching > > _is_ needed to make the markers code useful then I agree that we should > > continue to buffer the markers code in -mm until the > > use-markers-for-something patches have been eyeballed. > > > > My statement was probably not clear enough. The actual marker code is > useful as-is without any further kernel patching required : SystemTAP is > an example where they use external modules to load probes that can > connect either to markers or through kprobes. LTTng, in its current state, > has a mostly modular core that also uses the markers. OK, that's what I thought. > Although some, like Christoph and myself, think that it would benefit to > the kernel community to have a common infrastructure for more than just > markers (meaning common serialization and buffering mechanism), it does > not change the fact that the markers, being in mainline, are usable by > projects through additional kernel modules. > > If we are looking at current "potential users" that are already in > mainline, we could change blktrace to make it use the markers. That'd be a useful demonstration. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
* Andrew Morton ([EMAIL PROTECTED]) wrote: > On Wed, 2 May 2007 16:36:27 -0400 > Mathieu Desnoyers <[EMAIL PROTECTED]> wrote: > > > * Christoph Hellwig ([EMAIL PROTECTED]) wrote: > > > On Wed, May 02, 2007 at 09:47:07AM -0700, Andrew Morton wrote: > > > > > That doesn't constitute using it. > > > > > > > > Andi, there was a huge amount of discussion about all this in September > > > > last > > > > year (subjects: *markers* and *LTTng*). The outcome of all that was, I > > > > believe, that the kernel should have a static marker infrastructure. > > > > > > Only when it's actually useable. A prerequisite for merging it is > > > having an actual trace transport infrastructure aswell as a few actually > > > useful tracing modules in the kernel tree. > > > > > > Let this count as a vote to merge the markers once we have the > > > infrastructure > > > above ready, it'll be very useful then. > > > > Hi Christoph, > > > > The idea is the following : either we integrate the infrastructure for > > instrumentation / data serialization / buffer management / extraction of > > data to user space in multiple different steps, which makes code review > > easier for you guys, or we bring the main pieces of the LTTng project > > altogether with the Linux Kernel Markers, which would result in a bigger > > change. > > > > Based on the premise that discussing about logically distinct pieces of > > infrastructure is easier and can be done more thoroughly when done > > separately, we decided to submit the markers first, with the other > > pieces planned in a near future. > > > > I agree that it would be very useful to have the full tracing stack > > available in the Linux kernel, but we inevitably face the argument : > > "this change is too big" if we submit all LTTng modules at once or > > the argument : "we want the whole tracing stack, not just part of it" > > if we don't. > > > > This is why we chose to push the tracing infrastructure chunk by chunk : > > to make code review and criticism more efficient. > > > > I didn't know that this was the plan. > > The problem I have with this is that once we've merged one part, we're > committed to merging the other parts even though we haven't seen them yet. > > What happens if there's a revolt over the next set of patches? Do we > remove the core markers patches again? We end up in a cant-go-forward, > cant-go-backward situation. > > I thought the existing code was useful as-is for several projects, without > requiring additional patching to core kernel. If such additional patching > _is_ needed to make the markers code useful then I agree that we should > continue to buffer the markers code in -mm until the > use-markers-for-something patches have been eyeballed. > My statement was probably not clear enough. The actual marker code is useful as-is without any further kernel patching required : SystemTAP is an example where they use external modules to load probes that can connect either to markers or through kprobes. LTTng, in its current state, has a mostly modular core that also uses the markers. Although some, like Christoph and myself, think that it would benefit to the kernel community to have a common infrastructure for more than just markers (meaning common serialization and buffering mechanism), it does not change the fact that the markers, being in mainline, are usable by projects through additional kernel modules. If we are looking at current "potential users" that are already in mainline, we could change blktrace to make it use the markers. Mathieu > In which case we have: > > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-alpha.patch > atomich-complete-atomic_long-operations-in-asm-generic.patch > atomich-i386-type-safety-fix.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-ia64.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-mips.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-parisc.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-powerpc.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-sparc64.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-x86_64.patch > atomich-atomic_add_unless-as-inline-remove-systemh-atomich-circular-dependency.patch > local_t-architecture-independant-extension.patch > local_t-alpha-extension.patch > local_t-i386-extension.patch > local_t-ia64-extension.patch > local_t-mips-extension.patch > local_t-parisc-cleanup.patch > local_t-powerpc-extension.patch > local_t-sparc64-cleanup.patch > local_t-x86_64-extension.patch > > For 2.6.22 > > linux-kernel-markers-kconfig-menus.patch > linux-kernel-markers-architecture-independant-code.patch > linux-kernel-markers-powerpc-optimization.patch > linux-kernel-markers-i386-optimization.patch > markers-add-instrumentation-markers-menus-to-avr32.patch > linux-kernel-markers-non-optimized-architectures.patch > markers-alpha-and-avr32-supportadd-alpha-markerh-add-arm26-markerh.patch > linux-kernel-markers-docume
Re: 2.6.22 -mm merge plans
Am 02.05.2007 19:49 schrieb Andi Kleen: > And also I think when something is merged it should have some users in tree. Isn't that a circular dependency? -- Tilman Schmidt E-Mail: [EMAIL PROTECTED] Bonn, Germany - Undetected errors are handled as if no error occurred. (IBM) - signature.asc Description: OpenPGP digital signature
Re: 2.6.22 -mm merge plans
On Wed, 2 May 2007 16:36:27 -0400 Mathieu Desnoyers <[EMAIL PROTECTED]> wrote: > * Christoph Hellwig ([EMAIL PROTECTED]) wrote: > > On Wed, May 02, 2007 at 09:47:07AM -0700, Andrew Morton wrote: > > > > That doesn't constitute using it. > > > > > > Andi, there was a huge amount of discussion about all this in September > > > last > > > year (subjects: *markers* and *LTTng*). The outcome of all that was, I > > > believe, that the kernel should have a static marker infrastructure. > > > > Only when it's actually useable. A prerequisite for merging it is > > having an actual trace transport infrastructure aswell as a few actually > > useful tracing modules in the kernel tree. > > > > Let this count as a vote to merge the markers once we have the > > infrastructure > > above ready, it'll be very useful then. > > Hi Christoph, > > The idea is the following : either we integrate the infrastructure for > instrumentation / data serialization / buffer management / extraction of > data to user space in multiple different steps, which makes code review > easier for you guys, or we bring the main pieces of the LTTng project > altogether with the Linux Kernel Markers, which would result in a bigger > change. > > Based on the premise that discussing about logically distinct pieces of > infrastructure is easier and can be done more thoroughly when done > separately, we decided to submit the markers first, with the other > pieces planned in a near future. > > I agree that it would be very useful to have the full tracing stack > available in the Linux kernel, but we inevitably face the argument : > "this change is too big" if we submit all LTTng modules at once or > the argument : "we want the whole tracing stack, not just part of it" > if we don't. > > This is why we chose to push the tracing infrastructure chunk by chunk : > to make code review and criticism more efficient. > I didn't know that this was the plan. The problem I have with this is that once we've merged one part, we're committed to merging the other parts even though we haven't seen them yet. What happens if there's a revolt over the next set of patches? Do we remove the core markers patches again? We end up in a cant-go-forward, cant-go-backward situation. I thought the existing code was useful as-is for several projects, without requiring additional patching to core kernel. If such additional patching _is_ needed to make the markers code useful then I agree that we should continue to buffer the markers code in -mm until the use-markers-for-something patches have been eyeballed. In which case we have: atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-alpha.patch atomich-complete-atomic_long-operations-in-asm-generic.patch atomich-i386-type-safety-fix.patch atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-ia64.patch atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-mips.patch atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-parisc.patch atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-powerpc.patch atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-sparc64.patch atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-x86_64.patch atomich-atomic_add_unless-as-inline-remove-systemh-atomich-circular-dependency.patch local_t-architecture-independant-extension.patch local_t-alpha-extension.patch local_t-i386-extension.patch local_t-ia64-extension.patch local_t-mips-extension.patch local_t-parisc-cleanup.patch local_t-powerpc-extension.patch local_t-sparc64-cleanup.patch local_t-x86_64-extension.patch For 2.6.22 linux-kernel-markers-kconfig-menus.patch linux-kernel-markers-architecture-independant-code.patch linux-kernel-markers-powerpc-optimization.patch linux-kernel-markers-i386-optimization.patch markers-add-instrumentation-markers-menus-to-avr32.patch linux-kernel-markers-non-optimized-architectures.patch markers-alpha-and-avr32-supportadd-alpha-markerh-add-arm26-markerh.patch linux-kernel-markers-documentation.patch # markers-define-the-linker-macro-extra_rwdata.patch markers-use-extra_rwdata-in-architectures.patch # some-grammatical-fixups-and-additions-to-atomich-kernel-doc.patch no-longer-include-asm-kdebugh.patch Hold. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
* Christoph Hellwig ([EMAIL PROTECTED]) wrote: > On Wed, May 02, 2007 at 09:47:07AM -0700, Andrew Morton wrote: > > > That doesn't constitute using it. > > > > Andi, there was a huge amount of discussion about all this in September last > > year (subjects: *markers* and *LTTng*). The outcome of all that was, I > > believe, that the kernel should have a static marker infrastructure. > > Only when it's actually useable. A prerequisite for merging it is > having an actual trace transport infrastructure aswell as a few actually > useful tracing modules in the kernel tree. > > Let this count as a vote to merge the markers once we have the infrastructure > above ready, it'll be very useful then. Hi Christoph, The idea is the following : either we integrate the infrastructure for instrumentation / data serialization / buffer management / extraction of data to user space in multiple different steps, which makes code review easier for you guys, or we bring the main pieces of the LTTng project altogether with the Linux Kernel Markers, which would result in a bigger change. Based on the premise that discussing about logically distinct pieces of infrastructure is easier and can be done more thoroughly when done separately, we decided to submit the markers first, with the other pieces planned in a near future. I agree that it would be very useful to have the full tracing stack available in the Linux kernel, but we inevitably face the argument : "this change is too big" if we submit all LTTng modules at once or the argument : "we want the whole tracing stack, not just part of it" if we don't. This is why we chose to push the tracing infrastructure chunk by chunk : to make code review and criticism more efficient. Regards, Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Wed, 2 May 2007, Sam Ravnborg wrote: > To facilitate this do NOT introduce CONFIG_SLAB until we decide > that SLUB are default. In this way we can make CONFIG_SLUB be default > and people will not continue with CONFIG_SLAB because they had it in their > config already. We already have CONFIG_SLAB. If you use your existing .config then you will stay with SLAB. > The point is make sure that LSUB becomes default for people that does > an make oldconfig (explicit or implicit). H... We can think about that when we actually want to make SLUB the default. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Wed, May 02, 2007 at 12:42:54PM -0700, Christoph Lameter wrote: > On Wed, 2 May 2007, Andrew Morton wrote: > > > > At some point I dream that SLUB could become the default but I thought > > > this would take at least 6 month or so. If want to force this now then I > > > will certainly have some busy weeks ahead. > > > > s/dream/promise/ ;) > > > > Six months sounds reasonable - I was kind of hoping for less. Make it > > default-to-on in 2.6.23-rc1, see how it goes. > > Here is how I think the future could develop > > Cycle SLABSLUBSLOBSLxB > > 2.6.22API fixes Stabilization API fixes > > Major event: SLUB availability as experimental > > 2.6.23API upgradesPerf. Valid.EOL > > Major events: SLUB performance validation. Switch off > experimental (could even be the default) > Slab allocators support targeted reclaim for at > least one slab cache (dentry?) > (vacate/move all objects in a slab) To facilitate this do NOT introduce CONFIG_SLAB until we decide that SLUB are default. In this way we can make CONFIG_SLUB be default and people will not continue with CONFIG_SLAB because they had it in their .config already. Or just rename CONFIG_SLAB to CONFIG_SLAB_DEPRECATED or something. The point is make sure that LSUB becomes default for people that does an make oldconfig (explicit or implicit). Sam - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Wed, 2 May 2007, Pekka Enberg wrote: > And then there's patches such as kmemleak which would need to target > all three. Plus it doesn't really make sense for users to select > between three competiting implementations. Please don't take away our > high hopes of getting rid of mm/slab.c Christoph =) You too, Brutus ... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Wed, 2 May 2007, Andrew Morton wrote: > > At some point I dream that SLUB could become the default but I thought > > this would take at least 6 month or so. If want to force this now then I > > will certainly have some busy weeks ahead. > > s/dream/promise/ ;) > > Six months sounds reasonable - I was kind of hoping for less. Make it > default-to-on in 2.6.23-rc1, see how it goes. Here is how I think the future could develop Cycle SLABSLUBSLOBSLxB 2.6.22 API fixes Stabilization API fixes Major event: SLUB availability as experimental 2.6.23 API upgradesPerf. Valid.EOL Major events: SLUB performance validation. Switch off experimental (could even be the default) Slab allocators support targeted reclaim for at least one slab cache (dentry?) (vacate/move all objects in a slab) 2.6.24 Earliest EOLStable - Experiments Major events: SLUB stable. Stable targeted reclaim for all major reclaimable slabs. Maybe experiments with another new allocator? 2.6.25 EOL default - ? Death of SLAB. SLUB default. Hopefully new ideas on the horizon. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Wed, 2 May 2007, Pekka Enberg wrote: > On 5/2/07, Christoph Lameter <[EMAIL PROTECTED]> wrote: > > I am the one who has to maintain SLAB and SLUB it seems and I have been > > dealing with the trio SLAB, SLOB and SLUB for awhile now. Its okay and it > > will be much easier once the cleanups are in. > > And then there's patches such as kmemleak which would need to target > all three. Plus it doesn't really make sense for users to select > between three competiting implementations. Please don't take away our > high hopes of getting rid of mm/slab.c Christoph =) SLUB supports kmemleak (actually its quite improved). Switch debugging on and try cat /sys/slab/kmalloc-128/alloc_calls. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On 5/2/07, Christoph Lameter <[EMAIL PROTECTED]> wrote: Owww... You throw my roadmap out of the window and may create too high expectations of SLUB. Me too! On 5/2/07, Christoph Lameter <[EMAIL PROTECTED]> wrote: I am the one who has to maintain SLAB and SLUB it seems and I have been dealing with the trio SLAB, SLOB and SLUB for awhile now. Its okay and it will be much easier once the cleanups are in. And then there's patches such as kmemleak which would need to target all three. Plus it doesn't really make sense for users to select between three competiting implementations. Please don't take away our high hopes of getting rid of mm/slab.c Christoph =) Pekka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Wed, 2 May 2007 10:03:50 -0700 (PDT) Christoph Lameter <[EMAIL PROTECTED]> wrote: > On Wed, 2 May 2007, Hugh Dickins wrote: > > > But if Linus' tree is to be better than a warehouse to avoid > > awkward merges, I still think we want it to default to on for > > all the architectures, and for most if not all -rcs. > > At some point I dream that SLUB could become the default but I thought > this would take at least 6 month or so. If want to force this now then I > will certainly have some busy weeks ahead. s/dream/promise/ ;) Six months sounds reasonable - I was kind of hoping for less. Make it default-to-on in 2.6.23-rc1, see how it goes. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Wed, 2 May 2007, Andrew Morton wrote: > n, we don't want competing slab allocators, please. We should get slub > working well on all architectures then remove slab completely. Having to > maintain both slab.c and slub.c would be awful. Owww... You throw my roadmap out of the window and may create too high expectations of SLUB. I am the one who has to maintain SLAB and SLUB it seems and I have been dealing with the trio SLAB, SLOB and SLUB for awhile now. Its okay and it will be much easier once the cleanups are in. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Wed, 2 May 2007, Siddha, Suresh B wrote: > I have been looking into "slub" recently to avoid some of the NUMA alien > cache issues that we were encountering on the regular slab. Yes that is also our main concern. > I am having some stability issues with slub on an ia64 NUMA platform and > didn't have time to dig further. I am hoping to look into it soon > and share the data/findings with Christoph. There is at least one patch on top of 2.6.21-rc7-mm2 already in mm that may be necessary for you. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Wed, 2 May 2007 11:39:20 -0700 (PDT) Christoph Lameter <[EMAIL PROTECTED]> wrote: > On Wed, 2 May 2007, Hugh Dickins wrote: > > > I'm astonished and impressed, both with Kconfig and your use of it: > > Thanks! > > > I'd much rather be testing a quicklist patch: > > I'd better give that a try. > > Great. But I certainly do not mind people use SLAB. I do not think that > one approach should be there for all. Choice is the way to have multiple > allocators compete. One reason that SLAB is so crusty is because it was > the only solution for so long. > n, we don't want competing slab allocators, please. We should get slub working well on all architectures then remove slab completely. Having to maintain both slab.c and slub.c would be awful. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Wed, May 02, 2007 at 05:54:53AM -0700, Hugh Dickins wrote: > On Tue, 1 May 2007, Andrew Morton wrote: > > So on balance, given that we _do_ expect slub to have a future, I'm > > inclined to crash ahead with it. The worst that can happen will be a later > > rm mm/slub.c which would be pretty simple to do. > > Okay. And there's been no chorus to echo my concern. I have been looking into "slub" recently to avoid some of the NUMA alien cache issues that we were encountering on the regular slab. I am having some stability issues with slub on an ia64 NUMA platform and didn't have time to dig further. I am hoping to look into it soon and share the data/findings with Christoph. We also did a quick perf collection on x86_64(atleast didn't hear any stability issues from our team on regular x86_64 SMP), that we will be sharing shortly. > But if Linus' tree is to be better than a warehouse to avoid > awkward merges, I still think we want it to default to on for > all the architectures, and for most if not all -rcs. I will not suggest for default on at this point. thanks, suresh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Wed, 2 May 2007, Andrew Morton wrote: > > This is a sensitive piece of the kernel as you say and we better allow the > > running of two allocator for some time to make sure that it behaves in all > > load situations. The design is fundamentally different so its performance > > characteristics may diverge significantly and perhaps there will be corner > > cases for each where they do the best job. > > eek. We'd need to fix those corner cases then. Our endgame > here really must be rm mm/slab.c. First we need to discover them and I doubt that mm covers much more than development loads. I hope we can get to a point where we have SLUB be the primarily allocator soon but I would expect various performance issues to show up. On the other hand: I am pretty sure that SLUB can replace SLOB completely given SLOBs limitations and SLUBs more efficient use of space. SLOB needs 8 bytes of overhead. SLUB needs none. We may just have to #ifdef out the debugging support to make the code be of similar size to SLOB too. SLOB is a general problem because its features are not compatible to SLAB. F.e. it does not support DESTROY_BY_RCU and does not do reclaim the right way etc etc. SLUB may turn out to be the ideal embedded slab allocator. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Wed, 2 May 2007 11:28:26 -0700 (PDT) Christoph Lameter <[EMAIL PROTECTED]> wrote: > On Wed, 2 May 2007, Hugh Dickins wrote: > > > On Wed, 2 May 2007, Christoph Lameter wrote: > > > > > > But these are arch specific problems. We could use > > > ARCH_USES_SLAB_PAGE_STRUCT to disable SLUB on these platforms. > > > > As a quick hack, sure. But every ARCH_USES_SLAB_PAGE_STRUCT > > diminishes the testing SLUB will get. If the idea is that we're > > going to support both SLAB and SLUB, some arches with one, some > > with another, some with either, for more than a single release, > > then I'm back to saying SLUB is being pushed in too early. > > I can understand people wanting pluggable schedulers, > > but pluggable slab allocators? > > This is a sensitive piece of the kernel as you say and we better allow the > running of two allocator for some time to make sure that it behaves in all > load situations. The design is fundamentally different so its performance > characteristics may diverge significantly and perhaps there will be corner > cases for each where they do the best job. eek. We'd need to fix those corner cases then. Our endgame here really must be rm mm/slab.c. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Wed, 2 May 2007, Hugh Dickins wrote: > I'm astonished and impressed, both with Kconfig and your use of it: Thanks! > I'd much rather be testing a quicklist patch: > I'd better give that a try. Great. But I certainly do not mind people use SLAB. I do not think that one approach should be there for all. Choice is the way to have multiple allocators compete. One reason that SLAB is so crusty is because it was the only solution for so long. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Wed, 2 May 2007, Christoph Lameter wrote: > On Wed, 2 May 2007, Hugh Dickins wrote: > > > I presume the answer is just to extend your quicklist work to > > powerpc's lowest level of pagetables. The only other architecture > > which is using kmem_cache for them is arm26, which has > > "#error SMP is not supported", so won't be giving this problem. > > In the meantime we would need something like this to disable SLUB in this > particular configuration. Note that I have not tested this and the <= for > the comparision with SPLIT_PTLOCK_CPUS may not work (Never seen such a > construct in a Kconfig file but it is needed here). I'm astonished and impressed, both with Kconfig and your use of it: that does seem to work. Though I don't dare go so far as to give the patch an ack, and don't like this way out at all. It needs a proper (quicklist) solution, and by the time that solution comes along, all the powerpc people will have CONFIG_SLAB=y in their .config, and "make oldconfig" will just perpetuate that status quo, instead of the switching over to CONFIG_SLUB=y. I think. Unless we keep changing the config option names, or go through a phase with no option. I'd much rather be testing a quicklist patch: I'd better give that a try. Hugh > > > > PowerPC: Disable SLUB for configurations in which slab page structs are > modified > > PowerPC uses the slab allocator to manage the lowest level of the page table. > In high cpu configurations we also use the page struct to split the page > table lock. Disallow the selection of SLUB for that case. > > [Not tested: I am not familiar with powerpc build procedures etc] > > Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> > > Index: linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig > === > --- linux-2.6.21-rc7-mm2.orig/arch/powerpc/Kconfig2007-05-02 > 10:07:34.0 -0700 > +++ linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig 2007-05-02 10:13:37.0 > -0700 > @@ -117,6 +117,19 @@ config GENERIC_BUG > default y > depends on BUG > > +# > +# Powerpc uses the slab allocator to manage its ptes and the > +# page structs of ptes are used for splitting the page table > +# lock for configurations supporting more than SPLIT_PTLOCK_CPUS. > +# > +# In that special configuration the page structs of slabs are modified. > +# This setting disables the selection of SLUB as a slab allocator. > +# > +config ARCH_USES_SLAB_PAGE_STRUCT > + bool > + default y > + depends on SPLIT_PTLOCK_CPUS <= NR_CPUS > + > config DEFAULT_UIMAGE > bool > help - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Wed, 2 May 2007, Hugh Dickins wrote: > On Wed, 2 May 2007, Christoph Lameter wrote: > > > > But these are arch specific problems. We could use > > ARCH_USES_SLAB_PAGE_STRUCT to disable SLUB on these platforms. > > As a quick hack, sure. But every ARCH_USES_SLAB_PAGE_STRUCT > diminishes the testing SLUB will get. If the idea is that we're > going to support both SLAB and SLUB, some arches with one, some > with another, some with either, for more than a single release, > then I'm back to saying SLUB is being pushed in too early. > I can understand people wanting pluggable schedulers, > but pluggable slab allocators? This is a sensitive piece of the kernel as you say and we better allow the running of two allocator for some time to make sure that it behaves in all load situations. The design is fundamentally different so its performance characteristics may diverge significantly and perhaps there will be corner cases for each where they do the best job. I have already reworked the slab API to allow for an easy implementation of alternate slab allocators (released with 2.6.20) which only covered SLAB and SLOB. This is continuing the cleanup work and adding a third one. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Wed, 2 May 2007, Christoph Lameter wrote: > > But these are arch specific problems. We could use > ARCH_USES_SLAB_PAGE_STRUCT to disable SLUB on these platforms. As a quick hack, sure. But every ARCH_USES_SLAB_PAGE_STRUCT diminishes the testing SLUB will get. If the idea is that we're going to support both SLAB and SLUB, some arches with one, some with another, some with either, for more than a single release, then I'm back to saying SLUB is being pushed in too early. I can understand people wanting pluggable schedulers, but pluggable slab allocators? Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
On Wed, May 02, 2007 at 09:47:07AM -0700, Andrew Morton wrote: > On Wed, 2 May 2007 12:44:13 +0200 Andi Kleen <[EMAIL PROTECTED]> wrote: > > > > It is currently used as an instrumentation infrastructure for the LTTng > > > tracer at IBM, Google, Autodesk, Sony, MontaVista and deployed in > > > WindRiver products. The SystemTAP project also plan to use this type of > > > infrastructure to trace sites hard to instrument. The Linux Kernel > > > Markers has the support of Frank C. Eigler, author of their current > > > marker alternative (which he wishes to drop in order to adopt the > > > markers infrastructure as soon as it hits mainline). > > > > All of the above don't use mainline kernels. > > That's because they have to add a markers patch! I meant they use very old kernels. Their experiences don't apply to mainline bitrottyness. > > That doesn't constitute using it. > > Andi, there was a huge amount of discussion about all this in September last > year (subjects: *markers* and *LTTng*). The outcome of all that was, I > believe, that the kernel should have a static marker infrastructure. I have no problem with that in principle; just some doubts about the current proposed implementation: in particular its complexity. And also I think when something is merged it should have some users in tree. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
On Wed, May 02, 2007 at 09:47:07AM -0700, Andrew Morton wrote: > > That doesn't constitute using it. > > Andi, there was a huge amount of discussion about all this in September last > year (subjects: *markers* and *LTTng*). The outcome of all that was, I > believe, that the kernel should have a static marker infrastructure. Only when it's actually useable. A prerequisite for merging it is having an actual trace transport infrastructure aswell as a few actually useful tracing modules in the kernel tree. Let this count as a vote to merge the markers once we have the infrastructure above ready, it'll be very useful then. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
* Andi Kleen ([EMAIL PROTECTED]) wrote: > > It is currently used as an instrumentation infrastructure for the LTTng > > tracer at IBM, Google, Autodesk, Sony, MontaVista and deployed in > > WindRiver products. The SystemTAP project also plan to use this type of > > infrastructure to trace sites hard to instrument. The Linux Kernel > > Markers has the support of Frank C. Eigler, author of their current > > marker alternative (which he wishes to drop in order to adopt the > > markers infrastructure as soon as it hits mainline). > > All of the above don't use mainline kernels. > That doesn't constitute using it. > I am afraid this argument does not hold : - These companies are not shipping their products with mainline kernels to make sure things have time to stabilize. - They eventually get to the next version some time after it is not "head" anymore. They still want to benefit from the features of the newer versions. - All these companies would be really happy to have a marker infrastructure in mainline so they can stop applying a separate set of patches to provide this functionality. - Arguing the fact that "they apply their set of patches anyway" goes against the advice I have received from Greg KH, which is can be reworded as : please submit your patches to mainline instead of keeping your separate set of patches. See his various presentations about "mainlining" for more info about this. Because of these 4 arguments, I think that these companies can be considered as users and contributors of/to mainline kernels. > > Quoting Jim Keniston <[EMAIL PROTECTED]> : > > > > "kprobes remains a vital foundation for SystemTap. But markers are > > attactive as an alternate source of trace/debug info. Here's why: > > [...]" > > Talk is cheap. Do they have working code to use it? > LTTng has been using the markers for about 6 months now. SystemTAP is waiting on the "it hits mainline" signal before they switch from their STP_MARK() markers to this infrastructure. Give them a few days and they will proceed to the change. > > - Allow per-architecture optimized versions which removes the need for > > a d-cache based branch (patch a "load immediate" instruction > > instead). It minimized the d-cache impact of the disabled markers. > > That's a good idea in general, but should be generalized (available > independently), not hidden in your subsystem. I know a couple of places > who could use this successfully. > I agree that an efficient hooking mechanism is useful to manyr; listing at least security hooks and instrumentation for tracing. What other usage scenario do you have in mind that could not fit in my marker infrastructure ? I have tried to generalize this as much as possible, but if you see, within this, a piece of infrastructure that could be taken apart and used more widely, I will be happy to submit it separately to increase its usefulness. > > - Accept the cost of an unlikely branch at the marker site because the > > gcc compiler does not give the ability to put "nops" instead of a > > branch generated from C code. Keep this in mind for future > > per-architecture optimizations. > > See upcomming paravirt code for a way to do this. > I have looked at the paravirt code in Andrew's 2.6.21-rc7-mm2. A few reasons why I do not plan to use it : 1 - It requires specific arg setup for the calls to be crafted by hand, in assembly, for each and every number of parameters and each types, for each architecture. I use a variable argument list as a parameter to my marker to make sure that a single macro can be used for markup in a generic manner. Quoting : http://lkml.org/lkml/2007/4/4/577 "+ * Unfortunately there's no way to get gcc to generate the args setup + * for the call, and then allow the call itself to be generated by an + * inline asm. Because of this, we must do the complete arg setup and + * return value handling from within these macros. This is fairly + * cumbersome." 2 - I also provide an architecture independent "generic" version which does not depend on per-architecture assembly. From what I see, paravirt is only offered for i386 and x86_64. Are there any plans to support the other ~12 architectures ? Does it offer a architecture agnostic fallback in the cases where it is not implemented for a given architecture ? 3 - It can only alter instructions "safely" in the UP case before the other CPUs are turned on. See my arch/i386/marker.c code patcher for XMC-safe instruction patching. Marker activation must be done at runtime, when the system is fully operational. Quoting 2.6.21 arch/i386/kernel/alternative.c "/* Replace instructions with better alternatives for this CPU type. This runs before SMP is initialized to avoid SMP problems with self modifying code. This implies that assymetric systems where APs have less capabilities than the boot processor are not handled. Tough. Make sure you disable such features by hand. */ voi
Re: 2.6.22 -mm merge plans: slub
On Wed, 2 May 2007, Hugh Dickins wrote: > I presume the answer is just to extend your quicklist work to > powerpc's lowest level of pagetables. The only other architecture > which is using kmem_cache for them is arm26, which has > "#error SMP is not supported", so won't be giving this problem. In the meantime we would need something like this to disable SLUB in this particular configuration. Note that I have not tested this and the <= for the comparision with SPLIT_PTLOCK_CPUS may not work (Never seen such a construct in a Kconfig file but it is needed here). PowerPC: Disable SLUB for configurations in which slab page structs are modified PowerPC uses the slab allocator to manage the lowest level of the page table. In high cpu configurations we also use the page struct to split the page table lock. Disallow the selection of SLUB for that case. [Not tested: I am not familiar with powerpc build procedures etc] Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Index: linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig === --- linux-2.6.21-rc7-mm2.orig/arch/powerpc/Kconfig 2007-05-02 10:07:34.0 -0700 +++ linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig 2007-05-02 10:13:37.0 -0700 @@ -117,6 +117,19 @@ config GENERIC_BUG default y depends on BUG +# +# Powerpc uses the slab allocator to manage its ptes and the +# page structs of ptes are used for splitting the page table +# lock for configurations supporting more than SPLIT_PTLOCK_CPUS. +# +# In that special configuration the page structs of slabs are modified. +# This setting disables the selection of SLUB as a slab allocator. +# +config ARCH_USES_SLAB_PAGE_STRUCT + bool + default y + depends on SPLIT_PTLOCK_CPUS <= NR_CPUS + config DEFAULT_UIMAGE bool help - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Wed, 2 May 2007, Hugh Dickins wrote: > But if Linus' tree is to be better than a warehouse to avoid > awkward merges, I still think we want it to default to on for > all the architectures, and for most if not all -rcs. At some point I dream that SLUB could become the default but I thought this would take at least 6 month or so. If want to force this now then I will certainly have some busy weeks ahead. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Wed, 2 May 2007, Hugh Dickins wrote: > > Why would we need to go back to SLAB if we have not switched to SLUB? SLUB > > is marked experimental and not the default. > > I said above that I thought SLUB ought to be defaulted to on throughout > the -rcs: if we don't do that, we're not going to learn much from having > it in Linus' tree. I'd rather be careful with that. mm is enough for now. Why go to the extremes immediately. If it is an option then people can gradually start testing with it. > > The only problems that I am aware of is(or was) the issue with arches > > modifying page struct fields of slab pages that SLUB needs for its own > > operations. And I thought it was all fixed since the powerpc guys were > > quiet and the patch was in for i386. > > You're forgetting your unions in struct page: in the SPLIT_PTLOCK > case (NR_CPUS >= 4) the pagetable code is using spinlock_t ptl, > which overlays SLUB's first_page and slab pointers. Uhhh Right. So SLUB wont work if the lowest page table block is managed via slabs. > I just tried rebuilding powerpc with the SPLIT_PTLOCK cutover > edited to 8 cpus instead, and then no crash. > > I presume the answer is just to extend your quicklist work to > powerpc's lowest level of pagetables. The only other architecture I am not sure how PowerPCs lower pagetable pages work. If they are of PAGE_SIZE then this is no problem. > which is using kmem_cache for them is arm26, which has > "#error SMP is not supported", so won't be giving this problem. Ahh. Good. But these are arch specific problems. We could use ARCH_USES_SLAB_PAGE_STRUCT to disable SLUB on these platforms. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
On Wed, 2 May 2007 12:44:13 +0200 Andi Kleen <[EMAIL PROTECTED]> wrote: > > It is currently used as an instrumentation infrastructure for the LTTng > > tracer at IBM, Google, Autodesk, Sony, MontaVista and deployed in > > WindRiver products. The SystemTAP project also plan to use this type of > > infrastructure to trace sites hard to instrument. The Linux Kernel > > Markers has the support of Frank C. Eigler, author of their current > > marker alternative (which he wishes to drop in order to adopt the > > markers infrastructure as soon as it hits mainline). > > All of the above don't use mainline kernels. That's because they have to add a markers patch! > That doesn't constitute using it. Andi, there was a huge amount of discussion about all this in September last year (subjects: *markers* and *LTTng*). The outcome of all that was, I believe, that the kernel should have a static marker infrastructure. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
Andi Kleen <[EMAIL PROTECTED]> writes: > [...] The SystemTAP project also plan to use this type of > > infrastructure to trace sites hard to instrument. The Linux Kernel > > Markers has the support of Frank C. Eigler, author of their current > > marker alternative [...] > > All of the above don't use mainline kernels. > That doesn't constitute using it. Systemtap does run on mainline kernels. > > "kprobes remains a vital foundation for SystemTap. But markers are > > attactive as an alternate source of trace/debug info. Here's why: > > [...]" > > Talk is cheap. Do they have working code to use it? [...] We had been waiting on the chicken & egg semaphore. LTTNG has working code yesterday (months ago); systemtap will have it "tomorrow" (a week or few). - FChE - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [stable] to something appropriate (was Re: 2.6.22 -mm merge plans)
Chris Wright wrote: > * Greg KH ([EMAIL PROTECTED]) wrote: >> And is this really a problem? The whole goal of the -stable tree was to >> accomidate the users who relied on kernel.org kernels, and wanted >> bugfixes and security updates. It was not for new features or new >> hardware support. >> >> If people feel we should revisit this goal, then that's fine, and I have >> no objection to that. But until then, I think the rules that we have >> had in place for over the past 2 years should still remain in affect. > > I have to agree. I went back through my mbox and found vanishingly few > pci_id update patches. So it's not clear there's even a big issue. > Of course you didn't find many -- most people know that's not part of the -stable charter. If you asked for them you'd get them... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Wed, 2 May 2007, Nick Piggin wrote: > Hugh Dickins wrote: > > > > But I was quite disappointed when > > mm-fix-fault-vs-invalidate-race-for-linear-mappings-fix.patch > > appeared, putting double unmap_mapping_range calls in. Certainly > > you were wrong to take the one out, but a pity to end up with two. > > > > Your comment says/said: > > The nopage vs invalidate race fix patch did not take care of truncating > > private COW pages. Mind you, I'm pretty sure this was previously racy > > even for regular truncate, not to mention vmtruncate_range. > > > > vmtruncate_range (holepunch) was deficient I agree, and though we > > can now take out your second unmap_mapping_range there, that's only > > because I've slipped one into shmem_truncate_range. In due course it > > needs to be properly handled by noting the range in shmem inode info. > > > > (I think you couldn't take that approach, noting invalid range in > > ->mapping while invalidating, because NFS has/had some cases of > > invalidate_whatever without i_mutex?) > > Sorry, I didn't parse this? I meant that i_size is used to protect against truncation races, but we have no equivalent inval_start,inval_end in the struct inode or struct address_space, such as could be used for similar protection against races while invalidating. And that IIRC there are places where NFS was doing the invalidation without i_mutex: so there could be concurrent invalidations, so one inval_start,inval_end in the structure wouldn't be enough anyway. > But I wonder whether it is better to do > it in vmtruncate_range than the filesystem? Private COWed pages are > not really a filesystem "thing"... It wasn't the thought of private COWed pages which made me put a second unmap_mapping_range in shmem_truncate_range, it was its own internal file<->swap consistency which needed that (as a quick fix). The real fix to be having a trunc_start,trunc_end or whatever in the shmem_inode_info (assuming it's not wanted in the common inode: might be if holepunch spreads e.g. it's been mentioned with fallocate). Re private COWed pages and holepunch: Miklos and I agree that really it would be better for holepunch _not_ to remove them - but that's rather off-topic. More on-topic, since you suggest doing more within vmtruncate_range than the filesystem: no, I'm afraid that's misdesigned, and I want to move almost all of it into the filesystem ->truncate_range. Because, if what vmtruncate_range is doing before it gets to the filesystem isn't to be just a waste of time, the filesystem needs to know what's going on in advance - just as notify_change warns the filesystem about a coming truncation. But easier than inventing some new notification is to move it all into the filesystem, with unmap_mapping_range+truncate_inode_pages_range its library helpers. > > > But I'm pretty sure (to use your words!) regular truncate was not racy > > before: I believe Andrea's sequence count was handling that case fine, > > without a second unmap_mapping_range. > > OK, I think you're right. I _think_ it should also be OK with the > lock_page version as well: we should not be able to have any pages > after the first unmap_mapping_range call, because of the i_size > write. So if we have no pages, there is nothing to 'cow' from. I'd be delighted if you can remove those later unmap_mapping_ranges. As I recall, the important thing for the copy pages is to be holding the page lock (or whatever other serialization) on the copied page still while the copy page is inserted into pagetable: that looks to be so in your __do_fault. > > But it is a shame, and leaves me wondering what you gained with the > > page lock there. > > > > One thing gained is ease of understanding, and if your later patches > > build an edifice upon the knowledge of holding that page lock while > > faulting, I've no wish to undermine that foundation. > > It also fixes a bug, doesn't it? ;) Well, I'd come to think that perhaps the bugs would be solved by that second unmap_mapping_range alone, so the pagelock changes just a misleading diversion. I'm not sure how I feel about that: calling unmap_mapping_range a second time feels such a cheat, but if (big if) it does solve the races, and the pagelock method is as expensive as your numbers now suggest... Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: mm-more-rmap-checking
On Wed, 2 May 2007, Nick Piggin wrote: > > Yes, but IIRC I put that in because there was another check in > SLES9 that I actually couldn't put in, but used this one instead > because it also caught the bug we saw. >... > This was actually a rare corruption that is also in 2.6.21, and > as few rmap callsites as we have, it was never noticed until the > SLES9 bug check was triggered. You are being very mysterious. Please describe this bug (privately if you think it's exploitable), and let's work on the patch to fix it, rather than this "debug" patch. > Hmm, I didn't notice the do_swap_page change, rather just derived > its safety by looking at the current state of the code (which I > guess must have been post-do_swap_page change)... Your addition of page_add_new_anon_rmap clarified the situation too. > Do you have a pointer to the patch, for my interest? The patch which changed do_swap_page? commit c475a8ab625d567eacf5e30ec35d6d8704558062 Author: Hugh Dickins <[EMAIL PROTECTED]> Date: Tue Jun 21 17:15:12 2005 -0700 [PATCH] can_share_swap_page: use page_mapcount Or my intended PG_swapcache to PAGE_MAPPING_SWAP patch, which does assume PageLocked in page_add_anon_rmap? Yes, I can send you its current unsplit state if you like (but have higher priorities before splitting and commenting it for posting). Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Tue, 1 May 2007, Andrew Morton wrote: > > Given the current state and the current rate of development I'd expect slub > to have reached the level of completion which you're describing around -rc2 > or -rc3. I think we'd be pretty safe making that assumption. Its developer does show signs of being active! > > This is a bit unusual but there is of course some self-interest here: the > patch dependencies are getting awful and having this hanging around > out-of-tree will make 2.6.23 development harder for everyone. That is a very strong argument: a somewhat worrisome argument, but a very strong one. Maintaining your sanity is important. > > So on balance, given that we _do_ expect slub to have a future, I'm > inclined to crash ahead with it. The worst that can happen will be a later > rm mm/slub.c which would be pretty simple to do. Okay. And there's been no chorus to echo my concern. But if Linus' tree is to be better than a warehouse to avoid awkward merges, I still think we want it to default to on for all the architectures, and for most if not all -rcs. > > otoh I could do some frantic patch mangling and make it easier to carry > slub out-of-tree, but do we gain much from that? No, keep away from that. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Tue, 1 May 2007, Christoph Lameter wrote: > On Tue, 1 May 2007, Hugh Dickins wrote: > > > Yes, to me it does. If it could be defaulted to on throughout the > > -rcs, on every architecture, then I'd say that's "finishing work"; > > and we'd be safe knowing we could go back to slab in a hurry if > > needed. But it hasn't reached that stage yet, I think. > > Why would we need to go back to SLAB if we have not switched to SLUB? SLUB > is marked experimental and not the default. I said above that I thought SLUB ought to be defaulted to on throughout the -rcs: if we don't do that, we're not going to learn much from having it in Linus' tree. And perhaps that line which appends "PREEMPT " to an oops report ought to append "SLUB " too, for so long as there's a choice. > The only problems that I am aware of is(or was) the issue with arches > modifying page struct fields of slab pages that SLUB needs for its own > operations. And I thought it was all fixed since the powerpc guys were > quiet and the patch was in for i386. You're forgetting your unions in struct page: in the SPLIT_PTLOCK case (NR_CPUS >= 4) the pagetable code is using spinlock_t ptl, which overlays SLUB's first_page and slab pointers. I just tried rebuilding powerpc with the SPLIT_PTLOCK cutover edited to 8 cpus instead, and then no crash. I presume the answer is just to extend your quicklist work to powerpc's lowest level of pagetables. The only other architecture which is using kmem_cache for them is arm26, which has "#error SMP is not supported", so won't be giving this problem. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
> It is currently used as an instrumentation infrastructure for the LTTng > tracer at IBM, Google, Autodesk, Sony, MontaVista and deployed in > WindRiver products. The SystemTAP project also plan to use this type of > infrastructure to trace sites hard to instrument. The Linux Kernel > Markers has the support of Frank C. Eigler, author of their current > marker alternative (which he wishes to drop in order to adopt the > markers infrastructure as soon as it hits mainline). All of the above don't use mainline kernels. That doesn't constitute using it. > Quoting Jim Keniston <[EMAIL PROTECTED]> : > > "kprobes remains a vital foundation for SystemTap. But markers are > attactive as an alternate source of trace/debug info. Here's why: > [...]" Talk is cheap. Do they have working code to use it? > - Allow per-architecture optimized versions which removes the need for > a d-cache based branch (patch a "load immediate" instruction > instead). It minimized the d-cache impact of the disabled markers. That's a good idea in general, but should be generalized (available independently), not hidden in your subsystem. I know a couple of places who could use this successfully. > - Accept the cost of an unlikely branch at the marker site because the > gcc compiler does not give the ability to put "nops" instead of a > branch generated from C code. Keep this in mind for future > per-architecture optimizations. See upcomming paravirt code for a way to do this. > - Instrumentation of challenging kernel sites > - Instrumentation such as the one provided in the already existing > Lock dependency checker (lockdep) and instrumentation of trap > handlers implies being reentrant for such context. Therefore, the > implementation must be lock-free and update the state in an atomic > fashion (rcu-style). It must also let the programmer who describes > a marker site the ability to specify what is forbidden in the probe > that will be connected to the marker : can it generate a trap ? Can > it call lockdep (irq disable, take any type of lock), can it call > printk ? This is why flags can be passed to the _MARK() marker, > while the MARK() marker has the default flags. Why can't you just generally forbid probes from doing all of this? It would greatly simplify your code, wouldn't it? Keep it simple please. > Please tell me if I forgot to explain the rationale behind some > implementation detail and I will be happy to explain in more depth. Having lots of flags to do things differently optionally normally starts up all warning lights of early over design. While Linux has this sometimes it is generally only in mature old subsystems. But when something is freshly merged it shouldn't be like this. That is because code tends to grow more complicated over its livetime and when it is already complicated at the beginning it will eventually fall over (you can study current slab as a poster child of this) -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
On Wed, May 02, 2007 at 10:31:10AM +1000, Rusty Russell wrote: > On Tue, 2007-05-01 at 14:17 +0200, Andi Kleen wrote: > > Andrew Morton <[EMAIL PROTECTED]> writes: > > > Will merge the rustyvisor. > > > > IMHO the user code still doesn't belong into Documentation. > > Also it needs another review round I guess. And some beta testing by > > more people. > > Like any piece of code more review and more testing would be great. > (Your earlier review was particularly useful!). But it's not clear that > waiting for longer will achieve either. Not clear to me. Release a clear lguest patchkit with documentation on l-k several times and you'll probably get both reviewers and testers. Then confidence level will rise. > > Look at kvm's experience for the reverse case: it went in, then got > rewritten. They at least already had some user base at this point. > As for the code in Documentation, my initial attempts tried to get > around the need for a userspace part by putting everything in the kernel > module. It meant you could launch a guest by writing a string > to /dev/lguest (no real ABI burden there), but it's a worse solution > than some user code in the kernel tree 8( Just put it into a separate tarball. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Nick Piggin wrote: Hugh Dickins wrote: On Tue, 1 May 2007, Nick Piggin wrote: There were concerns that we could do this more cheaply, but I think it is important to start with a base that is simple and more likely to be correct and build on that. My testing didn't show any obvious problems with performance. I don't see _problems_ with performance, but I do consistently see the same kind of ~5% degradation in lmbench fork, exec, sh, mmap latency and page fault tests on SMP, several machines, just as I did last year. OK. I did run some tests at one stage which didn't show a regression on my P4, however I don't know that they were statistically significant. I'll try a couple more runs and post numbers. I didn't have enough time tonight to get means/stddev, etc, but the runs are pretty stable. Patch tested was just the lock page one. SMP kernel, tasks bound to 1 CPU: P4 Xeon pagefault fork exec 2.6.21 1.67-1.69 140.7-142.0 449.5-460.8 +patch 1.75-1.77 144.0-145.5 456.2-463.0 So it's taken on nearly 5% on pagefault, but looks like less than 2% on fork, so not as bad as your numbers (phew). G5 pagefault fork exec 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 +patch 1.71-1.73 175.2-180.8 780.5-794.2 Bigger hit there. Page faults can be improved a tiny bit by not using a test and clear op in unlock_page (less barriers for the G5). I don't think that's really a blocker problem for a merge, but I wonder what we can do to improve it. Lockless pagecache shaves quite a bit of straight line find_get_page performance there. Going to a non-sleeping lock might be one way to go in the long term, but it would require quite a lot of restructuring. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Tue, 1 May 2007, Nick Piggin wrote: There were concerns that we could do this more cheaply, but I think it is important to start with a base that is simple and more likely to be correct and build on that. My testing didn't show any obvious problems with performance. I don't see _problems_ with performance, but I do consistently see the same kind of ~5% degradation in lmbench fork, exec, sh, mmap latency and page fault tests on SMP, several machines, just as I did last year. OK. I did run some tests at one stage which didn't show a regression on my P4, however I don't know that they were statistically significant. I'll try a couple more runs and post numbers. I'm assuming this patch is the one responsible: at 2.6.20-rc4 time you posted a set of 10 and a set of 7 patches I tried in versus out; at 2.6.21-rc3-mm2 time you had a group of patches in -mm I tried in versus out; with similar results. I did check the graphs on test.kernel.org, I couldn't see any bad behaviour there that correlated with this work; though each -mm has such a variety of new work in it, it's very hard to attribute. And nobody else has reported any regression from your patches. I'm inclined to write it off as poorer performance in some micro- benchmarks, against which we offset the improved understandabilty of holding the page lock over the file fault. But I was quite disappointed when mm-fix-fault-vs-invalidate-race-for-linear-mappings-fix.patch appeared, putting double unmap_mapping_range calls in. Certainly you were wrong to take the one out, but a pity to end up with two. Your comment says/said: The nopage vs invalidate race fix patch did not take care of truncating private COW pages. Mind you, I'm pretty sure this was previously racy even for regular truncate, not to mention vmtruncate_range. vmtruncate_range (holepunch) was deficient I agree, and though we can now take out your second unmap_mapping_range there, that's only because I've slipped one into shmem_truncate_range. In due course it needs to be properly handled by noting the range in shmem inode info. (I think you couldn't take that approach, noting invalid range in ->mapping while invalidating, because NFS has/had some cases of invalidate_whatever without i_mutex?) Sorry, I didn't parse this? But I wonder whether it is better to do it in vmtruncate_range than the filesystem? Private COWed pages are not really a filesystem "thing"... But I'm pretty sure (to use your words!) regular truncate was not racy before: I believe Andrea's sequence count was handling that case fine, without a second unmap_mapping_range. OK, I think you're right. I _think_ it should also be OK with the lock_page version as well: we should not be able to have any pages after the first unmap_mapping_range call, because of the i_size write. So if we have no pages, there is nothing to 'cow' from. Well, I guess I've come to accept that, expensive as unmap_mapping_range may be, truncating files while they're mmap'ed is perverse behaviour: perhaps even deserving such punishment. But it is a shame, and leaves me wondering what you gained with the page lock there. One thing gained is ease of understanding, and if your later patches build an edifice upon the knowledge of holding that page lock while faulting, I've no wish to undermine that foundation. It also fixes a bug, doesn't it? ;) -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: mm-more-rmap-checking
Hugh Dickins wrote: On Mon, 30 Apr 2007, Andrew Morton wrote: ... mm-more-rmap-checking.patch ... Misc MM things. Will merge. Would Nick mind very much if I ask you to drop this one? You did CC me ages ago, but I've only just run across it. It's a small matter, but I'd prefer it dropped for now. I guess I would prefer it to go under CONFIG_DEBUG_VM. Speaking of which, it would be nice to be able to turn that on unconditionally in -rc1. Although I may have put a few too many things under it, so it might slow down too much... Re-introduce rmap verification patches that Hugh removed when he removed PG_map_lock. PG_map_lock actually isn't needed to synchronise access to anonymous pages, because PG_locked and PTL together already do. These checks were important in discovering and fixing a rare rmap corruption in SLES9. It introduces some silly checks which were never in mainline, nor so far as I can tell in SLES9: I'm thinking of those + BUG_ON(address < vma->vm_start || address >= vma->vm_end); Yes, but IIRC I put that in because there was another check in SLES9 that I actually couldn't put in, but used this one instead because it also caught the bug we saw. There are few callsites for these rmap functions, I don't think they need to be checking their arguments in that way. It also changes the inline page_dup_rmap (a single atomic increment) into a bugchecking out-of-line function: do we really want to slow down fork in that way, for 2.6.22 to fix a rare corruption in SLES9? This was actually a rare corruption that is also in 2.6.21, and as few rmap callsites as we have, it was never noticed until the SLES9 bug check was triggered. What I really like about the patch is Nick's observation that my /* else checking page index and mapping is racy */ is no longer true: a change we made to the do_swap_page sequence some while ago has indeed cured that raciness, and I'm happy to reintroduce the check on mapping and index in page_add_anon_rmap, and his BUG_ON(!PageLocked(page)) there (despite BUG_ONs falling out of fashion very recently). Hmm, I didn't notice the do_swap_page change, rather just derived its safety by looking at the current state of the code (which I guess must have been post-do_swap_page change)... Do you have a pointer to the patch, for my interest? -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [stable] to something appropriate (was Re: 2.6.22 -mm merge plans)
* Greg KH ([EMAIL PROTECTED]) wrote: > And is this really a problem? The whole goal of the -stable tree was to > accomidate the users who relied on kernel.org kernels, and wanted > bugfixes and security updates. It was not for new features or new > hardware support. > > If people feel we should revisit this goal, then that's fine, and I have > no objection to that. But until then, I think the rules that we have > had in place for over the past 2 years should still remain in affect. I have to agree. I went back through my mbox and found vanishingly few pci_id update patches. So it's not clear there's even a big issue. thanks, -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [stable] to something appropriate (was Re: 2.6.22 -mm merge plans)
On Tue, May 01, 2007 at 05:40:33PM +0100, Alan Cox wrote: > > > But distros can easily add the device id to their kernel if needed, it > > > isn't something that the -stable tree shoud be accepting. Otherwise, we > > > will be swamped with those types of patches... > > > > > > > Oh sure, leave the distros swamped with them instead. :) > > > > And they all have to do it separately, meaning they don't stay in sync > > and they duplicate each other's work... > > Well they *don't* have to work that separately. They could set up some > shared tree which would look suspiciously like what Greg is doing but > with the ID updates ;) And is this really a problem? The whole goal of the -stable tree was to accomidate the users who relied on kernel.org kernels, and wanted bugfixes and security updates. It was not for new features or new hardware support. If people feel we should revisit this goal, then that's fine, and I have no objection to that. But until then, I think the rules that we have had in place for over the past 2 years should still remain in affect. thanks, greg k-h - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
On Tue, 2007-05-01 at 14:17 +0200, Andi Kleen wrote: > Andrew Morton <[EMAIL PROTECTED]> writes: > > Will merge the rustyvisor. > > IMHO the user code still doesn't belong into Documentation. > Also it needs another review round I guess. And some beta testing by > more people. Like any piece of code more review and more testing would be great. (Your earlier review was particularly useful!). But it's not clear that waiting for longer will achieve either. Look at kvm's experience for the reverse case: it went in, then got rewritten. As for the code in Documentation, my initial attempts tried to get around the need for a userspace part by putting everything in the kernel module. It meant you could launch a guest by writing a string to /dev/lguest (no real ABI burden there), but it's a worse solution than some user code in the kernel tree 8( Cheers, Rusty. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
Hi Andi, * Andi Kleen ([EMAIL PROTECTED]) wrote: > Andrew Morton <[EMAIL PROTECTED]> writes: > > > > Static markers. Will merge. > There don't seem to be any users of this. How do you know it hasn't > already bitrotted? > See the detailed explanation at : http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc7/2.6.21-rc7-mm2/broken-out/linux-kernel-markers-kconfig-menus.patch Major points : It is currently used as an instrumentation infrastructure for the LTTng tracer at IBM, Google, Autodesk, Sony, MontaVista and deployed in WindRiver products. The SystemTAP project also plan to use this type of infrastructure to trace sites hard to instrument. The Linux Kernel Markers has the support of Frank C. Eigler, author of their current marker alternative (which he wishes to drop in order to adopt the markers infrastructure as soon as it hits mainline). Quoting Jim Keniston <[EMAIL PROTECTED]> : "kprobes remains a vital foundation for SystemTap. But markers are attactive as an alternate source of trace/debug info. Here's why: [...]" > It seems quite overcomplicated to me. Has the complexity been justified? > To summarize the document pointed at the URL above, where the full the key goals of the markers, showing the rationale being the most important design choices : - Almost non perceivable impact on production machines when compiled in but markers are "disabled". - Use a separate section to keep the data to minimize d-cache trashing. - Put the code (stack setup and function call) in unlikely branches of the if() condition to minimize i-cache impact. - Since it is required to allow instrumentation of variables within the body of a function, accept the impact on compiler's optimizations and let it keep the variables "live" sometimes longer than required. It is up to the person who puts the marker in the code to choose the location that will have a small impact in this aspect. - Allow per-architecture optimized versions which removes the need for a d-cache based branch (patch a "load immediate" instruction instead). It minimized the d-cache impact of the disabled markers. - Accept the cost of an unlikely branch at the marker site because the gcc compiler does not give the ability to put "nops" instead of a branch generated from C code. Keep this in mind for future per-architecture optimizations. - Instrumentation of challenging kernel sites - Instrumentation such as the one provided in the already existing Lock dependency checker (lockdep) and instrumentation of trap handlers implies being reentrant for such context. Therefore, the implementation must be lock-free and update the state in an atomic fashion (rcu-style). It must also let the programmer who describes a marker site the ability to specify what is forbidden in the probe that will be connected to the marker : can it generate a trap ? Can it call lockdep (irq disable, take any type of lock), can it call printk ? This is why flags can be passed to the _MARK() marker, while the MARK() marker has the default flags. Please tell me if I forgot to explain the rationale behind some implementation detail and I will be happy to explain in more depth. Regards, Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Tue, 1 May 2007 13:46:26 -0700 (PDT) Christoph Lameter <[EMAIL PROTECTED]> wrote: > On Tue, 1 May 2007, Andrew Morton wrote: > > > otoh I could do some frantic patch mangling and make it easier to carry > > slub out-of-tree, but do we gain much from that? > > Then we may loose all the slab API cleanups? Yuck. I really do not want > redo those No, I meant that I'd look at splitting those patches up into one-against-mainline and one-against-slub. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Tue, 1 May 2007, Hugh Dickins wrote: > Yes, to me it does. If it could be defaulted to on throughout the > -rcs, on every architecture, then I'd say that's "finishing work"; > and we'd be safe knowing we could go back to slab in a hurry if > needed. But it hasn't reached that stage yet, I think. Why would we need to go back to SLAB if we have not switched to SLUB? SLUB is marked experimental and not the default. The only problems that I am aware of is(or was) the issue with arches modifying page struct fields of slab pages that SLUB needs for its own operations. And I thought it was all fixed since the powerpc guys were quiet and the patch was in for i386. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Tue, 1 May 2007, Andrew Morton wrote: > otoh I could do some frantic patch mangling and make it easier to carry > slub out-of-tree, but do we gain much from that? Then we may loose all the slab API cleanups? Yuck. I really do not want redo those - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Tue, 1 May 2007 21:19:09 +0100 (BST) Hugh Dickins <[EMAIL PROTECTED]> wrote: > On Tue, 1 May 2007, Andrew Morton wrote: > > On Tue, 1 May 2007 19:10:29 +0100 (BST) > > Hugh Dickins <[EMAIL PROTECTED]> wrote: > > > > > > Most of the rest of slub. Will merge it all. > > > > > > Merging slub already? I'm surprised. > > > > My thinking here is "does slub have a future". > > I think the answer is "yes", > > I think I agree with that, > though it's a judgement I'd leave to you and others. > > > so we're reasonably safe getting it into mainline for the finishing > > work. The kernel.org kernel will still default to slab. > > > > Does that sound wrong? > > Yes, to me it does. If it could be defaulted to on throughout the > -rcs, on every architecture, then I'd say that's "finishing work"; > and we'd be safe knowing we could go back to slab in a hurry if > needed. But it hasn't reached that stage yet, I think. > Given the current state and the current rate of development I'd expect slub to have reached the level of completion which you're describing around -rc2 or -rc3. I think we'd be pretty safe making that assumption. This is a bit unusual but there is of course some self-interest here: the patch dependencies are getting awful and having this hanging around out-of-tree will make 2.6.23 development harder for everyone. So on balance, given that we _do_ expect slub to have a future, I'm inclined to crash ahead with it. The worst that can happen will be a later rm mm/slub.c which would be pretty simple to do. otoh I could do some frantic patch mangling and make it easier to carry slub out-of-tree, but do we gain much from that? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Tue, 1 May 2007, Andrew Morton wrote: > On Tue, 1 May 2007 19:10:29 +0100 (BST) > Hugh Dickins <[EMAIL PROTECTED]> wrote: > > > > Most of the rest of slub. Will merge it all. > > > > Merging slub already? I'm surprised. > > My thinking here is "does slub have a future". > I think the answer is "yes", I think I agree with that, though it's a judgement I'd leave to you and others. > so we're reasonably safe getting it into mainline for the finishing > work. The kernel.org kernel will still default to slab. > > Does that sound wrong? Yes, to me it does. If it could be defaulted to on throughout the -rcs, on every architecture, then I'd say that's "finishing work"; and we'd be safe knowing we could go back to slab in a hurry if needed. But it hasn't reached that stage yet, I think. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Tue, 1 May 2007 19:10:29 +0100 (BST) Hugh Dickins <[EMAIL PROTECTED]> wrote: > > Most of the rest of slub. Will merge it all. > > Merging slub already? I'm surprised. My thinking here is "does slub have a future". I think the answer is "yes", so we're reasonably safe getting it into mainline for the finishing work. The kernel.org kernel will still default to slab. Does that sound wrong? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Tue, 1 May 2007, Nick Piggin wrote: > Andrew Morton wrote: > > > mm-simplify-filemap_nopage.patch > > mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch > > mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch > > mm-merge-nopfn-into-fault.patch > > convert-hugetlbfs-to-use-vm_ops-fault.patch > > mm-remove-legacy-cruft.patch > > mm-debug-check-for-the-fault-vs-invalidate-race.patch > > > mm-fix-clear_page_dirty_for_io-vs-fault-race.patch > > > Miscish MM changes. Will merge, dependent upon what still applies and works > > if the moveable-zone patches get stalled. > > These fix some bugs in the core vm, at least the former one we have > seen numerous people hitting in production... ... > > So, do you or anyone else have any problems with these patches going in > 2.6.22? I haven't had much feedback for a while, but I was under the > impression that people are more-or-less happy with them? > > mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch > > This patch fixes the core filemap_nopage vs invalidate_inode_pages2 > race by having filemap_nopage return a locked page to do_no_page, > and removes the fairly complex (and inadequate) truncate_count > synchronisation logic. > > There were concerns that we could do this more cheaply, but I think it > is important to start with a base that is simple and more likely to > be correct and build on that. My testing didn't show any obvious > problems with performance. I don't see _problems_ with performance, but I do consistently see the same kind of ~5% degradation in lmbench fork, exec, sh, mmap latency and page fault tests on SMP, several machines, just as I did last year. I'm assuming this patch is the one responsible: at 2.6.20-rc4 time you posted a set of 10 and a set of 7 patches I tried in versus out; at 2.6.21-rc3-mm2 time you had a group of patches in -mm I tried in versus out; with similar results. I did check the graphs on test.kernel.org, I couldn't see any bad behaviour there that correlated with this work; though each -mm has such a variety of new work in it, it's very hard to attribute. And nobody else has reported any regression from your patches. I'm inclined to write it off as poorer performance in some micro- benchmarks, against which we offset the improved understandabilty of holding the page lock over the file fault. But I was quite disappointed when mm-fix-fault-vs-invalidate-race-for-linear-mappings-fix.patch appeared, putting double unmap_mapping_range calls in. Certainly you were wrong to take the one out, but a pity to end up with two. Your comment says/said: The nopage vs invalidate race fix patch did not take care of truncating private COW pages. Mind you, I'm pretty sure this was previously racy even for regular truncate, not to mention vmtruncate_range. vmtruncate_range (holepunch) was deficient I agree, and though we can now take out your second unmap_mapping_range there, that's only because I've slipped one into shmem_truncate_range. In due course it needs to be properly handled by noting the range in shmem inode info. (I think you couldn't take that approach, noting invalid range in ->mapping while invalidating, because NFS has/had some cases of invalidate_whatever without i_mutex?) But I'm pretty sure (to use your words!) regular truncate was not racy before: I believe Andrea's sequence count was handling that case fine, without a second unmap_mapping_range. Well, I guess I've come to accept that, expensive as unmap_mapping_range may be, truncating files while they're mmap'ed is perverse behaviour: perhaps even deserving such punishment. But it is a shame, and leaves me wondering what you gained with the page lock there. One thing gained is ease of understanding, and if your later patches build an edifice upon the knowledge of holding that page lock while faulting, I've no wish to undermine that foundation. > > mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch > mm-merge-nopfn-into-fault.patch > etc. > > These move ->nopage, ->populate, ->nopfn (and soon, ->page_mkwrite) > into a single, unified interface. Although this strictly closes some > similar holes in nonlinear faults as well, they are very uncommon, so > I wouldn't be so upset if these aren't merged in 2.6.22 (I don't see > any reason not to, but at least they don't fix major bugs). I don't have an opinion on these, but I think BenH and others were strongly in favour, with various people waiting upon them. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Tue, 1 May 2007, Hugh Dickins wrote: > > Most of the rest of slub. Will merge it all. > > Merging slub already? I'm surprised. That's a very key piece of > infrastructure, and I doubt it's had the exposure it needs yet. Its not the default. Its just an alternative like SLOB. It will take some time to test with various loads in order to see if it can really replace SLAB in all scenarios. > Just what has it been widely tested on so far? x86_64. Not many > of us have ia64, but I guess SGI people will have been trying it > on that. Not i386, that's excluded. There is an i386 patch pending and I have used it on i386 for a while. > Not powerpc - hmm, I thought that was known, but looking I see no > ARCH_USES_SLAB_PAGE_STRUCT there: just built and tried to run it up, > crashes in slab_free from pgtable_free_tlb frpm free_pte_range from > free_pgd_range from free_pgtables from unmap_region form do_munmap. > That's 2.6.21-rc7-mm2. Hmmm... True I have not spend any time with that platform. We can set ARCH_USES_SLAB_PAGE_STRUCT there to switch it off. SLUB is the default for mm so I am a bit surprised that this did not surface earlier. > I've nothing against slub in itself, though I'm wary of its > cache merging (more scope for one corrupting another) (and Yes but then SLUB has more diagnostics etc etc than SLAB to prevent any issues. In debug mode all slabs are separate. The merge feature is very stable these days and significantly reduces cache overhead problems that plague SLAB and require it to have a complex object expiration technique. As a result I was able to rip out all timers. SLUB has no cache reaper nor any timer. Its silent if not in use. > sometimes I think Christoph spent one life uglifying slab for > NUMA, then another life ripping that all out to make slub ;) SLAB has a certain paradigm of doing things (queues) and I had to work within that framework. It was a group effort. SLUB is an answer to those complaints and a result of the lessons learned through years of some painful slab debugging. SLUB makes debugging extremely easy (and also the design is very simple and comprehensible). No rebuilding of the kernel. Just pop in a debug option on the command line which can even be targeted to a slab cache if we know that things break there. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- lumpy reclaim
On Tue, 01 May 2007 14:02:41 +0100 Andy Whitcroft <[EMAIL PROTECTED]> wrote: > I have some primitive stats patches which we have used performance > testing. Perhaps those could be brought up to date to provide better > visibility into lumpy's operation. Again this would be a separate patch. Feel free to add new counters in /proc/vmstat - perhaps per-order success and fail rates? Monitoring the ratio between those would show how effective lumpiness is being, perhaps. It's always nice to see what's going on in there. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: fragmentation avoidance Re: 2.6.22 -mm merge plans
On Tue, 1 May 2007 11:16:51 +0100 [EMAIL PROTECTED] (Mel Gorman) wrote: > OK, I did all the reorganisation which you recommended. > Ok. It is getting reviewed by Christoph and I'm going through the TODO items > it yielded. Andy has also been regularly reviewing them which is probably > why they have had less public errors than you might expect from something > like this. Great. I'm a bit behind on my linux-mm reading. > Christoph may like to comment more here. That would be helpful. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: fragmentation avoidance Re: 2.6.22 -mm merge plans
On Tue, 1 May 2007, Christoph Lameter wrote: > On Tue, 1 May 2007, Mel Gorman wrote: > >>anti-fragmentation-switch-over-to-pfn_valid_within.patch >> >> These patches are the grouping pages by mobility patches. They get tested >> every time someone boots the machine from the perspective that they affect >> the page allocator. It is working to keep fragmentation problems to a >> minimum and being exercised. We have beaten it heavily here on tests >> with a variety of machines using the system that drives test.kernel.org >> for both functionality and performance testing. That covers x86, x86_64, >> ppc64 and occasionally IA64. Granted, there are corner-case machines out >> there or we'd never receive bug reports at all. >> >> They are currently being reviewed by Christoph Lameter. His feedback in >> the linux-mm thread "Antifrag patchset comments" has given me a TODO list >> which I'm currently working through. So far, there has been no fundamental >> mistake in my opinion and the additional work is logical extensions. > > I think we really urgently need a defragmentation solution in Linux in > order to support higher page allocations for various purposes. SLUB f.e. > would benefit from it and the large blocksize patches are not reasonable > without such a method. > I continue to maintain that anti-fragmentation is a pre-requisite for any defragmentation mechanism to be effective without trashing overall performance. If allocation success rates are low when everything possible has been reclaimed as is the case without fragmentation avoidance, then defragmentation will not help unless the the 1:1 phys:virt mappings is broken which incurs its own considerable set of problems. > However, the current code is not up to the task. I did not see a clean > categorization of allocations nor a consistent handling of those. The > cleanup work that would have to be done throughout the kernel is not > there. The choice of mobility marker to use in each case was deliberate (even if I have made mistakes but what else is review for?). The choice by default is UNMOVABLE as it's the safe choice even if may be sub-optimal. The description of the mobility types may not be the clearest. For example, buffers were placed beside page cache in MOVABLE because they can both be reclaimed in the same fashion - I consider moving it to disk to be as "movable" as any other definition of the word but in your world movable always means page migration which has led to some confusion. They could have been separated out as MOVABLE and BUFFERS for a conceptually cleaner split but it did not seem necessary because the more types there are, the bigger the memory and performance footprint becomes. Additional flag groupings like GFP_BUFFERS could be defined that alias to MOVABLE if you felt it would make the code clearer but functionally, the behaviour remains the same. This is similar to your feedback on the treatment of GFP_TEMPORARY. There can be as many alias mobility types as you wish but if more "real" types are required, you can have as you want as long as NR_PAGEBLOCK_BITS is increased properly and allocflags_to_migratetype() is able to translate GFP flags to the appropriate mobility type. It increases the performance and memory footprint though. > It is spotty. There seems to be a series of heuristic driving this > thing (I have to agree with Nick there). The temporary allocations that > were missed are just a few that I found. The review of the rest of the > kernel was not done. The review for temporary allocations was aimed at catching the most common callers, not every single one of them because a full review of every caller is a large undertaking. If anything, it makes more sense to do a review of all callers at the end when the core mechanism is finished. The default to treat them as UNMOVABLE is sensible. > Mel said that he fixed up locations that showed up to > be a problem in testing. That is another issue: Too much focus on testing > instead of conceptual cleanness and clean code in the kernel. The patches started as a thought experiment of what "should work". They were then tested to find flaws in the model and the results were fed back in. How is that a disadvantage exactly? > It looks > like this is geared for a specific series of tests on specific platforms > and also to a particular allocation size (max order sized huge pages). > Some series of tests had to be chosen and one combination was chosen that was known to be particularly hostile to external fragmentation - i.e. large numbers of kernel cache allocations at the same time as page cache allocations. No one has suggested an alternative test that would be more suitable. The platforms used were x86, x86_64 and ppc64 which are not exactly insignificant platforms. At the time, I didn't have an IA64 machine and franky the one I have now does not always boot so testing is not as thorough. Huge page sized pages were chosen because they were the hardest allocati