Re: 2.6.22 -mm merge plans

2007-05-13 Thread Christoph Hellwig
On Thu, May 10, 2007 at 03:39:36PM -0400, Mathieu Desnoyers wrote:
> * Christoph Hellwig ([EMAIL PROTECTED]) wrote:
> > On Thu, May 03, 2007 at 01:16:46PM -0400, Mathieu Desnoyers wrote:
> > > > kprobes does this kind of synchronization internally, so the marker
> > > > wrapper should probabl aswell.
> > > > 
> > > 
> > > The problem appears on heavily loaded systems. Doing 50
> > > synchronize_sched() calls in a row can take up to a few seconds on a
> > > 4-way machine. This is why I prefer to do it in the module to which
> > > the callbacks belong.
> > 
> > We recently had a discussion on batch unreistration interface for
> > kprobes.  I'm not very happy with having so different interfaces for
> > different kind of probe registrations.
> > 
> 
> Ok, I've had a look at the kprobes batch registration mechanisms and..
> well, it does not look well suited for the markers. Adding
> supplementary data structures such as linked lists of probes does not
> look like a good match.
> 
> However, I agree with you that providing a similar API is good.
> 
> Therefore, here is my proposal :
> 
> The goal is to do the synchronize just after we unregister the last
> probe handler provided by a given module. Since the unregistration
> functions iterate on every marker present in the kernel, we can keep a
> count of how many probes provided by the same module are still present.
> If we see that we unregistered the last probe pointing to this module,
> we issue a synchronize_sched().
> 
> It adds no data structure and keeps the same order of complexity as what
> is already there, we only have to do 2 passes in the marker structures :
> the first one finds the module associated with the callback and the
> second disables the callbacks and keep a count of the number of
> callbacks associated with the module.
> 
> Mathieu
> 
> P.S.: here is the code.
> 
> 
> Linux Kernel Markers - Architecture Independant code Provide internal
> synchronize_sched() in batch.
> 
> The goal is to do the synchronize just after we unregister the last
> probe handler provided by a given module. Since the unregistration
> functions iterate on every marker present in the kernel, we can keep a
> count of how many probes provided by the same module are still present.
> If we see that we unregistered the last probe pointing to this module,
> we issue a synchronize_sched().
> 
> It adds no data structure and keeps the same order of complexity as what
> is already there, we only have to do 2 passes in the marker structures : 
> the first one finds the module associated with the callback and the 
> second disables the callbacks and keep a count of the number of
> callbacks associated with the module.

Looks good to me, please incorporate this is the next round of the
markers patch series.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans

2007-05-10 Thread Mathieu Desnoyers
* Christoph Hellwig ([EMAIL PROTECTED]) wrote:
> On Thu, May 03, 2007 at 01:16:46PM -0400, Mathieu Desnoyers wrote:
> > > kprobes does this kind of synchronization internally, so the marker
> > > wrapper should probabl aswell.
> > > 
> > 
> > The problem appears on heavily loaded systems. Doing 50
> > synchronize_sched() calls in a row can take up to a few seconds on a
> > 4-way machine. This is why I prefer to do it in the module to which
> > the callbacks belong.
> 
> We recently had a discussion on batch unreistration interface for
> kprobes.  I'm not very happy with having so different interfaces for
> different kind of probe registrations.
> 

Ok, I've had a look at the kprobes batch registration mechanisms and..
well, it does not look well suited for the markers. Adding
supplementary data structures such as linked lists of probes does not
look like a good match.

However, I agree with you that providing a similar API is good.

Therefore, here is my proposal :

The goal is to do the synchronize just after we unregister the last
probe handler provided by a given module. Since the unregistration
functions iterate on every marker present in the kernel, we can keep a
count of how many probes provided by the same module are still present.
If we see that we unregistered the last probe pointing to this module,
we issue a synchronize_sched().

It adds no data structure and keeps the same order of complexity as what
is already there, we only have to do 2 passes in the marker structures :
the first one finds the module associated with the callback and the
second disables the callbacks and keep a count of the number of
callbacks associated with the module.

Mathieu

P.S.: here is the code.


Linux Kernel Markers - Architecture Independant code Provide internal
synchronize_sched() in batch.

The goal is to do the synchronize just after we unregister the last
probe handler provided by a given module. Since the unregistration
functions iterate on every marker present in the kernel, we can keep a
count of how many probes provided by the same module are still present.
If we see that we unregistered the last probe pointing to this module,
we issue a synchronize_sched().

It adds no data structure and keeps the same order of complexity as what
is already there, we only have to do 2 passes in the marker structures : 
the first one finds the module associated with the callback and the 
second disables the callbacks and keep a count of the number of
callbacks associated with the module.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
---
 kernel/module.c |   62 ++--
 1 file changed, 52 insertions(+), 10 deletions(-)

Index: linux-2.6-lttng/kernel/module.c
===
--- linux-2.6-lttng.orig/kernel/module.c2007-05-10 14:48:28.0 
-0400
+++ linux-2.6-lttng/kernel/module.c 2007-05-10 15:38:27.0 -0400
@@ -404,8 +404,12 @@
 }
 
 /* Sets a range of markers to a disabled state : unset the enable bit and
- * provide the empty callback. */
+ * provide the empty callback.
+ * Keep a count of other markers connected to the same module as the one
+ * provided as parameter. */
 static int marker_remove_probe_range(const char *name,
+   struct module *probe_module,
+   int *ref_count,
const struct __mark_marker *begin,
const struct __mark_marker *end)
 {
@@ -413,12 +417,17 @@
int found = 0;
 
for (iter = begin; iter < end; iter++) {
-   if (strcmp(name, iter->mdata->name) == 0) {
-   marker_set_enable(iter->enable, 0,
-   iter->mdata->flags);
-   iter->mdata->call = __mark_empty_function;
-   found++;
+   if (strcmp(name, iter->mdata->name) != 0) {
+   if (probe_module)
+   if (__module_text_address(
+   (unsigned long)iter->mdata->call)
+   == probe_module)
+   (*ref_count)++;
+   continue;
}
+   marker_set_enable(iter->enable, 0, iter->mdata->flags);
+   iter->mdata->call = __mark_empty_function;
+   found++;
}
return found;
 }
@@ -450,6 +459,29 @@
return found;
 }
 
+/* Get the module to which the probe handler's text belongs.
+ * Called with module_mutex taken.
+ * Returns NULL if the probe handler is not in a module. */
+static struct module *__marker_get_probe_module(const char *name)
+{
+   struct module *mod;
+   const struct __mark_marker *iter;
+
+   list_for_each_entry(mod, &modules, list) {
+   if (mod->taints)
+   continue;
+   for (iter = mod->markers;
+   iter < mod->markers+mod->num_markers; iter++) {

Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-09 Thread Nick Piggin

Hugh Dickins wrote:

On Thu, 10 May 2007, Nick Piggin wrote:


The filesystem (or page cache) allows pages beyond i_size to come
in there?  That wasn't a problem before, was it?  But now it is?


The filesystem still doesn't, but if i_size is updated after the page
is returned, we can have a problem that was previously taken care of
with the truncate_count but now isn't.



But... I thought the page lock was now taking care of that in your
scheme?  truncate_inode_pages has to wait for the page lock, then
it finds the page is mapped and... ahh, it finds the copiee page
is not mapped, so doesn't do its own little unmap_mapping_range,
and the copied page squeaks through.  Drat.

I really think the truncate_count solution worked better, for
truncation anyway.  There may be persuasive reasons you need the
page lock for invalidation: I gave up on trying to understand the
required behaviour(s) for invalidation.

So, bring back (the original use of, not my tree marker use of)
truncate_count?  Hmm, you probably don't want to do that, because
there was some pleasure in removing the strange barriers associated
with it.

A second unmap_mapping_range is just one line of code - but it sure
feels like a defeat to me, calling the whole exercise into question.
(But then, you'd be right to say my perfectionism made it impossible
for me to come up with any solution to the invalidation issues.)


Well we could bring back the truncate_count, but I think that sucks
because that's moving work into the page fault handler in order to
avoid a bit of work when truncating mapped files.



But that's a change we could have made at
any time if we'd bothered, it's not really the issue here.


I don't see how you could, because you need to increment truncate_count.



Though indeed we did so, I don't see that we needed to increment
truncate_count in that case (nobody could be coming through
do_no_page on that file, when there are no mappings of it).


Of course :P

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-09 Thread Hugh Dickins
On Thu, 10 May 2007, Nick Piggin wrote:
> > 
> > The filesystem (or page cache) allows pages beyond i_size to come
> > in there?  That wasn't a problem before, was it?  But now it is?
> 
> The filesystem still doesn't, but if i_size is updated after the page
> is returned, we can have a problem that was previously taken care of
> with the truncate_count but now isn't.

But... I thought the page lock was now taking care of that in your
scheme?  truncate_inode_pages has to wait for the page lock, then
it finds the page is mapped and... ahh, it finds the copiee page
is not mapped, so doesn't do its own little unmap_mapping_range,
and the copied page squeaks through.  Drat.

I really think the truncate_count solution worked better, for
truncation anyway.  There may be persuasive reasons you need the
page lock for invalidation: I gave up on trying to understand the
required behaviour(s) for invalidation.

So, bring back (the original use of, not my tree marker use of)
truncate_count?  Hmm, you probably don't want to do that, because
there was some pleasure in removing the strange barriers associated
with it.

A second unmap_mapping_range is just one line of code - but it sure
feels like a defeat to me, calling the whole exercise into question.
(But then, you'd be right to say my perfectionism made it impossible
for me to come up with any solution to the invalidation issues.)

> > Suspect you'd need a barrier of some kind between the i_size_write and
> > the mapping_mapped test?
> 
> The unmap_mapping_range that runs after the truncate_inode_pages should
> run in the correct order, I believe.

Yes, if there's going to be that backup call, the first won't really
need a barrier.

> > But that's a change we could have made at
> > any time if we'd bothered, it's not really the issue here.
> 
> I don't see how you could, because you need to increment truncate_count.

Though indeed we did so, I don't see that we needed to increment
truncate_count in that case (nobody could be coming through
do_no_page on that file, when there are no mappings of it).

> But I believe this is fixing the issue, even if it does so in a peripheral
> manner, because it avoids the added cost for unmapped files.

It's a small improvement to your common case, I agree.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-09 Thread Nick Piggin

Hugh Dickins wrote:

On Wed, 9 May 2007, Nick Piggin wrote:


Hugh Dickins wrote:


On Wed, 2 May 2007, Nick Piggin wrote:



But I'm pretty sure (to use your words!) regular truncate was not racy
before: I believe Andrea's sequence count was handling that case fine,
without a second unmap_mapping_range.


OK, I think you're right. I _think_ it should also be OK with the
lock_page version as well: we should not be able to have any pages
after the first unmap_mapping_range call, because of the i_size
write. So if we have no pages, there is nothing to 'cow' from.


I'd be delighted if you can remove those later unmap_mapping_ranges.
As I recall, the important thing for the copy pages is to be holding
the page lock (or whatever other serialization) on the copied page
still while the copy page is inserted into pagetable: that looks
to be so in your __do_fault.


Hmm, on second thoughts, I think I was right the first time, and do
need the unmap after the pages are truncated. With the lock_page code,
after the first unmap, we can get new ptes mapping pages, and
subsequently they can be COWed and then the original pte zapped before
the truncate loop checks it.



The filesystem (or page cache) allows pages beyond i_size to come
in there?  That wasn't a problem before, was it?  But now it is?


The filesystem still doesn't, but if i_size is updated after the page
is returned, we can have a problem that was previously taken care of
with the truncate_count but now isn't.


However, I wonder if we can't test mapping_mapped before the spinlock,
which would make most truncates cheaper?



Slightly cheaper, yes, though I doubt it'd be much in comparison with
actually doing any work in unmap_mapping_range or truncate_inode_pages.


But if we're supposing the common case for truncate is unmapped mappings,
then the main cost there will be the locking, which I'm trying to avoid.
Hopefully with this patch, most truncate workloads would get faster, even
though truncate mapped files is going to be unavoidably slower.



Suspect you'd need a barrier of some kind between the i_size_write and
the mapping_mapped test?


The unmap_mapping_range that runs after the truncate_inode_pages should
run in the correct order, I believe.


 But that's a change we could have made at
any time if we'd bothered, it's not really the issue here.


I don't see how you could, because you need to increment truncate_count.

But I believe this is fixing the issue, even if it does so in a peripheral
manner, because it avoids the added cost for unmapped files.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-09 Thread Hugh Dickins
On Wed, 9 May 2007, Nick Piggin wrote:
> Hugh Dickins wrote:
> > On Wed, 2 May 2007, Nick Piggin wrote:
> 
> > > >But I'm pretty sure (to use your words!) regular truncate was not racy
> > > >before: I believe Andrea's sequence count was handling that case fine,
> > > >without a second unmap_mapping_range.
> > >
> > >OK, I think you're right. I _think_ it should also be OK with the
> > >lock_page version as well: we should not be able to have any pages
> > >after the first unmap_mapping_range call, because of the i_size
> > >write. So if we have no pages, there is nothing to 'cow' from.
> > 
> > I'd be delighted if you can remove those later unmap_mapping_ranges.
> > As I recall, the important thing for the copy pages is to be holding
> > the page lock (or whatever other serialization) on the copied page
> > still while the copy page is inserted into pagetable: that looks
> > to be so in your __do_fault.
> 
> Hmm, on second thoughts, I think I was right the first time, and do
> need the unmap after the pages are truncated. With the lock_page code,
> after the first unmap, we can get new ptes mapping pages, and
> subsequently they can be COWed and then the original pte zapped before
> the truncate loop checks it.

The filesystem (or page cache) allows pages beyond i_size to come
in there?  That wasn't a problem before, was it?  But now it is?

> 
> However, I wonder if we can't test mapping_mapped before the spinlock,
> which would make most truncates cheaper?

Slightly cheaper, yes, though I doubt it'd be much in comparison with
actually doing any work in unmap_mapping_range or truncate_inode_pages.
Suspect you'd need a barrier of some kind between the i_size_write and
the mapping_mapped test?  But that's a change we could have made at
any time if we'd bothered, it's not really the issue here.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-09 Thread Nick Piggin

Hugh Dickins wrote:

On Wed, 2 May 2007, Nick Piggin wrote:



But I'm pretty sure (to use your words!) regular truncate was not racy
before: I believe Andrea's sequence count was handling that case fine,
without a second unmap_mapping_range.


OK, I think you're right. I _think_ it should also be OK with the
lock_page version as well: we should not be able to have any pages
after the first unmap_mapping_range call, because of the i_size
write. So if we have no pages, there is nothing to 'cow' from.



I'd be delighted if you can remove those later unmap_mapping_ranges.
As I recall, the important thing for the copy pages is to be holding
the page lock (or whatever other serialization) on the copied page
still while the copy page is inserted into pagetable: that looks
to be so in your __do_fault.


Hmm, on second thoughts, I think I was right the first time, and do
need the unmap after the pages are truncated. With the lock_page code,
after the first unmap, we can get new ptes mapping pages, and
subsequently they can be COWed and then the original pte zapped before
the truncate loop checks it.

However, I wonder if we can't test mapping_mapped before the spinlock,
which would make most truncates cheaper?

--
SUSE Labs, Novell Inc.
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c 2007-04-24 15:02:51.0 +1000
+++ linux-2.6/mm/filemap.c  2007-05-09 17:30:47.0 +1000
@@ -2579,8 +2579,7 @@
if (rw == WRITE) {
write_len = iov_length(iov, nr_segs);
end = (offset + write_len - 1) >> PAGE_CACHE_SHIFT;
-   if (mapping_mapped(mapping))
-   unmap_mapping_range(mapping, offset, write_len, 0);
+   unmap_mapping_range(mapping, offset, write_len, 0);
}
 
retval = filemap_write_and_wait(mapping);
Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c  2007-05-09 17:25:28.0 +1000
+++ linux-2.6/mm/memory.c   2007-05-09 17:30:22.0 +1000
@@ -1956,6 +1956,9 @@
pgoff_t hba = holebegin >> PAGE_SHIFT;
pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
 
+   if (!mapping_mapped(mapping))
+   return;
+
/* Check for overflow. */
if (sizeof(holelen) > sizeof(hlen)) {
long long holeend =


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-07 Thread Benjamin Herrenschmidt
On Fri, 2007-05-04 at 19:23 +1000, Nick Piggin wrote:

> These ops could also be put to use in bit spinlocks, buffer lock, and
> probably a few other places too.

Ok, the performance hit seems to be under control (especially with the
bigger benchmark showing actual improvements).

There's a little bogon with the PG_waiters bit that you already know
about but appart from that it should be ok.

I must say I absolutely _LOVE_ the bitops with explicit _lock/_unlock
semantics. That should allow us to remove a whole bunch of dodgy
barriers and smp_mb__before_whatever_magic_crap() things we have all
over the place by providing precisely the expected semantics for bit
locks.

There are quite a few people who've been trying to do bit locks and I've
always been very worried by how easy it is to get the barriers wrong (or
too much barriers in the fast path) with these.

There are a couple of things we might want to think about regarding the
actual API to bit locks... the API you propose is simple, but it might
not fit some of the most exotic usage requirements, which typically are
related to manipulating other bits along with the lock bit.

We might just ignore them though. In the case of the page lock, it's
only hitting the slow path, and I would expect other usage scenarii to
be similar.

Cheers,
Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans

2007-05-07 Thread Josef Sipek
On Mon, Apr 30, 2007 at 04:20:07PM -0700, Andrew Morton wrote:
...
>  git-unionfs.patch
>
> Does this have a future?

Yes!  There are many active users who use our unioning functionality.

Namespace unification consists of several major parts:

1) Duplicate elimination: This can be handled in the VFS.  However, it would
   clutter up the VFS code with a lot of wrappers around key VFS functions
   to select the appropriate dentry/inode/etc. object from the underlying
   branch.  (You also need to provide efficient and sane readdir/seekdir
   semantics which we do with our "On Disk Format" support.)

2) Copyup: Having a unified namespace by itself isn't enough.  You also need
   copy-on-write functionality when the source file is on a read-only
   branch.  This makes unioning much more useful and is one of the main
   attractions to unionfs users.

3) Whiteouts: Whiteouts are a key unioning construct.  As it was pointed out
   at OLS 2006, they are a properly of the union and _NOT_ a branch.
   Therefore, they should not be stored persistently on a branch but rather
   in some "external" storage.

4) You also need unique and *persistent* inode numbers for network f/s
   exports and other unix tools.

5) You need to provide dynamic branch management functionality: adding,
   removing, and changing the mode of branches in an existing union.

We have considerable experience in unioning file systems for years now; we
are currently working on the third generation of the code.  All of the above
features, and more, are USED by users, and are NEEDED by users.

We believe the right approach is the one we've taken, and is the least
intrusive: a standalone (stackable) file system that doesn't clutter the
VFS, with some small and gradual changes to the VFS to support stacking.  As
you may have noticed, we have been successfully submitting VFS patches to
make the VFS more stacking friendly (not just to Unionfs, but also to
eCryptfs which has been in since 2.6.19).

The older Union mounts, alas, try to put all that functionality into the
VFS.  We recognize that some people think that union mounts at the VFS level
is the "elegant" approach, but we hope people will listen to us and learn
from our experience: unioning may seem simple in principle, but it is
difficult in practice.  (See http://unionfs.fileystems.org/ for a lot more
info.)  So we don't think that is a viable long term approach to have all of
the unioning functionality in the VFS for two main reasons:

(1) If you want users to use a VFS-level unioning functionality ala
union-mounts, then you're going to have to implement *all* of the
features we have implemented; the VFS clutter and complexity that will
result will be very considerable, and we just don't think that it'd
happen.

(2) Some may suggest to have a lightweight union mounts that only offers a
subset of the functionality that's suitable for placing in the VFS.  In
that case, most unionfs users simply won't use it.  You'd need union
mounts to provide ALL of the functionality that we have TODAY, if you
want users to it.

As far as we can see the remaining stumbling block right now is cache
coherency between the layers.  Whether you provide unioning as a stackable
f/s or shoved into the VFS, coherency will have to be addressed.  In our
upcoming paper and talk at OLS'07, we plan to bring up and discuss several
ideas we've explored already on how to resolve this incoherency.  Our ideas
range from complex graph-based pointer management between objects of all
sorts, to simple timestamp-based VFS hooks.  (We've been experimenting with
several approaches and so far we're leaning toward the simple timestamp
based on, again in the interest of keeping the VFS changes simple.  We hope
to have more results to report by OLS time.)

Josef "Jeff" Sipek, on behalf of the Unionfs team.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: fragmentation avoidance Re: 2.6.22 -mm merge plans

2007-05-07 Thread Yasunori Goto

Sorry for late response. I went on a vacation in last week.
And I'm in the mountain of a ton of unread mail now

> > Mel's moveable-zone work.
> 
> These patches are what creates ZONE_MOVABLE. The last 6 patches should be
> collapsed into a single patch:
> 
>   handle-kernelcore=-generic
> 
> I believe Yasunori Goto is looking at these from the perspective of memory
> hot-remove and has caught a few bugs in the past. Goto-san may be able to
> comment on whether they have been reviewed recently.

Hmm, I don't think my review is enough.
To be precise, I'm just one user/tester of ZONE_MOVABLE.
I have tried to make memory remove patches with Mel-san's
ZONE_MOVABLE patch. And the bugs are things that I found in its work.
(I'll post these patches in a few days.)

> The main complexity is in one function in patch one which determines where
> the PFN is in each node for ZONE_MOVABLE. Getting that right so that the
> requested amount of kernel memory spread as evenly as possible is just
> not straight-forward.

>From memory-hotplug view, ZONE_MOVABLE should be aligned by section
size. But MAX_ORDER alignment is enough for others...

Bye.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-04 Thread Nick Piggin

Nick Piggin wrote:

Nick Piggin wrote:


Christoph Hellwig wrote:




Is that every fork/exec or just under certain cicumstances?
A 5% regression on every fork/exec is not acceptable.




Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4
numbers will be improved as well with that patch. Then if we have
specific lock/unlock bitops, I hope it should reduce that further.



OK, with the races and missing barriers fixed from the previous patch,
plus the attached one added (+patch3), numbers are better again (I'm not
sure if I have the ppc barriers correct though).

These ops could also be put to use in bit spinlocks, buffer lock, and
probably a few other places too.

2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2
+patch2  1.61-1.63   169.8-175.0   748.6-757.0
+patch3  1.54-1.57   165.6-170.9   748.5-757.5

So fault performance goes to under 5%, fork is in the noise, exec is
still up 1%, but maybe that's noise or cache effects again.


OK, with my new lock/unlock_page, dd if=large (bigger than RAM) sparse
file of=/dev/null with an experimentally optimal block size (32K) goes
from 626MB/s to 683MB/s on 2 CPU G5 booted with maxcpus=1.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-04 Thread Nick Piggin

Nick Piggin wrote:

Christoph Hellwig wrote:



Is that every fork/exec or just under certain cicumstances?
A 5% regression on every fork/exec is not acceptable.



Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4
numbers will be improved as well with that patch. Then if we have
specific lock/unlock bitops, I hope it should reduce that further.


OK, with the races and missing barriers fixed from the previous patch,
plus the attached one added (+patch3), numbers are better again (I'm not
sure if I have the ppc barriers correct though).

These ops could also be put to use in bit spinlocks, buffer lock, and
probably a few other places too.

2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2
+patch2  1.61-1.63   169.8-175.0   748.6-757.0
+patch3  1.54-1.57   165.6-170.9   748.5-757.5

So fault performance goes to under 5%, fork is in the noise, exec is
still up 1%, but maybe that's noise or cache effects again.

--
SUSE Labs, Novell Inc.
Index: linux-2.6/include/asm-powerpc/bitops.h
===
--- linux-2.6.orig/include/asm-powerpc/bitops.h 2007-05-04 16:08:20.0 
+1000
+++ linux-2.6/include/asm-powerpc/bitops.h  2007-05-04 16:14:39.0 
+1000
@@ -87,6 +87,24 @@
: "cc" );
 }
 
+static __inline__ void clear_bit_unlock(int nr, volatile unsigned long *addr)
+{
+   unsigned long old;
+   unsigned long mask = BITOP_MASK(nr);
+   unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr);
+
+   __asm__ __volatile__(
+   LWSYNC_ON_SMP
+"1:"   PPC_LLARX "%0,0,%3  # clear_bit_unlock\n"
+   "andc   %0,%0,%2\n"
+   PPC405_ERR77(0,%3)
+   PPC_STLCX "%0,0,%3\n"
+   "bne-   1b"
+   : "=&r" (old), "+m" (*p)
+   : "r" (mask), "r" (p)
+   : "cc" );
+}
+
 static __inline__ void change_bit(int nr, volatile unsigned long *addr)
 {
unsigned long old;
@@ -126,6 +144,27 @@
return (old & mask) != 0;
 }
 
+static __inline__ int test_and_set_bit_lock(unsigned long nr,
+  volatile unsigned long *addr)
+{
+   unsigned long old, t;
+   unsigned long mask = BITOP_MASK(nr);
+   unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr);
+
+   __asm__ __volatile__(
+"1:"   PPC_LLARX "%0,0,%3  # test_and_set_bit_lock\n"
+   "or %1,%0,%2 \n"
+   PPC405_ERR77(0,%3)
+   PPC_STLCX "%1,0,%3 \n"
+   "bne-   1b"
+   ISYNC_ON_SMP
+   : "=&r" (old), "=&r" (t)
+   : "r" (mask), "r" (p)
+   : "cc", "memory");
+
+   return (old & mask) != 0;
+}
+
 static __inline__ int test_and_clear_bit(unsigned long nr,
 volatile unsigned long *addr)
 {
Index: linux-2.6/include/linux/pagemap.h
===
--- linux-2.6.orig/include/linux/pagemap.h  2007-05-04 16:14:36.0 
+1000
+++ linux-2.6/include/linux/pagemap.h   2007-05-04 16:17:34.0 +1000
@@ -136,13 +136,18 @@
 extern void FASTCALL(__wait_on_page_locked(struct page *page));
 extern void FASTCALL(unlock_page(struct page *page));
 
+static inline int trylock_page(struct page *page)
+{
+   return (likely(!TestSetPageLocked_Lock(page)));
+}
+
 /*
  * lock_page may only be called if we have the page's inode pinned.
  */
 static inline void lock_page(struct page *page)
 {
might_sleep();
-   if (unlikely(TestSetPageLocked(page)))
+   if (!trylock_page(page))
__lock_page(page);
 }
 
@@ -153,7 +158,7 @@
 static inline void lock_page_nosync(struct page *page)
 {
might_sleep();
-   if (unlikely(TestSetPageLocked(page)))
+   if (!trylock_page(page))
__lock_page_nosync(page);
 }

Index: linux-2.6/drivers/scsi/sg.c
===
--- linux-2.6.orig/drivers/scsi/sg.c2007-04-12 14:35:08.0 +1000
+++ linux-2.6/drivers/scsi/sg.c 2007-05-04 16:23:27.0 +1000
@@ -1734,7 +1734,7 @@
  */
flush_dcache_page(pages[i]);
/* ?? Is locking needed? I don't think so */
-   /* if (TestSetPageLocked(pages[i]))
+   /* if (!trylock_page(pages[i]))
   goto out_unlock; */
 }
 
Index: linux-2.6/fs/cifs/file.c
===
--- linux-2.6.orig/fs/cifs/file.c   2007-04-12 14:35:09.0 +1000
+++ linux-2.6/fs/cifs/file.c2007-05-04 16:23:36.0 +1000
@@ -1229,7 +1229,7 @@
 
if (first < 0)
lock_page(page);
-   else if (TestSetPageLocked(page))
+   else if (!trylock_page(page))
break;
 
if (unlikely(page->mapping != mapping)) {
Index: linux-2.6/fs/jbd/commit.c
===

Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-03 Thread Nick Piggin

Andrew Morton wrote:

On Thu, 03 May 2007 11:32:23 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote:



void fastcall unlock_page(struct page *page)
{
+   VM_BUG_ON(!PageLocked(page));
smp_mb__before_clear_bit();
-   if (!TestClearPageLocked(page))
-   BUG();
-	smp_mb__after_clear_bit(); 
-	wake_up_page(page, PG_locked);

+   ClearPageLocked(page);
+   if (unlikely(test_bit(PG_waiters, &page->flags))) {
+   clear_bit(PG_waiters, &page->flags);
+   wake_up_page(page, PG_locked);
+   }
}



Why is that significantly faster than plain old wake_up_page(), which
tests waitqueue_active()?


Because it needs fewer barriers and doesn't touch random a random hash
cacheline in the fastpath.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub on PowerPC

2007-05-03 Thread Christoph Lameter
On Fri, 4 May 2007, Benjamin Herrenschmidt wrote:

> > The SLUB allocator relies on struct page fields first_page and slab,
> > overwritten by ptl when SPLIT_PTLOCK: so the SLUB allocator cannot then
> > be used for the lowest level of pagetable pages.  This was obstructing
> > SLUB on PowerPC, which uses kmem_caches for its pagetables.  So convert
> > its pte level to use quicklist pages (whereas pmd, pud and 64k-page pgd
> > want partpages, so continue to use kmem_caches for pmd, pud and pgd).
> > But to keep up appearances for pgtable_free, we still need PTE_CACHE_NUM.
> 
> Interesting... I'll have a look asap.

I would also recommend looking at removing the constructors for the 
remaining slabs. A constructor requires that SLUB never touch the object 
(same situation as is resulting from enabling debugging). So it must 
increase the object size in order to put the free pointer after the 
object. In case of a order of 2 cache this has a particularly bad effect 
of doubling object size. If the objects can be overwritten on free (no 
constructor) then we can use the first word of the object as a freepointer 
on kfree. Meaning we can use a hot cacheline so no cache miss. On 
alloc we have already touched the first cacheline which also avoids a 
cacheline fetch there. This is the optimal way of operation for SLUB.

Hmmm We could add an option to allow the use of a constructor while
keeping the free pointer at the beginning of the object? Then we would 
have to zap the first word on alloc. Would work like quicklists.

Add SLAB_FREEPOINTER_MAY_OVERLAP?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub on PowerPC

2007-05-03 Thread Benjamin Herrenschmidt
On Thu, 2007-05-03 at 22:04 +0100, Hugh Dickins wrote:
> On Thu, 3 May 2007, Hugh Dickins wrote:
> > 
> > Seems we're all wrong in thinking Christoph's Kconfiggery worked
> > as intended: maybe it just works some of the time.  I'm not going
> > to hazard a guess as to how to fix it up, will resume looking at
> > the powerpc's quicklist potential later.
> 
> Here's the patch I've been testing on G5, with 4k and with 64k pages,
> with SLAB and with SLUB.  But, though it doesn't crash, the pgd
> kmem_cache in the 4k-page SLUB case is revealing SLUB's propensity
> for using highorder allocations where SLAB would stick to order 0:
> under load, exec's mm_init gets page allocation failure on order 4
> - SLUB's calculate_order may need some retuning.  (I'd expect it to
> be going for order 3 actually, I'm not sure how order 4 comes about.)
> 
> I don't know how offensive Ben and Paulus may find this patch:
> the kmem_cache use was nicely done and this messes it up a little.
> 
> 
> The SLUB allocator relies on struct page fields first_page and slab,
> overwritten by ptl when SPLIT_PTLOCK: so the SLUB allocator cannot then
> be used for the lowest level of pagetable pages.  This was obstructing
> SLUB on PowerPC, which uses kmem_caches for its pagetables.  So convert
> its pte level to use quicklist pages (whereas pmd, pud and 64k-page pgd
> want partpages, so continue to use kmem_caches for pmd, pud and pgd).
> But to keep up appearances for pgtable_free, we still need PTE_CACHE_NUM.

Interesting... I'll have a look asap.

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub on PowerPC

2007-05-03 Thread Hugh Dickins
On Thu, 3 May 2007, Christoph Lameter wrote:
> 
> There are SLUB patches pending (not in rc7-mm2 as far as I can recall) 
> that reduce the default page order sizes to head off this issue. The 
> defaults were initially too large (and they still default to large
> for testing if Mel's Antifrag work is detected to be active).

Sounds good.

> > -   return kmem_cache_alloc(pgtable_cache[PTE_CACHE_NUM],
> > -   GFP_KERNEL|__GFP_REPEAT);
> > +   return quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL);
> 
> __GFP_REPEAT is unusual here but this was carried over from the 
> kmem_cache_alloc it seems. Hmm... There is some variance on how we do this 
> between arches. Should we uniformly set or not set this flag?

Not something to get into in this patch, but it did surprise me too.
I believe __GFP_REPEAT should be avoided, and I don't see justification
for it here (but need to remember not to do a blind virt_to_page on the
result in some places if it might return NULL - which IIRC it actually
can do even if __GFP_REPEAT, when chosen for OOM kill).  But I've a
suspicion it got put in there for some good reason I don't know about.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub on PowerPC

2007-05-03 Thread Christoph Lameter
On Thu, 3 May 2007, Hugh Dickins wrote:

> On Thu, 3 May 2007, Hugh Dickins wrote:
> > 
> > Seems we're all wrong in thinking Christoph's Kconfiggery worked
> > as intended: maybe it just works some of the time.  I'm not going
> > to hazard a guess as to how to fix it up, will resume looking at
> > the powerpc's quicklist potential later.
> 
> Here's the patch I've been testing on G5, with 4k and with 64k pages,
> with SLAB and with SLUB.  But, though it doesn't crash, the pgd
> kmem_cache in the 4k-page SLUB case is revealing SLUB's propensity
> for using highorder allocations where SLAB would stick to order 0:
> under load, exec's mm_init gets page allocation failure on order 4
> - SLUB's calculate_order may need some retuning.  (I'd expect it to
> be going for order 3 actually, I'm not sure how order 4 comes about.)

There are SLUB patches pending (not in rc7-mm2 as far as I can recall) 
that reduce the default page order sizes to head off this issue. The 
defaults were initially too large (and they still default to large
for testing if Mel's Antifrag work is detected to be active).

> - return kmem_cache_alloc(pgtable_cache[PTE_CACHE_NUM],
> - GFP_KERNEL|__GFP_REPEAT);
> + return quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL);

__GFP_REPEAT is unusual here but this was carried over from the 
kmem_cache_alloc it seems. Hmm... There is some variance on how we do this 
between arches. Should we uniformly set or not set this flag?

[EMAIL PROTECTED]:~/software/linux-2.6.21-rc7-mm2$ grep quicklist_alloc 
include/asm-ia64/*
include/asm-ia64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL);
include/asm-ia64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL);
include/asm-ia64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL);
include/asm-ia64/pgalloc.h: void *pg = quicklist_alloc(0, GFP_KERNEL, NULL);
include/asm-ia64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL);
[EMAIL PROTECTED]:~/software/linux-2.6.21-rc7-mm2$ grep quicklist_alloc 
arch/i386/mm/*
arch/i386/mm/pgtable.c: pgd_t *pgd = quicklist_alloc(0, GFP_KERNEL, pgd_ctor);
[EMAIL PROTECTED]:~/software/linux-2.6.21-rc7-mm2$ grep quicklist_alloc 
include/asm-sparc64/*
include/asm-sparc64/pgalloc.h:  return quicklist_alloc(0, GFP_KERNEL, NULL);
include/asm-sparc64/pgalloc.h:  return quicklist_alloc(0, GFP_KERNEL, NULL);
include/asm-sparc64/pgalloc.h:  return quicklist_alloc(0, GFP_KERNEL, NULL);
include/asm-sparc64/pgalloc.h:  void *pg = quicklist_alloc(0, GFP_KERNEL, NULL);
[EMAIL PROTECTED]:~/software/linux-2.6.21-rc7-mm2$ grep quicklist_alloc 
include/asm-x86_64/* 
include/asm-x86_64/pgalloc.h:   return (pmd_t *)quicklist_alloc(QUICK_PT, 
GFP_KERNEL|__GFP_REPEAT, NULL);
include/asm-x86_64/pgalloc.h:   return (pud_t *)quicklist_alloc(QUICK_PT, 
GFP_KERNEL|__GFP_REPEAT, NULL);
include/asm-x86_64/pgalloc.h:   pgd_t *pgd = (pgd_t *)quicklist_alloc(QUICK_PGD,
include/asm-x86_64/pgalloc.h:   return (pte_t *)quicklist_alloc(QUICK_PT, 
GFP_KERNEL|__GFP_REPEAT, NULL);
include/asm-x86_64/pgalloc.h:   void *p = (void *)quicklist_alloc(QUICK_PT, 
GFP_KERNEL|__GFP_REPEAT, NULL);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub on PowerPC

2007-05-03 Thread Hugh Dickins
On Thu, 3 May 2007, Hugh Dickins wrote:
> 
> Seems we're all wrong in thinking Christoph's Kconfiggery worked
> as intended: maybe it just works some of the time.  I'm not going
> to hazard a guess as to how to fix it up, will resume looking at
> the powerpc's quicklist potential later.

Here's the patch I've been testing on G5, with 4k and with 64k pages,
with SLAB and with SLUB.  But, though it doesn't crash, the pgd
kmem_cache in the 4k-page SLUB case is revealing SLUB's propensity
for using highorder allocations where SLAB would stick to order 0:
under load, exec's mm_init gets page allocation failure on order 4
- SLUB's calculate_order may need some retuning.  (I'd expect it to
be going for order 3 actually, I'm not sure how order 4 comes about.)

I don't know how offensive Ben and Paulus may find this patch:
the kmem_cache use was nicely done and this messes it up a little.


The SLUB allocator relies on struct page fields first_page and slab,
overwritten by ptl when SPLIT_PTLOCK: so the SLUB allocator cannot then
be used for the lowest level of pagetable pages.  This was obstructing
SLUB on PowerPC, which uses kmem_caches for its pagetables.  So convert
its pte level to use quicklist pages (whereas pmd, pud and 64k-page pgd
want partpages, so continue to use kmem_caches for pmd, pud and pgd).
But to keep up appearances for pgtable_free, we still need PTE_CACHE_NUM.

Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]>
---

 arch/powerpc/Kconfig  |4 
 arch/powerpc/mm/init_64.c |   17 ++---
 include/asm-powerpc/pgalloc.h |   26 +++---
 3 files changed, 21 insertions(+), 26 deletions(-)

--- 2.6.21-rc7-mm2/arch/powerpc/Kconfig 2007-04-26 13:33:51.0 +0100
+++ linux/arch/powerpc/Kconfig  2007-05-03 20:45:12.0 +0100
@@ -31,6 +31,10 @@ config MMU
bool
default y
 
+config QUICKLIST
+   bool
+   default y
+
 config GENERIC_HARDIRQS
bool
default y
--- 2.6.21-rc7-mm2/arch/powerpc/mm/init_64.c2007-04-26 13:33:51.0 
+0100
+++ linux/arch/powerpc/mm/init_64.c 2007-05-03 20:45:12.0 +0100
@@ -146,21 +146,16 @@ static void zero_ctor(void *addr, struct
memset(addr, 0, kmem_cache_size(cache));
 }
 
-#ifdef CONFIG_PPC_64K_PAGES
-static const unsigned int pgtable_cache_size[3] = {
-   PTE_TABLE_SIZE, PMD_TABLE_SIZE, PGD_TABLE_SIZE
-};
-static const char *pgtable_cache_name[ARRAY_SIZE(pgtable_cache_size)] = {
-   "pte_pmd_cache", "pmd_cache", "pgd_cache",
-};
-#else
 static const unsigned int pgtable_cache_size[2] = {
-   PTE_TABLE_SIZE, PMD_TABLE_SIZE
+   PGD_TABLE_SIZE, PMD_TABLE_SIZE
 };
 static const char *pgtable_cache_name[ARRAY_SIZE(pgtable_cache_size)] = {
-   "pgd_pte_cache", "pud_pmd_cache",
-};
+#ifdef CONFIG_PPC_64K_PAGES
+   "pgd_cache", "pmd_cache",
+#else
+   "pgd_cache", "pud_pmd_cache",
 #endif /* CONFIG_PPC_64K_PAGES */
+};
 
 #ifdef CONFIG_HUGETLB_PAGE
 /* Hugepages need one extra cache, initialized in hugetlbpage.c.  We
--- 2.6.21-rc7-mm2/include/asm-powerpc/pgalloc.h2007-02-04 
18:44:54.0 +
+++ linux/include/asm-powerpc/pgalloc.h 2007-05-03 20:45:12.0 +0100
@@ -10,21 +10,15 @@
 #include 
 #include 
 #include 
+#include 
 
 extern struct kmem_cache *pgtable_cache[];
 
-#ifdef CONFIG_PPC_64K_PAGES
-#define PTE_CACHE_NUM  0
-#define PMD_CACHE_NUM  1
-#define PGD_CACHE_NUM  2
-#define HUGEPTE_CACHE_NUM 3
-#else
-#define PTE_CACHE_NUM  0
-#define PMD_CACHE_NUM  1
-#define PUD_CACHE_NUM  1
 #define PGD_CACHE_NUM  0
+#define PUD_CACHE_NUM  1
+#define PMD_CACHE_NUM  1
 #define HUGEPTE_CACHE_NUM 2
-#endif
+#define PTE_CACHE_NUM  3   /* from quicklist rather than  kmem_cache */
 
 /*
  * This program is free software; you can redistribute it and/or
@@ -97,8 +91,7 @@ static inline void pmd_free(pmd_t *pmd)
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
  unsigned long address)
 {
-   return kmem_cache_alloc(pgtable_cache[PTE_CACHE_NUM],
-   GFP_KERNEL|__GFP_REPEAT);
+   return quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL);
 }
 
 static inline struct page *pte_alloc_one(struct mm_struct *mm,
@@ -109,7 +102,7 @@ static inline struct page *pte_alloc_one

 static inline void pte_free_kernel(pte_t *pte)
 {
-   kmem_cache_free(pgtable_cache[PTE_CACHE_NUM], pte);
+   quicklist_free(0, NULL, pte);
 }
 
 static inline void pte_free(struct page *ptepage)
@@ -136,7 +129,10 @@ static inline void pgtable_free(pgtable_
void *p = (void *)(pgf.val & ~PGF_CACHENUM_MASK);
int cachenum = pgf.val & PGF_CACHENUM_MASK;
 
-   kmem_cache_free(pgtable_cache[cachenum], p);
+   if (cachenum == PTE_CACHE_NUM)
+   quicklist_free(0, NULL, p);
+   else
+   kmem_cache_free(pgtable_cache[cachenum], p);
 }
 
 extern void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf);
@

Re: 2.6.22 -mm merge plans

2007-05-03 Thread Christoph Hellwig
On Thu, May 03, 2007 at 01:16:46PM -0400, Mathieu Desnoyers wrote:
> > kprobes does this kind of synchronization internally, so the marker
> > wrapper should probabl aswell.
> > 
> 
> The problem appears on heavily loaded systems. Doing 50
> synchronize_sched() calls in a row can take up to a few seconds on a
> 4-way machine. This is why I prefer to do it in the module to which
> the callbacks belong.

We recently had a discussion on batch unreistration interface for
kprobes.  I'm not very happy with having so different interfaces for
different kind of probe registrations.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans

2007-05-03 Thread Mathieu Desnoyers
Here is the reworked patch, except a comment :

* Christoph Hellwig ([EMAIL PROTECTED]) wrote:
> > +void blk_probe_disconnect(void)
> > +{
> > +   uint8_t i;
> > +
> > +   for (i = 0; i < NUM_PROBES; i++) {
> > +   marker_remove_probe(probe_array[i].name);
> > +   }
> > +   synchronize_sched();/* Wait for probes to finish */
> 
> kprobes does this kind of synchronization internally, so the marker
> wrapper should probabl aswell.
> 

The problem appears on heavily loaded systems. Doing 50
synchronize_sched() calls in a row can take up to a few seconds on a
4-way machine. This is why I prefer to do it in the module to which
the callbacks belong.

Here is the reviewed patch. It depends on a newer version of markers
I'll send to Andrew soon.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>


Index: linux-2.6-lttng/block/elevator.c
===
--- linux-2.6-lttng.orig/block/elevator.c   2007-05-03 12:27:12.0 
-0400
+++ linux-2.6-lttng/block/elevator.c2007-05-03 12:54:58.0 -0400
@@ -32,7 +32,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 
 #include 
@@ -571,7 +571,7 @@
unsigned ordseq;
int unplug_it = 1;
 
-   blk_add_trace_rq(q, rq, BLK_TA_INSERT);
+   trace_mark(blk_request_insert, "%p %p", q, rq);
 
rq->q = q;
 
@@ -757,7 +757,7 @@
 * not be passed by new incoming requests
 */
rq->cmd_flags |= REQ_STARTED;
-   blk_add_trace_rq(q, rq, BLK_TA_ISSUE);
+   trace_mark(blk_request_issue, "%p %p", q, rq);
}
 
if (!q->boundary_rq || q->boundary_rq == rq) {
Index: linux-2.6-lttng/block/ll_rw_blk.c
===
--- linux-2.6-lttng.orig/block/ll_rw_blk.c  2007-05-03 12:27:12.0 
-0400
+++ linux-2.6-lttng/block/ll_rw_blk.c   2007-05-03 12:54:58.0 -0400
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -1551,7 +1552,7 @@
 
if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) {
mod_timer(&q->unplug_timer, jiffies + q->unplug_delay);
-   blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG);
+   trace_mark(blk_plug_device, "%p %p %d", q, NULL, 0);
}
 }
 
@@ -1617,7 +1618,7 @@
 * devices don't necessarily have an ->unplug_fn defined
 */
if (q->unplug_fn) {
-   blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
+   trace_mark(blk_pdu_unplug_io, "%p %p %d", q, NULL,
q->rq.count[READ] + q->rq.count[WRITE]);
 
q->unplug_fn(q);
@@ -1628,7 +1629,7 @@
 {
request_queue_t *q = container_of(work, request_queue_t, unplug_work);
 
-   blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
+   trace_mark(blk_pdu_unplug_io, "%p %p %d", q, NULL,
q->rq.count[READ] + q->rq.count[WRITE]);
 
q->unplug_fn(q);
@@ -1638,7 +1639,7 @@
 {
request_queue_t *q = (request_queue_t *)data;
 
-   blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL,
+   trace_mark(blk_pdu_unplug_timer, "%p %p %d", q, NULL,
q->rq.count[READ] + q->rq.count[WRITE]);
 
kblockd_schedule_work(&q->unplug_work);
@@ -2148,7 +2149,7 @@

rq_init(q, rq);
 
-   blk_add_trace_generic(q, bio, rw, BLK_TA_GETRQ);
+   trace_mark(blk_get_request, "%p %p %d", q, bio, rw);
 out:
return rq;
 }
@@ -2178,7 +2179,7 @@
if (!rq) {
struct io_context *ioc;
 
-   blk_add_trace_generic(q, bio, rw, BLK_TA_SLEEPRQ);
+   trace_mark(blk_sleep_request, "%p %p %d", q, bio, rw);
 
__generic_unplug_device(q);
spin_unlock_irq(q->queue_lock);
@@ -2252,7 +2253,7 @@
  */
 void blk_requeue_request(request_queue_t *q, struct request *rq)
 {
-   blk_add_trace_rq(q, rq, BLK_TA_REQUEUE);
+   trace_mark(blk_requeue, "%p %p", q, rq);
 
if (blk_rq_tagged(rq))
blk_queue_end_tag(q, rq);
@@ -2937,7 +2938,7 @@
if (!ll_back_merge_fn(q, req, bio))
break;
 
-   blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE);
+   trace_mark(blk_bio_backmerge, "%p %p", q, bio);
 
req->biotail->bi_next = bio;
req->biotail = bio;
@@ -2954,7 +2955,7 @@
if (!ll_front_merge_fn(q, req, bio))
break;
 
-   blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE);
+   trace_mark(blk_bio_frontmerge, "%p %p", q, bio);
 
bio->bi_next = req->bio;
  

Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-03 Thread Andrew Morton
On Thu, 03 May 2007 11:32:23 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote:

>  void fastcall unlock_page(struct page *page)
>  {
> + VM_BUG_ON(!PageLocked(page));
>   smp_mb__before_clear_bit();
> - if (!TestClearPageLocked(page))
> - BUG();
> - smp_mb__after_clear_bit(); 
> - wake_up_page(page, PG_locked);
> + ClearPageLocked(page);
> + if (unlikely(test_bit(PG_waiters, &page->flags))) {
> + clear_bit(PG_waiters, &page->flags);
> + wake_up_page(page, PG_locked);
> + }
>  }

Why is that significantly faster than plain old wake_up_page(), which
tests waitqueue_active()?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-03 Thread Christoph Lameter
H...There are a gazillion configs to choose from. It works fine with
cell_defconfig. If I switch to 2 processors I can enable SLUB if I switch 
to 4 I cannot.

I saw some other config weirdness like being unable to set SMP if SLOB is 
enabled with some configs. This should not work and does not work but 
the menus are then vanishing and one can still configure lots of 
processors while not having enabled SMP.

It works as far as I can tell... The rest is arch weirdness.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-03 Thread Christoph Lameter
On Thu, 3 May 2007, William Lee Irwin III wrote:

> I've seen this crash in flush_old_exec() before. ISTR it being due to
> slub vs. pagetable alignment or something on that order.

>From from other discussion regarding SLAB: It may be necessary for 
powerpc to set the default alignment to 8 bytes on 32 bit powerpc 
because it requires that alignemnt for 64 bit value. SLUB will *not* 
disable debugging like SLAB if you do that.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans

2007-05-03 Thread Christoph Hellwig
On Thu, May 03, 2007 at 11:04:15AM -0400, Mathieu Desnoyers wrote:
> - blk_add_trace_rq(q, rq, BLK_TA_INSERT);
> + MARK(blk_request_insert, "%p %p", q, rq);

I don't really like the shouting MARK name very much.  Can we have
a less-generic, less shouting name, e.g. trace_marker?  The aboe would then
be:

trace_mark(blk_request_insert, "%p %p", q, rq);

> +#define NUM_PROBES ARRAY_SIZE(probe_array)

just get rid of this and use ARRAY_SIZE diretly below.

> +int blk_probe_connect(void)
> +{
> + int result;
> + uint8_t i;

just use an int for for loops.  it's easy to read and probably faster
on most systems (it the compiler isn't smart enough and promotes it
to int anyway during code generation)

> +void blk_probe_disconnect(void)
> +{
> + uint8_t i;
> +
> + for (i = 0; i < NUM_PROBES; i++) {
> + marker_remove_probe(probe_array[i].name);
> + }
> + synchronize_sched();/* Wait for probes to finish */

kprobes does this kind of synchronization internally, so the marker
wrapper should probabl aswell.

> +static int blk_probes_ref = 0;

no need to initialize this.

>  /*
>   * Send out a notify message.
>   */
> @@ -229,6 +233,12 @@
>   blk_remove_tree(bt->dir);
>   free_percpu(bt->sequence);
>   kfree(bt);
> + mutex_lock(&blk_probe_mutex);
> + if (blk_probes_ref == 1) {
> + blk_probe_disconnect();
> + blk_probes_ref--;
> + }

if (--blk_probes_ref == 0)
blk_probe_disconnect();

would probably be a tad cleaner.

> + if (!blk_probes_ref) {
> + blk_probe_connect();
> + blk_probes_ref++;
> + }

Dito here with a:

if (!blk_probes_ref++)
blk_probe_connect();

also the connect in the name seems rather add, what about arm/disarm instead?

>  static __init int blk_trace_init(void)
>  {
>   mutex_init(&blk_tree_mutex);
> + mutex_init(&blk_probe_mutex);

both should use DEFINE_MUTEX for compile-time initialization isntead.


Also it's probably better to put the trace points into blktrace.c,
that means all blktrace code can be static and self-contained.  And we
can probably do some additional cleanups by simplifying things later on.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans

2007-05-03 Thread Mathieu Desnoyers
* Andrew Morton ([EMAIL PROTECTED]) wrote:
> > Although some, like Christoph and myself, think that it would benefit to
> > the kernel community to have a common infrastructure for more than just
> > markers (meaning common serialization and buffering mechanism), it does
> > not change the fact that the markers, being in mainline, are usable by
> > projects through additional kernel modules.
> > 
> > If we are looking at current "potential users" that are already in
> > mainline, we could change blktrace to make it use the markers.
> 
> That'd be a useful demonstration.

Here is a proof of concept patch, for demonstration purpose, of moving
blktrace to the markers.

A few remarks : this patch has the positive effect of removing some code
from the block io tracing hot paths, minimizing the i-cache impact in a
system where the io tracing is compiled in but inactive.

It also moves the blk tracing code from a header (and therefore from the
body of the instrumented functions) to a separate C file.

There, as soon as one device has to be traced, every devices have to
fall into the tracing function call. This is slower than the previous
inline function which tested the condition quickly. If it becomes a
show stopper, it could be fixed by having the possibility to test a
supplementary condition, dependant of the marker context, at the marker
site, just after the enable/disable test.

It does not make the code smaller, since I left all the specialized
tracing functions for requests, bio, generic, remap, which would go away
once a generic infrastructure is in place to serialize the information
passed to the marker. This is mostly why I consider it a proof a
concept.

Patch named "markers-port-blktrace-to-markers.patch", can be placed
after the marker patches in the 2.6.21-rc7-mm2 series.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>

Index: linux-2.6-lttng/block/elevator.c
===
--- linux-2.6-lttng.orig/block/elevator.c   2007-05-02 20:33:22.0 
-0400
+++ linux-2.6-lttng/block/elevator.c2007-05-02 20:33:49.0 -0400
@@ -32,7 +32,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 
 #include 
@@ -571,7 +571,7 @@
unsigned ordseq;
int unplug_it = 1;
 
-   blk_add_trace_rq(q, rq, BLK_TA_INSERT);
+   MARK(blk_request_insert, "%p %p", q, rq);
 
rq->q = q;
 
@@ -757,7 +757,7 @@
 * not be passed by new incoming requests
 */
rq->cmd_flags |= REQ_STARTED;
-   blk_add_trace_rq(q, rq, BLK_TA_ISSUE);
+   MARK(blk_request_issue, "%p %p", q, rq);
}
 
if (!q->boundary_rq || q->boundary_rq == rq) {
Index: linux-2.6-lttng/block/ll_rw_blk.c
===
--- linux-2.6-lttng.orig/block/ll_rw_blk.c  2007-05-02 20:33:32.0 
-0400
+++ linux-2.6-lttng/block/ll_rw_blk.c   2007-05-02 23:21:02.0 -0400
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -1551,7 +1552,7 @@
 
if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) {
mod_timer(&q->unplug_timer, jiffies + q->unplug_delay);
-   blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG);
+   MARK(blk_plug_device, "%p %p %d", q, NULL, 0);
}
 }
 
@@ -1617,7 +1618,7 @@
 * devices don't necessarily have an ->unplug_fn defined
 */
if (q->unplug_fn) {
-   blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
+   MARK(blk_pdu_unplug_io, "%p %p %d", q, NULL,
q->rq.count[READ] + q->rq.count[WRITE]);
 
q->unplug_fn(q);
@@ -1628,7 +1629,7 @@
 {
request_queue_t *q = container_of(work, request_queue_t, unplug_work);
 
-   blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
+   MARK(blk_pdu_unplug_io, "%p %p %d", q, NULL,
q->rq.count[READ] + q->rq.count[WRITE]);
 
q->unplug_fn(q);
@@ -1638,7 +1639,7 @@
 {
request_queue_t *q = (request_queue_t *)data;
 
-   blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL,
+   MARK(blk_pdu_unplug_timer, "%p %p %d", q, NULL,
q->rq.count[READ] + q->rq.count[WRITE]);
 
kblockd_schedule_work(&q->unplug_work);
@@ -2148,7 +2149,7 @@

rq_init(q, rq);
 
-   blk_add_trace_generic(q, bio, rw, BLK_TA_GETRQ);
+   MARK(blk_get_request, "%p %p %d", q, bio, rw);
 out:
return rq;
 }
@@ -2178,7 +2179,7 @@
if (!rq) {
struct io_context *ioc;
 
-   blk_add_trace_generic(q, bio, rw, BLK_TA_SLEEPRQ);
+   MARK(blk_sleep_request, "%p %p %d", q, bio, rw);
 
__generic_unplug_device(q);
spi

Re: 2.6.22 -mm merge plans

2007-05-03 Thread Mathieu Desnoyers
Hi Andi,

This plan makes sense. I will split the "patched in enabled/disable
flags" part into a separate piece (good idea!) and then submit the LTTng
core to Andrew. Christoph's has a good point about wanting a usable
infrastructure to go ini. Regarding your plan, I must argue that
blktrace is not a general purpose tracing infrastructure, but one
dedicated to block io tracing.  Therefore, it makes sense to bring in
the generic infrastructure first and then convert blktrace to it.

Mathieu

* Andi Kleen ([EMAIL PROTECTED]) wrote:
> > If we are looking at current "potential users" that are already in
> > mainline, we could change blktrace to make it use the markers.
> 
> Ok, but do it step by step:
> - split out useful pieces like the "patched in enable/disable flags" 
> and submit them separate with an example user or two 
> [I got a couple of candidates e.g. with some of the sysctls in VM or 
> networking] 
> - post and merge that.
> - don't implement anything initially that is not needed by blktrace
> - post a minimal marker patch together with the blktrace
> conversion for review again on linux-kernel
> - await review comments. This review would not cover the basic
> need of markers, just the specific implementation.
> - then potentially merge incorporate review comments
> - then merge
> - later add features with individual review/discussion as new users in the 
> kernel are added.
> 
> -Andi

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans

2007-05-03 Thread Mathieu Desnoyers
* Christoph Hellwig ([EMAIL PROTECTED]) wrote:
> On Wed, May 02, 2007 at 07:11:04PM -0400, Mathieu Desnoyers wrote:
> > My statement was probably not clear enough. The actual marker code is
> > useful as-is without any further kernel patching required : SystemTAP is
> > an example where they use external modules to load probes that can
> > connect either to markers or through kprobes. LTTng, in its current state,
> > has a mostly modular core that also uses the markers.
> 
> That just mean you have to load an enormous emount of exernal crap
> that replaces the missing kernel functionality.  It's exactly the
> situation we want to avoid.
> 

It makes sense to use -mm to hold the hole usable infrastructure before
submitting it to mainline. I will submit my core LTTng patches to Andrew
in the following weeks.  There is no hurry, in the LTTng perspective, to
merge the markers sooner, although they could be useful to other
(external) projects meanwhile.

Mathieu


-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-03 Thread Nick Piggin

Hugh Dickins wrote:

On Thu, 3 May 2007, Nick Piggin wrote:


@@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!)
{
DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);

+   set_bit(PG_waiters, &page->flags);
+   if (unlikely(!TestSetPageLocked(page))) {


What happens if another cpu is coming through __lock_page at the
same time, did its set_bit, now finds PageLocked, and so proceeds
to the __wait_on_bit_lock?  But this cpu now clears PG_waiters,
so this task's unlock_page won't wake the other?


You're right, we can't clear the bit here. Doubt it mattered much anyway?



Ah yes, that's a good easy answer.  In fact, just remove this whole
test and block (we already tried TestSetPageLocked outside just a
short while ago, so this repeat won't often save anything).


Yeah, I was getting too clever for my own boots :)

I think the patch has merit though. Unfortunate that it uses another page
flag, however it seemed to have quite a bit speedup on unlock_page (probably
from both the barriers and an extra random cacheline load (from the hash)).

I guess it has to get good results from more benchmarks...



BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page
above... that barrier is in the slow path as well though, so it shouldn't
matter either.



I vaguely wondered how such barriers had managed to dissolve away,
but cranking my brain up to think about barriers takes far too long.


That barrier was one too many :)

However I believe the fastpath barrier can go away because the PG_locked
operation is depending on the same cacheline as PG_waiters.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-03 Thread Hugh Dickins
On Thu, 3 May 2007, Nick Piggin wrote:
> >>@@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!)
> > > {
> > >  DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
> > >
> > >+  set_bit(PG_waiters, &page->flags);
> > >+  if (unlikely(!TestSetPageLocked(page))) {
> > 
> > What happens if another cpu is coming through __lock_page at the
> > same time, did its set_bit, now finds PageLocked, and so proceeds
> > to the __wait_on_bit_lock?  But this cpu now clears PG_waiters,
> > so this task's unlock_page won't wake the other?
> 
> You're right, we can't clear the bit here. Doubt it mattered much anyway?

Ah yes, that's a good easy answer.  In fact, just remove this whole
test and block (we already tried TestSetPageLocked outside just a
short while ago, so this repeat won't often save anything).

> 
> BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page
> above... that barrier is in the slow path as well though, so it shouldn't
> matter either.

I vaguely wondered how such barriers had managed to dissolve away,
but cranking my brain up to think about barriers takes far too long.

> > >+  clear_bit(PG_waiters, &page->flags);
> > >+  return;
> > >+  }
> > >  __wait_on_bit_lock(page_waitqueue(page), &wait, sync_page,
> > >TASK_UNINTERRUPTIBLE);
> >> }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-03 Thread Nick Piggin

Christoph Hellwig wrote:

On Thu, May 03, 2007 at 11:32:23AM +1000, Nick Piggin wrote:


The attached patch gets performance up a bit by avoiding some
barriers and some cachelines:

G5
pagefault   fork  exec
2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2
+patch2  1.61-1.63   169.8-175.0   748.6-757.0

So that brings the fork/exec hits down to much less than 5%, and
would likely speed up other things that lock the page, like write
or page reclaim.



Is that every fork/exec or just under certain cicumstances?
A 5% regression on every fork/exec is not acceptable.


Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4
numbers will be improved as well with that patch. Then if we have
specific lock/unlock bitops, I hope it should reduce that further.

The overhead that is there should just be coming from the extra
overhead in the file backed fault handler. For noop fork/execs,
I think that tends to be more pronounced, it is hard to see any
difference on any non-micro benchmark.

The other thing is that I think there could be some cache effects
happening -- for example the exec numbers on the 2nd line are
disproportionately large.

It definitely isn't a good thing to drop performance anywhere
though, so I'll keep looking for improvements.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-03 Thread Nick Piggin

Hugh Dickins wrote:

On Thu, 3 May 2007, Nick Piggin wrote:


The problem is that lock/unlock_page is expensive on powerpc, and
if we improve that, we improve more than just the fault handler...

The attached patch gets performance up a bit by avoiding some
barriers and some cachelines:



There's a strong whiff of raciness about this...
but I could very easily be wrong.



Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c 2007-05-02 15:00:26.0 +1000
+++ linux-2.6/mm/filemap.c  2007-05-03 08:34:32.0 +1000
@@ -532,11 +532,13 @@
 */
void fastcall unlock_page(struct page *page)
{
+   VM_BUG_ON(!PageLocked(page));
smp_mb__before_clear_bit();
-   if (!TestClearPageLocked(page))
-   BUG();
-	smp_mb__after_clear_bit(); 
-	wake_up_page(page, PG_locked);

+   ClearPageLocked(page);
+   if (unlikely(test_bit(PG_waiters, &page->flags))) {
+   clear_bit(PG_waiters, &page->flags);
+   wake_up_page(page, PG_locked);
+   }
}
EXPORT_SYMBOL(unlock_page);

@@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!)
{
DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);

+   set_bit(PG_waiters, &page->flags);
+   if (unlikely(!TestSetPageLocked(page))) {



What happens if another cpu is coming through __lock_page at the
same time, did its set_bit, now finds PageLocked, and so proceeds
to the __wait_on_bit_lock?  But this cpu now clears PG_waiters,
so this task's unlock_page won't wake the other?


You're right, we can't clear the bit here. Doubt it mattered much anyway?

BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page
above... that barrier is in the slow path as well though, so it shouldn't
matter either.





+   clear_bit(PG_waiters, &page->flags);
+   return;
+   }
__wait_on_bit_lock(page_waitqueue(page), &wait, sync_page,
TASK_UNINTERRUPTIBLE);
}






--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-03 Thread Hugh Dickins
On Thu, 3 May 2007, Nick Piggin wrote:
> 
> The problem is that lock/unlock_page is expensive on powerpc, and
> if we improve that, we improve more than just the fault handler...
> 
> The attached patch gets performance up a bit by avoiding some
> barriers and some cachelines:

There's a strong whiff of raciness about this...
but I could very easily be wrong.

> Index: linux-2.6/mm/filemap.c
> ===
> --- linux-2.6.orig/mm/filemap.c   2007-05-02 15:00:26.0 +1000
> +++ linux-2.6/mm/filemap.c2007-05-03 08:34:32.0 +1000
> @@ -532,11 +532,13 @@
>   */
>  void fastcall unlock_page(struct page *page)
>  {
> + VM_BUG_ON(!PageLocked(page));
>   smp_mb__before_clear_bit();
> - if (!TestClearPageLocked(page))
> - BUG();
> - smp_mb__after_clear_bit(); 
> - wake_up_page(page, PG_locked);
> + ClearPageLocked(page);
> + if (unlikely(test_bit(PG_waiters, &page->flags))) {
> + clear_bit(PG_waiters, &page->flags);
> + wake_up_page(page, PG_locked);
> + }
>  }
>  EXPORT_SYMBOL(unlock_page);
>  
> @@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!)
>  {
>   DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
>  
> + set_bit(PG_waiters, &page->flags);
> + if (unlikely(!TestSetPageLocked(page))) {

What happens if another cpu is coming through __lock_page at the
same time, did its set_bit, now finds PageLocked, and so proceeds
to the __wait_on_bit_lock?  But this cpu now clears PG_waiters,
so this task's unlock_page won't wake the other?

> + clear_bit(PG_waiters, &page->flags);
> + return;
> + }
>   __wait_on_bit_lock(page_waitqueue(page), &wait, sync_page,
>   TASK_UNINTERRUPTIBLE);
>  }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-03 Thread Christoph Hellwig
On Thu, May 03, 2007 at 11:32:23AM +1000, Nick Piggin wrote:
> The attached patch gets performance up a bit by avoiding some
> barriers and some cachelines:
> 
> G5
>  pagefault   fork  exec
> 2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
> +patch   1.71-1.73   175.2-180.8   780.5-794.2
> +patch2  1.61-1.63   169.8-175.0   748.6-757.0
> 
> So that brings the fork/exec hits down to much less than 5%, and
> would likely speed up other things that lock the page, like write
> or page reclaim.

Is that every fork/exec or just under certain cicumstances?
A 5% regression on every fork/exec is not acceptable.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans

2007-05-03 Thread Andi Kleen
> If we are looking at current "potential users" that are already in
> mainline, we could change blktrace to make it use the markers.

Ok, but do it step by step:
- split out useful pieces like the "patched in enable/disable flags" 
and submit them separate with an example user or two 
[I got a couple of candidates e.g. with some of the sysctls in VM or 
networking] 
- post and merge that.
- don't implement anything initially that is not needed by blktrace
- post a minimal marker patch together with the blktrace
conversion for review again on linux-kernel
- await review comments. This review would not cover the basic
need of markers, just the specific implementation.
- then potentially merge incorporate review comments
- then merge
- later add features with individual review/discussion as new users in the 
kernel are added.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans

2007-05-03 Thread Andi Kleen
On Wed, May 02, 2007 at 11:46:40PM +0200, Tilman Schmidt wrote:
> Am 02.05.2007 19:49 schrieb Andi Kleen:
> > And also I think when something is merged it should have some users in tree.
> 
> Isn't that a circular dependency?

The normal mode of operation is to merge the initial users and the 
subsystem at the same time.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-03 Thread Hugh Dickins
On Thu, 3 May 2007, Andrew Morton wrote:
> On Thu, 3 May 2007 09:46:32 +0100 (BST) Hugh Dickins <[EMAIL PROTECTED]> 
> wrote:
> > On Thu, 3 May 2007, Andrew Morton wrote:
> > > On Wed, 2 May 2007 10:25:47 -0700 (PDT) Christoph Lameter <[EMAIL 
> > > PROTECTED]> wrote:
> > > 
> > > > +config ARCH_USES_SLAB_PAGE_STRUCT
> > > > +   bool
> > > > +   default y
> > > > +   depends on SPLIT_PTLOCK_CPUS <= NR_CPUS
> > > > +
> > > 
> > > That all seems to work as intended.
> > > 
> > > However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the
> > > machine early in boot.  
> > 
> > I thought that if that worked as intended, you wouldn't even
> > get the chance to choose SLUB=y?  That was how it was working
> > for me (but I realize I didn't try more than make oldconfig).
> > 
> > That sounds like what happens when SLUB's pagestruct use meets
> > SPLIT_PTLOCK's pagestruct use.  Does your .config really show
> > CONFIG_SLUB=y together with CONFIG_ARCH_USES_SLAB_PAGE_STRUCT=y?
> 
> Nope.
> 
> g5:/usr/src/25> grep SLUB .config
> CONFIG_SLUB=y
> g5:/usr/src/25> grep SLAB .config
> # CONFIG_SLAB is not set
> g5:/usr/src/25> grep CPUS .config
> CONFIG_NR_CPUS=8
> # CONFIG_CPUSETS is not set
> # CONFIG_IRQ_ALL_CPUS is not set
> CONFIG_SPLIT_PTLOCK_CPUS=4
> 
> It's in http://userweb.kernel.org/~akpm/config-g5.txt

Seems we're all wrong in thinking Christoph's Kconfiggery worked
as intended: maybe it just works some of the time.  I'm not going
to hazard a guess as to how to fix it up, will resume looking at
the powerpc's quicklist potential later.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-03 Thread Andrew Morton
On Thu, 3 May 2007 09:46:32 +0100 (BST) Hugh Dickins <[EMAIL PROTECTED]> wrote:

> On Thu, 3 May 2007, Andrew Morton wrote:
> > On Wed, 2 May 2007 10:25:47 -0700 (PDT) Christoph Lameter <[EMAIL 
> > PROTECTED]> wrote:
> > 
> > > +config ARCH_USES_SLAB_PAGE_STRUCT
> > > + bool
> > > + default y
> > > + depends on SPLIT_PTLOCK_CPUS <= NR_CPUS
> > > +
> > 
> > That all seems to work as intended.
> > 
> > However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the
> > machine early in boot.  
> 
> I thought that if that worked as intended, you wouldn't even
> get the chance to choose SLUB=y?  That was how it was working
> for me (but I realize I didn't try more than make oldconfig).

Right.  This can be tested on x86 without a cross-compiler:

ARCH=powerpc make mrproper
ARCH=powerpc make fooconfig

> > 
> > Too early for netconsole, no serial console.  Wedges up uselessly with
> > CONFIG_XMON=n, does mysterious repeated uncontrollable exceptions with
> > CONFIG_XMON=y.  This is all fairly typical for a powerpc/G5 crash :(
> > 
> > However I was able to glimpse some stuff as it flew past.  Crash started in
> > flush_old_exec and ended in pgtable_free_tlb -> kmem_cache_free.  I don't 
> > know
> > how to do better than that I'm afraid, unless I'm to hunt down a PCIE serial
> > card, perhaps.
> 
> That sounds like what happens when SLUB's pagestruct use meets
> SPLIT_PTLOCK's pagestruct use.  Does your .config really show
> CONFIG_SLUB=y together with CONFIG_ARCH_USES_SLAB_PAGE_STRUCT=y?
> 

Nope.

g5:/usr/src/25> grep SLUB .config
CONFIG_SLUB=y
g5:/usr/src/25> grep SLAB .config
# CONFIG_SLAB is not set
g5:/usr/src/25> grep CPUS .config
CONFIG_NR_CPUS=8
# CONFIG_CPUSETS is not set
# CONFIG_IRQ_ALL_CPUS is not set
CONFIG_SPLIT_PTLOCK_CPUS=4

It's in http://userweb.kernel.org/~akpm/config-g5.txt

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-03 Thread Hugh Dickins
On Thu, 3 May 2007, Andrew Morton wrote:
> On Wed, 2 May 2007 10:25:47 -0700 (PDT) Christoph Lameter <[EMAIL PROTECTED]> 
> wrote:
> 
> > +config ARCH_USES_SLAB_PAGE_STRUCT
> > +   bool
> > +   default y
> > +   depends on SPLIT_PTLOCK_CPUS <= NR_CPUS
> > +
> 
> That all seems to work as intended.
> 
> However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the
> machine early in boot.  

I thought that if that worked as intended, you wouldn't even
get the chance to choose SLUB=y?  That was how it was working
for me (but I realize I didn't try more than make oldconfig).

> 
> Too early for netconsole, no serial console.  Wedges up uselessly with
> CONFIG_XMON=n, does mysterious repeated uncontrollable exceptions with
> CONFIG_XMON=y.  This is all fairly typical for a powerpc/G5 crash :(
> 
> However I was able to glimpse some stuff as it flew past.  Crash started in
> flush_old_exec and ended in pgtable_free_tlb -> kmem_cache_free.  I don't know
> how to do better than that I'm afraid, unless I'm to hunt down a PCIE serial
> card, perhaps.

That sounds like what happens when SLUB's pagestruct use meets
SPLIT_PTLOCK's pagestruct use.  Does your .config really show
CONFIG_SLUB=y together with CONFIG_ARCH_USES_SLAB_PAGE_STRUCT=y?

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-03 Thread William Lee Irwin III
On Thu, May 03, 2007 at 01:15:15AM -0700, Andrew Morton wrote:
> That all seems to work as intended.
> However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the
> machine early in boot.  
> Too early for netconsole, no serial console.  Wedges up uselessly with
> CONFIG_XMON=n, does mysterious repeated uncontrollable exceptions with
> CONFIG_XMON=y.  This is all fairly typical for a powerpc/G5 crash :(
> However I was able to glimpse some stuff as it flew past.  Crash started in
> flush_old_exec and ended in pgtable_free_tlb -> kmem_cache_free.  I don't know
> how to do better than that I'm afraid, unless I'm to hunt down a PCIE serial
> card, perhaps.

I've seen this crash in flush_old_exec() before. ISTR it being due to
slub vs. pagetable alignment or something on that order.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-03 Thread Andrew Morton
On Wed, 2 May 2007 10:25:47 -0700 (PDT) Christoph Lameter <[EMAIL PROTECTED]> 
wrote:

> On Wed, 2 May 2007, Hugh Dickins wrote:
> 
> > I presume the answer is just to extend your quicklist work to
> > powerpc's lowest level of pagetables.  The only other architecture
> > which is using kmem_cache for them is arm26, which has
> > "#error SMP is not supported", so won't be giving this problem.
> 
> In the meantime we would need something like this to disable SLUB in this 
> particular configuration. Note that I have not tested this and the <= for
> the comparision with SPLIT_PTLOCK_CPUS may not work (Never seen such a
> construct in a Kconfig file but it is needed here).
> 
> 
> 
> PowerPC: Disable SLUB for configurations in which slab page structs are 
> modified
> 
> PowerPC uses the slab allocator to manage the lowest level of the page table.
> In high cpu configurations we also use the page struct to split the page
> table lock. Disallow the selection of SLUB for that case.
> 
> [Not tested: I am not familiar with powerpc build procedures etc]
> 
> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
> 
> Index: linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig
> ===
> --- linux-2.6.21-rc7-mm2.orig/arch/powerpc/Kconfig2007-05-02 
> 10:07:34.0 -0700
> +++ linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig 2007-05-02 10:13:37.0 
> -0700
> @@ -117,6 +117,19 @@ config GENERIC_BUG
>   default y
>   depends on BUG
>  
> +#
> +# Powerpc uses the slab allocator to manage its ptes and the
> +# page structs of ptes are used for splitting the page table
> +# lock for configurations supporting more than SPLIT_PTLOCK_CPUS.
> +#
> +# In that special configuration the page structs of slabs are modified.
> +# This setting disables the selection of SLUB as a slab allocator.
> +#
> +config ARCH_USES_SLAB_PAGE_STRUCT
> + bool
> + default y
> + depends on SPLIT_PTLOCK_CPUS <= NR_CPUS
> +

That all seems to work as intended.

However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the
machine early in boot.  

Too early for netconsole, no serial console.  Wedges up uselessly with
CONFIG_XMON=n, does mysterious repeated uncontrollable exceptions with
CONFIG_XMON=y.  This is all fairly typical for a powerpc/G5 crash :(

However I was able to glimpse some stuff as it flew past.  Crash started in
flush_old_exec and ended in pgtable_free_tlb -> kmem_cache_free.  I don't know
how to do better than that I'm afraid, unless I'm to hunt down a PCIE serial
card, perhaps.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans

2007-05-03 Thread Christoph Hellwig
On Wed, May 02, 2007 at 04:36:27PM -0400, Mathieu Desnoyers wrote:
> The idea is the following : either we integrate the infrastructure for
> instrumentation / data serialization / buffer management / extraction of
> data to user space in multiple different steps, which makes code review
> easier for you guys,

the staging area for that is -mm

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans

2007-05-03 Thread Christoph Hellwig
On Wed, May 02, 2007 at 01:53:36PM -0700, Andrew Morton wrote:
> In which case we have:
> 
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-alpha.patch
> atomich-complete-atomic_long-operations-in-asm-generic.patch
> atomich-i386-type-safety-fix.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-ia64.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-mips.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-parisc.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-powerpc.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-sparc64.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-x86_64.patch
> atomich-atomic_add_unless-as-inline-remove-systemh-atomich-circular-dependency.patch
> local_t-architecture-independant-extension.patch
> local_t-alpha-extension.patch
> local_t-i386-extension.patch
> local_t-ia64-extension.patch
> local_t-mips-extension.patch
> local_t-parisc-cleanup.patch
> local_t-powerpc-extension.patch
> local_t-sparc64-cleanup.patch
> local_t-x86_64-extension.patch
> 
>   For 2.6.22
> 
> linux-kernel-markers-kconfig-menus.patch
> linux-kernel-markers-architecture-independant-code.patch
> linux-kernel-markers-powerpc-optimization.patch
> linux-kernel-markers-i386-optimization.patch
> markers-add-instrumentation-markers-menus-to-avr32.patch
> linux-kernel-markers-non-optimized-architectures.patch
> markers-alpha-and-avr32-supportadd-alpha-markerh-add-arm26-markerh.patch
> linux-kernel-markers-documentation.patch
> #
> markers-define-the-linker-macro-extra_rwdata.patch
> markers-use-extra_rwdata-in-architectures.patch
> #
> some-grammatical-fixups-and-additions-to-atomich-kernel-doc.patch
> no-longer-include-asm-kdebugh.patch

This it a plan I can fully agree with.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans

2007-05-03 Thread Christoph Hellwig
On Wed, May 02, 2007 at 07:11:04PM -0400, Mathieu Desnoyers wrote:
> My statement was probably not clear enough. The actual marker code is
> useful as-is without any further kernel patching required : SystemTAP is
> an example where they use external modules to load probes that can
> connect either to markers or through kprobes. LTTng, in its current state,
> has a mostly modular core that also uses the markers.

That just mean you have to load an enormous emount of exernal crap
that replaces the missing kernel functionality.  It's exactly the
situation we want to avoid.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-02 Thread Nick Piggin

Hugh Dickins wrote:

On Wed, 2 May 2007, Nick Piggin wrote:


[snip]


More on-topic, since you suggest doing more within vmtruncate_range
than the filesystem: no, I'm afraid that's misdesigned, and I want
to move almost all of it into the filesystem ->truncate_range.
Because, if what vmtruncate_range is doing before it gets to the
filesystem isn't to be just a waste of time, the filesystem needs
to know what's going on in advance - just as notify_change warns
the filesystem about a coming truncation.  But easier than inventing
some new notification is to move it all into the filesystem, with
unmap_mapping_range+truncate_inode_pages_range its library helpers.


Well I would prefer it to follow the same pattern as regular
truncate. I don't think it is misdesigned to call the filesystem
_first_, but I think if you do that then the filesystem should
call the vm to prepare / finish truncate, rather than open code
calls to unmap itself.



But I'm pretty sure (to use your words!) regular truncate was not racy
before: I believe Andrea's sequence count was handling that case fine,
without a second unmap_mapping_range.


OK, I think you're right. I _think_ it should also be OK with the
lock_page version as well: we should not be able to have any pages
after the first unmap_mapping_range call, because of the i_size
write. So if we have no pages, there is nothing to 'cow' from.



I'd be delighted if you can remove those later unmap_mapping_ranges.
As I recall, the important thing for the copy pages is to be holding
the page lock (or whatever other serialization) on the copied page
still while the copy page is inserted into pagetable: that looks
to be so in your __do_fault.


Yeah, I think my thought process went wrong on those... I'll
revisit.



But it is a shame, and leaves me wondering what you gained with the
page lock there.

One thing gained is ease of understanding, and if your later patches
build an edifice upon the knowledge of holding that page lock while
faulting, I've no wish to undermine that foundation.


It also fixes a bug, doesn't it? ;)



Well, I'd come to think that perhaps the bugs would be solved by
that second unmap_mapping_range alone, so the pagelock changes
just a misleading diversion.

I'm not sure how I feel about that: calling unmap_mapping_range a
second time feels such a cheat, but if (big if) it does solve the
races, and the pagelock method is as expensive as your numbers
now suggest...


Well aside from being terribly ugly, it means we can still drop
the dirty bit where we'd otherwise rather not, so I don't think
we can do that.

I think there may be some way we can do this without taking the
page lock, and I was going to look at it, but I think it is
quite neat to just lock the page...

I don't think performance is _that_ bad. On the P4 it is a couple
of % on the microbenchmarks. The G5 is worse, but even then I
don't think it is I'll try to improve that and get back to you.

The problem is that lock/unlock_page is expensive on powerpc, and
if we improve that, we improve more than just the fault handler...

The attached patch gets performance up a bit by avoiding some
barriers and some cachelines:

G5
 pagefault   fork  exec
2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2
+patch2  1.61-1.63   169.8-175.0   748.6-757.0

So that brings the fork/exec hits down to much less than 5%, and
would likely speed up other things that lock the page, like write
or page reclaim.

I think we could get further performance improvement by
implementing arch specific bitops for lock/unlock operations,
so we don't need to use things like smb_mb__before_clear_bit()
if they aren't needed or full barriers in the test_and_set_bit().

--
SUSE Labs, Novell Inc.

Index: linux-2.6/include/linux/page-flags.h
===
--- linux-2.6.orig/include/linux/page-flags.h   2007-04-24 10:39:56.0 
+1000
+++ linux-2.6/include/linux/page-flags.h2007-05-03 08:38:53.0 
+1000
@@ -91,6 +91,8 @@
 #define PG_nosave_free 18  /* Used for system suspend/resume */
 #define PG_buddy   19  /* Page is free, on buddy lists */
 
+#define PG_waiters 20  /* Page has PG_locked waiters */
+
 /* PG_owner_priv_1 users should have descriptive aliases */
 #define PG_checked PG_owner_priv_1 /* Used by some filesystems */
 
Index: linux-2.6/include/linux/pagemap.h
===
--- linux-2.6.orig/include/linux/pagemap.h  2007-04-24 10:39:56.0 
+1000
+++ linux-2.6/include/linux/pagemap.h   2007-05-03 08:35:08.0 +1000
@@ -141,7 +141,7 @@
 static inline void lock_page(struct page *page)
 {
might_sleep();
-   if (TestSetPageLocked(page))
+   if (unlikely(TestSetPageLocked(page)))
__lock_page(page);
 }
 
@@ -152,7 +152,7 @@
 static inline void lock_pag

Re: 2.6.22 -mm merge plans: mm-more-rmap-checking

2007-05-02 Thread Nick Piggin

Hugh Dickins wrote:

On Wed, 2 May 2007, Nick Piggin wrote:


Yes, but IIRC I put that in because there was another check in
SLES9 that I actually couldn't put in, but used this one instead
because it also caught the bug we saw.
... 
This was actually a rare corruption that is also in 2.6.21, and

as few rmap callsites as we have, it was never noticed until the
SLES9 bug check was triggered.



You are being very mysterious.  Please describe this bug (privately
if you think it's exploitable), and let's work on the patch to fix it,
rather than this "debug" patch.


It is exec-fix-remove_arg_zero.patch in Andrew's tree, it's exploitable
in that it leaks memory, but it could also release corrupted pagetables
into quicklists on those architectures that have them...

Anyway, it quite likely would have gone unfixed for several more years
if we didn't have the bug triggers in. Now you could argue that my
patch obviously fixes all bugs in there (but I wouldn't :)), and being
most complex of the few callsites, _now_ we can avoid the bug checks.
However I'd prefer to keep them at least under CONFIG_DEBUG_VM.



Hmm, I didn't notice the do_swap_page change, rather just derived
its safety by looking at the current state of the code (which I
guess must have been post-do_swap_page change)...



Your addition of page_add_new_anon_rmap clarified the situation too.



Do you have a pointer to the patch, for my interest?



The patch which changed do_swap_page?

commit c475a8ab625d567eacf5e30ec35d6d8704558062
Author: Hugh Dickins <[EMAIL PROTECTED]>
Date:   Tue Jun 21 17:15:12 2005 -0700
[PATCH] can_share_swap_page: use page_mapcount



Yeah, this one, thanks. I'm just interested.



Or my intended PG_swapcache to PAGE_MAPPING_SWAP patch,
which does assume PageLocked in page_add_anon_rmap?
Yes, I can send you its current unsplit state if you like
(but have higher priorities before splitting and commenting
it for posting).


I would like to see that too, but when you are ready :)

Thanks,
Nick

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans

2007-05-02 Thread Andrew Morton
On Wed, 2 May 2007 19:11:04 -0400
Mathieu Desnoyers <[EMAIL PROTECTED]> wrote:

> > I didn't know that this was the plan.
> > 
> > The problem I have with this is that once we've merged one part, we're
> > committed to merging the other parts even though we haven't seen them yet.
> > 
> > What happens if there's a revolt over the next set of patches?  Do we
> > remove the core markers patches again?  We end up in a cant-go-forward,
> > cant-go-backward situation.
> > 
> > I thought the existing code was useful as-is for several projects, without
> > requiring additional patching to core kernel.  If such additional patching
> > _is_ needed to make the markers code useful then I agree that we should
> > continue to buffer the markers code in -mm until the
> > use-markers-for-something patches have been eyeballed.
> > 
> 
> My statement was probably not clear enough. The actual marker code is
> useful as-is without any further kernel patching required : SystemTAP is
> an example where they use external modules to load probes that can
> connect either to markers or through kprobes. LTTng, in its current state,
> has a mostly modular core that also uses the markers.

OK, that's what I thought.

> Although some, like Christoph and myself, think that it would benefit to
> the kernel community to have a common infrastructure for more than just
> markers (meaning common serialization and buffering mechanism), it does
> not change the fact that the markers, being in mainline, are usable by
> projects through additional kernel modules.
> 
> If we are looking at current "potential users" that are already in
> mainline, we could change blktrace to make it use the markers.

That'd be a useful demonstration.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans

2007-05-02 Thread Mathieu Desnoyers
* Andrew Morton ([EMAIL PROTECTED]) wrote:
> On Wed, 2 May 2007 16:36:27 -0400
> Mathieu Desnoyers <[EMAIL PROTECTED]> wrote:
> 
> > * Christoph Hellwig ([EMAIL PROTECTED]) wrote:
> > > On Wed, May 02, 2007 at 09:47:07AM -0700, Andrew Morton wrote:
> > > > > That doesn't constitute using it.
> > > > 
> > > > Andi, there was a huge amount of discussion about all this in September 
> > > > last
> > > > year (subjects: *markers* and *LTTng*). The outcome of all that was, I
> > > > believe, that the kernel should have a static marker infrastructure.
> > > 
> > > Only when it's actually useable.  A prerequisite for merging it is
> > > having an actual trace transport infrastructure aswell as a few actually
> > > useful tracing modules in the kernel tree.
> > > 
> > > Let this count as a vote to merge the markers once we have the 
> > > infrastructure
> > > above ready, it'll be very useful then.
> > 
> > Hi Christoph,
> > 
> > The idea is the following : either we integrate the infrastructure for
> > instrumentation / data serialization / buffer management / extraction of
> > data to user space in multiple different steps, which makes code review
> > easier for you guys, or we bring the main pieces of the LTTng project
> > altogether with the Linux Kernel Markers, which would result in a bigger
> > change.
> > 
> > Based on the premise that discussing about logically distinct pieces of
> > infrastructure is easier and can be done more thoroughly when done
> > separately, we decided to submit the markers first, with the other
> > pieces planned in a near future.
> > 
> > I agree that it would be very useful to have the full tracing stack
> > available in the Linux kernel, but we inevitably face the argument :
> > "this change is too big" if we submit all LTTng modules at once or
> > the argument : "we want the whole tracing stack, not just part of it"
> > if we don't.
> > 
> > This is why we chose to push the tracing infrastructure chunk by chunk :
> > to make code review and criticism more efficient.
> > 
> 
> I didn't know that this was the plan.
> 
> The problem I have with this is that once we've merged one part, we're
> committed to merging the other parts even though we haven't seen them yet.
> 
> What happens if there's a revolt over the next set of patches?  Do we
> remove the core markers patches again?  We end up in a cant-go-forward,
> cant-go-backward situation.
> 
> I thought the existing code was useful as-is for several projects, without
> requiring additional patching to core kernel.  If such additional patching
> _is_ needed to make the markers code useful then I agree that we should
> continue to buffer the markers code in -mm until the
> use-markers-for-something patches have been eyeballed.
> 

My statement was probably not clear enough. The actual marker code is
useful as-is without any further kernel patching required : SystemTAP is
an example where they use external modules to load probes that can
connect either to markers or through kprobes. LTTng, in its current state,
has a mostly modular core that also uses the markers.

Although some, like Christoph and myself, think that it would benefit to
the kernel community to have a common infrastructure for more than just
markers (meaning common serialization and buffering mechanism), it does
not change the fact that the markers, being in mainline, are usable by
projects through additional kernel modules.

If we are looking at current "potential users" that are already in
mainline, we could change blktrace to make it use the markers.

Mathieu


> In which case we have:
> 
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-alpha.patch
> atomich-complete-atomic_long-operations-in-asm-generic.patch
> atomich-i386-type-safety-fix.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-ia64.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-mips.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-parisc.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-powerpc.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-sparc64.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-x86_64.patch
> atomich-atomic_add_unless-as-inline-remove-systemh-atomich-circular-dependency.patch
> local_t-architecture-independant-extension.patch
> local_t-alpha-extension.patch
> local_t-i386-extension.patch
> local_t-ia64-extension.patch
> local_t-mips-extension.patch
> local_t-parisc-cleanup.patch
> local_t-powerpc-extension.patch
> local_t-sparc64-cleanup.patch
> local_t-x86_64-extension.patch
> 
>   For 2.6.22
> 
> linux-kernel-markers-kconfig-menus.patch
> linux-kernel-markers-architecture-independant-code.patch
> linux-kernel-markers-powerpc-optimization.patch
> linux-kernel-markers-i386-optimization.patch
> markers-add-instrumentation-markers-menus-to-avr32.patch
> linux-kernel-markers-non-optimized-architectures.patch
> markers-alpha-and-avr32-supportadd-alpha-markerh-add-arm26-markerh.patch
> linux-kernel-markers-docume

Re: 2.6.22 -mm merge plans

2007-05-02 Thread Tilman Schmidt
Am 02.05.2007 19:49 schrieb Andi Kleen:
> And also I think when something is merged it should have some users in tree.

Isn't that a circular dependency?

-- 
Tilman Schmidt  E-Mail: [EMAIL PROTECTED]
Bonn, Germany
- Undetected errors are handled as if no error occurred. (IBM) -



signature.asc
Description: OpenPGP digital signature


Re: 2.6.22 -mm merge plans

2007-05-02 Thread Andrew Morton
On Wed, 2 May 2007 16:36:27 -0400
Mathieu Desnoyers <[EMAIL PROTECTED]> wrote:

> * Christoph Hellwig ([EMAIL PROTECTED]) wrote:
> > On Wed, May 02, 2007 at 09:47:07AM -0700, Andrew Morton wrote:
> > > > That doesn't constitute using it.
> > > 
> > > Andi, there was a huge amount of discussion about all this in September 
> > > last
> > > year (subjects: *markers* and *LTTng*). The outcome of all that was, I
> > > believe, that the kernel should have a static marker infrastructure.
> > 
> > Only when it's actually useable.  A prerequisite for merging it is
> > having an actual trace transport infrastructure aswell as a few actually
> > useful tracing modules in the kernel tree.
> > 
> > Let this count as a vote to merge the markers once we have the 
> > infrastructure
> > above ready, it'll be very useful then.
> 
> Hi Christoph,
> 
> The idea is the following : either we integrate the infrastructure for
> instrumentation / data serialization / buffer management / extraction of
> data to user space in multiple different steps, which makes code review
> easier for you guys, or we bring the main pieces of the LTTng project
> altogether with the Linux Kernel Markers, which would result in a bigger
> change.
> 
> Based on the premise that discussing about logically distinct pieces of
> infrastructure is easier and can be done more thoroughly when done
> separately, we decided to submit the markers first, with the other
> pieces planned in a near future.
> 
> I agree that it would be very useful to have the full tracing stack
> available in the Linux kernel, but we inevitably face the argument :
> "this change is too big" if we submit all LTTng modules at once or
> the argument : "we want the whole tracing stack, not just part of it"
> if we don't.
> 
> This is why we chose to push the tracing infrastructure chunk by chunk :
> to make code review and criticism more efficient.
> 

I didn't know that this was the plan.

The problem I have with this is that once we've merged one part, we're
committed to merging the other parts even though we haven't seen them yet.

What happens if there's a revolt over the next set of patches?  Do we
remove the core markers patches again?  We end up in a cant-go-forward,
cant-go-backward situation.

I thought the existing code was useful as-is for several projects, without
requiring additional patching to core kernel.  If such additional patching
_is_ needed to make the markers code useful then I agree that we should
continue to buffer the markers code in -mm until the
use-markers-for-something patches have been eyeballed.

In which case we have:

atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-alpha.patch
atomich-complete-atomic_long-operations-in-asm-generic.patch
atomich-i386-type-safety-fix.patch
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-ia64.patch
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-mips.patch
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-parisc.patch
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-powerpc.patch
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-sparc64.patch
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-x86_64.patch
atomich-atomic_add_unless-as-inline-remove-systemh-atomich-circular-dependency.patch
local_t-architecture-independant-extension.patch
local_t-alpha-extension.patch
local_t-i386-extension.patch
local_t-ia64-extension.patch
local_t-mips-extension.patch
local_t-parisc-cleanup.patch
local_t-powerpc-extension.patch
local_t-sparc64-cleanup.patch
local_t-x86_64-extension.patch

  For 2.6.22

linux-kernel-markers-kconfig-menus.patch
linux-kernel-markers-architecture-independant-code.patch
linux-kernel-markers-powerpc-optimization.patch
linux-kernel-markers-i386-optimization.patch
markers-add-instrumentation-markers-menus-to-avr32.patch
linux-kernel-markers-non-optimized-architectures.patch
markers-alpha-and-avr32-supportadd-alpha-markerh-add-arm26-markerh.patch
linux-kernel-markers-documentation.patch
#
markers-define-the-linker-macro-extra_rwdata.patch
markers-use-extra_rwdata-in-architectures.patch
#
some-grammatical-fixups-and-additions-to-atomich-kernel-doc.patch
no-longer-include-asm-kdebugh.patch

  Hold.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans

2007-05-02 Thread Mathieu Desnoyers
* Christoph Hellwig ([EMAIL PROTECTED]) wrote:
> On Wed, May 02, 2007 at 09:47:07AM -0700, Andrew Morton wrote:
> > > That doesn't constitute using it.
> > 
> > Andi, there was a huge amount of discussion about all this in September last
> > year (subjects: *markers* and *LTTng*). The outcome of all that was, I
> > believe, that the kernel should have a static marker infrastructure.
> 
> Only when it's actually useable.  A prerequisite for merging it is
> having an actual trace transport infrastructure aswell as a few actually
> useful tracing modules in the kernel tree.
> 
> Let this count as a vote to merge the markers once we have the infrastructure
> above ready, it'll be very useful then.

Hi Christoph,

The idea is the following : either we integrate the infrastructure for
instrumentation / data serialization / buffer management / extraction of
data to user space in multiple different steps, which makes code review
easier for you guys, or we bring the main pieces of the LTTng project
altogether with the Linux Kernel Markers, which would result in a bigger
change.

Based on the premise that discussing about logically distinct pieces of
infrastructure is easier and can be done more thoroughly when done
separately, we decided to submit the markers first, with the other
pieces planned in a near future.

I agree that it would be very useful to have the full tracing stack
available in the Linux kernel, but we inevitably face the argument :
"this change is too big" if we submit all LTTng modules at once or
the argument : "we want the whole tracing stack, not just part of it"
if we don't.

This is why we chose to push the tracing infrastructure chunk by chunk :
to make code review and criticism more efficient.

Regards,

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-02 Thread Christoph Lameter
On Wed, 2 May 2007, Sam Ravnborg wrote:

> To facilitate this do NOT introduce CONFIG_SLAB until we decide
> that SLUB are default. In this way we can make CONFIG_SLUB be default
> and people will not continue with CONFIG_SLAB because they had it in their
> config already.

We already have CONFIG_SLAB. If you use your existing .config then
you will stay with SLAB.

> The point is make sure that LSUB becomes default for people that does
> an make oldconfig (explicit or implicit).

H... We can think about that when we actually want to make SLUB the 
default.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-02 Thread Sam Ravnborg
On Wed, May 02, 2007 at 12:42:54PM -0700, Christoph Lameter wrote:
> On Wed, 2 May 2007, Andrew Morton wrote:
> 
> > > At some point I dream that SLUB could become the default but I thought 
> > > this would take at least 6 month or so. If want to force this now then I 
> > > will certainly have some busy weeks ahead.
> > 
> > s/dream/promise/ ;)
> > 
> > Six months sounds reasonable - I was kind of hoping for less.  Make it
> > default-to-on in 2.6.23-rc1, see how it goes.
> 
> Here is how I think the future could develop
> 
> Cycle SLABSLUBSLOBSLxB
> 
> 2.6.22API fixes   Stabilization   API fixes
> 
> Major event: SLUB availability as experimental
> 
> 2.6.23API upgradesPerf. Valid.EOL
> 
> Major events: SLUB performance validation. Switch off
>   experimental (could even be the default) 
>   Slab allocators support targeted reclaim for at
>   least one slab cache (dentry?)
>   (vacate/move all objects in a slab)

To facilitate this do NOT introduce CONFIG_SLAB until we decide
that SLUB are default. In this way we can make CONFIG_SLUB be default
and people will not continue with CONFIG_SLAB because they had it in their
.config already.
Or just rename CONFIG_SLAB to CONFIG_SLAB_DEPRECATED or something.

The point is make sure that LSUB becomes default for people that does
an make oldconfig (explicit or implicit).

Sam
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-02 Thread Christoph Lameter
On Wed, 2 May 2007, Pekka Enberg wrote:

> And then there's patches such as kmemleak which would need to target
> all three. Plus it doesn't really make sense for users to select
> between three competiting implementations. Please don't take away our
> high hopes of getting rid of mm/slab.c Christoph =)

You too, Brutus ...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-02 Thread Christoph Lameter
On Wed, 2 May 2007, Andrew Morton wrote:

> > At some point I dream that SLUB could become the default but I thought 
> > this would take at least 6 month or so. If want to force this now then I 
> > will certainly have some busy weeks ahead.
> 
> s/dream/promise/ ;)
> 
> Six months sounds reasonable - I was kind of hoping for less.  Make it
> default-to-on in 2.6.23-rc1, see how it goes.

Here is how I think the future could develop

Cycle   SLABSLUBSLOBSLxB

2.6.22  API fixes   Stabilization   API fixes

Major event: SLUB availability as experimental

2.6.23  API upgradesPerf. Valid.EOL

Major events: SLUB performance validation. Switch off
experimental (could even be the default) 
Slab allocators support targeted reclaim for at
least one slab cache (dentry?)
(vacate/move all objects in a slab)

2.6.24  Earliest EOLStable  -   Experiments

Major events: SLUB stable. Stable targeted reclaim
for all major reclaimable slabs.
Maybe experiments with another new allocator?

2.6.25  EOL default -   ?

Death of SLAB. SLUB default. Hopefully new ideas on the horizon.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-02 Thread Christoph Lameter
On Wed, 2 May 2007, Pekka Enberg wrote:

> On 5/2/07, Christoph Lameter <[EMAIL PROTECTED]> wrote:
> > I am the one who has to maintain SLAB and SLUB it seems and I have been
> > dealing with the trio SLAB, SLOB and SLUB for awhile now. Its okay and it
> > will be much easier once the cleanups are in.
> 
> And then there's patches such as kmemleak which would need to target
> all three. Plus it doesn't really make sense for users to select
> between three competiting implementations. Please don't take away our
> high hopes of getting rid of mm/slab.c Christoph =)

SLUB supports kmemleak (actually its quite improved). Switch debugging on 
and try

cat /sys/slab/kmalloc-128/alloc_calls.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-02 Thread Pekka Enberg

On 5/2/07, Christoph Lameter <[EMAIL PROTECTED]> wrote:

Owww... You throw my roadmap out of the window and may create too
high expectations of SLUB.


Me too!

On 5/2/07, Christoph Lameter <[EMAIL PROTECTED]> wrote:

I am the one who has to maintain SLAB and SLUB it seems and I have been
dealing with the trio SLAB, SLOB and SLUB for awhile now. Its okay and it
will be much easier once the cleanups are in.


And then there's patches such as kmemleak which would need to target
all three. Plus it doesn't really make sense for users to select
between three competiting implementations. Please don't take away our
high hopes of getting rid of mm/slab.c Christoph =)

 Pekka
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-02 Thread Andrew Morton
On Wed, 2 May 2007 10:03:50 -0700 (PDT)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> On Wed, 2 May 2007, Hugh Dickins wrote:
> 
> > But if Linus' tree is to be better than a warehouse to avoid
> > awkward merges, I still think we want it to default to on for
> > all the architectures, and for most if not all -rcs.
> 
> At some point I dream that SLUB could become the default but I thought 
> this would take at least 6 month or so. If want to force this now then I 
> will certainly have some busy weeks ahead.

s/dream/promise/ ;)

Six months sounds reasonable - I was kind of hoping for less.  Make it
default-to-on in 2.6.23-rc1, see how it goes.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-02 Thread Christoph Lameter
On Wed, 2 May 2007, Andrew Morton wrote:

> n, we don't want competing slab allocators, please.  We should get slub
> working well on all architectures then remove slab completely.  Having to
> maintain both slab.c and slub.c would be awful.

Owww... You throw my roadmap out of the window and may create too 
high expectations of SLUB.

I am the one who has to maintain SLAB and SLUB it seems and I have been 
dealing with the trio SLAB, SLOB and SLUB for awhile now. Its okay and it 
will be much easier once the cleanups are in.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-02 Thread Christoph Lameter
On Wed, 2 May 2007, Siddha, Suresh B wrote:

> I have been looking into "slub" recently to avoid some of the NUMA alien
> cache issues that we were encountering on the regular slab.

Yes that is also our main concern.

> I am having some stability issues with slub on an ia64 NUMA platform and
> didn't have time to dig further. I am hoping to look into it soon
> and share the data/findings with  Christoph.

There is at least one patch on top of 2.6.21-rc7-mm2 already in mm that 
may be necessary for you.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-02 Thread Andrew Morton
On Wed, 2 May 2007 11:39:20 -0700 (PDT)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> On Wed, 2 May 2007, Hugh Dickins wrote:
> 
> > I'm astonished and impressed, both with Kconfig and your use of it:
> 
> Thanks!
> 
> > I'd much rather be testing a quicklist patch:
> > I'd better give that a try.
> 
> Great. But I certainly do not mind people use SLAB. I do not think that 
> one approach should be there for all. Choice is the way to have multiple 
> allocators compete. One reason that SLAB is so crusty is because it was 
> the only solution for so long.
> 

n, we don't want competing slab allocators, please.  We should get slub
working well on all architectures then remove slab completely.  Having to
maintain both slab.c and slub.c would be awful.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-02 Thread Siddha, Suresh B
On Wed, May 02, 2007 at 05:54:53AM -0700, Hugh Dickins wrote:
> On Tue, 1 May 2007, Andrew Morton wrote:
> > So on balance, given that we _do_ expect slub to have a future, I'm
> > inclined to crash ahead with it.  The worst that can happen will be a later
> > rm mm/slub.c which would be pretty simple to do.
> 
> Okay.  And there's been no chorus to echo my concern.

I have been looking into "slub" recently to avoid some of the NUMA alien
cache issues that we were encountering on the regular slab.

I am having some stability issues with slub on an ia64 NUMA platform and
didn't have time to dig further. I am hoping to look into it soon
and share the data/findings with  Christoph.

We also did a quick perf collection on x86_64(atleast didn't hear
any stability issues from our team on regular x86_64 SMP), that we will be
sharing shortly.

> But if Linus' tree is to be better than a warehouse to avoid
> awkward merges, I still think we want it to default to on for
> all the architectures, and for most if not all -rcs.

I will not suggest for default on at this point.

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-02 Thread Christoph Lameter
On Wed, 2 May 2007, Andrew Morton wrote:

> > This is a sensitive piece of the kernel as you say and we better allow the 
> > running of two allocator for some time to make sure that it behaves in all 
> > load situations. The design is fundamentally different so its performance 
> > characteristics may diverge significantly and perhaps there will be corner 
> > cases for each where they do the best job.
> 
> eek.  We'd need to fix those corner cases then.  Our endgame
> here really must be rm mm/slab.c.

First we need to discover them and I doubt that mm covers much more than 
development loads. I hope we can get to a point where we have SLUB be 
the primarily allocator soon but I would expect various performance issues 
to show up.

On the other hand: I am pretty sure that SLUB can replace SLOB completely 
given SLOBs limitations and SLUBs more efficient use of space. SLOB needs 
8 bytes of overhead. SLUB needs none. We may just have to #ifdef out the 
debugging support to make the code be of similar size to SLOB too. SLOB is 
a general problem because its features are not compatible to SLAB. F.e. it 
does not support DESTROY_BY_RCU and does not do reclaim the right way etc 
etc. SLUB may turn out to be the ideal embedded slab allocator.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-02 Thread Andrew Morton
On Wed, 2 May 2007 11:28:26 -0700 (PDT)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> On Wed, 2 May 2007, Hugh Dickins wrote:
> 
> > On Wed, 2 May 2007, Christoph Lameter wrote:
> > > 
> > > But these are arch specific problems. We could use 
> > > ARCH_USES_SLAB_PAGE_STRUCT to disable SLUB on these platforms.
> > 
> > As a quick hack, sure.  But every ARCH_USES_SLAB_PAGE_STRUCT
> > diminishes the testing SLUB will get.  If the idea is that we're
> > going to support both SLAB and SLUB, some arches with one, some
> > with another, some with either, for more than a single release,
> > then I'm back to saying SLUB is being pushed in too early.
> > I can understand people wanting pluggable schedulers,
> > but pluggable slab allocators?
> 
> This is a sensitive piece of the kernel as you say and we better allow the 
> running of two allocator for some time to make sure that it behaves in all 
> load situations. The design is fundamentally different so its performance 
> characteristics may diverge significantly and perhaps there will be corner 
> cases for each where they do the best job.

eek.  We'd need to fix those corner cases then.  Our endgame
here really must be rm mm/slab.c.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-02 Thread Christoph Lameter
On Wed, 2 May 2007, Hugh Dickins wrote:

> I'm astonished and impressed, both with Kconfig and your use of it:

Thanks!

> I'd much rather be testing a quicklist patch:
> I'd better give that a try.

Great. But I certainly do not mind people use SLAB. I do not think that 
one approach should be there for all. Choice is the way to have multiple 
allocators compete. One reason that SLAB is so crusty is because it was 
the only solution for so long.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-02 Thread Hugh Dickins
On Wed, 2 May 2007, Christoph Lameter wrote:
> On Wed, 2 May 2007, Hugh Dickins wrote:
> 
> > I presume the answer is just to extend your quicklist work to
> > powerpc's lowest level of pagetables.  The only other architecture
> > which is using kmem_cache for them is arm26, which has
> > "#error SMP is not supported", so won't be giving this problem.
> 
> In the meantime we would need something like this to disable SLUB in this 
> particular configuration. Note that I have not tested this and the <= for
> the comparision with SPLIT_PTLOCK_CPUS may not work (Never seen such a
> construct in a Kconfig file but it is needed here).

I'm astonished and impressed, both with Kconfig and your use of it:
that does seem to work.  Though I don't dare go so far as to give
the patch an ack, and don't like this way out at all.  It needs a
proper (quicklist) solution, and by the time that solution comes
along, all the powerpc people will have CONFIG_SLAB=y in their
.config, and "make oldconfig" will just perpetuate that status quo,
instead of the switching over to CONFIG_SLUB=y.  I think.  Unless
we keep changing the config option names, or go through a phase
with no option.

I'd much rather be testing a quicklist patch:
I'd better give that a try.

Hugh

> 
> 
> 
> PowerPC: Disable SLUB for configurations in which slab page structs are 
> modified
> 
> PowerPC uses the slab allocator to manage the lowest level of the page table.
> In high cpu configurations we also use the page struct to split the page
> table lock. Disallow the selection of SLUB for that case.
> 
> [Not tested: I am not familiar with powerpc build procedures etc]
> 
> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
> 
> Index: linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig
> ===
> --- linux-2.6.21-rc7-mm2.orig/arch/powerpc/Kconfig2007-05-02 
> 10:07:34.0 -0700
> +++ linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig 2007-05-02 10:13:37.0 
> -0700
> @@ -117,6 +117,19 @@ config GENERIC_BUG
>   default y
>   depends on BUG
>  
> +#
> +# Powerpc uses the slab allocator to manage its ptes and the
> +# page structs of ptes are used for splitting the page table
> +# lock for configurations supporting more than SPLIT_PTLOCK_CPUS.
> +#
> +# In that special configuration the page structs of slabs are modified.
> +# This setting disables the selection of SLUB as a slab allocator.
> +#
> +config ARCH_USES_SLAB_PAGE_STRUCT
> + bool
> + default y
> + depends on SPLIT_PTLOCK_CPUS <= NR_CPUS
> +
>  config DEFAULT_UIMAGE
>   bool
>   help
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-02 Thread Christoph Lameter
On Wed, 2 May 2007, Hugh Dickins wrote:

> On Wed, 2 May 2007, Christoph Lameter wrote:
> > 
> > But these are arch specific problems. We could use 
> > ARCH_USES_SLAB_PAGE_STRUCT to disable SLUB on these platforms.
> 
> As a quick hack, sure.  But every ARCH_USES_SLAB_PAGE_STRUCT
> diminishes the testing SLUB will get.  If the idea is that we're
> going to support both SLAB and SLUB, some arches with one, some
> with another, some with either, for more than a single release,
> then I'm back to saying SLUB is being pushed in too early.
> I can understand people wanting pluggable schedulers,
> but pluggable slab allocators?

This is a sensitive piece of the kernel as you say and we better allow the 
running of two allocator for some time to make sure that it behaves in all 
load situations. The design is fundamentally different so its performance 
characteristics may diverge significantly and perhaps there will be corner 
cases for each where they do the best job.

I have already reworked the slab API to allow for an easy implementation 
of alternate slab allocators (released with 2.6.20) which only covered 
SLAB and SLOB. This is continuing the cleanup work and adding a third one.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-02 Thread Hugh Dickins
On Wed, 2 May 2007, Christoph Lameter wrote:
> 
> But these are arch specific problems. We could use 
> ARCH_USES_SLAB_PAGE_STRUCT to disable SLUB on these platforms.

As a quick hack, sure.  But every ARCH_USES_SLAB_PAGE_STRUCT
diminishes the testing SLUB will get.  If the idea is that we're
going to support both SLAB and SLUB, some arches with one, some
with another, some with either, for more than a single release,
then I'm back to saying SLUB is being pushed in too early.
I can understand people wanting pluggable schedulers,
but pluggable slab allocators?

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans

2007-05-02 Thread Andi Kleen
On Wed, May 02, 2007 at 09:47:07AM -0700, Andrew Morton wrote:
> On Wed, 2 May 2007 12:44:13 +0200 Andi Kleen <[EMAIL PROTECTED]> wrote:
> 
> > > It is currently used as an instrumentation infrastructure for the LTTng
> > > tracer at IBM, Google, Autodesk, Sony, MontaVista and deployed in
> > > WindRiver products.  The SystemTAP project also plan to use this type of
> > > infrastructure to trace sites hard to instrument. The Linux Kernel
> > > Markers has the support of Frank C. Eigler, author of their current
> > > marker alternative (which he wishes to drop in order to adopt the
> > > markers infrastructure as soon as it hits mainline).
> > 
> > All of the above don't use mainline kernels.
> 
> That's because they have to add a markers patch!

I meant they use very old kernels. Their experiences don't apply
to mainline bitrottyness.

> > That doesn't constitute using it.
> 
> Andi, there was a huge amount of discussion about all this in September last
> year (subjects: *markers* and *LTTng*). The outcome of all that was, I
> believe, that the kernel should have a static marker infrastructure.

I have no problem with that in principle; just some doubts about
the current proposed implementation: in particular its complexity.
And also I think when something is merged it should have some users in tree.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans

2007-05-02 Thread Christoph Hellwig
On Wed, May 02, 2007 at 09:47:07AM -0700, Andrew Morton wrote:
> > That doesn't constitute using it.
> 
> Andi, there was a huge amount of discussion about all this in September last
> year (subjects: *markers* and *LTTng*). The outcome of all that was, I
> believe, that the kernel should have a static marker infrastructure.

Only when it's actually useable.  A prerequisite for merging it is
having an actual trace transport infrastructure aswell as a few actually
useful tracing modules in the kernel tree.

Let this count as a vote to merge the markers once we have the infrastructure
above ready, it'll be very useful then.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans

2007-05-02 Thread Mathieu Desnoyers

* Andi Kleen ([EMAIL PROTECTED]) wrote:
> > It is currently used as an instrumentation infrastructure for the LTTng
> > tracer at IBM, Google, Autodesk, Sony, MontaVista and deployed in
> > WindRiver products.  The SystemTAP project also plan to use this type of
> > infrastructure to trace sites hard to instrument. The Linux Kernel
> > Markers has the support of Frank C. Eigler, author of their current
> > marker alternative (which he wishes to drop in order to adopt the
> > markers infrastructure as soon as it hits mainline).
> 
> All of the above don't use mainline kernels.
> That doesn't constitute using it.
> 

I am afraid this argument does not hold :

- These companies are not shipping their products with mainline kernels
  to make sure things have time to stabilize.
- They eventually get to the next version some time after it is not
  "head" anymore. They still want to benefit from the features of the
  newer versions.
- All these companies would be really happy to have a marker
  infrastructure in mainline so they can stop applying a separate set of
  patches to provide this functionality.
- Arguing the fact that "they apply their set of patches anyway" goes
  against the advice I have received from Greg KH, which is can be
  reworded as : please submit your patches to mainline instead of
  keeping your separate set of patches. See his various presentations
  about "mainlining" for more info about this.

Because of these 4 arguments, I think that these companies can be
considered as users and contributors of/to mainline kernels.


> > Quoting Jim Keniston <[EMAIL PROTECTED]> :
> > 
> > "kprobes remains a vital foundation for SystemTap.  But markers are
> > attactive as an alternate source of trace/debug info.  Here's why:
> > [...]"
> 
> Talk is cheap. Do they have working code to use it?
> 

LTTng has been using the markers for about 6 months now. SystemTAP is
waiting on the "it hits mainline" signal before they switch from their
STP_MARK() markers to this infrastructure. Give them a few days and they
will proceed to the change.



> >   - Allow per-architecture optimized versions which removes the need for
> > a d-cache based branch (patch a "load immediate" instruction
> > instead). It minimized the d-cache impact of the disabled markers.
> 
> That's a good idea in general, but should be generalized (available
> independently), not hidden in your subsystem. I know a couple of places
> who could use this successfully.
> 

I agree that an efficient hooking mechanism is useful to manyr; listing
at least security hooks and instrumentation for tracing. What other
usage scenario do you have in mind that could not fit in my marker
infrastructure ? I have tried to generalize this as much as possible,
but if you see, within this, a piece of infrastructure that could be
taken apart and used more widely, I will be happy to submit it
separately to increase its usefulness.


> >   - Accept the cost of an unlikely branch at the marker site because the
> > gcc compiler does not give the ability to put "nops" instead of a
> > branch generated from C code. Keep this in mind for future
> > per-architecture optimizations.
> 
> See upcomming paravirt code for a way to do this.
> 

I have looked at the paravirt code in Andrew's 2.6.21-rc7-mm2. A few
reasons why I do not plan to use it :


1 - It requires specific arg setup for the calls to be crafted by hand,
in assembly, for each and every number of parameters and each types, for
each architecture. I use a variable argument list as a parameter to my
marker to make sure that a single macro can be used for markup in a
generic manner.

Quoting : http://lkml.org/lkml/2007/4/4/577
"+ * Unfortunately there's no way to get gcc to generate the args setup
+ * for the call, and then allow the call itself to be generated by an
+ * inline asm.  Because of this, we must do the complete arg setup and
+ * return value handling from within these macros.  This is fairly
+ * cumbersome."


2 - I also provide an architecture independent "generic" version which
does not depend on per-architecture assembly. From what I see, paravirt
is only offered for i386 and x86_64. Are there any plans to support the
other ~12 architectures ? Does it offer a architecture agnostic fallback
in the cases where it is not implemented for a given architecture ?


3 - It can only alter instructions "safely" in the UP case before the
other CPUs are turned on. See my arch/i386/marker.c code patcher for
XMC-safe instruction patching. Marker activation must be done at
runtime, when the system is fully operational.

Quoting 2.6.21 arch/i386/kernel/alternative.c
"/* Replace instructions with better alternatives for this CPU type.
   This runs before SMP is initialized to avoid SMP problems with
   self modifying code. This implies that assymetric systems where
   APs have less capabilities than the boot processor are not handled.
   Tough. Make sure you disable such features by hand. */

voi

Re: 2.6.22 -mm merge plans: slub

2007-05-02 Thread Christoph Lameter
On Wed, 2 May 2007, Hugh Dickins wrote:

> I presume the answer is just to extend your quicklist work to
> powerpc's lowest level of pagetables.  The only other architecture
> which is using kmem_cache for them is arm26, which has
> "#error SMP is not supported", so won't be giving this problem.

In the meantime we would need something like this to disable SLUB in this 
particular configuration. Note that I have not tested this and the <= for
the comparision with SPLIT_PTLOCK_CPUS may not work (Never seen such a
construct in a Kconfig file but it is needed here).



PowerPC: Disable SLUB for configurations in which slab page structs are modified

PowerPC uses the slab allocator to manage the lowest level of the page table.
In high cpu configurations we also use the page struct to split the page
table lock. Disallow the selection of SLUB for that case.

[Not tested: I am not familiar with powerpc build procedures etc]

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig
===
--- linux-2.6.21-rc7-mm2.orig/arch/powerpc/Kconfig  2007-05-02 
10:07:34.0 -0700
+++ linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig   2007-05-02 10:13:37.0 
-0700
@@ -117,6 +117,19 @@ config GENERIC_BUG
default y
depends on BUG
 
+#
+# Powerpc uses the slab allocator to manage its ptes and the
+# page structs of ptes are used for splitting the page table
+# lock for configurations supporting more than SPLIT_PTLOCK_CPUS.
+#
+# In that special configuration the page structs of slabs are modified.
+# This setting disables the selection of SLUB as a slab allocator.
+#
+config ARCH_USES_SLAB_PAGE_STRUCT
+   bool
+   default y
+   depends on SPLIT_PTLOCK_CPUS <= NR_CPUS
+
 config DEFAULT_UIMAGE
bool
help
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-02 Thread Christoph Lameter
On Wed, 2 May 2007, Hugh Dickins wrote:

> But if Linus' tree is to be better than a warehouse to avoid
> awkward merges, I still think we want it to default to on for
> all the architectures, and for most if not all -rcs.

At some point I dream that SLUB could become the default but I thought 
this would take at least 6 month or so. If want to force this now then I 
will certainly have some busy weeks ahead.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-02 Thread Christoph Lameter
On Wed, 2 May 2007, Hugh Dickins wrote:

> > Why would we need to go back to SLAB if we have not switched to SLUB? SLUB 
> > is marked experimental and not the default.
> 
> I said above that I thought SLUB ought to be defaulted to on throughout
> the -rcs: if we don't do that, we're not going to learn much from having
> it in Linus' tree.

I'd rather be careful with that. mm is enough for now. Why go to the 
extremes immediately. If it is an option then people can gradually start 
testing with it.
 
> > The only problems that I am aware of is(or was) the issue with arches 
> > modifying page struct fields of slab pages that SLUB needs for its own 
> > operations. And I thought it was all fixed since the powerpc guys were 
> > quiet and the patch was in for i386.
> 
> You're forgetting your unions in struct page: in the SPLIT_PTLOCK
> case (NR_CPUS >= 4) the pagetable code is using spinlock_t ptl,
> which overlays SLUB's first_page and slab pointers.

Uhhh Right. So SLUB wont work if the lowest page table block is 
managed via slabs.
 
> I just tried rebuilding powerpc with the SPLIT_PTLOCK cutover
> edited to 8 cpus instead, and then no crash.
> 
> I presume the answer is just to extend your quicklist work to
> powerpc's lowest level of pagetables.  The only other architecture

I am not sure how PowerPCs lower pagetable pages work. If they are of 
PAGE_SIZE then this is no problem.

> which is using kmem_cache for them is arm26, which has
> "#error SMP is not supported", so won't be giving this problem.

Ahh. Good.

But these are arch specific problems. We could use 
ARCH_USES_SLAB_PAGE_STRUCT to disable SLUB on these platforms.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans

2007-05-02 Thread Andrew Morton
On Wed, 2 May 2007 12:44:13 +0200 Andi Kleen <[EMAIL PROTECTED]> wrote:

> > It is currently used as an instrumentation infrastructure for the LTTng
> > tracer at IBM, Google, Autodesk, Sony, MontaVista and deployed in
> > WindRiver products.  The SystemTAP project also plan to use this type of
> > infrastructure to trace sites hard to instrument. The Linux Kernel
> > Markers has the support of Frank C. Eigler, author of their current
> > marker alternative (which he wishes to drop in order to adopt the
> > markers infrastructure as soon as it hits mainline).
> 
> All of the above don't use mainline kernels.

That's because they have to add a markers patch!

> That doesn't constitute using it.

Andi, there was a huge amount of discussion about all this in September last
year (subjects: *markers* and *LTTng*). The outcome of all that was, I
believe, that the kernel should have a static marker infrastructure.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans

2007-05-02 Thread Frank Ch. Eigler

Andi Kleen <[EMAIL PROTECTED]> writes:

> [...] The SystemTAP project also plan to use this type of
> > infrastructure to trace sites hard to instrument. The Linux Kernel
> > Markers has the support of Frank C. Eigler, author of their current
> > marker alternative [...]
> 
> All of the above don't use mainline kernels.
> That doesn't constitute using it.

Systemtap does run on mainline kernels.

> > "kprobes remains a vital foundation for SystemTap.  But markers are
> > attactive as an alternate source of trace/debug info.  Here's why:
> > [...]"
> 
> Talk is cheap. Do they have working code to use it? [...]

We had been waiting on the chicken & egg semaphore.  LTTNG has working
code yesterday (months ago); systemtap will have it "tomorrow" (a week
or few).


- FChE
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [stable] to something appropriate (was Re: 2.6.22 -mm merge plans)

2007-05-02 Thread Chuck Ebbert
Chris Wright wrote:
> * Greg KH ([EMAIL PROTECTED]) wrote:
>> And is this really a problem?  The whole goal of the -stable tree was to
>> accomidate the users who relied on kernel.org kernels, and wanted
>> bugfixes and security updates.  It was not for new features or new
>> hardware support.
>>
>> If people feel we should revisit this goal, then that's fine, and I have
>> no objection to that.  But until then, I think the rules that we have
>> had in place for over the past 2 years should still remain in affect.
> 
> I have to agree.  I went back through my mbox and found vanishingly few
> pci_id update patches.   So it's not clear there's even a big issue.
> 

Of course you didn't find many -- most people know that's not part of
the -stable charter. If you asked for them you'd get them...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-02 Thread Hugh Dickins
On Wed, 2 May 2007, Nick Piggin wrote:
> Hugh Dickins wrote:
> > 
> > But I was quite disappointed when
> > mm-fix-fault-vs-invalidate-race-for-linear-mappings-fix.patch
> > appeared, putting double unmap_mapping_range calls in.  Certainly
> > you were wrong to take the one out, but a pity to end up with two.
> > 
> > Your comment says/said:
> > The nopage vs invalidate race fix patch did not take care of truncating
> > private COW pages. Mind you, I'm pretty sure this was previously racy
> > even for regular truncate, not to mention vmtruncate_range.
> > 
> > vmtruncate_range (holepunch) was deficient I agree, and though we
> > can now take out your second unmap_mapping_range there, that's only
> > because I've slipped one into shmem_truncate_range.  In due course it
> > needs to be properly handled by noting the range in shmem inode info.
> > 
> > (I think you couldn't take that approach, noting invalid range in
> > ->mapping while invalidating, because NFS has/had some cases of
> > invalidate_whatever without i_mutex?)
> 
> Sorry, I didn't parse this?

I meant that i_size is used to protect against truncation races, but
we have no equivalent inval_start,inval_end in the struct inode or
struct address_space, such as could be used for similar protection
against races while invalidating.

And that IIRC there are places where NFS was doing the invalidation
without i_mutex: so there could be concurrent invalidations, so one
inval_start,inval_end in the structure wouldn't be enough anyway.

> But I wonder whether it is better to do
> it in vmtruncate_range than the filesystem? Private COWed pages are
> not really a filesystem "thing"...

It wasn't the thought of private COWed pages which made me put a
second unmap_mapping_range in shmem_truncate_range, it was its own
internal file<->swap consistency which needed that (as a quick fix).
The real fix to be having a trunc_start,trunc_end or whatever in
the shmem_inode_info (assuming it's not wanted in the common inode:
might be if holepunch spreads e.g. it's been mentioned with fallocate).

Re private COWed pages and holepunch: Miklos and I agree that really
it would be better for holepunch _not_ to remove them - but that's
rather off-topic.

More on-topic, since you suggest doing more within vmtruncate_range
than the filesystem: no, I'm afraid that's misdesigned, and I want
to move almost all of it into the filesystem ->truncate_range.
Because, if what vmtruncate_range is doing before it gets to the
filesystem isn't to be just a waste of time, the filesystem needs
to know what's going on in advance - just as notify_change warns
the filesystem about a coming truncation.  But easier than inventing
some new notification is to move it all into the filesystem, with
unmap_mapping_range+truncate_inode_pages_range its library helpers.

> 
> > But I'm pretty sure (to use your words!) regular truncate was not racy
> > before: I believe Andrea's sequence count was handling that case fine,
> > without a second unmap_mapping_range.
> 
> OK, I think you're right. I _think_ it should also be OK with the
> lock_page version as well: we should not be able to have any pages
> after the first unmap_mapping_range call, because of the i_size
> write. So if we have no pages, there is nothing to 'cow' from.

I'd be delighted if you can remove those later unmap_mapping_ranges.
As I recall, the important thing for the copy pages is to be holding
the page lock (or whatever other serialization) on the copied page
still while the copy page is inserted into pagetable: that looks
to be so in your __do_fault.

> > But it is a shame, and leaves me wondering what you gained with the
> > page lock there.
> > 
> > One thing gained is ease of understanding, and if your later patches
> > build an edifice upon the knowledge of holding that page lock while
> > faulting, I've no wish to undermine that foundation.
> 
> It also fixes a bug, doesn't it? ;)

Well, I'd come to think that perhaps the bugs would be solved by
that second unmap_mapping_range alone, so the pagelock changes
just a misleading diversion.

I'm not sure how I feel about that: calling unmap_mapping_range a
second time feels such a cheat, but if (big if) it does solve the
races, and the pagelock method is as expensive as your numbers
now suggest...

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: mm-more-rmap-checking

2007-05-02 Thread Hugh Dickins
On Wed, 2 May 2007, Nick Piggin wrote:
> 
> Yes, but IIRC I put that in because there was another check in
> SLES9 that I actually couldn't put in, but used this one instead
> because it also caught the bug we saw.
>... 
> This was actually a rare corruption that is also in 2.6.21, and
> as few rmap callsites as we have, it was never noticed until the
> SLES9 bug check was triggered.

You are being very mysterious.  Please describe this bug (privately
if you think it's exploitable), and let's work on the patch to fix it,
rather than this "debug" patch.

> Hmm, I didn't notice the do_swap_page change, rather just derived
> its safety by looking at the current state of the code (which I
> guess must have been post-do_swap_page change)...

Your addition of page_add_new_anon_rmap clarified the situation too.

> Do you have a pointer to the patch, for my interest?

The patch which changed do_swap_page?

commit c475a8ab625d567eacf5e30ec35d6d8704558062
Author: Hugh Dickins <[EMAIL PROTECTED]>
Date:   Tue Jun 21 17:15:12 2005 -0700
[PATCH] can_share_swap_page: use page_mapcount

Or my intended PG_swapcache to PAGE_MAPPING_SWAP patch,
which does assume PageLocked in page_add_anon_rmap?
Yes, I can send you its current unsplit state if you like
(but have higher priorities before splitting and commenting
it for posting).

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-02 Thread Hugh Dickins
On Tue, 1 May 2007, Andrew Morton wrote:
> 
> Given the current state and the current rate of development I'd expect slub
> to have reached the level of completion which you're describing around -rc2
> or -rc3.  I think we'd be pretty safe making that assumption.

Its developer does show signs of being active!

> 
> This is a bit unusual but there is of course some self-interest here: the
> patch dependencies are getting awful and having this hanging around
> out-of-tree will make 2.6.23 development harder for everyone.

That is a very strong argument: a somewhat worrisome argument,
but a very strong one.  Maintaining your sanity is important.

> 
> So on balance, given that we _do_ expect slub to have a future, I'm
> inclined to crash ahead with it.  The worst that can happen will be a later
> rm mm/slub.c which would be pretty simple to do.

Okay.  And there's been no chorus to echo my concern.

But if Linus' tree is to be better than a warehouse to avoid
awkward merges, I still think we want it to default to on for
all the architectures, and for most if not all -rcs.

> 
> otoh I could do some frantic patch mangling and make it easier to carry
> slub out-of-tree, but do we gain much from that?

No, keep away from that.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-02 Thread Hugh Dickins
On Tue, 1 May 2007, Christoph Lameter wrote:
> On Tue, 1 May 2007, Hugh Dickins wrote:
> 
> > Yes, to me it does.  If it could be defaulted to on throughout the
> > -rcs, on every architecture, then I'd say that's "finishing work";
> > and we'd be safe knowing we could go back to slab in a hurry if
> > needed.  But it hasn't reached that stage yet, I think.
> 
> Why would we need to go back to SLAB if we have not switched to SLUB? SLUB 
> is marked experimental and not the default.

I said above that I thought SLUB ought to be defaulted to on throughout
the -rcs: if we don't do that, we're not going to learn much from having
it in Linus' tree.

And perhaps that line which appends "PREEMPT " to an oops report ought
to append "SLUB " too, for so long as there's a choice.

> The only problems that I am aware of is(or was) the issue with arches 
> modifying page struct fields of slab pages that SLUB needs for its own 
> operations. And I thought it was all fixed since the powerpc guys were 
> quiet and the patch was in for i386.

You're forgetting your unions in struct page: in the SPLIT_PTLOCK
case (NR_CPUS >= 4) the pagetable code is using spinlock_t ptl,
which overlays SLUB's first_page and slab pointers.

I just tried rebuilding powerpc with the SPLIT_PTLOCK cutover
edited to 8 cpus instead, and then no crash.

I presume the answer is just to extend your quicklist work to
powerpc's lowest level of pagetables.  The only other architecture
which is using kmem_cache for them is arm26, which has
"#error SMP is not supported", so won't be giving this problem.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans

2007-05-02 Thread Andi Kleen
> It is currently used as an instrumentation infrastructure for the LTTng
> tracer at IBM, Google, Autodesk, Sony, MontaVista and deployed in
> WindRiver products.  The SystemTAP project also plan to use this type of
> infrastructure to trace sites hard to instrument. The Linux Kernel
> Markers has the support of Frank C. Eigler, author of their current
> marker alternative (which he wishes to drop in order to adopt the
> markers infrastructure as soon as it hits mainline).

All of the above don't use mainline kernels.
That doesn't constitute using it.

> Quoting Jim Keniston <[EMAIL PROTECTED]> :
> 
> "kprobes remains a vital foundation for SystemTap.  But markers are
> attactive as an alternate source of trace/debug info.  Here's why:
> [...]"

Talk is cheap. Do they have working code to use it?

>   - Allow per-architecture optimized versions which removes the need for
> a d-cache based branch (patch a "load immediate" instruction
> instead). It minimized the d-cache impact of the disabled markers.

That's a good idea in general, but should be generalized (available
independently), not hidden in your subsystem. I know a couple of places
who could use this successfully.

>   - Accept the cost of an unlikely branch at the marker site because the
> gcc compiler does not give the ability to put "nops" instead of a
> branch generated from C code. Keep this in mind for future
> per-architecture optimizations.

See upcomming paravirt code for a way to do this.

> - Instrumentation of challenging kernel sites
>   - Instrumentation such as the one provided in the already existing
> Lock dependency checker (lockdep) and instrumentation of trap
> handlers implies being reentrant for such context. Therefore, the
> implementation must be lock-free and update the state in an atomic
> fashion (rcu-style). It must also let the programmer who describes
> a marker site the ability to specify what is forbidden in the probe
> that will be connected to the marker : can it generate a trap ? Can
> it call lockdep (irq disable, take any type of lock), can it call
> printk ? This is why flags can be passed to the _MARK() marker,
> while the MARK() marker has the default flags.

Why can't you just generally forbid probes from doing all of this?
It would greatly simplify your code, wouldn't it?

Keep it simple please.

> Please tell me if I forgot to explain the rationale behind some
> implementation detail and I will be happy to explain in more depth.

Having lots of flags to do things differently optionally normally
starts up all warning lights of early over design. While Linux
has this sometimes it is generally only in mature old subsystems.
But when something is freshly merged it shouldn't be like this.
That is because code tends to grow more complicated over its livetime
and when it is already complicated at the beginning it will eventually
fall over (you can study current slab as a poster child of this)

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans

2007-05-02 Thread Andi Kleen
On Wed, May 02, 2007 at 10:31:10AM +1000, Rusty Russell wrote:
> On Tue, 2007-05-01 at 14:17 +0200, Andi Kleen wrote:
> > Andrew Morton <[EMAIL PROTECTED]> writes:
> > > Will merge the rustyvisor.
> > 
> > IMHO the user code still doesn't belong into Documentation.
> > Also it needs another review round I guess. And some beta testing by
> > more people.
> 
> Like any piece of code more review and more testing would be great.
> (Your earlier review was particularly useful!).  But it's not clear that
> waiting for longer will achieve either.

Not clear to me. Release a clear lguest patchkit with documentation
on l-k several times and you'll probably get both reviewers and testers.
Then confidence level will rise.

> 
> Look at kvm's experience for the reverse case: it went in, then got
> rewritten.

They at least already had some user base at this point.

> As for the code in Documentation, my initial attempts tried to get
> around the need for a userspace part by putting everything in the kernel
> module.  It meant you could launch a guest by writing a string
> to /dev/lguest (no real ABI burden there), but it's a worse solution
> than some user code in the kernel tree 8(

Just put it into a separate tarball.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-02 Thread Nick Piggin

Nick Piggin wrote:

Hugh Dickins wrote:


On Tue, 1 May 2007, Nick Piggin wrote:




There were concerns that we could do this more cheaply, but I think it
is important to start with a base that is simple and more likely to
be correct and build on that. My testing didn't show any obvious
problems with performance.




I don't see _problems_ with performance, but I do consistently see the
same kind of ~5% degradation in lmbench fork, exec, sh, mmap latency
and page fault tests on SMP, several machines, just as I did last year.



OK. I did run some tests at one stage which didn't show a regression
on my P4, however I don't know that they were statistically significant.
I'll try a couple more runs and post numbers.


I didn't have enough time tonight to get means/stddev, etc, but the runs
are pretty stable.

Patch tested was just the lock page one.

SMP kernel, tasks bound to 1 CPU:

P4 Xeon
 pagefault   fork  exec
2.6.21   1.67-1.69   140.7-142.0   449.5-460.8
+patch   1.75-1.77   144.0-145.5   456.2-463.0

So it's taken on nearly 5% on pagefault, but looks like less than 2% on
fork, so not as bad as your numbers (phew).

G5
 pagefault   fork  exec
2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2

Bigger hit there.

Page faults can be improved a tiny bit by not using a test and clear op
in unlock_page (less barriers for the G5).

I don't think that's really a blocker problem for a merge, but I wonder
what we can do to improve it. Lockless pagecache shaves quite a bit of
straight line find_get_page performance there.

Going to a non-sleeping lock might be one way to go in the long term, but
it would require quite a lot of restructuring.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-01 Thread Nick Piggin

Hugh Dickins wrote:

On Tue, 1 May 2007, Nick Piggin wrote:



There were concerns that we could do this more cheaply, but I think it
is important to start with a base that is simple and more likely to
be correct and build on that. My testing didn't show any obvious
problems with performance.



I don't see _problems_ with performance, but I do consistently see the
same kind of ~5% degradation in lmbench fork, exec, sh, mmap latency
and page fault tests on SMP, several machines, just as I did last year.


OK. I did run some tests at one stage which didn't show a regression
on my P4, however I don't know that they were statistically significant.
I'll try a couple more runs and post numbers.



I'm assuming this patch is the one responsible: at 2.6.20-rc4 time
you posted a set of 10 and a set of 7 patches I tried in versus out;
at 2.6.21-rc3-mm2 time you had a group of patches in -mm I tried in
versus out; with similar results.

I did check the graphs on test.kernel.org, I couldn't see any bad
behaviour there that correlated with this work; though each -mm
has such a variety of new work in it, it's very hard to attribute.
And nobody else has reported any regression from your patches.

I'm inclined to write it off as poorer performance in some micro-
benchmarks, against which we offset the improved understandabilty
of holding the page lock over the file fault.

But I was quite disappointed when 
mm-fix-fault-vs-invalidate-race-for-linear-mappings-fix.patch

appeared, putting double unmap_mapping_range calls in.  Certainly
you were wrong to take the one out, but a pity to end up with two.

Your comment says/said:
The nopage vs invalidate race fix patch did not take care of truncating
private COW pages. Mind you, I'm pretty sure this was previously racy
even for regular truncate, not to mention vmtruncate_range.

vmtruncate_range (holepunch) was deficient I agree, and though we
can now take out your second unmap_mapping_range there, that's only
because I've slipped one into shmem_truncate_range.  In due course it
needs to be properly handled by noting the range in shmem inode info.

(I think you couldn't take that approach, noting invalid range in
->mapping while invalidating, because NFS has/had some cases of
invalidate_whatever without i_mutex?)


Sorry, I didn't parse this? But I wonder whether it is better to do
it in vmtruncate_range than the filesystem? Private COWed pages are
not really a filesystem "thing"...



But I'm pretty sure (to use your words!) regular truncate was not racy
before: I believe Andrea's sequence count was handling that case fine,
without a second unmap_mapping_range.


OK, I think you're right. I _think_ it should also be OK with the
lock_page version as well: we should not be able to have any pages
after the first unmap_mapping_range call, because of the i_size
write. So if we have no pages, there is nothing to 'cow' from.



Well, I guess I've come to accept that, expensive as unmap_mapping_range
may be, truncating files while they're mmap'ed is perverse behaviour:
perhaps even deserving such punishment.

But it is a shame, and leaves me wondering what you gained with the
page lock there.

One thing gained is ease of understanding, and if your later patches
build an edifice upon the knowledge of holding that page lock while
faulting, I've no wish to undermine that foundation.


It also fixes a bug, doesn't it? ;)

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: mm-more-rmap-checking

2007-05-01 Thread Nick Piggin

Hugh Dickins wrote:

On Mon, 30 Apr 2007, Andrew Morton wrote:

... 
mm-more-rmap-checking.patch

...

Misc MM things.  Will merge.



Would Nick mind very much if I ask you to drop this one?
You did CC me ages ago, but I've only just run across it.
It's a small matter, but I'd prefer it dropped for now.


I guess I would prefer it to go under CONFIG_DEBUG_VM. Speaking
of which, it would be nice to be able to turn that on unconditionally
in -rc1. Although I may have put a few too many things under it, so
it might slow down too much...



Re-introduce rmap verification patches that Hugh removed when he removed
PG_map_lock. PG_map_lock actually isn't needed to synchronise access to
anonymous pages, because PG_locked and PTL together already do.

These checks were important in discovering and fixing a rare rmap corruption
in SLES9.



It introduces some silly checks which were never in mainline,
nor so far as I can tell in SLES9: I'm thinking of those
+   BUG_ON(address < vma->vm_start || address >= vma->vm_end);


Yes, but IIRC I put that in because there was another check in
SLES9 that I actually couldn't put in, but used this one instead
because it also caught the bug we saw.



There are few callsites for these rmap functions, I don't think
they need to be checking their arguments in that way.

It also changes the inline page_dup_rmap (a single atomic increment)
into a bugchecking out-of-line function: do we really want to slow
down fork in that way, for 2.6.22 to fix a rare corruption in SLES9?


This was actually a rare corruption that is also in 2.6.21, and
as few rmap callsites as we have, it was never noticed until the
SLES9 bug check was triggered.



What I really like about the patch is Nick's observation that my
/* else checking page index and mapping is racy */
is no longer true: a change we made to the do_swap_page sequence
some while ago has indeed cured that raciness, and I'm happy to
reintroduce the check on mapping and index in page_add_anon_rmap,
and his BUG_ON(!PageLocked(page)) there (despite BUG_ONs falling
out of fashion very recently).


Hmm, I didn't notice the do_swap_page change, rather just derived
its safety by looking at the current state of the code (which I
guess must have been post-do_swap_page change)...

Do you have a pointer to the patch, for my interest?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [stable] to something appropriate (was Re: 2.6.22 -mm merge plans)

2007-05-01 Thread Chris Wright
* Greg KH ([EMAIL PROTECTED]) wrote:
> And is this really a problem?  The whole goal of the -stable tree was to
> accomidate the users who relied on kernel.org kernels, and wanted
> bugfixes and security updates.  It was not for new features or new
> hardware support.
> 
> If people feel we should revisit this goal, then that's fine, and I have
> no objection to that.  But until then, I think the rules that we have
> had in place for over the past 2 years should still remain in affect.

I have to agree.  I went back through my mbox and found vanishingly few
pci_id update patches.   So it's not clear there's even a big issue.

thanks,
-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [stable] to something appropriate (was Re: 2.6.22 -mm merge plans)

2007-05-01 Thread Greg KH
On Tue, May 01, 2007 at 05:40:33PM +0100, Alan Cox wrote:
> > > But distros can easily add the device id to their kernel if needed, it
> > > isn't something that the -stable tree shoud be accepting.  Otherwise, we
> > > will be swamped with those types of patches...
> > > 
> > 
> > Oh sure, leave the distros swamped with them instead. :)
> > 
> > And they all have to do it separately, meaning they don't stay in sync
> > and they duplicate each other's work...
> 
> Well they *don't* have to work that separately. They could set up some
> shared tree which would look suspiciously like what Greg is doing but
> with the ID updates ;)

And is this really a problem?  The whole goal of the -stable tree was to
accomidate the users who relied on kernel.org kernels, and wanted
bugfixes and security updates.  It was not for new features or new
hardware support.

If people feel we should revisit this goal, then that's fine, and I have
no objection to that.  But until then, I think the rules that we have
had in place for over the past 2 years should still remain in affect.

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans

2007-05-01 Thread Rusty Russell
On Tue, 2007-05-01 at 14:17 +0200, Andi Kleen wrote:
> Andrew Morton <[EMAIL PROTECTED]> writes:
> > Will merge the rustyvisor.
> 
> IMHO the user code still doesn't belong into Documentation.
> Also it needs another review round I guess. And some beta testing by
> more people.

Like any piece of code more review and more testing would be great.
(Your earlier review was particularly useful!).  But it's not clear that
waiting for longer will achieve either.

Look at kvm's experience for the reverse case: it went in, then got
rewritten.

As for the code in Documentation, my initial attempts tried to get
around the need for a userspace part by putting everything in the kernel
module.  It meant you could launch a guest by writing a string
to /dev/lguest (no real ABI burden there), but it's a worse solution
than some user code in the kernel tree 8(

Cheers,
Rusty.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans

2007-05-01 Thread Mathieu Desnoyers
Hi Andi,

* Andi Kleen ([EMAIL PROTECTED]) wrote:
> Andrew Morton <[EMAIL PROTECTED]> writes:
> 
> 
> > Static markers.  Will merge.
> There don't seem to be any users of this. How do you know it hasn't
> already bitrotted?
> 

See the detailed explanation at :
http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc7/2.6.21-rc7-mm2/broken-out/linux-kernel-markers-kconfig-menus.patch

Major points :

It is currently used as an instrumentation infrastructure for the LTTng
tracer at IBM, Google, Autodesk, Sony, MontaVista and deployed in
WindRiver products.  The SystemTAP project also plan to use this type of
infrastructure to trace sites hard to instrument. The Linux Kernel
Markers has the support of Frank C. Eigler, author of their current
marker alternative (which he wishes to drop in order to adopt the
markers infrastructure as soon as it hits mainline).

Quoting Jim Keniston <[EMAIL PROTECTED]> :

"kprobes remains a vital foundation for SystemTap.  But markers are
attactive as an alternate source of trace/debug info.  Here's why:
[...]"

> It seems quite overcomplicated to me. Has the complexity been justified?
> 

To summarize the document pointed at the URL above, where the full
the key goals of the markers, showing the rationale being the most
important design choices :
- Almost non perceivable impact on production machines when compiled in
  but markers are "disabled".
  - Use a separate section to keep the data to minimize d-cache
trashing.
  - Put the code (stack setup and function call) in unlikely branches of the
if() condition to minimize i-cache impact.
  - Since it is required to allow instrumentation of variables within
the body of a function, accept the impact on compiler's
optimizations and let it keep the variables "live" sometimes longer
than required. It is up to the person who puts the marker in the
code to choose the location that will have a small impact in this
aspect.
  - Allow per-architecture optimized versions which removes the need for
a d-cache based branch (patch a "load immediate" instruction
instead). It minimized the d-cache impact of the disabled markers.
  - Accept the cost of an unlikely branch at the marker site because the
gcc compiler does not give the ability to put "nops" instead of a
branch generated from C code. Keep this in mind for future
per-architecture optimizations.
- Instrumentation of challenging kernel sites
  - Instrumentation such as the one provided in the already existing
Lock dependency checker (lockdep) and instrumentation of trap
handlers implies being reentrant for such context. Therefore, the
implementation must be lock-free and update the state in an atomic
fashion (rcu-style). It must also let the programmer who describes
a marker site the ability to specify what is forbidden in the probe
that will be connected to the marker : can it generate a trap ? Can
it call lockdep (irq disable, take any type of lock), can it call
printk ? This is why flags can be passed to the _MARK() marker,
while the MARK() marker has the default flags.

Please tell me if I forgot to explain the rationale behind some
implementation detail and I will be happy to explain in more depth.

Regards,

Mathieu


-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-01 Thread Andrew Morton
On Tue, 1 May 2007 13:46:26 -0700 (PDT)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> On Tue, 1 May 2007, Andrew Morton wrote:
> 
> > otoh I could do some frantic patch mangling and make it easier to carry
> > slub out-of-tree, but do we gain much from that?
> 
> Then we may loose all the slab API cleanups? Yuck. I really do not want 
> redo those

No, I meant that I'd look at splitting those patches up into
one-against-mainline and one-against-slub.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-01 Thread Christoph Lameter
On Tue, 1 May 2007, Hugh Dickins wrote:

> Yes, to me it does.  If it could be defaulted to on throughout the
> -rcs, on every architecture, then I'd say that's "finishing work";
> and we'd be safe knowing we could go back to slab in a hurry if
> needed.  But it hasn't reached that stage yet, I think.

Why would we need to go back to SLAB if we have not switched to SLUB? SLUB 
is marked experimental and not the default.

The only problems that I am aware of is(or was) the issue with arches 
modifying page struct fields of slab pages that SLUB needs for its own 
operations. And I thought it was all fixed since the powerpc guys were 
quiet and the patch was in for i386.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-01 Thread Christoph Lameter
On Tue, 1 May 2007, Andrew Morton wrote:

> otoh I could do some frantic patch mangling and make it easier to carry
> slub out-of-tree, but do we gain much from that?

Then we may loose all the slab API cleanups? Yuck. I really do not want 
redo those

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-01 Thread Andrew Morton
On Tue, 1 May 2007 21:19:09 +0100 (BST)
Hugh Dickins <[EMAIL PROTECTED]> wrote:

> On Tue, 1 May 2007, Andrew Morton wrote:
> > On Tue, 1 May 2007 19:10:29 +0100 (BST)
> > Hugh Dickins <[EMAIL PROTECTED]> wrote:
> > 
> > > > Most of the rest of slub.  Will merge it all.
> > > 
> > > Merging slub already?  I'm surprised.
> > 
> > My thinking here is "does slub have a future".
> > I think the answer is "yes",
> 
> I think I agree with that,
> though it's a judgement I'd leave to you and others.
> 
> > so we're reasonably safe getting it into mainline for the finishing
> > work.  The kernel.org kernel will still default to slab.
> > 
> > Does that sound wrong?
> 
> Yes, to me it does.  If it could be defaulted to on throughout the
> -rcs, on every architecture, then I'd say that's "finishing work";
> and we'd be safe knowing we could go back to slab in a hurry if
> needed.  But it hasn't reached that stage yet, I think.
> 

Given the current state and the current rate of development I'd expect slub
to have reached the level of completion which you're describing around -rc2
or -rc3.  I think we'd be pretty safe making that assumption.

This is a bit unusual but there is of course some self-interest here: the
patch dependencies are getting awful and having this hanging around
out-of-tree will make 2.6.23 development harder for everyone.

So on balance, given that we _do_ expect slub to have a future, I'm
inclined to crash ahead with it.  The worst that can happen will be a later
rm mm/slub.c which would be pretty simple to do.

otoh I could do some frantic patch mangling and make it easier to carry
slub out-of-tree, but do we gain much from that?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-01 Thread Hugh Dickins
On Tue, 1 May 2007, Andrew Morton wrote:
> On Tue, 1 May 2007 19:10:29 +0100 (BST)
> Hugh Dickins <[EMAIL PROTECTED]> wrote:
> 
> > > Most of the rest of slub.  Will merge it all.
> > 
> > Merging slub already?  I'm surprised.
> 
> My thinking here is "does slub have a future".
> I think the answer is "yes",

I think I agree with that,
though it's a judgement I'd leave to you and others.

> so we're reasonably safe getting it into mainline for the finishing
> work.  The kernel.org kernel will still default to slab.
> 
> Does that sound wrong?

Yes, to me it does.  If it could be defaulted to on throughout the
-rcs, on every architecture, then I'd say that's "finishing work";
and we'd be safe knowing we could go back to slab in a hurry if
needed.  But it hasn't reached that stage yet, I think.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-01 Thread Andrew Morton
On Tue, 1 May 2007 19:10:29 +0100 (BST)
Hugh Dickins <[EMAIL PROTECTED]> wrote:

> > Most of the rest of slub.  Will merge it all.
> 
> Merging slub already?  I'm surprised.

My thinking here is "does slub have a future".  I think the answer is
"yes", so we're reasonably safe getting it into mainline for the finishing
work.  The kernel.org kernel will still default to slab.

Does that sound wrong?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-01 Thread Hugh Dickins
On Tue, 1 May 2007, Nick Piggin wrote:
> Andrew Morton wrote:
> 
> >  mm-simplify-filemap_nopage.patch
> >  mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
> >  mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
> >  mm-merge-nopfn-into-fault.patch
> >  convert-hugetlbfs-to-use-vm_ops-fault.patch
> >  mm-remove-legacy-cruft.patch
> >  mm-debug-check-for-the-fault-vs-invalidate-race.patch
> 
> > mm-fix-clear_page_dirty_for_io-vs-fault-race.patch
> 
> > Miscish MM changes.  Will merge, dependent upon what still applies and works
> > if the moveable-zone patches get stalled.
> 
> These fix some bugs in the core vm, at least the former one we have
> seen numerous people hitting in production...
...
> 
> So, do you or anyone else have any problems with these patches going in
> 2.6.22? I haven't had much feedback for a while, but I was under the
> impression that people are more-or-less happy with them?
> 
> mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
> 
> This patch fixes the core filemap_nopage vs invalidate_inode_pages2
> race by having filemap_nopage return a locked page to do_no_page,
> and removes the fairly complex (and inadequate) truncate_count
> synchronisation logic.
> 
> There were concerns that we could do this more cheaply, but I think it
> is important to start with a base that is simple and more likely to
> be correct and build on that. My testing didn't show any obvious
> problems with performance.

I don't see _problems_ with performance, but I do consistently see the
same kind of ~5% degradation in lmbench fork, exec, sh, mmap latency
and page fault tests on SMP, several machines, just as I did last year.

I'm assuming this patch is the one responsible: at 2.6.20-rc4 time
you posted a set of 10 and a set of 7 patches I tried in versus out;
at 2.6.21-rc3-mm2 time you had a group of patches in -mm I tried in
versus out; with similar results.

I did check the graphs on test.kernel.org, I couldn't see any bad
behaviour there that correlated with this work; though each -mm
has such a variety of new work in it, it's very hard to attribute.
And nobody else has reported any regression from your patches.

I'm inclined to write it off as poorer performance in some micro-
benchmarks, against which we offset the improved understandabilty
of holding the page lock over the file fault.

But I was quite disappointed when 
mm-fix-fault-vs-invalidate-race-for-linear-mappings-fix.patch
appeared, putting double unmap_mapping_range calls in.  Certainly
you were wrong to take the one out, but a pity to end up with two.

Your comment says/said:
The nopage vs invalidate race fix patch did not take care of truncating
private COW pages. Mind you, I'm pretty sure this was previously racy
even for regular truncate, not to mention vmtruncate_range.

vmtruncate_range (holepunch) was deficient I agree, and though we
can now take out your second unmap_mapping_range there, that's only
because I've slipped one into shmem_truncate_range.  In due course it
needs to be properly handled by noting the range in shmem inode info.

(I think you couldn't take that approach, noting invalid range in
->mapping while invalidating, because NFS has/had some cases of
invalidate_whatever without i_mutex?)

But I'm pretty sure (to use your words!) regular truncate was not racy
before: I believe Andrea's sequence count was handling that case fine,
without a second unmap_mapping_range.

Well, I guess I've come to accept that, expensive as unmap_mapping_range
may be, truncating files while they're mmap'ed is perverse behaviour:
perhaps even deserving such punishment.

But it is a shame, and leaves me wondering what you gained with the
page lock there.

One thing gained is ease of understanding, and if your later patches
build an edifice upon the knowledge of holding that page lock while
faulting, I've no wish to undermine that foundation.

> 
> mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
> mm-merge-nopfn-into-fault.patch
> etc.
> 
> These move ->nopage, ->populate, ->nopfn (and soon, ->page_mkwrite)
> into a single, unified interface. Although this strictly closes some
> similar holes in nonlinear faults as well, they are very uncommon, so
> I wouldn't be so upset if these aren't merged in 2.6.22 (I don't see
> any reason not to, but at least they don't fix major bugs).

I don't have an opinion on these, but I think BenH and others
were strongly in favour, with various people waiting upon them.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans: slub

2007-05-01 Thread Christoph Lameter
On Tue, 1 May 2007, Hugh Dickins wrote:

> > Most of the rest of slub.  Will merge it all.
> 
> Merging slub already?  I'm surprised.  That's a very key piece of
> infrastructure, and I doubt it's had the exposure it needs yet.

Its not the default. Its just an alternative like SLOB. It will take some 
time to test with various loads in order to see if it can really replace
SLAB in all scenarios.
 
> Just what has it been widely tested on so far?  x86_64.  Not many
> of us have ia64, but I guess SGI people will have been trying it
> on that.  Not i386, that's excluded.

There is an i386 patch pending and I have used it on i386 for a while.
 
> Not powerpc - hmm, I thought that was known, but looking I see no
> ARCH_USES_SLAB_PAGE_STRUCT there: just built and tried to run it up,
> crashes in slab_free from pgtable_free_tlb frpm free_pte_range from
> free_pgd_range from free_pgtables from unmap_region form do_munmap.
> That's 2.6.21-rc7-mm2.

Hmmm... True I have not spend any time with that platform. We can set 
ARCH_USES_SLAB_PAGE_STRUCT there to switch it off. SLUB is the default for 
mm so I am a bit surprised that this did not surface earlier.

> I've nothing against slub in itself, though I'm wary of its
> cache merging (more scope for one corrupting another) (and

Yes but then SLUB has more diagnostics etc etc than SLAB to prevent any 
issues. In debug mode all slabs are separate. The merge feature is very 
stable these days and significantly reduces cache overhead problems 
that plague SLAB and require it to have a complex object expiration 
technique. As a result I was able to rip out all timers. SLUB has no cache 
reaper nor any timer. Its silent if not in use.

> sometimes I think Christoph spent one life uglifying slab for
> NUMA, then another life ripping that all out to make slub ;)

SLAB has a certain paradigm of doing things (queues) and I had to work 
within that framework. It was a group effort. SLUB is an answer to those 
complaints and a result of the lessons learned through years of some 
painful slab debugging. SLUB makes debugging extremely easy (and also the 
design is very simple and comprehensible). No rebuilding of the kernel. 
Just pop in a debug option on the command line which can even be targeted 
to a slab cache if we know that things break there.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- lumpy reclaim

2007-05-01 Thread Andrew Morton
On Tue, 01 May 2007 14:02:41 +0100 Andy Whitcroft <[EMAIL PROTECTED]> wrote:

> I have some primitive stats patches which we have used performance
> testing.  Perhaps those could be brought up to date to provide better
> visibility into lumpy's operation.  Again this would be a separate patch.

Feel free to add new counters in /proc/vmstat - perhaps per-order
success and fail rates?  Monitoring the ratio between those would show
how effective lumpiness is being, perhaps.

It's always nice to see what's going on in there.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: fragmentation avoidance Re: 2.6.22 -mm merge plans

2007-05-01 Thread Andrew Morton
On Tue, 1 May 2007 11:16:51 +0100 [EMAIL PROTECTED] (Mel Gorman) wrote:

> 

OK, I did all the reorganisation which you recommended.

> Ok. It is getting reviewed by Christoph and I'm going through the TODO items
> it yielded. Andy has also been regularly reviewing them which is probably
> why they have had less public errors than you might expect from something
> like this.

Great.  I'm a bit behind on my linux-mm reading.

> Christoph may like to comment more here.

That would be helpful.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: fragmentation avoidance Re: 2.6.22 -mm merge plans

2007-05-01 Thread Mel Gorman
On Tue, 1 May 2007, Christoph Lameter wrote:

> On Tue, 1 May 2007, Mel Gorman wrote:
>
>>anti-fragmentation-switch-over-to-pfn_valid_within.patch
>>
>> These patches are the grouping pages by mobility patches. They get tested
>> every time someone boots the machine from the perspective that they affect
>> the page allocator. It is working to keep fragmentation problems to a
>> minimum and being exercised.  We have beaten it heavily here on tests
>> with a variety of machines using the system that drives test.kernel.org
>> for both functionality and performance testing. That covers x86, x86_64,
>> ppc64 and occasionally IA64. Granted, there are corner-case machines out
>> there or we'd never receive bug reports at all.
>>
>> They are currently being reviewed by Christoph Lameter. His feedback in
>> the linux-mm thread "Antifrag patchset comments" has given me a TODO list
>> which I'm currently working through. So far, there has been no fundamental
>> mistake in my opinion and the additional work is logical extensions.
>
> I think we really urgently need a defragmentation solution in Linux in
> order to support higher page allocations for various purposes. SLUB f.e.
> would benefit from it and the large blocksize patches are not reasonable
> without such a method.
>

I continue to maintain that anti-fragmentation is a pre-requisite for
any defragmentation mechanism to be effective without trashing overall
performance. If allocation success rates are low when everything possible
has been reclaimed as is the case without fragmentation avoidance, then
defragmentation will not help unless the the 1:1 phys:virt mappings is broken
which incurs its own considerable set of problems.

> However, the current code is not up to the task. I did not see a clean
> categorization of allocations nor a consistent handling of those. The
> cleanup work that would have to be done throughout the kernel is not
> there.

The choice of mobility marker to use in each case was deliberate (even if I
have made mistakes but what else is review for?). The choice by default is
UNMOVABLE as it's the safe choice even if may be sub-optimal.  The description
of the mobility types may not be the clearest. For example, buffers were
placed beside page cache in MOVABLE because they can both be reclaimed in
the same fashion - I consider moving it to disk to be as "movable" as any
other definition of the word but in your world movable always means page
migration which has led to some confusion. They could have been separated
out as MOVABLE and BUFFERS for a conceptually cleaner split but it did not
seem necessary because the more types there are, the bigger the memory and
performance footprint becomes. Additional flag groupings like GFP_BUFFERS
could be defined that alias to MOVABLE if you felt it would make the code
clearer but functionally, the behaviour remains the same. This is similar
to your feedback on the treatment of GFP_TEMPORARY.

There can be as many alias mobility types as you wish but if more "real"
types are required, you can have as you want as long as NR_PAGEBLOCK_BITS
is increased properly and allocflags_to_migratetype() is able to translate
GFP flags to the appropriate mobility type. It increases the performance
and memory footprint though.

> It is spotty. There seems to be a series of heuristic driving this
> thing (I have to agree with Nick there). The temporary allocations that
> were missed are just a few that I found. The review of the rest of the
> kernel was not done.

The review for temporary allocations was aimed at catching the most common
callers, not every single one of them because a full review of every caller
is a large undertaking.  If anything, it makes more sense to do a review of
all callers at the end when the core mechanism is finished. The default to
treat them as UNMOVABLE is sensible.

> Mel said that he fixed up locations that showed up to
> be a problem in testing. That is another issue: Too much focus on testing
> instead of conceptual cleanness and clean code in the kernel.

The patches started as a thought experiment of what "should work". They
were then tested to find flaws in the model and the results were fed back
in. How is that a disadvantage exactly?

> It looks
> like this is geared for a specific series of tests on specific platforms
> and also to a particular allocation size (max order sized huge pages).
>

Some series of tests had to be chosen and one combination was chosen
that was known to be particularly hostile to external fragmentation -
i.e. large numbers of kernel cache allocations at the same time as page
cache allocations. No one has suggested an alternative test that would be
more suitable. The platforms used were x86, x86_64 and ppc64 which are not
exactly insignificant platforms. At the time, I didn't have an IA64 machine and
franky the one I have now does not always boot so testing is not as thorough.

Huge page sized pages were chosen because they were the hardest allocati

  1   2   >