subject:"2.6.22 \-mm merge plans"

On 5/9/07, Nick Piggin <[EMAIL PROTECTED]> wrote:

Ray Lee wrote:
> On 5/9/07, Nick Piggin <[EMAIL PROTECTED]> wrote:
>
>> You said it helped with the updatedb problem. That says we should look at
>> why it is going bad first, and for example improve use-once algorithms.
>> After we do that, then swap prefetching might still help, which is fine.
>
> Nick, if you're volunteering to do that analysis, then great. If not,
> then you're just providing a airy hope with nothing to back up when or
> if that work would ever occur.

I'd like to try helping. Tell me your problem.

Huh? You already stated one version of it above, namely updatedb. But
let's put this another way, shall we? A gedankenexperiment, if you
will.

Say we have a perfect swap-out algorithm that can choose exactly what
needs to be evicted to disk. ('Perfect', of course, is dependent upon
one's metric, but let's go with "maximizes overall system utilization
and minimizes IO wait time." Arbitrary, but hey.)

So, great, the right things got swapped out. Anything else that could
have been chosen would have caused more overall IO Wait. Yay us.

So what happens when those processes that triggered the swap-outs go
away? (Firefox is closed, I stop hitting my local copy of a database,
whatever.) Well, currently, nothing. What happens when I switch
workspaces and try to use my email program? Swap-ins.

Okay, so why didn't the system swap that stuff in preemptively? Why am
I sitting there waiting for something that it could have already done
in the background?

A new swap-out algorithm, be it use-once, Clock-Pro, or perfect
foreknowledge isn't going to change that issue. Swap prefetch does.

> Further, if you or someone else *does* do that work, then guess what,
> we still have the option to rip out the swap prefetching code after
> the hypothetical use-once improvements have been proven and merged.
> Which, by the way, I've watched people talk about since 2.4. That was,
> y'know, a *while* ago.

What's wrong with the use-once we have? What improvements are you talking
about?

You said, effectively: "Use-once could be improved to deal with
updatedb". I said I've been reading emails from Rik and others talking
about that for four years now, and we're still talking about it. Were
it merely updatedb, I'd say us userspace folk should step up and
rewrite the damn thing to amortize its work. However, I and others
feel it's only an example -- glaring, obviously -- of a more pervasive
issue. A small issue, to be sure!, but an issue nevertheless.

In general, I/others are talking about improving the desktop
experience of running too much on a RAM limited machine. (Which, in my
case, is with a gig and a 2.2GHz processor.)

Or restated: the desktop experience occasionally sucks for me, and I
don't think I'm alone. There may be a heuristic, completely isolated
from userspace (and so isn't an API the kernel has to support! -- if
it doesn't work, we can rip it out again), that may mitigate the
suckiness. Let's try it.

> So enough with the stop energy, okay? You're better than that.

I don't think it is about energy or being mean, I'm just stating the
issues I have with it.

Nick, I in no way think you're being mean, and I'm sorry if I've given
you that impression. However, if you're just stating the issues you
have with it, then can I assume that you won't lobby against having
this experiment merged?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans

On Thursday 10 May 2007 13:48, Ray Lee wrote:
> On 5/9/07, Nick Piggin <[EMAIL PROTECTED]> wrote:
> > You said it helped with the updatedb problem. That says we should look at
> > why it is going bad first, and for example improve use-once algorithms.
> > After we do that, then swap prefetching might still help, which is fine.
>
> Nick, if you're volunteering to do that analysis, then great. If not,
> then you're just providing a airy hope with nothing to back up when or
> if that work would ever occur.
>
> Further, if you or someone else *does* do that work, then guess what,
> we still have the option to rip out the swap prefetching code after
> the hypothetical use-once improvements have been proven and merged.
> Which, by the way, I've watched people talk about since 2.4. That was,
> y'know, a *while* ago.
>
> So enough with the stop energy, okay? You're better than that.
>
> Con? He is right about the last feature to go in needs to work
> gracefully with what's there now. However, it's not unheard of for
> authors of other sections of code to help out with incompatibilities
> by answering politely phrased questions for guidance. Though the
> intersection of users between cpusets and desktop systems seems small
> indeed.

Let's just set the record straight. I actually discussed cpusets over a year 
ago in this nonsense and was told by sgi folk there was no need to get my 
head around cpusets and honouring node placement should be enough which, by 
the way, swap prefetch does. So I by no means ignored this; we just hit an 
impasse on just how much more featured it should be for the sake of a goddamn 
home desktop pc feature.

Anyway why the hell am I resurrecting this thread? The code is declared dead 
already. Leave it be.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans


Ray Lee wrote:

On 5/9/07, Nick Piggin <[EMAIL PROTECTED]> wrote:


You said it helped with the updatedb problem. That says we should look at
why it is going bad first, and for example improve use-once algorithms.
After we do that, then swap prefetching might still help, which is fine.



Nick, if you're volunteering to do that analysis, then great. If not,
then you're just providing a airy hope with nothing to back up when or
if that work would ever occur.


I'd like to try helping. Tell me your problem.



Further, if you or someone else *does* do that work, then guess what,
we still have the option to rip out the swap prefetching code after
the hypothetical use-once improvements have been proven and merged.
Which, by the way, I've watched people talk about since 2.4. That was,
y'know, a *while* ago.


What's wrong with the use-once we have? What improvements are you talking
about?



So enough with the stop energy, okay? You're better than that.


I don't think it is about energy or being mean, I'm just stating the
issues I have with it.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans


On 5/9/07, Nick Piggin <[EMAIL PROTECTED]> wrote:

You said it helped with the updatedb problem. That says we should look at
why it is going bad first, and for example improve use-once algorithms.
After we do that, then swap prefetching might still help, which is fine.


Nick, if you're volunteering to do that analysis, then great. If not,
then you're just providing a airy hope with nothing to back up when or
if that work would ever occur.

Further, if you or someone else *does* do that work, then guess what,
we still have the option to rip out the swap prefetching code after
the hypothetical use-once improvements have been proven and merged.
Which, by the way, I've watched people talk about since 2.4. That was,
y'know, a *while* ago.

So enough with the stop energy, okay? You're better than that.

Con? He is right about the last feature to go in needs to work
gracefully with what's there now. However, it's not unheard of for
authors of other sections of code to help out with incompatibilities
by answering politely phrased questions for guidance. Though the
intersection of users between cpusets and desktop systems seems small
indeed.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans


Con Kolivas wrote:

On Thursday 10 May 2007 10:05, Nick Piggin wrote:



I'm not the gatekeeper and it is completely up to you whether you want
to work on something or not... but I'm sure you understand where I was
coming from when I suggested it doesn't get merged yet.



No matter how you spin it, you're the gatekeeper.


If raising unaddressed issues means closing a gate, then OK. You can
equally open it by answering them.



You may not believe this, but I agree that swap prefetching (and
prefetching in general) has some potential to help desktop workloads :).
But it still should go through the normal process of being tested and
questioned and having a look at options for first improving existing
code in those problematic cases.



Not this again? Proof was there ages ago that it helped and no proof that it 
harmed could be found yet you cunningly pretend it never existed. It's been 
done to death and I'm sick of this.


I said I know it can help. Do you know how many patches I have that help
some workloads but are not merged? That's just the way it works.

What I have seen is it helps the case where you force out a huge amount
of swap. OK, that's nice -- the base case obviously works.

You said it helped with the updatedb problem. That says we should look at
why it is going bad first, and for example improve use-once algorithms.
After we do that, then swap prefetching might still help, which is fine.



Once that process happens and it is shown to work nicely, etc., then I
would not be able to (or want to) keep it from getting merged.

As far as cpusets goes... if your code goes in last, then you have to
make it work with what is there, as a rule. People are using cpusets
for memory resource control, which would have uses on a desktop system.
It is just a really bad precedent to set, having different parts of the
VM not work correctly together. Even if you made them mutually
exclusive CONFIG_ options, that is still not a very nice solution.



That's as close to a 3 as I'm likely to get out of you.


If you're not willing to try making it work with existing code, among other
things, then yes it will be difficult to get it merged. That's not going to
change.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans

On Thursday 10 May 2007 10:05, Nick Piggin wrote:
> Con Kolivas wrote:
> > Well how about that? That was the difference with a swap _file_ as I
> > said, but I went ahead and checked with a swap partition as I used to
> > have. I didn't notice, but somewhere in the last few months, swap
> > prefetch code itself being unchanged for a year, seems to have been
> > broken by other changes in the vm and it doesn't even start up
> > prefetching often and has stale swap entries in its list. Once it breaks
> > like that it does nothing from then on. So that leaves me with a quandry
> > now.
> >
> >
> > Do I:
> >
> > 1. Go ahead and find whatever breakage was introduced and fix it with
> > hopefully a trivial change
> >
> > 2. Do option 1. and then implement support for yet another kernel feature
> > (cpusets) that will be used perhaps never with swap prefetch [No Nick I
> > don't believe you that cpusets have anything to do with normal users on a
> > desktop ever; if it's used on a desktop it will only be by a kernel
> > developer testing the cpusets code].
> >
> > or
> >
> > 3. Dump swap prefetch forever and ignore that it ever worked and was
> > helpful and was a lot of work to implement and so on.
> >
> >
> > Given that even if I do 1 and/or 2 it'll still be blocked from ever going
> > to mainline I think the choice is clear.
> >
> > Nick since you're personally the gatekeeper for this code, would you like
> > to make a call? Just say 3 and put me out of my misery please.
>
> I'm not the gatekeeper and it is completely up to you whether you want
> to work on something or not... but I'm sure you understand where I was
> coming from when I suggested it doesn't get merged yet.

No matter how you spin it, you're the gatekeeper.

> You may not believe this, but I agree that swap prefetching (and
> prefetching in general) has some potential to help desktop workloads :).
> But it still should go through the normal process of being tested and
> questioned and having a look at options for first improving existing
> code in those problematic cases.

Not this again? Proof was there ages ago that it helped and no proof that it 
harmed could be found yet you cunningly pretend it never existed. It's been 
done to death and I'm sick of this.

> Once that process happens and it is shown to work nicely, etc., then I
> would not be able to (or want to) keep it from getting merged.
>
> As far as cpusets goes... if your code goes in last, then you have to
> make it work with what is there, as a rule. People are using cpusets
> for memory resource control, which would have uses on a desktop system.
> It is just a really bad precedent to set, having different parts of the
> VM not work correctly together. Even if you made them mutually
> exclusive CONFIG_ options, that is still not a very nice solution.

That's as close to a 3 as I'm likely to get out of you.

Andrew you'll be relieved to know I would like you to throw swap prefetch and 
related patches into the bin. Thanks.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans


Con Kolivas wrote:

Well how about that? That was the difference with a swap _file_ as I said, but 
I went ahead and checked with a swap partition as I used to have. I didn't 
notice, but somewhere in the last few months, swap prefetch code itself being 
unchanged for a year, seems to have been broken by other changes in the vm 
and it doesn't even start up prefetching often and has stale swap entries in 
its list. Once it breaks like that it does nothing from then on. So that 
leaves me with a quandry now.



Do I:

1. Go ahead and find whatever breakage was introduced and fix it with 
hopefully a trivial change


2. Do option 1. and then implement support for yet another kernel feature 
(cpusets) that will be used perhaps never with swap prefetch [No Nick I don't 
believe you that cpusets have anything to do with normal users on a desktop 
ever; if it's used on a desktop it will only be by a kernel developer testing 
the cpusets code].


or

3. Dump swap prefetch forever and ignore that it ever worked and was helpful 
and was a lot of work to implement and so on.



Given that even if I do 1 and/or 2 it'll still be blocked from ever going to 
mainline I think the choice is clear.


Nick since you're personally the gatekeeper for this code, would you like to 
make a call? Just say 3 and put me out of my misery please.



I'm not the gatekeeper and it is completely up to you whether you want
to work on something or not... but I'm sure you understand where I was
coming from when I suggested it doesn't get merged yet.

You may not believe this, but I agree that swap prefetching (and
prefetching in general) has some potential to help desktop workloads :).
But it still should go through the normal process of being tested and
questioned and having a look at options for first improving existing
code in those problematic cases.

Once that process happens and it is shown to work nicely, etc., then I
would not be able to (or want to) keep it from getting merged.

As far as cpusets goes... if your code goes in last, then you have to
make it work with what is there, as a rule. People are using cpusets
for memory resource control, which would have uses on a desktop system.
It is just a really bad precedent to set, having different parts of the
VM not work correctly together. Even if you made them mutually
exclusive CONFIG_ options, that is still not a very nice solution.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans

On Saturday 05 May 2007 18:42, Con Kolivas wrote:
> On Friday 04 May 2007 22:10, Con Kolivas wrote:
> > On Friday 04 May 2007 18:52, Ingo Molnar wrote:
> > > agreed. Con, IIRC you wrote a testcase for this, right? Could you
> > > please send us the results of that testing?
> >
> > Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch
> > disabled and then enabled swap prefetch saves ~5 seconds on average
> > hardware on this one test case. I had many users try this and the results
> > were between 2 and 10 seconds, but always showed a saving on this
> > testcase. This effect easily occurs on printing a big picture, editing a
> > large file, compressing an iso image or whatever in real world workloads.
> > Smaller, but much more frequent effects of this over the course of a day
> > obviously also occur and do add up.
>
> Here's a better swap prefetch tester. Instructions in file.
>
> Machine with 2GB ram and 2GB swapfile
>
> Prefetch disabled:
> ./sp_tester

> Timed portion 53397 milliseconds
>
> Enabled:
> ./sp_tester

> Timed portion 26351 milliseconds
>
> Note huge time difference.

Well how about that? That was the difference with a swap _file_ as I said, but 
I went ahead and checked with a swap partition as I used to have. I didn't 
notice, but somewhere in the last few months, swap prefetch code itself being 
unchanged for a year, seems to have been broken by other changes in the vm 
and it doesn't even start up prefetching often and has stale swap entries in 
its list. Once it breaks like that it does nothing from then on. So that 
leaves me with a quandry now.

Do I:

1. Go ahead and find whatever breakage was introduced and fix it with 
hopefully a trivial change

2. Do option 1. and then implement support for yet another kernel feature 
(cpusets) that will be used perhaps never with swap prefetch [No Nick I don't 
believe you that cpusets have anything to do with normal users on a desktop 
ever; if it's used on a desktop it will only be by a kernel developer testing 
the cpusets code].

or

3. Dump swap prefetch forever and ignore that it ever worked and was helpful 
and was a lot of work to implement and so on.

Given that even if I do 1 and/or 2 it'll still be blocked from ever going to 
mainline I think the choice is clear.

Nick since you're personally the gatekeeper for this code, would you like to 
make a call? Just say 3 and put me out of my misery please.

-- 
-ck

P.S. Ingo, thanks (and sorry) for your involvement here.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans -- vm bugfixes


Hugh Dickins wrote:

On Thu, 10 May 2007, Nick Piggin wrote:


The filesystem (or page cache) allows pages beyond i_size to come
in there?  That wasn't a problem before, was it?  But now it is?


The filesystem still doesn't, but if i_size is updated after the page
is returned, we can have a problem that was previously taken care of
with the truncate_count but now isn't.



But... I thought the page lock was now taking care of that in your
scheme?  truncate_inode_pages has to wait for the page lock, then
it finds the page is mapped and... ahh, it finds the copiee page
is not mapped, so doesn't do its own little unmap_mapping_range,
and the copied page squeaks through.  Drat.

I really think the truncate_count solution worked better, for
truncation anyway.  There may be persuasive reasons you need the
page lock for invalidation: I gave up on trying to understand the
required behaviour(s) for invalidation.

So, bring back (the original use of, not my tree marker use of)
truncate_count?  Hmm, you probably don't want to do that, because
there was some pleasure in removing the strange barriers associated
with it.

A second unmap_mapping_range is just one line of code - but it sure
feels like a defeat to me, calling the whole exercise into question.
(But then, you'd be right to say my perfectionism made it impossible
for me to come up with any solution to the invalidation issues.)


Well we could bring back the truncate_count, but I think that sucks
because that's moving work into the page fault handler in order to
avoid a bit of work when truncating mapped files.



But that's a change we could have made at
any time if we'd bothered, it's not really the issue here.


I don't see how you could, because you need to increment truncate_count.



Though indeed we did so, I don't see that we needed to increment
truncate_count in that case (nobody could be coming through
do_no_page on that file, when there are no mappings of it).


Of course :P

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans -- vm bugfixes

On Thu, 10 May 2007, Nick Piggin wrote:
> > 
> > The filesystem (or page cache) allows pages beyond i_size to come
> > in there?  That wasn't a problem before, was it?  But now it is?
> 
> The filesystem still doesn't, but if i_size is updated after the page
> is returned, we can have a problem that was previously taken care of
> with the truncate_count but now isn't.

But... I thought the page lock was now taking care of that in your
scheme?  truncate_inode_pages has to wait for the page lock, then
it finds the page is mapped and... ahh, it finds the copiee page
is not mapped, so doesn't do its own little unmap_mapping_range,
and the copied page squeaks through.  Drat.

I really think the truncate_count solution worked better, for
truncation anyway.  There may be persuasive reasons you need the
page lock for invalidation: I gave up on trying to understand the
required behaviour(s) for invalidation.

So, bring back (the original use of, not my tree marker use of)
truncate_count?  Hmm, you probably don't want to do that, because
there was some pleasure in removing the strange barriers associated
with it.

A second unmap_mapping_range is just one line of code - but it sure
feels like a defeat to me, calling the whole exercise into question.
(But then, you'd be right to say my perfectionism made it impossible
for me to come up with any solution to the invalidation issues.)

> > Suspect you'd need a barrier of some kind between the i_size_write and
> > the mapping_mapped test?
> 
> The unmap_mapping_range that runs after the truncate_inode_pages should
> run in the correct order, I believe.

Yes, if there's going to be that backup call, the first won't really
need a barrier.

> > But that's a change we could have made at
> > any time if we'd bothered, it's not really the issue here.
> 
> I don't see how you could, because you need to increment truncate_count.

Though indeed we did so, I don't see that we needed to increment
truncate_count in that case (nobody could be coming through
do_no_page on that file, when there are no mappings of it).

> But I believe this is fixing the issue, even if it does so in a peripheral
> manner, because it avoids the added cost for unmapped files.

It's a small improvement to your common case, I agree.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans -- vm bugfixes


Hugh Dickins wrote:

On Wed, 9 May 2007, Nick Piggin wrote:


Hugh Dickins wrote:


On Wed, 2 May 2007, Nick Piggin wrote:



But I'm pretty sure (to use your words!) regular truncate was not racy
before: I believe Andrea's sequence count was handling that case fine,
without a second unmap_mapping_range.


OK, I think you're right. I _think_ it should also be OK with the
lock_page version as well: we should not be able to have any pages
after the first unmap_mapping_range call, because of the i_size
write. So if we have no pages, there is nothing to 'cow' from.


I'd be delighted if you can remove those later unmap_mapping_ranges.
As I recall, the important thing for the copy pages is to be holding
the page lock (or whatever other serialization) on the copied page
still while the copy page is inserted into pagetable: that looks
to be so in your __do_fault.


Hmm, on second thoughts, I think I was right the first time, and do
need the unmap after the pages are truncated. With the lock_page code,
after the first unmap, we can get new ptes mapping pages, and
subsequently they can be COWed and then the original pte zapped before
the truncate loop checks it.



The filesystem (or page cache) allows pages beyond i_size to come
in there?  That wasn't a problem before, was it?  But now it is?


The filesystem still doesn't, but if i_size is updated after the page
is returned, we can have a problem that was previously taken care of
with the truncate_count but now isn't.


However, I wonder if we can't test mapping_mapped before the spinlock,
which would make most truncates cheaper?



Slightly cheaper, yes, though I doubt it'd be much in comparison with
actually doing any work in unmap_mapping_range or truncate_inode_pages.


But if we're supposing the common case for truncate is unmapped mappings,
then the main cost there will be the locking, which I'm trying to avoid.
Hopefully with this patch, most truncate workloads would get faster, even
though truncate mapped files is going to be unavoidably slower.



Suspect you'd need a barrier of some kind between the i_size_write and
the mapping_mapped test?


The unmap_mapping_range that runs after the truncate_inode_pages should
run in the correct order, I believe.


 But that's a change we could have made at
any time if we'd bothered, it's not really the issue here.


I don't see how you could, because you need to increment truncate_count.

But I believe this is fixing the issue, even if it does so in a peripheral
manner, because it avoids the added cost for unmapped files.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans -- vm bugfixes

On Wed, 9 May 2007, Nick Piggin wrote:
> Hugh Dickins wrote:
> > On Wed, 2 May 2007, Nick Piggin wrote:
> 
> > > >But I'm pretty sure (to use your words!) regular truncate was not racy
> > > >before: I believe Andrea's sequence count was handling that case fine,
> > > >without a second unmap_mapping_range.
> > >
> > >OK, I think you're right. I _think_ it should also be OK with the
> > >lock_page version as well: we should not be able to have any pages
> > >after the first unmap_mapping_range call, because of the i_size
> > >write. So if we have no pages, there is nothing to 'cow' from.
> > 
> > I'd be delighted if you can remove those later unmap_mapping_ranges.
> > As I recall, the important thing for the copy pages is to be holding
> > the page lock (or whatever other serialization) on the copied page
> > still while the copy page is inserted into pagetable: that looks
> > to be so in your __do_fault.
> 
> Hmm, on second thoughts, I think I was right the first time, and do
> need the unmap after the pages are truncated. With the lock_page code,
> after the first unmap, we can get new ptes mapping pages, and
> subsequently they can be COWed and then the original pte zapped before
> the truncate loop checks it.

The filesystem (or page cache) allows pages beyond i_size to come
in there?  That wasn't a problem before, was it?  But now it is?

> 
> However, I wonder if we can't test mapping_mapped before the spinlock,
> which would make most truncates cheaper?

Slightly cheaper, yes, though I doubt it'd be much in comparison with
actually doing any work in unmap_mapping_range or truncate_inode_pages.
Suspect you'd need a barrier of some kind between the i_size_write and
the mapping_mapped test?  But that's a change we could have made at
any time if we'd bothered, it's not really the issue here.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans -- vm bugfixes


Hugh Dickins wrote:

On Wed, 2 May 2007, Nick Piggin wrote:



But I'm pretty sure (to use your words!) regular truncate was not racy
before: I believe Andrea's sequence count was handling that case fine,
without a second unmap_mapping_range.


OK, I think you're right. I _think_ it should also be OK with the
lock_page version as well: we should not be able to have any pages
after the first unmap_mapping_range call, because of the i_size
write. So if we have no pages, there is nothing to 'cow' from.



I'd be delighted if you can remove those later unmap_mapping_ranges.
As I recall, the important thing for the copy pages is to be holding
the page lock (or whatever other serialization) on the copied page
still while the copy page is inserted into pagetable: that looks
to be so in your __do_fault.


Hmm, on second thoughts, I think I was right the first time, and do
need the unmap after the pages are truncated. With the lock_page code,
after the first unmap, we can get new ptes mapping pages, and
subsequently they can be COWed and then the original pte zapped before
the truncate loop checks it.

However, I wonder if we can't test mapping_mapped before the spinlock,
which would make most truncates cheaper?

--
SUSE Labs, Novell Inc.
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c 2007-04-24 15:02:51.0 +1000
+++ linux-2.6/mm/filemap.c  2007-05-09 17:30:47.0 +1000
@@ -2579,8 +2579,7 @@
if (rw == WRITE) {
write_len = iov_length(iov, nr_segs);
end = (offset + write_len - 1) >> PAGE_CACHE_SHIFT;
-   if (mapping_mapped(mapping))
-   unmap_mapping_range(mapping, offset, write_len, 0);
+   unmap_mapping_range(mapping, offset, write_len, 0);
}
 
retval = filemap_write_and_wait(mapping);
Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c  2007-05-09 17:25:28.0 +1000
+++ linux-2.6/mm/memory.c   2007-05-09 17:30:22.0 +1000
@@ -1956,6 +1956,9 @@
pgoff_t hba = holebegin >> PAGE_SHIFT;
pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
 
+   if (!mapping_mapped(mapping))
+   return;
+
/* Check for overflow. */
if (sizeof(holelen) > sizeof(hlen)) {
long long holeend =

Re: 2.6.22 -mm merge plans -- vm bugfixes


Hugh Dickins wrote:

On Wed, 2 May 2007, Nick Piggin wrote:



But I'm pretty sure (to use your words!) regular truncate was not racy
before: I believe Andrea's sequence count was handling that case fine,
without a second unmap_mapping_range.


OK, I think you're right. I _think_ it should also be OK with the
lock_page version as well: we should not be able to have any pages
after the first unmap_mapping_range call, because of the i_size
write. So if we have no pages, there is nothing to 'cow' from.



I'd be delighted if you can remove those later unmap_mapping_ranges.
As I recall, the important thing for the copy pages is to be holding
the page lock (or whatever other serialization) on the copied page
still while the copy page is inserted into pagetable: that looks
to be so in your __do_fault.


Hmm, on second thoughts, I think I was right the first time, and do
need the unmap after the pages are truncated. With the lock_page code,
after the first unmap, we can get new ptes mapping pages, and
subsequently they can be COWed and then the original pte zapped before
the truncate loop checks it.

However, I wonder if we can't test mapping_mapped before the spinlock,
which would make most truncates cheaper?

--
SUSE Labs, Novell Inc.
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c 2007-04-24 15:02:51.0 +1000
+++ linux-2.6/mm/filemap.c  2007-05-09 17:30:47.0 +1000
@@ -2579,8 +2579,7 @@
if (rw == WRITE) {
write_len = iov_length(iov, nr_segs);
end = (offset + write_len - 1)  PAGE_CACHE_SHIFT;
-   if (mapping_mapped(mapping))
-   unmap_mapping_range(mapping, offset, write_len, 0);
+   unmap_mapping_range(mapping, offset, write_len, 0);
}
 
retval = filemap_write_and_wait(mapping);
Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c  2007-05-09 17:25:28.0 +1000
+++ linux-2.6/mm/memory.c   2007-05-09 17:30:22.0 +1000
@@ -1956,6 +1956,9 @@
pgoff_t hba = holebegin  PAGE_SHIFT;
pgoff_t hlen = (holelen + PAGE_SIZE - 1)  PAGE_SHIFT;
 
+   if (!mapping_mapped(mapping))
+   return;
+
/* Check for overflow. */
if (sizeof(holelen)  sizeof(hlen)) {
long long holeend =

Re: 2.6.22 -mm merge plans -- vm bugfixes

On Wed, 9 May 2007, Nick Piggin wrote:
 Hugh Dickins wrote:
  On Wed, 2 May 2007, Nick Piggin wrote:
 
   But I'm pretty sure (to use your words!) regular truncate was not racy
   before: I believe Andrea's sequence count was handling that case fine,
   without a second unmap_mapping_range.
  
  OK, I think you're right. I _think_ it should also be OK with the
  lock_page version as well: we should not be able to have any pages
  after the first unmap_mapping_range call, because of the i_size
  write. So if we have no pages, there is nothing to 'cow' from.
  
  I'd be delighted if you can remove those later unmap_mapping_ranges.
  As I recall, the important thing for the copy pages is to be holding
  the page lock (or whatever other serialization) on the copied page
  still while the copy page is inserted into pagetable: that looks
  to be so in your __do_fault.
 
 Hmm, on second thoughts, I think I was right the first time, and do
 need the unmap after the pages are truncated. With the lock_page code,
 after the first unmap, we can get new ptes mapping pages, and
 subsequently they can be COWed and then the original pte zapped before
 the truncate loop checks it.

The filesystem (or page cache) allows pages beyond i_size to come
in there?  That wasn't a problem before, was it?  But now it is?

 
 However, I wonder if we can't test mapping_mapped before the spinlock,
 which would make most truncates cheaper?

Slightly cheaper, yes, though I doubt it'd be much in comparison with
actually doing any work in unmap_mapping_range or truncate_inode_pages.
Suspect you'd need a barrier of some kind between the i_size_write and
the mapping_mapped test?  But that's a change we could have made at
any time if we'd bothered, it's not really the issue here.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans -- vm bugfixes


Hugh Dickins wrote:

On Wed, 9 May 2007, Nick Piggin wrote:


Hugh Dickins wrote:


On Wed, 2 May 2007, Nick Piggin wrote:



But I'm pretty sure (to use your words!) regular truncate was not racy
before: I believe Andrea's sequence count was handling that case fine,
without a second unmap_mapping_range.


OK, I think you're right. I _think_ it should also be OK with the
lock_page version as well: we should not be able to have any pages
after the first unmap_mapping_range call, because of the i_size
write. So if we have no pages, there is nothing to 'cow' from.


I'd be delighted if you can remove those later unmap_mapping_ranges.
As I recall, the important thing for the copy pages is to be holding
the page lock (or whatever other serialization) on the copied page
still while the copy page is inserted into pagetable: that looks
to be so in your __do_fault.


Hmm, on second thoughts, I think I was right the first time, and do
need the unmap after the pages are truncated. With the lock_page code,
after the first unmap, we can get new ptes mapping pages, and
subsequently they can be COWed and then the original pte zapped before
the truncate loop checks it.



The filesystem (or page cache) allows pages beyond i_size to come
in there?  That wasn't a problem before, was it?  But now it is?


The filesystem still doesn't, but if i_size is updated after the page
is returned, we can have a problem that was previously taken care of
with the truncate_count but now isn't.


However, I wonder if we can't test mapping_mapped before the spinlock,
which would make most truncates cheaper?



Slightly cheaper, yes, though I doubt it'd be much in comparison with
actually doing any work in unmap_mapping_range or truncate_inode_pages.


But if we're supposing the common case for truncate is unmapped mappings,
then the main cost there will be the locking, which I'm trying to avoid.
Hopefully with this patch, most truncate workloads would get faster, even
though truncate mapped files is going to be unavoidably slower.



Suspect you'd need a barrier of some kind between the i_size_write and
the mapping_mapped test?


The unmap_mapping_range that runs after the truncate_inode_pages should
run in the correct order, I believe.


 But that's a change we could have made at
any time if we'd bothered, it's not really the issue here.


I don't see how you could, because you need to increment truncate_count.

But I believe this is fixing the issue, even if it does so in a peripheral
manner, because it avoids the added cost for unmapped files.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans -- vm bugfixes

On Thu, 10 May 2007, Nick Piggin wrote:
  
  The filesystem (or page cache) allows pages beyond i_size to come
  in there?  That wasn't a problem before, was it?  But now it is?
 
 The filesystem still doesn't, but if i_size is updated after the page
 is returned, we can have a problem that was previously taken care of
 with the truncate_count but now isn't.

But... I thought the page lock was now taking care of that in your
scheme?  truncate_inode_pages has to wait for the page lock, then
it finds the page is mapped and... ahh, it finds the copiee page
is not mapped, so doesn't do its own little unmap_mapping_range,
and the copied page squeaks through.  Drat.

I really think the truncate_count solution worked better, for
truncation anyway.  There may be persuasive reasons you need the
page lock for invalidation: I gave up on trying to understand the
required behaviour(s) for invalidation.

So, bring back (the original use of, not my tree marker use of)
truncate_count?  Hmm, you probably don't want to do that, because
there was some pleasure in removing the strange barriers associated
with it.

A second unmap_mapping_range is just one line of code - but it sure
feels like a defeat to me, calling the whole exercise into question.
(But then, you'd be right to say my perfectionism made it impossible
for me to come up with any solution to the invalidation issues.)

  Suspect you'd need a barrier of some kind between the i_size_write and
  the mapping_mapped test?
 
 The unmap_mapping_range that runs after the truncate_inode_pages should
 run in the correct order, I believe.

Yes, if there's going to be that backup call, the first won't really
need a barrier.

  But that's a change we could have made at
  any time if we'd bothered, it's not really the issue here.
 
 I don't see how you could, because you need to increment truncate_count.

Though indeed we did so, I don't see that we needed to increment
truncate_count in that case (nobody could be coming through
do_no_page on that file, when there are no mappings of it).

 But I believe this is fixing the issue, even if it does so in a peripheral
 manner, because it avoids the added cost for unmapped files.

It's a small improvement to your common case, I agree.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans -- vm bugfixes


Hugh Dickins wrote:

On Thu, 10 May 2007, Nick Piggin wrote:


The filesystem (or page cache) allows pages beyond i_size to come
in there?  That wasn't a problem before, was it?  But now it is?


The filesystem still doesn't, but if i_size is updated after the page
is returned, we can have a problem that was previously taken care of
with the truncate_count but now isn't.



But... I thought the page lock was now taking care of that in your
scheme?  truncate_inode_pages has to wait for the page lock, then
it finds the page is mapped and... ahh, it finds the copiee page
is not mapped, so doesn't do its own little unmap_mapping_range,
and the copied page squeaks through.  Drat.

I really think the truncate_count solution worked better, for
truncation anyway.  There may be persuasive reasons you need the
page lock for invalidation: I gave up on trying to understand the
required behaviour(s) for invalidation.

So, bring back (the original use of, not my tree marker use of)
truncate_count?  Hmm, you probably don't want to do that, because
there was some pleasure in removing the strange barriers associated
with it.

A second unmap_mapping_range is just one line of code - but it sure
feels like a defeat to me, calling the whole exercise into question.
(But then, you'd be right to say my perfectionism made it impossible
for me to come up with any solution to the invalidation issues.)


Well we could bring back the truncate_count, but I think that sucks
because that's moving work into the page fault handler in order to
avoid a bit of work when truncating mapped files.



But that's a change we could have made at
any time if we'd bothered, it's not really the issue here.


I don't see how you could, because you need to increment truncate_count.



Though indeed we did so, I don't see that we needed to increment
truncate_count in that case (nobody could be coming through
do_no_page on that file, when there are no mappings of it).


Of course :P

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans

On Saturday 05 May 2007 18:42, Con Kolivas wrote:
 On Friday 04 May 2007 22:10, Con Kolivas wrote:
  On Friday 04 May 2007 18:52, Ingo Molnar wrote:
   agreed. Con, IIRC you wrote a testcase for this, right? Could you
   please send us the results of that testing?
 
  Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch
  disabled and then enabled swap prefetch saves ~5 seconds on average
  hardware on this one test case. I had many users try this and the results
  were between 2 and 10 seconds, but always showed a saving on this
  testcase. This effect easily occurs on printing a big picture, editing a
  large file, compressing an iso image or whatever in real world workloads.
  Smaller, but much more frequent effects of this over the course of a day
  obviously also occur and do add up.

 Here's a better swap prefetch tester. Instructions in file.

 Machine with 2GB ram and 2GB swapfile

 Prefetch disabled:
 ./sp_tester

 Timed portion 53397 milliseconds

 Enabled:
 ./sp_tester

 Timed portion 26351 milliseconds

 Note huge time difference.

Well how about that? That was the difference with a swap _file_ as I said, but 
I went ahead and checked with a swap partition as I used to have. I didn't 
notice, but somewhere in the last few months, swap prefetch code itself being 
unchanged for a year, seems to have been broken by other changes in the vm 
and it doesn't even start up prefetching often and has stale swap entries in 
its list. Once it breaks like that it does nothing from then on. So that 
leaves me with a quandry now.


Do I:

1. Go ahead and find whatever breakage was introduced and fix it with 
hopefully a trivial change

2. Do option 1. and then implement support for yet another kernel feature 
(cpusets) that will be used perhaps never with swap prefetch [No Nick I don't 
believe you that cpusets have anything to do with normal users on a desktop 
ever; if it's used on a desktop it will only be by a kernel developer testing 
the cpusets code].

or

3. Dump swap prefetch forever and ignore that it ever worked and was helpful 
and was a lot of work to implement and so on.


Given that even if I do 1 and/or 2 it'll still be blocked from ever going to 
mainline I think the choice is clear.

Nick since you're personally the gatekeeper for this code, would you like to 
make a call? Just say 3 and put me out of my misery please.

-- 
-ck

P.S. Ingo, thanks (and sorry) for your involvement here.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans


Con Kolivas wrote:

Well how about that? That was the difference with a swap _file_ as I said, but 
I went ahead and checked with a swap partition as I used to have. I didn't 
notice, but somewhere in the last few months, swap prefetch code itself being 
unchanged for a year, seems to have been broken by other changes in the vm 
and it doesn't even start up prefetching often and has stale swap entries in 
its list. Once it breaks like that it does nothing from then on. So that 
leaves me with a quandry now.



Do I:

1. Go ahead and find whatever breakage was introduced and fix it with 
hopefully a trivial change


2. Do option 1. and then implement support for yet another kernel feature 
(cpusets) that will be used perhaps never with swap prefetch [No Nick I don't 
believe you that cpusets have anything to do with normal users on a desktop 
ever; if it's used on a desktop it will only be by a kernel developer testing 
the cpusets code].


or

3. Dump swap prefetch forever and ignore that it ever worked and was helpful 
and was a lot of work to implement and so on.



Given that even if I do 1 and/or 2 it'll still be blocked from ever going to 
mainline I think the choice is clear.


Nick since you're personally the gatekeeper for this code, would you like to 
make a call? Just say 3 and put me out of my misery please.



I'm not the gatekeeper and it is completely up to you whether you want
to work on something or not... but I'm sure you understand where I was
coming from when I suggested it doesn't get merged yet.

You may not believe this, but I agree that swap prefetching (and
prefetching in general) has some potential to help desktop workloads :).
But it still should go through the normal process of being tested and
questioned and having a look at options for first improving existing
code in those problematic cases.

Once that process happens and it is shown to work nicely, etc., then I
would not be able to (or want to) keep it from getting merged.

As far as cpusets goes... if your code goes in last, then you have to
make it work with what is there, as a rule. People are using cpusets
for memory resource control, which would have uses on a desktop system.
It is just a really bad precedent to set, having different parts of the
VM not work correctly together. Even if you made them mutually
exclusive CONFIG_ options, that is still not a very nice solution.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans

On Thursday 10 May 2007 10:05, Nick Piggin wrote:
 Con Kolivas wrote:
  Well how about that? That was the difference with a swap _file_ as I
  said, but I went ahead and checked with a swap partition as I used to
  have. I didn't notice, but somewhere in the last few months, swap
  prefetch code itself being unchanged for a year, seems to have been
  broken by other changes in the vm and it doesn't even start up
  prefetching often and has stale swap entries in its list. Once it breaks
  like that it does nothing from then on. So that leaves me with a quandry
  now.
 
 
  Do I:
 
  1. Go ahead and find whatever breakage was introduced and fix it with
  hopefully a trivial change
 
  2. Do option 1. and then implement support for yet another kernel feature
  (cpusets) that will be used perhaps never with swap prefetch [No Nick I
  don't believe you that cpusets have anything to do with normal users on a
  desktop ever; if it's used on a desktop it will only be by a kernel
  developer testing the cpusets code].
 
  or
 
  3. Dump swap prefetch forever and ignore that it ever worked and was
  helpful and was a lot of work to implement and so on.
 
 
  Given that even if I do 1 and/or 2 it'll still be blocked from ever going
  to mainline I think the choice is clear.
 
  Nick since you're personally the gatekeeper for this code, would you like
  to make a call? Just say 3 and put me out of my misery please.

 I'm not the gatekeeper and it is completely up to you whether you want
 to work on something or not... but I'm sure you understand where I was
 coming from when I suggested it doesn't get merged yet.

No matter how you spin it, you're the gatekeeper.

 You may not believe this, but I agree that swap prefetching (and
 prefetching in general) has some potential to help desktop workloads :).
 But it still should go through the normal process of being tested and
 questioned and having a look at options for first improving existing
 code in those problematic cases.

Not this again? Proof was there ages ago that it helped and no proof that it 
harmed could be found yet you cunningly pretend it never existed. It's been 
done to death and I'm sick of this.

 Once that process happens and it is shown to work nicely, etc., then I
 would not be able to (or want to) keep it from getting merged.

 As far as cpusets goes... if your code goes in last, then you have to
 make it work with what is there, as a rule. People are using cpusets
 for memory resource control, which would have uses on a desktop system.
 It is just a really bad precedent to set, having different parts of the
 VM not work correctly together. Even if you made them mutually
 exclusive CONFIG_ options, that is still not a very nice solution.

That's as close to a 3 as I'm likely to get out of you.

Andrew you'll be relieved to know I would like you to throw swap prefetch and 
related patches into the bin. Thanks.

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans


Con Kolivas wrote:

On Thursday 10 May 2007 10:05, Nick Piggin wrote:



I'm not the gatekeeper and it is completely up to you whether you want
to work on something or not... but I'm sure you understand where I was
coming from when I suggested it doesn't get merged yet.



No matter how you spin it, you're the gatekeeper.


If raising unaddressed issues means closing a gate, then OK. You can
equally open it by answering them.



You may not believe this, but I agree that swap prefetching (and
prefetching in general) has some potential to help desktop workloads :).
But it still should go through the normal process of being tested and
questioned and having a look at options for first improving existing
code in those problematic cases.



Not this again? Proof was there ages ago that it helped and no proof that it 
harmed could be found yet you cunningly pretend it never existed. It's been 
done to death and I'm sick of this.


I said I know it can help. Do you know how many patches I have that help
some workloads but are not merged? That's just the way it works.

What I have seen is it helps the case where you force out a huge amount
of swap. OK, that's nice -- the base case obviously works.

You said it helped with the updatedb problem. That says we should look at
why it is going bad first, and for example improve use-once algorithms.
After we do that, then swap prefetching might still help, which is fine.



Once that process happens and it is shown to work nicely, etc., then I
would not be able to (or want to) keep it from getting merged.

As far as cpusets goes... if your code goes in last, then you have to
make it work with what is there, as a rule. People are using cpusets
for memory resource control, which would have uses on a desktop system.
It is just a really bad precedent to set, having different parts of the
VM not work correctly together. Even if you made them mutually
exclusive CONFIG_ options, that is still not a very nice solution.



That's as close to a 3 as I'm likely to get out of you.


If you're not willing to try making it work with existing code, among other
things, then yes it will be difficult to get it merged. That's not going to
change.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans


On 5/9/07, Nick Piggin [EMAIL PROTECTED] wrote:

You said it helped with the updatedb problem. That says we should look at
why it is going bad first, and for example improve use-once algorithms.
After we do that, then swap prefetching might still help, which is fine.


Nick, if you're volunteering to do that analysis, then great. If not,
then you're just providing a airy hope with nothing to back up when or
if that work would ever occur.

Further, if you or someone else *does* do that work, then guess what,
we still have the option to rip out the swap prefetching code after
the hypothetical use-once improvements have been proven and merged.
Which, by the way, I've watched people talk about since 2.4. That was,
y'know, a *while* ago.

So enough with the stop energy, okay? You're better than that.

Con? He is right about the last feature to go in needs to work
gracefully with what's there now. However, it's not unheard of for
authors of other sections of code to help out with incompatibilities
by answering politely phrased questions for guidance. Though the
intersection of users between cpusets and desktop systems seems small
indeed.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans


Ray Lee wrote:

On 5/9/07, Nick Piggin [EMAIL PROTECTED] wrote:


You said it helped with the updatedb problem. That says we should look at
why it is going bad first, and for example improve use-once algorithms.
After we do that, then swap prefetching might still help, which is fine.



Nick, if you're volunteering to do that analysis, then great. If not,
then you're just providing a airy hope with nothing to back up when or
if that work would ever occur.


I'd like to try helping. Tell me your problem.



Further, if you or someone else *does* do that work, then guess what,
we still have the option to rip out the swap prefetching code after
the hypothetical use-once improvements have been proven and merged.
Which, by the way, I've watched people talk about since 2.4. That was,
y'know, a *while* ago.


What's wrong with the use-once we have? What improvements are you talking
about?



So enough with the stop energy, okay? You're better than that.


I don't think it is about energy or being mean, I'm just stating the
issues I have with it.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans

On Thursday 10 May 2007 13:48, Ray Lee wrote:
 On 5/9/07, Nick Piggin [EMAIL PROTECTED] wrote:
  You said it helped with the updatedb problem. That says we should look at
  why it is going bad first, and for example improve use-once algorithms.
  After we do that, then swap prefetching might still help, which is fine.

 Nick, if you're volunteering to do that analysis, then great. If not,
 then you're just providing a airy hope with nothing to back up when or
 if that work would ever occur.

 Further, if you or someone else *does* do that work, then guess what,
 we still have the option to rip out the swap prefetching code after
 the hypothetical use-once improvements have been proven and merged.
 Which, by the way, I've watched people talk about since 2.4. That was,
 y'know, a *while* ago.

 So enough with the stop energy, okay? You're better than that.

 Con? He is right about the last feature to go in needs to work
 gracefully with what's there now. However, it's not unheard of for
 authors of other sections of code to help out with incompatibilities
 by answering politely phrased questions for guidance. Though the
 intersection of users between cpusets and desktop systems seems small
 indeed.

Let's just set the record straight. I actually discussed cpusets over a year 
ago in this nonsense and was told by sgi folk there was no need to get my 
head around cpusets and honouring node placement should be enough which, by 
the way, swap prefetch does. So I by no means ignored this; we just hit an 
impasse on just how much more featured it should be for the sake of a goddamn 
home desktop pc feature.

Anyway why the hell am I resurrecting this thread? The code is declared dead 
already. Leave it be.

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans


On 5/9/07, Nick Piggin [EMAIL PROTECTED] wrote:

Ray Lee wrote:
 On 5/9/07, Nick Piggin [EMAIL PROTECTED] wrote:

 You said it helped with the updatedb problem. That says we should look at
 why it is going bad first, and for example improve use-once algorithms.
 After we do that, then swap prefetching might still help, which is fine.

 Nick, if you're volunteering to do that analysis, then great. If not,
 then you're just providing a airy hope with nothing to back up when or
 if that work would ever occur.

I'd like to try helping. Tell me your problem.


Huh? You already stated one version of it above, namely updatedb. But
let's put this another way, shall we? A gedankenexperiment, if you
will.

Say we have a perfect swap-out algorithm that can choose exactly what
needs to be evicted to disk. ('Perfect', of course, is dependent upon
one's metric, but let's go with maximizes overall system utilization
and minimizes IO wait time. Arbitrary, but hey.)

So, great, the right things got swapped out. Anything else that could
have been chosen would have caused more overall IO Wait. Yay us.

So what happens when those processes that triggered the swap-outs go
away? (Firefox is closed, I stop hitting my local copy of a database,
whatever.) Well, currently, nothing. What happens when I switch
workspaces and try to use my email program? Swap-ins.

Okay, so why didn't the system swap that stuff in preemptively? Why am
I sitting there waiting for something that it could have already done
in the background?

A new swap-out algorithm, be it use-once, Clock-Pro, or perfect
foreknowledge isn't going to change that issue. Swap prefetch does.


 Further, if you or someone else *does* do that work, then guess what,
 we still have the option to rip out the swap prefetching code after
 the hypothetical use-once improvements have been proven and merged.
 Which, by the way, I've watched people talk about since 2.4. That was,
 y'know, a *while* ago.

What's wrong with the use-once we have? What improvements are you talking
about?


You said, effectively: Use-once could be improved to deal with
updatedb. I said I've been reading emails from Rik and others talking
about that for four years now, and we're still talking about it. Were
it merely updatedb, I'd say us userspace folk should step up and
rewrite the damn thing to amortize its work. However, I and others
feel it's only an example -- glaring, obviously -- of a more pervasive
issue. A small issue, to be sure!, but an issue nevertheless.

In general, I/others are talking about improving the desktop
experience of running too much on a RAM limited machine. (Which, in my
case, is with a gig and a 2.2GHz processor.)

Or restated: the desktop experience occasionally sucks for me, and I
don't think I'm alone. There may be a heuristic, completely isolated
from userspace (and so isn't an API the kernel has to support! -- if
it doesn't work, we can rip it out again), that may mitigate the
suckiness. Let's try it.


 So enough with the stop energy, okay? You're better than that.

I don't think it is about energy or being mean, I'm just stating the
issues I have with it.


Nick, I in no way think you're being mean, and I'm sorry if I've given
you that impression. However, if you're just stating the issues you
have with it, then can I assume that you won't lobby against having
this experiment merged?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-07 Thread Benjamin Herrenschmidt

On Fri, 2007-05-04 at 19:23 +1000, Nick Piggin wrote:

> These ops could also be put to use in bit spinlocks, buffer lock, and
> probably a few other places too.

Ok, the performance hit seems to be under control (especially with the
bigger benchmark showing actual improvements).

There's a little bogon with the PG_waiters bit that you already know
about but appart from that it should be ok.

I must say I absolutely _LOVE_ the bitops with explicit _lock/_unlock
semantics. That should allow us to remove a whole bunch of dodgy
barriers and smp_mb__before_whatever_magic_crap() things we have all
over the place by providing precisely the expected semantics for bit
locks.

There are quite a few people who've been trying to do bit locks and I've
always been very worried by how easy it is to get the barriers wrong (or
too much barriers in the fast path) with these.

There are a couple of things we might want to think about regarding the
actual API to bit locks... the API you propose is simple, but it might
not fit some of the most exotic usage requirements, which typically are
related to manipulating other bits along with the lock bit.

We might just ignore them though. In the case of the page lock, it's
only hitting the slow path, and I would expect other usage scenarii to
be similar.

Cheers,
Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans

2007-05-07 Thread Josef Sipek

On Mon, Apr 30, 2007 at 04:20:07PM -0700, Andrew Morton wrote:
...
>  git-unionfs.patch
>
> Does this have a future?

Yes!  There are many active users who use our unioning functionality.

Namespace unification consists of several major parts:

1) Duplicate elimination: This can be handled in the VFS.  However, it would
   clutter up the VFS code with a lot of wrappers around key VFS functions
   to select the appropriate dentry/inode/etc. object from the underlying
   branch.  (You also need to provide efficient and sane readdir/seekdir
   semantics which we do with our "On Disk Format" support.)

2) Copyup: Having a unified namespace by itself isn't enough.  You also need
   copy-on-write functionality when the source file is on a read-only
   branch.  This makes unioning much more useful and is one of the main
   attractions to unionfs users.

3) Whiteouts: Whiteouts are a key unioning construct.  As it was pointed out
   at OLS 2006, they are a properly of the union and _NOT_ a branch.
   Therefore, they should not be stored persistently on a branch but rather
   in some "external" storage.

4) You also need unique and *persistent* inode numbers for network f/s
   exports and other unix tools.

5) You need to provide dynamic branch management functionality: adding,
   removing, and changing the mode of branches in an existing union.

We have considerable experience in unioning file systems for years now; we
are currently working on the third generation of the code.  All of the above
features, and more, are USED by users, and are NEEDED by users.

We believe the right approach is the one we've taken, and is the least
intrusive: a standalone (stackable) file system that doesn't clutter the
VFS, with some small and gradual changes to the VFS to support stacking.  As
you may have noticed, we have been successfully submitting VFS patches to
make the VFS more stacking friendly (not just to Unionfs, but also to
eCryptfs which has been in since 2.6.19).

The older Union mounts, alas, try to put all that functionality into the
VFS.  We recognize that some people think that union mounts at the VFS level
is the "elegant" approach, but we hope people will listen to us and learn
from our experience: unioning may seem simple in principle, but it is
difficult in practice.  (See http://unionfs.fileystems.org/ for a lot more
info.)  So we don't think that is a viable long term approach to have all of
the unioning functionality in the VFS for two main reasons:

(1) If you want users to use a VFS-level unioning functionality ala
union-mounts, then you're going to have to implement *all* of the
features we have implemented; the VFS clutter and complexity that will
result will be very considerable, and we just don't think that it'd
happen.

(2) Some may suggest to have a lightweight union mounts that only offers a
subset of the functionality that's suitable for placing in the VFS.  In
that case, most unionfs users simply won't use it.  You'd need union
mounts to provide ALL of the functionality that we have TODAY, if you
want users to it.

As far as we can see the remaining stumbling block right now is cache
coherency between the layers.  Whether you provide unioning as a stackable
f/s or shoved into the VFS, coherency will have to be addressed.  In our
upcoming paper and talk at OLS'07, we plan to bring up and discuss several
ideas we've explored already on how to resolve this incoherency.  Our ideas
range from complex graph-based pointer management between objects of all
sorts, to simple timestamp-based VFS hooks.  (We've been experimenting with
several approaches and so far we're leaning toward the simple timestamp
based on, again in the interest of keeping the VFS changes simple.  We hope
to have more results to report by OLS time.)

Josef "Jeff" Sipek, on behalf of the Unionfs team.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans


Con Kolivas wrote:

On Friday 04 May 2007 18:52, Ingo Molnar wrote:

agreed. Con, IIRC you wrote a testcase for this, right? Could you please
send us the results of that testing?


Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch 
disabled and then enabled swap prefetch saves ~5 seconds on average hardware 
on this one test case. I had many users try this and the results were between 
2 and 10 seconds, but always showed a saving on this testcase. This effect 
easily occurs on printing a big picture, editing a large file, compressing an 
iso image or whatever in real world workloads. Smaller, but much more 
frequent effects of this over the course of a day obviously also occur and do 
add up.


I'll try this when I get the scheduler stuff done, and also dig out the 
"resp1" stuff for "back when." I see the most recent datasets were 
comparing 2.5.43-mm2 responsiveness with 2.4.19-ck7, you know I always 
test your stuff ;-)


Guess it might need a bit of polish for current hardware, I was testing 
on *small* machines, deliberately.


--
Bill Davidsen <[EMAIL PROTECTED]>
  "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans


Ingo Molnar wrote:

* Nick Piggin <[EMAIL PROTECTED]> wrote:


i'm wondering about swap-prefetch:


Being able to config all these core heuristics changes is really not 
that much of a positive. The fact that we might _need_ to config 
something out, and double the configuration range isn't too pleasing.


Well, to the desktop user this is a speculative performance feature that 
he is willing to potentially waste CPU and IO capacity, in expectation 
of better performance.


On the conceptual level it is _precisely the same thing as regular file 
readahead_. (with the difference that to me swapahead seems to be quite 
a bit more intelligent than our current file readahead logic.)


This feature has no API or ABI impact at all, it's a pure performance 
feature. (besides the trivial sysctl to turn it runtime on/off).



Here were some of my concerns, and where our discussion got up to.


[...snip...]

i see no real problem here. We've had heuristics for a _long_ time in 
various areas of the code. Sometimes they work, sometimes they suck.


the flow of this is really easy: distro looking for a feature edge turns 
it on and announces it, if the feature does not work out for users then 
user turns it off and complains to distro, if enough users complain then 
distro turns it off for next release, upstream forgets about this 
performance feature and eventually removes it once someone notices that 
it wouldnt even compile in the past 2 main releases. I see no problem 
here, we did that in the past too with performance features. The 
networking stack has literally dozens of such small tunable things which 
get experimented with, and whose defaults do get tuned carefully. Some 
of the knobs help bandwidth, some help latency.


I haven't looked at this code since it first came out and didn't impress 
me, but I think it would be good to get the current version in. However, 
when you say "user turns it off" I hope you mean "in /proc/sys with a 
switch or knob" and not by expecting people to recompile and install a 
kernel. Then it might take a little memory but wouldn't do something 
undesirable.


Note: I had no bad effect from the code, it just didn't feel faster. On 
a low memory machine it might help. Of course I have wanted to have a 
hard limit on memory used for i/o buffers, just to avoid swapping 
programs to make room for i/o, so to some extent I feel as if this is a 
fix for a problem we shouldn't have.


--
Bill Davidsen <[EMAIL PROTECTED]>
  "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: fragmentation avoidance Re: 2.6.22 -mm merge plans

2007-05-07 Thread Yasunori Goto


Sorry for late response. I went on a vacation in last week.
And I'm in the mountain of a ton of unread mail now

> > Mel's moveable-zone work.
> 
> These patches are what creates ZONE_MOVABLE. The last 6 patches should be
> collapsed into a single patch:
> 
>   handle-kernelcore=-generic
> 
> I believe Yasunori Goto is looking at these from the perspective of memory
> hot-remove and has caught a few bugs in the past. Goto-san may be able to
> comment on whether they have been reviewed recently.

Hmm, I don't think my review is enough.
To be precise, I'm just one user/tester of ZONE_MOVABLE.
I have tried to make memory remove patches with Mel-san's
ZONE_MOVABLE patch. And the bugs are things that I found in its work.
(I'll post these patches in a few days.)

> The main complexity is in one function in patch one which determines where
> the PFN is in each node for ZONE_MOVABLE. Getting that right so that the
> requested amount of kernel memory spread as evenly as possible is just
> not straight-forward.

>From memory-hotplug view, ZONE_MOVABLE should be aligned by section
size. But MAX_ORDER alignment is enough for others...

Bye.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans

2007-05-07 Thread Josef Sipek

On Mon, Apr 30, 2007 at 04:20:07PM -0700, Andrew Morton wrote:
...
  git-unionfs.patch

 Does this have a future?

Yes!  There are many active users who use our unioning functionality.

Namespace unification consists of several major parts:

1) Duplicate elimination: This can be handled in the VFS.  However, it would
   clutter up the VFS code with a lot of wrappers around key VFS functions
   to select the appropriate dentry/inode/etc. object from the underlying
   branch.  (You also need to provide efficient and sane readdir/seekdir
   semantics which we do with our On Disk Format support.)

2) Copyup: Having a unified namespace by itself isn't enough.  You also need
   copy-on-write functionality when the source file is on a read-only
   branch.  This makes unioning much more useful and is one of the main
   attractions to unionfs users.

3) Whiteouts: Whiteouts are a key unioning construct.  As it was pointed out
   at OLS 2006, they are a properly of the union and _NOT_ a branch.
   Therefore, they should not be stored persistently on a branch but rather
   in some external storage.

4) You also need unique and *persistent* inode numbers for network f/s
   exports and other unix tools.

5) You need to provide dynamic branch management functionality: adding,
   removing, and changing the mode of branches in an existing union.

We have considerable experience in unioning file systems for years now; we
are currently working on the third generation of the code.  All of the above
features, and more, are USED by users, and are NEEDED by users.

We believe the right approach is the one we've taken, and is the least
intrusive: a standalone (stackable) file system that doesn't clutter the
VFS, with some small and gradual changes to the VFS to support stacking.  As
you may have noticed, we have been successfully submitting VFS patches to
make the VFS more stacking friendly (not just to Unionfs, but also to
eCryptfs which has been in since 2.6.19).

The older Union mounts, alas, try to put all that functionality into the
VFS.  We recognize that some people think that union mounts at the VFS level
is the elegant approach, but we hope people will listen to us and learn
from our experience: unioning may seem simple in principle, but it is
difficult in practice.  (See http://unionfs.fileystems.org/ for a lot more
info.)  So we don't think that is a viable long term approach to have all of
the unioning functionality in the VFS for two main reasons:

(1) If you want users to use a VFS-level unioning functionality ala
union-mounts, then you're going to have to implement *all* of the
features we have implemented; the VFS clutter and complexity that will
result will be very considerable, and we just don't think that it'd
happen.

(2) Some may suggest to have a lightweight union mounts that only offers a
subset of the functionality that's suitable for placing in the VFS.  In
that case, most unionfs users simply won't use it.  You'd need union
mounts to provide ALL of the functionality that we have TODAY, if you
want users to it.

As far as we can see the remaining stumbling block right now is cache
coherency between the layers.  Whether you provide unioning as a stackable
f/s or shoved into the VFS, coherency will have to be addressed.  In our
upcoming paper and talk at OLS'07, we plan to bring up and discuss several
ideas we've explored already on how to resolve this incoherency.  Our ideas
range from complex graph-based pointer management between objects of all
sorts, to simple timestamp-based VFS hooks.  (We've been experimenting with
several approaches and so far we're leaning toward the simple timestamp
based on, again in the interest of keeping the VFS changes simple.  We hope
to have more results to report by OLS time.)

Josef Jeff Sipek, on behalf of the Unionfs team.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-07 Thread Benjamin Herrenschmidt

On Fri, 2007-05-04 at 19:23 +1000, Nick Piggin wrote:

 These ops could also be put to use in bit spinlocks, buffer lock, and
 probably a few other places too.

Ok, the performance hit seems to be under control (especially with the
bigger benchmark showing actual improvements).

There's a little bogon with the PG_waiters bit that you already know
about but appart from that it should be ok.

I must say I absolutely _LOVE_ the bitops with explicit _lock/_unlock
semantics. That should allow us to remove a whole bunch of dodgy
barriers and smp_mb__before_whatever_magic_crap() things we have all
over the place by providing precisely the expected semantics for bit
locks.

There are quite a few people who've been trying to do bit locks and I've
always been very worried by how easy it is to get the barriers wrong (or
too much barriers in the fast path) with these.

There are a couple of things we might want to think about regarding the
actual API to bit locks... the API you propose is simple, but it might
not fit some of the most exotic usage requirements, which typically are
related to manipulating other bits along with the lock bit.

We might just ignore them though. In the case of the page lock, it's
only hitting the slow path, and I would expect other usage scenarii to
be similar.

Cheers,
Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: fragmentation avoidance Re: 2.6.22 -mm merge plans

2007-05-07 Thread Yasunori Goto


Sorry for late response. I went on a vacation in last week.
And I'm in the mountain of a ton of unread mail now

  Mel's moveable-zone work.
 
 These patches are what creates ZONE_MOVABLE. The last 6 patches should be
 collapsed into a single patch:
 
   handle-kernelcore=-generic
 
 I believe Yasunori Goto is looking at these from the perspective of memory
 hot-remove and has caught a few bugs in the past. Goto-san may be able to
 comment on whether they have been reviewed recently.

Hmm, I don't think my review is enough.
To be precise, I'm just one user/tester of ZONE_MOVABLE.
I have tried to make memory remove patches with Mel-san's
ZONE_MOVABLE patch. And the bugs are things that I found in its work.
(I'll post these patches in a few days.)

 The main complexity is in one function in patch one which determines where
 the PFN is in each node for ZONE_MOVABLE. Getting that right so that the
 requested amount of kernel memory spread as evenly as possible is just
 not straight-forward.

From memory-hotplug view, ZONE_MOVABLE should be aligned by section
size. But MAX_ORDER alignment is enough for others...

Bye.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans


Ingo Molnar wrote:

* Nick Piggin [EMAIL PROTECTED] wrote:


i'm wondering about swap-prefetch:


Being able to config all these core heuristics changes is really not 
that much of a positive. The fact that we might _need_ to config 
something out, and double the configuration range isn't too pleasing.


Well, to the desktop user this is a speculative performance feature that 
he is willing to potentially waste CPU and IO capacity, in expectation 
of better performance.


On the conceptual level it is _precisely the same thing as regular file 
readahead_. (with the difference that to me swapahead seems to be quite 
a bit more intelligent than our current file readahead logic.)


This feature has no API or ABI impact at all, it's a pure performance 
feature. (besides the trivial sysctl to turn it runtime on/off).



Here were some of my concerns, and where our discussion got up to.


[...snip...]

i see no real problem here. We've had heuristics for a _long_ time in 
various areas of the code. Sometimes they work, sometimes they suck.


the flow of this is really easy: distro looking for a feature edge turns 
it on and announces it, if the feature does not work out for users then 
user turns it off and complains to distro, if enough users complain then 
distro turns it off for next release, upstream forgets about this 
performance feature and eventually removes it once someone notices that 
it wouldnt even compile in the past 2 main releases. I see no problem 
here, we did that in the past too with performance features. The 
networking stack has literally dozens of such small tunable things which 
get experimented with, and whose defaults do get tuned carefully. Some 
of the knobs help bandwidth, some help latency.


I haven't looked at this code since it first came out and didn't impress 
me, but I think it would be good to get the current version in. However, 
when you say user turns it off I hope you mean in /proc/sys with a 
switch or knob and not by expecting people to recompile and install a 
kernel. Then it might take a little memory but wouldn't do something 
undesirable.


Note: I had no bad effect from the code, it just didn't feel faster. On 
a low memory machine it might help. Of course I have wanted to have a 
hard limit on memory used for i/o buffers, just to avoid swapping 
programs to make room for i/o, so to some extent I feel as if this is a 
fix for a problem we shouldn't have.


--
Bill Davidsen [EMAIL PROTECTED]
  We have more to fear from the bungling of the incompetent than from
the machinations of the wicked.  - from Slashdot

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans


Con Kolivas wrote:

On Friday 04 May 2007 18:52, Ingo Molnar wrote:

agreed. Con, IIRC you wrote a testcase for this, right? Could you please
send us the results of that testing?


Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch 
disabled and then enabled swap prefetch saves ~5 seconds on average hardware 
on this one test case. I had many users try this and the results were between 
2 and 10 seconds, but always showed a saving on this testcase. This effect 
easily occurs on printing a big picture, editing a large file, compressing an 
iso image or whatever in real world workloads. Smaller, but much more 
frequent effects of this over the course of a day obviously also occur and do 
add up.


I'll try this when I get the scheduler stuff done, and also dig out the 
resp1 stuff for back when. I see the most recent datasets were 
comparing 2.5.43-mm2 responsiveness with 2.4.19-ck7, you know I always 
test your stuff ;-)


Guess it might need a bit of polish for current hardware, I was testing 
on *small* machines, deliberately.


--
Bill Davidsen [EMAIL PROTECTED]
  We have more to fear from the bungling of the incompetent than from
the machinations of the wicked.  - from Slashdot

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ck] Re: swap-prefetch: 2.6.22 -mm merge plans

2007-05-06 Thread Jory A. Pratt

Con Kolivas wrote:
> Here's a better swap prefetch tester. Instructions in file.
>
> Machine with 2GB ram and 2GB swapfile
>
> Prefetch disabled:
> ./sp_tester
> Ram 2060352000  Swap 197342
> Total ram to be malloced: 3047062000 bytes
> Starting first malloc of 1523531000 bytes
> Starting 1st read of first malloc
> Touching this much ram takes 809 milliseconds
> Starting second malloc of 1523531000 bytes
> Completed second malloc and free
> Sleeping for 600 seconds
> Important part - starting reread of first malloc
> Completed read of first malloc
> Timed portion 53397 milliseconds
>
> Enabled:
> ./sp_tester
> Ram 2060352000  Swap 197342
> Total ram to be malloced: 3047062000 bytes
> Starting first malloc of 1523531000 bytes
> Starting 1st read of first malloc
> Touching this much ram takes 676 milliseconds
> Starting second malloc of 1523531000 bytes
> Completed second malloc and free
> Sleeping for 600 seconds
> Important part - starting reread of first malloc
> Completed read of first malloc
> Timed portion 26351 milliseconds
>   
echo 1 > /proc/sys/vm/overcommit_memory
swapoff -a
swapon -a
./sp_tester

Ram 1153644000  Swap 1004052000
Total ram to be malloced: 165567 bytes
Starting first malloc of 827835000 bytes
Starting 1st read of first malloc
Touching this much ram takes 937 milliseconds
Starting second malloc of 827835000 bytes
Completed second malloc and free
Sleeping for 600 seconds
Important part - starting reread of first malloc
Completed read of first malloc
Timed portion 15011 milliseconds

echo 0 > /proc/sys/vm/overcommit_memory
swapoff -a
swapon -a
./sp_tester

Ram 1153644000  Swap 1004052000
Total ram to be malloced: 165567 bytes
Starting first malloc of 827835000 bytes
Starting 1st read of first malloc
Touching this much ram takes 1125 milliseconds
Starting second malloc of 827835000 bytes
Completed second malloc and free
Sleeping for 600 seconds
Important part - starting reread of first malloc
Completed read of first malloc
Timed portion 14611 milliseconds

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ck] Re: swap-prefetch: 2.6.22 -mm merge plans

2007-05-06 Thread Antonino Ingargiola


2007/5/5, Con Kolivas <[EMAIL PROTECTED]>:
[cut]

Here's a better swap prefetch tester. Instructions in file.


The system should be leaved totally inactive for the test duration (10m) right?


Regards,

 ~ Antonio
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ck] Re: swap-prefetch: 2.6.22 -mm merge plans

2007-05-06 Thread Antonino Ingargiola


2007/5/5, Con Kolivas [EMAIL PROTECTED]:
[cut]

Here's a better swap prefetch tester. Instructions in file.


The system should be leaved totally inactive for the test duration (10m) right?


Regards,

 ~ Antonio
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ck] Re: swap-prefetch: 2.6.22 -mm merge plans

2007-05-06 Thread Jory A. Pratt

Con Kolivas wrote:
 Here's a better swap prefetch tester. Instructions in file.

 Machine with 2GB ram and 2GB swapfile

 Prefetch disabled:
 ./sp_tester
 Ram 2060352000  Swap 197342
 Total ram to be malloced: 3047062000 bytes
 Starting first malloc of 1523531000 bytes
 Starting 1st read of first malloc
 Touching this much ram takes 809 milliseconds
 Starting second malloc of 1523531000 bytes
 Completed second malloc and free
 Sleeping for 600 seconds
 Important part - starting reread of first malloc
 Completed read of first malloc
 Timed portion 53397 milliseconds

 Enabled:
 ./sp_tester
 Ram 2060352000  Swap 197342
 Total ram to be malloced: 3047062000 bytes
 Starting first malloc of 1523531000 bytes
 Starting 1st read of first malloc
 Touching this much ram takes 676 milliseconds
 Starting second malloc of 1523531000 bytes
 Completed second malloc and free
 Sleeping for 600 seconds
 Important part - starting reread of first malloc
 Completed read of first malloc
 Timed portion 26351 milliseconds
   
echo 1  /proc/sys/vm/overcommit_memory
swapoff -a
swapon -a
./sp_tester

Ram 1153644000  Swap 1004052000
Total ram to be malloced: 165567 bytes
Starting first malloc of 827835000 bytes
Starting 1st read of first malloc
Touching this much ram takes 937 milliseconds
Starting second malloc of 827835000 bytes
Completed second malloc and free
Sleeping for 600 seconds
Important part - starting reread of first malloc
Completed read of first malloc
Timed portion 15011 milliseconds

echo 0  /proc/sys/vm/overcommit_memory
swapoff -a
swapon -a
./sp_tester

Ram 1153644000  Swap 1004052000
Total ram to be malloced: 165567 bytes
Starting first malloc of 827835000 bytes
Starting 1st read of first malloc
Touching this much ram takes 1125 milliseconds
Starting second malloc of 827835000 bytes
Completed second malloc and free
Sleeping for 600 seconds
Important part - starting reread of first malloc
Completed read of first malloc
Timed portion 14611 milliseconds

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans

2007-05-05 Thread Con Kolivas

On Friday 04 May 2007 22:10, Con Kolivas wrote:
> On Friday 04 May 2007 18:52, Ingo Molnar wrote:
> > agreed. Con, IIRC you wrote a testcase for this, right? Could you please
> > send us the results of that testing?
>
> Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch
> disabled and then enabled swap prefetch saves ~5 seconds on average
> hardware on this one test case. I had many users try this and the results
> were between 2 and 10 seconds, but always showed a saving on this testcase.
> This effect easily occurs on printing a big picture, editing a large file,
> compressing an iso image or whatever in real world workloads. Smaller, but
> much more frequent effects of this over the course of a day obviously also
> occur and do add up.

Here's a better swap prefetch tester. Instructions in file.

Machine with 2GB ram and 2GB swapfile

Prefetch disabled:
./sp_tester
Ram 2060352000  Swap 197342
Total ram to be malloced: 3047062000 bytes
Starting first malloc of 1523531000 bytes
Starting 1st read of first malloc
Touching this much ram takes 809 milliseconds
Starting second malloc of 1523531000 bytes
Completed second malloc and free
Sleeping for 600 seconds
Important part - starting reread of first malloc
Completed read of first malloc
Timed portion 53397 milliseconds

Enabled:
./sp_tester
Ram 2060352000  Swap 197342
Total ram to be malloced: 3047062000 bytes
Starting first malloc of 1523531000 bytes
Starting 1st read of first malloc
Touching this much ram takes 676 milliseconds
Starting second malloc of 1523531000 bytes
Completed second malloc and free
Sleeping for 600 seconds
Important part - starting reread of first malloc
Completed read of first malloc
Timed portion 26351 milliseconds

Note huge time difference.

-- 
-ck
/*
sp_tester.c

Build with:
gcc -o sp_tester sp_tester.c -lrt -W -Wall -O2

How to use:
echo 1 > /proc/sys/vm/overcommit_memory
swapoff -a
swapon -a
./sp_tester

then repeat with changed conditions eg
echo 0 > /proc/sys/vm/swap_prefetch

Each Test takes 10 minutes
*/

#include 
#include 
#include 
#include 
#include 
#include 
#include 

void fatal(const char *format, ...)
{
	va_list ap;

	if (format) {
		va_start(ap, format);
		vfprintf(stderr, format, ap);
		va_end(ap);
	}

	fprintf(stderr, "Fatal error - exiting\n");
	exit(1);
}

unsigned long ramsize, swapsize;

size_t get_ram(void)
{
	FILE *meminfo;
char aux[256];

	if(!(meminfo = fopen("/proc/meminfo", "r")))
		fatal("fopen\n");

	while( !feof(meminfo) && !fscanf(meminfo, "MemTotal: %lu kB", ) )
		fgets(aux,sizeof(aux),meminfo);
	while( !feof(meminfo) && !fscanf(meminfo, "SwapTotal: %lu kB", ) )
		fgets(aux,sizeof(aux),meminfo);
	if (fclose(meminfo) == -1)
		fatal("fclose");
	ramsize *= 1000;
	swapsize *= 1000;
	printf("Ram %lu  Swap %lu\n", ramsize, swapsize);
	return ramsize + (swapsize / 2);
}

unsigned long get_usecs(struct timespec *myts)
{
	if (clock_gettime(CLOCK_REALTIME, myts))
		fatal("clock_gettime");
	return (myts->tv_sec * 100 + myts->tv_nsec / 1000 );
}

int main(void)
{
	unsigned long current_time, time_diff;
	struct timespec myts;
	char *buf1, *buf2, *buf3, *buf4;
	size_t size = get_ram();
	int sleep_seconds = 600;

	if (size > ramsize / 2 * 3)
		size = ramsize / 2 * 3;
	printf("Total ram to be malloced: %u bytes\n", size);
	size /= 2;
	printf("Starting first malloc of %u bytes\n", size);
	buf1 = malloc(size);
	buf4 = malloc(1);
	if (buf1 == (char *)-1)
		fatal("Failed to malloc 1st buffer\n");
	memset(buf1, 0, size);
	time_diff = current_time = get_usecs();
	for (buf3 = buf1; buf3 < buf1 + size; buf3++)
		*buf4 = *buf3;
	printf("Starting 1st read of first malloc\n");
	current_time = get_usecs();
	time_diff = current_time - time_diff;
	printf("Touching this much ram takes %lu milliseconds\n",time_diff / 1000);
	printf("Starting second malloc of %u bytes\n", size);

	buf2 = malloc(size);
	if (buf2 == (char *)-1)
		fatal("Failed to malloc 2nd buffer\n");
	memset(buf2, 0, size);
	for (buf3 = buf2 + size; buf3 > buf2; buf3--)
		*buf4 = *buf3;
	free(buf2);
	printf("Completed second malloc and free\n");

	printf("Sleeping for %u seconds\n", sleep_seconds);
	sleep(sleep_seconds);

	printf("Important part - starting reread of first malloc\n");
	time_diff = current_time = get_usecs();
	for (buf3 = buf1; buf3 < buf1 + size; buf3++)
		*buf4 = *buf3;
	current_time = get_usecs();
	time_diff = current_time - time_diff;
	printf("Completed read of first malloc\n");
	printf("Timed portion %lu milliseconds\n",time_diff / 1000);

	free(buf1);
	free(buf4);

	return 0;
}

Re: swap-prefetch: 2.6.22 -mm merge plans

2007-05-05 Thread Con Kolivas

On Friday 04 May 2007 22:10, Con Kolivas wrote:
 On Friday 04 May 2007 18:52, Ingo Molnar wrote:
  agreed. Con, IIRC you wrote a testcase for this, right? Could you please
  send us the results of that testing?

 Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch
 disabled and then enabled swap prefetch saves ~5 seconds on average
 hardware on this one test case. I had many users try this and the results
 were between 2 and 10 seconds, but always showed a saving on this testcase.
 This effect easily occurs on printing a big picture, editing a large file,
 compressing an iso image or whatever in real world workloads. Smaller, but
 much more frequent effects of this over the course of a day obviously also
 occur and do add up.

Here's a better swap prefetch tester. Instructions in file.

Machine with 2GB ram and 2GB swapfile

Prefetch disabled:
./sp_tester
Ram 2060352000  Swap 197342
Total ram to be malloced: 3047062000 bytes
Starting first malloc of 1523531000 bytes
Starting 1st read of first malloc
Touching this much ram takes 809 milliseconds
Starting second malloc of 1523531000 bytes
Completed second malloc and free
Sleeping for 600 seconds
Important part - starting reread of first malloc
Completed read of first malloc
Timed portion 53397 milliseconds

Enabled:
./sp_tester
Ram 2060352000  Swap 197342
Total ram to be malloced: 3047062000 bytes
Starting first malloc of 1523531000 bytes
Starting 1st read of first malloc
Touching this much ram takes 676 milliseconds
Starting second malloc of 1523531000 bytes
Completed second malloc and free
Sleeping for 600 seconds
Important part - starting reread of first malloc
Completed read of first malloc
Timed portion 26351 milliseconds

Note huge time difference.

-- 
-ck
/*
sp_tester.c

Build with:
gcc -o sp_tester sp_tester.c -lrt -W -Wall -O2

How to use:
echo 1  /proc/sys/vm/overcommit_memory
swapoff -a
swapon -a
./sp_tester

then repeat with changed conditions eg
echo 0  /proc/sys/vm/swap_prefetch

Each Test takes 10 minutes
*/

#include stdio.h
#include stdarg.h
#include stdlib.h
#include string.h
#include unistd.h
#include sys/mman.h
#include time.h

void fatal(const char *format, ...)
{
	va_list ap;

	if (format) {
		va_start(ap, format);
		vfprintf(stderr, format, ap);
		va_end(ap);
	}

	fprintf(stderr, Fatal error - exiting\n);
	exit(1);
}

unsigned long ramsize, swapsize;

size_t get_ram(void)
{
	FILE *meminfo;
char aux[256];

	if(!(meminfo = fopen(/proc/meminfo, r)))
		fatal(fopen\n);

	while( !feof(meminfo)  !fscanf(meminfo, MemTotal: %lu kB, ramsize) )
		fgets(aux,sizeof(aux),meminfo);
	while( !feof(meminfo)  !fscanf(meminfo, SwapTotal: %lu kB, swapsize) )
		fgets(aux,sizeof(aux),meminfo);
	if (fclose(meminfo) == -1)
		fatal(fclose);
	ramsize *= 1000;
	swapsize *= 1000;
	printf(Ram %lu  Swap %lu\n, ramsize, swapsize);
	return ramsize + (swapsize / 2);
}

unsigned long get_usecs(struct timespec *myts)
{
	if (clock_gettime(CLOCK_REALTIME, myts))
		fatal(clock_gettime);
	return (myts-tv_sec * 100 + myts-tv_nsec / 1000 );
}

int main(void)
{
	unsigned long current_time, time_diff;
	struct timespec myts;
	char *buf1, *buf2, *buf3, *buf4;
	size_t size = get_ram();
	int sleep_seconds = 600;

	if (size  ramsize / 2 * 3)
		size = ramsize / 2 * 3;
	printf(Total ram to be malloced: %u bytes\n, size);
	size /= 2;
	printf(Starting first malloc of %u bytes\n, size);
	buf1 = malloc(size);
	buf4 = malloc(1);
	if (buf1 == (char *)-1)
		fatal(Failed to malloc 1st buffer\n);
	memset(buf1, 0, size);
	time_diff = current_time = get_usecs(myts);
	for (buf3 = buf1; buf3  buf1 + size; buf3++)
		*buf4 = *buf3;
	printf(Starting 1st read of first malloc\n);
	current_time = get_usecs(myts);
	time_diff = current_time - time_diff;
	printf(Touching this much ram takes %lu milliseconds\n,time_diff / 1000);
	printf(Starting second malloc of %u bytes\n, size);

	buf2 = malloc(size);
	if (buf2 == (char *)-1)
		fatal(Failed to malloc 2nd buffer\n);
	memset(buf2, 0, size);
	for (buf3 = buf2 + size; buf3  buf2; buf3--)
		*buf4 = *buf3;
	free(buf2);
	printf(Completed second malloc and free\n);

	printf(Sleeping for %u seconds\n, sleep_seconds);
	sleep(sleep_seconds);

	printf(Important part - starting reread of first malloc\n);
	time_diff = current_time = get_usecs(myts);
	for (buf3 = buf1; buf3  buf1 + size; buf3++)
		*buf4 = *buf3;
	current_time = get_usecs(myts);
	time_diff = current_time - time_diff;
	printf(Completed read of first malloc\n);
	printf(Timed portion %lu milliseconds\n,time_diff / 1000);

	free(buf1);
	free(buf4);

	return 0;
}

Re: swap-prefetch: 2.6.22 -mm merge plans

2007-05-04 Thread Con Kolivas

On Friday 04 May 2007 18:52, Ingo Molnar wrote:
> agreed. Con, IIRC you wrote a testcase for this, right? Could you please
> send us the results of that testing?

Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch 
disabled and then enabled swap prefetch saves ~5 seconds on average hardware 
on this one test case. I had many users try this and the results were between 
2 and 10 seconds, but always showed a saving on this testcase. This effect 
easily occurs on printing a big picture, editing a large file, compressing an 
iso image or whatever in real world workloads. Smaller, but much more 
frequent effects of this over the course of a day obviously also occur and do 
add up.

-- 
-ck
#include 
#include 
#include 
#include 
#include 
#include 
#include 

void fatal(const char *format, ...)
{
	va_list ap;

	if (format) {
		va_start(ap, format);
		vfprintf(stderr, format, ap);
		va_end(ap);
	}

	fprintf(stderr, "Fatal error - exiting\n");
	exit(1);
}

size_t get_ram(void)
{
unsigned long ramsize;
	FILE *meminfo;
char aux[256];

	if(!(meminfo = fopen("/proc/meminfo", "r")))
		fatal("fopen\n");

	while( !feof(meminfo) && !fscanf(meminfo, "MemTotal: %lu kB", ) )
		fgets(aux,sizeof(aux),meminfo);
	if (fclose(meminfo) == -1)
		fatal("fclose");
	return ramsize * 1000;
}

unsigned long get_usecs(struct timespec *myts)
{
	if (clock_gettime(CLOCK_REALTIME, myts))
		fatal("clock_gettime");
	return (myts->tv_sec * 100 + myts->tv_nsec / 1000 );
}

int main(void)
{
	unsigned long current_time, time_diff;
	struct timespec myts;
	char *buf1, *buf2, *buf3, *buf4;
	size_t size, full_size = get_ram();
	int sleep_seconds = 600;

	size = full_size * 7 / 10;
	printf("Starting first malloc of %d bytes\n", size);
	buf1 = malloc(size);
	if (buf1 == (char *)-1)
		fatal("Failed to malloc 1st buffer\n");
	memset(buf1, 0, size);
	printf("Completed first malloc and starting second malloc of %d bytes\n", full_size);

	buf2 = malloc(full_size);
	if (buf2 == (char *)-1)
		fatal("Failed to malloc 2nd buffer\n");
	memset(buf2, 0, full_size);
	buf4 = malloc(1);
	for (buf3 = buf2 + full_size; buf3 > buf2; buf3--)
		*buf4 = *buf3;
	free(buf2);
	printf("Completed second malloc and free\n");

	printf("Sleeping for %d seconds\n", sleep_seconds);
	sleep(sleep_seconds);

	printf("Important part - starting read of first malloc\n");
	time_diff = current_time = get_usecs();
	for (buf3 = buf1; buf3 < buf1 + size; buf3++)
		*buf4 = *buf3;
	current_time = get_usecs();
	free(buf4);
	free(buf1);
	printf("Completed read and freeing of first malloc\n");
	time_diff = current_time - time_diff;
	printf("Timed portion %lu microseconds\n",time_diff);

	return 0;
}

Re: 2.6.22 -mm merge plans -- vm bugfixes


Nick Piggin wrote:

Nick Piggin wrote:


Christoph Hellwig wrote:




Is that every fork/exec or just under certain cicumstances?
A 5% regression on every fork/exec is not acceptable.




Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4
numbers will be improved as well with that patch. Then if we have
specific lock/unlock bitops, I hope it should reduce that further.



OK, with the races and missing barriers fixed from the previous patch,
plus the attached one added (+patch3), numbers are better again (I'm not
sure if I have the ppc barriers correct though).

These ops could also be put to use in bit spinlocks, buffer lock, and
probably a few other places too.

2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2
+patch2  1.61-1.63   169.8-175.0   748.6-757.0
+patch3  1.54-1.57   165.6-170.9   748.5-757.5

So fault performance goes to under 5%, fork is in the noise, exec is
still up 1%, but maybe that's noise or cache effects again.


OK, with my new lock/unlock_page, dd if=large (bigger than RAM) sparse
file of=/dev/null with an experimentally optimal block size (32K) goes
from 626MB/s to 683MB/s on 2 CPU G5 booted with maxcpus=1.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans -- vm bugfixes


Nick Piggin wrote:

Christoph Hellwig wrote:



Is that every fork/exec or just under certain cicumstances?
A 5% regression on every fork/exec is not acceptable.



Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4
numbers will be improved as well with that patch. Then if we have
specific lock/unlock bitops, I hope it should reduce that further.


OK, with the races and missing barriers fixed from the previous patch,
plus the attached one added (+patch3), numbers are better again (I'm not
sure if I have the ppc barriers correct though).

These ops could also be put to use in bit spinlocks, buffer lock, and
probably a few other places too.

2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2
+patch2  1.61-1.63   169.8-175.0   748.6-757.0
+patch3  1.54-1.57   165.6-170.9   748.5-757.5

So fault performance goes to under 5%, fork is in the noise, exec is
still up 1%, but maybe that's noise or cache effects again.

--
SUSE Labs, Novell Inc.
Index: linux-2.6/include/asm-powerpc/bitops.h
===
--- linux-2.6.orig/include/asm-powerpc/bitops.h 2007-05-04 16:08:20.0 
+1000
+++ linux-2.6/include/asm-powerpc/bitops.h  2007-05-04 16:14:39.0 
+1000
@@ -87,6 +87,24 @@
: "cc" );
 }
 
+static __inline__ void clear_bit_unlock(int nr, volatile unsigned long *addr)
+{
+   unsigned long old;
+   unsigned long mask = BITOP_MASK(nr);
+   unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr);
+
+   __asm__ __volatile__(
+   LWSYNC_ON_SMP
+"1:"   PPC_LLARX "%0,0,%3  # clear_bit_unlock\n"
+   "andc   %0,%0,%2\n"
+   PPC405_ERR77(0,%3)
+   PPC_STLCX "%0,0,%3\n"
+   "bne-   1b"
+   : "=" (old), "+m" (*p)
+   : "r" (mask), "r" (p)
+   : "cc" );
+}
+
 static __inline__ void change_bit(int nr, volatile unsigned long *addr)
 {
unsigned long old;
@@ -126,6 +144,27 @@
return (old & mask) != 0;
 }
 
+static __inline__ int test_and_set_bit_lock(unsigned long nr,
+  volatile unsigned long *addr)
+{
+   unsigned long old, t;
+   unsigned long mask = BITOP_MASK(nr);
+   unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr);
+
+   __asm__ __volatile__(
+"1:"   PPC_LLARX "%0,0,%3  # test_and_set_bit_lock\n"
+   "or %1,%0,%2 \n"
+   PPC405_ERR77(0,%3)
+   PPC_STLCX "%1,0,%3 \n"
+   "bne-   1b"
+   ISYNC_ON_SMP
+   : "=" (old), "=" (t)
+   : "r" (mask), "r" (p)
+   : "cc", "memory");
+
+   return (old & mask) != 0;
+}
+
 static __inline__ int test_and_clear_bit(unsigned long nr,
 volatile unsigned long *addr)
 {
Index: linux-2.6/include/linux/pagemap.h
===
--- linux-2.6.orig/include/linux/pagemap.h  2007-05-04 16:14:36.0 
+1000
+++ linux-2.6/include/linux/pagemap.h   2007-05-04 16:17:34.0 +1000
@@ -136,13 +136,18 @@
 extern void FASTCALL(__wait_on_page_locked(struct page *page));
 extern void FASTCALL(unlock_page(struct page *page));
 
+static inline int trylock_page(struct page *page)
+{
+   return (likely(!TestSetPageLocked_Lock(page)));
+}
+
 /*
  * lock_page may only be called if we have the page's inode pinned.
  */
 static inline void lock_page(struct page *page)
 {
might_sleep();
-   if (unlikely(TestSetPageLocked(page)))
+   if (!trylock_page(page))
__lock_page(page);
 }
 
@@ -153,7 +158,7 @@
 static inline void lock_page_nosync(struct page *page)
 {
might_sleep();
-   if (unlikely(TestSetPageLocked(page)))
+   if (!trylock_page(page))
__lock_page_nosync(page);
 }

Index: linux-2.6/drivers/scsi/sg.c
===
--- linux-2.6.orig/drivers/scsi/sg.c2007-04-12 14:35:08.0 +1000
+++ linux-2.6/drivers/scsi/sg.c 2007-05-04 16:23:27.0 +1000
@@ -1734,7 +1734,7 @@
  */
flush_dcache_page(pages[i]);
/* ?? Is locking needed? I don't think so */
-   /* if (TestSetPageLocked(pages[i]))
+   /* if (!trylock_page(pages[i]))
   goto out_unlock; */
 }
 
Index: linux-2.6/fs/cifs/file.c
===
--- linux-2.6.orig/fs/cifs/file.c   2007-04-12 14:35:09.0 +1000
+++ linux-2.6/fs/cifs/file.c2007-05-04 16:23:36.0 +1000
@@ -1229,7 +1229,7 @@
 
if (first < 0)
lock_page(page);
-   else if (TestSetPageLocked(page))
+   else if (!trylock_page(page))
break;
 
if (unlikely(page->mapping != mapping)) {
Index: linux-2.6/fs/jbd/commit.c

Re: swap-prefetch: 2.6.22 -mm merge plans


Ingo Molnar wrote:

* Nick Piggin <[EMAIL PROTECTED]> wrote:



Here were some of my concerns, and where our discussion got up to.



Yes.  Perhaps it just doesn't help with the updatedb thing.  Or 
maybe with normal system activity we get enough free pages to kick 
the thing off and running.  Perhaps updatedb itself has a lot of 
rss, for example.


Could be, but I don't know. I'd think it unlikely to allow _much_ 
swapin, if huge amounts of the desktop have been swapped out. But 
maybe... as I said, nobody seems to have a recipe for these things.



can i take this one as a "no fundamental objection"? There are really 
only 2 maintainance options left:


  1) either you can do it better or at least have a _very_ clearly
 described idea outlined about how to do it differently

  2) or you should let others try it

#1 you've not done for 2-3 years since swap-prefetch was waiting for
integration so it's not an option at this stage anymore. Then you are 
pretty much obliged to do #2. ;-)


The burden is not on me to get someone else's feature merged. If it
can be shown to work well and people's concerns addressed, then anything
will get merged. The reason Linux is so good is because of what we don't
merge, figuratively speaking.

I wanted to see some basic regression tests to show that it hasn't caused
obvious problems, and some basic scenarios where it helps, so that we can
analyse them. It is really simple, but I haven't got any since first
asking.

And note that I don't think I ever explicitly "nacked" anything, just
voiced my concerns. If my concerns had been addressed, then I couldn't
have stopped anybody from merging anything.



2) It is a _highly_ speculative operation, and in workloads where periods
  of low and high page usage with genuinely unused anonymous / tmpfs
  pages, it could waste power, memory bandwidth, bus bandwidth, disk
  bandwidth...


Yes.  I suspect that's a matter of waiting for the corner-case 
reporters to complain, then add more heuristics.


Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch 
is happy to do a _lot_ of work for these things which we have already 
decided are least likely to be used again.



i see no real problem here. We've had heuristics for a _long_ time in 
various areas of the code. Sometimes they work, sometimes they suck.


So that's one of my issues with the code. If all you have to support a
merge is anecodal evidence, then I find it interesting that you would
easily discount something like this.



4) If this is helpful, wouldn't it be equally important for things like
  mapped file pages? Seems like half a solution.


[...]


(otoh the akpm usersapce implementation is swapoff -a;swapon -a)


Perhaps. You may need a few indicators to see whether the system is 
idle... but OTOH, we've already got a lot of indicators for memory, 
disk usage, etc. So, maybe :)



The time has passed for this. Let others play too. Please :-)


Play with what? Prefetching mmaped file pages as well? Sure.


I could be wrong, but IIRC there is no good way to know which cpuset 
to bring the page back into, (and I guess similarly it would be hard 
to know what container to account it to, if doing 
account-on-allocate).



(i think cpusets are totally uninteresting in this context: nobody in 
their right mind is going to use swap-prefetch on a big NUMA box. Nor 
can i see any fundamental impediment to making this more cpuset-aware, 
just like other subsystems were made cpuset-aware, once the requests 
from actual users came in and people started getting interested in it.)


OK, so make it more cpuset aware. This isn't a new issue, I raised it
a long time ago. And trust me, it is a nightmare to just assume that
nobody will use cpusets on a small box for example (AFAIK the resource
control guys are looking at doing just that).

All core VM features should play nicely with each other without *really*
good reason.


I think the "lack of testcase and numbers" is the only valid technical 
objection i've seen so far.


Well you're entitled to your opinion too.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans

2007-05-04 Thread Ingo Molnar

* Nick Piggin <[EMAIL PROTECTED]> wrote:

> > i'm wondering about swap-prefetch:

> Being able to config all these core heuristics changes is really not 
> that much of a positive. The fact that we might _need_ to config 
> something out, and double the configuration range isn't too pleasing.

Well, to the desktop user this is a speculative performance feature that 
he is willing to potentially waste CPU and IO capacity, in expectation 
of better performance.

On the conceptual level it is _precisely the same thing as regular file 
readahead_. (with the difference that to me swapahead seems to be quite 
a bit more intelligent than our current file readahead logic.)

This feature has no API or ABI impact at all, it's a pure performance 
feature. (besides the trivial sysctl to turn it runtime on/off).

> Here were some of my concerns, and where our discussion got up to.

> > Yes.  Perhaps it just doesn't help with the updatedb thing.  Or 
> > maybe with normal system activity we get enough free pages to kick 
> > the thing off and running.  Perhaps updatedb itself has a lot of 
> > rss, for example.
> 
> Could be, but I don't know. I'd think it unlikely to allow _much_ 
> swapin, if huge amounts of the desktop have been swapped out. But 
> maybe... as I said, nobody seems to have a recipe for these things.

can i take this one as a "no fundamental objection"? There are really 
only 2 maintainance options left:

  1) either you can do it better or at least have a _very_ clearly
 described idea outlined about how to do it differently

  2) or you should let others try it

#1 you've not done for 2-3 years since swap-prefetch was waiting for
integration so it's not an option at this stage anymore. Then you are 
pretty much obliged to do #2. ;-)

And let me be really blunt about this, there is no option #3 to say: "I 
have no real better idea, I have no code, I have no time, but hey, lets 
not merge this because it 'could in theory' be possible to do it better" 
=B-)

really, we are likely be better off by risking the merge of _bad_ code 
(which in the swap-prefetch case is the exact opposite of the truth), 
than to let code stagnate. People are clearly unhappy about certain 
desktop aspects of swapping, and the only way out of that is to let more 
people hack that code. Merging code involves more people. It will cause 
'noise' and could cause regressions, but at least in this case the only 
impact is 'performance' and the feature is trivial to disable.

The maintainance drag outside of swap_prefetch.c is essentially _zero_. 
If the feature doesnt work it ends up on Con's desk. If it turns out to 
not work at all (despite years of testing and happy users) it still only 
ends up on Con's desk. A clear win/win scenario for you i think :-)

> > Would be useful to see this claim substantiated with a real 
> > testcase, description of results and an explanation of how and why 
> > it worked.
> 
> Yes... and then try to first improve regular page reclaim and use-once 
> handling.

agreed. Con, IIRC you wrote a testcase for this, right? Could you please 
send us the results of that testing?

> >>2) It is a _highly_ speculative operation, and in workloads where periods
> >>of low and high page usage with genuinely unused anonymous / tmpfs
> >>pages, it could waste power, memory bandwidth, bus bandwidth, disk
> >>bandwidth...
> > 
> > Yes.  I suspect that's a matter of waiting for the corner-case 
> > reporters to complain, then add more heuristics.
> 
> Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch 
> is happy to do a _lot_ of work for these things which we have already 
> decided are least likely to be used again.

i see no real problem here. We've had heuristics for a _long_ time in 
various areas of the code. Sometimes they work, sometimes they suck.

the flow of this is really easy: distro looking for a feature edge turns 
it on and announces it, if the feature does not work out for users then 
user turns it off and complains to distro, if enough users complain then 
distro turns it off for next release, upstream forgets about this 
performance feature and eventually removes it once someone notices that 
it wouldnt even compile in the past 2 main releases. I see no problem 
here, we did that in the past too with performance features. The 
networking stack has literally dozens of such small tunable things which 
get experimented with, and whose defaults do get tuned carefully. Some 
of the knobs help bandwidth, some help latency.

I do not even see any risk of "splitup of mindshare" - swap-prefetch is 
so clearly speculative that it's not really a different view about how 
to do swapping (which would split the tester base, etc.), it's simply a 
"do you want your system to speculate about the future or not" add-on 
decision. Every system has a pretty clear idea about that: desktops 
generally want to do it, clusters generally dont want to do it.

> >>3) I haven't seen a single

Re: swap-prefetch: 2.6.22 -mm merge plans

Ingo Molnar wrote:

* Andrew Morton <[EMAIL PROTECTED]> wrote:

- If replying, please be sure to cc the appropriate individuals.  
 Please also consider rewriting the Subject: to something 
 appropriate.

i'm wondering about swap-prefetch:

Well I had some issues with it that I don't think were fully discussed,
and Andrew prompted me to say something, but it went off list for a
couple of posts (my fault, sorry). Posting it below with Andrew's
permission...

  mm-implement-swap-prefetching.patch
  swap-prefetch-avoid-repeating-entry.patch

add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated-swap-prefetch.patch

The swap-prefetch feature is relatively compact:

   10 files changed, 745 insertions(+), 1 deletion(-)

it is contained mostly to itself:

   mm/swap_prefetch.c|  581 

i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a 
clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback 
i've seen so far was positive. Time to have this upstream and time for a 
desktop-oriented distro to pick it up.

I think this has been held back way too long. It's .config selectable 
and it is as ready for integration as it ever is going to be. So it's a 
win/win scenario.

Being able to config all these core heuristics changes is really not that
much of a positive. The fact that we might _need_ to config something out,
and double the configuration range isn't too pleasing.

Here were some of my concerns, and where our discussion got up to.

Andrew Morton wrote:
> On Fri, 04 May 2007 14:34:45 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote:
>
>
>>Andrew Morton wrote:
>>
>>>istr you had issues with swap-prefetch?
>>>
>>>If so, now's a good time to reiterate them ;)
>>
>>1) It is said to help with the updatedb overnight problem, however it
>>actually _doesn't_ prefetch swap when there are low free pages, which
>>is how updatedb will leave the system. So this really puzzles me how
>>it would work. However if updatedb is causing excessive swapout, I
>>think we should try improving use-once algorithms first, for example.
>
>
> Yes.  Perhaps it just doesn't help with the updatedb thing.  Or maybe with
> normal system activity we get enough free pages to kick the thing off and
> running.  Perhaps updatedb itself has a lot of rss, for example.

Could be, but I don't know. I'd think it unlikely to allow _much_ swapin,
if huge amounts of the desktop have been swapped out. But maybe... as I
said, nobody seems to have a recipe for these things.

> Would be useful to see this claim substantiated with a real testcase,
> description of results and an explanation of how and why it worked.

Yes... and then try to first improve regular page reclaim and use-once
handling.

>>2) It is a _highly_ speculative operation, and in workloads where periods
>>of low and high page usage with genuinely unused anonymous / tmpfs
>>pages, it could waste power, memory bandwidth, bus bandwidth, disk
>>bandwidth...
>
>
> Yes.  I suspect that's a matter of waiting for the corner-case reporters to
> complain, then add more heuristics.

Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch is
happy to do a _lot_ of work for these things which we have already decided
are least likely to be used again.

>>3) I haven't seen a single set of numbers out of it. Feedback seems to
>>have mostly come from people who
>
>
> Yup.  But can we come up with a testcase?  It's hard.

I guess it is hard firstly because swapping is quite random to start with.
But I haven't even seen basic things like "make -jhuge swapstorm has no
regressions".

>>4) If this is helpful, wouldn't it be equally important for things like
>>mapped file pages? Seems like half a solution.
>
>
> True.
>
> Without thinking about it, I almost wonder if one could do a userspace
> implementation with something which pokes around in /proc/pid/pagemap and
> /proc/pid/kpagemap, perhaps with some additional interfaces added to
> do a swapcache read.  (Give userspace the ability to get at swapcache
> via a regular fd?)
>
> (otoh the akpm usersapce implementation is swapoff -a;swapon -a)

Perhaps. You may need a few indicators to see whether the system is idle...
but OTOH, we've already got a lot of indicators for memory, disk usage,
etc. So, maybe :)

>>5) New one: it is possibly going to interact badly with MADV_FREE lazy
>>freeing. The more complex we make page reclaim, the worse IMO.
>
>
> That's a bit vague.  What sort of problems do you envisage?

Well MADV_FREE pages aren't technically free, are they? So it might be
possible for a significant number of them to build up and prevent
swap prefetch from working. Maybe.

>>...) I had a few issues with implementation, like interaction with
>>cpusets. Don't know if these are all fixed or not. I sort of gave
>>up looking at it.
>
>
> Ah yes, I remember some mention of cpusets.

Re: swap-prefetch: 2.6.22 -mm merge plans

Ingo Molnar wrote:

* Andrew Morton [EMAIL PROTECTED] wrote:

- If replying, please be sure to cc the appropriate individuals.
Please also consider rewriting the Subject: to something
appropriate.

i'm wondering about swap-prefetch:

Well I had some issues with it that I don't think were fully discussed,
and Andrew prompted me to say something, but it went off list for a
couple of posts (my fault, sorry). Posting it below with Andrew's
permission...

mm-implement-swap-prefetching.patch
swap-prefetch-avoid-repeating-entry.patch

add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated-swap-prefetch.patch

The swap-prefetch feature is relatively compact:

10 files changed, 745 insertions(+), 1 deletion(-)

it is contained mostly to itself:

mm/swap_prefetch.c| 581

i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a
clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback
i've seen so far was positive. Time to have this upstream and time for a
desktop-oriented distro to pick it up.

I think this has been held back way too long. It's .config selectable
and it is as ready for integration as it ever is going to be. So it's a
win/win scenario.

Being able to config all these core heuristics changes is really not that
much of a positive. The fact that we might _need_ to config something out,
and double the configuration range isn't too pleasing.

Here were some of my concerns, and where our discussion got up to.

Andrew Morton wrote:
On Fri, 04 May 2007 14:34:45 +1000 Nick Piggin [EMAIL PROTECTED] wrote:

Andrew Morton wrote:

istr you had issues with swap-prefetch?

If so, now's a good time to reiterate them ;)

1) It is said to help with the updatedb overnight problem, however it
actually _doesn't_ prefetch swap when there are low free pages, which
is how updatedb will leave the system. So this really puzzles me how
it would work. However if updatedb is causing excessive swapout, I
think we should try improving use-once algorithms first, for example.

Yes. Perhaps it just doesn't help with the updatedb thing. Or maybe with
normal system activity we get enough free pages to kick the thing off and
running. Perhaps updatedb itself has a lot of rss, for example.

Could be, but I don't know. I'd think it unlikely to allow _much_ swapin,
if huge amounts of the desktop have been swapped out. But maybe... as I
said, nobody seems to have a recipe for these things.

Would be useful to see this claim substantiated with a real testcase,
description of results and an explanation of how and why it worked.

Yes... and then try to first improve regular page reclaim and use-once
handling.

2) It is a _highly_ speculative operation, and in workloads where periods
of low and high page usage with genuinely unused anonymous / tmpfs
pages, it could waste power, memory bandwidth, bus bandwidth, disk
bandwidth...

Yes. I suspect that's a matter of waiting for the corner-case reporters to
complain, then add more heuristics.

Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch is
happy to do a _lot_ of work for these things which we have already decided
are least likely to be used again.

3) I haven't seen a single set of numbers out of it. Feedback seems to
have mostly come from people who

Yup. But can we come up with a testcase? It's hard.

I guess it is hard firstly because swapping is quite random to start with.
But I haven't even seen basic things like make -jhuge swapstorm has no
regressions.

4) If this is helpful, wouldn't it be equally important for things like
mapped file pages? Seems like half a solution.

True.

Without thinking about it, I almost wonder if one could do a userspace
implementation with something which pokes around in /proc/pid/pagemap and
/proc/pid/kpagemap, perhaps with some additional interfaces added to
do a swapcache read. (Give userspace the ability to get at swapcache
via a regular fd?)

(otoh the akpm usersapce implementation is swapoff -a;swapon -a)

Perhaps. You may need a few indicators to see whether the system is idle...
but OTOH, we've already got a lot of indicators for memory, disk usage,
etc. So, maybe :)

5) New one: it is possibly going to interact badly with MADV_FREE lazy
freeing. The more complex we make page reclaim, the worse IMO.

That's a bit vague. What sort of problems do you envisage?

Well MADV_FREE pages aren't technically free, are they? So it might be
possible for a significant number of them to build up and prevent
swap prefetch from working. Maybe.

...) I had a few issues with implementation, like interaction with
cpusets. Don't know if these are all fixed or not. I sort of gave
up looking at it.

Ah yes, I remember some mention of cpusets. I forget what it was though.

I could be wrong, but IIRC there is no good way to know which

Re: swap-prefetch: 2.6.22 -mm merge plans

2007-05-04 Thread Ingo Molnar


* Nick Piggin [EMAIL PROTECTED] wrote:

  i'm wondering about swap-prefetch:

 Being able to config all these core heuristics changes is really not 
 that much of a positive. The fact that we might _need_ to config 
 something out, and double the configuration range isn't too pleasing.

Well, to the desktop user this is a speculative performance feature that 
he is willing to potentially waste CPU and IO capacity, in expectation 
of better performance.

On the conceptual level it is _precisely the same thing as regular file 
readahead_. (with the difference that to me swapahead seems to be quite 
a bit more intelligent than our current file readahead logic.)

This feature has no API or ABI impact at all, it's a pure performance 
feature. (besides the trivial sysctl to turn it runtime on/off).

 Here were some of my concerns, and where our discussion got up to.

  Yes.  Perhaps it just doesn't help with the updatedb thing.  Or 
  maybe with normal system activity we get enough free pages to kick 
  the thing off and running.  Perhaps updatedb itself has a lot of 
  rss, for example.
 
 Could be, but I don't know. I'd think it unlikely to allow _much_ 
 swapin, if huge amounts of the desktop have been swapped out. But 
 maybe... as I said, nobody seems to have a recipe for these things.

can i take this one as a no fundamental objection? There are really 
only 2 maintainance options left:

  1) either you can do it better or at least have a _very_ clearly
 described idea outlined about how to do it differently

  2) or you should let others try it

#1 you've not done for 2-3 years since swap-prefetch was waiting for
integration so it's not an option at this stage anymore. Then you are 
pretty much obliged to do #2. ;-)

And let me be really blunt about this, there is no option #3 to say: I 
have no real better idea, I have no code, I have no time, but hey, lets 
not merge this because it 'could in theory' be possible to do it better 
=B-)

really, we are likely be better off by risking the merge of _bad_ code 
(which in the swap-prefetch case is the exact opposite of the truth), 
than to let code stagnate. People are clearly unhappy about certain 
desktop aspects of swapping, and the only way out of that is to let more 
people hack that code. Merging code involves more people. It will cause 
'noise' and could cause regressions, but at least in this case the only 
impact is 'performance' and the feature is trivial to disable.

The maintainance drag outside of swap_prefetch.c is essentially _zero_. 
If the feature doesnt work it ends up on Con's desk. If it turns out to 
not work at all (despite years of testing and happy users) it still only 
ends up on Con's desk. A clear win/win scenario for you i think :-)

  Would be useful to see this claim substantiated with a real 
  testcase, description of results and an explanation of how and why 
  it worked.
 
 Yes... and then try to first improve regular page reclaim and use-once 
 handling.

agreed. Con, IIRC you wrote a testcase for this, right? Could you please 
send us the results of that testing?

 2) It is a _highly_ speculative operation, and in workloads where periods
 of low and high page usage with genuinely unused anonymous / tmpfs
 pages, it could waste power, memory bandwidth, bus bandwidth, disk
 bandwidth...
  
  Yes.  I suspect that's a matter of waiting for the corner-case 
  reporters to complain, then add more heuristics.
 
 Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch 
 is happy to do a _lot_ of work for these things which we have already 
 decided are least likely to be used again.

i see no real problem here. We've had heuristics for a _long_ time in 
various areas of the code. Sometimes they work, sometimes they suck.

the flow of this is really easy: distro looking for a feature edge turns 
it on and announces it, if the feature does not work out for users then 
user turns it off and complains to distro, if enough users complain then 
distro turns it off for next release, upstream forgets about this 
performance feature and eventually removes it once someone notices that 
it wouldnt even compile in the past 2 main releases. I see no problem 
here, we did that in the past too with performance features. The 
networking stack has literally dozens of such small tunable things which 
get experimented with, and whose defaults do get tuned carefully. Some 
of the knobs help bandwidth, some help latency.

I do not even see any risk of splitup of mindshare - swap-prefetch is 
so clearly speculative that it's not really a different view about how 
to do swapping (which would split the tester base, etc.), it's simply a 
do you want your system to speculate about the future or not add-on 
decision. Every system has a pretty clear idea about that: desktops 
generally want to do it, clusters generally dont want to do it.

 3) I haven't seen a single set of numbers out of it. Feedback seems to
 have mostly

Re: swap-prefetch: 2.6.22 -mm merge plans


Ingo Molnar wrote:

* Nick Piggin [EMAIL PROTECTED] wrote:



Here were some of my concerns, and where our discussion got up to.



Yes.  Perhaps it just doesn't help with the updatedb thing.  Or 
maybe with normal system activity we get enough free pages to kick 
the thing off and running.  Perhaps updatedb itself has a lot of 
rss, for example.


Could be, but I don't know. I'd think it unlikely to allow _much_ 
swapin, if huge amounts of the desktop have been swapped out. But 
maybe... as I said, nobody seems to have a recipe for these things.



can i take this one as a no fundamental objection? There are really 
only 2 maintainance options left:


  1) either you can do it better or at least have a _very_ clearly
 described idea outlined about how to do it differently

  2) or you should let others try it

#1 you've not done for 2-3 years since swap-prefetch was waiting for
integration so it's not an option at this stage anymore. Then you are 
pretty much obliged to do #2. ;-)


The burden is not on me to get someone else's feature merged. If it
can be shown to work well and people's concerns addressed, then anything
will get merged. The reason Linux is so good is because of what we don't
merge, figuratively speaking.

I wanted to see some basic regression tests to show that it hasn't caused
obvious problems, and some basic scenarios where it helps, so that we can
analyse them. It is really simple, but I haven't got any since first
asking.

And note that I don't think I ever explicitly nacked anything, just
voiced my concerns. If my concerns had been addressed, then I couldn't
have stopped anybody from merging anything.



2) It is a _highly_ speculative operation, and in workloads where periods
  of low and high page usage with genuinely unused anonymous / tmpfs
  pages, it could waste power, memory bandwidth, bus bandwidth, disk
  bandwidth...


Yes.  I suspect that's a matter of waiting for the corner-case 
reporters to complain, then add more heuristics.


Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch 
is happy to do a _lot_ of work for these things which we have already 
decided are least likely to be used again.



i see no real problem here. We've had heuristics for a _long_ time in 
various areas of the code. Sometimes they work, sometimes they suck.


So that's one of my issues with the code. If all you have to support a
merge is anecodal evidence, then I find it interesting that you would
easily discount something like this.



4) If this is helpful, wouldn't it be equally important for things like
  mapped file pages? Seems like half a solution.


[...]


(otoh the akpm usersapce implementation is swapoff -a;swapon -a)


Perhaps. You may need a few indicators to see whether the system is 
idle... but OTOH, we've already got a lot of indicators for memory, 
disk usage, etc. So, maybe :)



The time has passed for this. Let others play too. Please :-)


Play with what? Prefetching mmaped file pages as well? Sure.


I could be wrong, but IIRC there is no good way to know which cpuset 
to bring the page back into, (and I guess similarly it would be hard 
to know what container to account it to, if doing 
account-on-allocate).



(i think cpusets are totally uninteresting in this context: nobody in 
their right mind is going to use swap-prefetch on a big NUMA box. Nor 
can i see any fundamental impediment to making this more cpuset-aware, 
just like other subsystems were made cpuset-aware, once the requests 
from actual users came in and people started getting interested in it.)


OK, so make it more cpuset aware. This isn't a new issue, I raised it
a long time ago. And trust me, it is a nightmare to just assume that
nobody will use cpusets on a small box for example (AFAIK the resource
control guys are looking at doing just that).

All core VM features should play nicely with each other without *really*
good reason.


I think the lack of testcase and numbers is the only valid technical 
objection i've seen so far.


Well you're entitled to your opinion too.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans -- vm bugfixes


Nick Piggin wrote:

Christoph Hellwig wrote:



Is that every fork/exec or just under certain cicumstances?
A 5% regression on every fork/exec is not acceptable.



Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4
numbers will be improved as well with that patch. Then if we have
specific lock/unlock bitops, I hope it should reduce that further.


OK, with the races and missing barriers fixed from the previous patch,
plus the attached one added (+patch3), numbers are better again (I'm not
sure if I have the ppc barriers correct though).

These ops could also be put to use in bit spinlocks, buffer lock, and
probably a few other places too.

2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2
+patch2  1.61-1.63   169.8-175.0   748.6-757.0
+patch3  1.54-1.57   165.6-170.9   748.5-757.5

So fault performance goes to under 5%, fork is in the noise, exec is
still up 1%, but maybe that's noise or cache effects again.

--
SUSE Labs, Novell Inc.
Index: linux-2.6/include/asm-powerpc/bitops.h
===
--- linux-2.6.orig/include/asm-powerpc/bitops.h 2007-05-04 16:08:20.0 
+1000
+++ linux-2.6/include/asm-powerpc/bitops.h  2007-05-04 16:14:39.0 
+1000
@@ -87,6 +87,24 @@
: cc );
 }
 
+static __inline__ void clear_bit_unlock(int nr, volatile unsigned long *addr)
+{
+   unsigned long old;
+   unsigned long mask = BITOP_MASK(nr);
+   unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr);
+
+   __asm__ __volatile__(
+   LWSYNC_ON_SMP
+1:   PPC_LLARX %0,0,%3  # clear_bit_unlock\n
+   andc   %0,%0,%2\n
+   PPC405_ERR77(0,%3)
+   PPC_STLCX %0,0,%3\n
+   bne-   1b
+   : =r (old), +m (*p)
+   : r (mask), r (p)
+   : cc );
+}
+
 static __inline__ void change_bit(int nr, volatile unsigned long *addr)
 {
unsigned long old;
@@ -126,6 +144,27 @@
return (old  mask) != 0;
 }
 
+static __inline__ int test_and_set_bit_lock(unsigned long nr,
+  volatile unsigned long *addr)
+{
+   unsigned long old, t;
+   unsigned long mask = BITOP_MASK(nr);
+   unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr);
+
+   __asm__ __volatile__(
+1:   PPC_LLARX %0,0,%3  # test_and_set_bit_lock\n
+   or %1,%0,%2 \n
+   PPC405_ERR77(0,%3)
+   PPC_STLCX %1,0,%3 \n
+   bne-   1b
+   ISYNC_ON_SMP
+   : =r (old), =r (t)
+   : r (mask), r (p)
+   : cc, memory);
+
+   return (old  mask) != 0;
+}
+
 static __inline__ int test_and_clear_bit(unsigned long nr,
 volatile unsigned long *addr)
 {
Index: linux-2.6/include/linux/pagemap.h
===
--- linux-2.6.orig/include/linux/pagemap.h  2007-05-04 16:14:36.0 
+1000
+++ linux-2.6/include/linux/pagemap.h   2007-05-04 16:17:34.0 +1000
@@ -136,13 +136,18 @@
 extern void FASTCALL(__wait_on_page_locked(struct page *page));
 extern void FASTCALL(unlock_page(struct page *page));
 
+static inline int trylock_page(struct page *page)
+{
+   return (likely(!TestSetPageLocked_Lock(page)));
+}
+
 /*
  * lock_page may only be called if we have the page's inode pinned.
  */
 static inline void lock_page(struct page *page)
 {
might_sleep();
-   if (unlikely(TestSetPageLocked(page)))
+   if (!trylock_page(page))
__lock_page(page);
 }
 
@@ -153,7 +158,7 @@
 static inline void lock_page_nosync(struct page *page)
 {
might_sleep();
-   if (unlikely(TestSetPageLocked(page)))
+   if (!trylock_page(page))
__lock_page_nosync(page);
 }

Index: linux-2.6/drivers/scsi/sg.c
===
--- linux-2.6.orig/drivers/scsi/sg.c2007-04-12 14:35:08.0 +1000
+++ linux-2.6/drivers/scsi/sg.c 2007-05-04 16:23:27.0 +1000
@@ -1734,7 +1734,7 @@
  */
flush_dcache_page(pages[i]);
/* ?? Is locking needed? I don't think so */
-   /* if (TestSetPageLocked(pages[i]))
+   /* if (!trylock_page(pages[i]))
   goto out_unlock; */
 }
 
Index: linux-2.6/fs/cifs/file.c
===
--- linux-2.6.orig/fs/cifs/file.c   2007-04-12 14:35:09.0 +1000
+++ linux-2.6/fs/cifs/file.c2007-05-04 16:23:36.0 +1000
@@ -1229,7 +1229,7 @@
 
if (first  0)
lock_page(page);
-   else if (TestSetPageLocked(page))
+   else if (!trylock_page(page))
break;
 
if (unlikely(page-mapping != mapping)) {
Index: linux-2.6/fs/jbd/commit.c

Re: 2.6.22 -mm merge plans -- vm bugfixes


Nick Piggin wrote:

Nick Piggin wrote:


Christoph Hellwig wrote:




Is that every fork/exec or just under certain cicumstances?
A 5% regression on every fork/exec is not acceptable.




Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4
numbers will be improved as well with that patch. Then if we have
specific lock/unlock bitops, I hope it should reduce that further.



OK, with the races and missing barriers fixed from the previous patch,
plus the attached one added (+patch3), numbers are better again (I'm not
sure if I have the ppc barriers correct though).

These ops could also be put to use in bit spinlocks, buffer lock, and
probably a few other places too.

2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2
+patch2  1.61-1.63   169.8-175.0   748.6-757.0
+patch3  1.54-1.57   165.6-170.9   748.5-757.5

So fault performance goes to under 5%, fork is in the noise, exec is
still up 1%, but maybe that's noise or cache effects again.


OK, with my new lock/unlock_page, dd if=large (bigger than RAM) sparse
file of=/dev/null with an experimentally optimal block size (32K) goes
from 626MB/s to 683MB/s on 2 CPU G5 booted with maxcpus=1.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans

2007-05-04 Thread Con Kolivas

On Friday 04 May 2007 18:52, Ingo Molnar wrote:
 agreed. Con, IIRC you wrote a testcase for this, right? Could you please
 send us the results of that testing?

Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch 
disabled and then enabled swap prefetch saves ~5 seconds on average hardware 
on this one test case. I had many users try this and the results were between 
2 and 10 seconds, but always showed a saving on this testcase. This effect 
easily occurs on printing a big picture, editing a large file, compressing an 
iso image or whatever in real world workloads. Smaller, but much more 
frequent effects of this over the course of a day obviously also occur and do 
add up.

-- 
-ck
#include stdio.h
#include stdarg.h
#include stdlib.h
#include string.h
#include unistd.h
#include sys/mman.h
#include time.h

void fatal(const char *format, ...)
{
	va_list ap;

	if (format) {
		va_start(ap, format);
		vfprintf(stderr, format, ap);
		va_end(ap);
	}

	fprintf(stderr, Fatal error - exiting\n);
	exit(1);
}

size_t get_ram(void)
{
unsigned long ramsize;
	FILE *meminfo;
char aux[256];

	if(!(meminfo = fopen(/proc/meminfo, r)))
		fatal(fopen\n);

	while( !feof(meminfo)  !fscanf(meminfo, MemTotal: %lu kB, ramsize) )
		fgets(aux,sizeof(aux),meminfo);
	if (fclose(meminfo) == -1)
		fatal(fclose);
	return ramsize * 1000;
}

unsigned long get_usecs(struct timespec *myts)
{
	if (clock_gettime(CLOCK_REALTIME, myts))
		fatal(clock_gettime);
	return (myts-tv_sec * 100 + myts-tv_nsec / 1000 );
}

int main(void)
{
	unsigned long current_time, time_diff;
	struct timespec myts;
	char *buf1, *buf2, *buf3, *buf4;
	size_t size, full_size = get_ram();
	int sleep_seconds = 600;

	size = full_size * 7 / 10;
	printf(Starting first malloc of %d bytes\n, size);
	buf1 = malloc(size);
	if (buf1 == (char *)-1)
		fatal(Failed to malloc 1st buffer\n);
	memset(buf1, 0, size);
	printf(Completed first malloc and starting second malloc of %d bytes\n, full_size);

	buf2 = malloc(full_size);
	if (buf2 == (char *)-1)
		fatal(Failed to malloc 2nd buffer\n);
	memset(buf2, 0, full_size);
	buf4 = malloc(1);
	for (buf3 = buf2 + full_size; buf3  buf2; buf3--)
		*buf4 = *buf3;
	free(buf2);
	printf(Completed second malloc and free\n);

	printf(Sleeping for %d seconds\n, sleep_seconds);
	sleep(sleep_seconds);

	printf(Important part - starting read of first malloc\n);
	time_diff = current_time = get_usecs(myts);
	for (buf3 = buf1; buf3  buf1 + size; buf3++)
		*buf4 = *buf3;
	current_time = get_usecs(myts);
	free(buf4);
	free(buf1);
	printf(Completed read and freeing of first malloc\n);
	time_diff = current_time - time_diff;
	printf(Timed portion %lu microseconds\n,time_diff);

	return 0;
}

Re: 2.6.22 -mm merge plans -- vm bugfixes


Andrew Morton wrote:

On Thu, 03 May 2007 11:32:23 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote:



void fastcall unlock_page(struct page *page)
{
+   VM_BUG_ON(!PageLocked(page));
smp_mb__before_clear_bit();
-   if (!TestClearPageLocked(page))
-   BUG();
-	smp_mb__after_clear_bit(); 
-	wake_up_page(page, PG_locked);

+   ClearPageLocked(page);
+   if (unlikely(test_bit(PG_waiters, >flags))) {
+   clear_bit(PG_waiters, >flags);
+   wake_up_page(page, PG_locked);
+   }
}



Why is that significantly faster than plain old wake_up_page(), which
tests waitqueue_active()?


Because it needs fewer barriers and doesn't touch random a random hash
cacheline in the fastpath.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans: slub on PowerPC

On Fri, 4 May 2007, Benjamin Herrenschmidt wrote:

> > The SLUB allocator relies on struct page fields first_page and slab,
> > overwritten by ptl when SPLIT_PTLOCK: so the SLUB allocator cannot then
> > be used for the lowest level of pagetable pages.  This was obstructing
> > SLUB on PowerPC, which uses kmem_caches for its pagetables.  So convert
> > its pte level to use quicklist pages (whereas pmd, pud and 64k-page pgd
> > want partpages, so continue to use kmem_caches for pmd, pud and pgd).
> > But to keep up appearances for pgtable_free, we still need PTE_CACHE_NUM.
> 
> Interesting... I'll have a look asap.

I would also recommend looking at removing the constructors for the 
remaining slabs. A constructor requires that SLUB never touch the object 
(same situation as is resulting from enabling debugging). So it must 
increase the object size in order to put the free pointer after the 
object. In case of a order of 2 cache this has a particularly bad effect 
of doubling object size. If the objects can be overwritten on free (no 
constructor) then we can use the first word of the object as a freepointer 
on kfree. Meaning we can use a hot cacheline so no cache miss. On 
alloc we have already touched the first cacheline which also avoids a 
cacheline fetch there. This is the optimal way of operation for SLUB.

Hmmm We could add an option to allow the use of a constructor while
keeping the free pointer at the beginning of the object? Then we would 
have to zap the first word on alloc. Would work like quicklists.

Add SLAB_FREEPOINTER_MAY_OVERLAP?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans: slub on PowerPC

2007-05-03 Thread Benjamin Herrenschmidt

On Thu, 2007-05-03 at 22:04 +0100, Hugh Dickins wrote:
> On Thu, 3 May 2007, Hugh Dickins wrote:
> > 
> > Seems we're all wrong in thinking Christoph's Kconfiggery worked
> > as intended: maybe it just works some of the time.  I'm not going
> > to hazard a guess as to how to fix it up, will resume looking at
> > the powerpc's quicklist potential later.
> 
> Here's the patch I've been testing on G5, with 4k and with 64k pages,
> with SLAB and with SLUB.  But, though it doesn't crash, the pgd
> kmem_cache in the 4k-page SLUB case is revealing SLUB's propensity
> for using highorder allocations where SLAB would stick to order 0:
> under load, exec's mm_init gets page allocation failure on order 4
> - SLUB's calculate_order may need some retuning.  (I'd expect it to
> be going for order 3 actually, I'm not sure how order 4 comes about.)
> 
> I don't know how offensive Ben and Paulus may find this patch:
> the kmem_cache use was nicely done and this messes it up a little.
> 
> 
> The SLUB allocator relies on struct page fields first_page and slab,
> overwritten by ptl when SPLIT_PTLOCK: so the SLUB allocator cannot then
> be used for the lowest level of pagetable pages.  This was obstructing
> SLUB on PowerPC, which uses kmem_caches for its pagetables.  So convert
> its pte level to use quicklist pages (whereas pmd, pud and 64k-page pgd
> want partpages, so continue to use kmem_caches for pmd, pud and pgd).
> But to keep up appearances for pgtable_free, we still need PTE_CACHE_NUM.

Interesting... I'll have a look asap.

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans: slub on PowerPC

On Thu, 3 May 2007, Christoph Lameter wrote:
> 
> There are SLUB patches pending (not in rc7-mm2 as far as I can recall) 
> that reduce the default page order sizes to head off this issue. The 
> defaults were initially too large (and they still default to large
> for testing if Mel's Antifrag work is detected to be active).

Sounds good.

> > -   return kmem_cache_alloc(pgtable_cache[PTE_CACHE_NUM],
> > -   GFP_KERNEL|__GFP_REPEAT);
> > +   return quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL);
> 
> __GFP_REPEAT is unusual here but this was carried over from the 
> kmem_cache_alloc it seems. Hmm... There is some variance on how we do this 
> between arches. Should we uniformly set or not set this flag?

Not something to get into in this patch, but it did surprise me too.
I believe __GFP_REPEAT should be avoided, and I don't see justification
for it here (but need to remember not to do a blind virt_to_page on the
result in some places if it might return NULL - which IIRC it actually
can do even if __GFP_REPEAT, when chosen for OOM kill).  But I've a
suspicion it got put in there for some good reason I don't know about.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans

2007-05-03 Thread Con Kolivas

On Friday 04 May 2007 01:54, Ingo Molnar wrote:
> * Andrew Morton <[EMAIL PROTECTED]> wrote:
> > - If replying, please be sure to cc the appropriate individuals.
> >   Please also consider rewriting the Subject: to something
> >   appropriate.

> i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a
> clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback
> i've seen so far was positive. Time to have this upstream and time for a
> desktop-oriented distro to pick it up.
>
> I think this has been held back way too long. It's .config selectable
> and it is as ready for integration as it ever is going to be. So it's a
> win/win scenario.
>
> Acked-by: Ingo Molnar <[EMAIL PROTECTED]>

Thank you very much for code review, ack and support!

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans: slub on PowerPC

On Thu, 3 May 2007, Hugh Dickins wrote:

> On Thu, 3 May 2007, Hugh Dickins wrote:
> > 
> > Seems we're all wrong in thinking Christoph's Kconfiggery worked
> > as intended: maybe it just works some of the time.  I'm not going
> > to hazard a guess as to how to fix it up, will resume looking at
> > the powerpc's quicklist potential later.
> 
> Here's the patch I've been testing on G5, with 4k and with 64k pages,
> with SLAB and with SLUB.  But, though it doesn't crash, the pgd
> kmem_cache in the 4k-page SLUB case is revealing SLUB's propensity
> for using highorder allocations where SLAB would stick to order 0:
> under load, exec's mm_init gets page allocation failure on order 4
> - SLUB's calculate_order may need some retuning.  (I'd expect it to
> be going for order 3 actually, I'm not sure how order 4 comes about.)

There are SLUB patches pending (not in rc7-mm2 as far as I can recall) 
that reduce the default page order sizes to head off this issue. The 
defaults were initially too large (and they still default to large
for testing if Mel's Antifrag work is detected to be active).

> - return kmem_cache_alloc(pgtable_cache[PTE_CACHE_NUM],
> - GFP_KERNEL|__GFP_REPEAT);
> + return quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL);

__GFP_REPEAT is unusual here but this was carried over from the 
kmem_cache_alloc it seems. Hmm... There is some variance on how we do this 
between arches. Should we uniformly set or not set this flag?

[EMAIL PROTECTED]:~/software/linux-2.6.21-rc7-mm2$ grep quicklist_alloc 
include/asm-ia64/*
include/asm-ia64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL);
include/asm-ia64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL);
include/asm-ia64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL);
include/asm-ia64/pgalloc.h: void *pg = quicklist_alloc(0, GFP_KERNEL, NULL);
include/asm-ia64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL);
[EMAIL PROTECTED]:~/software/linux-2.6.21-rc7-mm2$ grep quicklist_alloc 
arch/i386/mm/*
arch/i386/mm/pgtable.c: pgd_t *pgd = quicklist_alloc(0, GFP_KERNEL, pgd_ctor);
[EMAIL PROTECTED]:~/software/linux-2.6.21-rc7-mm2$ grep quicklist_alloc 
include/asm-sparc64/*
include/asm-sparc64/pgalloc.h:  return quicklist_alloc(0, GFP_KERNEL, NULL);
include/asm-sparc64/pgalloc.h:  return quicklist_alloc(0, GFP_KERNEL, NULL);
include/asm-sparc64/pgalloc.h:  return quicklist_alloc(0, GFP_KERNEL, NULL);
include/asm-sparc64/pgalloc.h:  void *pg = quicklist_alloc(0, GFP_KERNEL, NULL);
[EMAIL PROTECTED]:~/software/linux-2.6.21-rc7-mm2$ grep quicklist_alloc 
include/asm-x86_64/* 
include/asm-x86_64/pgalloc.h:   return (pmd_t *)quicklist_alloc(QUICK_PT, 
GFP_KERNEL|__GFP_REPEAT, NULL);
include/asm-x86_64/pgalloc.h:   return (pud_t *)quicklist_alloc(QUICK_PT, 
GFP_KERNEL|__GFP_REPEAT, NULL);
include/asm-x86_64/pgalloc.h:   pgd_t *pgd = (pgd_t *)quicklist_alloc(QUICK_PGD,
include/asm-x86_64/pgalloc.h:   return (pte_t *)quicklist_alloc(QUICK_PT, 
GFP_KERNEL|__GFP_REPEAT, NULL);
include/asm-x86_64/pgalloc.h:   void *p = (void *)quicklist_alloc(QUICK_PT, 
GFP_KERNEL|__GFP_REPEAT, NULL);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans: slub on PowerPC

On Thu, 3 May 2007, Hugh Dickins wrote:
> 
> Seems we're all wrong in thinking Christoph's Kconfiggery worked
> as intended: maybe it just works some of the time.  I'm not going
> to hazard a guess as to how to fix it up, will resume looking at
> the powerpc's quicklist potential later.

Here's the patch I've been testing on G5, with 4k and with 64k pages,
with SLAB and with SLUB.  But, though it doesn't crash, the pgd
kmem_cache in the 4k-page SLUB case is revealing SLUB's propensity
for using highorder allocations where SLAB would stick to order 0:
under load, exec's mm_init gets page allocation failure on order 4
- SLUB's calculate_order may need some retuning.  (I'd expect it to
be going for order 3 actually, I'm not sure how order 4 comes about.)

I don't know how offensive Ben and Paulus may find this patch:
the kmem_cache use was nicely done and this messes it up a little.


The SLUB allocator relies on struct page fields first_page and slab,
overwritten by ptl when SPLIT_PTLOCK: so the SLUB allocator cannot then
be used for the lowest level of pagetable pages.  This was obstructing
SLUB on PowerPC, which uses kmem_caches for its pagetables.  So convert
its pte level to use quicklist pages (whereas pmd, pud and 64k-page pgd
want partpages, so continue to use kmem_caches for pmd, pud and pgd).
But to keep up appearances for pgtable_free, we still need PTE_CACHE_NUM.

Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]>
---

 arch/powerpc/Kconfig  |4 
 arch/powerpc/mm/init_64.c |   17 ++---
 include/asm-powerpc/pgalloc.h |   26 +++---
 3 files changed, 21 insertions(+), 26 deletions(-)

--- 2.6.21-rc7-mm2/arch/powerpc/Kconfig 2007-04-26 13:33:51.0 +0100
+++ linux/arch/powerpc/Kconfig  2007-05-03 20:45:12.0 +0100
@@ -31,6 +31,10 @@ config MMU
bool
default y
 
+config QUICKLIST
+   bool
+   default y
+
 config GENERIC_HARDIRQS
bool
default y
--- 2.6.21-rc7-mm2/arch/powerpc/mm/init_64.c2007-04-26 13:33:51.0 
+0100
+++ linux/arch/powerpc/mm/init_64.c 2007-05-03 20:45:12.0 +0100
@@ -146,21 +146,16 @@ static void zero_ctor(void *addr, struct
memset(addr, 0, kmem_cache_size(cache));
 }
 
-#ifdef CONFIG_PPC_64K_PAGES
-static const unsigned int pgtable_cache_size[3] = {
-   PTE_TABLE_SIZE, PMD_TABLE_SIZE, PGD_TABLE_SIZE
-};
-static const char *pgtable_cache_name[ARRAY_SIZE(pgtable_cache_size)] = {
-   "pte_pmd_cache", "pmd_cache", "pgd_cache",
-};
-#else
 static const unsigned int pgtable_cache_size[2] = {
-   PTE_TABLE_SIZE, PMD_TABLE_SIZE
+   PGD_TABLE_SIZE, PMD_TABLE_SIZE
 };
 static const char *pgtable_cache_name[ARRAY_SIZE(pgtable_cache_size)] = {
-   "pgd_pte_cache", "pud_pmd_cache",
-};
+#ifdef CONFIG_PPC_64K_PAGES
+   "pgd_cache", "pmd_cache",
+#else
+   "pgd_cache", "pud_pmd_cache",
 #endif /* CONFIG_PPC_64K_PAGES */
+};
 
 #ifdef CONFIG_HUGETLB_PAGE
 /* Hugepages need one extra cache, initialized in hugetlbpage.c.  We
--- 2.6.21-rc7-mm2/include/asm-powerpc/pgalloc.h2007-02-04 
18:44:54.0 +
+++ linux/include/asm-powerpc/pgalloc.h 2007-05-03 20:45:12.0 +0100
@@ -10,21 +10,15 @@
 #include 
 #include 
 #include 
+#include 
 
 extern struct kmem_cache *pgtable_cache[];
 
-#ifdef CONFIG_PPC_64K_PAGES
-#define PTE_CACHE_NUM  0
-#define PMD_CACHE_NUM  1
-#define PGD_CACHE_NUM  2
-#define HUGEPTE_CACHE_NUM 3
-#else
-#define PTE_CACHE_NUM  0
-#define PMD_CACHE_NUM  1
-#define PUD_CACHE_NUM  1
 #define PGD_CACHE_NUM  0
+#define PUD_CACHE_NUM  1
+#define PMD_CACHE_NUM  1
 #define HUGEPTE_CACHE_NUM 2
-#endif
+#define PTE_CACHE_NUM  3   /* from quicklist rather than  kmem_cache */
 
 /*
  * This program is free software; you can redistribute it and/or
@@ -97,8 +91,7 @@ static inline void pmd_free(pmd_t *pmd)
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
  unsigned long address)
 {
-   return kmem_cache_alloc(pgtable_cache[PTE_CACHE_NUM],
-   GFP_KERNEL|__GFP_REPEAT);
+   return quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL);
 }
 
 static inline struct page *pte_alloc_one(struct mm_struct *mm,
@@ -109,7 +102,7 @@ static inline struct page *pte_alloc_one

 static inline void pte_free_kernel(pte_t *pte)
 {
-   kmem_cache_free(pgtable_cache[PTE_CACHE_NUM], pte);
+   quicklist_free(0, NULL, pte);
 }
 
 static inline void pte_free(struct page *ptepage)
@@ -136,7 +129,10 @@ static inline void pgtable_free(pgtable_
void *p = (void *)(pgf.val & ~PGF_CACHENUM_MASK);
int cachenum = pgf.val & PGF_CACHENUM_MASK;
 
-   kmem_cache_free(pgtable_cache[cachenum], p);
+   if (cachenum == PTE_CACHE_NUM)
+   quicklist_free(0, NULL, p);
+   else
+   kmem_cache_free(pgtable_cache[cachenum], p);
 }
 
 extern void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf);

Re: 2.6.22 -mm merge plans

On Thu, May 03, 2007 at 01:16:46PM -0400, Mathieu Desnoyers wrote:
> > kprobes does this kind of synchronization internally, so the marker
> > wrapper should probabl aswell.
> > 
> 
> The problem appears on heavily loaded systems. Doing 50
> synchronize_sched() calls in a row can take up to a few seconds on a
> 4-way machine. This is why I prefer to do it in the module to which
> the callbacks belong.

We recently had a discussion on batch unreistration interface for
kprobes.  I'm not very happy with having so different interfaces for
different kind of probe registrations.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans

Here is the reworked patch, except a comment :

* Christoph Hellwig ([EMAIL PROTECTED]) wrote:
> > +void blk_probe_disconnect(void)
> > +{
> > +   uint8_t i;
> > +
> > +   for (i = 0; i < NUM_PROBES; i++) {
> > +   marker_remove_probe(probe_array[i].name);
> > +   }
> > +   synchronize_sched();/* Wait for probes to finish */
> 
> kprobes does this kind of synchronization internally, so the marker
> wrapper should probabl aswell.
> 

The problem appears on heavily loaded systems. Doing 50
synchronize_sched() calls in a row can take up to a few seconds on a
4-way machine. This is why I prefer to do it in the module to which
the callbacks belong.

Here is the reviewed patch. It depends on a newer version of markers
I'll send to Andrew soon.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>


Index: linux-2.6-lttng/block/elevator.c
===
--- linux-2.6-lttng.orig/block/elevator.c   2007-05-03 12:27:12.0 
-0400
+++ linux-2.6-lttng/block/elevator.c2007-05-03 12:54:58.0 -0400
@@ -32,7 +32,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 
 #include 
@@ -571,7 +571,7 @@
unsigned ordseq;
int unplug_it = 1;
 
-   blk_add_trace_rq(q, rq, BLK_TA_INSERT);
+   trace_mark(blk_request_insert, "%p %p", q, rq);
 
rq->q = q;
 
@@ -757,7 +757,7 @@
 * not be passed by new incoming requests
 */
rq->cmd_flags |= REQ_STARTED;
-   blk_add_trace_rq(q, rq, BLK_TA_ISSUE);
+   trace_mark(blk_request_issue, "%p %p", q, rq);
}
 
if (!q->boundary_rq || q->boundary_rq == rq) {
Index: linux-2.6-lttng/block/ll_rw_blk.c
===
--- linux-2.6-lttng.orig/block/ll_rw_blk.c  2007-05-03 12:27:12.0 
-0400
+++ linux-2.6-lttng/block/ll_rw_blk.c   2007-05-03 12:54:58.0 -0400
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -1551,7 +1552,7 @@
 
if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, >queue_flags)) {
mod_timer(>unplug_timer, jiffies + q->unplug_delay);
-   blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG);
+   trace_mark(blk_plug_device, "%p %p %d", q, NULL, 0);
}
 }
 
@@ -1617,7 +1618,7 @@
 * devices don't necessarily have an ->unplug_fn defined
 */
if (q->unplug_fn) {
-   blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
+   trace_mark(blk_pdu_unplug_io, "%p %p %d", q, NULL,
q->rq.count[READ] + q->rq.count[WRITE]);
 
q->unplug_fn(q);
@@ -1628,7 +1629,7 @@
 {
request_queue_t *q = container_of(work, request_queue_t, unplug_work);
 
-   blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
+   trace_mark(blk_pdu_unplug_io, "%p %p %d", q, NULL,
q->rq.count[READ] + q->rq.count[WRITE]);
 
q->unplug_fn(q);
@@ -1638,7 +1639,7 @@
 {
request_queue_t *q = (request_queue_t *)data;
 
-   blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL,
+   trace_mark(blk_pdu_unplug_timer, "%p %p %d", q, NULL,
q->rq.count[READ] + q->rq.count[WRITE]);
 
kblockd_schedule_work(>unplug_work);
@@ -2148,7 +2149,7 @@

rq_init(q, rq);
 
-   blk_add_trace_generic(q, bio, rw, BLK_TA_GETRQ);
+   trace_mark(blk_get_request, "%p %p %d", q, bio, rw);
 out:
return rq;
 }
@@ -2178,7 +2179,7 @@
if (!rq) {
struct io_context *ioc;
 
-   blk_add_trace_generic(q, bio, rw, BLK_TA_SLEEPRQ);
+   trace_mark(blk_sleep_request, "%p %p %d", q, bio, rw);
 
__generic_unplug_device(q);
spin_unlock_irq(q->queue_lock);
@@ -2252,7 +2253,7 @@
  */
 void blk_requeue_request(request_queue_t *q, struct request *rq)
 {
-   blk_add_trace_rq(q, rq, BLK_TA_REQUEUE);
+   trace_mark(blk_requeue, "%p %p", q, rq);
 
if (blk_rq_tagged(rq))
blk_queue_end_tag(q, rq);
@@ -2937,7 +2938,7 @@
if (!ll_back_merge_fn(q, req, bio))
break;
 
-   blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE);
+   trace_mark(blk_bio_backmerge, "%p %p", q, bio);
 
req->biotail->bi_next = bio;
req->biotail = bio;
@@ -2954,7 +2955,7 @@
if (!ll_front_merge_fn(q, req, bio))
break;
 
-   blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE);
+   trace_mark(blk_bio_frontmerge, "%p %p", q, bio);
 
bio->bi_next = req->bio;

Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-03 Thread Andrew Morton

On Thu, 03 May 2007 11:32:23 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote:

>  void fastcall unlock_page(struct page *page)
>  {
> + VM_BUG_ON(!PageLocked(page));
>   smp_mb__before_clear_bit();
> - if (!TestClearPageLocked(page))
> - BUG();
> - smp_mb__after_clear_bit(); 
> - wake_up_page(page, PG_locked);
> + ClearPageLocked(page);
> + if (unlikely(test_bit(PG_waiters, >flags))) {
> + clear_bit(PG_waiters, >flags);
> + wake_up_page(page, PG_locked);
> + }
>  }

Why is that significantly faster than plain old wake_up_page(), which
tests waitqueue_active()?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans: slub

H...There are a gazillion configs to choose from. It works fine with
cell_defconfig. If I switch to 2 processors I can enable SLUB if I switch 
to 4 I cannot.

I saw some other config weirdness like being unable to set SMP if SLOB is 
enabled with some configs. This should not work and does not work but 
the menus are then vanishing and one can still configure lots of 
processors while not having enabled SMP.

It works as far as I can tell... The rest is arch weirdness.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans: slub

On Thu, 3 May 2007, William Lee Irwin III wrote:

> I've seen this crash in flush_old_exec() before. ISTR it being due to
> slub vs. pagetable alignment or something on that order.

>From from other discussion regarding SLAB: It may be necessary for 
powerpc to set the default alignment to 8 bytes on 32 bit powerpc 
because it requires that alignemnt for 64 bit value. SLUB will *not* 
disable debugging like SLAB if you do that.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans

2007-05-03 Thread Michal Piotrowski


On 03/05/07, Michal Piotrowski <[EMAIL PROTECTED]> wrote:

Hi,

On 03/05/07, Ingo Molnar <[EMAIL PROTECTED]> wrote:
>
> * Andrew Morton <[EMAIL PROTECTED]> wrote:
>
> > - If replying, please be sure to cc the appropriate individuals.
> >   Please also consider rewriting the Subject: to something
> >   appropriate.
>
> i'm wondering about swap-prefetch:
>
>   mm-implement-swap-prefetching.patch
>   swap-prefetch-avoid-repeating-entry.patch
>   
add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated-swap-prefetch.patch
>
> The swap-prefetch feature is relatively compact:
>
>10 files changed, 745 insertions(+), 1 deletion(-)
>
> it is contained mostly to itself:
>
>mm/swap_prefetch.c|  581 
>
> i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a
> clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback
> i've seen so far was positive. Time to have this upstream and time for a
> desktop-oriented distro to pick it up.
>
> I think this has been held back way too long. It's .config selectable
> and it is as ready for integration as it ever is going to be. So it's a
> win/win scenario.

I'm using SWAP_PREFETCH since 2.6.17-mm1 (I don't have earlier configs)


since 2.6.14-mm2 :)
http://lkml.org/lkml/2005/11/11/260

Regards,
Michal

--
Michal K. K. Piotrowski
Kernel Monkeys
(http://kernel.wikidot.com/start)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans

2007-05-03 Thread Michal Piotrowski

Hi,

On 03/05/07, Ingo Molnar <[EMAIL PROTECTED]> wrote:

* Andrew Morton <[EMAIL PROTECTED]> wrote:

> - If replying, please be sure to cc the appropriate individuals.
>   Please also consider rewriting the Subject: to something
>   appropriate.

i'm wondering about swap-prefetch:

  mm-implement-swap-prefetching.patch
  swap-prefetch-avoid-repeating-entry.patch

add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated-swap-prefetch.patch

The swap-prefetch feature is relatively compact:

   10 files changed, 745 insertions(+), 1 deletion(-)

it is contained mostly to itself:

   mm/swap_prefetch.c|  581 

i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a
clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback
i've seen so far was positive. Time to have this upstream and time for a
desktop-oriented distro to pick it up.

I think this has been held back way too long. It's .config selectable
and it is as ready for integration as it ever is going to be. So it's a
win/win scenario.

I'm using SWAP_PREFETCH since 2.6.17-mm1 (I don't have earlier configs)
http://www.stardust.webpages.pl/files/tbf/euridica/2.6.17-mm1/mm-config
and I don't recall _any_ problems. It's very stable for me.

Acked-by: Ingo Molnar <[EMAIL PROTECTED]>

Ingo

Regards,
Michal

--
Michal K. K. Piotrowski
Kernel Monkeys
(http://kernel.wikidot.com/start)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: swap-prefetch: 2.6.22 -mm merge plans

2007-05-03 Thread Ingo Molnar

* Andrew Morton <[EMAIL PROTECTED]> wrote:

> - If replying, please be sure to cc the appropriate individuals.  
>   Please also consider rewriting the Subject: to something 
>   appropriate.

i'm wondering about swap-prefetch:

  mm-implement-swap-prefetching.patch
  swap-prefetch-avoid-repeating-entry.patch

add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated-swap-prefetch.patch

The swap-prefetch feature is relatively compact:

   10 files changed, 745 insertions(+), 1 deletion(-)

it is contained mostly to itself:

   mm/swap_prefetch.c|  581 

i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a 
clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback 
i've seen so far was positive. Time to have this upstream and time for a 
desktop-oriented distro to pick it up.

I think this has been held back way too long. It's .config selectable 
and it is as ready for integration as it ever is going to be. So it's a 
win/win scenario.

Acked-by: Ingo Molnar <[EMAIL PROTECTED]>

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans

On Thu, May 03, 2007 at 11:04:15AM -0400, Mathieu Desnoyers wrote:
> - blk_add_trace_rq(q, rq, BLK_TA_INSERT);
> + MARK(blk_request_insert, "%p %p", q, rq);

I don't really like the shouting MARK name very much.  Can we have
a less-generic, less shouting name, e.g. trace_marker?  The aboe would then
be:

trace_mark(blk_request_insert, "%p %p", q, rq);

> +#define NUM_PROBES ARRAY_SIZE(probe_array)

just get rid of this and use ARRAY_SIZE diretly below.

> +int blk_probe_connect(void)
> +{
> + int result;
> + uint8_t i;

just use an int for for loops.  it's easy to read and probably faster
on most systems (it the compiler isn't smart enough and promotes it
to int anyway during code generation)

> +void blk_probe_disconnect(void)
> +{
> + uint8_t i;
> +
> + for (i = 0; i < NUM_PROBES; i++) {
> + marker_remove_probe(probe_array[i].name);
> + }
> + synchronize_sched();/* Wait for probes to finish */

kprobes does this kind of synchronization internally, so the marker
wrapper should probabl aswell.

> +static int blk_probes_ref = 0;

no need to initialize this.

>  /*
>   * Send out a notify message.
>   */
> @@ -229,6 +233,12 @@
>   blk_remove_tree(bt->dir);
>   free_percpu(bt->sequence);
>   kfree(bt);
> + mutex_lock(_probe_mutex);
> + if (blk_probes_ref == 1) {
> + blk_probe_disconnect();
> + blk_probes_ref--;
> + }

if (--blk_probes_ref == 0)
blk_probe_disconnect();

would probably be a tad cleaner.

> + if (!blk_probes_ref) {
> + blk_probe_connect();
> + blk_probes_ref++;
> + }

Dito here with a:

if (!blk_probes_ref++)
blk_probe_connect();

also the connect in the name seems rather add, what about arm/disarm instead?

>  static __init int blk_trace_init(void)
>  {
>   mutex_init(_tree_mutex);
> + mutex_init(_probe_mutex);

both should use DEFINE_MUTEX for compile-time initialization isntead.

Also it's probably better to put the trace points into blktrace.c,
that means all blktrace code can be static and self-contained.  And we
can probably do some additional cleanups by simplifying things later on.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans

* Andrew Morton ([EMAIL PROTECTED]) wrote:
> > Although some, like Christoph and myself, think that it would benefit to
> > the kernel community to have a common infrastructure for more than just
> > markers (meaning common serialization and buffering mechanism), it does
> > not change the fact that the markers, being in mainline, are usable by
> > projects through additional kernel modules.
> > 
> > If we are looking at current "potential users" that are already in
> > mainline, we could change blktrace to make it use the markers.
> 
> That'd be a useful demonstration.

Here is a proof of concept patch, for demonstration purpose, of moving
blktrace to the markers.

A few remarks : this patch has the positive effect of removing some code
from the block io tracing hot paths, minimizing the i-cache impact in a
system where the io tracing is compiled in but inactive.

It also moves the blk tracing code from a header (and therefore from the
body of the instrumented functions) to a separate C file.

There, as soon as one device has to be traced, every devices have to
fall into the tracing function call. This is slower than the previous
inline function which tested the condition quickly. If it becomes a
show stopper, it could be fixed by having the possibility to test a
supplementary condition, dependant of the marker context, at the marker
site, just after the enable/disable test.

It does not make the code smaller, since I left all the specialized
tracing functions for requests, bio, generic, remap, which would go away
once a generic infrastructure is in place to serialize the information
passed to the marker. This is mostly why I consider it a proof a
concept.

Patch named "markers-port-blktrace-to-markers.patch", can be placed
after the marker patches in the 2.6.21-rc7-mm2 series.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>

Index: linux-2.6-lttng/block/elevator.c
===
--- linux-2.6-lttng.orig/block/elevator.c   2007-05-02 20:33:22.0 
-0400
+++ linux-2.6-lttng/block/elevator.c2007-05-02 20:33:49.0 -0400
@@ -32,7 +32,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 
 #include 
@@ -571,7 +571,7 @@
unsigned ordseq;
int unplug_it = 1;
 
-   blk_add_trace_rq(q, rq, BLK_TA_INSERT);
+   MARK(blk_request_insert, "%p %p", q, rq);
 
rq->q = q;
 
@@ -757,7 +757,7 @@
 * not be passed by new incoming requests
 */
rq->cmd_flags |= REQ_STARTED;
-   blk_add_trace_rq(q, rq, BLK_TA_ISSUE);
+   MARK(blk_request_issue, "%p %p", q, rq);
}
 
if (!q->boundary_rq || q->boundary_rq == rq) {
Index: linux-2.6-lttng/block/ll_rw_blk.c
===
--- linux-2.6-lttng.orig/block/ll_rw_blk.c  2007-05-02 20:33:32.0 
-0400
+++ linux-2.6-lttng/block/ll_rw_blk.c   2007-05-02 23:21:02.0 -0400
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -1551,7 +1552,7 @@
 
if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, >queue_flags)) {
mod_timer(>unplug_timer, jiffies + q->unplug_delay);
-   blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG);
+   MARK(blk_plug_device, "%p %p %d", q, NULL, 0);
}
 }
 
@@ -1617,7 +1618,7 @@
 * devices don't necessarily have an ->unplug_fn defined
 */
if (q->unplug_fn) {
-   blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
+   MARK(blk_pdu_unplug_io, "%p %p %d", q, NULL,
q->rq.count[READ] + q->rq.count[WRITE]);
 
q->unplug_fn(q);
@@ -1628,7 +1629,7 @@
 {
request_queue_t *q = container_of(work, request_queue_t, unplug_work);
 
-   blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
+   MARK(blk_pdu_unplug_io, "%p %p %d", q, NULL,
q->rq.count[READ] + q->rq.count[WRITE]);
 
q->unplug_fn(q);
@@ -1638,7 +1639,7 @@
 {
request_queue_t *q = (request_queue_t *)data;
 
-   blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL,
+   MARK(blk_pdu_unplug_timer, "%p %p %d", q, NULL,
q->rq.count[READ] + q->rq.count[WRITE]);
 
kblockd_schedule_work(>unplug_work);
@@ -2148,7 +2149,7 @@

rq_init(q, rq);
 
-   blk_add_trace_generic(q, bio, rw, BLK_TA_GETRQ);
+   MARK(blk_get_request, "%p %p %d", q, bio, rw);
 out:
return rq;
 }
@@ -2178,7 +2179,7 @@
if (!rq) {
struct io_context *ioc;
 
-   blk_add_trace_generic(q, bio, rw, BLK_TA_SLEEPRQ);
+   MARK(blk_sleep_request, "%p %p %d", q, bio, rw);
 
__generic_unplug_device(q);

Re: 2.6.22 -mm merge plans

Hi Andi,

This plan makes sense. I will split the "patched in enabled/disable
flags" part into a separate piece (good idea!) and then submit the LTTng
core to Andrew. Christoph's has a good point about wanting a usable
infrastructure to go ini. Regarding your plan, I must argue that
blktrace is not a general purpose tracing infrastructure, but one
dedicated to block io tracing.  Therefore, it makes sense to bring in
the generic infrastructure first and then convert blktrace to it.

Mathieu

* Andi Kleen ([EMAIL PROTECTED]) wrote:
> > If we are looking at current "potential users" that are already in
> > mainline, we could change blktrace to make it use the markers.
> 
> Ok, but do it step by step:
> - split out useful pieces like the "patched in enable/disable flags" 
> and submit them separate with an example user or two 
> [I got a couple of candidates e.g. with some of the sysctls in VM or 
> networking] 
> - post and merge that.
> - don't implement anything initially that is not needed by blktrace
> - post a minimal marker patch together with the blktrace
> conversion for review again on linux-kernel
> - await review comments. This review would not cover the basic
> need of markers, just the specific implementation.
> - then potentially merge incorporate review comments
> - then merge
> - later add features with individual review/discussion as new users in the 
> kernel are added.
> 
> -Andi

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans

* Christoph Hellwig ([EMAIL PROTECTED]) wrote:
> On Wed, May 02, 2007 at 07:11:04PM -0400, Mathieu Desnoyers wrote:
> > My statement was probably not clear enough. The actual marker code is
> > useful as-is without any further kernel patching required : SystemTAP is
> > an example where they use external modules to load probes that can
> > connect either to markers or through kprobes. LTTng, in its current state,
> > has a mostly modular core that also uses the markers.
> 
> That just mean you have to load an enormous emount of exernal crap
> that replaces the missing kernel functionality.  It's exactly the
> situation we want to avoid.
> 

It makes sense to use -mm to hold the hole usable infrastructure before
submitting it to mainline. I will submit my core LTTng patches to Andrew
in the following weeks.  There is no hurry, in the LTTng perspective, to
merge the markers sooner, although they could be useful to other
(external) projects meanwhile.

Mathieu


-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans -- vm bugfixes


Hugh Dickins wrote:

On Thu, 3 May 2007, Nick Piggin wrote:


@@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!)
{
DEFINE_WAIT_BIT(wait, >flags, PG_locked);

+   set_bit(PG_waiters, >flags);
+   if (unlikely(!TestSetPageLocked(page))) {


What happens if another cpu is coming through __lock_page at the
same time, did its set_bit, now finds PageLocked, and so proceeds
to the __wait_on_bit_lock?  But this cpu now clears PG_waiters,
so this task's unlock_page won't wake the other?


You're right, we can't clear the bit here. Doubt it mattered much anyway?



Ah yes, that's a good easy answer.  In fact, just remove this whole
test and block (we already tried TestSetPageLocked outside just a
short while ago, so this repeat won't often save anything).


Yeah, I was getting too clever for my own boots :)

I think the patch has merit though. Unfortunate that it uses another page
flag, however it seemed to have quite a bit speedup on unlock_page (probably
from both the barriers and an extra random cacheline load (from the hash)).

I guess it has to get good results from more benchmarks...



BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page
above... that barrier is in the slow path as well though, so it shouldn't
matter either.



I vaguely wondered how such barriers had managed to dissolve away,
but cranking my brain up to think about barriers takes far too long.


That barrier was one too many :)

However I believe the fastpath barrier can go away because the PG_locked
operation is depending on the same cacheline as PG_waiters.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans -- vm bugfixes

On Thu, 3 May 2007, Nick Piggin wrote:
> >>@@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!)
> > > {
> > >  DEFINE_WAIT_BIT(wait, >flags, PG_locked);
> > >
> > >+  set_bit(PG_waiters, >flags);
> > >+  if (unlikely(!TestSetPageLocked(page))) {
> > 
> > What happens if another cpu is coming through __lock_page at the
> > same time, did its set_bit, now finds PageLocked, and so proceeds
> > to the __wait_on_bit_lock?  But this cpu now clears PG_waiters,
> > so this task's unlock_page won't wake the other?
> 
> You're right, we can't clear the bit here. Doubt it mattered much anyway?

Ah yes, that's a good easy answer.  In fact, just remove this whole
test and block (we already tried TestSetPageLocked outside just a
short while ago, so this repeat won't often save anything).

> 
> BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page
> above... that barrier is in the slow path as well though, so it shouldn't
> matter either.

I vaguely wondered how such barriers had managed to dissolve away,
but cranking my brain up to think about barriers takes far too long.

> > >+  clear_bit(PG_waiters, >flags);
> > >+  return;
> > >+  }
> > >  __wait_on_bit_lock(page_waitqueue(page), , sync_page,
> > >TASK_UNINTERRUPTIBLE);
> >> }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans -- vm bugfixes


Christoph Hellwig wrote:

On Thu, May 03, 2007 at 11:32:23AM +1000, Nick Piggin wrote:


The attached patch gets performance up a bit by avoiding some
barriers and some cachelines:

G5
pagefault   fork  exec
2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2
+patch2  1.61-1.63   169.8-175.0   748.6-757.0

So that brings the fork/exec hits down to much less than 5%, and
would likely speed up other things that lock the page, like write
or page reclaim.



Is that every fork/exec or just under certain cicumstances?
A 5% regression on every fork/exec is not acceptable.


Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4
numbers will be improved as well with that patch. Then if we have
specific lock/unlock bitops, I hope it should reduce that further.

The overhead that is there should just be coming from the extra
overhead in the file backed fault handler. For noop fork/execs,
I think that tends to be more pronounced, it is hard to see any
difference on any non-micro benchmark.

The other thing is that I think there could be some cache effects
happening -- for example the exec numbers on the 2nd line are
disproportionately large.

It definitely isn't a good thing to drop performance anywhere
though, so I'll keep looking for improvements.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans -- vm bugfixes


Hugh Dickins wrote:

On Thu, 3 May 2007, Nick Piggin wrote:


The problem is that lock/unlock_page is expensive on powerpc, and
if we improve that, we improve more than just the fault handler...

The attached patch gets performance up a bit by avoiding some
barriers and some cachelines:



There's a strong whiff of raciness about this...
but I could very easily be wrong.



Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c 2007-05-02 15:00:26.0 +1000
+++ linux-2.6/mm/filemap.c  2007-05-03 08:34:32.0 +1000
@@ -532,11 +532,13 @@
 */
void fastcall unlock_page(struct page *page)
{
+   VM_BUG_ON(!PageLocked(page));
smp_mb__before_clear_bit();
-   if (!TestClearPageLocked(page))
-   BUG();
-	smp_mb__after_clear_bit(); 
-	wake_up_page(page, PG_locked);

+   ClearPageLocked(page);
+   if (unlikely(test_bit(PG_waiters, >flags))) {
+   clear_bit(PG_waiters, >flags);
+   wake_up_page(page, PG_locked);
+   }
}
EXPORT_SYMBOL(unlock_page);

@@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!)
{
DEFINE_WAIT_BIT(wait, >flags, PG_locked);

+   set_bit(PG_waiters, >flags);
+   if (unlikely(!TestSetPageLocked(page))) {



What happens if another cpu is coming through __lock_page at the
same time, did its set_bit, now finds PageLocked, and so proceeds
to the __wait_on_bit_lock?  But this cpu now clears PG_waiters,
so this task's unlock_page won't wake the other?


You're right, we can't clear the bit here. Doubt it mattered much anyway?

BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page
above... that barrier is in the slow path as well though, so it shouldn't
matter either.





+   clear_bit(PG_waiters, >flags);
+   return;
+   }
__wait_on_bit_lock(page_waitqueue(page), , sync_page,
TASK_UNINTERRUPTIBLE);
}






--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans -- vm bugfixes

On Thu, 3 May 2007, Nick Piggin wrote:
> 
> The problem is that lock/unlock_page is expensive on powerpc, and
> if we improve that, we improve more than just the fault handler...
> 
> The attached patch gets performance up a bit by avoiding some
> barriers and some cachelines:

There's a strong whiff of raciness about this...
but I could very easily be wrong.

> Index: linux-2.6/mm/filemap.c
> ===
> --- linux-2.6.orig/mm/filemap.c   2007-05-02 15:00:26.0 +1000
> +++ linux-2.6/mm/filemap.c2007-05-03 08:34:32.0 +1000
> @@ -532,11 +532,13 @@
>   */
>  void fastcall unlock_page(struct page *page)
>  {
> + VM_BUG_ON(!PageLocked(page));
>   smp_mb__before_clear_bit();
> - if (!TestClearPageLocked(page))
> - BUG();
> - smp_mb__after_clear_bit(); 
> - wake_up_page(page, PG_locked);
> + ClearPageLocked(page);
> + if (unlikely(test_bit(PG_waiters, >flags))) {
> + clear_bit(PG_waiters, >flags);
> + wake_up_page(page, PG_locked);
> + }
>  }
>  EXPORT_SYMBOL(unlock_page);
>  
> @@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!)
>  {
>   DEFINE_WAIT_BIT(wait, >flags, PG_locked);
>  
> + set_bit(PG_waiters, >flags);
> + if (unlikely(!TestSetPageLocked(page))) {

What happens if another cpu is coming through __lock_page at the
same time, did its set_bit, now finds PageLocked, and so proceeds
to the __wait_on_bit_lock?  But this cpu now clears PG_waiters,
so this task's unlock_page won't wake the other?

> + clear_bit(PG_waiters, >flags);
> + return;
> + }
>   __wait_on_bit_lock(page_waitqueue(page), , sync_page,
>   TASK_UNINTERRUPTIBLE);
>  }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans -- vm bugfixes

On Thu, May 03, 2007 at 11:32:23AM +1000, Nick Piggin wrote:
> The attached patch gets performance up a bit by avoiding some
> barriers and some cachelines:
> 
> G5
>  pagefault   fork  exec
> 2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
> +patch   1.71-1.73   175.2-180.8   780.5-794.2
> +patch2  1.61-1.63   169.8-175.0   748.6-757.0
> 
> So that brings the fork/exec hits down to much less than 5%, and
> would likely speed up other things that lock the page, like write
> or page reclaim.

Is that every fork/exec or just under certain cicumstances?
A 5% regression on every fork/exec is not acceptable.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans

2007-05-03 Thread Andi Kleen

> If we are looking at current "potential users" that are already in
> mainline, we could change blktrace to make it use the markers.

Ok, but do it step by step:
- split out useful pieces like the "patched in enable/disable flags" 
and submit them separate with an example user or two 
[I got a couple of candidates e.g. with some of the sysctls in VM or 
networking] 
- post and merge that.
- don't implement anything initially that is not needed by blktrace
- post a minimal marker patch together with the blktrace
conversion for review again on linux-kernel
- await review comments. This review would not cover the basic
need of markers, just the specific implementation.
- then potentially merge incorporate review comments
- then merge
- later add features with individual review/discussion as new users in the 
kernel are added.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans

2007-05-03 Thread Andi Kleen

On Wed, May 02, 2007 at 11:46:40PM +0200, Tilman Schmidt wrote:
> Am 02.05.2007 19:49 schrieb Andi Kleen:
> > And also I think when something is merged it should have some users in tree.
> 
> Isn't that a circular dependency?

The normal mode of operation is to merge the initial users and the 
subsystem at the same time.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans: slub

On Thu, 3 May 2007, Andrew Morton wrote:
> On Thu, 3 May 2007 09:46:32 +0100 (BST) Hugh Dickins <[EMAIL PROTECTED]> 
> wrote:
> > On Thu, 3 May 2007, Andrew Morton wrote:
> > > On Wed, 2 May 2007 10:25:47 -0700 (PDT) Christoph Lameter <[EMAIL 
> > > PROTECTED]> wrote:
> > > 
> > > > +config ARCH_USES_SLAB_PAGE_STRUCT
> > > > +   bool
> > > > +   default y
> > > > +   depends on SPLIT_PTLOCK_CPUS <= NR_CPUS
> > > > +
> > > 
> > > That all seems to work as intended.
> > > 
> > > However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the
> > > machine early in boot.  
> > 
> > I thought that if that worked as intended, you wouldn't even
> > get the chance to choose SLUB=y?  That was how it was working
> > for me (but I realize I didn't try more than make oldconfig).
> > 
> > That sounds like what happens when SLUB's pagestruct use meets
> > SPLIT_PTLOCK's pagestruct use.  Does your .config really show
> > CONFIG_SLUB=y together with CONFIG_ARCH_USES_SLAB_PAGE_STRUCT=y?
> 
> Nope.
> 
> g5:/usr/src/25> grep SLUB .config
> CONFIG_SLUB=y
> g5:/usr/src/25> grep SLAB .config
> # CONFIG_SLAB is not set
> g5:/usr/src/25> grep CPUS .config
> CONFIG_NR_CPUS=8
> # CONFIG_CPUSETS is not set
> # CONFIG_IRQ_ALL_CPUS is not set
> CONFIG_SPLIT_PTLOCK_CPUS=4
> 
> It's in http://userweb.kernel.org/~akpm/config-g5.txt

Seems we're all wrong in thinking Christoph's Kconfiggery worked
as intended: maybe it just works some of the time.  I'm not going
to hazard a guess as to how to fix it up, will resume looking at
the powerpc's quicklist potential later.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans: slub

2007-05-03 Thread Andrew Morton

On Thu, 3 May 2007 09:46:32 +0100 (BST) Hugh Dickins <[EMAIL PROTECTED]> wrote:

> On Thu, 3 May 2007, Andrew Morton wrote:
> > On Wed, 2 May 2007 10:25:47 -0700 (PDT) Christoph Lameter <[EMAIL 
> > PROTECTED]> wrote:
> > 
> > > +config ARCH_USES_SLAB_PAGE_STRUCT
> > > + bool
> > > + default y
> > > + depends on SPLIT_PTLOCK_CPUS <= NR_CPUS
> > > +
> > 
> > That all seems to work as intended.
> > 
> > However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the
> > machine early in boot.  
> 
> I thought that if that worked as intended, you wouldn't even
> get the chance to choose SLUB=y?  That was how it was working
> for me (but I realize I didn't try more than make oldconfig).

Right.  This can be tested on x86 without a cross-compiler:

ARCH=powerpc make mrproper
ARCH=powerpc make fooconfig

> > 
> > Too early for netconsole, no serial console.  Wedges up uselessly with
> > CONFIG_XMON=n, does mysterious repeated uncontrollable exceptions with
> > CONFIG_XMON=y.  This is all fairly typical for a powerpc/G5 crash :(
> > 
> > However I was able to glimpse some stuff as it flew past.  Crash started in
> > flush_old_exec and ended in pgtable_free_tlb -> kmem_cache_free.  I don't 
> > know
> > how to do better than that I'm afraid, unless I'm to hunt down a PCIE serial
> > card, perhaps.
> 
> That sounds like what happens when SLUB's pagestruct use meets
> SPLIT_PTLOCK's pagestruct use.  Does your .config really show
> CONFIG_SLUB=y together with CONFIG_ARCH_USES_SLAB_PAGE_STRUCT=y?
> 

Nope.

g5:/usr/src/25> grep SLUB .config
CONFIG_SLUB=y
g5:/usr/src/25> grep SLAB .config
# CONFIG_SLAB is not set
g5:/usr/src/25> grep CPUS .config
CONFIG_NR_CPUS=8
# CONFIG_CPUSETS is not set
# CONFIG_IRQ_ALL_CPUS is not set
CONFIG_SPLIT_PTLOCK_CPUS=4

It's in http://userweb.kernel.org/~akpm/config-g5.txt

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans: slub

On Thu, 3 May 2007, Andrew Morton wrote:
> On Wed, 2 May 2007 10:25:47 -0700 (PDT) Christoph Lameter <[EMAIL PROTECTED]> 
> wrote:
> 
> > +config ARCH_USES_SLAB_PAGE_STRUCT
> > +   bool
> > +   default y
> > +   depends on SPLIT_PTLOCK_CPUS <= NR_CPUS
> > +
> 
> That all seems to work as intended.
> 
> However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the
> machine early in boot.  

I thought that if that worked as intended, you wouldn't even
get the chance to choose SLUB=y?  That was how it was working
for me (but I realize I didn't try more than make oldconfig).

> 
> Too early for netconsole, no serial console.  Wedges up uselessly with
> CONFIG_XMON=n, does mysterious repeated uncontrollable exceptions with
> CONFIG_XMON=y.  This is all fairly typical for a powerpc/G5 crash :(
> 
> However I was able to glimpse some stuff as it flew past.  Crash started in
> flush_old_exec and ended in pgtable_free_tlb -> kmem_cache_free.  I don't know
> how to do better than that I'm afraid, unless I'm to hunt down a PCIE serial
> card, perhaps.

That sounds like what happens when SLUB's pagestruct use meets
SPLIT_PTLOCK's pagestruct use.  Does your .config really show
CONFIG_SLUB=y together with CONFIG_ARCH_USES_SLAB_PAGE_STRUCT=y?

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans: slub

2007-05-03 Thread William Lee Irwin III

On Thu, May 03, 2007 at 01:15:15AM -0700, Andrew Morton wrote:
> That all seems to work as intended.
> However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the
> machine early in boot.  
> Too early for netconsole, no serial console.  Wedges up uselessly with
> CONFIG_XMON=n, does mysterious repeated uncontrollable exceptions with
> CONFIG_XMON=y.  This is all fairly typical for a powerpc/G5 crash :(
> However I was able to glimpse some stuff as it flew past.  Crash started in
> flush_old_exec and ended in pgtable_free_tlb -> kmem_cache_free.  I don't know
> how to do better than that I'm afraid, unless I'm to hunt down a PCIE serial
> card, perhaps.

I've seen this crash in flush_old_exec() before. ISTR it being due to
slub vs. pagetable alignment or something on that order.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans: slub

2007-05-03 Thread Andrew Morton

On Wed, 2 May 2007 10:25:47 -0700 (PDT) Christoph Lameter <[EMAIL PROTECTED]> 
wrote:

> On Wed, 2 May 2007, Hugh Dickins wrote:
> 
> > I presume the answer is just to extend your quicklist work to
> > powerpc's lowest level of pagetables.  The only other architecture
> > which is using kmem_cache for them is arm26, which has
> > "#error SMP is not supported", so won't be giving this problem.
> 
> In the meantime we would need something like this to disable SLUB in this 
> particular configuration. Note that I have not tested this and the <= for
> the comparision with SPLIT_PTLOCK_CPUS may not work (Never seen such a
> construct in a Kconfig file but it is needed here).
> 
> 
> 
> PowerPC: Disable SLUB for configurations in which slab page structs are 
> modified
> 
> PowerPC uses the slab allocator to manage the lowest level of the page table.
> In high cpu configurations we also use the page struct to split the page
> table lock. Disallow the selection of SLUB for that case.
> 
> [Not tested: I am not familiar with powerpc build procedures etc]
> 
> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
> 
> Index: linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig
> ===
> --- linux-2.6.21-rc7-mm2.orig/arch/powerpc/Kconfig2007-05-02 
> 10:07:34.0 -0700
> +++ linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig 2007-05-02 10:13:37.0 
> -0700
> @@ -117,6 +117,19 @@ config GENERIC_BUG
>   default y
>   depends on BUG
>  
> +#
> +# Powerpc uses the slab allocator to manage its ptes and the
> +# page structs of ptes are used for splitting the page table
> +# lock for configurations supporting more than SPLIT_PTLOCK_CPUS.
> +#
> +# In that special configuration the page structs of slabs are modified.
> +# This setting disables the selection of SLUB as a slab allocator.
> +#
> +config ARCH_USES_SLAB_PAGE_STRUCT
> + bool
> + default y
> + depends on SPLIT_PTLOCK_CPUS <= NR_CPUS
> +

That all seems to work as intended.

However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the
machine early in boot.  

Too early for netconsole, no serial console.  Wedges up uselessly with
CONFIG_XMON=n, does mysterious repeated uncontrollable exceptions with
CONFIG_XMON=y.  This is all fairly typical for a powerpc/G5 crash :(

However I was able to glimpse some stuff as it flew past.  Crash started in
flush_old_exec and ended in pgtable_free_tlb -> kmem_cache_free.  I don't know
how to do better than that I'm afraid, unless I'm to hunt down a PCIE serial
card, perhaps.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans

On Wed, May 02, 2007 at 04:36:27PM -0400, Mathieu Desnoyers wrote:
> The idea is the following : either we integrate the infrastructure for
> instrumentation / data serialization / buffer management / extraction of
> data to user space in multiple different steps, which makes code review
> easier for you guys,

the staging area for that is -mm

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans

On Wed, May 02, 2007 at 01:53:36PM -0700, Andrew Morton wrote:
> In which case we have:
> 
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-alpha.patch
> atomich-complete-atomic_long-operations-in-asm-generic.patch
> atomich-i386-type-safety-fix.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-ia64.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-mips.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-parisc.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-powerpc.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-sparc64.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-x86_64.patch
> atomich-atomic_add_unless-as-inline-remove-systemh-atomich-circular-dependency.patch
> local_t-architecture-independant-extension.patch
> local_t-alpha-extension.patch
> local_t-i386-extension.patch
> local_t-ia64-extension.patch
> local_t-mips-extension.patch
> local_t-parisc-cleanup.patch
> local_t-powerpc-extension.patch
> local_t-sparc64-cleanup.patch
> local_t-x86_64-extension.patch
> 
>   For 2.6.22
> 
> linux-kernel-markers-kconfig-menus.patch
> linux-kernel-markers-architecture-independant-code.patch
> linux-kernel-markers-powerpc-optimization.patch
> linux-kernel-markers-i386-optimization.patch
> markers-add-instrumentation-markers-menus-to-avr32.patch
> linux-kernel-markers-non-optimized-architectures.patch
> markers-alpha-and-avr32-supportadd-alpha-markerh-add-arm26-markerh.patch
> linux-kernel-markers-documentation.patch
> #
> markers-define-the-linker-macro-extra_rwdata.patch
> markers-use-extra_rwdata-in-architectures.patch
> #
> some-grammatical-fixups-and-additions-to-atomich-kernel-doc.patch
> no-longer-include-asm-kdebugh.patch

This it a plan I can fully agree with.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans

On Wed, May 02, 2007 at 07:11:04PM -0400, Mathieu Desnoyers wrote:
> My statement was probably not clear enough. The actual marker code is
> useful as-is without any further kernel patching required : SystemTAP is
> an example where they use external modules to load probes that can
> connect either to markers or through kprobes. LTTng, in its current state,
> has a mostly modular core that also uses the markers.

That just mean you have to load an enormous emount of exernal crap
that replaces the missing kernel functionality.  It's exactly the
situation we want to avoid.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22 -mm merge plans