Re: 2.6.22 -mm merge plans
On Thu, May 10, 2007 at 03:39:36PM -0400, Mathieu Desnoyers wrote: > * Christoph Hellwig ([EMAIL PROTECTED]) wrote: > > On Thu, May 03, 2007 at 01:16:46PM -0400, Mathieu Desnoyers wrote: > > > > kprobes does this kind of synchronization internally, so the marker > > > > wrapper should probabl aswell. > > > > > > > > > > The problem appears on heavily loaded systems. Doing 50 > > > synchronize_sched() calls in a row can take up to a few seconds on a > > > 4-way machine. This is why I prefer to do it in the module to which > > > the callbacks belong. > > > > We recently had a discussion on batch unreistration interface for > > kprobes. I'm not very happy with having so different interfaces for > > different kind of probe registrations. > > > > Ok, I've had a look at the kprobes batch registration mechanisms and.. > well, it does not look well suited for the markers. Adding > supplementary data structures such as linked lists of probes does not > look like a good match. > > However, I agree with you that providing a similar API is good. > > Therefore, here is my proposal : > > The goal is to do the synchronize just after we unregister the last > probe handler provided by a given module. Since the unregistration > functions iterate on every marker present in the kernel, we can keep a > count of how many probes provided by the same module are still present. > If we see that we unregistered the last probe pointing to this module, > we issue a synchronize_sched(). > > It adds no data structure and keeps the same order of complexity as what > is already there, we only have to do 2 passes in the marker structures : > the first one finds the module associated with the callback and the > second disables the callbacks and keep a count of the number of > callbacks associated with the module. > > Mathieu > > P.S.: here is the code. > > > Linux Kernel Markers - Architecture Independant code Provide internal > synchronize_sched() in batch. > > The goal is to do the synchronize just after we unregister the last > probe handler provided by a given module. Since the unregistration > functions iterate on every marker present in the kernel, we can keep a > count of how many probes provided by the same module are still present. > If we see that we unregistered the last probe pointing to this module, > we issue a synchronize_sched(). > > It adds no data structure and keeps the same order of complexity as what > is already there, we only have to do 2 passes in the marker structures : > the first one finds the module associated with the callback and the > second disables the callbacks and keep a count of the number of > callbacks associated with the module. Looks good to me, please incorporate this is the next round of the markers patch series. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
On Thu, May 10, 2007 at 03:39:36PM -0400, Mathieu Desnoyers wrote: * Christoph Hellwig ([EMAIL PROTECTED]) wrote: On Thu, May 03, 2007 at 01:16:46PM -0400, Mathieu Desnoyers wrote: kprobes does this kind of synchronization internally, so the marker wrapper should probabl aswell. The problem appears on heavily loaded systems. Doing 50 synchronize_sched() calls in a row can take up to a few seconds on a 4-way machine. This is why I prefer to do it in the module to which the callbacks belong. We recently had a discussion on batch unreistration interface for kprobes. I'm not very happy with having so different interfaces for different kind of probe registrations. Ok, I've had a look at the kprobes batch registration mechanisms and.. well, it does not look well suited for the markers. Adding supplementary data structures such as linked lists of probes does not look like a good match. However, I agree with you that providing a similar API is good. Therefore, here is my proposal : The goal is to do the synchronize just after we unregister the last probe handler provided by a given module. Since the unregistration functions iterate on every marker present in the kernel, we can keep a count of how many probes provided by the same module are still present. If we see that we unregistered the last probe pointing to this module, we issue a synchronize_sched(). It adds no data structure and keeps the same order of complexity as what is already there, we only have to do 2 passes in the marker structures : the first one finds the module associated with the callback and the second disables the callbacks and keep a count of the number of callbacks associated with the module. Mathieu P.S.: here is the code. Linux Kernel Markers - Architecture Independant code Provide internal synchronize_sched() in batch. The goal is to do the synchronize just after we unregister the last probe handler provided by a given module. Since the unregistration functions iterate on every marker present in the kernel, we can keep a count of how many probes provided by the same module are still present. If we see that we unregistered the last probe pointing to this module, we issue a synchronize_sched(). It adds no data structure and keeps the same order of complexity as what is already there, we only have to do 2 passes in the marker structures : the first one finds the module associated with the callback and the second disables the callbacks and keep a count of the number of callbacks associated with the module. Looks good to me, please incorporate this is the next round of the markers patch series. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
* Christoph Hellwig ([EMAIL PROTECTED]) wrote: > On Thu, May 03, 2007 at 01:16:46PM -0400, Mathieu Desnoyers wrote: > > > kprobes does this kind of synchronization internally, so the marker > > > wrapper should probabl aswell. > > > > > > > The problem appears on heavily loaded systems. Doing 50 > > synchronize_sched() calls in a row can take up to a few seconds on a > > 4-way machine. This is why I prefer to do it in the module to which > > the callbacks belong. > > We recently had a discussion on batch unreistration interface for > kprobes. I'm not very happy with having so different interfaces for > different kind of probe registrations. > Ok, I've had a look at the kprobes batch registration mechanisms and.. well, it does not look well suited for the markers. Adding supplementary data structures such as linked lists of probes does not look like a good match. However, I agree with you that providing a similar API is good. Therefore, here is my proposal : The goal is to do the synchronize just after we unregister the last probe handler provided by a given module. Since the unregistration functions iterate on every marker present in the kernel, we can keep a count of how many probes provided by the same module are still present. If we see that we unregistered the last probe pointing to this module, we issue a synchronize_sched(). It adds no data structure and keeps the same order of complexity as what is already there, we only have to do 2 passes in the marker structures : the first one finds the module associated with the callback and the second disables the callbacks and keep a count of the number of callbacks associated with the module. Mathieu P.S.: here is the code. Linux Kernel Markers - Architecture Independant code Provide internal synchronize_sched() in batch. The goal is to do the synchronize just after we unregister the last probe handler provided by a given module. Since the unregistration functions iterate on every marker present in the kernel, we can keep a count of how many probes provided by the same module are still present. If we see that we unregistered the last probe pointing to this module, we issue a synchronize_sched(). It adds no data structure and keeps the same order of complexity as what is already there, we only have to do 2 passes in the marker structures : the first one finds the module associated with the callback and the second disables the callbacks and keep a count of the number of callbacks associated with the module. Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]> --- kernel/module.c | 62 ++-- 1 file changed, 52 insertions(+), 10 deletions(-) Index: linux-2.6-lttng/kernel/module.c === --- linux-2.6-lttng.orig/kernel/module.c2007-05-10 14:48:28.0 -0400 +++ linux-2.6-lttng/kernel/module.c 2007-05-10 15:38:27.0 -0400 @@ -404,8 +404,12 @@ } /* Sets a range of markers to a disabled state : unset the enable bit and - * provide the empty callback. */ + * provide the empty callback. + * Keep a count of other markers connected to the same module as the one + * provided as parameter. */ static int marker_remove_probe_range(const char *name, + struct module *probe_module, + int *ref_count, const struct __mark_marker *begin, const struct __mark_marker *end) { @@ -413,12 +417,17 @@ int found = 0; for (iter = begin; iter < end; iter++) { - if (strcmp(name, iter->mdata->name) == 0) { - marker_set_enable(iter->enable, 0, - iter->mdata->flags); - iter->mdata->call = __mark_empty_function; - found++; + if (strcmp(name, iter->mdata->name) != 0) { + if (probe_module) + if (__module_text_address( + (unsigned long)iter->mdata->call) + == probe_module) + (*ref_count)++; + continue; } + marker_set_enable(iter->enable, 0, iter->mdata->flags); + iter->mdata->call = __mark_empty_function; + found++; } return found; } @@ -450,6 +459,29 @@ return found; } +/* Get the module to which the probe handler's text belongs. + * Called with module_mutex taken. + * Returns NULL if the probe handler is not in a module. */ +static struct module *__marker_get_probe_module(const char *name) +{ + struct module *mod; + const struct __mark_marker *iter; + + list_for_each_entry(mod, , list) { + if (mod->taints) + continue; + for (iter = mod->markers; + iter < mod->markers+mod->num_markers; iter++) { +
Re: swap-prefetch: 2.6.22 -mm merge plans
On 5/10/07, Nick Piggin <[EMAIL PROTECTED]> wrote: > Huh? You already stated one version of it above, namely updatedb. But So a swapping problem with updatedb should be unusual and we'd like to see if we can fix it without resorting to prefetching. I know the theory behind swap prefetching, and I'm not saying it doesn't work, so I'll snip the rest of that. updatedb is only part of the problem. The other part is that the kernel has an opportunity to preemptively return some of the evicted working set to RAM before I ask for it. No fancy use-once algorithm is going to address that, so your solution is provably incomplete for my problem. What's so hard to understand about that? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
Ray Lee wrote: >> Huh? You already stated one version of it above, namely updatedb. But On Thu, May 10, 2007 at 05:04:54PM +1000, Nick Piggin wrote: > So a swapping problem with updatedb should be unusual and we'd like to see > if we can fix it without resorting to prefetching. > I know the theory behind swap prefetching, and I'm not saying it doesn't > work, so I'll snip the rest of that. I've not run updatedb in years, so I have no idea what it does to a modern kernel. It used to be an unholy terror of slab fragmentation and displacing user memory. The case of streaming kernel metadata IO is probably not quite as easy as streaming file IO. Ray Lee wrote: >> You said, effectively: "Use-once could be improved to deal with >> updatedb". I said I've been reading emails from Rik and others talking >> about that for four years now, and we're still talking about it. Were >> it merely updatedb, I'd say us userspace folk should step up and >> rewrite the damn thing to amortize its work. However, I and others >> feel it's only an example -- glaring, obviously -- of a more pervasive >> issue. A small issue, to be sure!, but an issue nevertheless. On Thu, May 10, 2007 at 05:04:54PM +1000, Nick Piggin wrote: > It isn't going to get fixed unless people complain about it. If you > cover the use-once problem with swap prefetching, then it will never > get fixed. The policy people need to clean this up once and for all at some point. clameter's targeted reclaim bits for slub look like a plausible tactic, but are by no means comprehensive. Things need to attempt to eat their own tails before eating everyone else alive. Maybe we need to take hits on things such as badari's dd's to resolve the pathologies. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
Ray Lee wrote: On 5/9/07, Nick Piggin <[EMAIL PROTECTED]> wrote: Ray Lee wrote: > On 5/9/07, Nick Piggin <[EMAIL PROTECTED]> wrote: > >> You said it helped with the updatedb problem. That says we should look at >> why it is going bad first, and for example improve use-once algorithms. >> After we do that, then swap prefetching might still help, which is fine. > > Nick, if you're volunteering to do that analysis, then great. If not, > then you're just providing a airy hope with nothing to back up when or > if that work would ever occur. I'd like to try helping. Tell me your problem. Huh? You already stated one version of it above, namely updatedb. But So a swapping problem with updatedb should be unusual and we'd like to see if we can fix it without resorting to prefetching. I know the theory behind swap prefetching, and I'm not saying it doesn't work, so I'll snip the rest of that. What's wrong with the use-once we have? What improvements are you talking about? You said, effectively: "Use-once could be improved to deal with updatedb". I said I've been reading emails from Rik and others talking about that for four years now, and we're still talking about it. Were it merely updatedb, I'd say us userspace folk should step up and rewrite the damn thing to amortize its work. However, I and others feel it's only an example -- glaring, obviously -- of a more pervasive issue. A small issue, to be sure!, but an issue nevertheless. It isn't going to get fixed unless people complain about it. If you cover the use-once problem with swap prefetching, then it will never get fixed. I don't think it is about energy or being mean, I'm just stating the issues I have with it. Nick, I in no way think you're being mean, and I'm sorry if I've given you that impression. However, if you're just stating the issues you have with it, then can I assume that you won't lobby against having this experiment merged? Anybody is free to merge anything into their kernel. And if somebody asks for my issues with the swap prefetching patch, then I'll give them :) -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
Ray Lee wrote: On 5/9/07, Nick Piggin [EMAIL PROTECTED] wrote: Ray Lee wrote: On 5/9/07, Nick Piggin [EMAIL PROTECTED] wrote: You said it helped with the updatedb problem. That says we should look at why it is going bad first, and for example improve use-once algorithms. After we do that, then swap prefetching might still help, which is fine. Nick, if you're volunteering to do that analysis, then great. If not, then you're just providing a airy hope with nothing to back up when or if that work would ever occur. I'd like to try helping. Tell me your problem. Huh? You already stated one version of it above, namely updatedb. But So a swapping problem with updatedb should be unusual and we'd like to see if we can fix it without resorting to prefetching. I know the theory behind swap prefetching, and I'm not saying it doesn't work, so I'll snip the rest of that. What's wrong with the use-once we have? What improvements are you talking about? You said, effectively: Use-once could be improved to deal with updatedb. I said I've been reading emails from Rik and others talking about that for four years now, and we're still talking about it. Were it merely updatedb, I'd say us userspace folk should step up and rewrite the damn thing to amortize its work. However, I and others feel it's only an example -- glaring, obviously -- of a more pervasive issue. A small issue, to be sure!, but an issue nevertheless. It isn't going to get fixed unless people complain about it. If you cover the use-once problem with swap prefetching, then it will never get fixed. I don't think it is about energy or being mean, I'm just stating the issues I have with it. Nick, I in no way think you're being mean, and I'm sorry if I've given you that impression. However, if you're just stating the issues you have with it, then can I assume that you won't lobby against having this experiment merged? Anybody is free to merge anything into their kernel. And if somebody asks for my issues with the swap prefetching patch, then I'll give them :) -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
Ray Lee wrote: Huh? You already stated one version of it above, namely updatedb. But On Thu, May 10, 2007 at 05:04:54PM +1000, Nick Piggin wrote: So a swapping problem with updatedb should be unusual and we'd like to see if we can fix it without resorting to prefetching. I know the theory behind swap prefetching, and I'm not saying it doesn't work, so I'll snip the rest of that. I've not run updatedb in years, so I have no idea what it does to a modern kernel. It used to be an unholy terror of slab fragmentation and displacing user memory. The case of streaming kernel metadata IO is probably not quite as easy as streaming file IO. Ray Lee wrote: You said, effectively: Use-once could be improved to deal with updatedb. I said I've been reading emails from Rik and others talking about that for four years now, and we're still talking about it. Were it merely updatedb, I'd say us userspace folk should step up and rewrite the damn thing to amortize its work. However, I and others feel it's only an example -- glaring, obviously -- of a more pervasive issue. A small issue, to be sure!, but an issue nevertheless. On Thu, May 10, 2007 at 05:04:54PM +1000, Nick Piggin wrote: It isn't going to get fixed unless people complain about it. If you cover the use-once problem with swap prefetching, then it will never get fixed. The policy people need to clean this up once and for all at some point. clameter's targeted reclaim bits for slub look like a plausible tactic, but are by no means comprehensive. Things need to attempt to eat their own tails before eating everyone else alive. Maybe we need to take hits on things such as badari's dd's to resolve the pathologies. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
On 5/10/07, Nick Piggin [EMAIL PROTECTED] wrote: Huh? You already stated one version of it above, namely updatedb. But So a swapping problem with updatedb should be unusual and we'd like to see if we can fix it without resorting to prefetching. I know the theory behind swap prefetching, and I'm not saying it doesn't work, so I'll snip the rest of that. updatedb is only part of the problem. The other part is that the kernel has an opportunity to preemptively return some of the evicted working set to RAM before I ask for it. No fancy use-once algorithm is going to address that, so your solution is provably incomplete for my problem. What's so hard to understand about that? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
* Christoph Hellwig ([EMAIL PROTECTED]) wrote: On Thu, May 03, 2007 at 01:16:46PM -0400, Mathieu Desnoyers wrote: kprobes does this kind of synchronization internally, so the marker wrapper should probabl aswell. The problem appears on heavily loaded systems. Doing 50 synchronize_sched() calls in a row can take up to a few seconds on a 4-way machine. This is why I prefer to do it in the module to which the callbacks belong. We recently had a discussion on batch unreistration interface for kprobes. I'm not very happy with having so different interfaces for different kind of probe registrations. Ok, I've had a look at the kprobes batch registration mechanisms and.. well, it does not look well suited for the markers. Adding supplementary data structures such as linked lists of probes does not look like a good match. However, I agree with you that providing a similar API is good. Therefore, here is my proposal : The goal is to do the synchronize just after we unregister the last probe handler provided by a given module. Since the unregistration functions iterate on every marker present in the kernel, we can keep a count of how many probes provided by the same module are still present. If we see that we unregistered the last probe pointing to this module, we issue a synchronize_sched(). It adds no data structure and keeps the same order of complexity as what is already there, we only have to do 2 passes in the marker structures : the first one finds the module associated with the callback and the second disables the callbacks and keep a count of the number of callbacks associated with the module. Mathieu P.S.: here is the code. Linux Kernel Markers - Architecture Independant code Provide internal synchronize_sched() in batch. The goal is to do the synchronize just after we unregister the last probe handler provided by a given module. Since the unregistration functions iterate on every marker present in the kernel, we can keep a count of how many probes provided by the same module are still present. If we see that we unregistered the last probe pointing to this module, we issue a synchronize_sched(). It adds no data structure and keeps the same order of complexity as what is already there, we only have to do 2 passes in the marker structures : the first one finds the module associated with the callback and the second disables the callbacks and keep a count of the number of callbacks associated with the module. Signed-off-by: Mathieu Desnoyers [EMAIL PROTECTED] --- kernel/module.c | 62 ++-- 1 file changed, 52 insertions(+), 10 deletions(-) Index: linux-2.6-lttng/kernel/module.c === --- linux-2.6-lttng.orig/kernel/module.c2007-05-10 14:48:28.0 -0400 +++ linux-2.6-lttng/kernel/module.c 2007-05-10 15:38:27.0 -0400 @@ -404,8 +404,12 @@ } /* Sets a range of markers to a disabled state : unset the enable bit and - * provide the empty callback. */ + * provide the empty callback. + * Keep a count of other markers connected to the same module as the one + * provided as parameter. */ static int marker_remove_probe_range(const char *name, + struct module *probe_module, + int *ref_count, const struct __mark_marker *begin, const struct __mark_marker *end) { @@ -413,12 +417,17 @@ int found = 0; for (iter = begin; iter end; iter++) { - if (strcmp(name, iter-mdata-name) == 0) { - marker_set_enable(iter-enable, 0, - iter-mdata-flags); - iter-mdata-call = __mark_empty_function; - found++; + if (strcmp(name, iter-mdata-name) != 0) { + if (probe_module) + if (__module_text_address( + (unsigned long)iter-mdata-call) + == probe_module) + (*ref_count)++; + continue; } + marker_set_enable(iter-enable, 0, iter-mdata-flags); + iter-mdata-call = __mark_empty_function; + found++; } return found; } @@ -450,6 +459,29 @@ return found; } +/* Get the module to which the probe handler's text belongs. + * Called with module_mutex taken. + * Returns NULL if the probe handler is not in a module. */ +static struct module *__marker_get_probe_module(const char *name) +{ + struct module *mod; + const struct __mark_marker *iter; + + list_for_each_entry(mod, modules, list) { + if (mod-taints) + continue; + for (iter = mod-markers; + iter mod-markers+mod-num_markers; iter++) { + if (strcmp(name,
Re: swap-prefetch: 2.6.22 -mm merge plans
On 5/9/07, Nick Piggin <[EMAIL PROTECTED]> wrote: Ray Lee wrote: > On 5/9/07, Nick Piggin <[EMAIL PROTECTED]> wrote: > >> You said it helped with the updatedb problem. That says we should look at >> why it is going bad first, and for example improve use-once algorithms. >> After we do that, then swap prefetching might still help, which is fine. > > Nick, if you're volunteering to do that analysis, then great. If not, > then you're just providing a airy hope with nothing to back up when or > if that work would ever occur. I'd like to try helping. Tell me your problem. Huh? You already stated one version of it above, namely updatedb. But let's put this another way, shall we? A gedankenexperiment, if you will. Say we have a perfect swap-out algorithm that can choose exactly what needs to be evicted to disk. ('Perfect', of course, is dependent upon one's metric, but let's go with "maximizes overall system utilization and minimizes IO wait time." Arbitrary, but hey.) So, great, the right things got swapped out. Anything else that could have been chosen would have caused more overall IO Wait. Yay us. So what happens when those processes that triggered the swap-outs go away? (Firefox is closed, I stop hitting my local copy of a database, whatever.) Well, currently, nothing. What happens when I switch workspaces and try to use my email program? Swap-ins. Okay, so why didn't the system swap that stuff in preemptively? Why am I sitting there waiting for something that it could have already done in the background? A new swap-out algorithm, be it use-once, Clock-Pro, or perfect foreknowledge isn't going to change that issue. Swap prefetch does. > Further, if you or someone else *does* do that work, then guess what, > we still have the option to rip out the swap prefetching code after > the hypothetical use-once improvements have been proven and merged. > Which, by the way, I've watched people talk about since 2.4. That was, > y'know, a *while* ago. What's wrong with the use-once we have? What improvements are you talking about? You said, effectively: "Use-once could be improved to deal with updatedb". I said I've been reading emails from Rik and others talking about that for four years now, and we're still talking about it. Were it merely updatedb, I'd say us userspace folk should step up and rewrite the damn thing to amortize its work. However, I and others feel it's only an example -- glaring, obviously -- of a more pervasive issue. A small issue, to be sure!, but an issue nevertheless. In general, I/others are talking about improving the desktop experience of running too much on a RAM limited machine. (Which, in my case, is with a gig and a 2.2GHz processor.) Or restated: the desktop experience occasionally sucks for me, and I don't think I'm alone. There may be a heuristic, completely isolated from userspace (and so isn't an API the kernel has to support! -- if it doesn't work, we can rip it out again), that may mitigate the suckiness. Let's try it. > So enough with the stop energy, okay? You're better than that. I don't think it is about energy or being mean, I'm just stating the issues I have with it. Nick, I in no way think you're being mean, and I'm sorry if I've given you that impression. However, if you're just stating the issues you have with it, then can I assume that you won't lobby against having this experiment merged? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
On Thursday 10 May 2007 13:48, Ray Lee wrote: > On 5/9/07, Nick Piggin <[EMAIL PROTECTED]> wrote: > > You said it helped with the updatedb problem. That says we should look at > > why it is going bad first, and for example improve use-once algorithms. > > After we do that, then swap prefetching might still help, which is fine. > > Nick, if you're volunteering to do that analysis, then great. If not, > then you're just providing a airy hope with nothing to back up when or > if that work would ever occur. > > Further, if you or someone else *does* do that work, then guess what, > we still have the option to rip out the swap prefetching code after > the hypothetical use-once improvements have been proven and merged. > Which, by the way, I've watched people talk about since 2.4. That was, > y'know, a *while* ago. > > So enough with the stop energy, okay? You're better than that. > > Con? He is right about the last feature to go in needs to work > gracefully with what's there now. However, it's not unheard of for > authors of other sections of code to help out with incompatibilities > by answering politely phrased questions for guidance. Though the > intersection of users between cpusets and desktop systems seems small > indeed. Let's just set the record straight. I actually discussed cpusets over a year ago in this nonsense and was told by sgi folk there was no need to get my head around cpusets and honouring node placement should be enough which, by the way, swap prefetch does. So I by no means ignored this; we just hit an impasse on just how much more featured it should be for the sake of a goddamn home desktop pc feature. Anyway why the hell am I resurrecting this thread? The code is declared dead already. Leave it be. -- -ck - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
Ray Lee wrote: On 5/9/07, Nick Piggin <[EMAIL PROTECTED]> wrote: You said it helped with the updatedb problem. That says we should look at why it is going bad first, and for example improve use-once algorithms. After we do that, then swap prefetching might still help, which is fine. Nick, if you're volunteering to do that analysis, then great. If not, then you're just providing a airy hope with nothing to back up when or if that work would ever occur. I'd like to try helping. Tell me your problem. Further, if you or someone else *does* do that work, then guess what, we still have the option to rip out the swap prefetching code after the hypothetical use-once improvements have been proven and merged. Which, by the way, I've watched people talk about since 2.4. That was, y'know, a *while* ago. What's wrong with the use-once we have? What improvements are you talking about? So enough with the stop energy, okay? You're better than that. I don't think it is about energy or being mean, I'm just stating the issues I have with it. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
On 5/9/07, Nick Piggin <[EMAIL PROTECTED]> wrote: You said it helped with the updatedb problem. That says we should look at why it is going bad first, and for example improve use-once algorithms. After we do that, then swap prefetching might still help, which is fine. Nick, if you're volunteering to do that analysis, then great. If not, then you're just providing a airy hope with nothing to back up when or if that work would ever occur. Further, if you or someone else *does* do that work, then guess what, we still have the option to rip out the swap prefetching code after the hypothetical use-once improvements have been proven and merged. Which, by the way, I've watched people talk about since 2.4. That was, y'know, a *while* ago. So enough with the stop energy, okay? You're better than that. Con? He is right about the last feature to go in needs to work gracefully with what's there now. However, it's not unheard of for authors of other sections of code to help out with incompatibilities by answering politely phrased questions for guidance. Though the intersection of users between cpusets and desktop systems seems small indeed. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
Con Kolivas wrote: On Thursday 10 May 2007 10:05, Nick Piggin wrote: I'm not the gatekeeper and it is completely up to you whether you want to work on something or not... but I'm sure you understand where I was coming from when I suggested it doesn't get merged yet. No matter how you spin it, you're the gatekeeper. If raising unaddressed issues means closing a gate, then OK. You can equally open it by answering them. You may not believe this, but I agree that swap prefetching (and prefetching in general) has some potential to help desktop workloads :). But it still should go through the normal process of being tested and questioned and having a look at options for first improving existing code in those problematic cases. Not this again? Proof was there ages ago that it helped and no proof that it harmed could be found yet you cunningly pretend it never existed. It's been done to death and I'm sick of this. I said I know it can help. Do you know how many patches I have that help some workloads but are not merged? That's just the way it works. What I have seen is it helps the case where you force out a huge amount of swap. OK, that's nice -- the base case obviously works. You said it helped with the updatedb problem. That says we should look at why it is going bad first, and for example improve use-once algorithms. After we do that, then swap prefetching might still help, which is fine. Once that process happens and it is shown to work nicely, etc., then I would not be able to (or want to) keep it from getting merged. As far as cpusets goes... if your code goes in last, then you have to make it work with what is there, as a rule. People are using cpusets for memory resource control, which would have uses on a desktop system. It is just a really bad precedent to set, having different parts of the VM not work correctly together. Even if you made them mutually exclusive CONFIG_ options, that is still not a very nice solution. That's as close to a 3 as I'm likely to get out of you. If you're not willing to try making it work with existing code, among other things, then yes it will be difficult to get it merged. That's not going to change. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
On Thursday 10 May 2007 10:05, Nick Piggin wrote: > Con Kolivas wrote: > > Well how about that? That was the difference with a swap _file_ as I > > said, but I went ahead and checked with a swap partition as I used to > > have. I didn't notice, but somewhere in the last few months, swap > > prefetch code itself being unchanged for a year, seems to have been > > broken by other changes in the vm and it doesn't even start up > > prefetching often and has stale swap entries in its list. Once it breaks > > like that it does nothing from then on. So that leaves me with a quandry > > now. > > > > > > Do I: > > > > 1. Go ahead and find whatever breakage was introduced and fix it with > > hopefully a trivial change > > > > 2. Do option 1. and then implement support for yet another kernel feature > > (cpusets) that will be used perhaps never with swap prefetch [No Nick I > > don't believe you that cpusets have anything to do with normal users on a > > desktop ever; if it's used on a desktop it will only be by a kernel > > developer testing the cpusets code]. > > > > or > > > > 3. Dump swap prefetch forever and ignore that it ever worked and was > > helpful and was a lot of work to implement and so on. > > > > > > Given that even if I do 1 and/or 2 it'll still be blocked from ever going > > to mainline I think the choice is clear. > > > > Nick since you're personally the gatekeeper for this code, would you like > > to make a call? Just say 3 and put me out of my misery please. > > I'm not the gatekeeper and it is completely up to you whether you want > to work on something or not... but I'm sure you understand where I was > coming from when I suggested it doesn't get merged yet. No matter how you spin it, you're the gatekeeper. > You may not believe this, but I agree that swap prefetching (and > prefetching in general) has some potential to help desktop workloads :). > But it still should go through the normal process of being tested and > questioned and having a look at options for first improving existing > code in those problematic cases. Not this again? Proof was there ages ago that it helped and no proof that it harmed could be found yet you cunningly pretend it never existed. It's been done to death and I'm sick of this. > Once that process happens and it is shown to work nicely, etc., then I > would not be able to (or want to) keep it from getting merged. > > As far as cpusets goes... if your code goes in last, then you have to > make it work with what is there, as a rule. People are using cpusets > for memory resource control, which would have uses on a desktop system. > It is just a really bad precedent to set, having different parts of the > VM not work correctly together. Even if you made them mutually > exclusive CONFIG_ options, that is still not a very nice solution. That's as close to a 3 as I'm likely to get out of you. Andrew you'll be relieved to know I would like you to throw swap prefetch and related patches into the bin. Thanks. -- -ck - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
Con Kolivas wrote: Well how about that? That was the difference with a swap _file_ as I said, but I went ahead and checked with a swap partition as I used to have. I didn't notice, but somewhere in the last few months, swap prefetch code itself being unchanged for a year, seems to have been broken by other changes in the vm and it doesn't even start up prefetching often and has stale swap entries in its list. Once it breaks like that it does nothing from then on. So that leaves me with a quandry now. Do I: 1. Go ahead and find whatever breakage was introduced and fix it with hopefully a trivial change 2. Do option 1. and then implement support for yet another kernel feature (cpusets) that will be used perhaps never with swap prefetch [No Nick I don't believe you that cpusets have anything to do with normal users on a desktop ever; if it's used on a desktop it will only be by a kernel developer testing the cpusets code]. or 3. Dump swap prefetch forever and ignore that it ever worked and was helpful and was a lot of work to implement and so on. Given that even if I do 1 and/or 2 it'll still be blocked from ever going to mainline I think the choice is clear. Nick since you're personally the gatekeeper for this code, would you like to make a call? Just say 3 and put me out of my misery please. I'm not the gatekeeper and it is completely up to you whether you want to work on something or not... but I'm sure you understand where I was coming from when I suggested it doesn't get merged yet. You may not believe this, but I agree that swap prefetching (and prefetching in general) has some potential to help desktop workloads :). But it still should go through the normal process of being tested and questioned and having a look at options for first improving existing code in those problematic cases. Once that process happens and it is shown to work nicely, etc., then I would not be able to (or want to) keep it from getting merged. As far as cpusets goes... if your code goes in last, then you have to make it work with what is there, as a rule. People are using cpusets for memory resource control, which would have uses on a desktop system. It is just a really bad precedent to set, having different parts of the VM not work correctly together. Even if you made them mutually exclusive CONFIG_ options, that is still not a very nice solution. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
On Saturday 05 May 2007 18:42, Con Kolivas wrote: > On Friday 04 May 2007 22:10, Con Kolivas wrote: > > On Friday 04 May 2007 18:52, Ingo Molnar wrote: > > > agreed. Con, IIRC you wrote a testcase for this, right? Could you > > > please send us the results of that testing? > > > > Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch > > disabled and then enabled swap prefetch saves ~5 seconds on average > > hardware on this one test case. I had many users try this and the results > > were between 2 and 10 seconds, but always showed a saving on this > > testcase. This effect easily occurs on printing a big picture, editing a > > large file, compressing an iso image or whatever in real world workloads. > > Smaller, but much more frequent effects of this over the course of a day > > obviously also occur and do add up. > > Here's a better swap prefetch tester. Instructions in file. > > Machine with 2GB ram and 2GB swapfile > > Prefetch disabled: > ./sp_tester > Timed portion 53397 milliseconds > > Enabled: > ./sp_tester > Timed portion 26351 milliseconds > > Note huge time difference. Well how about that? That was the difference with a swap _file_ as I said, but I went ahead and checked with a swap partition as I used to have. I didn't notice, but somewhere in the last few months, swap prefetch code itself being unchanged for a year, seems to have been broken by other changes in the vm and it doesn't even start up prefetching often and has stale swap entries in its list. Once it breaks like that it does nothing from then on. So that leaves me with a quandry now. Do I: 1. Go ahead and find whatever breakage was introduced and fix it with hopefully a trivial change 2. Do option 1. and then implement support for yet another kernel feature (cpusets) that will be used perhaps never with swap prefetch [No Nick I don't believe you that cpusets have anything to do with normal users on a desktop ever; if it's used on a desktop it will only be by a kernel developer testing the cpusets code]. or 3. Dump swap prefetch forever and ignore that it ever worked and was helpful and was a lot of work to implement and so on. Given that even if I do 1 and/or 2 it'll still be blocked from ever going to mainline I think the choice is clear. Nick since you're personally the gatekeeper for this code, would you like to make a call? Just say 3 and put me out of my misery please. -- -ck P.S. Ingo, thanks (and sorry) for your involvement here. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Thu, 10 May 2007, Nick Piggin wrote: The filesystem (or page cache) allows pages beyond i_size to come in there? That wasn't a problem before, was it? But now it is? The filesystem still doesn't, but if i_size is updated after the page is returned, we can have a problem that was previously taken care of with the truncate_count but now isn't. But... I thought the page lock was now taking care of that in your scheme? truncate_inode_pages has to wait for the page lock, then it finds the page is mapped and... ahh, it finds the copiee page is not mapped, so doesn't do its own little unmap_mapping_range, and the copied page squeaks through. Drat. I really think the truncate_count solution worked better, for truncation anyway. There may be persuasive reasons you need the page lock for invalidation: I gave up on trying to understand the required behaviour(s) for invalidation. So, bring back (the original use of, not my tree marker use of) truncate_count? Hmm, you probably don't want to do that, because there was some pleasure in removing the strange barriers associated with it. A second unmap_mapping_range is just one line of code - but it sure feels like a defeat to me, calling the whole exercise into question. (But then, you'd be right to say my perfectionism made it impossible for me to come up with any solution to the invalidation issues.) Well we could bring back the truncate_count, but I think that sucks because that's moving work into the page fault handler in order to avoid a bit of work when truncating mapped files. But that's a change we could have made at any time if we'd bothered, it's not really the issue here. I don't see how you could, because you need to increment truncate_count. Though indeed we did so, I don't see that we needed to increment truncate_count in that case (nobody could be coming through do_no_page on that file, when there are no mappings of it). Of course :P -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Thu, 10 May 2007, Nick Piggin wrote: > > > > The filesystem (or page cache) allows pages beyond i_size to come > > in there? That wasn't a problem before, was it? But now it is? > > The filesystem still doesn't, but if i_size is updated after the page > is returned, we can have a problem that was previously taken care of > with the truncate_count but now isn't. But... I thought the page lock was now taking care of that in your scheme? truncate_inode_pages has to wait for the page lock, then it finds the page is mapped and... ahh, it finds the copiee page is not mapped, so doesn't do its own little unmap_mapping_range, and the copied page squeaks through. Drat. I really think the truncate_count solution worked better, for truncation anyway. There may be persuasive reasons you need the page lock for invalidation: I gave up on trying to understand the required behaviour(s) for invalidation. So, bring back (the original use of, not my tree marker use of) truncate_count? Hmm, you probably don't want to do that, because there was some pleasure in removing the strange barriers associated with it. A second unmap_mapping_range is just one line of code - but it sure feels like a defeat to me, calling the whole exercise into question. (But then, you'd be right to say my perfectionism made it impossible for me to come up with any solution to the invalidation issues.) > > Suspect you'd need a barrier of some kind between the i_size_write and > > the mapping_mapped test? > > The unmap_mapping_range that runs after the truncate_inode_pages should > run in the correct order, I believe. Yes, if there's going to be that backup call, the first won't really need a barrier. > > But that's a change we could have made at > > any time if we'd bothered, it's not really the issue here. > > I don't see how you could, because you need to increment truncate_count. Though indeed we did so, I don't see that we needed to increment truncate_count in that case (nobody could be coming through do_no_page on that file, when there are no mappings of it). > But I believe this is fixing the issue, even if it does so in a peripheral > manner, because it avoids the added cost for unmapped files. It's a small improvement to your common case, I agree. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Wed, 9 May 2007, Nick Piggin wrote: Hugh Dickins wrote: On Wed, 2 May 2007, Nick Piggin wrote: But I'm pretty sure (to use your words!) regular truncate was not racy before: I believe Andrea's sequence count was handling that case fine, without a second unmap_mapping_range. OK, I think you're right. I _think_ it should also be OK with the lock_page version as well: we should not be able to have any pages after the first unmap_mapping_range call, because of the i_size write. So if we have no pages, there is nothing to 'cow' from. I'd be delighted if you can remove those later unmap_mapping_ranges. As I recall, the important thing for the copy pages is to be holding the page lock (or whatever other serialization) on the copied page still while the copy page is inserted into pagetable: that looks to be so in your __do_fault. Hmm, on second thoughts, I think I was right the first time, and do need the unmap after the pages are truncated. With the lock_page code, after the first unmap, we can get new ptes mapping pages, and subsequently they can be COWed and then the original pte zapped before the truncate loop checks it. The filesystem (or page cache) allows pages beyond i_size to come in there? That wasn't a problem before, was it? But now it is? The filesystem still doesn't, but if i_size is updated after the page is returned, we can have a problem that was previously taken care of with the truncate_count but now isn't. However, I wonder if we can't test mapping_mapped before the spinlock, which would make most truncates cheaper? Slightly cheaper, yes, though I doubt it'd be much in comparison with actually doing any work in unmap_mapping_range or truncate_inode_pages. But if we're supposing the common case for truncate is unmapped mappings, then the main cost there will be the locking, which I'm trying to avoid. Hopefully with this patch, most truncate workloads would get faster, even though truncate mapped files is going to be unavoidably slower. Suspect you'd need a barrier of some kind between the i_size_write and the mapping_mapped test? The unmap_mapping_range that runs after the truncate_inode_pages should run in the correct order, I believe. But that's a change we could have made at any time if we'd bothered, it's not really the issue here. I don't see how you could, because you need to increment truncate_count. But I believe this is fixing the issue, even if it does so in a peripheral manner, because it avoids the added cost for unmapped files. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Wed, 9 May 2007, Nick Piggin wrote: > Hugh Dickins wrote: > > On Wed, 2 May 2007, Nick Piggin wrote: > > > > >But I'm pretty sure (to use your words!) regular truncate was not racy > > > >before: I believe Andrea's sequence count was handling that case fine, > > > >without a second unmap_mapping_range. > > > > > >OK, I think you're right. I _think_ it should also be OK with the > > >lock_page version as well: we should not be able to have any pages > > >after the first unmap_mapping_range call, because of the i_size > > >write. So if we have no pages, there is nothing to 'cow' from. > > > > I'd be delighted if you can remove those later unmap_mapping_ranges. > > As I recall, the important thing for the copy pages is to be holding > > the page lock (or whatever other serialization) on the copied page > > still while the copy page is inserted into pagetable: that looks > > to be so in your __do_fault. > > Hmm, on second thoughts, I think I was right the first time, and do > need the unmap after the pages are truncated. With the lock_page code, > after the first unmap, we can get new ptes mapping pages, and > subsequently they can be COWed and then the original pte zapped before > the truncate loop checks it. The filesystem (or page cache) allows pages beyond i_size to come in there? That wasn't a problem before, was it? But now it is? > > However, I wonder if we can't test mapping_mapped before the spinlock, > which would make most truncates cheaper? Slightly cheaper, yes, though I doubt it'd be much in comparison with actually doing any work in unmap_mapping_range or truncate_inode_pages. Suspect you'd need a barrier of some kind between the i_size_write and the mapping_mapped test? But that's a change we could have made at any time if we'd bothered, it's not really the issue here. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Wed, 2 May 2007, Nick Piggin wrote: But I'm pretty sure (to use your words!) regular truncate was not racy before: I believe Andrea's sequence count was handling that case fine, without a second unmap_mapping_range. OK, I think you're right. I _think_ it should also be OK with the lock_page version as well: we should not be able to have any pages after the first unmap_mapping_range call, because of the i_size write. So if we have no pages, there is nothing to 'cow' from. I'd be delighted if you can remove those later unmap_mapping_ranges. As I recall, the important thing for the copy pages is to be holding the page lock (or whatever other serialization) on the copied page still while the copy page is inserted into pagetable: that looks to be so in your __do_fault. Hmm, on second thoughts, I think I was right the first time, and do need the unmap after the pages are truncated. With the lock_page code, after the first unmap, we can get new ptes mapping pages, and subsequently they can be COWed and then the original pte zapped before the truncate loop checks it. However, I wonder if we can't test mapping_mapped before the spinlock, which would make most truncates cheaper? -- SUSE Labs, Novell Inc. Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c 2007-04-24 15:02:51.0 +1000 +++ linux-2.6/mm/filemap.c 2007-05-09 17:30:47.0 +1000 @@ -2579,8 +2579,7 @@ if (rw == WRITE) { write_len = iov_length(iov, nr_segs); end = (offset + write_len - 1) >> PAGE_CACHE_SHIFT; - if (mapping_mapped(mapping)) - unmap_mapping_range(mapping, offset, write_len, 0); + unmap_mapping_range(mapping, offset, write_len, 0); } retval = filemap_write_and_wait(mapping); Index: linux-2.6/mm/memory.c === --- linux-2.6.orig/mm/memory.c 2007-05-09 17:25:28.0 +1000 +++ linux-2.6/mm/memory.c 2007-05-09 17:30:22.0 +1000 @@ -1956,6 +1956,9 @@ pgoff_t hba = holebegin >> PAGE_SHIFT; pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT; + if (!mapping_mapped(mapping)) + return; + /* Check for overflow. */ if (sizeof(holelen) > sizeof(hlen)) { long long holeend =
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Wed, 2 May 2007, Nick Piggin wrote: But I'm pretty sure (to use your words!) regular truncate was not racy before: I believe Andrea's sequence count was handling that case fine, without a second unmap_mapping_range. OK, I think you're right. I _think_ it should also be OK with the lock_page version as well: we should not be able to have any pages after the first unmap_mapping_range call, because of the i_size write. So if we have no pages, there is nothing to 'cow' from. I'd be delighted if you can remove those later unmap_mapping_ranges. As I recall, the important thing for the copy pages is to be holding the page lock (or whatever other serialization) on the copied page still while the copy page is inserted into pagetable: that looks to be so in your __do_fault. Hmm, on second thoughts, I think I was right the first time, and do need the unmap after the pages are truncated. With the lock_page code, after the first unmap, we can get new ptes mapping pages, and subsequently they can be COWed and then the original pte zapped before the truncate loop checks it. However, I wonder if we can't test mapping_mapped before the spinlock, which would make most truncates cheaper? -- SUSE Labs, Novell Inc. Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c 2007-04-24 15:02:51.0 +1000 +++ linux-2.6/mm/filemap.c 2007-05-09 17:30:47.0 +1000 @@ -2579,8 +2579,7 @@ if (rw == WRITE) { write_len = iov_length(iov, nr_segs); end = (offset + write_len - 1) PAGE_CACHE_SHIFT; - if (mapping_mapped(mapping)) - unmap_mapping_range(mapping, offset, write_len, 0); + unmap_mapping_range(mapping, offset, write_len, 0); } retval = filemap_write_and_wait(mapping); Index: linux-2.6/mm/memory.c === --- linux-2.6.orig/mm/memory.c 2007-05-09 17:25:28.0 +1000 +++ linux-2.6/mm/memory.c 2007-05-09 17:30:22.0 +1000 @@ -1956,6 +1956,9 @@ pgoff_t hba = holebegin PAGE_SHIFT; pgoff_t hlen = (holelen + PAGE_SIZE - 1) PAGE_SHIFT; + if (!mapping_mapped(mapping)) + return; + /* Check for overflow. */ if (sizeof(holelen) sizeof(hlen)) { long long holeend =
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Wed, 9 May 2007, Nick Piggin wrote: Hugh Dickins wrote: On Wed, 2 May 2007, Nick Piggin wrote: But I'm pretty sure (to use your words!) regular truncate was not racy before: I believe Andrea's sequence count was handling that case fine, without a second unmap_mapping_range. OK, I think you're right. I _think_ it should also be OK with the lock_page version as well: we should not be able to have any pages after the first unmap_mapping_range call, because of the i_size write. So if we have no pages, there is nothing to 'cow' from. I'd be delighted if you can remove those later unmap_mapping_ranges. As I recall, the important thing for the copy pages is to be holding the page lock (or whatever other serialization) on the copied page still while the copy page is inserted into pagetable: that looks to be so in your __do_fault. Hmm, on second thoughts, I think I was right the first time, and do need the unmap after the pages are truncated. With the lock_page code, after the first unmap, we can get new ptes mapping pages, and subsequently they can be COWed and then the original pte zapped before the truncate loop checks it. The filesystem (or page cache) allows pages beyond i_size to come in there? That wasn't a problem before, was it? But now it is? However, I wonder if we can't test mapping_mapped before the spinlock, which would make most truncates cheaper? Slightly cheaper, yes, though I doubt it'd be much in comparison with actually doing any work in unmap_mapping_range or truncate_inode_pages. Suspect you'd need a barrier of some kind between the i_size_write and the mapping_mapped test? But that's a change we could have made at any time if we'd bothered, it's not really the issue here. Hugh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Wed, 9 May 2007, Nick Piggin wrote: Hugh Dickins wrote: On Wed, 2 May 2007, Nick Piggin wrote: But I'm pretty sure (to use your words!) regular truncate was not racy before: I believe Andrea's sequence count was handling that case fine, without a second unmap_mapping_range. OK, I think you're right. I _think_ it should also be OK with the lock_page version as well: we should not be able to have any pages after the first unmap_mapping_range call, because of the i_size write. So if we have no pages, there is nothing to 'cow' from. I'd be delighted if you can remove those later unmap_mapping_ranges. As I recall, the important thing for the copy pages is to be holding the page lock (or whatever other serialization) on the copied page still while the copy page is inserted into pagetable: that looks to be so in your __do_fault. Hmm, on second thoughts, I think I was right the first time, and do need the unmap after the pages are truncated. With the lock_page code, after the first unmap, we can get new ptes mapping pages, and subsequently they can be COWed and then the original pte zapped before the truncate loop checks it. The filesystem (or page cache) allows pages beyond i_size to come in there? That wasn't a problem before, was it? But now it is? The filesystem still doesn't, but if i_size is updated after the page is returned, we can have a problem that was previously taken care of with the truncate_count but now isn't. However, I wonder if we can't test mapping_mapped before the spinlock, which would make most truncates cheaper? Slightly cheaper, yes, though I doubt it'd be much in comparison with actually doing any work in unmap_mapping_range or truncate_inode_pages. But if we're supposing the common case for truncate is unmapped mappings, then the main cost there will be the locking, which I'm trying to avoid. Hopefully with this patch, most truncate workloads would get faster, even though truncate mapped files is going to be unavoidably slower. Suspect you'd need a barrier of some kind between the i_size_write and the mapping_mapped test? The unmap_mapping_range that runs after the truncate_inode_pages should run in the correct order, I believe. But that's a change we could have made at any time if we'd bothered, it's not really the issue here. I don't see how you could, because you need to increment truncate_count. But I believe this is fixing the issue, even if it does so in a peripheral manner, because it avoids the added cost for unmapped files. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Thu, 10 May 2007, Nick Piggin wrote: The filesystem (or page cache) allows pages beyond i_size to come in there? That wasn't a problem before, was it? But now it is? The filesystem still doesn't, but if i_size is updated after the page is returned, we can have a problem that was previously taken care of with the truncate_count but now isn't. But... I thought the page lock was now taking care of that in your scheme? truncate_inode_pages has to wait for the page lock, then it finds the page is mapped and... ahh, it finds the copiee page is not mapped, so doesn't do its own little unmap_mapping_range, and the copied page squeaks through. Drat. I really think the truncate_count solution worked better, for truncation anyway. There may be persuasive reasons you need the page lock for invalidation: I gave up on trying to understand the required behaviour(s) for invalidation. So, bring back (the original use of, not my tree marker use of) truncate_count? Hmm, you probably don't want to do that, because there was some pleasure in removing the strange barriers associated with it. A second unmap_mapping_range is just one line of code - but it sure feels like a defeat to me, calling the whole exercise into question. (But then, you'd be right to say my perfectionism made it impossible for me to come up with any solution to the invalidation issues.) Suspect you'd need a barrier of some kind between the i_size_write and the mapping_mapped test? The unmap_mapping_range that runs after the truncate_inode_pages should run in the correct order, I believe. Yes, if there's going to be that backup call, the first won't really need a barrier. But that's a change we could have made at any time if we'd bothered, it's not really the issue here. I don't see how you could, because you need to increment truncate_count. Though indeed we did so, I don't see that we needed to increment truncate_count in that case (nobody could be coming through do_no_page on that file, when there are no mappings of it). But I believe this is fixing the issue, even if it does so in a peripheral manner, because it avoids the added cost for unmapped files. It's a small improvement to your common case, I agree. Hugh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Thu, 10 May 2007, Nick Piggin wrote: The filesystem (or page cache) allows pages beyond i_size to come in there? That wasn't a problem before, was it? But now it is? The filesystem still doesn't, but if i_size is updated after the page is returned, we can have a problem that was previously taken care of with the truncate_count but now isn't. But... I thought the page lock was now taking care of that in your scheme? truncate_inode_pages has to wait for the page lock, then it finds the page is mapped and... ahh, it finds the copiee page is not mapped, so doesn't do its own little unmap_mapping_range, and the copied page squeaks through. Drat. I really think the truncate_count solution worked better, for truncation anyway. There may be persuasive reasons you need the page lock for invalidation: I gave up on trying to understand the required behaviour(s) for invalidation. So, bring back (the original use of, not my tree marker use of) truncate_count? Hmm, you probably don't want to do that, because there was some pleasure in removing the strange barriers associated with it. A second unmap_mapping_range is just one line of code - but it sure feels like a defeat to me, calling the whole exercise into question. (But then, you'd be right to say my perfectionism made it impossible for me to come up with any solution to the invalidation issues.) Well we could bring back the truncate_count, but I think that sucks because that's moving work into the page fault handler in order to avoid a bit of work when truncating mapped files. But that's a change we could have made at any time if we'd bothered, it's not really the issue here. I don't see how you could, because you need to increment truncate_count. Though indeed we did so, I don't see that we needed to increment truncate_count in that case (nobody could be coming through do_no_page on that file, when there are no mappings of it). Of course :P -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
On Saturday 05 May 2007 18:42, Con Kolivas wrote: On Friday 04 May 2007 22:10, Con Kolivas wrote: On Friday 04 May 2007 18:52, Ingo Molnar wrote: agreed. Con, IIRC you wrote a testcase for this, right? Could you please send us the results of that testing? Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch disabled and then enabled swap prefetch saves ~5 seconds on average hardware on this one test case. I had many users try this and the results were between 2 and 10 seconds, but always showed a saving on this testcase. This effect easily occurs on printing a big picture, editing a large file, compressing an iso image or whatever in real world workloads. Smaller, but much more frequent effects of this over the course of a day obviously also occur and do add up. Here's a better swap prefetch tester. Instructions in file. Machine with 2GB ram and 2GB swapfile Prefetch disabled: ./sp_tester Timed portion 53397 milliseconds Enabled: ./sp_tester Timed portion 26351 milliseconds Note huge time difference. Well how about that? That was the difference with a swap _file_ as I said, but I went ahead and checked with a swap partition as I used to have. I didn't notice, but somewhere in the last few months, swap prefetch code itself being unchanged for a year, seems to have been broken by other changes in the vm and it doesn't even start up prefetching often and has stale swap entries in its list. Once it breaks like that it does nothing from then on. So that leaves me with a quandry now. Do I: 1. Go ahead and find whatever breakage was introduced and fix it with hopefully a trivial change 2. Do option 1. and then implement support for yet another kernel feature (cpusets) that will be used perhaps never with swap prefetch [No Nick I don't believe you that cpusets have anything to do with normal users on a desktop ever; if it's used on a desktop it will only be by a kernel developer testing the cpusets code]. or 3. Dump swap prefetch forever and ignore that it ever worked and was helpful and was a lot of work to implement and so on. Given that even if I do 1 and/or 2 it'll still be blocked from ever going to mainline I think the choice is clear. Nick since you're personally the gatekeeper for this code, would you like to make a call? Just say 3 and put me out of my misery please. -- -ck P.S. Ingo, thanks (and sorry) for your involvement here. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
Con Kolivas wrote: Well how about that? That was the difference with a swap _file_ as I said, but I went ahead and checked with a swap partition as I used to have. I didn't notice, but somewhere in the last few months, swap prefetch code itself being unchanged for a year, seems to have been broken by other changes in the vm and it doesn't even start up prefetching often and has stale swap entries in its list. Once it breaks like that it does nothing from then on. So that leaves me with a quandry now. Do I: 1. Go ahead and find whatever breakage was introduced and fix it with hopefully a trivial change 2. Do option 1. and then implement support for yet another kernel feature (cpusets) that will be used perhaps never with swap prefetch [No Nick I don't believe you that cpusets have anything to do with normal users on a desktop ever; if it's used on a desktop it will only be by a kernel developer testing the cpusets code]. or 3. Dump swap prefetch forever and ignore that it ever worked and was helpful and was a lot of work to implement and so on. Given that even if I do 1 and/or 2 it'll still be blocked from ever going to mainline I think the choice is clear. Nick since you're personally the gatekeeper for this code, would you like to make a call? Just say 3 and put me out of my misery please. I'm not the gatekeeper and it is completely up to you whether you want to work on something or not... but I'm sure you understand where I was coming from when I suggested it doesn't get merged yet. You may not believe this, but I agree that swap prefetching (and prefetching in general) has some potential to help desktop workloads :). But it still should go through the normal process of being tested and questioned and having a look at options for first improving existing code in those problematic cases. Once that process happens and it is shown to work nicely, etc., then I would not be able to (or want to) keep it from getting merged. As far as cpusets goes... if your code goes in last, then you have to make it work with what is there, as a rule. People are using cpusets for memory resource control, which would have uses on a desktop system. It is just a really bad precedent to set, having different parts of the VM not work correctly together. Even if you made them mutually exclusive CONFIG_ options, that is still not a very nice solution. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
On Thursday 10 May 2007 10:05, Nick Piggin wrote: Con Kolivas wrote: Well how about that? That was the difference with a swap _file_ as I said, but I went ahead and checked with a swap partition as I used to have. I didn't notice, but somewhere in the last few months, swap prefetch code itself being unchanged for a year, seems to have been broken by other changes in the vm and it doesn't even start up prefetching often and has stale swap entries in its list. Once it breaks like that it does nothing from then on. So that leaves me with a quandry now. Do I: 1. Go ahead and find whatever breakage was introduced and fix it with hopefully a trivial change 2. Do option 1. and then implement support for yet another kernel feature (cpusets) that will be used perhaps never with swap prefetch [No Nick I don't believe you that cpusets have anything to do with normal users on a desktop ever; if it's used on a desktop it will only be by a kernel developer testing the cpusets code]. or 3. Dump swap prefetch forever and ignore that it ever worked and was helpful and was a lot of work to implement and so on. Given that even if I do 1 and/or 2 it'll still be blocked from ever going to mainline I think the choice is clear. Nick since you're personally the gatekeeper for this code, would you like to make a call? Just say 3 and put me out of my misery please. I'm not the gatekeeper and it is completely up to you whether you want to work on something or not... but I'm sure you understand where I was coming from when I suggested it doesn't get merged yet. No matter how you spin it, you're the gatekeeper. You may not believe this, but I agree that swap prefetching (and prefetching in general) has some potential to help desktop workloads :). But it still should go through the normal process of being tested and questioned and having a look at options for first improving existing code in those problematic cases. Not this again? Proof was there ages ago that it helped and no proof that it harmed could be found yet you cunningly pretend it never existed. It's been done to death and I'm sick of this. Once that process happens and it is shown to work nicely, etc., then I would not be able to (or want to) keep it from getting merged. As far as cpusets goes... if your code goes in last, then you have to make it work with what is there, as a rule. People are using cpusets for memory resource control, which would have uses on a desktop system. It is just a really bad precedent to set, having different parts of the VM not work correctly together. Even if you made them mutually exclusive CONFIG_ options, that is still not a very nice solution. That's as close to a 3 as I'm likely to get out of you. Andrew you'll be relieved to know I would like you to throw swap prefetch and related patches into the bin. Thanks. -- -ck - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
Con Kolivas wrote: On Thursday 10 May 2007 10:05, Nick Piggin wrote: I'm not the gatekeeper and it is completely up to you whether you want to work on something or not... but I'm sure you understand where I was coming from when I suggested it doesn't get merged yet. No matter how you spin it, you're the gatekeeper. If raising unaddressed issues means closing a gate, then OK. You can equally open it by answering them. You may not believe this, but I agree that swap prefetching (and prefetching in general) has some potential to help desktop workloads :). But it still should go through the normal process of being tested and questioned and having a look at options for first improving existing code in those problematic cases. Not this again? Proof was there ages ago that it helped and no proof that it harmed could be found yet you cunningly pretend it never existed. It's been done to death and I'm sick of this. I said I know it can help. Do you know how many patches I have that help some workloads but are not merged? That's just the way it works. What I have seen is it helps the case where you force out a huge amount of swap. OK, that's nice -- the base case obviously works. You said it helped with the updatedb problem. That says we should look at why it is going bad first, and for example improve use-once algorithms. After we do that, then swap prefetching might still help, which is fine. Once that process happens and it is shown to work nicely, etc., then I would not be able to (or want to) keep it from getting merged. As far as cpusets goes... if your code goes in last, then you have to make it work with what is there, as a rule. People are using cpusets for memory resource control, which would have uses on a desktop system. It is just a really bad precedent to set, having different parts of the VM not work correctly together. Even if you made them mutually exclusive CONFIG_ options, that is still not a very nice solution. That's as close to a 3 as I'm likely to get out of you. If you're not willing to try making it work with existing code, among other things, then yes it will be difficult to get it merged. That's not going to change. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
On 5/9/07, Nick Piggin [EMAIL PROTECTED] wrote: You said it helped with the updatedb problem. That says we should look at why it is going bad first, and for example improve use-once algorithms. After we do that, then swap prefetching might still help, which is fine. Nick, if you're volunteering to do that analysis, then great. If not, then you're just providing a airy hope with nothing to back up when or if that work would ever occur. Further, if you or someone else *does* do that work, then guess what, we still have the option to rip out the swap prefetching code after the hypothetical use-once improvements have been proven and merged. Which, by the way, I've watched people talk about since 2.4. That was, y'know, a *while* ago. So enough with the stop energy, okay? You're better than that. Con? He is right about the last feature to go in needs to work gracefully with what's there now. However, it's not unheard of for authors of other sections of code to help out with incompatibilities by answering politely phrased questions for guidance. Though the intersection of users between cpusets and desktop systems seems small indeed. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
Ray Lee wrote: On 5/9/07, Nick Piggin [EMAIL PROTECTED] wrote: You said it helped with the updatedb problem. That says we should look at why it is going bad first, and for example improve use-once algorithms. After we do that, then swap prefetching might still help, which is fine. Nick, if you're volunteering to do that analysis, then great. If not, then you're just providing a airy hope with nothing to back up when or if that work would ever occur. I'd like to try helping. Tell me your problem. Further, if you or someone else *does* do that work, then guess what, we still have the option to rip out the swap prefetching code after the hypothetical use-once improvements have been proven and merged. Which, by the way, I've watched people talk about since 2.4. That was, y'know, a *while* ago. What's wrong with the use-once we have? What improvements are you talking about? So enough with the stop energy, okay? You're better than that. I don't think it is about energy or being mean, I'm just stating the issues I have with it. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
On Thursday 10 May 2007 13:48, Ray Lee wrote: On 5/9/07, Nick Piggin [EMAIL PROTECTED] wrote: You said it helped with the updatedb problem. That says we should look at why it is going bad first, and for example improve use-once algorithms. After we do that, then swap prefetching might still help, which is fine. Nick, if you're volunteering to do that analysis, then great. If not, then you're just providing a airy hope with nothing to back up when or if that work would ever occur. Further, if you or someone else *does* do that work, then guess what, we still have the option to rip out the swap prefetching code after the hypothetical use-once improvements have been proven and merged. Which, by the way, I've watched people talk about since 2.4. That was, y'know, a *while* ago. So enough with the stop energy, okay? You're better than that. Con? He is right about the last feature to go in needs to work gracefully with what's there now. However, it's not unheard of for authors of other sections of code to help out with incompatibilities by answering politely phrased questions for guidance. Though the intersection of users between cpusets and desktop systems seems small indeed. Let's just set the record straight. I actually discussed cpusets over a year ago in this nonsense and was told by sgi folk there was no need to get my head around cpusets and honouring node placement should be enough which, by the way, swap prefetch does. So I by no means ignored this; we just hit an impasse on just how much more featured it should be for the sake of a goddamn home desktop pc feature. Anyway why the hell am I resurrecting this thread? The code is declared dead already. Leave it be. -- -ck - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
On 5/9/07, Nick Piggin [EMAIL PROTECTED] wrote: Ray Lee wrote: On 5/9/07, Nick Piggin [EMAIL PROTECTED] wrote: You said it helped with the updatedb problem. That says we should look at why it is going bad first, and for example improve use-once algorithms. After we do that, then swap prefetching might still help, which is fine. Nick, if you're volunteering to do that analysis, then great. If not, then you're just providing a airy hope with nothing to back up when or if that work would ever occur. I'd like to try helping. Tell me your problem. Huh? You already stated one version of it above, namely updatedb. But let's put this another way, shall we? A gedankenexperiment, if you will. Say we have a perfect swap-out algorithm that can choose exactly what needs to be evicted to disk. ('Perfect', of course, is dependent upon one's metric, but let's go with maximizes overall system utilization and minimizes IO wait time. Arbitrary, but hey.) So, great, the right things got swapped out. Anything else that could have been chosen would have caused more overall IO Wait. Yay us. So what happens when those processes that triggered the swap-outs go away? (Firefox is closed, I stop hitting my local copy of a database, whatever.) Well, currently, nothing. What happens when I switch workspaces and try to use my email program? Swap-ins. Okay, so why didn't the system swap that stuff in preemptively? Why am I sitting there waiting for something that it could have already done in the background? A new swap-out algorithm, be it use-once, Clock-Pro, or perfect foreknowledge isn't going to change that issue. Swap prefetch does. Further, if you or someone else *does* do that work, then guess what, we still have the option to rip out the swap prefetching code after the hypothetical use-once improvements have been proven and merged. Which, by the way, I've watched people talk about since 2.4. That was, y'know, a *while* ago. What's wrong with the use-once we have? What improvements are you talking about? You said, effectively: Use-once could be improved to deal with updatedb. I said I've been reading emails from Rik and others talking about that for four years now, and we're still talking about it. Were it merely updatedb, I'd say us userspace folk should step up and rewrite the damn thing to amortize its work. However, I and others feel it's only an example -- glaring, obviously -- of a more pervasive issue. A small issue, to be sure!, but an issue nevertheless. In general, I/others are talking about improving the desktop experience of running too much on a RAM limited machine. (Which, in my case, is with a gig and a 2.2GHz processor.) Or restated: the desktop experience occasionally sucks for me, and I don't think I'm alone. There may be a heuristic, completely isolated from userspace (and so isn't an API the kernel has to support! -- if it doesn't work, we can rip it out again), that may mitigate the suckiness. Let's try it. So enough with the stop energy, okay? You're better than that. I don't think it is about energy or being mean, I'm just stating the issues I have with it. Nick, I in no way think you're being mean, and I'm sorry if I've given you that impression. However, if you're just stating the issues you have with it, then can I assume that you won't lobby against having this experiment merged? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Fri, 2007-05-04 at 19:23 +1000, Nick Piggin wrote: > These ops could also be put to use in bit spinlocks, buffer lock, and > probably a few other places too. Ok, the performance hit seems to be under control (especially with the bigger benchmark showing actual improvements). There's a little bogon with the PG_waiters bit that you already know about but appart from that it should be ok. I must say I absolutely _LOVE_ the bitops with explicit _lock/_unlock semantics. That should allow us to remove a whole bunch of dodgy barriers and smp_mb__before_whatever_magic_crap() things we have all over the place by providing precisely the expected semantics for bit locks. There are quite a few people who've been trying to do bit locks and I've always been very worried by how easy it is to get the barriers wrong (or too much barriers in the fast path) with these. There are a couple of things we might want to think about regarding the actual API to bit locks... the API you propose is simple, but it might not fit some of the most exotic usage requirements, which typically are related to manipulating other bits along with the lock bit. We might just ignore them though. In the case of the page lock, it's only hitting the slow path, and I would expect other usage scenarii to be similar. Cheers, Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
On Mon, Apr 30, 2007 at 04:20:07PM -0700, Andrew Morton wrote: ... > git-unionfs.patch > > Does this have a future? Yes! There are many active users who use our unioning functionality. Namespace unification consists of several major parts: 1) Duplicate elimination: This can be handled in the VFS. However, it would clutter up the VFS code with a lot of wrappers around key VFS functions to select the appropriate dentry/inode/etc. object from the underlying branch. (You also need to provide efficient and sane readdir/seekdir semantics which we do with our "On Disk Format" support.) 2) Copyup: Having a unified namespace by itself isn't enough. You also need copy-on-write functionality when the source file is on a read-only branch. This makes unioning much more useful and is one of the main attractions to unionfs users. 3) Whiteouts: Whiteouts are a key unioning construct. As it was pointed out at OLS 2006, they are a properly of the union and _NOT_ a branch. Therefore, they should not be stored persistently on a branch but rather in some "external" storage. 4) You also need unique and *persistent* inode numbers for network f/s exports and other unix tools. 5) You need to provide dynamic branch management functionality: adding, removing, and changing the mode of branches in an existing union. We have considerable experience in unioning file systems for years now; we are currently working on the third generation of the code. All of the above features, and more, are USED by users, and are NEEDED by users. We believe the right approach is the one we've taken, and is the least intrusive: a standalone (stackable) file system that doesn't clutter the VFS, with some small and gradual changes to the VFS to support stacking. As you may have noticed, we have been successfully submitting VFS patches to make the VFS more stacking friendly (not just to Unionfs, but also to eCryptfs which has been in since 2.6.19). The older Union mounts, alas, try to put all that functionality into the VFS. We recognize that some people think that union mounts at the VFS level is the "elegant" approach, but we hope people will listen to us and learn from our experience: unioning may seem simple in principle, but it is difficult in practice. (See http://unionfs.fileystems.org/ for a lot more info.) So we don't think that is a viable long term approach to have all of the unioning functionality in the VFS for two main reasons: (1) If you want users to use a VFS-level unioning functionality ala union-mounts, then you're going to have to implement *all* of the features we have implemented; the VFS clutter and complexity that will result will be very considerable, and we just don't think that it'd happen. (2) Some may suggest to have a lightweight union mounts that only offers a subset of the functionality that's suitable for placing in the VFS. In that case, most unionfs users simply won't use it. You'd need union mounts to provide ALL of the functionality that we have TODAY, if you want users to it. As far as we can see the remaining stumbling block right now is cache coherency between the layers. Whether you provide unioning as a stackable f/s or shoved into the VFS, coherency will have to be addressed. In our upcoming paper and talk at OLS'07, we plan to bring up and discuss several ideas we've explored already on how to resolve this incoherency. Our ideas range from complex graph-based pointer management between objects of all sorts, to simple timestamp-based VFS hooks. (We've been experimenting with several approaches and so far we're leaning toward the simple timestamp based on, again in the interest of keeping the VFS changes simple. We hope to have more results to report by OLS time.) Josef "Jeff" Sipek, on behalf of the Unionfs team. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
Con Kolivas wrote: On Friday 04 May 2007 18:52, Ingo Molnar wrote: agreed. Con, IIRC you wrote a testcase for this, right? Could you please send us the results of that testing? Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch disabled and then enabled swap prefetch saves ~5 seconds on average hardware on this one test case. I had many users try this and the results were between 2 and 10 seconds, but always showed a saving on this testcase. This effect easily occurs on printing a big picture, editing a large file, compressing an iso image or whatever in real world workloads. Smaller, but much more frequent effects of this over the course of a day obviously also occur and do add up. I'll try this when I get the scheduler stuff done, and also dig out the "resp1" stuff for "back when." I see the most recent datasets were comparing 2.5.43-mm2 responsiveness with 2.4.19-ck7, you know I always test your stuff ;-) Guess it might need a bit of polish for current hardware, I was testing on *small* machines, deliberately. -- Bill Davidsen <[EMAIL PROTECTED]> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
Ingo Molnar wrote: * Nick Piggin <[EMAIL PROTECTED]> wrote: i'm wondering about swap-prefetch: Being able to config all these core heuristics changes is really not that much of a positive. The fact that we might _need_ to config something out, and double the configuration range isn't too pleasing. Well, to the desktop user this is a speculative performance feature that he is willing to potentially waste CPU and IO capacity, in expectation of better performance. On the conceptual level it is _precisely the same thing as regular file readahead_. (with the difference that to me swapahead seems to be quite a bit more intelligent than our current file readahead logic.) This feature has no API or ABI impact at all, it's a pure performance feature. (besides the trivial sysctl to turn it runtime on/off). Here were some of my concerns, and where our discussion got up to. [...snip...] i see no real problem here. We've had heuristics for a _long_ time in various areas of the code. Sometimes they work, sometimes they suck. the flow of this is really easy: distro looking for a feature edge turns it on and announces it, if the feature does not work out for users then user turns it off and complains to distro, if enough users complain then distro turns it off for next release, upstream forgets about this performance feature and eventually removes it once someone notices that it wouldnt even compile in the past 2 main releases. I see no problem here, we did that in the past too with performance features. The networking stack has literally dozens of such small tunable things which get experimented with, and whose defaults do get tuned carefully. Some of the knobs help bandwidth, some help latency. I haven't looked at this code since it first came out and didn't impress me, but I think it would be good to get the current version in. However, when you say "user turns it off" I hope you mean "in /proc/sys with a switch or knob" and not by expecting people to recompile and install a kernel. Then it might take a little memory but wouldn't do something undesirable. Note: I had no bad effect from the code, it just didn't feel faster. On a low memory machine it might help. Of course I have wanted to have a hard limit on memory used for i/o buffers, just to avoid swapping programs to make room for i/o, so to some extent I feel as if this is a fix for a problem we shouldn't have. -- Bill Davidsen <[EMAIL PROTECTED]> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: fragmentation avoidance Re: 2.6.22 -mm merge plans
Sorry for late response. I went on a vacation in last week. And I'm in the mountain of a ton of unread mail now > > Mel's moveable-zone work. > > These patches are what creates ZONE_MOVABLE. The last 6 patches should be > collapsed into a single patch: > > handle-kernelcore=-generic > > I believe Yasunori Goto is looking at these from the perspective of memory > hot-remove and has caught a few bugs in the past. Goto-san may be able to > comment on whether they have been reviewed recently. Hmm, I don't think my review is enough. To be precise, I'm just one user/tester of ZONE_MOVABLE. I have tried to make memory remove patches with Mel-san's ZONE_MOVABLE patch. And the bugs are things that I found in its work. (I'll post these patches in a few days.) > The main complexity is in one function in patch one which determines where > the PFN is in each node for ZONE_MOVABLE. Getting that right so that the > requested amount of kernel memory spread as evenly as possible is just > not straight-forward. >From memory-hotplug view, ZONE_MOVABLE should be aligned by section size. But MAX_ORDER alignment is enough for others... Bye. -- Yasunori Goto - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
On Mon, Apr 30, 2007 at 04:20:07PM -0700, Andrew Morton wrote: ... git-unionfs.patch Does this have a future? Yes! There are many active users who use our unioning functionality. Namespace unification consists of several major parts: 1) Duplicate elimination: This can be handled in the VFS. However, it would clutter up the VFS code with a lot of wrappers around key VFS functions to select the appropriate dentry/inode/etc. object from the underlying branch. (You also need to provide efficient and sane readdir/seekdir semantics which we do with our On Disk Format support.) 2) Copyup: Having a unified namespace by itself isn't enough. You also need copy-on-write functionality when the source file is on a read-only branch. This makes unioning much more useful and is one of the main attractions to unionfs users. 3) Whiteouts: Whiteouts are a key unioning construct. As it was pointed out at OLS 2006, they are a properly of the union and _NOT_ a branch. Therefore, they should not be stored persistently on a branch but rather in some external storage. 4) You also need unique and *persistent* inode numbers for network f/s exports and other unix tools. 5) You need to provide dynamic branch management functionality: adding, removing, and changing the mode of branches in an existing union. We have considerable experience in unioning file systems for years now; we are currently working on the third generation of the code. All of the above features, and more, are USED by users, and are NEEDED by users. We believe the right approach is the one we've taken, and is the least intrusive: a standalone (stackable) file system that doesn't clutter the VFS, with some small and gradual changes to the VFS to support stacking. As you may have noticed, we have been successfully submitting VFS patches to make the VFS more stacking friendly (not just to Unionfs, but also to eCryptfs which has been in since 2.6.19). The older Union mounts, alas, try to put all that functionality into the VFS. We recognize that some people think that union mounts at the VFS level is the elegant approach, but we hope people will listen to us and learn from our experience: unioning may seem simple in principle, but it is difficult in practice. (See http://unionfs.fileystems.org/ for a lot more info.) So we don't think that is a viable long term approach to have all of the unioning functionality in the VFS for two main reasons: (1) If you want users to use a VFS-level unioning functionality ala union-mounts, then you're going to have to implement *all* of the features we have implemented; the VFS clutter and complexity that will result will be very considerable, and we just don't think that it'd happen. (2) Some may suggest to have a lightweight union mounts that only offers a subset of the functionality that's suitable for placing in the VFS. In that case, most unionfs users simply won't use it. You'd need union mounts to provide ALL of the functionality that we have TODAY, if you want users to it. As far as we can see the remaining stumbling block right now is cache coherency between the layers. Whether you provide unioning as a stackable f/s or shoved into the VFS, coherency will have to be addressed. In our upcoming paper and talk at OLS'07, we plan to bring up and discuss several ideas we've explored already on how to resolve this incoherency. Our ideas range from complex graph-based pointer management between objects of all sorts, to simple timestamp-based VFS hooks. (We've been experimenting with several approaches and so far we're leaning toward the simple timestamp based on, again in the interest of keeping the VFS changes simple. We hope to have more results to report by OLS time.) Josef Jeff Sipek, on behalf of the Unionfs team. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Fri, 2007-05-04 at 19:23 +1000, Nick Piggin wrote: These ops could also be put to use in bit spinlocks, buffer lock, and probably a few other places too. Ok, the performance hit seems to be under control (especially with the bigger benchmark showing actual improvements). There's a little bogon with the PG_waiters bit that you already know about but appart from that it should be ok. I must say I absolutely _LOVE_ the bitops with explicit _lock/_unlock semantics. That should allow us to remove a whole bunch of dodgy barriers and smp_mb__before_whatever_magic_crap() things we have all over the place by providing precisely the expected semantics for bit locks. There are quite a few people who've been trying to do bit locks and I've always been very worried by how easy it is to get the barriers wrong (or too much barriers in the fast path) with these. There are a couple of things we might want to think about regarding the actual API to bit locks... the API you propose is simple, but it might not fit some of the most exotic usage requirements, which typically are related to manipulating other bits along with the lock bit. We might just ignore them though. In the case of the page lock, it's only hitting the slow path, and I would expect other usage scenarii to be similar. Cheers, Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: fragmentation avoidance Re: 2.6.22 -mm merge plans
Sorry for late response. I went on a vacation in last week. And I'm in the mountain of a ton of unread mail now Mel's moveable-zone work. These patches are what creates ZONE_MOVABLE. The last 6 patches should be collapsed into a single patch: handle-kernelcore=-generic I believe Yasunori Goto is looking at these from the perspective of memory hot-remove and has caught a few bugs in the past. Goto-san may be able to comment on whether they have been reviewed recently. Hmm, I don't think my review is enough. To be precise, I'm just one user/tester of ZONE_MOVABLE. I have tried to make memory remove patches with Mel-san's ZONE_MOVABLE patch. And the bugs are things that I found in its work. (I'll post these patches in a few days.) The main complexity is in one function in patch one which determines where the PFN is in each node for ZONE_MOVABLE. Getting that right so that the requested amount of kernel memory spread as evenly as possible is just not straight-forward. From memory-hotplug view, ZONE_MOVABLE should be aligned by section size. But MAX_ORDER alignment is enough for others... Bye. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
Ingo Molnar wrote: * Nick Piggin [EMAIL PROTECTED] wrote: i'm wondering about swap-prefetch: Being able to config all these core heuristics changes is really not that much of a positive. The fact that we might _need_ to config something out, and double the configuration range isn't too pleasing. Well, to the desktop user this is a speculative performance feature that he is willing to potentially waste CPU and IO capacity, in expectation of better performance. On the conceptual level it is _precisely the same thing as regular file readahead_. (with the difference that to me swapahead seems to be quite a bit more intelligent than our current file readahead logic.) This feature has no API or ABI impact at all, it's a pure performance feature. (besides the trivial sysctl to turn it runtime on/off). Here were some of my concerns, and where our discussion got up to. [...snip...] i see no real problem here. We've had heuristics for a _long_ time in various areas of the code. Sometimes they work, sometimes they suck. the flow of this is really easy: distro looking for a feature edge turns it on and announces it, if the feature does not work out for users then user turns it off and complains to distro, if enough users complain then distro turns it off for next release, upstream forgets about this performance feature and eventually removes it once someone notices that it wouldnt even compile in the past 2 main releases. I see no problem here, we did that in the past too with performance features. The networking stack has literally dozens of such small tunable things which get experimented with, and whose defaults do get tuned carefully. Some of the knobs help bandwidth, some help latency. I haven't looked at this code since it first came out and didn't impress me, but I think it would be good to get the current version in. However, when you say user turns it off I hope you mean in /proc/sys with a switch or knob and not by expecting people to recompile and install a kernel. Then it might take a little memory but wouldn't do something undesirable. Note: I had no bad effect from the code, it just didn't feel faster. On a low memory machine it might help. Of course I have wanted to have a hard limit on memory used for i/o buffers, just to avoid swapping programs to make room for i/o, so to some extent I feel as if this is a fix for a problem we shouldn't have. -- Bill Davidsen [EMAIL PROTECTED] We have more to fear from the bungling of the incompetent than from the machinations of the wicked. - from Slashdot - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
Con Kolivas wrote: On Friday 04 May 2007 18:52, Ingo Molnar wrote: agreed. Con, IIRC you wrote a testcase for this, right? Could you please send us the results of that testing? Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch disabled and then enabled swap prefetch saves ~5 seconds on average hardware on this one test case. I had many users try this and the results were between 2 and 10 seconds, but always showed a saving on this testcase. This effect easily occurs on printing a big picture, editing a large file, compressing an iso image or whatever in real world workloads. Smaller, but much more frequent effects of this over the course of a day obviously also occur and do add up. I'll try this when I get the scheduler stuff done, and also dig out the resp1 stuff for back when. I see the most recent datasets were comparing 2.5.43-mm2 responsiveness with 2.4.19-ck7, you know I always test your stuff ;-) Guess it might need a bit of polish for current hardware, I was testing on *small* machines, deliberately. -- Bill Davidsen [EMAIL PROTECTED] We have more to fear from the bungling of the incompetent than from the machinations of the wicked. - from Slashdot - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ck] Re: swap-prefetch: 2.6.22 -mm merge plans
Con Kolivas wrote: > Here's a better swap prefetch tester. Instructions in file. > > Machine with 2GB ram and 2GB swapfile > > Prefetch disabled: > ./sp_tester > Ram 2060352000 Swap 197342 > Total ram to be malloced: 3047062000 bytes > Starting first malloc of 1523531000 bytes > Starting 1st read of first malloc > Touching this much ram takes 809 milliseconds > Starting second malloc of 1523531000 bytes > Completed second malloc and free > Sleeping for 600 seconds > Important part - starting reread of first malloc > Completed read of first malloc > Timed portion 53397 milliseconds > > Enabled: > ./sp_tester > Ram 2060352000 Swap 197342 > Total ram to be malloced: 3047062000 bytes > Starting first malloc of 1523531000 bytes > Starting 1st read of first malloc > Touching this much ram takes 676 milliseconds > Starting second malloc of 1523531000 bytes > Completed second malloc and free > Sleeping for 600 seconds > Important part - starting reread of first malloc > Completed read of first malloc > Timed portion 26351 milliseconds > echo 1 > /proc/sys/vm/overcommit_memory swapoff -a swapon -a ./sp_tester Ram 1153644000 Swap 1004052000 Total ram to be malloced: 165567 bytes Starting first malloc of 827835000 bytes Starting 1st read of first malloc Touching this much ram takes 937 milliseconds Starting second malloc of 827835000 bytes Completed second malloc and free Sleeping for 600 seconds Important part - starting reread of first malloc Completed read of first malloc Timed portion 15011 milliseconds echo 0 > /proc/sys/vm/overcommit_memory swapoff -a swapon -a ./sp_tester Ram 1153644000 Swap 1004052000 Total ram to be malloced: 165567 bytes Starting first malloc of 827835000 bytes Starting 1st read of first malloc Touching this much ram takes 1125 milliseconds Starting second malloc of 827835000 bytes Completed second malloc and free Sleeping for 600 seconds Important part - starting reread of first malloc Completed read of first malloc Timed portion 14611 milliseconds - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ck] Re: swap-prefetch: 2.6.22 -mm merge plans
2007/5/5, Con Kolivas <[EMAIL PROTECTED]>: [cut] Here's a better swap prefetch tester. Instructions in file. The system should be leaved totally inactive for the test duration (10m) right? Regards, ~ Antonio - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ck] Re: swap-prefetch: 2.6.22 -mm merge plans
2007/5/5, Con Kolivas [EMAIL PROTECTED]: [cut] Here's a better swap prefetch tester. Instructions in file. The system should be leaved totally inactive for the test duration (10m) right? Regards, ~ Antonio - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ck] Re: swap-prefetch: 2.6.22 -mm merge plans
Con Kolivas wrote: Here's a better swap prefetch tester. Instructions in file. Machine with 2GB ram and 2GB swapfile Prefetch disabled: ./sp_tester Ram 2060352000 Swap 197342 Total ram to be malloced: 3047062000 bytes Starting first malloc of 1523531000 bytes Starting 1st read of first malloc Touching this much ram takes 809 milliseconds Starting second malloc of 1523531000 bytes Completed second malloc and free Sleeping for 600 seconds Important part - starting reread of first malloc Completed read of first malloc Timed portion 53397 milliseconds Enabled: ./sp_tester Ram 2060352000 Swap 197342 Total ram to be malloced: 3047062000 bytes Starting first malloc of 1523531000 bytes Starting 1st read of first malloc Touching this much ram takes 676 milliseconds Starting second malloc of 1523531000 bytes Completed second malloc and free Sleeping for 600 seconds Important part - starting reread of first malloc Completed read of first malloc Timed portion 26351 milliseconds echo 1 /proc/sys/vm/overcommit_memory swapoff -a swapon -a ./sp_tester Ram 1153644000 Swap 1004052000 Total ram to be malloced: 165567 bytes Starting first malloc of 827835000 bytes Starting 1st read of first malloc Touching this much ram takes 937 milliseconds Starting second malloc of 827835000 bytes Completed second malloc and free Sleeping for 600 seconds Important part - starting reread of first malloc Completed read of first malloc Timed portion 15011 milliseconds echo 0 /proc/sys/vm/overcommit_memory swapoff -a swapon -a ./sp_tester Ram 1153644000 Swap 1004052000 Total ram to be malloced: 165567 bytes Starting first malloc of 827835000 bytes Starting 1st read of first malloc Touching this much ram takes 1125 milliseconds Starting second malloc of 827835000 bytes Completed second malloc and free Sleeping for 600 seconds Important part - starting reread of first malloc Completed read of first malloc Timed portion 14611 milliseconds - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
On Friday 04 May 2007 22:10, Con Kolivas wrote: > On Friday 04 May 2007 18:52, Ingo Molnar wrote: > > agreed. Con, IIRC you wrote a testcase for this, right? Could you please > > send us the results of that testing? > > Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch > disabled and then enabled swap prefetch saves ~5 seconds on average > hardware on this one test case. I had many users try this and the results > were between 2 and 10 seconds, but always showed a saving on this testcase. > This effect easily occurs on printing a big picture, editing a large file, > compressing an iso image or whatever in real world workloads. Smaller, but > much more frequent effects of this over the course of a day obviously also > occur and do add up. Here's a better swap prefetch tester. Instructions in file. Machine with 2GB ram and 2GB swapfile Prefetch disabled: ./sp_tester Ram 2060352000 Swap 197342 Total ram to be malloced: 3047062000 bytes Starting first malloc of 1523531000 bytes Starting 1st read of first malloc Touching this much ram takes 809 milliseconds Starting second malloc of 1523531000 bytes Completed second malloc and free Sleeping for 600 seconds Important part - starting reread of first malloc Completed read of first malloc Timed portion 53397 milliseconds Enabled: ./sp_tester Ram 2060352000 Swap 197342 Total ram to be malloced: 3047062000 bytes Starting first malloc of 1523531000 bytes Starting 1st read of first malloc Touching this much ram takes 676 milliseconds Starting second malloc of 1523531000 bytes Completed second malloc and free Sleeping for 600 seconds Important part - starting reread of first malloc Completed read of first malloc Timed portion 26351 milliseconds Note huge time difference. -- -ck /* sp_tester.c Build with: gcc -o sp_tester sp_tester.c -lrt -W -Wall -O2 How to use: echo 1 > /proc/sys/vm/overcommit_memory swapoff -a swapon -a ./sp_tester then repeat with changed conditions eg echo 0 > /proc/sys/vm/swap_prefetch Each Test takes 10 minutes */ #include #include #include #include #include #include #include void fatal(const char *format, ...) { va_list ap; if (format) { va_start(ap, format); vfprintf(stderr, format, ap); va_end(ap); } fprintf(stderr, "Fatal error - exiting\n"); exit(1); } unsigned long ramsize, swapsize; size_t get_ram(void) { FILE *meminfo; char aux[256]; if(!(meminfo = fopen("/proc/meminfo", "r"))) fatal("fopen\n"); while( !feof(meminfo) && !fscanf(meminfo, "MemTotal: %lu kB", ) ) fgets(aux,sizeof(aux),meminfo); while( !feof(meminfo) && !fscanf(meminfo, "SwapTotal: %lu kB", ) ) fgets(aux,sizeof(aux),meminfo); if (fclose(meminfo) == -1) fatal("fclose"); ramsize *= 1000; swapsize *= 1000; printf("Ram %lu Swap %lu\n", ramsize, swapsize); return ramsize + (swapsize / 2); } unsigned long get_usecs(struct timespec *myts) { if (clock_gettime(CLOCK_REALTIME, myts)) fatal("clock_gettime"); return (myts->tv_sec * 100 + myts->tv_nsec / 1000 ); } int main(void) { unsigned long current_time, time_diff; struct timespec myts; char *buf1, *buf2, *buf3, *buf4; size_t size = get_ram(); int sleep_seconds = 600; if (size > ramsize / 2 * 3) size = ramsize / 2 * 3; printf("Total ram to be malloced: %u bytes\n", size); size /= 2; printf("Starting first malloc of %u bytes\n", size); buf1 = malloc(size); buf4 = malloc(1); if (buf1 == (char *)-1) fatal("Failed to malloc 1st buffer\n"); memset(buf1, 0, size); time_diff = current_time = get_usecs(); for (buf3 = buf1; buf3 < buf1 + size; buf3++) *buf4 = *buf3; printf("Starting 1st read of first malloc\n"); current_time = get_usecs(); time_diff = current_time - time_diff; printf("Touching this much ram takes %lu milliseconds\n",time_diff / 1000); printf("Starting second malloc of %u bytes\n", size); buf2 = malloc(size); if (buf2 == (char *)-1) fatal("Failed to malloc 2nd buffer\n"); memset(buf2, 0, size); for (buf3 = buf2 + size; buf3 > buf2; buf3--) *buf4 = *buf3; free(buf2); printf("Completed second malloc and free\n"); printf("Sleeping for %u seconds\n", sleep_seconds); sleep(sleep_seconds); printf("Important part - starting reread of first malloc\n"); time_diff = current_time = get_usecs(); for (buf3 = buf1; buf3 < buf1 + size; buf3++) *buf4 = *buf3; current_time = get_usecs(); time_diff = current_time - time_diff; printf("Completed read of first malloc\n"); printf("Timed portion %lu milliseconds\n",time_diff / 1000); free(buf1); free(buf4); return 0; }
Re: swap-prefetch: 2.6.22 -mm merge plans
On Friday 04 May 2007 22:10, Con Kolivas wrote: On Friday 04 May 2007 18:52, Ingo Molnar wrote: agreed. Con, IIRC you wrote a testcase for this, right? Could you please send us the results of that testing? Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch disabled and then enabled swap prefetch saves ~5 seconds on average hardware on this one test case. I had many users try this and the results were between 2 and 10 seconds, but always showed a saving on this testcase. This effect easily occurs on printing a big picture, editing a large file, compressing an iso image or whatever in real world workloads. Smaller, but much more frequent effects of this over the course of a day obviously also occur and do add up. Here's a better swap prefetch tester. Instructions in file. Machine with 2GB ram and 2GB swapfile Prefetch disabled: ./sp_tester Ram 2060352000 Swap 197342 Total ram to be malloced: 3047062000 bytes Starting first malloc of 1523531000 bytes Starting 1st read of first malloc Touching this much ram takes 809 milliseconds Starting second malloc of 1523531000 bytes Completed second malloc and free Sleeping for 600 seconds Important part - starting reread of first malloc Completed read of first malloc Timed portion 53397 milliseconds Enabled: ./sp_tester Ram 2060352000 Swap 197342 Total ram to be malloced: 3047062000 bytes Starting first malloc of 1523531000 bytes Starting 1st read of first malloc Touching this much ram takes 676 milliseconds Starting second malloc of 1523531000 bytes Completed second malloc and free Sleeping for 600 seconds Important part - starting reread of first malloc Completed read of first malloc Timed portion 26351 milliseconds Note huge time difference. -- -ck /* sp_tester.c Build with: gcc -o sp_tester sp_tester.c -lrt -W -Wall -O2 How to use: echo 1 /proc/sys/vm/overcommit_memory swapoff -a swapon -a ./sp_tester then repeat with changed conditions eg echo 0 /proc/sys/vm/swap_prefetch Each Test takes 10 minutes */ #include stdio.h #include stdarg.h #include stdlib.h #include string.h #include unistd.h #include sys/mman.h #include time.h void fatal(const char *format, ...) { va_list ap; if (format) { va_start(ap, format); vfprintf(stderr, format, ap); va_end(ap); } fprintf(stderr, Fatal error - exiting\n); exit(1); } unsigned long ramsize, swapsize; size_t get_ram(void) { FILE *meminfo; char aux[256]; if(!(meminfo = fopen(/proc/meminfo, r))) fatal(fopen\n); while( !feof(meminfo) !fscanf(meminfo, MemTotal: %lu kB, ramsize) ) fgets(aux,sizeof(aux),meminfo); while( !feof(meminfo) !fscanf(meminfo, SwapTotal: %lu kB, swapsize) ) fgets(aux,sizeof(aux),meminfo); if (fclose(meminfo) == -1) fatal(fclose); ramsize *= 1000; swapsize *= 1000; printf(Ram %lu Swap %lu\n, ramsize, swapsize); return ramsize + (swapsize / 2); } unsigned long get_usecs(struct timespec *myts) { if (clock_gettime(CLOCK_REALTIME, myts)) fatal(clock_gettime); return (myts-tv_sec * 100 + myts-tv_nsec / 1000 ); } int main(void) { unsigned long current_time, time_diff; struct timespec myts; char *buf1, *buf2, *buf3, *buf4; size_t size = get_ram(); int sleep_seconds = 600; if (size ramsize / 2 * 3) size = ramsize / 2 * 3; printf(Total ram to be malloced: %u bytes\n, size); size /= 2; printf(Starting first malloc of %u bytes\n, size); buf1 = malloc(size); buf4 = malloc(1); if (buf1 == (char *)-1) fatal(Failed to malloc 1st buffer\n); memset(buf1, 0, size); time_diff = current_time = get_usecs(myts); for (buf3 = buf1; buf3 buf1 + size; buf3++) *buf4 = *buf3; printf(Starting 1st read of first malloc\n); current_time = get_usecs(myts); time_diff = current_time - time_diff; printf(Touching this much ram takes %lu milliseconds\n,time_diff / 1000); printf(Starting second malloc of %u bytes\n, size); buf2 = malloc(size); if (buf2 == (char *)-1) fatal(Failed to malloc 2nd buffer\n); memset(buf2, 0, size); for (buf3 = buf2 + size; buf3 buf2; buf3--) *buf4 = *buf3; free(buf2); printf(Completed second malloc and free\n); printf(Sleeping for %u seconds\n, sleep_seconds); sleep(sleep_seconds); printf(Important part - starting reread of first malloc\n); time_diff = current_time = get_usecs(myts); for (buf3 = buf1; buf3 buf1 + size; buf3++) *buf4 = *buf3; current_time = get_usecs(myts); time_diff = current_time - time_diff; printf(Completed read of first malloc\n); printf(Timed portion %lu milliseconds\n,time_diff / 1000); free(buf1); free(buf4); return 0; }
Re: swap-prefetch: 2.6.22 -mm merge plans
On Friday 04 May 2007 18:52, Ingo Molnar wrote: > agreed. Con, IIRC you wrote a testcase for this, right? Could you please > send us the results of that testing? Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch disabled and then enabled swap prefetch saves ~5 seconds on average hardware on this one test case. I had many users try this and the results were between 2 and 10 seconds, but always showed a saving on this testcase. This effect easily occurs on printing a big picture, editing a large file, compressing an iso image or whatever in real world workloads. Smaller, but much more frequent effects of this over the course of a day obviously also occur and do add up. -- -ck #include #include #include #include #include #include #include void fatal(const char *format, ...) { va_list ap; if (format) { va_start(ap, format); vfprintf(stderr, format, ap); va_end(ap); } fprintf(stderr, "Fatal error - exiting\n"); exit(1); } size_t get_ram(void) { unsigned long ramsize; FILE *meminfo; char aux[256]; if(!(meminfo = fopen("/proc/meminfo", "r"))) fatal("fopen\n"); while( !feof(meminfo) && !fscanf(meminfo, "MemTotal: %lu kB", ) ) fgets(aux,sizeof(aux),meminfo); if (fclose(meminfo) == -1) fatal("fclose"); return ramsize * 1000; } unsigned long get_usecs(struct timespec *myts) { if (clock_gettime(CLOCK_REALTIME, myts)) fatal("clock_gettime"); return (myts->tv_sec * 100 + myts->tv_nsec / 1000 ); } int main(void) { unsigned long current_time, time_diff; struct timespec myts; char *buf1, *buf2, *buf3, *buf4; size_t size, full_size = get_ram(); int sleep_seconds = 600; size = full_size * 7 / 10; printf("Starting first malloc of %d bytes\n", size); buf1 = malloc(size); if (buf1 == (char *)-1) fatal("Failed to malloc 1st buffer\n"); memset(buf1, 0, size); printf("Completed first malloc and starting second malloc of %d bytes\n", full_size); buf2 = malloc(full_size); if (buf2 == (char *)-1) fatal("Failed to malloc 2nd buffer\n"); memset(buf2, 0, full_size); buf4 = malloc(1); for (buf3 = buf2 + full_size; buf3 > buf2; buf3--) *buf4 = *buf3; free(buf2); printf("Completed second malloc and free\n"); printf("Sleeping for %d seconds\n", sleep_seconds); sleep(sleep_seconds); printf("Important part - starting read of first malloc\n"); time_diff = current_time = get_usecs(); for (buf3 = buf1; buf3 < buf1 + size; buf3++) *buf4 = *buf3; current_time = get_usecs(); free(buf4); free(buf1); printf("Completed read and freeing of first malloc\n"); time_diff = current_time - time_diff; printf("Timed portion %lu microseconds\n",time_diff); return 0; }
Re: 2.6.22 -mm merge plans -- vm bugfixes
Nick Piggin wrote: Nick Piggin wrote: Christoph Hellwig wrote: Is that every fork/exec or just under certain cicumstances? A 5% regression on every fork/exec is not acceptable. Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4 numbers will be improved as well with that patch. Then if we have specific lock/unlock bitops, I hope it should reduce that further. OK, with the races and missing barriers fixed from the previous patch, plus the attached one added (+patch3), numbers are better again (I'm not sure if I have the ppc barriers correct though). These ops could also be put to use in bit spinlocks, buffer lock, and probably a few other places too. 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 +patch 1.71-1.73 175.2-180.8 780.5-794.2 +patch2 1.61-1.63 169.8-175.0 748.6-757.0 +patch3 1.54-1.57 165.6-170.9 748.5-757.5 So fault performance goes to under 5%, fork is in the noise, exec is still up 1%, but maybe that's noise or cache effects again. OK, with my new lock/unlock_page, dd if=large (bigger than RAM) sparse file of=/dev/null with an experimentally optimal block size (32K) goes from 626MB/s to 683MB/s on 2 CPU G5 booted with maxcpus=1. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Nick Piggin wrote: Christoph Hellwig wrote: Is that every fork/exec or just under certain cicumstances? A 5% regression on every fork/exec is not acceptable. Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4 numbers will be improved as well with that patch. Then if we have specific lock/unlock bitops, I hope it should reduce that further. OK, with the races and missing barriers fixed from the previous patch, plus the attached one added (+patch3), numbers are better again (I'm not sure if I have the ppc barriers correct though). These ops could also be put to use in bit spinlocks, buffer lock, and probably a few other places too. 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 +patch 1.71-1.73 175.2-180.8 780.5-794.2 +patch2 1.61-1.63 169.8-175.0 748.6-757.0 +patch3 1.54-1.57 165.6-170.9 748.5-757.5 So fault performance goes to under 5%, fork is in the noise, exec is still up 1%, but maybe that's noise or cache effects again. -- SUSE Labs, Novell Inc. Index: linux-2.6/include/asm-powerpc/bitops.h === --- linux-2.6.orig/include/asm-powerpc/bitops.h 2007-05-04 16:08:20.0 +1000 +++ linux-2.6/include/asm-powerpc/bitops.h 2007-05-04 16:14:39.0 +1000 @@ -87,6 +87,24 @@ : "cc" ); } +static __inline__ void clear_bit_unlock(int nr, volatile unsigned long *addr) +{ + unsigned long old; + unsigned long mask = BITOP_MASK(nr); + unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); + + __asm__ __volatile__( + LWSYNC_ON_SMP +"1:" PPC_LLARX "%0,0,%3 # clear_bit_unlock\n" + "andc %0,%0,%2\n" + PPC405_ERR77(0,%3) + PPC_STLCX "%0,0,%3\n" + "bne- 1b" + : "=" (old), "+m" (*p) + : "r" (mask), "r" (p) + : "cc" ); +} + static __inline__ void change_bit(int nr, volatile unsigned long *addr) { unsigned long old; @@ -126,6 +144,27 @@ return (old & mask) != 0; } +static __inline__ int test_and_set_bit_lock(unsigned long nr, + volatile unsigned long *addr) +{ + unsigned long old, t; + unsigned long mask = BITOP_MASK(nr); + unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); + + __asm__ __volatile__( +"1:" PPC_LLARX "%0,0,%3 # test_and_set_bit_lock\n" + "or %1,%0,%2 \n" + PPC405_ERR77(0,%3) + PPC_STLCX "%1,0,%3 \n" + "bne- 1b" + ISYNC_ON_SMP + : "=" (old), "=" (t) + : "r" (mask), "r" (p) + : "cc", "memory"); + + return (old & mask) != 0; +} + static __inline__ int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr) { Index: linux-2.6/include/linux/pagemap.h === --- linux-2.6.orig/include/linux/pagemap.h 2007-05-04 16:14:36.0 +1000 +++ linux-2.6/include/linux/pagemap.h 2007-05-04 16:17:34.0 +1000 @@ -136,13 +136,18 @@ extern void FASTCALL(__wait_on_page_locked(struct page *page)); extern void FASTCALL(unlock_page(struct page *page)); +static inline int trylock_page(struct page *page) +{ + return (likely(!TestSetPageLocked_Lock(page))); +} + /* * lock_page may only be called if we have the page's inode pinned. */ static inline void lock_page(struct page *page) { might_sleep(); - if (unlikely(TestSetPageLocked(page))) + if (!trylock_page(page)) __lock_page(page); } @@ -153,7 +158,7 @@ static inline void lock_page_nosync(struct page *page) { might_sleep(); - if (unlikely(TestSetPageLocked(page))) + if (!trylock_page(page)) __lock_page_nosync(page); } Index: linux-2.6/drivers/scsi/sg.c === --- linux-2.6.orig/drivers/scsi/sg.c2007-04-12 14:35:08.0 +1000 +++ linux-2.6/drivers/scsi/sg.c 2007-05-04 16:23:27.0 +1000 @@ -1734,7 +1734,7 @@ */ flush_dcache_page(pages[i]); /* ?? Is locking needed? I don't think so */ - /* if (TestSetPageLocked(pages[i])) + /* if (!trylock_page(pages[i])) goto out_unlock; */ } Index: linux-2.6/fs/cifs/file.c === --- linux-2.6.orig/fs/cifs/file.c 2007-04-12 14:35:09.0 +1000 +++ linux-2.6/fs/cifs/file.c2007-05-04 16:23:36.0 +1000 @@ -1229,7 +1229,7 @@ if (first < 0) lock_page(page); - else if (TestSetPageLocked(page)) + else if (!trylock_page(page)) break; if (unlikely(page->mapping != mapping)) { Index: linux-2.6/fs/jbd/commit.c
Re: swap-prefetch: 2.6.22 -mm merge plans
Ingo Molnar wrote: * Nick Piggin <[EMAIL PROTECTED]> wrote: Here were some of my concerns, and where our discussion got up to. Yes. Perhaps it just doesn't help with the updatedb thing. Or maybe with normal system activity we get enough free pages to kick the thing off and running. Perhaps updatedb itself has a lot of rss, for example. Could be, but I don't know. I'd think it unlikely to allow _much_ swapin, if huge amounts of the desktop have been swapped out. But maybe... as I said, nobody seems to have a recipe for these things. can i take this one as a "no fundamental objection"? There are really only 2 maintainance options left: 1) either you can do it better or at least have a _very_ clearly described idea outlined about how to do it differently 2) or you should let others try it #1 you've not done for 2-3 years since swap-prefetch was waiting for integration so it's not an option at this stage anymore. Then you are pretty much obliged to do #2. ;-) The burden is not on me to get someone else's feature merged. If it can be shown to work well and people's concerns addressed, then anything will get merged. The reason Linux is so good is because of what we don't merge, figuratively speaking. I wanted to see some basic regression tests to show that it hasn't caused obvious problems, and some basic scenarios where it helps, so that we can analyse them. It is really simple, but I haven't got any since first asking. And note that I don't think I ever explicitly "nacked" anything, just voiced my concerns. If my concerns had been addressed, then I couldn't have stopped anybody from merging anything. 2) It is a _highly_ speculative operation, and in workloads where periods of low and high page usage with genuinely unused anonymous / tmpfs pages, it could waste power, memory bandwidth, bus bandwidth, disk bandwidth... Yes. I suspect that's a matter of waiting for the corner-case reporters to complain, then add more heuristics. Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch is happy to do a _lot_ of work for these things which we have already decided are least likely to be used again. i see no real problem here. We've had heuristics for a _long_ time in various areas of the code. Sometimes they work, sometimes they suck. So that's one of my issues with the code. If all you have to support a merge is anecodal evidence, then I find it interesting that you would easily discount something like this. 4) If this is helpful, wouldn't it be equally important for things like mapped file pages? Seems like half a solution. [...] (otoh the akpm usersapce implementation is swapoff -a;swapon -a) Perhaps. You may need a few indicators to see whether the system is idle... but OTOH, we've already got a lot of indicators for memory, disk usage, etc. So, maybe :) The time has passed for this. Let others play too. Please :-) Play with what? Prefetching mmaped file pages as well? Sure. I could be wrong, but IIRC there is no good way to know which cpuset to bring the page back into, (and I guess similarly it would be hard to know what container to account it to, if doing account-on-allocate). (i think cpusets are totally uninteresting in this context: nobody in their right mind is going to use swap-prefetch on a big NUMA box. Nor can i see any fundamental impediment to making this more cpuset-aware, just like other subsystems were made cpuset-aware, once the requests from actual users came in and people started getting interested in it.) OK, so make it more cpuset aware. This isn't a new issue, I raised it a long time ago. And trust me, it is a nightmare to just assume that nobody will use cpusets on a small box for example (AFAIK the resource control guys are looking at doing just that). All core VM features should play nicely with each other without *really* good reason. I think the "lack of testcase and numbers" is the only valid technical objection i've seen so far. Well you're entitled to your opinion too. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
* Nick Piggin <[EMAIL PROTECTED]> wrote: > > i'm wondering about swap-prefetch: > Being able to config all these core heuristics changes is really not > that much of a positive. The fact that we might _need_ to config > something out, and double the configuration range isn't too pleasing. Well, to the desktop user this is a speculative performance feature that he is willing to potentially waste CPU and IO capacity, in expectation of better performance. On the conceptual level it is _precisely the same thing as regular file readahead_. (with the difference that to me swapahead seems to be quite a bit more intelligent than our current file readahead logic.) This feature has no API or ABI impact at all, it's a pure performance feature. (besides the trivial sysctl to turn it runtime on/off). > Here were some of my concerns, and where our discussion got up to. > > Yes. Perhaps it just doesn't help with the updatedb thing. Or > > maybe with normal system activity we get enough free pages to kick > > the thing off and running. Perhaps updatedb itself has a lot of > > rss, for example. > > Could be, but I don't know. I'd think it unlikely to allow _much_ > swapin, if huge amounts of the desktop have been swapped out. But > maybe... as I said, nobody seems to have a recipe for these things. can i take this one as a "no fundamental objection"? There are really only 2 maintainance options left: 1) either you can do it better or at least have a _very_ clearly described idea outlined about how to do it differently 2) or you should let others try it #1 you've not done for 2-3 years since swap-prefetch was waiting for integration so it's not an option at this stage anymore. Then you are pretty much obliged to do #2. ;-) And let me be really blunt about this, there is no option #3 to say: "I have no real better idea, I have no code, I have no time, but hey, lets not merge this because it 'could in theory' be possible to do it better" =B-) really, we are likely be better off by risking the merge of _bad_ code (which in the swap-prefetch case is the exact opposite of the truth), than to let code stagnate. People are clearly unhappy about certain desktop aspects of swapping, and the only way out of that is to let more people hack that code. Merging code involves more people. It will cause 'noise' and could cause regressions, but at least in this case the only impact is 'performance' and the feature is trivial to disable. The maintainance drag outside of swap_prefetch.c is essentially _zero_. If the feature doesnt work it ends up on Con's desk. If it turns out to not work at all (despite years of testing and happy users) it still only ends up on Con's desk. A clear win/win scenario for you i think :-) > > Would be useful to see this claim substantiated with a real > > testcase, description of results and an explanation of how and why > > it worked. > > Yes... and then try to first improve regular page reclaim and use-once > handling. agreed. Con, IIRC you wrote a testcase for this, right? Could you please send us the results of that testing? > >>2) It is a _highly_ speculative operation, and in workloads where periods > >>of low and high page usage with genuinely unused anonymous / tmpfs > >>pages, it could waste power, memory bandwidth, bus bandwidth, disk > >>bandwidth... > > > > Yes. I suspect that's a matter of waiting for the corner-case > > reporters to complain, then add more heuristics. > > Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch > is happy to do a _lot_ of work for these things which we have already > decided are least likely to be used again. i see no real problem here. We've had heuristics for a _long_ time in various areas of the code. Sometimes they work, sometimes they suck. the flow of this is really easy: distro looking for a feature edge turns it on and announces it, if the feature does not work out for users then user turns it off and complains to distro, if enough users complain then distro turns it off for next release, upstream forgets about this performance feature and eventually removes it once someone notices that it wouldnt even compile in the past 2 main releases. I see no problem here, we did that in the past too with performance features. The networking stack has literally dozens of such small tunable things which get experimented with, and whose defaults do get tuned carefully. Some of the knobs help bandwidth, some help latency. I do not even see any risk of "splitup of mindshare" - swap-prefetch is so clearly speculative that it's not really a different view about how to do swapping (which would split the tester base, etc.), it's simply a "do you want your system to speculate about the future or not" add-on decision. Every system has a pretty clear idea about that: desktops generally want to do it, clusters generally dont want to do it. > >>3) I haven't seen a single
Re: swap-prefetch: 2.6.22 -mm merge plans
Ingo Molnar wrote: * Andrew Morton <[EMAIL PROTECTED]> wrote: - If replying, please be sure to cc the appropriate individuals. Please also consider rewriting the Subject: to something appropriate. i'm wondering about swap-prefetch: Well I had some issues with it that I don't think were fully discussed, and Andrew prompted me to say something, but it went off list for a couple of posts (my fault, sorry). Posting it below with Andrew's permission... mm-implement-swap-prefetching.patch swap-prefetch-avoid-repeating-entry.patch add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated-swap-prefetch.patch The swap-prefetch feature is relatively compact: 10 files changed, 745 insertions(+), 1 deletion(-) it is contained mostly to itself: mm/swap_prefetch.c| 581 i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback i've seen so far was positive. Time to have this upstream and time for a desktop-oriented distro to pick it up. I think this has been held back way too long. It's .config selectable and it is as ready for integration as it ever is going to be. So it's a win/win scenario. Being able to config all these core heuristics changes is really not that much of a positive. The fact that we might _need_ to config something out, and double the configuration range isn't too pleasing. Here were some of my concerns, and where our discussion got up to. Andrew Morton wrote: > On Fri, 04 May 2007 14:34:45 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: > > >>Andrew Morton wrote: >> >>>istr you had issues with swap-prefetch? >>> >>>If so, now's a good time to reiterate them ;) >> >>1) It is said to help with the updatedb overnight problem, however it >>actually _doesn't_ prefetch swap when there are low free pages, which >>is how updatedb will leave the system. So this really puzzles me how >>it would work. However if updatedb is causing excessive swapout, I >>think we should try improving use-once algorithms first, for example. > > > Yes. Perhaps it just doesn't help with the updatedb thing. Or maybe with > normal system activity we get enough free pages to kick the thing off and > running. Perhaps updatedb itself has a lot of rss, for example. Could be, but I don't know. I'd think it unlikely to allow _much_ swapin, if huge amounts of the desktop have been swapped out. But maybe... as I said, nobody seems to have a recipe for these things. > Would be useful to see this claim substantiated with a real testcase, > description of results and an explanation of how and why it worked. Yes... and then try to first improve regular page reclaim and use-once handling. >>2) It is a _highly_ speculative operation, and in workloads where periods >>of low and high page usage with genuinely unused anonymous / tmpfs >>pages, it could waste power, memory bandwidth, bus bandwidth, disk >>bandwidth... > > > Yes. I suspect that's a matter of waiting for the corner-case reporters to > complain, then add more heuristics. Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch is happy to do a _lot_ of work for these things which we have already decided are least likely to be used again. >>3) I haven't seen a single set of numbers out of it. Feedback seems to >>have mostly come from people who > > > Yup. But can we come up with a testcase? It's hard. I guess it is hard firstly because swapping is quite random to start with. But I haven't even seen basic things like "make -jhuge swapstorm has no regressions". >>4) If this is helpful, wouldn't it be equally important for things like >>mapped file pages? Seems like half a solution. > > > True. > > Without thinking about it, I almost wonder if one could do a userspace > implementation with something which pokes around in /proc/pid/pagemap and > /proc/pid/kpagemap, perhaps with some additional interfaces added to > do a swapcache read. (Give userspace the ability to get at swapcache > via a regular fd?) > > (otoh the akpm usersapce implementation is swapoff -a;swapon -a) Perhaps. You may need a few indicators to see whether the system is idle... but OTOH, we've already got a lot of indicators for memory, disk usage, etc. So, maybe :) >>5) New one: it is possibly going to interact badly with MADV_FREE lazy >>freeing. The more complex we make page reclaim, the worse IMO. > > > That's a bit vague. What sort of problems do you envisage? Well MADV_FREE pages aren't technically free, are they? So it might be possible for a significant number of them to build up and prevent swap prefetch from working. Maybe. >>...) I had a few issues with implementation, like interaction with >>cpusets. Don't know if these are all fixed or not. I sort of gave >>up looking at it. > > > Ah yes, I remember some mention of cpusets.
Re: swap-prefetch: 2.6.22 -mm merge plans
Ingo Molnar wrote: * Andrew Morton [EMAIL PROTECTED] wrote: - If replying, please be sure to cc the appropriate individuals. Please also consider rewriting the Subject: to something appropriate. i'm wondering about swap-prefetch: Well I had some issues with it that I don't think were fully discussed, and Andrew prompted me to say something, but it went off list for a couple of posts (my fault, sorry). Posting it below with Andrew's permission... mm-implement-swap-prefetching.patch swap-prefetch-avoid-repeating-entry.patch add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated-swap-prefetch.patch The swap-prefetch feature is relatively compact: 10 files changed, 745 insertions(+), 1 deletion(-) it is contained mostly to itself: mm/swap_prefetch.c| 581 i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback i've seen so far was positive. Time to have this upstream and time for a desktop-oriented distro to pick it up. I think this has been held back way too long. It's .config selectable and it is as ready for integration as it ever is going to be. So it's a win/win scenario. Being able to config all these core heuristics changes is really not that much of a positive. The fact that we might _need_ to config something out, and double the configuration range isn't too pleasing. Here were some of my concerns, and where our discussion got up to. Andrew Morton wrote: On Fri, 04 May 2007 14:34:45 +1000 Nick Piggin [EMAIL PROTECTED] wrote: Andrew Morton wrote: istr you had issues with swap-prefetch? If so, now's a good time to reiterate them ;) 1) It is said to help with the updatedb overnight problem, however it actually _doesn't_ prefetch swap when there are low free pages, which is how updatedb will leave the system. So this really puzzles me how it would work. However if updatedb is causing excessive swapout, I think we should try improving use-once algorithms first, for example. Yes. Perhaps it just doesn't help with the updatedb thing. Or maybe with normal system activity we get enough free pages to kick the thing off and running. Perhaps updatedb itself has a lot of rss, for example. Could be, but I don't know. I'd think it unlikely to allow _much_ swapin, if huge amounts of the desktop have been swapped out. But maybe... as I said, nobody seems to have a recipe for these things. Would be useful to see this claim substantiated with a real testcase, description of results and an explanation of how and why it worked. Yes... and then try to first improve regular page reclaim and use-once handling. 2) It is a _highly_ speculative operation, and in workloads where periods of low and high page usage with genuinely unused anonymous / tmpfs pages, it could waste power, memory bandwidth, bus bandwidth, disk bandwidth... Yes. I suspect that's a matter of waiting for the corner-case reporters to complain, then add more heuristics. Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch is happy to do a _lot_ of work for these things which we have already decided are least likely to be used again. 3) I haven't seen a single set of numbers out of it. Feedback seems to have mostly come from people who Yup. But can we come up with a testcase? It's hard. I guess it is hard firstly because swapping is quite random to start with. But I haven't even seen basic things like make -jhuge swapstorm has no regressions. 4) If this is helpful, wouldn't it be equally important for things like mapped file pages? Seems like half a solution. True. Without thinking about it, I almost wonder if one could do a userspace implementation with something which pokes around in /proc/pid/pagemap and /proc/pid/kpagemap, perhaps with some additional interfaces added to do a swapcache read. (Give userspace the ability to get at swapcache via a regular fd?) (otoh the akpm usersapce implementation is swapoff -a;swapon -a) Perhaps. You may need a few indicators to see whether the system is idle... but OTOH, we've already got a lot of indicators for memory, disk usage, etc. So, maybe :) 5) New one: it is possibly going to interact badly with MADV_FREE lazy freeing. The more complex we make page reclaim, the worse IMO. That's a bit vague. What sort of problems do you envisage? Well MADV_FREE pages aren't technically free, are they? So it might be possible for a significant number of them to build up and prevent swap prefetch from working. Maybe. ...) I had a few issues with implementation, like interaction with cpusets. Don't know if these are all fixed or not. I sort of gave up looking at it. Ah yes, I remember some mention of cpusets. I forget what it was though. I could be wrong, but IIRC there is no good way to know which
Re: swap-prefetch: 2.6.22 -mm merge plans
* Nick Piggin [EMAIL PROTECTED] wrote: i'm wondering about swap-prefetch: Being able to config all these core heuristics changes is really not that much of a positive. The fact that we might _need_ to config something out, and double the configuration range isn't too pleasing. Well, to the desktop user this is a speculative performance feature that he is willing to potentially waste CPU and IO capacity, in expectation of better performance. On the conceptual level it is _precisely the same thing as regular file readahead_. (with the difference that to me swapahead seems to be quite a bit more intelligent than our current file readahead logic.) This feature has no API or ABI impact at all, it's a pure performance feature. (besides the trivial sysctl to turn it runtime on/off). Here were some of my concerns, and where our discussion got up to. Yes. Perhaps it just doesn't help with the updatedb thing. Or maybe with normal system activity we get enough free pages to kick the thing off and running. Perhaps updatedb itself has a lot of rss, for example. Could be, but I don't know. I'd think it unlikely to allow _much_ swapin, if huge amounts of the desktop have been swapped out. But maybe... as I said, nobody seems to have a recipe for these things. can i take this one as a no fundamental objection? There are really only 2 maintainance options left: 1) either you can do it better or at least have a _very_ clearly described idea outlined about how to do it differently 2) or you should let others try it #1 you've not done for 2-3 years since swap-prefetch was waiting for integration so it's not an option at this stage anymore. Then you are pretty much obliged to do #2. ;-) And let me be really blunt about this, there is no option #3 to say: I have no real better idea, I have no code, I have no time, but hey, lets not merge this because it 'could in theory' be possible to do it better =B-) really, we are likely be better off by risking the merge of _bad_ code (which in the swap-prefetch case is the exact opposite of the truth), than to let code stagnate. People are clearly unhappy about certain desktop aspects of swapping, and the only way out of that is to let more people hack that code. Merging code involves more people. It will cause 'noise' and could cause regressions, but at least in this case the only impact is 'performance' and the feature is trivial to disable. The maintainance drag outside of swap_prefetch.c is essentially _zero_. If the feature doesnt work it ends up on Con's desk. If it turns out to not work at all (despite years of testing and happy users) it still only ends up on Con's desk. A clear win/win scenario for you i think :-) Would be useful to see this claim substantiated with a real testcase, description of results and an explanation of how and why it worked. Yes... and then try to first improve regular page reclaim and use-once handling. agreed. Con, IIRC you wrote a testcase for this, right? Could you please send us the results of that testing? 2) It is a _highly_ speculative operation, and in workloads where periods of low and high page usage with genuinely unused anonymous / tmpfs pages, it could waste power, memory bandwidth, bus bandwidth, disk bandwidth... Yes. I suspect that's a matter of waiting for the corner-case reporters to complain, then add more heuristics. Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch is happy to do a _lot_ of work for these things which we have already decided are least likely to be used again. i see no real problem here. We've had heuristics for a _long_ time in various areas of the code. Sometimes they work, sometimes they suck. the flow of this is really easy: distro looking for a feature edge turns it on and announces it, if the feature does not work out for users then user turns it off and complains to distro, if enough users complain then distro turns it off for next release, upstream forgets about this performance feature and eventually removes it once someone notices that it wouldnt even compile in the past 2 main releases. I see no problem here, we did that in the past too with performance features. The networking stack has literally dozens of such small tunable things which get experimented with, and whose defaults do get tuned carefully. Some of the knobs help bandwidth, some help latency. I do not even see any risk of splitup of mindshare - swap-prefetch is so clearly speculative that it's not really a different view about how to do swapping (which would split the tester base, etc.), it's simply a do you want your system to speculate about the future or not add-on decision. Every system has a pretty clear idea about that: desktops generally want to do it, clusters generally dont want to do it. 3) I haven't seen a single set of numbers out of it. Feedback seems to have mostly
Re: swap-prefetch: 2.6.22 -mm merge plans
Ingo Molnar wrote: * Nick Piggin [EMAIL PROTECTED] wrote: Here were some of my concerns, and where our discussion got up to. Yes. Perhaps it just doesn't help with the updatedb thing. Or maybe with normal system activity we get enough free pages to kick the thing off and running. Perhaps updatedb itself has a lot of rss, for example. Could be, but I don't know. I'd think it unlikely to allow _much_ swapin, if huge amounts of the desktop have been swapped out. But maybe... as I said, nobody seems to have a recipe for these things. can i take this one as a no fundamental objection? There are really only 2 maintainance options left: 1) either you can do it better or at least have a _very_ clearly described idea outlined about how to do it differently 2) or you should let others try it #1 you've not done for 2-3 years since swap-prefetch was waiting for integration so it's not an option at this stage anymore. Then you are pretty much obliged to do #2. ;-) The burden is not on me to get someone else's feature merged. If it can be shown to work well and people's concerns addressed, then anything will get merged. The reason Linux is so good is because of what we don't merge, figuratively speaking. I wanted to see some basic regression tests to show that it hasn't caused obvious problems, and some basic scenarios where it helps, so that we can analyse them. It is really simple, but I haven't got any since first asking. And note that I don't think I ever explicitly nacked anything, just voiced my concerns. If my concerns had been addressed, then I couldn't have stopped anybody from merging anything. 2) It is a _highly_ speculative operation, and in workloads where periods of low and high page usage with genuinely unused anonymous / tmpfs pages, it could waste power, memory bandwidth, bus bandwidth, disk bandwidth... Yes. I suspect that's a matter of waiting for the corner-case reporters to complain, then add more heuristics. Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch is happy to do a _lot_ of work for these things which we have already decided are least likely to be used again. i see no real problem here. We've had heuristics for a _long_ time in various areas of the code. Sometimes they work, sometimes they suck. So that's one of my issues with the code. If all you have to support a merge is anecodal evidence, then I find it interesting that you would easily discount something like this. 4) If this is helpful, wouldn't it be equally important for things like mapped file pages? Seems like half a solution. [...] (otoh the akpm usersapce implementation is swapoff -a;swapon -a) Perhaps. You may need a few indicators to see whether the system is idle... but OTOH, we've already got a lot of indicators for memory, disk usage, etc. So, maybe :) The time has passed for this. Let others play too. Please :-) Play with what? Prefetching mmaped file pages as well? Sure. I could be wrong, but IIRC there is no good way to know which cpuset to bring the page back into, (and I guess similarly it would be hard to know what container to account it to, if doing account-on-allocate). (i think cpusets are totally uninteresting in this context: nobody in their right mind is going to use swap-prefetch on a big NUMA box. Nor can i see any fundamental impediment to making this more cpuset-aware, just like other subsystems were made cpuset-aware, once the requests from actual users came in and people started getting interested in it.) OK, so make it more cpuset aware. This isn't a new issue, I raised it a long time ago. And trust me, it is a nightmare to just assume that nobody will use cpusets on a small box for example (AFAIK the resource control guys are looking at doing just that). All core VM features should play nicely with each other without *really* good reason. I think the lack of testcase and numbers is the only valid technical objection i've seen so far. Well you're entitled to your opinion too. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Nick Piggin wrote: Christoph Hellwig wrote: Is that every fork/exec or just under certain cicumstances? A 5% regression on every fork/exec is not acceptable. Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4 numbers will be improved as well with that patch. Then if we have specific lock/unlock bitops, I hope it should reduce that further. OK, with the races and missing barriers fixed from the previous patch, plus the attached one added (+patch3), numbers are better again (I'm not sure if I have the ppc barriers correct though). These ops could also be put to use in bit spinlocks, buffer lock, and probably a few other places too. 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 +patch 1.71-1.73 175.2-180.8 780.5-794.2 +patch2 1.61-1.63 169.8-175.0 748.6-757.0 +patch3 1.54-1.57 165.6-170.9 748.5-757.5 So fault performance goes to under 5%, fork is in the noise, exec is still up 1%, but maybe that's noise or cache effects again. -- SUSE Labs, Novell Inc. Index: linux-2.6/include/asm-powerpc/bitops.h === --- linux-2.6.orig/include/asm-powerpc/bitops.h 2007-05-04 16:08:20.0 +1000 +++ linux-2.6/include/asm-powerpc/bitops.h 2007-05-04 16:14:39.0 +1000 @@ -87,6 +87,24 @@ : cc ); } +static __inline__ void clear_bit_unlock(int nr, volatile unsigned long *addr) +{ + unsigned long old; + unsigned long mask = BITOP_MASK(nr); + unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); + + __asm__ __volatile__( + LWSYNC_ON_SMP +1: PPC_LLARX %0,0,%3 # clear_bit_unlock\n + andc %0,%0,%2\n + PPC405_ERR77(0,%3) + PPC_STLCX %0,0,%3\n + bne- 1b + : =r (old), +m (*p) + : r (mask), r (p) + : cc ); +} + static __inline__ void change_bit(int nr, volatile unsigned long *addr) { unsigned long old; @@ -126,6 +144,27 @@ return (old mask) != 0; } +static __inline__ int test_and_set_bit_lock(unsigned long nr, + volatile unsigned long *addr) +{ + unsigned long old, t; + unsigned long mask = BITOP_MASK(nr); + unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); + + __asm__ __volatile__( +1: PPC_LLARX %0,0,%3 # test_and_set_bit_lock\n + or %1,%0,%2 \n + PPC405_ERR77(0,%3) + PPC_STLCX %1,0,%3 \n + bne- 1b + ISYNC_ON_SMP + : =r (old), =r (t) + : r (mask), r (p) + : cc, memory); + + return (old mask) != 0; +} + static __inline__ int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr) { Index: linux-2.6/include/linux/pagemap.h === --- linux-2.6.orig/include/linux/pagemap.h 2007-05-04 16:14:36.0 +1000 +++ linux-2.6/include/linux/pagemap.h 2007-05-04 16:17:34.0 +1000 @@ -136,13 +136,18 @@ extern void FASTCALL(__wait_on_page_locked(struct page *page)); extern void FASTCALL(unlock_page(struct page *page)); +static inline int trylock_page(struct page *page) +{ + return (likely(!TestSetPageLocked_Lock(page))); +} + /* * lock_page may only be called if we have the page's inode pinned. */ static inline void lock_page(struct page *page) { might_sleep(); - if (unlikely(TestSetPageLocked(page))) + if (!trylock_page(page)) __lock_page(page); } @@ -153,7 +158,7 @@ static inline void lock_page_nosync(struct page *page) { might_sleep(); - if (unlikely(TestSetPageLocked(page))) + if (!trylock_page(page)) __lock_page_nosync(page); } Index: linux-2.6/drivers/scsi/sg.c === --- linux-2.6.orig/drivers/scsi/sg.c2007-04-12 14:35:08.0 +1000 +++ linux-2.6/drivers/scsi/sg.c 2007-05-04 16:23:27.0 +1000 @@ -1734,7 +1734,7 @@ */ flush_dcache_page(pages[i]); /* ?? Is locking needed? I don't think so */ - /* if (TestSetPageLocked(pages[i])) + /* if (!trylock_page(pages[i])) goto out_unlock; */ } Index: linux-2.6/fs/cifs/file.c === --- linux-2.6.orig/fs/cifs/file.c 2007-04-12 14:35:09.0 +1000 +++ linux-2.6/fs/cifs/file.c2007-05-04 16:23:36.0 +1000 @@ -1229,7 +1229,7 @@ if (first 0) lock_page(page); - else if (TestSetPageLocked(page)) + else if (!trylock_page(page)) break; if (unlikely(page-mapping != mapping)) { Index: linux-2.6/fs/jbd/commit.c
Re: 2.6.22 -mm merge plans -- vm bugfixes
Nick Piggin wrote: Nick Piggin wrote: Christoph Hellwig wrote: Is that every fork/exec or just under certain cicumstances? A 5% regression on every fork/exec is not acceptable. Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4 numbers will be improved as well with that patch. Then if we have specific lock/unlock bitops, I hope it should reduce that further. OK, with the races and missing barriers fixed from the previous patch, plus the attached one added (+patch3), numbers are better again (I'm not sure if I have the ppc barriers correct though). These ops could also be put to use in bit spinlocks, buffer lock, and probably a few other places too. 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 +patch 1.71-1.73 175.2-180.8 780.5-794.2 +patch2 1.61-1.63 169.8-175.0 748.6-757.0 +patch3 1.54-1.57 165.6-170.9 748.5-757.5 So fault performance goes to under 5%, fork is in the noise, exec is still up 1%, but maybe that's noise or cache effects again. OK, with my new lock/unlock_page, dd if=large (bigger than RAM) sparse file of=/dev/null with an experimentally optimal block size (32K) goes from 626MB/s to 683MB/s on 2 CPU G5 booted with maxcpus=1. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
On Friday 04 May 2007 18:52, Ingo Molnar wrote: agreed. Con, IIRC you wrote a testcase for this, right? Could you please send us the results of that testing? Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch disabled and then enabled swap prefetch saves ~5 seconds on average hardware on this one test case. I had many users try this and the results were between 2 and 10 seconds, but always showed a saving on this testcase. This effect easily occurs on printing a big picture, editing a large file, compressing an iso image or whatever in real world workloads. Smaller, but much more frequent effects of this over the course of a day obviously also occur and do add up. -- -ck #include stdio.h #include stdarg.h #include stdlib.h #include string.h #include unistd.h #include sys/mman.h #include time.h void fatal(const char *format, ...) { va_list ap; if (format) { va_start(ap, format); vfprintf(stderr, format, ap); va_end(ap); } fprintf(stderr, Fatal error - exiting\n); exit(1); } size_t get_ram(void) { unsigned long ramsize; FILE *meminfo; char aux[256]; if(!(meminfo = fopen(/proc/meminfo, r))) fatal(fopen\n); while( !feof(meminfo) !fscanf(meminfo, MemTotal: %lu kB, ramsize) ) fgets(aux,sizeof(aux),meminfo); if (fclose(meminfo) == -1) fatal(fclose); return ramsize * 1000; } unsigned long get_usecs(struct timespec *myts) { if (clock_gettime(CLOCK_REALTIME, myts)) fatal(clock_gettime); return (myts-tv_sec * 100 + myts-tv_nsec / 1000 ); } int main(void) { unsigned long current_time, time_diff; struct timespec myts; char *buf1, *buf2, *buf3, *buf4; size_t size, full_size = get_ram(); int sleep_seconds = 600; size = full_size * 7 / 10; printf(Starting first malloc of %d bytes\n, size); buf1 = malloc(size); if (buf1 == (char *)-1) fatal(Failed to malloc 1st buffer\n); memset(buf1, 0, size); printf(Completed first malloc and starting second malloc of %d bytes\n, full_size); buf2 = malloc(full_size); if (buf2 == (char *)-1) fatal(Failed to malloc 2nd buffer\n); memset(buf2, 0, full_size); buf4 = malloc(1); for (buf3 = buf2 + full_size; buf3 buf2; buf3--) *buf4 = *buf3; free(buf2); printf(Completed second malloc and free\n); printf(Sleeping for %d seconds\n, sleep_seconds); sleep(sleep_seconds); printf(Important part - starting read of first malloc\n); time_diff = current_time = get_usecs(myts); for (buf3 = buf1; buf3 buf1 + size; buf3++) *buf4 = *buf3; current_time = get_usecs(myts); free(buf4); free(buf1); printf(Completed read and freeing of first malloc\n); time_diff = current_time - time_diff; printf(Timed portion %lu microseconds\n,time_diff); return 0; }
Re: 2.6.22 -mm merge plans -- vm bugfixes
Andrew Morton wrote: On Thu, 03 May 2007 11:32:23 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: void fastcall unlock_page(struct page *page) { + VM_BUG_ON(!PageLocked(page)); smp_mb__before_clear_bit(); - if (!TestClearPageLocked(page)) - BUG(); - smp_mb__after_clear_bit(); - wake_up_page(page, PG_locked); + ClearPageLocked(page); + if (unlikely(test_bit(PG_waiters, >flags))) { + clear_bit(PG_waiters, >flags); + wake_up_page(page, PG_locked); + } } Why is that significantly faster than plain old wake_up_page(), which tests waitqueue_active()? Because it needs fewer barriers and doesn't touch random a random hash cacheline in the fastpath. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub on PowerPC
On Fri, 4 May 2007, Benjamin Herrenschmidt wrote: > > The SLUB allocator relies on struct page fields first_page and slab, > > overwritten by ptl when SPLIT_PTLOCK: so the SLUB allocator cannot then > > be used for the lowest level of pagetable pages. This was obstructing > > SLUB on PowerPC, which uses kmem_caches for its pagetables. So convert > > its pte level to use quicklist pages (whereas pmd, pud and 64k-page pgd > > want partpages, so continue to use kmem_caches for pmd, pud and pgd). > > But to keep up appearances for pgtable_free, we still need PTE_CACHE_NUM. > > Interesting... I'll have a look asap. I would also recommend looking at removing the constructors for the remaining slabs. A constructor requires that SLUB never touch the object (same situation as is resulting from enabling debugging). So it must increase the object size in order to put the free pointer after the object. In case of a order of 2 cache this has a particularly bad effect of doubling object size. If the objects can be overwritten on free (no constructor) then we can use the first word of the object as a freepointer on kfree. Meaning we can use a hot cacheline so no cache miss. On alloc we have already touched the first cacheline which also avoids a cacheline fetch there. This is the optimal way of operation for SLUB. Hmmm We could add an option to allow the use of a constructor while keeping the free pointer at the beginning of the object? Then we would have to zap the first word on alloc. Would work like quicklists. Add SLAB_FREEPOINTER_MAY_OVERLAP? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub on PowerPC
On Thu, 2007-05-03 at 22:04 +0100, Hugh Dickins wrote: > On Thu, 3 May 2007, Hugh Dickins wrote: > > > > Seems we're all wrong in thinking Christoph's Kconfiggery worked > > as intended: maybe it just works some of the time. I'm not going > > to hazard a guess as to how to fix it up, will resume looking at > > the powerpc's quicklist potential later. > > Here's the patch I've been testing on G5, with 4k and with 64k pages, > with SLAB and with SLUB. But, though it doesn't crash, the pgd > kmem_cache in the 4k-page SLUB case is revealing SLUB's propensity > for using highorder allocations where SLAB would stick to order 0: > under load, exec's mm_init gets page allocation failure on order 4 > - SLUB's calculate_order may need some retuning. (I'd expect it to > be going for order 3 actually, I'm not sure how order 4 comes about.) > > I don't know how offensive Ben and Paulus may find this patch: > the kmem_cache use was nicely done and this messes it up a little. > > > The SLUB allocator relies on struct page fields first_page and slab, > overwritten by ptl when SPLIT_PTLOCK: so the SLUB allocator cannot then > be used for the lowest level of pagetable pages. This was obstructing > SLUB on PowerPC, which uses kmem_caches for its pagetables. So convert > its pte level to use quicklist pages (whereas pmd, pud and 64k-page pgd > want partpages, so continue to use kmem_caches for pmd, pud and pgd). > But to keep up appearances for pgtable_free, we still need PTE_CACHE_NUM. Interesting... I'll have a look asap. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub on PowerPC
On Thu, 3 May 2007, Christoph Lameter wrote: > > There are SLUB patches pending (not in rc7-mm2 as far as I can recall) > that reduce the default page order sizes to head off this issue. The > defaults were initially too large (and they still default to large > for testing if Mel's Antifrag work is detected to be active). Sounds good. > > - return kmem_cache_alloc(pgtable_cache[PTE_CACHE_NUM], > > - GFP_KERNEL|__GFP_REPEAT); > > + return quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL); > > __GFP_REPEAT is unusual here but this was carried over from the > kmem_cache_alloc it seems. Hmm... There is some variance on how we do this > between arches. Should we uniformly set or not set this flag? Not something to get into in this patch, but it did surprise me too. I believe __GFP_REPEAT should be avoided, and I don't see justification for it here (but need to remember not to do a blind virt_to_page on the result in some places if it might return NULL - which IIRC it actually can do even if __GFP_REPEAT, when chosen for OOM kill). But I've a suspicion it got put in there for some good reason I don't know about. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
On Friday 04 May 2007 01:54, Ingo Molnar wrote: > * Andrew Morton <[EMAIL PROTECTED]> wrote: > > - If replying, please be sure to cc the appropriate individuals. > > Please also consider rewriting the Subject: to something > > appropriate. > i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a > clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback > i've seen so far was positive. Time to have this upstream and time for a > desktop-oriented distro to pick it up. > > I think this has been held back way too long. It's .config selectable > and it is as ready for integration as it ever is going to be. So it's a > win/win scenario. > > Acked-by: Ingo Molnar <[EMAIL PROTECTED]> Thank you very much for code review, ack and support! -- -ck - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub on PowerPC
On Thu, 3 May 2007, Hugh Dickins wrote: > On Thu, 3 May 2007, Hugh Dickins wrote: > > > > Seems we're all wrong in thinking Christoph's Kconfiggery worked > > as intended: maybe it just works some of the time. I'm not going > > to hazard a guess as to how to fix it up, will resume looking at > > the powerpc's quicklist potential later. > > Here's the patch I've been testing on G5, with 4k and with 64k pages, > with SLAB and with SLUB. But, though it doesn't crash, the pgd > kmem_cache in the 4k-page SLUB case is revealing SLUB's propensity > for using highorder allocations where SLAB would stick to order 0: > under load, exec's mm_init gets page allocation failure on order 4 > - SLUB's calculate_order may need some retuning. (I'd expect it to > be going for order 3 actually, I'm not sure how order 4 comes about.) There are SLUB patches pending (not in rc7-mm2 as far as I can recall) that reduce the default page order sizes to head off this issue. The defaults were initially too large (and they still default to large for testing if Mel's Antifrag work is detected to be active). > - return kmem_cache_alloc(pgtable_cache[PTE_CACHE_NUM], > - GFP_KERNEL|__GFP_REPEAT); > + return quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL); __GFP_REPEAT is unusual here but this was carried over from the kmem_cache_alloc it seems. Hmm... There is some variance on how we do this between arches. Should we uniformly set or not set this flag? [EMAIL PROTECTED]:~/software/linux-2.6.21-rc7-mm2$ grep quicklist_alloc include/asm-ia64/* include/asm-ia64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL); include/asm-ia64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL); include/asm-ia64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL); include/asm-ia64/pgalloc.h: void *pg = quicklist_alloc(0, GFP_KERNEL, NULL); include/asm-ia64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL); [EMAIL PROTECTED]:~/software/linux-2.6.21-rc7-mm2$ grep quicklist_alloc arch/i386/mm/* arch/i386/mm/pgtable.c: pgd_t *pgd = quicklist_alloc(0, GFP_KERNEL, pgd_ctor); [EMAIL PROTECTED]:~/software/linux-2.6.21-rc7-mm2$ grep quicklist_alloc include/asm-sparc64/* include/asm-sparc64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL); include/asm-sparc64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL); include/asm-sparc64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL); include/asm-sparc64/pgalloc.h: void *pg = quicklist_alloc(0, GFP_KERNEL, NULL); [EMAIL PROTECTED]:~/software/linux-2.6.21-rc7-mm2$ grep quicklist_alloc include/asm-x86_64/* include/asm-x86_64/pgalloc.h: return (pmd_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL); include/asm-x86_64/pgalloc.h: return (pud_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL); include/asm-x86_64/pgalloc.h: pgd_t *pgd = (pgd_t *)quicklist_alloc(QUICK_PGD, include/asm-x86_64/pgalloc.h: return (pte_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL); include/asm-x86_64/pgalloc.h: void *p = (void *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub on PowerPC
On Thu, 3 May 2007, Hugh Dickins wrote: > > Seems we're all wrong in thinking Christoph's Kconfiggery worked > as intended: maybe it just works some of the time. I'm not going > to hazard a guess as to how to fix it up, will resume looking at > the powerpc's quicklist potential later. Here's the patch I've been testing on G5, with 4k and with 64k pages, with SLAB and with SLUB. But, though it doesn't crash, the pgd kmem_cache in the 4k-page SLUB case is revealing SLUB's propensity for using highorder allocations where SLAB would stick to order 0: under load, exec's mm_init gets page allocation failure on order 4 - SLUB's calculate_order may need some retuning. (I'd expect it to be going for order 3 actually, I'm not sure how order 4 comes about.) I don't know how offensive Ben and Paulus may find this patch: the kmem_cache use was nicely done and this messes it up a little. The SLUB allocator relies on struct page fields first_page and slab, overwritten by ptl when SPLIT_PTLOCK: so the SLUB allocator cannot then be used for the lowest level of pagetable pages. This was obstructing SLUB on PowerPC, which uses kmem_caches for its pagetables. So convert its pte level to use quicklist pages (whereas pmd, pud and 64k-page pgd want partpages, so continue to use kmem_caches for pmd, pud and pgd). But to keep up appearances for pgtable_free, we still need PTE_CACHE_NUM. Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]> --- arch/powerpc/Kconfig |4 arch/powerpc/mm/init_64.c | 17 ++--- include/asm-powerpc/pgalloc.h | 26 +++--- 3 files changed, 21 insertions(+), 26 deletions(-) --- 2.6.21-rc7-mm2/arch/powerpc/Kconfig 2007-04-26 13:33:51.0 +0100 +++ linux/arch/powerpc/Kconfig 2007-05-03 20:45:12.0 +0100 @@ -31,6 +31,10 @@ config MMU bool default y +config QUICKLIST + bool + default y + config GENERIC_HARDIRQS bool default y --- 2.6.21-rc7-mm2/arch/powerpc/mm/init_64.c2007-04-26 13:33:51.0 +0100 +++ linux/arch/powerpc/mm/init_64.c 2007-05-03 20:45:12.0 +0100 @@ -146,21 +146,16 @@ static void zero_ctor(void *addr, struct memset(addr, 0, kmem_cache_size(cache)); } -#ifdef CONFIG_PPC_64K_PAGES -static const unsigned int pgtable_cache_size[3] = { - PTE_TABLE_SIZE, PMD_TABLE_SIZE, PGD_TABLE_SIZE -}; -static const char *pgtable_cache_name[ARRAY_SIZE(pgtable_cache_size)] = { - "pte_pmd_cache", "pmd_cache", "pgd_cache", -}; -#else static const unsigned int pgtable_cache_size[2] = { - PTE_TABLE_SIZE, PMD_TABLE_SIZE + PGD_TABLE_SIZE, PMD_TABLE_SIZE }; static const char *pgtable_cache_name[ARRAY_SIZE(pgtable_cache_size)] = { - "pgd_pte_cache", "pud_pmd_cache", -}; +#ifdef CONFIG_PPC_64K_PAGES + "pgd_cache", "pmd_cache", +#else + "pgd_cache", "pud_pmd_cache", #endif /* CONFIG_PPC_64K_PAGES */ +}; #ifdef CONFIG_HUGETLB_PAGE /* Hugepages need one extra cache, initialized in hugetlbpage.c. We --- 2.6.21-rc7-mm2/include/asm-powerpc/pgalloc.h2007-02-04 18:44:54.0 + +++ linux/include/asm-powerpc/pgalloc.h 2007-05-03 20:45:12.0 +0100 @@ -10,21 +10,15 @@ #include #include #include +#include extern struct kmem_cache *pgtable_cache[]; -#ifdef CONFIG_PPC_64K_PAGES -#define PTE_CACHE_NUM 0 -#define PMD_CACHE_NUM 1 -#define PGD_CACHE_NUM 2 -#define HUGEPTE_CACHE_NUM 3 -#else -#define PTE_CACHE_NUM 0 -#define PMD_CACHE_NUM 1 -#define PUD_CACHE_NUM 1 #define PGD_CACHE_NUM 0 +#define PUD_CACHE_NUM 1 +#define PMD_CACHE_NUM 1 #define HUGEPTE_CACHE_NUM 2 -#endif +#define PTE_CACHE_NUM 3 /* from quicklist rather than kmem_cache */ /* * This program is free software; you can redistribute it and/or @@ -97,8 +91,7 @@ static inline void pmd_free(pmd_t *pmd) static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address) { - return kmem_cache_alloc(pgtable_cache[PTE_CACHE_NUM], - GFP_KERNEL|__GFP_REPEAT); + return quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL); } static inline struct page *pte_alloc_one(struct mm_struct *mm, @@ -109,7 +102,7 @@ static inline struct page *pte_alloc_one static inline void pte_free_kernel(pte_t *pte) { - kmem_cache_free(pgtable_cache[PTE_CACHE_NUM], pte); + quicklist_free(0, NULL, pte); } static inline void pte_free(struct page *ptepage) @@ -136,7 +129,10 @@ static inline void pgtable_free(pgtable_ void *p = (void *)(pgf.val & ~PGF_CACHENUM_MASK); int cachenum = pgf.val & PGF_CACHENUM_MASK; - kmem_cache_free(pgtable_cache[cachenum], p); + if (cachenum == PTE_CACHE_NUM) + quicklist_free(0, NULL, p); + else + kmem_cache_free(pgtable_cache[cachenum], p); } extern void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf);
Re: 2.6.22 -mm merge plans
On Thu, May 03, 2007 at 01:16:46PM -0400, Mathieu Desnoyers wrote: > > kprobes does this kind of synchronization internally, so the marker > > wrapper should probabl aswell. > > > > The problem appears on heavily loaded systems. Doing 50 > synchronize_sched() calls in a row can take up to a few seconds on a > 4-way machine. This is why I prefer to do it in the module to which > the callbacks belong. We recently had a discussion on batch unreistration interface for kprobes. I'm not very happy with having so different interfaces for different kind of probe registrations. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
Here is the reworked patch, except a comment : * Christoph Hellwig ([EMAIL PROTECTED]) wrote: > > +void blk_probe_disconnect(void) > > +{ > > + uint8_t i; > > + > > + for (i = 0; i < NUM_PROBES; i++) { > > + marker_remove_probe(probe_array[i].name); > > + } > > + synchronize_sched();/* Wait for probes to finish */ > > kprobes does this kind of synchronization internally, so the marker > wrapper should probabl aswell. > The problem appears on heavily loaded systems. Doing 50 synchronize_sched() calls in a row can take up to a few seconds on a 4-way machine. This is why I prefer to do it in the module to which the callbacks belong. Here is the reviewed patch. It depends on a newer version of markers I'll send to Andrew soon. Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]> Index: linux-2.6-lttng/block/elevator.c === --- linux-2.6-lttng.orig/block/elevator.c 2007-05-03 12:27:12.0 -0400 +++ linux-2.6-lttng/block/elevator.c2007-05-03 12:54:58.0 -0400 @@ -32,7 +32,7 @@ #include #include #include -#include +#include #include #include @@ -571,7 +571,7 @@ unsigned ordseq; int unplug_it = 1; - blk_add_trace_rq(q, rq, BLK_TA_INSERT); + trace_mark(blk_request_insert, "%p %p", q, rq); rq->q = q; @@ -757,7 +757,7 @@ * not be passed by new incoming requests */ rq->cmd_flags |= REQ_STARTED; - blk_add_trace_rq(q, rq, BLK_TA_ISSUE); + trace_mark(blk_request_issue, "%p %p", q, rq); } if (!q->boundary_rq || q->boundary_rq == rq) { Index: linux-2.6-lttng/block/ll_rw_blk.c === --- linux-2.6-lttng.orig/block/ll_rw_blk.c 2007-05-03 12:27:12.0 -0400 +++ linux-2.6-lttng/block/ll_rw_blk.c 2007-05-03 12:54:58.0 -0400 @@ -28,6 +28,7 @@ #include #include #include +#include #include #include @@ -1551,7 +1552,7 @@ if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, >queue_flags)) { mod_timer(>unplug_timer, jiffies + q->unplug_delay); - blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG); + trace_mark(blk_plug_device, "%p %p %d", q, NULL, 0); } } @@ -1617,7 +1618,7 @@ * devices don't necessarily have an ->unplug_fn defined */ if (q->unplug_fn) { - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, + trace_mark(blk_pdu_unplug_io, "%p %p %d", q, NULL, q->rq.count[READ] + q->rq.count[WRITE]); q->unplug_fn(q); @@ -1628,7 +1629,7 @@ { request_queue_t *q = container_of(work, request_queue_t, unplug_work); - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, + trace_mark(blk_pdu_unplug_io, "%p %p %d", q, NULL, q->rq.count[READ] + q->rq.count[WRITE]); q->unplug_fn(q); @@ -1638,7 +1639,7 @@ { request_queue_t *q = (request_queue_t *)data; - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL, + trace_mark(blk_pdu_unplug_timer, "%p %p %d", q, NULL, q->rq.count[READ] + q->rq.count[WRITE]); kblockd_schedule_work(>unplug_work); @@ -2148,7 +2149,7 @@ rq_init(q, rq); - blk_add_trace_generic(q, bio, rw, BLK_TA_GETRQ); + trace_mark(blk_get_request, "%p %p %d", q, bio, rw); out: return rq; } @@ -2178,7 +2179,7 @@ if (!rq) { struct io_context *ioc; - blk_add_trace_generic(q, bio, rw, BLK_TA_SLEEPRQ); + trace_mark(blk_sleep_request, "%p %p %d", q, bio, rw); __generic_unplug_device(q); spin_unlock_irq(q->queue_lock); @@ -2252,7 +2253,7 @@ */ void blk_requeue_request(request_queue_t *q, struct request *rq) { - blk_add_trace_rq(q, rq, BLK_TA_REQUEUE); + trace_mark(blk_requeue, "%p %p", q, rq); if (blk_rq_tagged(rq)) blk_queue_end_tag(q, rq); @@ -2937,7 +2938,7 @@ if (!ll_back_merge_fn(q, req, bio)) break; - blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE); + trace_mark(blk_bio_backmerge, "%p %p", q, bio); req->biotail->bi_next = bio; req->biotail = bio; @@ -2954,7 +2955,7 @@ if (!ll_front_merge_fn(q, req, bio)) break; - blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE); + trace_mark(blk_bio_frontmerge, "%p %p", q, bio); bio->bi_next = req->bio;
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Thu, 03 May 2007 11:32:23 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: > void fastcall unlock_page(struct page *page) > { > + VM_BUG_ON(!PageLocked(page)); > smp_mb__before_clear_bit(); > - if (!TestClearPageLocked(page)) > - BUG(); > - smp_mb__after_clear_bit(); > - wake_up_page(page, PG_locked); > + ClearPageLocked(page); > + if (unlikely(test_bit(PG_waiters, >flags))) { > + clear_bit(PG_waiters, >flags); > + wake_up_page(page, PG_locked); > + } > } Why is that significantly faster than plain old wake_up_page(), which tests waitqueue_active()? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
H...There are a gazillion configs to choose from. It works fine with cell_defconfig. If I switch to 2 processors I can enable SLUB if I switch to 4 I cannot. I saw some other config weirdness like being unable to set SMP if SLOB is enabled with some configs. This should not work and does not work but the menus are then vanishing and one can still configure lots of processors while not having enabled SMP. It works as far as I can tell... The rest is arch weirdness. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Thu, 3 May 2007, William Lee Irwin III wrote: > I've seen this crash in flush_old_exec() before. ISTR it being due to > slub vs. pagetable alignment or something on that order. >From from other discussion regarding SLAB: It may be necessary for powerpc to set the default alignment to 8 bytes on 32 bit powerpc because it requires that alignemnt for 64 bit value. SLUB will *not* disable debugging like SLAB if you do that. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
On 03/05/07, Michal Piotrowski <[EMAIL PROTECTED]> wrote: Hi, On 03/05/07, Ingo Molnar <[EMAIL PROTECTED]> wrote: > > * Andrew Morton <[EMAIL PROTECTED]> wrote: > > > - If replying, please be sure to cc the appropriate individuals. > > Please also consider rewriting the Subject: to something > > appropriate. > > i'm wondering about swap-prefetch: > > mm-implement-swap-prefetching.patch > swap-prefetch-avoid-repeating-entry.patch > add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated-swap-prefetch.patch > > The swap-prefetch feature is relatively compact: > >10 files changed, 745 insertions(+), 1 deletion(-) > > it is contained mostly to itself: > >mm/swap_prefetch.c| 581 > > i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a > clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback > i've seen so far was positive. Time to have this upstream and time for a > desktop-oriented distro to pick it up. > > I think this has been held back way too long. It's .config selectable > and it is as ready for integration as it ever is going to be. So it's a > win/win scenario. I'm using SWAP_PREFETCH since 2.6.17-mm1 (I don't have earlier configs) since 2.6.14-mm2 :) http://lkml.org/lkml/2005/11/11/260 Regards, Michal -- Michal K. K. Piotrowski Kernel Monkeys (http://kernel.wikidot.com/start) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
Hi, On 03/05/07, Ingo Molnar <[EMAIL PROTECTED]> wrote: * Andrew Morton <[EMAIL PROTECTED]> wrote: > - If replying, please be sure to cc the appropriate individuals. > Please also consider rewriting the Subject: to something > appropriate. i'm wondering about swap-prefetch: mm-implement-swap-prefetching.patch swap-prefetch-avoid-repeating-entry.patch add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated-swap-prefetch.patch The swap-prefetch feature is relatively compact: 10 files changed, 745 insertions(+), 1 deletion(-) it is contained mostly to itself: mm/swap_prefetch.c| 581 i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback i've seen so far was positive. Time to have this upstream and time for a desktop-oriented distro to pick it up. I think this has been held back way too long. It's .config selectable and it is as ready for integration as it ever is going to be. So it's a win/win scenario. I'm using SWAP_PREFETCH since 2.6.17-mm1 (I don't have earlier configs) http://www.stardust.webpages.pl/files/tbf/euridica/2.6.17-mm1/mm-config and I don't recall _any_ problems. It's very stable for me. Acked-by: Ingo Molnar <[EMAIL PROTECTED]> Ingo Regards, Michal -- Michal K. K. Piotrowski Kernel Monkeys (http://kernel.wikidot.com/start) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: 2.6.22 -mm merge plans
* Andrew Morton <[EMAIL PROTECTED]> wrote: > - If replying, please be sure to cc the appropriate individuals. > Please also consider rewriting the Subject: to something > appropriate. i'm wondering about swap-prefetch: mm-implement-swap-prefetching.patch swap-prefetch-avoid-repeating-entry.patch add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated-swap-prefetch.patch The swap-prefetch feature is relatively compact: 10 files changed, 745 insertions(+), 1 deletion(-) it is contained mostly to itself: mm/swap_prefetch.c| 581 i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback i've seen so far was positive. Time to have this upstream and time for a desktop-oriented distro to pick it up. I think this has been held back way too long. It's .config selectable and it is as ready for integration as it ever is going to be. So it's a win/win scenario. Acked-by: Ingo Molnar <[EMAIL PROTECTED]> Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
On Thu, May 03, 2007 at 11:04:15AM -0400, Mathieu Desnoyers wrote: > - blk_add_trace_rq(q, rq, BLK_TA_INSERT); > + MARK(blk_request_insert, "%p %p", q, rq); I don't really like the shouting MARK name very much. Can we have a less-generic, less shouting name, e.g. trace_marker? The aboe would then be: trace_mark(blk_request_insert, "%p %p", q, rq); > +#define NUM_PROBES ARRAY_SIZE(probe_array) just get rid of this and use ARRAY_SIZE diretly below. > +int blk_probe_connect(void) > +{ > + int result; > + uint8_t i; just use an int for for loops. it's easy to read and probably faster on most systems (it the compiler isn't smart enough and promotes it to int anyway during code generation) > +void blk_probe_disconnect(void) > +{ > + uint8_t i; > + > + for (i = 0; i < NUM_PROBES; i++) { > + marker_remove_probe(probe_array[i].name); > + } > + synchronize_sched();/* Wait for probes to finish */ kprobes does this kind of synchronization internally, so the marker wrapper should probabl aswell. > +static int blk_probes_ref = 0; no need to initialize this. > /* > * Send out a notify message. > */ > @@ -229,6 +233,12 @@ > blk_remove_tree(bt->dir); > free_percpu(bt->sequence); > kfree(bt); > + mutex_lock(_probe_mutex); > + if (blk_probes_ref == 1) { > + blk_probe_disconnect(); > + blk_probes_ref--; > + } if (--blk_probes_ref == 0) blk_probe_disconnect(); would probably be a tad cleaner. > + if (!blk_probes_ref) { > + blk_probe_connect(); > + blk_probes_ref++; > + } Dito here with a: if (!blk_probes_ref++) blk_probe_connect(); also the connect in the name seems rather add, what about arm/disarm instead? > static __init int blk_trace_init(void) > { > mutex_init(_tree_mutex); > + mutex_init(_probe_mutex); both should use DEFINE_MUTEX for compile-time initialization isntead. Also it's probably better to put the trace points into blktrace.c, that means all blktrace code can be static and self-contained. And we can probably do some additional cleanups by simplifying things later on. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
* Andrew Morton ([EMAIL PROTECTED]) wrote: > > Although some, like Christoph and myself, think that it would benefit to > > the kernel community to have a common infrastructure for more than just > > markers (meaning common serialization and buffering mechanism), it does > > not change the fact that the markers, being in mainline, are usable by > > projects through additional kernel modules. > > > > If we are looking at current "potential users" that are already in > > mainline, we could change blktrace to make it use the markers. > > That'd be a useful demonstration. Here is a proof of concept patch, for demonstration purpose, of moving blktrace to the markers. A few remarks : this patch has the positive effect of removing some code from the block io tracing hot paths, minimizing the i-cache impact in a system where the io tracing is compiled in but inactive. It also moves the blk tracing code from a header (and therefore from the body of the instrumented functions) to a separate C file. There, as soon as one device has to be traced, every devices have to fall into the tracing function call. This is slower than the previous inline function which tested the condition quickly. If it becomes a show stopper, it could be fixed by having the possibility to test a supplementary condition, dependant of the marker context, at the marker site, just after the enable/disable test. It does not make the code smaller, since I left all the specialized tracing functions for requests, bio, generic, remap, which would go away once a generic infrastructure is in place to serialize the information passed to the marker. This is mostly why I consider it a proof a concept. Patch named "markers-port-blktrace-to-markers.patch", can be placed after the marker patches in the 2.6.21-rc7-mm2 series. Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]> Index: linux-2.6-lttng/block/elevator.c === --- linux-2.6-lttng.orig/block/elevator.c 2007-05-02 20:33:22.0 -0400 +++ linux-2.6-lttng/block/elevator.c2007-05-02 20:33:49.0 -0400 @@ -32,7 +32,7 @@ #include #include #include -#include +#include #include #include @@ -571,7 +571,7 @@ unsigned ordseq; int unplug_it = 1; - blk_add_trace_rq(q, rq, BLK_TA_INSERT); + MARK(blk_request_insert, "%p %p", q, rq); rq->q = q; @@ -757,7 +757,7 @@ * not be passed by new incoming requests */ rq->cmd_flags |= REQ_STARTED; - blk_add_trace_rq(q, rq, BLK_TA_ISSUE); + MARK(blk_request_issue, "%p %p", q, rq); } if (!q->boundary_rq || q->boundary_rq == rq) { Index: linux-2.6-lttng/block/ll_rw_blk.c === --- linux-2.6-lttng.orig/block/ll_rw_blk.c 2007-05-02 20:33:32.0 -0400 +++ linux-2.6-lttng/block/ll_rw_blk.c 2007-05-02 23:21:02.0 -0400 @@ -28,6 +28,7 @@ #include #include #include +#include #include #include @@ -1551,7 +1552,7 @@ if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, >queue_flags)) { mod_timer(>unplug_timer, jiffies + q->unplug_delay); - blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG); + MARK(blk_plug_device, "%p %p %d", q, NULL, 0); } } @@ -1617,7 +1618,7 @@ * devices don't necessarily have an ->unplug_fn defined */ if (q->unplug_fn) { - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, + MARK(blk_pdu_unplug_io, "%p %p %d", q, NULL, q->rq.count[READ] + q->rq.count[WRITE]); q->unplug_fn(q); @@ -1628,7 +1629,7 @@ { request_queue_t *q = container_of(work, request_queue_t, unplug_work); - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, + MARK(blk_pdu_unplug_io, "%p %p %d", q, NULL, q->rq.count[READ] + q->rq.count[WRITE]); q->unplug_fn(q); @@ -1638,7 +1639,7 @@ { request_queue_t *q = (request_queue_t *)data; - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL, + MARK(blk_pdu_unplug_timer, "%p %p %d", q, NULL, q->rq.count[READ] + q->rq.count[WRITE]); kblockd_schedule_work(>unplug_work); @@ -2148,7 +2149,7 @@ rq_init(q, rq); - blk_add_trace_generic(q, bio, rw, BLK_TA_GETRQ); + MARK(blk_get_request, "%p %p %d", q, bio, rw); out: return rq; } @@ -2178,7 +2179,7 @@ if (!rq) { struct io_context *ioc; - blk_add_trace_generic(q, bio, rw, BLK_TA_SLEEPRQ); + MARK(blk_sleep_request, "%p %p %d", q, bio, rw); __generic_unplug_device(q);
Re: 2.6.22 -mm merge plans
Hi Andi, This plan makes sense. I will split the "patched in enabled/disable flags" part into a separate piece (good idea!) and then submit the LTTng core to Andrew. Christoph's has a good point about wanting a usable infrastructure to go ini. Regarding your plan, I must argue that blktrace is not a general purpose tracing infrastructure, but one dedicated to block io tracing. Therefore, it makes sense to bring in the generic infrastructure first and then convert blktrace to it. Mathieu * Andi Kleen ([EMAIL PROTECTED]) wrote: > > If we are looking at current "potential users" that are already in > > mainline, we could change blktrace to make it use the markers. > > Ok, but do it step by step: > - split out useful pieces like the "patched in enable/disable flags" > and submit them separate with an example user or two > [I got a couple of candidates e.g. with some of the sysctls in VM or > networking] > - post and merge that. > - don't implement anything initially that is not needed by blktrace > - post a minimal marker patch together with the blktrace > conversion for review again on linux-kernel > - await review comments. This review would not cover the basic > need of markers, just the specific implementation. > - then potentially merge incorporate review comments > - then merge > - later add features with individual review/discussion as new users in the > kernel are added. > > -Andi -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
* Christoph Hellwig ([EMAIL PROTECTED]) wrote: > On Wed, May 02, 2007 at 07:11:04PM -0400, Mathieu Desnoyers wrote: > > My statement was probably not clear enough. The actual marker code is > > useful as-is without any further kernel patching required : SystemTAP is > > an example where they use external modules to load probes that can > > connect either to markers or through kprobes. LTTng, in its current state, > > has a mostly modular core that also uses the markers. > > That just mean you have to load an enormous emount of exernal crap > that replaces the missing kernel functionality. It's exactly the > situation we want to avoid. > It makes sense to use -mm to hold the hole usable infrastructure before submitting it to mainline. I will submit my core LTTng patches to Andrew in the following weeks. There is no hurry, in the LTTng perspective, to merge the markers sooner, although they could be useful to other (external) projects meanwhile. Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Thu, 3 May 2007, Nick Piggin wrote: @@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!) { DEFINE_WAIT_BIT(wait, >flags, PG_locked); + set_bit(PG_waiters, >flags); + if (unlikely(!TestSetPageLocked(page))) { What happens if another cpu is coming through __lock_page at the same time, did its set_bit, now finds PageLocked, and so proceeds to the __wait_on_bit_lock? But this cpu now clears PG_waiters, so this task's unlock_page won't wake the other? You're right, we can't clear the bit here. Doubt it mattered much anyway? Ah yes, that's a good easy answer. In fact, just remove this whole test and block (we already tried TestSetPageLocked outside just a short while ago, so this repeat won't often save anything). Yeah, I was getting too clever for my own boots :) I think the patch has merit though. Unfortunate that it uses another page flag, however it seemed to have quite a bit speedup on unlock_page (probably from both the barriers and an extra random cacheline load (from the hash)). I guess it has to get good results from more benchmarks... BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page above... that barrier is in the slow path as well though, so it shouldn't matter either. I vaguely wondered how such barriers had managed to dissolve away, but cranking my brain up to think about barriers takes far too long. That barrier was one too many :) However I believe the fastpath barrier can go away because the PG_locked operation is depending on the same cacheline as PG_waiters. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Thu, 3 May 2007, Nick Piggin wrote: > >>@@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!) > > > { > > > DEFINE_WAIT_BIT(wait, >flags, PG_locked); > > > > > >+ set_bit(PG_waiters, >flags); > > >+ if (unlikely(!TestSetPageLocked(page))) { > > > > What happens if another cpu is coming through __lock_page at the > > same time, did its set_bit, now finds PageLocked, and so proceeds > > to the __wait_on_bit_lock? But this cpu now clears PG_waiters, > > so this task's unlock_page won't wake the other? > > You're right, we can't clear the bit here. Doubt it mattered much anyway? Ah yes, that's a good easy answer. In fact, just remove this whole test and block (we already tried TestSetPageLocked outside just a short while ago, so this repeat won't often save anything). > > BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page > above... that barrier is in the slow path as well though, so it shouldn't > matter either. I vaguely wondered how such barriers had managed to dissolve away, but cranking my brain up to think about barriers takes far too long. > > >+ clear_bit(PG_waiters, >flags); > > >+ return; > > >+ } > > > __wait_on_bit_lock(page_waitqueue(page), , sync_page, > > >TASK_UNINTERRUPTIBLE); > >> } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Christoph Hellwig wrote: On Thu, May 03, 2007 at 11:32:23AM +1000, Nick Piggin wrote: The attached patch gets performance up a bit by avoiding some barriers and some cachelines: G5 pagefault fork exec 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 +patch 1.71-1.73 175.2-180.8 780.5-794.2 +patch2 1.61-1.63 169.8-175.0 748.6-757.0 So that brings the fork/exec hits down to much less than 5%, and would likely speed up other things that lock the page, like write or page reclaim. Is that every fork/exec or just under certain cicumstances? A 5% regression on every fork/exec is not acceptable. Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4 numbers will be improved as well with that patch. Then if we have specific lock/unlock bitops, I hope it should reduce that further. The overhead that is there should just be coming from the extra overhead in the file backed fault handler. For noop fork/execs, I think that tends to be more pronounced, it is hard to see any difference on any non-micro benchmark. The other thing is that I think there could be some cache effects happening -- for example the exec numbers on the 2nd line are disproportionately large. It definitely isn't a good thing to drop performance anywhere though, so I'll keep looking for improvements. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Thu, 3 May 2007, Nick Piggin wrote: The problem is that lock/unlock_page is expensive on powerpc, and if we improve that, we improve more than just the fault handler... The attached patch gets performance up a bit by avoiding some barriers and some cachelines: There's a strong whiff of raciness about this... but I could very easily be wrong. Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c 2007-05-02 15:00:26.0 +1000 +++ linux-2.6/mm/filemap.c 2007-05-03 08:34:32.0 +1000 @@ -532,11 +532,13 @@ */ void fastcall unlock_page(struct page *page) { + VM_BUG_ON(!PageLocked(page)); smp_mb__before_clear_bit(); - if (!TestClearPageLocked(page)) - BUG(); - smp_mb__after_clear_bit(); - wake_up_page(page, PG_locked); + ClearPageLocked(page); + if (unlikely(test_bit(PG_waiters, >flags))) { + clear_bit(PG_waiters, >flags); + wake_up_page(page, PG_locked); + } } EXPORT_SYMBOL(unlock_page); @@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!) { DEFINE_WAIT_BIT(wait, >flags, PG_locked); + set_bit(PG_waiters, >flags); + if (unlikely(!TestSetPageLocked(page))) { What happens if another cpu is coming through __lock_page at the same time, did its set_bit, now finds PageLocked, and so proceeds to the __wait_on_bit_lock? But this cpu now clears PG_waiters, so this task's unlock_page won't wake the other? You're right, we can't clear the bit here. Doubt it mattered much anyway? BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page above... that barrier is in the slow path as well though, so it shouldn't matter either. + clear_bit(PG_waiters, >flags); + return; + } __wait_on_bit_lock(page_waitqueue(page), , sync_page, TASK_UNINTERRUPTIBLE); } -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Thu, 3 May 2007, Nick Piggin wrote: > > The problem is that lock/unlock_page is expensive on powerpc, and > if we improve that, we improve more than just the fault handler... > > The attached patch gets performance up a bit by avoiding some > barriers and some cachelines: There's a strong whiff of raciness about this... but I could very easily be wrong. > Index: linux-2.6/mm/filemap.c > === > --- linux-2.6.orig/mm/filemap.c 2007-05-02 15:00:26.0 +1000 > +++ linux-2.6/mm/filemap.c2007-05-03 08:34:32.0 +1000 > @@ -532,11 +532,13 @@ > */ > void fastcall unlock_page(struct page *page) > { > + VM_BUG_ON(!PageLocked(page)); > smp_mb__before_clear_bit(); > - if (!TestClearPageLocked(page)) > - BUG(); > - smp_mb__after_clear_bit(); > - wake_up_page(page, PG_locked); > + ClearPageLocked(page); > + if (unlikely(test_bit(PG_waiters, >flags))) { > + clear_bit(PG_waiters, >flags); > + wake_up_page(page, PG_locked); > + } > } > EXPORT_SYMBOL(unlock_page); > > @@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!) > { > DEFINE_WAIT_BIT(wait, >flags, PG_locked); > > + set_bit(PG_waiters, >flags); > + if (unlikely(!TestSetPageLocked(page))) { What happens if another cpu is coming through __lock_page at the same time, did its set_bit, now finds PageLocked, and so proceeds to the __wait_on_bit_lock? But this cpu now clears PG_waiters, so this task's unlock_page won't wake the other? > + clear_bit(PG_waiters, >flags); > + return; > + } > __wait_on_bit_lock(page_waitqueue(page), , sync_page, > TASK_UNINTERRUPTIBLE); > } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Thu, May 03, 2007 at 11:32:23AM +1000, Nick Piggin wrote: > The attached patch gets performance up a bit by avoiding some > barriers and some cachelines: > > G5 > pagefault fork exec > 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 > +patch 1.71-1.73 175.2-180.8 780.5-794.2 > +patch2 1.61-1.63 169.8-175.0 748.6-757.0 > > So that brings the fork/exec hits down to much less than 5%, and > would likely speed up other things that lock the page, like write > or page reclaim. Is that every fork/exec or just under certain cicumstances? A 5% regression on every fork/exec is not acceptable. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
> If we are looking at current "potential users" that are already in > mainline, we could change blktrace to make it use the markers. Ok, but do it step by step: - split out useful pieces like the "patched in enable/disable flags" and submit them separate with an example user or two [I got a couple of candidates e.g. with some of the sysctls in VM or networking] - post and merge that. - don't implement anything initially that is not needed by blktrace - post a minimal marker patch together with the blktrace conversion for review again on linux-kernel - await review comments. This review would not cover the basic need of markers, just the specific implementation. - then potentially merge incorporate review comments - then merge - later add features with individual review/discussion as new users in the kernel are added. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
On Wed, May 02, 2007 at 11:46:40PM +0200, Tilman Schmidt wrote: > Am 02.05.2007 19:49 schrieb Andi Kleen: > > And also I think when something is merged it should have some users in tree. > > Isn't that a circular dependency? The normal mode of operation is to merge the initial users and the subsystem at the same time. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Thu, 3 May 2007, Andrew Morton wrote: > On Thu, 3 May 2007 09:46:32 +0100 (BST) Hugh Dickins <[EMAIL PROTECTED]> > wrote: > > On Thu, 3 May 2007, Andrew Morton wrote: > > > On Wed, 2 May 2007 10:25:47 -0700 (PDT) Christoph Lameter <[EMAIL > > > PROTECTED]> wrote: > > > > > > > +config ARCH_USES_SLAB_PAGE_STRUCT > > > > + bool > > > > + default y > > > > + depends on SPLIT_PTLOCK_CPUS <= NR_CPUS > > > > + > > > > > > That all seems to work as intended. > > > > > > However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the > > > machine early in boot. > > > > I thought that if that worked as intended, you wouldn't even > > get the chance to choose SLUB=y? That was how it was working > > for me (but I realize I didn't try more than make oldconfig). > > > > That sounds like what happens when SLUB's pagestruct use meets > > SPLIT_PTLOCK's pagestruct use. Does your .config really show > > CONFIG_SLUB=y together with CONFIG_ARCH_USES_SLAB_PAGE_STRUCT=y? > > Nope. > > g5:/usr/src/25> grep SLUB .config > CONFIG_SLUB=y > g5:/usr/src/25> grep SLAB .config > # CONFIG_SLAB is not set > g5:/usr/src/25> grep CPUS .config > CONFIG_NR_CPUS=8 > # CONFIG_CPUSETS is not set > # CONFIG_IRQ_ALL_CPUS is not set > CONFIG_SPLIT_PTLOCK_CPUS=4 > > It's in http://userweb.kernel.org/~akpm/config-g5.txt Seems we're all wrong in thinking Christoph's Kconfiggery worked as intended: maybe it just works some of the time. I'm not going to hazard a guess as to how to fix it up, will resume looking at the powerpc's quicklist potential later. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Thu, 3 May 2007 09:46:32 +0100 (BST) Hugh Dickins <[EMAIL PROTECTED]> wrote: > On Thu, 3 May 2007, Andrew Morton wrote: > > On Wed, 2 May 2007 10:25:47 -0700 (PDT) Christoph Lameter <[EMAIL > > PROTECTED]> wrote: > > > > > +config ARCH_USES_SLAB_PAGE_STRUCT > > > + bool > > > + default y > > > + depends on SPLIT_PTLOCK_CPUS <= NR_CPUS > > > + > > > > That all seems to work as intended. > > > > However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the > > machine early in boot. > > I thought that if that worked as intended, you wouldn't even > get the chance to choose SLUB=y? That was how it was working > for me (but I realize I didn't try more than make oldconfig). Right. This can be tested on x86 without a cross-compiler: ARCH=powerpc make mrproper ARCH=powerpc make fooconfig > > > > Too early for netconsole, no serial console. Wedges up uselessly with > > CONFIG_XMON=n, does mysterious repeated uncontrollable exceptions with > > CONFIG_XMON=y. This is all fairly typical for a powerpc/G5 crash :( > > > > However I was able to glimpse some stuff as it flew past. Crash started in > > flush_old_exec and ended in pgtable_free_tlb -> kmem_cache_free. I don't > > know > > how to do better than that I'm afraid, unless I'm to hunt down a PCIE serial > > card, perhaps. > > That sounds like what happens when SLUB's pagestruct use meets > SPLIT_PTLOCK's pagestruct use. Does your .config really show > CONFIG_SLUB=y together with CONFIG_ARCH_USES_SLAB_PAGE_STRUCT=y? > Nope. g5:/usr/src/25> grep SLUB .config CONFIG_SLUB=y g5:/usr/src/25> grep SLAB .config # CONFIG_SLAB is not set g5:/usr/src/25> grep CPUS .config CONFIG_NR_CPUS=8 # CONFIG_CPUSETS is not set # CONFIG_IRQ_ALL_CPUS is not set CONFIG_SPLIT_PTLOCK_CPUS=4 It's in http://userweb.kernel.org/~akpm/config-g5.txt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Thu, 3 May 2007, Andrew Morton wrote: > On Wed, 2 May 2007 10:25:47 -0700 (PDT) Christoph Lameter <[EMAIL PROTECTED]> > wrote: > > > +config ARCH_USES_SLAB_PAGE_STRUCT > > + bool > > + default y > > + depends on SPLIT_PTLOCK_CPUS <= NR_CPUS > > + > > That all seems to work as intended. > > However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the > machine early in boot. I thought that if that worked as intended, you wouldn't even get the chance to choose SLUB=y? That was how it was working for me (but I realize I didn't try more than make oldconfig). > > Too early for netconsole, no serial console. Wedges up uselessly with > CONFIG_XMON=n, does mysterious repeated uncontrollable exceptions with > CONFIG_XMON=y. This is all fairly typical for a powerpc/G5 crash :( > > However I was able to glimpse some stuff as it flew past. Crash started in > flush_old_exec and ended in pgtable_free_tlb -> kmem_cache_free. I don't know > how to do better than that I'm afraid, unless I'm to hunt down a PCIE serial > card, perhaps. That sounds like what happens when SLUB's pagestruct use meets SPLIT_PTLOCK's pagestruct use. Does your .config really show CONFIG_SLUB=y together with CONFIG_ARCH_USES_SLAB_PAGE_STRUCT=y? Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Thu, May 03, 2007 at 01:15:15AM -0700, Andrew Morton wrote: > That all seems to work as intended. > However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the > machine early in boot. > Too early for netconsole, no serial console. Wedges up uselessly with > CONFIG_XMON=n, does mysterious repeated uncontrollable exceptions with > CONFIG_XMON=y. This is all fairly typical for a powerpc/G5 crash :( > However I was able to glimpse some stuff as it flew past. Crash started in > flush_old_exec and ended in pgtable_free_tlb -> kmem_cache_free. I don't know > how to do better than that I'm afraid, unless I'm to hunt down a PCIE serial > card, perhaps. I've seen this crash in flush_old_exec() before. ISTR it being due to slub vs. pagetable alignment or something on that order. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans: slub
On Wed, 2 May 2007 10:25:47 -0700 (PDT) Christoph Lameter <[EMAIL PROTECTED]> wrote: > On Wed, 2 May 2007, Hugh Dickins wrote: > > > I presume the answer is just to extend your quicklist work to > > powerpc's lowest level of pagetables. The only other architecture > > which is using kmem_cache for them is arm26, which has > > "#error SMP is not supported", so won't be giving this problem. > > In the meantime we would need something like this to disable SLUB in this > particular configuration. Note that I have not tested this and the <= for > the comparision with SPLIT_PTLOCK_CPUS may not work (Never seen such a > construct in a Kconfig file but it is needed here). > > > > PowerPC: Disable SLUB for configurations in which slab page structs are > modified > > PowerPC uses the slab allocator to manage the lowest level of the page table. > In high cpu configurations we also use the page struct to split the page > table lock. Disallow the selection of SLUB for that case. > > [Not tested: I am not familiar with powerpc build procedures etc] > > Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> > > Index: linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig > === > --- linux-2.6.21-rc7-mm2.orig/arch/powerpc/Kconfig2007-05-02 > 10:07:34.0 -0700 > +++ linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig 2007-05-02 10:13:37.0 > -0700 > @@ -117,6 +117,19 @@ config GENERIC_BUG > default y > depends on BUG > > +# > +# Powerpc uses the slab allocator to manage its ptes and the > +# page structs of ptes are used for splitting the page table > +# lock for configurations supporting more than SPLIT_PTLOCK_CPUS. > +# > +# In that special configuration the page structs of slabs are modified. > +# This setting disables the selection of SLUB as a slab allocator. > +# > +config ARCH_USES_SLAB_PAGE_STRUCT > + bool > + default y > + depends on SPLIT_PTLOCK_CPUS <= NR_CPUS > + That all seems to work as intended. However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the machine early in boot. Too early for netconsole, no serial console. Wedges up uselessly with CONFIG_XMON=n, does mysterious repeated uncontrollable exceptions with CONFIG_XMON=y. This is all fairly typical for a powerpc/G5 crash :( However I was able to glimpse some stuff as it flew past. Crash started in flush_old_exec and ended in pgtable_free_tlb -> kmem_cache_free. I don't know how to do better than that I'm afraid, unless I'm to hunt down a PCIE serial card, perhaps. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
On Wed, May 02, 2007 at 04:36:27PM -0400, Mathieu Desnoyers wrote: > The idea is the following : either we integrate the infrastructure for > instrumentation / data serialization / buffer management / extraction of > data to user space in multiple different steps, which makes code review > easier for you guys, the staging area for that is -mm - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
On Wed, May 02, 2007 at 01:53:36PM -0700, Andrew Morton wrote: > In which case we have: > > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-alpha.patch > atomich-complete-atomic_long-operations-in-asm-generic.patch > atomich-i386-type-safety-fix.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-ia64.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-mips.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-parisc.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-powerpc.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-sparc64.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-x86_64.patch > atomich-atomic_add_unless-as-inline-remove-systemh-atomich-circular-dependency.patch > local_t-architecture-independant-extension.patch > local_t-alpha-extension.patch > local_t-i386-extension.patch > local_t-ia64-extension.patch > local_t-mips-extension.patch > local_t-parisc-cleanup.patch > local_t-powerpc-extension.patch > local_t-sparc64-cleanup.patch > local_t-x86_64-extension.patch > > For 2.6.22 > > linux-kernel-markers-kconfig-menus.patch > linux-kernel-markers-architecture-independant-code.patch > linux-kernel-markers-powerpc-optimization.patch > linux-kernel-markers-i386-optimization.patch > markers-add-instrumentation-markers-menus-to-avr32.patch > linux-kernel-markers-non-optimized-architectures.patch > markers-alpha-and-avr32-supportadd-alpha-markerh-add-arm26-markerh.patch > linux-kernel-markers-documentation.patch > # > markers-define-the-linker-macro-extra_rwdata.patch > markers-use-extra_rwdata-in-architectures.patch > # > some-grammatical-fixups-and-additions-to-atomich-kernel-doc.patch > no-longer-include-asm-kdebugh.patch This it a plan I can fully agree with. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
On Wed, May 02, 2007 at 07:11:04PM -0400, Mathieu Desnoyers wrote: > My statement was probably not clear enough. The actual marker code is > useful as-is without any further kernel patching required : SystemTAP is > an example where they use external modules to load probes that can > connect either to markers or through kprobes. LTTng, in its current state, > has a mostly modular core that also uses the markers. That just mean you have to load an enormous emount of exernal crap that replaces the missing kernel functionality. It's exactly the situation we want to avoid. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans
On Wed, May 02, 2007 at 07:11:04PM -0400, Mathieu Desnoyers wrote: My statement was probably not clear enough. The actual marker code is useful as-is without any further kernel patching required : SystemTAP is an example where they use external modules to load probes that can connect either to markers or through kprobes. LTTng, in its current state, has a mostly modular core that also uses the markers. That just mean you have to load an enormous emount of exernal crap that replaces the missing kernel functionality. It's exactly the situation we want to avoid. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/