Re: [PATCH] vmpressure: implement strict mode
On Tue, 2 Jul 2013 21:47:03 +0200 Pavel Machek wrote: > On Tue 2013-07-02 11:06:28, Luiz Capitulino wrote: > > On Mon, 1 Jul 2013 10:51:03 +0200 > > Pavel Machek wrote: > > > > > Hi! > > > > > > > diff --git a/Documentation/cgroups/memory.txt > > > > b/Documentation/cgroups/memory.txt > > > > index ddf4f93..3c589cf 100644 > > > > --- a/Documentation/cgroups/memory.txt > > > > +++ b/Documentation/cgroups/memory.txt > > > > @@ -807,12 +807,14 @@ register a notification, an application must: > > > > > > > > - create an eventfd using eventfd(2); > > > > - open memory.pressure_level; > > > > -- write string like " " > > > > +- write string like " > > > > [strict]" > > > >to cgroup.event_control. > > > > > > > > > > This is.. pretty strange interface. Would it be cleaner to do ioctl()? > > > New syscall? > > > > Are you referring to my new mode or to the whole thing? > > Well. The interface was already very strange and you made it even > worse. The existing interface is the cgroup's notification mechanism, I think discussing it is a bit out of scope for my extension. Now, regarding my extension itself and the current vmpressure API, I believe that delivering all events to user-space (ie. w/o any filtering in the kernel) is a better solution. Point is whether we can do it with the current vmpressure API (which is cgroup based) or whether we should move to something else. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] vmpressure: implement strict mode
On Tue, 2 Jul 2013 10:24:09 -0700 Anton Vorontsov wrote: > > Honestly, what Andrew suggested is the best design for me: apps > > are notified on all events but the event name is sent to the application. > > I am fine with this approach (or any other, I'm really indifferent to the > API itself -- read/netlink/notification per file/whatever for the > payload), That's a very good thing because we've managed to agree on something :) I'm also indifferent to the API, as long as we have 100% of the policy in user-space. To me this means we do absolutely no filtering in the kernel, which in turn means user-space gets all the events. Of course, we need the event name as a payload. Do we agree this solves all use-cases we have discussed so far? > except that you still have the similar problem: > > read() oldread() new > -- >"low" "low" >"low" "foo" -- the app does not know what does this mean >"med" "bar" -- ditto It can just ignore it, have a special handling, log it, fail or whatever. That's the good of having the policy in user-space. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] vmpressure: implement strict mode
On Mon, 1 Jul 2013 10:51:03 +0200 Pavel Machek wrote: > Hi! > > > diff --git a/Documentation/cgroups/memory.txt > > b/Documentation/cgroups/memory.txt > > index ddf4f93..3c589cf 100644 > > --- a/Documentation/cgroups/memory.txt > > +++ b/Documentation/cgroups/memory.txt > > @@ -807,12 +807,14 @@ register a notification, an application must: > > > > - create an eventfd using eventfd(2); > > - open memory.pressure_level; > > -- write string like " " > > +- write string like " > > [strict]" > >to cgroup.event_control. > > > > This is.. pretty strange interface. Would it be cleaner to do ioctl()? > New syscall? Are you referring to my new mode or to the whole thing? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] vmpressure: implement strict mode
On Fri, 28 Jun 2013 17:56:37 -0700 Anton Vorontsov wrote: > On Fri, Jun 28, 2013 at 03:44:02PM -0400, Luiz Capitulino wrote: > > > Why can't you use poll() and demultiplex the events? Check if there is an > > > event in the crit fd, and if there is, then just ignore all the rest. > > > > This may be a valid workaround for current kernels, but application > > behavior will be different among kernels with a different number of > > events. > > This is not a workaround, this is how poll works, and this is kinda > expected... I think this is a workaround because it's tailored to my specific use-case and to a specific kernel version, as: 1. Applications registering for lower levels (eg. low) are still unable to tell which level actually caused them to be notified, as lower levels are triggered along with higher levels 2. Considering the interface can be extended, how can new applications work on backwards mode? Say, we add ultra-critical on 3.12 and I update my application to work on it, how will my application work on 3.11? Hint: Try and error is an horribly bad approach. 3. I also don't believe we have good forward compatibility with the current API, as adding new events will cause existing ones to be triggered more often, so I'd expect app behavior to vary among kernels with a different number of events. Honestly, what Andrew suggested is the best design for me: apps are notified on all events but the event name is sent to the application. This is pretty simple and solves all the problems we've discussed so far. Why can't we just do it? > But not that I had this plan in mind when I was designing the > current scheme... :) Hehe :) > > Say, we events on top of critical. Then crit fd will now be > > notified for cases where it didn't use to on older kernels. > > I'm not sure I am following here... but thinking about it more, I guess > the extra read() will be needed anyway (to reset the counter). I hope I have explained this more clearly above. > > > > However, it *is* possible to make non-strict work on strict if we make > > > > strict default _and_ make reads on memory.pressure_level return > > > > available events. Just do this on app initialization: > > > > > > > > for each event in memory.pressure_level; do > > > > /* register eventfd to be notified on "event" */ > > > > done > > > > > > This scheme registers "all" events. > > > > Yes, because I thought that's the user-case that matters for activity > > manager :) > > Some activity managers use only low levels (Android), some might use only > medium levels (simple load-balancing). > > Being able to register only "all" does not make sense to me. Well, you can skip events if you want. > > > Here is more complicated case: > > > > > > Old kernels, pressure_level reads: > > > > > > low, med, crit > > > > > > The app just wants to listen for med level. > > > > > > New kernels, pressure_level reads: > > > > > > low, FOO, med, BAR, crit > > > > > > How would application decide which of FOO and BAR are ex-med levels? > > > > What you meant by ex-med? > > The scale is continuous and non-overlapping. If you add some other level, > you effectively "shrinking" other levels, so the ex-med in the list above > might correspond to "FOO, med" or "med, BAR" or "FOO, med, BAR", and that > is exactly the problem. Just return the events in order? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] vmpressure: implement strict mode
On Tue, 2 Jul 2013 10:24:09 -0700 Anton Vorontsov an...@enomsg.org wrote: Honestly, what Andrew suggested is the best design for me: apps are notified on all events but the event name is sent to the application. I am fine with this approach (or any other, I'm really indifferent to the API itself -- read/netlink/notification per file/whatever for the payload), That's a very good thing because we've managed to agree on something :) I'm also indifferent to the API, as long as we have 100% of the policy in user-space. To me this means we do absolutely no filtering in the kernel, which in turn means user-space gets all the events. Of course, we need the event name as a payload. Do we agree this solves all use-cases we have discussed so far? except that you still have the similar problem: read() oldread() new -- low low low foo -- the app does not know what does this mean med bar -- ditto It can just ignore it, have a special handling, log it, fail or whatever. That's the good of having the policy in user-space. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] vmpressure: implement strict mode
On Tue, 2 Jul 2013 21:47:03 +0200 Pavel Machek pa...@ucw.cz wrote: On Tue 2013-07-02 11:06:28, Luiz Capitulino wrote: On Mon, 1 Jul 2013 10:51:03 +0200 Pavel Machek pa...@ucw.cz wrote: Hi! diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index ddf4f93..3c589cf 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -807,12 +807,14 @@ register a notification, an application must: - create an eventfd using eventfd(2); - open memory.pressure_level; -- write string like event_fd fd of memory.pressure_level level +- write string like event_fd fd of memory.pressure_level level [strict] to cgroup.event_control. This is.. pretty strange interface. Would it be cleaner to do ioctl()? New syscall? Are you referring to my new mode or to the whole thing? Well. The interface was already very strange and you made it even worse. The existing interface is the cgroup's notification mechanism, I think discussing it is a bit out of scope for my extension. Now, regarding my extension itself and the current vmpressure API, I believe that delivering all events to user-space (ie. w/o any filtering in the kernel) is a better solution. Point is whether we can do it with the current vmpressure API (which is cgroup based) or whether we should move to something else. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] vmpressure: implement strict mode
On Fri, 28 Jun 2013 17:56:37 -0700 Anton Vorontsov an...@enomsg.org wrote: On Fri, Jun 28, 2013 at 03:44:02PM -0400, Luiz Capitulino wrote: Why can't you use poll() and demultiplex the events? Check if there is an event in the crit fd, and if there is, then just ignore all the rest. This may be a valid workaround for current kernels, but application behavior will be different among kernels with a different number of events. This is not a workaround, this is how poll works, and this is kinda expected... I think this is a workaround because it's tailored to my specific use-case and to a specific kernel version, as: 1. Applications registering for lower levels (eg. low) are still unable to tell which level actually caused them to be notified, as lower levels are triggered along with higher levels 2. Considering the interface can be extended, how can new applications work on backwards mode? Say, we add ultra-critical on 3.12 and I update my application to work on it, how will my application work on 3.11? Hint: Try and error is an horribly bad approach. 3. I also don't believe we have good forward compatibility with the current API, as adding new events will cause existing ones to be triggered more often, so I'd expect app behavior to vary among kernels with a different number of events. Honestly, what Andrew suggested is the best design for me: apps are notified on all events but the event name is sent to the application. This is pretty simple and solves all the problems we've discussed so far. Why can't we just do it? But not that I had this plan in mind when I was designing the current scheme... :) Hehe :) Say, we events on top of critical. Then crit fd will now be notified for cases where it didn't use to on older kernels. I'm not sure I am following here... but thinking about it more, I guess the extra read() will be needed anyway (to reset the counter). I hope I have explained this more clearly above. However, it *is* possible to make non-strict work on strict if we make strict default _and_ make reads on memory.pressure_level return available events. Just do this on app initialization: for each event in memory.pressure_level; do /* register eventfd to be notified on event */ done This scheme registers all events. Yes, because I thought that's the user-case that matters for activity manager :) Some activity managers use only low levels (Android), some might use only medium levels (simple load-balancing). Being able to register only all does not make sense to me. Well, you can skip events if you want. Here is more complicated case: Old kernels, pressure_level reads: low, med, crit The app just wants to listen for med level. New kernels, pressure_level reads: low, FOO, med, BAR, crit How would application decide which of FOO and BAR are ex-med levels? What you meant by ex-med? The scale is continuous and non-overlapping. If you add some other level, you effectively shrinking other levels, so the ex-med in the list above might correspond to FOO, med or med, BAR or FOO, med, BAR, and that is exactly the problem. Just return the events in order? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] vmpressure: implement strict mode
On Mon, 1 Jul 2013 10:51:03 +0200 Pavel Machek pa...@ucw.cz wrote: Hi! diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index ddf4f93..3c589cf 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -807,12 +807,14 @@ register a notification, an application must: - create an eventfd using eventfd(2); - open memory.pressure_level; -- write string like event_fd fd of memory.pressure_level level +- write string like event_fd fd of memory.pressure_level level [strict] to cgroup.event_control. This is.. pretty strange interface. Would it be cleaner to do ioctl()? New syscall? Are you referring to my new mode or to the whole thing? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] vmpressure: implement strict mode
On Fri, 28 Jun 2013 11:55:47 -0700 Anton Vorontsov wrote: > On Fri, Jun 28, 2013 at 02:45:07PM -0400, Luiz Capitulino wrote: > > On Fri, 28 Jun 2013 10:09:17 -0700 > > Anton Vorontsov wrote: > > > > > So, I would now argue that the current scheme is perfectly OK and can do > > > everything you can do with the "strict" one, > > > > I forgot commenting this bit. This is not true, because I don't want a > > low fd to be notified on critical level. The current interface just > > can't do that. > > Why can't you use poll() and demultiplex the events? Check if there is an > event in the crit fd, and if there is, then just ignore all the rest. This may be a valid workaround for current kernels, but application behavior will be different among kernels with a different number of events. Say, we events on top of critical. Then crit fd will now be notified for cases where it didn't use to on older kernels. > > However, it *is* possible to make non-strict work on strict if we make > > strict default _and_ make reads on memory.pressure_level return > > available events. Just do this on app initialization: > > > > for each event in memory.pressure_level; do > > /* register eventfd to be notified on "event" */ > > done > > This scheme registers "all" events. Yes, because I thought that's the user-case that matters for activity manager :) > Here is more complicated case: > > Old kernels, pressure_level reads: > > low, med, crit > > The app just wants to listen for med level. > > New kernels, pressure_level reads: > > low, FOO, med, BAR, crit > > How would application decide which of FOO and BAR are ex-med levels? What you meant by ex-med? Let's not over-design. I agree that allowing the API to be extended is a good thing, but we shouldn't give complex meaning to events, otherwise we're better with a numeric scale instead. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] vmpressure: implement strict mode
On Fri, 28 Jun 2013 11:45:47 -0700 Anton Vorontsov wrote: > On Fri, Jun 28, 2013 at 02:25:58PM -0400, Luiz Capitulino wrote: > > > > > That's how it's expected to work, because on strict mode you're > > > > > notified > > > > > for the level you registered for. So apps registering for critical, > > > > > will > > > > > still be notified on critical just like before. > > > > > > > > Suppose you introduce a new level, and the system hits this level. > > > > Before, > > > > the app would receive at least some notification for the given memory > > > > load > > > > (i.e. one of the old levels), with the new level introduced in the > > > > kernel, > > > > the app will receive no events at all. > > > > That's not true. If an app registered for critical it will still get > > critical notification when the system is at the critical level. Just as it > > always did. No new events will change this. > > > > With today's semantics though, new events will change when current events > > are triggered. So each new extension will cause applications to have > > different behaviors, in different kernel versions. This looks quite > > undesirable to me. > > I'll try to explain it again. > > Old behaviour: > > low -> event > x <- but the system is at this unnamed level, between low and med > med > crit > > > We add a level: > > low > low-med <- system at this state, we send an event, but the old app does >not know about it, so it won't receive *any* notifications. (In > older kernels it would receive low level notification > med > crit > > You really don't see a problem here? I do get what you're saying. We disagree it's a problem. In my mind the best API is to get what you registered for. Nothing more, nothing less. Now, there might be ways around it (being it a problem or not). I was also considering this: > 3. Never change the levels (how can we know?) If we fail at determining levels (I honestly think current levels are all we need), we can add a new interface later. Also, what I said in the last email should work, which is to make memory.pressure_level return supported levels, so an application can register for all available levels. This way it will never miss a level. I also think this matches having the mechanism in the kernel and policy in user-space. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] vmpressure: implement strict mode
On Fri, 28 Jun 2013 10:09:17 -0700 Anton Vorontsov wrote: > So, I would now argue that the current scheme is perfectly OK and can do > everything you can do with the "strict" one, I forgot commenting this bit. This is not true, because I don't want a low fd to be notified on critical level. The current interface just can't do that. However, it *is* possible to make non-strict work on strict if we make strict default _and_ make reads on memory.pressure_level return available events. Just do this on app initialization: for each event in memory.pressure_level; do /* register eventfd to be notified on "event" */ done Then eventfd will always be notified, no matter the event. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] vmpressure: implement strict mode
On Fri, 28 Jun 2013 10:09:17 -0700 Anton Vorontsov wrote: > On Fri, Jun 28, 2013 at 09:57:22AM -0700, Anton Vorontsov wrote: > > On Fri, Jun 28, 2013 at 10:00:27AM -0400, Luiz Capitulino wrote: > > > On Thu, 27 Jun 2013 22:07:12 -0700 > > > Anton Vorontsov wrote: > > > > > > > On Thu, Jun 27, 2013 at 09:34:11PM -0700, Anton Vorontsov wrote: > > > > > ... we can add the strict mode and deprecate the > > > > > "filtering" -- basically we'll implement the idea of requiring that > > > > > userspace registers a separate fd for each level. > > > > > > > > Btw, assuming that more levels can be added, there will be a problem: > > > > imagine that an app hooked up onto low, med, crit levels in "strict" > > > > mode... then once we add a new level, the app will start missing the new > > > > level events. > > > > > > That's how it's expected to work, because on strict mode you're notified > > > for the level you registered for. So apps registering for critical, will > > > still be notified on critical just like before. > > > > Suppose you introduce a new level, and the system hits this level. Before, > > the app would receive at least some notification for the given memory load > > (i.e. one of the old levels), with the new level introduced in the kernel, > > the app will receive no events at all. That's not true. If an app registered for critical it will still get critical notification when the system is at the critical level. Just as it always did. No new events will change this. With today's semantics though, new events will change when current events are triggered. So each new extension will cause applications to have different behaviors, in different kernel versions. This looks quite undesirable to me. > > This makes a serious behavioural > > change in the app (read: it'll break it). How? Strict mode semantics is simple: you get what you registered for. No matter the kernel version, no matter how many events we add on top of existing ones. How can this brake applications? > Btw, why exactly you need the strict mode? I'm implementing automatic guest-resize for KVM guests. What I'd like to do is to inflate the balloon by different values depending on the host pressure. Say, 1MB on low; 16MB on medium and 128MB on critical. The actual values are meaningless btw, as this is going to be set by the user. So, saying that 1MB on low is too little to be concerned about is not an valid argument, as the user can set this to 1GB. > Why 'medium' won't work for the > load-balancing? I need precision. If the system is at 'medium' level, then I'll do medium stuff. If it gets into critical, then I'll do critical stuff. I don't want to do critical stuff on all levels, or critical on medium. This is really messy. Just give me what I asked for, please. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] vmpressure: implement strict mode
On Thu, 27 Jun 2013 22:07:12 -0700 Anton Vorontsov wrote: > On Thu, Jun 27, 2013 at 09:34:11PM -0700, Anton Vorontsov wrote: > > ... we can add the strict mode and deprecate the > > "filtering" -- basically we'll implement the idea of requiring that > > userspace registers a separate fd for each level. > > Btw, assuming that more levels can be added, there will be a problem: > imagine that an app hooked up onto low, med, crit levels in "strict" > mode... then once we add a new level, the app will start missing the new > level events. That's how it's expected to work, because on strict mode you're notified for the level you registered for. So apps registering for critical, will still be notified on critical just like before. > In the old scheme it is not a problem because of the >= condition. I think the problem actually lies with the current interface, because if an app registers for critical and we add a new level after critical then this app will now be notified on critical *and* the new level. The app's algorithm might not be prepared to deal with that. > With a proper versioning this won't be a problem for a new scheme too. I don't think there's a problem to be solved here. Strict mode does allow forward compatibility. For good backward compatibility we can make memory.pressure_level return supported levels on read. This way applications can check if the level they are interested in exist and then fallback or exit with a good error message if they don't. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] vmpressure: implement strict mode
On Thu, 27 Jun 2013 21:34:11 -0700 Anton Vorontsov wrote: > On Thu, Jun 27, 2013 at 06:13:53PM -0700, Andrew Morton wrote: > > On Thu, 27 Jun 2013 17:58:53 -0700 Anton Vorontsov wrote: > > > Current frequency is 1/(2MB). Suppose we ended up scanning the whole > > > memory on a 2GB host, this will give us 1024 hits. Doesn't feel too much* > > > to me... But for what it worth, I am against adding read() to the > > > interface -- just because we can avoid the unnecessary switch into the > > > kernel. > > > > What was it they said about premature optimization? > > > > I think I'd rather do nothing than add a mode hack (already!). > > > > The information Luiz wants is already available with the existing > > interface, so why not just use it until there is a real demonstrated > > problem? > > > > But all this does point at the fact that the chosen interface was not a > > good one. And it's happening so soon :( A far better interface would > > be to do away with this level filtering stuff in the kernel altogether. > > OK, I am convinced that modes might be not necessary, but I see no big > problem in current situation, we can add the strict mode and deprecate the > "filtering" -- basically we'll implement the idea of requiring that > userspace registers a separate fd for each level. Agreed this is a good solution. > As one of the ways to change the interface, we can do the strict mode by > writing levels in uppercase, and warn_once on lowercase levels, describing > that the old behaviour will go away. Once (if ever) we remove the old > behaviour, the apps trying the old-style lowercase levels will fail > gracefully with EINVAL. Why don't we just break it? There's no non-development kernel released with this interface yet. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] vmpressure: implement strict mode
On Thu, 27 Jun 2013 21:34:11 -0700 Anton Vorontsov an...@enomsg.org wrote: On Thu, Jun 27, 2013 at 06:13:53PM -0700, Andrew Morton wrote: On Thu, 27 Jun 2013 17:58:53 -0700 Anton Vorontsov an...@enomsg.org wrote: Current frequency is 1/(2MB). Suppose we ended up scanning the whole memory on a 2GB host, this will give us 1024 hits. Doesn't feel too much* to me... But for what it worth, I am against adding read() to the interface -- just because we can avoid the unnecessary switch into the kernel. What was it they said about premature optimization? I think I'd rather do nothing than add a mode hack (already!). The information Luiz wants is already available with the existing interface, so why not just use it until there is a real demonstrated problem? But all this does point at the fact that the chosen interface was not a good one. And it's happening so soon :( A far better interface would be to do away with this level filtering stuff in the kernel altogether. OK, I am convinced that modes might be not necessary, but I see no big problem in current situation, we can add the strict mode and deprecate the filtering -- basically we'll implement the idea of requiring that userspace registers a separate fd for each level. Agreed this is a good solution. As one of the ways to change the interface, we can do the strict mode by writing levels in uppercase, and warn_once on lowercase levels, describing that the old behaviour will go away. Once (if ever) we remove the old behaviour, the apps trying the old-style lowercase levels will fail gracefully with EINVAL. Why don't we just break it? There's no non-development kernel released with this interface yet. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] vmpressure: implement strict mode
On Thu, 27 Jun 2013 22:07:12 -0700 Anton Vorontsov an...@enomsg.org wrote: On Thu, Jun 27, 2013 at 09:34:11PM -0700, Anton Vorontsov wrote: ... we can add the strict mode and deprecate the filtering -- basically we'll implement the idea of requiring that userspace registers a separate fd for each level. Btw, assuming that more levels can be added, there will be a problem: imagine that an app hooked up onto low, med, crit levels in strict mode... then once we add a new level, the app will start missing the new level events. That's how it's expected to work, because on strict mode you're notified for the level you registered for. So apps registering for critical, will still be notified on critical just like before. In the old scheme it is not a problem because of the = condition. I think the problem actually lies with the current interface, because if an app registers for critical and we add a new level after critical then this app will now be notified on critical *and* the new level. The app's algorithm might not be prepared to deal with that. With a proper versioning this won't be a problem for a new scheme too. I don't think there's a problem to be solved here. Strict mode does allow forward compatibility. For good backward compatibility we can make memory.pressure_level return supported levels on read. This way applications can check if the level they are interested in exist and then fallback or exit with a good error message if they don't. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] vmpressure: implement strict mode
On Fri, 28 Jun 2013 10:09:17 -0700 Anton Vorontsov an...@enomsg.org wrote: On Fri, Jun 28, 2013 at 09:57:22AM -0700, Anton Vorontsov wrote: On Fri, Jun 28, 2013 at 10:00:27AM -0400, Luiz Capitulino wrote: On Thu, 27 Jun 2013 22:07:12 -0700 Anton Vorontsov an...@enomsg.org wrote: On Thu, Jun 27, 2013 at 09:34:11PM -0700, Anton Vorontsov wrote: ... we can add the strict mode and deprecate the filtering -- basically we'll implement the idea of requiring that userspace registers a separate fd for each level. Btw, assuming that more levels can be added, there will be a problem: imagine that an app hooked up onto low, med, crit levels in strict mode... then once we add a new level, the app will start missing the new level events. That's how it's expected to work, because on strict mode you're notified for the level you registered for. So apps registering for critical, will still be notified on critical just like before. Suppose you introduce a new level, and the system hits this level. Before, the app would receive at least some notification for the given memory load (i.e. one of the old levels), with the new level introduced in the kernel, the app will receive no events at all. That's not true. If an app registered for critical it will still get critical notification when the system is at the critical level. Just as it always did. No new events will change this. With today's semantics though, new events will change when current events are triggered. So each new extension will cause applications to have different behaviors, in different kernel versions. This looks quite undesirable to me. This makes a serious behavioural change in the app (read: it'll break it). How? Strict mode semantics is simple: you get what you registered for. No matter the kernel version, no matter how many events we add on top of existing ones. How can this brake applications? Btw, why exactly you need the strict mode? I'm implementing automatic guest-resize for KVM guests. What I'd like to do is to inflate the balloon by different values depending on the host pressure. Say, 1MB on low; 16MB on medium and 128MB on critical. The actual values are meaningless btw, as this is going to be set by the user. So, saying that 1MB on low is too little to be concerned about is not an valid argument, as the user can set this to 1GB. Why 'medium' won't work for the load-balancing? I need precision. If the system is at 'medium' level, then I'll do medium stuff. If it gets into critical, then I'll do critical stuff. I don't want to do critical stuff on all levels, or critical on medium. This is really messy. Just give me what I asked for, please. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] vmpressure: implement strict mode
On Fri, 28 Jun 2013 10:09:17 -0700 Anton Vorontsov an...@enomsg.org wrote: So, I would now argue that the current scheme is perfectly OK and can do everything you can do with the strict one, I forgot commenting this bit. This is not true, because I don't want a low fd to be notified on critical level. The current interface just can't do that. However, it *is* possible to make non-strict work on strict if we make strict default _and_ make reads on memory.pressure_level return available events. Just do this on app initialization: for each event in memory.pressure_level; do /* register eventfd to be notified on event */ done Then eventfd will always be notified, no matter the event. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] vmpressure: implement strict mode
On Fri, 28 Jun 2013 11:45:47 -0700 Anton Vorontsov an...@enomsg.org wrote: On Fri, Jun 28, 2013 at 02:25:58PM -0400, Luiz Capitulino wrote: That's how it's expected to work, because on strict mode you're notified for the level you registered for. So apps registering for critical, will still be notified on critical just like before. Suppose you introduce a new level, and the system hits this level. Before, the app would receive at least some notification for the given memory load (i.e. one of the old levels), with the new level introduced in the kernel, the app will receive no events at all. That's not true. If an app registered for critical it will still get critical notification when the system is at the critical level. Just as it always did. No new events will change this. With today's semantics though, new events will change when current events are triggered. So each new extension will cause applications to have different behaviors, in different kernel versions. This looks quite undesirable to me. I'll try to explain it again. Old behaviour: low - event x - but the system is at this unnamed level, between low and med med crit We add a level: low low-med - system at this state, we send an event, but the old app does not know about it, so it won't receive *any* notifications. (In older kernels it would receive low level notification med crit You really don't see a problem here? I do get what you're saying. We disagree it's a problem. In my mind the best API is to get what you registered for. Nothing more, nothing less. Now, there might be ways around it (being it a problem or not). I was also considering this: 3. Never change the levels (how can we know?) If we fail at determining levels (I honestly think current levels are all we need), we can add a new interface later. Also, what I said in the last email should work, which is to make memory.pressure_level return supported levels, so an application can register for all available levels. This way it will never miss a level. I also think this matches having the mechanism in the kernel and policy in user-space. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] vmpressure: implement strict mode
On Fri, 28 Jun 2013 11:55:47 -0700 Anton Vorontsov an...@enomsg.org wrote: On Fri, Jun 28, 2013 at 02:45:07PM -0400, Luiz Capitulino wrote: On Fri, 28 Jun 2013 10:09:17 -0700 Anton Vorontsov an...@enomsg.org wrote: So, I would now argue that the current scheme is perfectly OK and can do everything you can do with the strict one, I forgot commenting this bit. This is not true, because I don't want a low fd to be notified on critical level. The current interface just can't do that. Why can't you use poll() and demultiplex the events? Check if there is an event in the crit fd, and if there is, then just ignore all the rest. This may be a valid workaround for current kernels, but application behavior will be different among kernels with a different number of events. Say, we events on top of critical. Then crit fd will now be notified for cases where it didn't use to on older kernels. However, it *is* possible to make non-strict work on strict if we make strict default _and_ make reads on memory.pressure_level return available events. Just do this on app initialization: for each event in memory.pressure_level; do /* register eventfd to be notified on event */ done This scheme registers all events. Yes, because I thought that's the user-case that matters for activity manager :) Here is more complicated case: Old kernels, pressure_level reads: low, med, crit The app just wants to listen for med level. New kernels, pressure_level reads: low, FOO, med, BAR, crit How would application decide which of FOO and BAR are ex-med levels? What you meant by ex-med? Let's not over-design. I agree that allowing the API to be extended is a good thing, but we shouldn't give complex meaning to events, otherwise we're better with a numeric scale instead. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] vmpressure: implement strict mode
On Thu, 27 Jun 2013 11:26:16 +0200 Michal Hocko wrote: > On Wed 26-06-13 23:17:12, Luiz Capitulino wrote: > > Currently, an eventfd is notified for the level it's registered for > > _plus_ higher levels. > > > > This is a problem if an application wants to implement different > > actions for different levels. For example, an application might want > > to release 10% of its cache on level low, 50% on medium and 100% on > > critical. To do this, an application has to register a different > > eventfd for each pressure level. However, fd low is always going to > > be notified and and all fds are going to be notified on level critical. > > > > Strict mode solves this problem by strictly notifiying an eventfd > > for the pressure level it registered for. This new mode is optional, > > by default we still notify eventfds on higher levels too. > > > > Signed-off-by: Luiz Capitulino > > Two nits bellow but it looks good in general to me. You can add my > Reviewed-by: Michal Hocko Thanks. > I still think that edge triggering makes some sense but that one might > be rebased on top of this patch. We should still figure out whether the > edge triggering is the right approach for the use case Hyunhee Kim wants > it for so the strict mode should go first IMO. I also think that edge triggering makes sense. > > --- > > > > o v2 > > > > - Improve documentation > > - Use a bit to store mode instead of a bool > > - Minor changelog changes > > > > Documentation/cgroups/memory.txt | 26 ++ > > mm/vmpressure.c | 26 -- > > 2 files changed, 46 insertions(+), 6 deletions(-) > > > > diff --git a/Documentation/cgroups/memory.txt > > b/Documentation/cgroups/memory.txt > > index ddf4f93..412872b 100644 > > --- a/Documentation/cgroups/memory.txt > > +++ b/Documentation/cgroups/memory.txt > > @@ -791,6 +791,22 @@ way to trigger. Applications should do whatever they > > can to help the > > system. It might be too late to consult with vmstat or any other > > statistics, so it's advisable to take an immediate action. > > > > +Applications can also choose between two notification modes when > > +registering an eventfd for memory pressure events: > > + > > +When in "non-strict" mode, an eventfd is notified for the specific level > > +it's registered for and higher levels. For example, an eventfd registered > > +for low level is also going to be notified on medium and critical levels. > > +This mode makes sense for applications interested on monitoring reclaim > > +activity or implementing simple load-balacing logic. The non-strict mode > > +is the default notification mode. > > + > > +When in "strict" mode, an eventfd is strictly notified for the pressure > > +level it's registered for. For example, an eventfd registered for the low > > +level event is not going to be notified when memory pressure gets into > > +medium or critical levels. This allows for more complex logic based on > > +the actual pressure level the system is experiencing. > > It would be also fair to mention that there is no guarantee that lower > levels are signaled before higher so nobody should rely on seeing LOW > before MEDIUM or CRITICAL. I think this is implied. Actually, as an user of this interface I didn't expect this to happen until I read the code. > > > + > > The events are propagated upward until the event is handled, i.e. the > > events are not pass-through. Here is what this means: for example you have > > three cgroups: A->B->C. Now you set up an event listener on cgroups A, B > > @@ -807,12 +823,14 @@ register a notification, an application must: > > > > - create an eventfd using eventfd(2); > > - open memory.pressure_level; > > -- write string like " " > > +- write string like " > > [strict]" > >to cgroup.event_control. > > > > -Application will be notified through eventfd when memory pressure is at > > -the specific level (or higher). Read/write operations to > > -memory.pressure_level are no implemented. > > +Applications will be notified through eventfd when memory pressure is at > > +the specific level or higher. If strict is passed, then applications > > +will only be notified when memory pressure reaches the specified level. > > + > > +Read/write operations to memory.pressure_level are no implemented. > > > > Test: > > > > diff --git a/mm/vmpressure.c b/mm/vmpressure.c > > index 736a60
Re: [PATCH v2] vmpressure: implement strict mode
On Thu, 27 Jun 2013 11:26:16 +0200 Michal Hocko mho...@suse.cz wrote: On Wed 26-06-13 23:17:12, Luiz Capitulino wrote: Currently, an eventfd is notified for the level it's registered for _plus_ higher levels. This is a problem if an application wants to implement different actions for different levels. For example, an application might want to release 10% of its cache on level low, 50% on medium and 100% on critical. To do this, an application has to register a different eventfd for each pressure level. However, fd low is always going to be notified and and all fds are going to be notified on level critical. Strict mode solves this problem by strictly notifiying an eventfd for the pressure level it registered for. This new mode is optional, by default we still notify eventfds on higher levels too. Signed-off-by: Luiz Capitulino lcapitul...@redhat.com Two nits bellow but it looks good in general to me. You can add my Reviewed-by: Michal Hocko mho...@suse.cz Thanks. I still think that edge triggering makes some sense but that one might be rebased on top of this patch. We should still figure out whether the edge triggering is the right approach for the use case Hyunhee Kim wants it for so the strict mode should go first IMO. I also think that edge triggering makes sense. --- o v2 - Improve documentation - Use a bit to store mode instead of a bool - Minor changelog changes Documentation/cgroups/memory.txt | 26 ++ mm/vmpressure.c | 26 -- 2 files changed, 46 insertions(+), 6 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index ddf4f93..412872b 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -791,6 +791,22 @@ way to trigger. Applications should do whatever they can to help the system. It might be too late to consult with vmstat or any other statistics, so it's advisable to take an immediate action. +Applications can also choose between two notification modes when +registering an eventfd for memory pressure events: + +When in non-strict mode, an eventfd is notified for the specific level +it's registered for and higher levels. For example, an eventfd registered +for low level is also going to be notified on medium and critical levels. +This mode makes sense for applications interested on monitoring reclaim +activity or implementing simple load-balacing logic. The non-strict mode +is the default notification mode. + +When in strict mode, an eventfd is strictly notified for the pressure +level it's registered for. For example, an eventfd registered for the low +level event is not going to be notified when memory pressure gets into +medium or critical levels. This allows for more complex logic based on +the actual pressure level the system is experiencing. It would be also fair to mention that there is no guarantee that lower levels are signaled before higher so nobody should rely on seeing LOW before MEDIUM or CRITICAL. I think this is implied. Actually, as an user of this interface I didn't expect this to happen until I read the code. + The events are propagated upward until the event is handled, i.e. the events are not pass-through. Here is what this means: for example you have three cgroups: A-B-C. Now you set up an event listener on cgroups A, B @@ -807,12 +823,14 @@ register a notification, an application must: - create an eventfd using eventfd(2); - open memory.pressure_level; -- write string like event_fd fd of memory.pressure_level level +- write string like event_fd fd of memory.pressure_level level [strict] to cgroup.event_control. -Application will be notified through eventfd when memory pressure is at -the specific level (or higher). Read/write operations to -memory.pressure_level are no implemented. +Applications will be notified through eventfd when memory pressure is at +the specific level or higher. If strict is passed, then applications +will only be notified when memory pressure reaches the specified level. + +Read/write operations to memory.pressure_level are no implemented. Test: diff --git a/mm/vmpressure.c b/mm/vmpressure.c index 736a601..ba5c17e 100644 --- a/mm/vmpressure.c +++ b/mm/vmpressure.c @@ -138,8 +138,16 @@ struct vmpressure_event { struct eventfd_ctx *efd; enum vmpressure_levels level; struct list_head node; + unsigned int mode; You would fill up a hole between level and node if you move it up on 64b. Doesn't matter much but why not do it... You want me to respin? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http
[PATCH v2] vmpressure: implement strict mode
Currently, an eventfd is notified for the level it's registered for _plus_ higher levels. This is a problem if an application wants to implement different actions for different levels. For example, an application might want to release 10% of its cache on level low, 50% on medium and 100% on critical. To do this, an application has to register a different eventfd for each pressure level. However, fd low is always going to be notified and and all fds are going to be notified on level critical. Strict mode solves this problem by strictly notifiying an eventfd for the pressure level it registered for. This new mode is optional, by default we still notify eventfds on higher levels too. Signed-off-by: Luiz Capitulino --- o v2 - Improve documentation - Use a bit to store mode instead of a bool - Minor changelog changes Documentation/cgroups/memory.txt | 26 ++ mm/vmpressure.c | 26 -- 2 files changed, 46 insertions(+), 6 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index ddf4f93..412872b 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -791,6 +791,22 @@ way to trigger. Applications should do whatever they can to help the system. It might be too late to consult with vmstat or any other statistics, so it's advisable to take an immediate action. +Applications can also choose between two notification modes when +registering an eventfd for memory pressure events: + +When in "non-strict" mode, an eventfd is notified for the specific level +it's registered for and higher levels. For example, an eventfd registered +for low level is also going to be notified on medium and critical levels. +This mode makes sense for applications interested on monitoring reclaim +activity or implementing simple load-balacing logic. The non-strict mode +is the default notification mode. + +When in "strict" mode, an eventfd is strictly notified for the pressure +level it's registered for. For example, an eventfd registered for the low +level event is not going to be notified when memory pressure gets into +medium or critical levels. This allows for more complex logic based on +the actual pressure level the system is experiencing. + The events are propagated upward until the event is handled, i.e. the events are not pass-through. Here is what this means: for example you have three cgroups: A->B->C. Now you set up an event listener on cgroups A, B @@ -807,12 +823,14 @@ register a notification, an application must: - create an eventfd using eventfd(2); - open memory.pressure_level; -- write string like " " +- write string like " [strict]" to cgroup.event_control. -Application will be notified through eventfd when memory pressure is at -the specific level (or higher). Read/write operations to -memory.pressure_level are no implemented. +Applications will be notified through eventfd when memory pressure is at +the specific level or higher. If strict is passed, then applications +will only be notified when memory pressure reaches the specified level. + +Read/write operations to memory.pressure_level are no implemented. Test: diff --git a/mm/vmpressure.c b/mm/vmpressure.c index 736a601..ba5c17e 100644 --- a/mm/vmpressure.c +++ b/mm/vmpressure.c @@ -138,8 +138,16 @@ struct vmpressure_event { struct eventfd_ctx *efd; enum vmpressure_levels level; struct list_head node; + unsigned int mode; }; +#define VMPRESSURE_MODE_STRICT 1 + +static inline bool vmpressure_mode_is_strict(const struct vmpressure_event *ev) +{ + return ev->mode & VMPRESSURE_MODE_STRICT; +} + static bool vmpressure_event(struct vmpressure *vmpr, unsigned long scanned, unsigned long reclaimed) { @@ -153,6 +161,9 @@ static bool vmpressure_event(struct vmpressure *vmpr, list_for_each_entry(ev, >events, node) { if (level >= ev->level) { + /* strict mode ensures level == ev->level */ + if (vmpressure_mode_is_strict(ev) && level != ev->level) + continue; eventfd_signal(ev->efd, 1); signalled = true; } @@ -292,7 +303,7 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio) * infrastructure, so that the notifications will be delivered to the * @eventfd. The @args parameter is a string that denotes pressure level * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or - * "critical"). + * "critical") and optionally a different operating mode (i.e. "strict") * * This function should not be used directly, just pass it to (struct * cftype).register_event, and then cgroup core will handle everything
Re: [PATCH] vmpressure: implement strict mode
On Wed, 26 Jun 2013 10:08:27 +0200 Michal Hocko wrote: > On Tue 25-06-13 17:51:29, Luiz Capitulino wrote: > > Currently, applications are notified for the level they registered for > > _plus_ higher levels. > > > > This is a problem if the application wants to implement different > > actions for different levels. For example, an application might want > > to release 10% of its cache on level low, 50% on medium and 100% on > > critical. To do this, the application has to register a different fd > > for each event. However, fd low is always going to be notified and > > and all fds are going to be notified on level critical. > > OK, I am not user of this interface but I thought that the application > would take an action of the highest level it gets notification. But yes > this might get clumsy to implement. > > > Strict mode solves this problem by strictly notifiying the event > > an fd has registered for. It's optional. By default we still notify > > on higher levels. > > OK, makes some sense to me and it should work with the proposed edge > trigerring as well. > > > Signed-off-by: Luiz Capitulino > > --- > > > > PS: I'm following the discussion on the event storm problem, but I believe > > strict mode is orthogonal to what has been suggested (although the > > patches conflict) > > > > Documentation/cgroups/memory.txt | 10 ++ > > mm/vmpressure.c | 19 +-- > > 2 files changed, 23 insertions(+), 6 deletions(-) > > > > diff --git a/Documentation/cgroups/memory.txt > > b/Documentation/cgroups/memory.txt > > index ddf4f93..3c589cf 100644 > > --- a/Documentation/cgroups/memory.txt > > +++ b/Documentation/cgroups/memory.txt > > @@ -807,12 +807,14 @@ register a notification, an application must: > > > > - create an eventfd using eventfd(2); > > - open memory.pressure_level; > > -- write string like " " > > +- write string like " > > [strict]" > >to cgroup.event_control. > > > > -Application will be notified through eventfd when memory pressure is at > > -the specific level (or higher). Read/write operations to > > -memory.pressure_level are no implemented. > > +Applications will be notified through eventfd when memory pressure is at > > +the specific level or higher. If strict is passed, then applications > > +will only be notified when memory pressure reaches the specified level. > > It would be good to describe when is strick and when the default > appropriate. Yeah, Anton asked for the same thing. Will do it. > > + > > +Read/write operations to memory.pressure_level are no implemented. > > > > Test: > > > > diff --git a/mm/vmpressure.c b/mm/vmpressure.c > > index 736a601..6289ede 100644 > > --- a/mm/vmpressure.c > > +++ b/mm/vmpressure.c > > @@ -137,6 +137,7 @@ static enum vmpressure_levels > > vmpressure_calc_level(unsigned long scanned, > > struct vmpressure_event { > > struct eventfd_ctx *efd; > > enum vmpressure_levels level; > > + bool strict_mode; > > Any reason to not using a flag for this? If there are any other modes to > come them we would end up with zilions of bools which is not very nice. Good point, I'll change it. > > > struct list_head node; > > }; > > > > @@ -153,6 +154,9 @@ static bool vmpressure_event(struct vmpressure *vmpr, > > > > list_for_each_entry(ev, >events, node) { > > if (level >= ev->level) { > > + /* strict mode ensures level == ev->level */ > > + if (ev->strict_mode && level != ev->level) > > + continue; > > eventfd_signal(ev->efd, 1); > > signalled = true; > > } > > @@ -292,7 +296,7 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup > > *memcg, int prio) > > * infrastructure, so that the notifications will be delivered to the > > * @eventfd. The @args parameter is a string that denotes pressure level > > * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or > > - * "critical"). > > + * "critical") and optionally a different operating mode (i.e. "strict") > > * > > * This function should not be used directly, just pass it to (struct > > * cftype).register_event, and then cgroup core will handle everything by > > @@ -303,22 +307,33 @@ int vmpressure_register_event(struct cgroup *cg
Re: [PATCH] vmpressure: implement strict mode
On Wed, 26 Jun 2013 17:20:40 +0900 Minchan Kim wrote: > Hello Michal, > > On Wed, Jun 26, 2013 at 09:59:21AM +0200, Michal Hocko wrote: > > On Wed 26-06-13 16:50:51, Minchan Kim wrote: > > > On Tue, Jun 25, 2013 at 05:51:29PM -0400, Luiz Capitulino wrote: > > > > Currently, applications are notified for the level they registered for > > > > _plus_ higher levels. > > > > > > > > This is a problem if the application wants to implement different > > > > actions for different levels. For example, an application might want > > > > to release 10% of its cache on level low, 50% on medium and 100% on > > > > critical. To do this, the application has to register a different fd > > > > for each event. However, fd low is always going to be notified and > > > > and all fds are going to be notified on level critical. > > > > > > > > Strict mode solves this problem by strictly notifiying the event > > > > an fd has registered for. It's optional. By default we still notify > > > > on higher levels. > > > > > > > > Signed-off-by: Luiz Capitulino > > > Acked-by: Minchan Kim > > > > > > Shouldn't we make this default? > > > > The interface is not there for long but still, changing it is always > > quite tricky. And the users who care can be modified really easily so I > > would stick with the original default. > > Yeb, I am not strong against to stick old at a moment but at least, > this patch makes more sense to me so I'd like to know why we didn't do it > from the beginning. Surely, Anton has a answer. That's exactly my thinking too: I think strict mode should be the default mode, and the current mode should be optional. But it's not a big deal. I've discussed this issue with Anton some weeks ago, and iirc (Anton, please correct/clarify where appropriate) the conclusion was that the current schema makes sense for apps monitoring reclaim activity, as they can hook on low only. Hmm. Something just crossed my mind. Maybe we should have two notification schemas: o memory.pressure_level: implements strict mode (this patch) o memory.reclaim_activity: apps are notified whenever there's reclaim activity As for changing applications, it's better to get some breakage while we're in -rc than regret the API later. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] vmpressure: implement strict mode
On Tue, 25 Jun 2013 21:03:31 -0700 Anton Vorontsov wrote: > On Tue, Jun 25, 2013 at 05:51:29PM -0400, Luiz Capitulino wrote: > > Currently, applications are notified for the level they registered for > > _plus_ higher levels. > > > > This is a problem if the application wants to implement different > > actions for different levels. For example, an application might want > > to release 10% of its cache on level low, 50% on medium and 100% on > > critical. To do this, the application has to register a different fd > > for each event. However, fd low is always going to be notified and > > and all fds are going to be notified on level critical. > > > > Strict mode solves this problem by strictly notifiying the event > > an fd has registered for. It's optional. By default we still notify > > on higher levels. > > > > Signed-off-by: Luiz Capitulino > > In the documentation I would add more information about why exactly the > strict mode makes sense. > > For example, the non-strict fd listener hooked onto the low level makes > sense for apps that just monitor reclaiming activity (like current Android > Activity Manager), hooking onto 'medium' non-strict mode makes sense for > simple load-balancing logic, and the new strict mode is for the cases when > an application wants to implement some fancy logic as it makes a decision > based on a concrete level. OK, I'll respin. But you said it all already, so I'll base my text on on what you wrote. > Otherwise, it looks good. > > Acked-by: Anton Vorontsov Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] vmpressure: implement strict mode
On Tue, 25 Jun 2013 21:03:31 -0700 Anton Vorontsov an...@enomsg.org wrote: On Tue, Jun 25, 2013 at 05:51:29PM -0400, Luiz Capitulino wrote: Currently, applications are notified for the level they registered for _plus_ higher levels. This is a problem if the application wants to implement different actions for different levels. For example, an application might want to release 10% of its cache on level low, 50% on medium and 100% on critical. To do this, the application has to register a different fd for each event. However, fd low is always going to be notified and and all fds are going to be notified on level critical. Strict mode solves this problem by strictly notifiying the event an fd has registered for. It's optional. By default we still notify on higher levels. Signed-off-by: Luiz Capitulino lcapitul...@redhat.com In the documentation I would add more information about why exactly the strict mode makes sense. For example, the non-strict fd listener hooked onto the low level makes sense for apps that just monitor reclaiming activity (like current Android Activity Manager), hooking onto 'medium' non-strict mode makes sense for simple load-balancing logic, and the new strict mode is for the cases when an application wants to implement some fancy logic as it makes a decision based on a concrete level. OK, I'll respin. But you said it all already, so I'll base my text on on what you wrote. Otherwise, it looks good. Acked-by: Anton Vorontsov an...@enomsg.org Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] vmpressure: implement strict mode
On Wed, 26 Jun 2013 17:20:40 +0900 Minchan Kim minc...@kernel.org wrote: Hello Michal, On Wed, Jun 26, 2013 at 09:59:21AM +0200, Michal Hocko wrote: On Wed 26-06-13 16:50:51, Minchan Kim wrote: On Tue, Jun 25, 2013 at 05:51:29PM -0400, Luiz Capitulino wrote: Currently, applications are notified for the level they registered for _plus_ higher levels. This is a problem if the application wants to implement different actions for different levels. For example, an application might want to release 10% of its cache on level low, 50% on medium and 100% on critical. To do this, the application has to register a different fd for each event. However, fd low is always going to be notified and and all fds are going to be notified on level critical. Strict mode solves this problem by strictly notifiying the event an fd has registered for. It's optional. By default we still notify on higher levels. Signed-off-by: Luiz Capitulino lcapitul...@redhat.com Acked-by: Minchan Kim minc...@kernel.org Shouldn't we make this default? The interface is not there for long but still, changing it is always quite tricky. And the users who care can be modified really easily so I would stick with the original default. Yeb, I am not strong against to stick old at a moment but at least, this patch makes more sense to me so I'd like to know why we didn't do it from the beginning. Surely, Anton has a answer. That's exactly my thinking too: I think strict mode should be the default mode, and the current mode should be optional. But it's not a big deal. I've discussed this issue with Anton some weeks ago, and iirc (Anton, please correct/clarify where appropriate) the conclusion was that the current schema makes sense for apps monitoring reclaim activity, as they can hook on low only. Hmm. Something just crossed my mind. Maybe we should have two notification schemas: o memory.pressure_level: implements strict mode (this patch) o memory.reclaim_activity: apps are notified whenever there's reclaim activity As for changing applications, it's better to get some breakage while we're in -rc than regret the API later. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] vmpressure: implement strict mode
On Wed, 26 Jun 2013 10:08:27 +0200 Michal Hocko mho...@suse.cz wrote: On Tue 25-06-13 17:51:29, Luiz Capitulino wrote: Currently, applications are notified for the level they registered for _plus_ higher levels. This is a problem if the application wants to implement different actions for different levels. For example, an application might want to release 10% of its cache on level low, 50% on medium and 100% on critical. To do this, the application has to register a different fd for each event. However, fd low is always going to be notified and and all fds are going to be notified on level critical. OK, I am not user of this interface but I thought that the application would take an action of the highest level it gets notification. But yes this might get clumsy to implement. Strict mode solves this problem by strictly notifiying the event an fd has registered for. It's optional. By default we still notify on higher levels. OK, makes some sense to me and it should work with the proposed edge trigerring as well. Signed-off-by: Luiz Capitulino lcapitul...@redhat.com --- PS: I'm following the discussion on the event storm problem, but I believe strict mode is orthogonal to what has been suggested (although the patches conflict) Documentation/cgroups/memory.txt | 10 ++ mm/vmpressure.c | 19 +-- 2 files changed, 23 insertions(+), 6 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index ddf4f93..3c589cf 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -807,12 +807,14 @@ register a notification, an application must: - create an eventfd using eventfd(2); - open memory.pressure_level; -- write string like event_fd fd of memory.pressure_level level +- write string like event_fd fd of memory.pressure_level level [strict] to cgroup.event_control. -Application will be notified through eventfd when memory pressure is at -the specific level (or higher). Read/write operations to -memory.pressure_level are no implemented. +Applications will be notified through eventfd when memory pressure is at +the specific level or higher. If strict is passed, then applications +will only be notified when memory pressure reaches the specified level. It would be good to describe when is strick and when the default appropriate. Yeah, Anton asked for the same thing. Will do it. + +Read/write operations to memory.pressure_level are no implemented. Test: diff --git a/mm/vmpressure.c b/mm/vmpressure.c index 736a601..6289ede 100644 --- a/mm/vmpressure.c +++ b/mm/vmpressure.c @@ -137,6 +137,7 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned, struct vmpressure_event { struct eventfd_ctx *efd; enum vmpressure_levels level; + bool strict_mode; Any reason to not using a flag for this? If there are any other modes to come them we would end up with zilions of bools which is not very nice. Good point, I'll change it. struct list_head node; }; @@ -153,6 +154,9 @@ static bool vmpressure_event(struct vmpressure *vmpr, list_for_each_entry(ev, vmpr-events, node) { if (level = ev-level) { + /* strict mode ensures level == ev-level */ + if (ev-strict_mode level != ev-level) + continue; eventfd_signal(ev-efd, 1); signalled = true; } @@ -292,7 +296,7 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio) * infrastructure, so that the notifications will be delivered to the * @eventfd. The @args parameter is a string that denotes pressure level * threshold (one of vmpressure_str_levels, i.e. low, medium, or - * critical). + * critical) and optionally a different operating mode (i.e. strict) * * This function should not be used directly, just pass it to (struct * cftype).register_event, and then cgroup core will handle everything by @@ -303,22 +307,33 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft, { struct vmpressure *vmpr = cg_to_vmpressure(cg); struct vmpressure_event *ev; + bool smode = false; + const char *p; int level; for (level = 0; level VMPRESSURE_NUM_LEVELS; level++) { - if (!strcmp(vmpressure_str_levels[level], args)) + p = vmpressure_str_levels[level]; + if (!strncmp(p, args, strlen(p))) break; } if (level = VMPRESSURE_NUM_LEVELS) return -EINVAL; + p = strchr(args, ' '); + if (p) { + if (strncmp(++p, strict, 6)) + return -EINVAL; + smode = true; + } + ev = kzalloc(sizeof(*ev
[PATCH v2] vmpressure: implement strict mode
Currently, an eventfd is notified for the level it's registered for _plus_ higher levels. This is a problem if an application wants to implement different actions for different levels. For example, an application might want to release 10% of its cache on level low, 50% on medium and 100% on critical. To do this, an application has to register a different eventfd for each pressure level. However, fd low is always going to be notified and and all fds are going to be notified on level critical. Strict mode solves this problem by strictly notifiying an eventfd for the pressure level it registered for. This new mode is optional, by default we still notify eventfds on higher levels too. Signed-off-by: Luiz Capitulino lcapitul...@redhat.com --- o v2 - Improve documentation - Use a bit to store mode instead of a bool - Minor changelog changes Documentation/cgroups/memory.txt | 26 ++ mm/vmpressure.c | 26 -- 2 files changed, 46 insertions(+), 6 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index ddf4f93..412872b 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -791,6 +791,22 @@ way to trigger. Applications should do whatever they can to help the system. It might be too late to consult with vmstat or any other statistics, so it's advisable to take an immediate action. +Applications can also choose between two notification modes when +registering an eventfd for memory pressure events: + +When in non-strict mode, an eventfd is notified for the specific level +it's registered for and higher levels. For example, an eventfd registered +for low level is also going to be notified on medium and critical levels. +This mode makes sense for applications interested on monitoring reclaim +activity or implementing simple load-balacing logic. The non-strict mode +is the default notification mode. + +When in strict mode, an eventfd is strictly notified for the pressure +level it's registered for. For example, an eventfd registered for the low +level event is not going to be notified when memory pressure gets into +medium or critical levels. This allows for more complex logic based on +the actual pressure level the system is experiencing. + The events are propagated upward until the event is handled, i.e. the events are not pass-through. Here is what this means: for example you have three cgroups: A-B-C. Now you set up an event listener on cgroups A, B @@ -807,12 +823,14 @@ register a notification, an application must: - create an eventfd using eventfd(2); - open memory.pressure_level; -- write string like event_fd fd of memory.pressure_level level +- write string like event_fd fd of memory.pressure_level level [strict] to cgroup.event_control. -Application will be notified through eventfd when memory pressure is at -the specific level (or higher). Read/write operations to -memory.pressure_level are no implemented. +Applications will be notified through eventfd when memory pressure is at +the specific level or higher. If strict is passed, then applications +will only be notified when memory pressure reaches the specified level. + +Read/write operations to memory.pressure_level are no implemented. Test: diff --git a/mm/vmpressure.c b/mm/vmpressure.c index 736a601..ba5c17e 100644 --- a/mm/vmpressure.c +++ b/mm/vmpressure.c @@ -138,8 +138,16 @@ struct vmpressure_event { struct eventfd_ctx *efd; enum vmpressure_levels level; struct list_head node; + unsigned int mode; }; +#define VMPRESSURE_MODE_STRICT 1 + +static inline bool vmpressure_mode_is_strict(const struct vmpressure_event *ev) +{ + return ev-mode VMPRESSURE_MODE_STRICT; +} + static bool vmpressure_event(struct vmpressure *vmpr, unsigned long scanned, unsigned long reclaimed) { @@ -153,6 +161,9 @@ static bool vmpressure_event(struct vmpressure *vmpr, list_for_each_entry(ev, vmpr-events, node) { if (level = ev-level) { + /* strict mode ensures level == ev-level */ + if (vmpressure_mode_is_strict(ev) level != ev-level) + continue; eventfd_signal(ev-efd, 1); signalled = true; } @@ -292,7 +303,7 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio) * infrastructure, so that the notifications will be delivered to the * @eventfd. The @args parameter is a string that denotes pressure level * threshold (one of vmpressure_str_levels, i.e. low, medium, or - * critical). + * critical) and optionally a different operating mode (i.e. strict) * * This function should not be used directly, just pass it to (struct * cftype).register_event, and then cgroup core will handle everything by @@ -303,22 +314,33 @@ int vmpressure_register_event(struct cgroup
Re: [PATCH] vmpressure: implement strict mode
On Wed, 26 Jun 2013 10:12:15 +0900 Hyunhee Kim wrote: > Please see "[PATCH v3] memcg: event control at vmpressure". mail > thread. (and also the thread I sent last Saturday.) > There was discussion on this mode not sending lower events when "level > != ev->level". The new argument this patch adds should be orthogonal to what has been discussed and suggested in that thread. The patches conflict though, but it's just a matter of rebasing if both are ACKed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] vmpressure: implement strict mode
Currently, applications are notified for the level they registered for _plus_ higher levels. This is a problem if the application wants to implement different actions for different levels. For example, an application might want to release 10% of its cache on level low, 50% on medium and 100% on critical. To do this, the application has to register a different fd for each event. However, fd low is always going to be notified and and all fds are going to be notified on level critical. Strict mode solves this problem by strictly notifiying the event an fd has registered for. It's optional. By default we still notify on higher levels. Signed-off-by: Luiz Capitulino --- PS: I'm following the discussion on the event storm problem, but I believe strict mode is orthogonal to what has been suggested (although the patches conflict) Documentation/cgroups/memory.txt | 10 ++ mm/vmpressure.c | 19 +-- 2 files changed, 23 insertions(+), 6 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index ddf4f93..3c589cf 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -807,12 +807,14 @@ register a notification, an application must: - create an eventfd using eventfd(2); - open memory.pressure_level; -- write string like " " +- write string like " [strict]" to cgroup.event_control. -Application will be notified through eventfd when memory pressure is at -the specific level (or higher). Read/write operations to -memory.pressure_level are no implemented. +Applications will be notified through eventfd when memory pressure is at +the specific level or higher. If strict is passed, then applications +will only be notified when memory pressure reaches the specified level. + +Read/write operations to memory.pressure_level are no implemented. Test: diff --git a/mm/vmpressure.c b/mm/vmpressure.c index 736a601..6289ede 100644 --- a/mm/vmpressure.c +++ b/mm/vmpressure.c @@ -137,6 +137,7 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned, struct vmpressure_event { struct eventfd_ctx *efd; enum vmpressure_levels level; + bool strict_mode; struct list_head node; }; @@ -153,6 +154,9 @@ static bool vmpressure_event(struct vmpressure *vmpr, list_for_each_entry(ev, >events, node) { if (level >= ev->level) { + /* strict mode ensures level == ev->level */ + if (ev->strict_mode && level != ev->level) + continue; eventfd_signal(ev->efd, 1); signalled = true; } @@ -292,7 +296,7 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio) * infrastructure, so that the notifications will be delivered to the * @eventfd. The @args parameter is a string that denotes pressure level * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or - * "critical"). + * "critical") and optionally a different operating mode (i.e. "strict") * * This function should not be used directly, just pass it to (struct * cftype).register_event, and then cgroup core will handle everything by @@ -303,22 +307,33 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft, { struct vmpressure *vmpr = cg_to_vmpressure(cg); struct vmpressure_event *ev; + bool smode = false; + const char *p; int level; for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) { - if (!strcmp(vmpressure_str_levels[level], args)) + p = vmpressure_str_levels[level]; + if (!strncmp(p, args, strlen(p))) break; } if (level >= VMPRESSURE_NUM_LEVELS) return -EINVAL; + p = strchr(args, ' '); + if (p) { + if (strncmp(++p, "strict", 6)) + return -EINVAL; + smode = true; + } + ev = kzalloc(sizeof(*ev), GFP_KERNEL); if (!ev) return -ENOMEM; ev->efd = eventfd; ev->level = level; + ev->strict_mode = smode; mutex_lock(>events_lock); list_add(>node, >events); -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] vmpressure: implement strict mode
Currently, applications are notified for the level they registered for _plus_ higher levels. This is a problem if the application wants to implement different actions for different levels. For example, an application might want to release 10% of its cache on level low, 50% on medium and 100% on critical. To do this, the application has to register a different fd for each event. However, fd low is always going to be notified and and all fds are going to be notified on level critical. Strict mode solves this problem by strictly notifiying the event an fd has registered for. It's optional. By default we still notify on higher levels. Signed-off-by: Luiz Capitulino lcapitul...@redhat.com --- PS: I'm following the discussion on the event storm problem, but I believe strict mode is orthogonal to what has been suggested (although the patches conflict) Documentation/cgroups/memory.txt | 10 ++ mm/vmpressure.c | 19 +-- 2 files changed, 23 insertions(+), 6 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index ddf4f93..3c589cf 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -807,12 +807,14 @@ register a notification, an application must: - create an eventfd using eventfd(2); - open memory.pressure_level; -- write string like event_fd fd of memory.pressure_level level +- write string like event_fd fd of memory.pressure_level level [strict] to cgroup.event_control. -Application will be notified through eventfd when memory pressure is at -the specific level (or higher). Read/write operations to -memory.pressure_level are no implemented. +Applications will be notified through eventfd when memory pressure is at +the specific level or higher. If strict is passed, then applications +will only be notified when memory pressure reaches the specified level. + +Read/write operations to memory.pressure_level are no implemented. Test: diff --git a/mm/vmpressure.c b/mm/vmpressure.c index 736a601..6289ede 100644 --- a/mm/vmpressure.c +++ b/mm/vmpressure.c @@ -137,6 +137,7 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned, struct vmpressure_event { struct eventfd_ctx *efd; enum vmpressure_levels level; + bool strict_mode; struct list_head node; }; @@ -153,6 +154,9 @@ static bool vmpressure_event(struct vmpressure *vmpr, list_for_each_entry(ev, vmpr-events, node) { if (level = ev-level) { + /* strict mode ensures level == ev-level */ + if (ev-strict_mode level != ev-level) + continue; eventfd_signal(ev-efd, 1); signalled = true; } @@ -292,7 +296,7 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio) * infrastructure, so that the notifications will be delivered to the * @eventfd. The @args parameter is a string that denotes pressure level * threshold (one of vmpressure_str_levels, i.e. low, medium, or - * critical). + * critical) and optionally a different operating mode (i.e. strict) * * This function should not be used directly, just pass it to (struct * cftype).register_event, and then cgroup core will handle everything by @@ -303,22 +307,33 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft, { struct vmpressure *vmpr = cg_to_vmpressure(cg); struct vmpressure_event *ev; + bool smode = false; + const char *p; int level; for (level = 0; level VMPRESSURE_NUM_LEVELS; level++) { - if (!strcmp(vmpressure_str_levels[level], args)) + p = vmpressure_str_levels[level]; + if (!strncmp(p, args, strlen(p))) break; } if (level = VMPRESSURE_NUM_LEVELS) return -EINVAL; + p = strchr(args, ' '); + if (p) { + if (strncmp(++p, strict, 6)) + return -EINVAL; + smode = true; + } + ev = kzalloc(sizeof(*ev), GFP_KERNEL); if (!ev) return -ENOMEM; ev-efd = eventfd; ev-level = level; + ev-strict_mode = smode; mutex_lock(vmpr-events_lock); list_add(ev-node, vmpr-events); -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] vmpressure: implement strict mode
On Wed, 26 Jun 2013 10:12:15 +0900 Hyunhee Kim hyunhee@samsung.com wrote: Please see [PATCH v3] memcg: event control at vmpressure. mail thread. (and also the thread I sent last Saturday.) There was discussion on this mode not sending lower events when level != ev-level. The new argument this patch adds should be orthogonal to what has been discussed and suggested in that thread. The patches conflict though, but it's just a matter of rebasing if both are ACKed. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] virtio_balloon: leak_balloon(): only tell host if we got pages deflated
On Wed, 5 Jun 2013 21:18:37 -0400 Luiz Capitulino wrote: > The balloon_page_dequeue() function can return NULL. If it does for > the first page being freed, then leak_balloon() will create a > scatter list with len=0. Which in turn seems to generate an invalid > virtio request. > > I didn't get this in practice, I found it by code review. On the other > hand, such an invalid virtio request will cause errors in QEMU and > fill_balloon() also performs the same check implemented by this commit. > > Signed-off-by: Luiz Capitulino > Acked-by: Rafael Aquini Andrew, can you pick this one? > --- > > o v2 > > - Improve changelog > > drivers/virtio/virtio_balloon.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c > index bd3ae32..71af7b5 100644 > --- a/drivers/virtio/virtio_balloon.c > +++ b/drivers/virtio/virtio_balloon.c > @@ -191,7 +191,8 @@ static void leak_balloon(struct virtio_balloon *vb, > size_t num) >* virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST); >* is true, we *have* to do it in this order >*/ > - tell_host(vb, vb->deflate_vq); > + if (vb->num_pfns != 0) > + tell_host(vb, vb->deflate_vq); > mutex_unlock(>balloon_lock); > release_pages_by_pfn(vb->pfns, vb->num_pfns); > } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] virtio_balloon: leak_balloon(): only tell host if we got pages deflated
On Wed, 5 Jun 2013 21:18:37 -0400 Luiz Capitulino lcapitul...@redhat.com wrote: The balloon_page_dequeue() function can return NULL. If it does for the first page being freed, then leak_balloon() will create a scatter list with len=0. Which in turn seems to generate an invalid virtio request. I didn't get this in practice, I found it by code review. On the other hand, such an invalid virtio request will cause errors in QEMU and fill_balloon() also performs the same check implemented by this commit. Signed-off-by: Luiz Capitulino lcapitul...@redhat.com Acked-by: Rafael Aquini aqu...@redhat.com Andrew, can you pick this one? --- o v2 - Improve changelog drivers/virtio/virtio_balloon.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index bd3ae32..71af7b5 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -191,7 +191,8 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST); * is true, we *have* to do it in this order */ - tell_host(vb, vb-deflate_vq); + if (vb-num_pfns != 0) + tell_host(vb, vb-deflate_vq); mutex_unlock(vb-balloon_lock); release_pages_by_pfn(vb-pfns, vb-num_pfns); } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] virtio_balloon: leak_balloon(): only tell host if we got pages deflated
On Thu, 6 Jun 2013 11:13:58 -0300 Rafael Aquini wrote: > On Wed, Jun 05, 2013 at 09:18:37PM -0400, Luiz Capitulino wrote: > > The balloon_page_dequeue() function can return NULL. If it does for > > the first page being freed, then leak_balloon() will create a > > scatter list with len=0. Which in turn seems to generate an invalid > > virtio request. > > > > I didn't get this in practice, I found it by code review. On the other > > hand, such an invalid virtio request will cause errors in QEMU and > > fill_balloon() also performs the same check implemented by this commit. > > > > Signed-off-by: Luiz Capitulino > > Acked-by: Rafael Aquini > > --- > > > > o v2 > > > > - Improve changelog > > > > drivers/virtio/virtio_balloon.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/virtio/virtio_balloon.c > > b/drivers/virtio/virtio_balloon.c > > index bd3ae32..71af7b5 100644 > > --- a/drivers/virtio/virtio_balloon.c > > +++ b/drivers/virtio/virtio_balloon.c > > @@ -191,7 +191,8 @@ static void leak_balloon(struct virtio_balloon *vb, > > size_t num) > > * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST); > > * is true, we *have* to do it in this order > > */ > > - tell_host(vb, vb->deflate_vq); > > Luiz, sorry for not being clearer before. I was referring to add a commentary > on > code, to explain in short words why we should not get rid of this check point. Oh. > > + if (vb->num_pfns != 0) > > + tell_host(vb, vb->deflate_vq); > > mutex_unlock(>balloon_lock); > > If the comment is regarded as unnecessary, then just ignore my suggestion. I'm > OK with your patch. :) IMHO, the code is clear enough. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] virtio_balloon: leak_balloon(): only tell host if we got pages deflated
On Thu, 6 Jun 2013 11:13:58 -0300 Rafael Aquini aqu...@redhat.com wrote: On Wed, Jun 05, 2013 at 09:18:37PM -0400, Luiz Capitulino wrote: The balloon_page_dequeue() function can return NULL. If it does for the first page being freed, then leak_balloon() will create a scatter list with len=0. Which in turn seems to generate an invalid virtio request. I didn't get this in practice, I found it by code review. On the other hand, such an invalid virtio request will cause errors in QEMU and fill_balloon() also performs the same check implemented by this commit. Signed-off-by: Luiz Capitulino lcapitul...@redhat.com Acked-by: Rafael Aquini aqu...@redhat.com --- o v2 - Improve changelog drivers/virtio/virtio_balloon.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index bd3ae32..71af7b5 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -191,7 +191,8 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST); * is true, we *have* to do it in this order */ - tell_host(vb, vb-deflate_vq); Luiz, sorry for not being clearer before. I was referring to add a commentary on code, to explain in short words why we should not get rid of this check point. Oh. + if (vb-num_pfns != 0) + tell_host(vb, vb-deflate_vq); mutex_unlock(vb-balloon_lock); If the comment is regarded as unnecessary, then just ignore my suggestion. I'm OK with your patch. :) IMHO, the code is clear enough. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] virtio_balloon: leak_balloon(): only tell host if we got pages deflated
The balloon_page_dequeue() function can return NULL. If it does for the first page being freed, then leak_balloon() will create a scatter list with len=0. Which in turn seems to generate an invalid virtio request. I didn't get this in practice, I found it by code review. On the other hand, such an invalid virtio request will cause errors in QEMU and fill_balloon() also performs the same check implemented by this commit. Signed-off-by: Luiz Capitulino Acked-by: Rafael Aquini --- o v2 - Improve changelog drivers/virtio/virtio_balloon.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index bd3ae32..71af7b5 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -191,7 +191,8 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST); * is true, we *have* to do it in this order */ - tell_host(vb, vb->deflate_vq); + if (vb->num_pfns != 0) + tell_host(vb, vb->deflate_vq); mutex_unlock(>balloon_lock); release_pages_by_pfn(vb->pfns, vb->num_pfns); } -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] virtio_balloon: leak_balloon(): only tell host if we got pages deflated
On Wed, 5 Jun 2013 18:24:49 -0300 Rafael Aquini wrote: > On Wed, Jun 05, 2013 at 05:10:31PM -0400, Luiz Capitulino wrote: > > The balloon_page_dequeue() function can return NULL. If it does for > > the first page being freed, then leak_balloon() will create a > > scatter list with len=0. Which in turn seems to generate an invalid > > virtio request. > > > > Signed-off-by: Luiz Capitulino > > --- > > > > PS: I didn't get this in practice. I found it by code review. On the other > > hand, automatic-ballooning was able to put such invalid requests in > > the virtqueue and QEMU would explode... > > > > Nice catch! The patch looks sane and replicates the check done at > fill_balloon(). I think we also could use this P.S. as a commentary > to let others aware of this scenario. Thanks Luiz! Want me to respin? > Acked-by: Rafael Aquini Thanks for your review! > > > > PPS: Very lightly tested > > > > drivers/virtio/virtio_balloon.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/virtio/virtio_balloon.c > > b/drivers/virtio/virtio_balloon.c > > index bd3ae32..71af7b5 100644 > > --- a/drivers/virtio/virtio_balloon.c > > +++ b/drivers/virtio/virtio_balloon.c > > @@ -191,7 +191,8 @@ static void leak_balloon(struct virtio_balloon *vb, > > size_t num) > > * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST); > > * is true, we *have* to do it in this order > > */ > > - tell_host(vb, vb->deflate_vq); > > + if (vb->num_pfns != 0) > > + tell_host(vb, vb->deflate_vq); > > mutex_unlock(>balloon_lock); > > release_pages_by_pfn(vb->pfns, vb->num_pfns); > > } > > -- > > 1.8.1.4 > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] virtio_balloon: leak_balloon(): only tell host if we got pages deflated
The balloon_page_dequeue() function can return NULL. If it does for the first page being freed, then leak_balloon() will create a scatter list with len=0. Which in turn seems to generate an invalid virtio request. Signed-off-by: Luiz Capitulino --- PS: I didn't get this in practice. I found it by code review. On the other hand, automatic-ballooning was able to put such invalid requests in the virtqueue and QEMU would explode... PPS: Very lightly tested drivers/virtio/virtio_balloon.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index bd3ae32..71af7b5 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -191,7 +191,8 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST); * is true, we *have* to do it in this order */ - tell_host(vb, vb->deflate_vq); + if (vb->num_pfns != 0) + tell_host(vb, vb->deflate_vq); mutex_unlock(>balloon_lock); release_pages_by_pfn(vb->pfns, vb->num_pfns); } -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] virtio_balloon: leak_balloon(): only tell host if we got pages deflated
The balloon_page_dequeue() function can return NULL. If it does for the first page being freed, then leak_balloon() will create a scatter list with len=0. Which in turn seems to generate an invalid virtio request. Signed-off-by: Luiz Capitulino lcapitul...@redhat.com --- PS: I didn't get this in practice. I found it by code review. On the other hand, automatic-ballooning was able to put such invalid requests in the virtqueue and QEMU would explode... PPS: Very lightly tested drivers/virtio/virtio_balloon.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index bd3ae32..71af7b5 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -191,7 +191,8 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST); * is true, we *have* to do it in this order */ - tell_host(vb, vb-deflate_vq); + if (vb-num_pfns != 0) + tell_host(vb, vb-deflate_vq); mutex_unlock(vb-balloon_lock); release_pages_by_pfn(vb-pfns, vb-num_pfns); } -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] virtio_balloon: leak_balloon(): only tell host if we got pages deflated
On Wed, 5 Jun 2013 18:24:49 -0300 Rafael Aquini aqu...@redhat.com wrote: On Wed, Jun 05, 2013 at 05:10:31PM -0400, Luiz Capitulino wrote: The balloon_page_dequeue() function can return NULL. If it does for the first page being freed, then leak_balloon() will create a scatter list with len=0. Which in turn seems to generate an invalid virtio request. Signed-off-by: Luiz Capitulino lcapitul...@redhat.com --- PS: I didn't get this in practice. I found it by code review. On the other hand, automatic-ballooning was able to put such invalid requests in the virtqueue and QEMU would explode... Nice catch! The patch looks sane and replicates the check done at fill_balloon(). I think we also could use this P.S. as a commentary to let others aware of this scenario. Thanks Luiz! Want me to respin? Acked-by: Rafael Aquini aqu...@redhat.com Thanks for your review! PPS: Very lightly tested drivers/virtio/virtio_balloon.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index bd3ae32..71af7b5 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -191,7 +191,8 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST); * is true, we *have* to do it in this order */ - tell_host(vb, vb-deflate_vq); + if (vb-num_pfns != 0) + tell_host(vb, vb-deflate_vq); mutex_unlock(vb-balloon_lock); release_pages_by_pfn(vb-pfns, vb-num_pfns); } -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] virtio_balloon: leak_balloon(): only tell host if we got pages deflated
The balloon_page_dequeue() function can return NULL. If it does for the first page being freed, then leak_balloon() will create a scatter list with len=0. Which in turn seems to generate an invalid virtio request. I didn't get this in practice, I found it by code review. On the other hand, such an invalid virtio request will cause errors in QEMU and fill_balloon() also performs the same check implemented by this commit. Signed-off-by: Luiz Capitulino lcapitul...@redhat.com Acked-by: Rafael Aquini aqu...@redhat.com --- o v2 - Improve changelog drivers/virtio/virtio_balloon.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index bd3ae32..71af7b5 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -191,7 +191,8 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST); * is true, we *have* to do it in this order */ - tell_host(vb, vb-deflate_vq); + if (vb-num_pfns != 0) + tell_host(vb, vb-deflate_vq); mutex_unlock(vb-balloon_lock); release_pages_by_pfn(vb-pfns, vb-num_pfns); } -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 0/2] virtio_balloon: auto-ballooning support
On Thu, 16 May 2013 16:56:34 -0400 Sasha Levin wrote: > On 05/09/2013 10:53 AM, Luiz Capitulino wrote: > > Hi, > > > > This series is a respin of automatic ballooning support I started > > working on last year. Patch 2/2 contains all relevant technical > > details and performance measurements results. > > > > This is in RFC state because it's a work in progress. > > Hi Luiz, > > Is there a virtio spec patch I could use to get it implemented on > kvmtool? Not yet, this will come with v1. But I got some homework to do before posting it (more perf tests). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 0/2] virtio_balloon: auto-ballooning support
On Thu, 16 May 2013 16:56:34 -0400 Sasha Levin sasha.le...@oracle.com wrote: On 05/09/2013 10:53 AM, Luiz Capitulino wrote: Hi, This series is a respin of automatic ballooning support I started working on last year. Patch 2/2 contains all relevant technical details and performance measurements results. This is in RFC state because it's a work in progress. Hi Luiz, Is there a virtio spec patch I could use to get it implemented on kvmtool? Not yet, this will come with v1. But I got some homework to do before posting it (more perf tests). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] package: Makefile: unbreak binrpm-pkg target
Commit 6501320311664e11ad5489fedb17ea1f817c2617 dropped the rpm spec as a prerequisite for the binrpm-pkg target but forgot to update $< usage, which causes the rule to break. This commit fixes that by replacing $< with the spec name. Signed-off-by: Luiz Capitulino --- scripts/package/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/package/Makefile b/scripts/package/Makefile index 84a4060..a4f31c9 100644 --- a/scripts/package/Makefile +++ b/scripts/package/Makefile @@ -63,7 +63,7 @@ binrpm-pkg: FORCE mv -f $(objtree)/.tmp_version $(objtree)/.version $(RPM) $(RPMOPTS) --define "_builddir $(objtree)" --target \ - $(UTS_MACHINE) -bb $< + $(UTS_MACHINE) -bb $(objtree)/binkernel.spec rm binkernel.spec # Deb target -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] package: Makefile: unbreak binrpm-pkg target
Commit 6501320311664e11ad5489fedb17ea1f817c2617 dropped the rpm spec as a prerequisite for the binrpm-pkg target but forgot to update $ usage, which causes the rule to break. This commit fixes that by replacing $ with the spec name. Signed-off-by: Luiz Capitulino lcapitul...@redhat.com --- scripts/package/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/package/Makefile b/scripts/package/Makefile index 84a4060..a4f31c9 100644 --- a/scripts/package/Makefile +++ b/scripts/package/Makefile @@ -63,7 +63,7 @@ binrpm-pkg: FORCE mv -f $(objtree)/.tmp_version $(objtree)/.version $(RPM) $(RPMOPTS) --define _builddir $(objtree) --target \ - $(UTS_MACHINE) -bb $ + $(UTS_MACHINE) -bb $(objtree)/binkernel.spec rm binkernel.spec # Deb target -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] virtio_balloon: auto-ballooning support
On Mon, 13 May 2013 22:02:50 +0300 "Michael S. Tsirkin" wrote: > On Mon, May 13, 2013 at 02:25:11PM -0400, Luiz Capitulino wrote: > > On Mon, 13 May 2013 11:34:41 -0300 > > Rafael Aquini wrote: > > > > > You're right, and the host's member is used to communicate the configured > > > size > > > to guest's balloon device, however, by not changing it when the shrinker > > > causes > > > the balloon to deflate will make the balloon thread to be woken up again > > > in order to chase the balloon size target again, won't it? Check > > > > I don't see the balloon thread waking up after the shrinker executes in my > > testing. Maybe this is so because it will only wake up when QEMU notifies > > a config change. > > Well that's also a problem. > Need some mechanism to re-inflate balloon In this implemention this is done by QEMU, have you looked at the QEMU patch yet? https://lists.gnu.org/archive/html/qemu-devel/2013-05/msg01295.html > when guest memory pressure is down. In this implementation we do this when there's pressure in the host. I expect things to balance over time, and this seems to be what I'm observing in my testing, but of course we need more testing. > virtio fs mechanism worth a look? Can you elaborate? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] virtio_balloon: auto-ballooning support
On Mon, 13 May 2013 11:34:41 -0300 Rafael Aquini wrote: > You're right, and the host's member is used to communicate the configured size > to guest's balloon device, however, by not changing it when the shrinker > causes > the balloon to deflate will make the balloon thread to be woken up again > in order to chase the balloon size target again, won't it? Check I don't see the balloon thread waking up after the shrinker executes in my testing. Maybe this is so because it will only wake up when QEMU notifies a config change. But anyway, I'll think how to improve this as suggested by Michael too, as I seem to be changing num_pages' semantics according to the virtio spec. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] virtio_balloon: auto-ballooning support
On Sun, 12 May 2013 21:49:34 +0300 "Michael S. Tsirkin" wrote: > On Sun, May 12, 2013 at 12:36:09PM -0400, Rik van Riel wrote: > > On 05/12/2013 10:30 AM, Michael S. Tsirkin wrote: > > >On Thu, May 09, 2013 at 10:53:49AM -0400, Luiz Capitulino wrote: > > >>Automatic ballooning consists of dynamically adjusting the guest's > > >>balloon according to memory pressure in the host and in the guest. > > >> > > >>This commit implements the guest side of automatic balloning, which > > >>basically consists of registering a shrinker callback with the kernel, > > >>which will try to deflate the guest's balloon by the amount of pages > > >>being requested. The shrinker callback is only registered if the host > > >>supports the VIRTIO_BALLOON_F_AUTO_BALLOON feature bit. > > > > > >OK so we have a new feature bit, such that: > > >- if AUTO_BALLOON is set in host, guest can leak a > > > page from a balloon at any time > > > > > >questions left unanswered > > >- what meaning does num_pages have now? > > > > "as large as we could go" > > I see. This is the reverse of it's current meaning. > I would suggest a new field instead of overriding > the existing one. I'll write a spec patch as you suggested on irc and will decide what to do from there. > > >- when will the balloon be re-inflated? > > > > I believe the qemu changes Luiz wrote address that side, > > with qemu-kvm getting notifications from the host kernel > > when there is memory pressure, and shrinking the guest > > in reaction to that notification. > > But it's the guest memory pressure we care about: > > - host asks balloon to inflate > later > - guest asks balloon to deflate > > with this patch guest takes priority, > balloon deflates. So we should only inflate > if guest does not need the memory. Inflate will actually fail if the guest doesn't have memory to fill the balloon. But in any case, and as you said elsewhere in this thread, inflate is not new and could be even done by mngt. So I don't think this should be changed in this patch. > > >I'd like to see a spec patch addressing these questions. > > > > > >Would we ever want to mix the two types of ballooning? > > >If yes possibly when we put a page in the balloon we > > >might want to give host a hint "this page might be > > >leaked again soon". > > > > It might not be the same page, and the host really does > > not care which page it is. > > Whether we care depends on what we do with the page. > For example, in the future we might care which numa node is > used. > > > The automatic inflation happens when the host needs to > > free up memory. > > This can be done today by management, with no need to > change qemu. So automatic inflate, IMHO does not need > a feature flag. It's the automatic deflate in guest that's new. Makes sense. > > >>Automatic inflate is performed by the host. > > >> > > >>Here are some numbers. The test-case is to run 35 VMs (1G of RAM each) > > >>in parallel doing a kernel build. Host has 32GB of RAM and 16GB of swap. > > >>SWAP IN and SWAP OUT correspond to the number of pages swapped in and > > >>swapped out, respectively. > > >> > > >>Auto-ballooning disabled: > > >> > > >>RUN TIME(s) SWAP IN SWAP OUT > > >> > > >>1634 930980 1588522 > > >>2610 627422 1362174 > > >>3649 1079847 1616367 > > >>4543 953289 1635379 > > >>5642 913237 1514000 > > >> > > >>Auto-ballooning enabled: > > >> > > >>RUN TIME(s) SWAP IN SWAP OUT > > >> > > >>1629 901 12537 > > >>2624 981 18506 > > >>3626 573 9085 > > >>4631 2250 42534 > > >>5627 1610 20808 > > > > > >So what exactly happens here? > > >Much less swap in/out activity, but no gain > > >in total runtime. Doesn't this show there's > > >a bottleneck somewhere? Could be a problem in > > >the implementation? > > > > It could also be an issue with the benchmark chosen, > > which may not have swap as its bottleneck at any point. > > > > However, the reduced swapping is still very promising! > > Isn't this a direct result of inflating the balloon? > E.g. just inflating the balloon without > the shrinker
Re: [RFC 2/2] virtio_balloon: auto-ballooning support
On Sun, 12 May 2013 21:49:34 +0300 Michael S. Tsirkin m...@redhat.com wrote: On Sun, May 12, 2013 at 12:36:09PM -0400, Rik van Riel wrote: On 05/12/2013 10:30 AM, Michael S. Tsirkin wrote: On Thu, May 09, 2013 at 10:53:49AM -0400, Luiz Capitulino wrote: Automatic ballooning consists of dynamically adjusting the guest's balloon according to memory pressure in the host and in the guest. This commit implements the guest side of automatic balloning, which basically consists of registering a shrinker callback with the kernel, which will try to deflate the guest's balloon by the amount of pages being requested. The shrinker callback is only registered if the host supports the VIRTIO_BALLOON_F_AUTO_BALLOON feature bit. OK so we have a new feature bit, such that: - if AUTO_BALLOON is set in host, guest can leak a page from a balloon at any time questions left unanswered - what meaning does num_pages have now? as large as we could go I see. This is the reverse of it's current meaning. I would suggest a new field instead of overriding the existing one. I'll write a spec patch as you suggested on irc and will decide what to do from there. - when will the balloon be re-inflated? I believe the qemu changes Luiz wrote address that side, with qemu-kvm getting notifications from the host kernel when there is memory pressure, and shrinking the guest in reaction to that notification. But it's the guest memory pressure we care about: - host asks balloon to inflate later - guest asks balloon to deflate with this patch guest takes priority, balloon deflates. So we should only inflate if guest does not need the memory. Inflate will actually fail if the guest doesn't have memory to fill the balloon. But in any case, and as you said elsewhere in this thread, inflate is not new and could be even done by mngt. So I don't think this should be changed in this patch. I'd like to see a spec patch addressing these questions. Would we ever want to mix the two types of ballooning? If yes possibly when we put a page in the balloon we might want to give host a hint this page might be leaked again soon. It might not be the same page, and the host really does not care which page it is. Whether we care depends on what we do with the page. For example, in the future we might care which numa node is used. The automatic inflation happens when the host needs to free up memory. This can be done today by management, with no need to change qemu. So automatic inflate, IMHO does not need a feature flag. It's the automatic deflate in guest that's new. Makes sense. Automatic inflate is performed by the host. Here are some numbers. The test-case is to run 35 VMs (1G of RAM each) in parallel doing a kernel build. Host has 32GB of RAM and 16GB of swap. SWAP IN and SWAP OUT correspond to the number of pages swapped in and swapped out, respectively. Auto-ballooning disabled: RUN TIME(s) SWAP IN SWAP OUT 1634 930980 1588522 2610 627422 1362174 3649 1079847 1616367 4543 953289 1635379 5642 913237 1514000 Auto-ballooning enabled: RUN TIME(s) SWAP IN SWAP OUT 1629 901 12537 2624 981 18506 3626 573 9085 4631 2250 42534 5627 1610 20808 So what exactly happens here? Much less swap in/out activity, but no gain in total runtime. Doesn't this show there's a bottleneck somewhere? Could be a problem in the implementation? It could also be an issue with the benchmark chosen, which may not have swap as its bottleneck at any point. However, the reduced swapping is still very promising! Isn't this a direct result of inflating the balloon? E.g. just inflating the balloon without the shrinker will make us avoid swap in host. What we would want to check is whether shrinking works as expected, and whether we need to speed up shrinking. As I say above, inflating the balloon is easy. A good benchmark would show how we can deflate and re-inflate it efficiently with demand. I'm going to measure this. Also, what happened with the balloon? Did we end up with balloon completely inflated? deflated? In my test-case VMs are started with 1G. After the test, almost all of them have between 200-300MB. One question to consider: possibly if we are going to reuse the page in the balloon soon, we might want to bypass notify before use for it? Maybe that will help speed things up. Signed-off-by: Luiz Capitulino lcapitul...@redhat.com --- drivers/virtio/virtio_balloon.c | 55 + include/uapi/linux/virtio_balloon.h | 1 + 2 files changed, 56 insertions(+) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 9d5fe2b..f9dcae8 100644
Re: [RFC 2/2] virtio_balloon: auto-ballooning support
On Mon, 13 May 2013 11:34:41 -0300 Rafael Aquini aqu...@redhat.com wrote: You're right, and the host's member is used to communicate the configured size to guest's balloon device, however, by not changing it when the shrinker causes the balloon to deflate will make the balloon thread to be woken up again in order to chase the balloon size target again, won't it? Check I don't see the balloon thread waking up after the shrinker executes in my testing. Maybe this is so because it will only wake up when QEMU notifies a config change. But anyway, I'll think how to improve this as suggested by Michael too, as I seem to be changing num_pages' semantics according to the virtio spec. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] virtio_balloon: auto-ballooning support
On Mon, 13 May 2013 22:02:50 +0300 Michael S. Tsirkin m...@redhat.com wrote: On Mon, May 13, 2013 at 02:25:11PM -0400, Luiz Capitulino wrote: On Mon, 13 May 2013 11:34:41 -0300 Rafael Aquini aqu...@redhat.com wrote: You're right, and the host's member is used to communicate the configured size to guest's balloon device, however, by not changing it when the shrinker causes the balloon to deflate will make the balloon thread to be woken up again in order to chase the balloon size target again, won't it? Check I don't see the balloon thread waking up after the shrinker executes in my testing. Maybe this is so because it will only wake up when QEMU notifies a config change. Well that's also a problem. Need some mechanism to re-inflate balloon In this implemention this is done by QEMU, have you looked at the QEMU patch yet? https://lists.gnu.org/archive/html/qemu-devel/2013-05/msg01295.html when guest memory pressure is down. In this implementation we do this when there's pressure in the host. I expect things to balance over time, and this seems to be what I'm observing in my testing, but of course we need more testing. virtio fs mechanism worth a look? Can you elaborate? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] virtio_balloon: auto-ballooning support
On Fri, 10 May 2013 09:20:46 -0400 Luiz Capitulino wrote: > On Thu, 9 May 2013 18:15:19 -0300 > Rafael Aquini wrote: > > > On Thu, May 09, 2013 at 10:53:49AM -0400, Luiz Capitulino wrote: > > > Automatic ballooning consists of dynamically adjusting the guest's > > > balloon according to memory pressure in the host and in the guest. > > > > > > This commit implements the guest side of automatic balloning, which > > > basically consists of registering a shrinker callback with the kernel, > > > which will try to deflate the guest's balloon by the amount of pages > > > being requested. The shrinker callback is only registered if the host > > > supports the VIRTIO_BALLOON_F_AUTO_BALLOON feature bit. > > > > > > Automatic inflate is performed by the host. > > > > > > Here are some numbers. The test-case is to run 35 VMs (1G of RAM each) > > > in parallel doing a kernel build. Host has 32GB of RAM and 16GB of swap. > > > SWAP IN and SWAP OUT correspond to the number of pages swapped in and > > > swapped out, respectively. > > > > > > Auto-ballooning disabled: > > > > > > RUN TIME(s) SWAP IN SWAP OUT > > > > > > 1634 930980 1588522 > > > 2610 627422 1362174 > > > 3649 1079847 1616367 > > > 4543 953289 1635379 > > > 5642 913237 1514000 > > > > > > Auto-ballooning enabled: > > > > > > RUN TIME(s) SWAP IN SWAP OUT > > > > > > 1629 901 12537 > > > 2624 981 18506 > > > 3626 573 9085 > > > 4631 2250 42534 > > > 5627 1610 20808 > > > > > > Signed-off-by: Luiz Capitulino > > > --- > > > > Nice work Luiz! Just allow me a silly question, though. > > I have 100% more chances of committing sillynesses than you, so please > go ahead. > > > Since your shrinker > > doesn't change the balloon target size, > > Which target size are you referring to? The one in the host (member num_pages > of VirtIOBalloon in QEMU)? > > If it the one in the host, then my understanding is that that member is only > used to communicate the new balloon target to the guest. The guest driver > will only read it when told (by the host) to do so, and when it does the > target value will be correct. > > Am I right? > > > as soon as the shrink round finishes the > > balloon will re-inflate again, won't it? Doesn't this cause a sort of > > "balloon > > thrashing" scenario, if both guest and host are suffering from memory > > pressure? Forgot to say that I didn't observe this in my testing. But I'll try harder as soon as we clarify which target size we're talking about. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] virtio_balloon: auto-ballooning support
On Thu, 9 May 2013 18:15:19 -0300 Rafael Aquini wrote: > On Thu, May 09, 2013 at 10:53:49AM -0400, Luiz Capitulino wrote: > > Automatic ballooning consists of dynamically adjusting the guest's > > balloon according to memory pressure in the host and in the guest. > > > > This commit implements the guest side of automatic balloning, which > > basically consists of registering a shrinker callback with the kernel, > > which will try to deflate the guest's balloon by the amount of pages > > being requested. The shrinker callback is only registered if the host > > supports the VIRTIO_BALLOON_F_AUTO_BALLOON feature bit. > > > > Automatic inflate is performed by the host. > > > > Here are some numbers. The test-case is to run 35 VMs (1G of RAM each) > > in parallel doing a kernel build. Host has 32GB of RAM and 16GB of swap. > > SWAP IN and SWAP OUT correspond to the number of pages swapped in and > > swapped out, respectively. > > > > Auto-ballooning disabled: > > > > RUN TIME(s) SWAP IN SWAP OUT > > > > 1634 930980 1588522 > > 2610 627422 1362174 > > 3649 1079847 1616367 > > 4543 953289 1635379 > > 5642 913237 1514000 > > > > Auto-ballooning enabled: > > > > RUN TIME(s) SWAP IN SWAP OUT > > > > 1629 901 12537 > > 2624 981 18506 > > 3626 573 9085 > > 4631 2250 42534 > > 5627 1610 20808 > > > > Signed-off-by: Luiz Capitulino > > --- > > Nice work Luiz! Just allow me a silly question, though. I have 100% more chances of committing sillynesses than you, so please go ahead. > Since your shrinker > doesn't change the balloon target size, Which target size are you referring to? The one in the host (member num_pages of VirtIOBalloon in QEMU)? If it the one in the host, then my understanding is that that member is only used to communicate the new balloon target to the guest. The guest driver will only read it when told (by the host) to do so, and when it does the target value will be correct. Am I right? > as soon as the shrink round finishes the > balloon will re-inflate again, won't it? Doesn't this cause a sort of "balloon > thrashing" scenario, if both guest and host are suffering from memory > pressure? > > > The rest I have for the moment, are only nitpicks :) > > > > drivers/virtio/virtio_balloon.c | 55 > > + > > include/uapi/linux/virtio_balloon.h | 1 + > > 2 files changed, 56 insertions(+) > > > > diff --git a/drivers/virtio/virtio_balloon.c > > b/drivers/virtio/virtio_balloon.c > > index 9d5fe2b..f9dcae8 100644 > > --- a/drivers/virtio/virtio_balloon.c > > +++ b/drivers/virtio/virtio_balloon.c > > @@ -71,6 +71,9 @@ struct virtio_balloon > > /* Memory statistics */ > > int need_stats_update; > > struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR]; > > + > > + /* Memory shrinker */ > > + struct shrinker shrinker; > > }; > > > > static struct virtio_device_id id_table[] = { > > @@ -126,6 +129,7 @@ static void set_page_pfns(u32 pfns[], struct page *page) > > pfns[i] = page_to_balloon_pfn(page) + i; > > } > > > > +/* This function should be called with vb->balloon_mutex held */ > > static void fill_balloon(struct virtio_balloon *vb, size_t num) > > { > > struct balloon_dev_info *vb_dev_info = vb->vb_dev_info; > > @@ -166,6 +170,7 @@ static void release_pages_by_pfn(const u32 pfns[], > > unsigned int num) > > } > > } > > > > +/* This function should be called with vb->balloon_mutex held */ > > static void leak_balloon(struct virtio_balloon *vb, size_t num) > > { > > struct page *page; > > @@ -285,6 +290,45 @@ static void update_balloon_size(struct virtio_balloon > > *vb) > > , sizeof(actual)); > > } > > > > +static unsigned long balloon_get_nr_pages(const struct virtio_balloon *vb) > > +{ > > + return vb->num_pages / VIRTIO_BALLOON_PAGES_PER_PAGE; > > +} > > + > > +static int balloon_shrinker(struct shrinker *shrinker,struct > > shrink_control *sc) > > +{ > > + unsigned int nr_pages, new_target; > > + struct virtio_balloon *vb; > > + > > + vb = container_of(shrinker, struct virtio_balloon, shrinker); > > + if (!mutex_trylock(>balloon_lock)) { > > + return -1; > > + } > > + > &
Re: [RFC 1/2] virtio_balloon: move balloon_lock mutex to callers
On Thu, 9 May 2013 18:03:09 -0300 Rafael Aquini wrote: > On Thu, May 09, 2013 at 10:53:48AM -0400, Luiz Capitulino wrote: > > This commit moves the balloon_lock mutex out of the fill_balloon() > > and leak_balloon() functions to their callers. > > > > The reason for this change is that the next commit will introduce > > a shrinker callback for the balloon driver, which will also call > > leak_balloon() but will require different locking semantics. > > > > Signed-off-by: Luiz Capitulino > > --- > > drivers/virtio/virtio_balloon.c | 8 > > 1 file changed, 4 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/virtio/virtio_balloon.c > > b/drivers/virtio/virtio_balloon.c > > index bd3ae32..9d5fe2b 100644 > > --- a/drivers/virtio/virtio_balloon.c > > +++ b/drivers/virtio/virtio_balloon.c > > @@ -133,7 +133,6 @@ static void fill_balloon(struct virtio_balloon *vb, > > size_t num) > > /* We can only do one array worth at a time. */ > > num = min(num, ARRAY_SIZE(vb->pfns)); > > > > - mutex_lock(>balloon_lock); > > for (vb->num_pfns = 0; vb->num_pfns < num; > > vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) { > > struct page *page = balloon_page_enqueue(vb_dev_info); > > @@ -154,7 +153,6 @@ static void fill_balloon(struct virtio_balloon *vb, > > size_t num) > > /* Did we get any? */ > > if (vb->num_pfns != 0) > > tell_host(vb, vb->inflate_vq); > > - mutex_unlock(>balloon_lock); > > } > > > > static void release_pages_by_pfn(const u32 pfns[], unsigned int num) > > @@ -176,7 +174,6 @@ static void leak_balloon(struct virtio_balloon *vb, > > size_t num) > > /* We can only do one array worth at a time. */ > > num = min(num, ARRAY_SIZE(vb->pfns)); > > > > - mutex_lock(>balloon_lock); > > for (vb->num_pfns = 0; vb->num_pfns < num; > > vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) { > > page = balloon_page_dequeue(vb_dev_info); > > @@ -192,7 +189,6 @@ static void leak_balloon(struct virtio_balloon *vb, > > size_t num) > > * is true, we *have* to do it in this order > > */ > > tell_host(vb, vb->deflate_vq); > > - mutex_unlock(>balloon_lock); > > release_pages_by_pfn(vb->pfns, vb->num_pfns); > > } > > > > @@ -305,11 +301,13 @@ static int balloon(void *_vballoon) > > || freezing(current)); > > if (vb->need_stats_update) > > stats_handle_request(vb); > > + mutex_lock(>balloon_lock); > > if (diff > 0) > > fill_balloon(vb, diff); > > else if (diff < 0) > > leak_balloon(vb, -diff); > > update_balloon_size(vb); > > + mutex_unlock(>balloon_lock); > > } > > return 0; > > } > > @@ -490,9 +488,11 @@ out: > > static void remove_common(struct virtio_balloon *vb) > > { > > /* There might be pages left in the balloon: free them. */ > > + mutex_lock(>balloon_lock); > > while (vb->num_pages) > > leak_balloon(vb, vb->num_pages); > > update_balloon_size(vb); > > + mutex_unlock(>balloon_lock); > > I think you will need to introduce the same change as above to > virtballoon_restore() Thanks Rafael, I've fixed it in my tree. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/2] virtio_balloon: move balloon_lock mutex to callers
On Thu, 9 May 2013 18:03:09 -0300 Rafael Aquini aqu...@redhat.com wrote: On Thu, May 09, 2013 at 10:53:48AM -0400, Luiz Capitulino wrote: This commit moves the balloon_lock mutex out of the fill_balloon() and leak_balloon() functions to their callers. The reason for this change is that the next commit will introduce a shrinker callback for the balloon driver, which will also call leak_balloon() but will require different locking semantics. Signed-off-by: Luiz Capitulino lcapitul...@redhat.com --- drivers/virtio/virtio_balloon.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index bd3ae32..9d5fe2b 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -133,7 +133,6 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num) /* We can only do one array worth at a time. */ num = min(num, ARRAY_SIZE(vb-pfns)); - mutex_lock(vb-balloon_lock); for (vb-num_pfns = 0; vb-num_pfns num; vb-num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) { struct page *page = balloon_page_enqueue(vb_dev_info); @@ -154,7 +153,6 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num) /* Did we get any? */ if (vb-num_pfns != 0) tell_host(vb, vb-inflate_vq); - mutex_unlock(vb-balloon_lock); } static void release_pages_by_pfn(const u32 pfns[], unsigned int num) @@ -176,7 +174,6 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) /* We can only do one array worth at a time. */ num = min(num, ARRAY_SIZE(vb-pfns)); - mutex_lock(vb-balloon_lock); for (vb-num_pfns = 0; vb-num_pfns num; vb-num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) { page = balloon_page_dequeue(vb_dev_info); @@ -192,7 +189,6 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) * is true, we *have* to do it in this order */ tell_host(vb, vb-deflate_vq); - mutex_unlock(vb-balloon_lock); release_pages_by_pfn(vb-pfns, vb-num_pfns); } @@ -305,11 +301,13 @@ static int balloon(void *_vballoon) || freezing(current)); if (vb-need_stats_update) stats_handle_request(vb); + mutex_lock(vb-balloon_lock); if (diff 0) fill_balloon(vb, diff); else if (diff 0) leak_balloon(vb, -diff); update_balloon_size(vb); + mutex_unlock(vb-balloon_lock); } return 0; } @@ -490,9 +488,11 @@ out: static void remove_common(struct virtio_balloon *vb) { /* There might be pages left in the balloon: free them. */ + mutex_lock(vb-balloon_lock); while (vb-num_pages) leak_balloon(vb, vb-num_pages); update_balloon_size(vb); + mutex_unlock(vb-balloon_lock); I think you will need to introduce the same change as above to virtballoon_restore() Thanks Rafael, I've fixed it in my tree. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] virtio_balloon: auto-ballooning support
On Thu, 9 May 2013 18:15:19 -0300 Rafael Aquini aqu...@redhat.com wrote: On Thu, May 09, 2013 at 10:53:49AM -0400, Luiz Capitulino wrote: Automatic ballooning consists of dynamically adjusting the guest's balloon according to memory pressure in the host and in the guest. This commit implements the guest side of automatic balloning, which basically consists of registering a shrinker callback with the kernel, which will try to deflate the guest's balloon by the amount of pages being requested. The shrinker callback is only registered if the host supports the VIRTIO_BALLOON_F_AUTO_BALLOON feature bit. Automatic inflate is performed by the host. Here are some numbers. The test-case is to run 35 VMs (1G of RAM each) in parallel doing a kernel build. Host has 32GB of RAM and 16GB of swap. SWAP IN and SWAP OUT correspond to the number of pages swapped in and swapped out, respectively. Auto-ballooning disabled: RUN TIME(s) SWAP IN SWAP OUT 1634 930980 1588522 2610 627422 1362174 3649 1079847 1616367 4543 953289 1635379 5642 913237 1514000 Auto-ballooning enabled: RUN TIME(s) SWAP IN SWAP OUT 1629 901 12537 2624 981 18506 3626 573 9085 4631 2250 42534 5627 1610 20808 Signed-off-by: Luiz Capitulino lcapitul...@redhat.com --- Nice work Luiz! Just allow me a silly question, though. I have 100% more chances of committing sillynesses than you, so please go ahead. Since your shrinker doesn't change the balloon target size, Which target size are you referring to? The one in the host (member num_pages of VirtIOBalloon in QEMU)? If it the one in the host, then my understanding is that that member is only used to communicate the new balloon target to the guest. The guest driver will only read it when told (by the host) to do so, and when it does the target value will be correct. Am I right? as soon as the shrink round finishes the balloon will re-inflate again, won't it? Doesn't this cause a sort of balloon thrashing scenario, if both guest and host are suffering from memory pressure? The rest I have for the moment, are only nitpicks :) drivers/virtio/virtio_balloon.c | 55 + include/uapi/linux/virtio_balloon.h | 1 + 2 files changed, 56 insertions(+) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 9d5fe2b..f9dcae8 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -71,6 +71,9 @@ struct virtio_balloon /* Memory statistics */ int need_stats_update; struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR]; + + /* Memory shrinker */ + struct shrinker shrinker; }; static struct virtio_device_id id_table[] = { @@ -126,6 +129,7 @@ static void set_page_pfns(u32 pfns[], struct page *page) pfns[i] = page_to_balloon_pfn(page) + i; } +/* This function should be called with vb-balloon_mutex held */ static void fill_balloon(struct virtio_balloon *vb, size_t num) { struct balloon_dev_info *vb_dev_info = vb-vb_dev_info; @@ -166,6 +170,7 @@ static void release_pages_by_pfn(const u32 pfns[], unsigned int num) } } +/* This function should be called with vb-balloon_mutex held */ static void leak_balloon(struct virtio_balloon *vb, size_t num) { struct page *page; @@ -285,6 +290,45 @@ static void update_balloon_size(struct virtio_balloon *vb) actual, sizeof(actual)); } +static unsigned long balloon_get_nr_pages(const struct virtio_balloon *vb) +{ + return vb-num_pages / VIRTIO_BALLOON_PAGES_PER_PAGE; +} + +static int balloon_shrinker(struct shrinker *shrinker,struct shrink_control *sc) +{ + unsigned int nr_pages, new_target; + struct virtio_balloon *vb; + + vb = container_of(shrinker, struct virtio_balloon, shrinker); + if (!mutex_trylock(vb-balloon_lock)) { + return -1; + } + + nr_pages = balloon_get_nr_pages(vb); + if (!sc-nr_to_scan || !nr_pages) { + goto out; + } + + /* +* If the current balloon size is greater than the number of +* pages being reclaimed by the kernel, deflate only the needed +* amount. Otherwise deflate everything we have. +*/ + new_target = 0; + if (nr_pages sc-nr_to_scan) { + new_target = nr_pages - sc-nr_to_scan; + } + CodingStyle: you don't need the curly-braces for all these single staments above Oh, this comes from QEMU coding style. Fixed. + leak_balloon(vb, new_target); + update_balloon_size(vb); + nr_pages = balloon_get_nr_pages(vb); + +out: + mutex_unlock(vb-balloon_lock); + return nr_pages; +} + static int balloon(void
Re: [RFC 2/2] virtio_balloon: auto-ballooning support
On Fri, 10 May 2013 09:20:46 -0400 Luiz Capitulino lcapitul...@redhat.com wrote: On Thu, 9 May 2013 18:15:19 -0300 Rafael Aquini aqu...@redhat.com wrote: On Thu, May 09, 2013 at 10:53:49AM -0400, Luiz Capitulino wrote: Automatic ballooning consists of dynamically adjusting the guest's balloon according to memory pressure in the host and in the guest. This commit implements the guest side of automatic balloning, which basically consists of registering a shrinker callback with the kernel, which will try to deflate the guest's balloon by the amount of pages being requested. The shrinker callback is only registered if the host supports the VIRTIO_BALLOON_F_AUTO_BALLOON feature bit. Automatic inflate is performed by the host. Here are some numbers. The test-case is to run 35 VMs (1G of RAM each) in parallel doing a kernel build. Host has 32GB of RAM and 16GB of swap. SWAP IN and SWAP OUT correspond to the number of pages swapped in and swapped out, respectively. Auto-ballooning disabled: RUN TIME(s) SWAP IN SWAP OUT 1634 930980 1588522 2610 627422 1362174 3649 1079847 1616367 4543 953289 1635379 5642 913237 1514000 Auto-ballooning enabled: RUN TIME(s) SWAP IN SWAP OUT 1629 901 12537 2624 981 18506 3626 573 9085 4631 2250 42534 5627 1610 20808 Signed-off-by: Luiz Capitulino lcapitul...@redhat.com --- Nice work Luiz! Just allow me a silly question, though. I have 100% more chances of committing sillynesses than you, so please go ahead. Since your shrinker doesn't change the balloon target size, Which target size are you referring to? The one in the host (member num_pages of VirtIOBalloon in QEMU)? If it the one in the host, then my understanding is that that member is only used to communicate the new balloon target to the guest. The guest driver will only read it when told (by the host) to do so, and when it does the target value will be correct. Am I right? as soon as the shrink round finishes the balloon will re-inflate again, won't it? Doesn't this cause a sort of balloon thrashing scenario, if both guest and host are suffering from memory pressure? Forgot to say that I didn't observe this in my testing. But I'll try harder as soon as we clarify which target size we're talking about. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 1/2] virtio_balloon: move balloon_lock mutex to callers
This commit moves the balloon_lock mutex out of the fill_balloon() and leak_balloon() functions to their callers. The reason for this change is that the next commit will introduce a shrinker callback for the balloon driver, which will also call leak_balloon() but will require different locking semantics. Signed-off-by: Luiz Capitulino --- drivers/virtio/virtio_balloon.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index bd3ae32..9d5fe2b 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -133,7 +133,6 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num) /* We can only do one array worth at a time. */ num = min(num, ARRAY_SIZE(vb->pfns)); - mutex_lock(>balloon_lock); for (vb->num_pfns = 0; vb->num_pfns < num; vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) { struct page *page = balloon_page_enqueue(vb_dev_info); @@ -154,7 +153,6 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num) /* Did we get any? */ if (vb->num_pfns != 0) tell_host(vb, vb->inflate_vq); - mutex_unlock(>balloon_lock); } static void release_pages_by_pfn(const u32 pfns[], unsigned int num) @@ -176,7 +174,6 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) /* We can only do one array worth at a time. */ num = min(num, ARRAY_SIZE(vb->pfns)); - mutex_lock(>balloon_lock); for (vb->num_pfns = 0; vb->num_pfns < num; vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) { page = balloon_page_dequeue(vb_dev_info); @@ -192,7 +189,6 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) * is true, we *have* to do it in this order */ tell_host(vb, vb->deflate_vq); - mutex_unlock(>balloon_lock); release_pages_by_pfn(vb->pfns, vb->num_pfns); } @@ -305,11 +301,13 @@ static int balloon(void *_vballoon) || freezing(current)); if (vb->need_stats_update) stats_handle_request(vb); + mutex_lock(>balloon_lock); if (diff > 0) fill_balloon(vb, diff); else if (diff < 0) leak_balloon(vb, -diff); update_balloon_size(vb); + mutex_unlock(>balloon_lock); } return 0; } @@ -490,9 +488,11 @@ out: static void remove_common(struct virtio_balloon *vb) { /* There might be pages left in the balloon: free them. */ + mutex_lock(>balloon_lock); while (vb->num_pages) leak_balloon(vb, vb->num_pages); update_balloon_size(vb); + mutex_unlock(>balloon_lock); /* Now we reset the device so we can clean up the queues. */ vb->vdev->config->reset(vb->vdev); -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 2/2] virtio_balloon: auto-ballooning support
Automatic ballooning consists of dynamically adjusting the guest's balloon according to memory pressure in the host and in the guest. This commit implements the guest side of automatic balloning, which basically consists of registering a shrinker callback with the kernel, which will try to deflate the guest's balloon by the amount of pages being requested. The shrinker callback is only registered if the host supports the VIRTIO_BALLOON_F_AUTO_BALLOON feature bit. Automatic inflate is performed by the host. Here are some numbers. The test-case is to run 35 VMs (1G of RAM each) in parallel doing a kernel build. Host has 32GB of RAM and 16GB of swap. SWAP IN and SWAP OUT correspond to the number of pages swapped in and swapped out, respectively. Auto-ballooning disabled: RUN TIME(s) SWAP IN SWAP OUT 1634 930980 1588522 2610 627422 1362174 3649 1079847 1616367 4543 953289 1635379 5642 913237 1514000 Auto-ballooning enabled: RUN TIME(s) SWAP IN SWAP OUT 1629 901 12537 2624 981 18506 3626 573 9085 4631 2250 42534 5627 1610 20808 Signed-off-by: Luiz Capitulino --- drivers/virtio/virtio_balloon.c | 55 + include/uapi/linux/virtio_balloon.h | 1 + 2 files changed, 56 insertions(+) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 9d5fe2b..f9dcae8 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -71,6 +71,9 @@ struct virtio_balloon /* Memory statistics */ int need_stats_update; struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR]; + + /* Memory shrinker */ + struct shrinker shrinker; }; static struct virtio_device_id id_table[] = { @@ -126,6 +129,7 @@ static void set_page_pfns(u32 pfns[], struct page *page) pfns[i] = page_to_balloon_pfn(page) + i; } +/* This function should be called with vb->balloon_mutex held */ static void fill_balloon(struct virtio_balloon *vb, size_t num) { struct balloon_dev_info *vb_dev_info = vb->vb_dev_info; @@ -166,6 +170,7 @@ static void release_pages_by_pfn(const u32 pfns[], unsigned int num) } } +/* This function should be called with vb->balloon_mutex held */ static void leak_balloon(struct virtio_balloon *vb, size_t num) { struct page *page; @@ -285,6 +290,45 @@ static void update_balloon_size(struct virtio_balloon *vb) , sizeof(actual)); } +static unsigned long balloon_get_nr_pages(const struct virtio_balloon *vb) +{ + return vb->num_pages / VIRTIO_BALLOON_PAGES_PER_PAGE; +} + +static int balloon_shrinker(struct shrinker *shrinker,struct shrink_control *sc) +{ + unsigned int nr_pages, new_target; + struct virtio_balloon *vb; + + vb = container_of(shrinker, struct virtio_balloon, shrinker); + if (!mutex_trylock(>balloon_lock)) { + return -1; + } + + nr_pages = balloon_get_nr_pages(vb); + if (!sc->nr_to_scan || !nr_pages) { + goto out; + } + + /* +* If the current balloon size is greater than the number of +* pages being reclaimed by the kernel, deflate only the needed +* amount. Otherwise deflate everything we have. +*/ + new_target = 0; + if (nr_pages > sc->nr_to_scan) { + new_target = nr_pages - sc->nr_to_scan; + } + + leak_balloon(vb, new_target); + update_balloon_size(vb); + nr_pages = balloon_get_nr_pages(vb); + +out: + mutex_unlock(>balloon_lock); + return nr_pages; +} + static int balloon(void *_vballoon) { struct virtio_balloon *vb = _vballoon; @@ -471,6 +515,13 @@ static int virtballoon_probe(struct virtio_device *vdev) goto out_del_vqs; } + memset(>shrinker, 0, sizeof(vb->shrinker)); + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_AUTO_BALLOON)) { + vb->shrinker.shrink = balloon_shrinker; + vb->shrinker.seeks = DEFAULT_SEEKS; + register_shrinker(>shrinker); + } + return 0; out_del_vqs: @@ -487,6 +538,9 @@ out: static void remove_common(struct virtio_balloon *vb) { + if (vb->shrinker.shrink) + unregister_shrinker(>shrinker); + /* There might be pages left in the balloon: free them. */ mutex_lock(>balloon_lock); while (vb->num_pages) @@ -543,6 +597,7 @@ static int virtballoon_restore(struct virtio_device *vdev) static unsigned int features[] = { VIRTIO_BALLOON_F_MUST_TELL_HOST, VIRTIO_BALLOON_F_STATS_VQ, + VIRTIO_BALLOON_F_AUTO_BALLOON, }; static struct virtio_driver virtio_balloon_driver = { diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_ball
[RFC v2 0/2] virtio_balloon: auto-ballooning support
Hi, This series is a respin of automatic ballooning support I started working on last year. Patch 2/2 contains all relevant technical details and performance measurements results. This is in RFC state because it's a work in progress. Here's some information if you want to try automatic ballooning: 1. You'll need 3.9+ for the host kernel 2. Apply this series for the guest kernel 3. Grab the QEMU bits from: git://repo.or.cz/qemu/qmp-unstable.git balloon/auto-ballooning/memcg/rfc 4. Enable the balloon device in qemu with: -device virtio-balloon-pci,auto-balloon=true 5. Balloon the guest memory down, say from 1G to 256MB 6. Generate some pressure in the guest, say a kernel build with -j16 Any feedback is appreciated! Luiz Capitulino (2): virtio_balloon: move balloon_lock mutex to callers virtio_balloon: auto-ballooning support drivers/virtio/virtio_balloon.c | 63 ++--- include/uapi/linux/virtio_balloon.h | 1 + 2 files changed, 60 insertions(+), 4 deletions(-) -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC v2 0/2] virtio_balloon: auto-ballooning support
Hi, This series is a respin of automatic ballooning support I started working on last year. Patch 2/2 contains all relevant technical details and performance measurements results. This is in RFC state because it's a work in progress. Here's some information if you want to try automatic ballooning: 1. You'll need 3.9+ for the host kernel 2. Apply this series for the guest kernel 3. Grab the QEMU bits from: git://repo.or.cz/qemu/qmp-unstable.git balloon/auto-ballooning/memcg/rfc 4. Enable the balloon device in qemu with: -device virtio-balloon-pci,auto-balloon=true 5. Balloon the guest memory down, say from 1G to 256MB 6. Generate some pressure in the guest, say a kernel build with -j16 Any feedback is appreciated! Luiz Capitulino (2): virtio_balloon: move balloon_lock mutex to callers virtio_balloon: auto-ballooning support drivers/virtio/virtio_balloon.c | 63 ++--- include/uapi/linux/virtio_balloon.h | 1 + 2 files changed, 60 insertions(+), 4 deletions(-) -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 2/2] virtio_balloon: auto-ballooning support
Automatic ballooning consists of dynamically adjusting the guest's balloon according to memory pressure in the host and in the guest. This commit implements the guest side of automatic balloning, which basically consists of registering a shrinker callback with the kernel, which will try to deflate the guest's balloon by the amount of pages being requested. The shrinker callback is only registered if the host supports the VIRTIO_BALLOON_F_AUTO_BALLOON feature bit. Automatic inflate is performed by the host. Here are some numbers. The test-case is to run 35 VMs (1G of RAM each) in parallel doing a kernel build. Host has 32GB of RAM and 16GB of swap. SWAP IN and SWAP OUT correspond to the number of pages swapped in and swapped out, respectively. Auto-ballooning disabled: RUN TIME(s) SWAP IN SWAP OUT 1634 930980 1588522 2610 627422 1362174 3649 1079847 1616367 4543 953289 1635379 5642 913237 1514000 Auto-ballooning enabled: RUN TIME(s) SWAP IN SWAP OUT 1629 901 12537 2624 981 18506 3626 573 9085 4631 2250 42534 5627 1610 20808 Signed-off-by: Luiz Capitulino lcapitul...@redhat.com --- drivers/virtio/virtio_balloon.c | 55 + include/uapi/linux/virtio_balloon.h | 1 + 2 files changed, 56 insertions(+) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 9d5fe2b..f9dcae8 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -71,6 +71,9 @@ struct virtio_balloon /* Memory statistics */ int need_stats_update; struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR]; + + /* Memory shrinker */ + struct shrinker shrinker; }; static struct virtio_device_id id_table[] = { @@ -126,6 +129,7 @@ static void set_page_pfns(u32 pfns[], struct page *page) pfns[i] = page_to_balloon_pfn(page) + i; } +/* This function should be called with vb-balloon_mutex held */ static void fill_balloon(struct virtio_balloon *vb, size_t num) { struct balloon_dev_info *vb_dev_info = vb-vb_dev_info; @@ -166,6 +170,7 @@ static void release_pages_by_pfn(const u32 pfns[], unsigned int num) } } +/* This function should be called with vb-balloon_mutex held */ static void leak_balloon(struct virtio_balloon *vb, size_t num) { struct page *page; @@ -285,6 +290,45 @@ static void update_balloon_size(struct virtio_balloon *vb) actual, sizeof(actual)); } +static unsigned long balloon_get_nr_pages(const struct virtio_balloon *vb) +{ + return vb-num_pages / VIRTIO_BALLOON_PAGES_PER_PAGE; +} + +static int balloon_shrinker(struct shrinker *shrinker,struct shrink_control *sc) +{ + unsigned int nr_pages, new_target; + struct virtio_balloon *vb; + + vb = container_of(shrinker, struct virtio_balloon, shrinker); + if (!mutex_trylock(vb-balloon_lock)) { + return -1; + } + + nr_pages = balloon_get_nr_pages(vb); + if (!sc-nr_to_scan || !nr_pages) { + goto out; + } + + /* +* If the current balloon size is greater than the number of +* pages being reclaimed by the kernel, deflate only the needed +* amount. Otherwise deflate everything we have. +*/ + new_target = 0; + if (nr_pages sc-nr_to_scan) { + new_target = nr_pages - sc-nr_to_scan; + } + + leak_balloon(vb, new_target); + update_balloon_size(vb); + nr_pages = balloon_get_nr_pages(vb); + +out: + mutex_unlock(vb-balloon_lock); + return nr_pages; +} + static int balloon(void *_vballoon) { struct virtio_balloon *vb = _vballoon; @@ -471,6 +515,13 @@ static int virtballoon_probe(struct virtio_device *vdev) goto out_del_vqs; } + memset(vb-shrinker, 0, sizeof(vb-shrinker)); + if (virtio_has_feature(vb-vdev, VIRTIO_BALLOON_F_AUTO_BALLOON)) { + vb-shrinker.shrink = balloon_shrinker; + vb-shrinker.seeks = DEFAULT_SEEKS; + register_shrinker(vb-shrinker); + } + return 0; out_del_vqs: @@ -487,6 +538,9 @@ out: static void remove_common(struct virtio_balloon *vb) { + if (vb-shrinker.shrink) + unregister_shrinker(vb-shrinker); + /* There might be pages left in the balloon: free them. */ mutex_lock(vb-balloon_lock); while (vb-num_pages) @@ -543,6 +597,7 @@ static int virtballoon_restore(struct virtio_device *vdev) static unsigned int features[] = { VIRTIO_BALLOON_F_MUST_TELL_HOST, VIRTIO_BALLOON_F_STATS_VQ, + VIRTIO_BALLOON_F_AUTO_BALLOON, }; static struct virtio_driver virtio_balloon_driver = { diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index 5e26f61..bd378a4 100644
[RFC 1/2] virtio_balloon: move balloon_lock mutex to callers
This commit moves the balloon_lock mutex out of the fill_balloon() and leak_balloon() functions to their callers. The reason for this change is that the next commit will introduce a shrinker callback for the balloon driver, which will also call leak_balloon() but will require different locking semantics. Signed-off-by: Luiz Capitulino lcapitul...@redhat.com --- drivers/virtio/virtio_balloon.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index bd3ae32..9d5fe2b 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -133,7 +133,6 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num) /* We can only do one array worth at a time. */ num = min(num, ARRAY_SIZE(vb-pfns)); - mutex_lock(vb-balloon_lock); for (vb-num_pfns = 0; vb-num_pfns num; vb-num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) { struct page *page = balloon_page_enqueue(vb_dev_info); @@ -154,7 +153,6 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num) /* Did we get any? */ if (vb-num_pfns != 0) tell_host(vb, vb-inflate_vq); - mutex_unlock(vb-balloon_lock); } static void release_pages_by_pfn(const u32 pfns[], unsigned int num) @@ -176,7 +174,6 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) /* We can only do one array worth at a time. */ num = min(num, ARRAY_SIZE(vb-pfns)); - mutex_lock(vb-balloon_lock); for (vb-num_pfns = 0; vb-num_pfns num; vb-num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) { page = balloon_page_dequeue(vb_dev_info); @@ -192,7 +189,6 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) * is true, we *have* to do it in this order */ tell_host(vb, vb-deflate_vq); - mutex_unlock(vb-balloon_lock); release_pages_by_pfn(vb-pfns, vb-num_pfns); } @@ -305,11 +301,13 @@ static int balloon(void *_vballoon) || freezing(current)); if (vb-need_stats_update) stats_handle_request(vb); + mutex_lock(vb-balloon_lock); if (diff 0) fill_balloon(vb, diff); else if (diff 0) leak_balloon(vb, -diff); update_balloon_size(vb); + mutex_unlock(vb-balloon_lock); } return 0; } @@ -490,9 +488,11 @@ out: static void remove_common(struct virtio_balloon *vb) { /* There might be pages left in the balloon: free them. */ + mutex_lock(vb-balloon_lock); while (vb-num_pages) leak_balloon(vb, vb-num_pages); update_balloon_size(vb); + mutex_unlock(vb-balloon_lock); /* Now we reset the device so we can clean up the queues. */ vb-vdev-config-reset(vb-vdev); -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] virtio_balloon: add auto-ballooning support
On Sat, 12 Jan 2013 02:13:17 +0530 Amit Shah wrote: > On (Tue) 18 Dec 2012 [18:17:30], Luiz Capitulino wrote: > > The auto-ballooning feature automatically performs balloon inflate or > > deflate based on host and guest memory pressure. This can help to > > avoid swapping or worse in both, host and guest. > > > > Auto-ballooning has a host and a guest part. The host performs > > automatic inflate by requesting the guest to inflate its balloon > > when the host is facing memory pressure. The guest performs > > automatic deflate when it's facing memory pressure itself. It's > > expected that auto-inflate and auto-deflate will balance each > > other over time. > > > > This commit implements the guest side of auto-ballooning. > > > > To perform automatic deflate, the virtio_balloon driver registers > > a shrinker callback, which will try to deflate the guest's balloon > > on guest memory pressure just like if it were a cache. The shrinker > > callback is only registered if the host supports the > > VIRTIO_BALLOON_F_AUTO_BALLOON feature bit. > > I'm wondering if guest should auto-deflate even when the AUTO_BALLOON > feature isn't supported by the host: if a guest is under pressure, > there's no way for it to tell the host and wait for the host to > deflate the balloon, so it may be beneficial to just go ahead and > deflate the balloon for all hosts. I see two problems with this. First, this will automagically override balloon changes done by the user; and second, if we don't have the auto-inflate part and if the host starts facing memory pressure, VMs may start getting OOM. > Similarly, on the host side, management can configure a VM to either > enable or disable auto-balloon (the auto-inflate part). So even the > host can do away with the feature advertisement and negotiation. > > Is there some use-case I'm missing where doing these actions after > feature negotiation is beneficial? > > > FIXMEs > > > > o the guest kernel seems to spin when the host is performing a long > >auto-inflate > > Is this introduced by the current patches? I'd assume it happens even > without it -- these patches just introduce some heuristics, the > mechanism has stayed the same. Good point, I'll check that. > > Signed-off-by: Luiz Capitulino > > --- > > drivers/virtio/virtio_balloon.c | 54 > > + > > include/uapi/linux/virtio_balloon.h | 1 + > > 2 files changed, 55 insertions(+) > > Patch looks good, just one thing: > > > + /* > > +* If the current balloon size is greater than the number of > > +* pages being reclaimed by the kernel, deflate only the needed > > +* amount. Otherwise deflate everything we have. > > +*/ > > + if (nr_pages > sc->nr_to_scan) { > > + new_target = nr_pages - sc->nr_to_scan; > > + } else { > > + new_target = 0; > > + } > > This looks better: > > new_target = 0; > if (nr_pages > sc->nr_to_scan) { > new_target = nr_pages - sc->nr_to_scan; > } Ok. > > > Thanks, > Amit > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] virtio_balloon: add auto-ballooning support
On Sat, 12 Jan 2013 02:13:17 +0530 Amit Shah amit.s...@redhat.com wrote: On (Tue) 18 Dec 2012 [18:17:30], Luiz Capitulino wrote: The auto-ballooning feature automatically performs balloon inflate or deflate based on host and guest memory pressure. This can help to avoid swapping or worse in both, host and guest. Auto-ballooning has a host and a guest part. The host performs automatic inflate by requesting the guest to inflate its balloon when the host is facing memory pressure. The guest performs automatic deflate when it's facing memory pressure itself. It's expected that auto-inflate and auto-deflate will balance each other over time. This commit implements the guest side of auto-ballooning. To perform automatic deflate, the virtio_balloon driver registers a shrinker callback, which will try to deflate the guest's balloon on guest memory pressure just like if it were a cache. The shrinker callback is only registered if the host supports the VIRTIO_BALLOON_F_AUTO_BALLOON feature bit. I'm wondering if guest should auto-deflate even when the AUTO_BALLOON feature isn't supported by the host: if a guest is under pressure, there's no way for it to tell the host and wait for the host to deflate the balloon, so it may be beneficial to just go ahead and deflate the balloon for all hosts. I see two problems with this. First, this will automagically override balloon changes done by the user; and second, if we don't have the auto-inflate part and if the host starts facing memory pressure, VMs may start getting OOM. Similarly, on the host side, management can configure a VM to either enable or disable auto-balloon (the auto-inflate part). So even the host can do away with the feature advertisement and negotiation. Is there some use-case I'm missing where doing these actions after feature negotiation is beneficial? FIXMEs o the guest kernel seems to spin when the host is performing a long auto-inflate Is this introduced by the current patches? I'd assume it happens even without it -- these patches just introduce some heuristics, the mechanism has stayed the same. Good point, I'll check that. Signed-off-by: Luiz Capitulino lcapitul...@redhat.com --- drivers/virtio/virtio_balloon.c | 54 + include/uapi/linux/virtio_balloon.h | 1 + 2 files changed, 55 insertions(+) Patch looks good, just one thing: + /* +* If the current balloon size is greater than the number of +* pages being reclaimed by the kernel, deflate only the needed +* amount. Otherwise deflate everything we have. +*/ + if (nr_pages sc-nr_to_scan) { + new_target = nr_pages - sc-nr_to_scan; + } else { + new_target = 0; + } This looks better: new_target = 0; if (nr_pages sc-nr_to_scan) { new_target = nr_pages - sc-nr_to_scan; } Ok. Thanks, Amit -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Mempressure cgroup
On Fri, 4 Jan 2013 00:27:52 -0800 Anton Vorontsov wrote: > - I've split the pach into two: 'shrinker' and 'levels' parts. While the > full-fledged userland shrinker is an interesting idea, we don't have any > users ready for it, so I won't advocate for it too much. For the next version of the automatic balloon prototype I'm planning to give the user-space shrinker a try. It seems to be a better fit, as the current prototype has to guess by how much a guest's balloon should be inflated. Also, I think it would be worth it to list possible use-cases for the two functionalities in the series' intro email. This might help choosing both, one or another. Looking forward to the next version :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Mempressure cgroup
On Fri, 4 Jan 2013 00:27:52 -0800 Anton Vorontsov anton.voront...@linaro.org wrote: - I've split the pach into two: 'shrinker' and 'levels' parts. While the full-fledged userland shrinker is an interesting idea, we don't have any users ready for it, so I won't advocate for it too much. For the next version of the automatic balloon prototype I'm planning to give the user-space shrinker a try. It seems to be a better fit, as the current prototype has to guess by how much a guest's balloon should be inflated. Also, I think it would be worth it to list possible use-cases for the two functionalities in the series' intro email. This might help choosing both, one or another. Looking forward to the next version :) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Qemu-devel] [RFC 3/3] virtio-balloon: add auto-ballooning support
On Thu, 20 Dec 2012 05:24:12 + Dietmar Maurer wrote: > > > Wow, you're fast! And I'm glad that it works for you, so we have two > > > full-featured mempressure cgroup users already. > > > > Thanks, although I think we need more testing to be sure this does what we > > want. I mean, the basic mechanics does work, but my testing has been very > > light so far. > > Is it possible to assign different weights for different VMs, something like > the vmware 'shares' setting? This series doesn't have the "weight" concept, it has auto-balloon-level and auto-balloon-granularity. The former allows you to choose which type of kernel low-mem level you want auto-inflate to trigger. The latter allows you to say by how much the balloon should grow (as a percentage of the guest's current memory). Both of them are per VM. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Qemu-devel] [RFC 3/3] virtio-balloon: add auto-ballooning support
On Thu, 20 Dec 2012 05:24:12 + Dietmar Maurer diet...@proxmox.com wrote: Wow, you're fast! And I'm glad that it works for you, so we have two full-featured mempressure cgroup users already. Thanks, although I think we need more testing to be sure this does what we want. I mean, the basic mechanics does work, but my testing has been very light so far. Is it possible to assign different weights for different VMs, something like the vmware 'shares' setting? This series doesn't have the weight concept, it has auto-balloon-level and auto-balloon-granularity. The former allows you to choose which type of kernel low-mem level you want auto-inflate to trigger. The latter allows you to say by how much the balloon should grow (as a percentage of the guest's current memory). Both of them are per VM. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/2] virtio_balloon: move locking to the balloon thread
On Wed, 19 Dec 2012 09:55:58 -0200 Rafael Aquini wrote: > On Tue, Dec 18, 2012 at 06:17:29PM -0200, Luiz Capitulino wrote: > > Today, the balloon_lock mutex is taken and released by fill_balloon() > > and leak_balloon() when both functions are entered and when they > > return. > > > > This commit moves the locking to the caller instead, which is > > the balloon() thread. The balloon thread is the sole caller of those > > functions today. > > > > The reason for this move is that the next commit will introduce > > a shrinker callback for the balloon driver, which will also call > > leak_balloon() but will require different locking semantics. > > > > Signed-off-by: Luiz Capitulino > > --- > > drivers/virtio/virtio_balloon.c | 6 ++ > > 1 file changed, 2 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/virtio/virtio_balloon.c > > b/drivers/virtio/virtio_balloon.c > > index 2a70558..877e695 100644 > > --- a/drivers/virtio/virtio_balloon.c > > +++ b/drivers/virtio/virtio_balloon.c > > @@ -133,7 +133,6 @@ static void fill_balloon(struct virtio_balloon *vb, > > size_t num) > > /* We can only do one array worth at a time. */ > > num = min(num, ARRAY_SIZE(vb->pfns)); > > > > - mutex_lock(>balloon_lock); > > for (vb->num_pfns = 0; vb->num_pfns < num; > > vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) { > > struct page *page = balloon_page_enqueue(vb_dev_info); > > @@ -155,7 +154,6 @@ static void fill_balloon(struct virtio_balloon *vb, > > size_t num) > > /* Did we get any? */ > > if (vb->num_pfns != 0) > > tell_host(vb, vb->inflate_vq); > > - mutex_unlock(>balloon_lock); > > } > > > > Since you're removing the locking scheme from within this function, I think it > would be a good idea introduce a comment stating its caller must held the > mutex > vb->balloon_lock. Will address all comments for v1 (or rfc v2), thanks Rafael. > > > > static void release_pages_by_pfn(const u32 pfns[], unsigned int num) > > @@ -177,7 +175,6 @@ static void leak_balloon(struct virtio_balloon *vb, > > size_t num) > > /* We can only do one array worth at a time. */ > > num = min(num, ARRAY_SIZE(vb->pfns)); > > > > - mutex_lock(>balloon_lock); > > for (vb->num_pfns = 0; vb->num_pfns < num; > > vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) { > > page = balloon_page_dequeue(vb_dev_info); > > @@ -193,7 +190,6 @@ static void leak_balloon(struct virtio_balloon *vb, > > size_t num) > > * is true, we *have* to do it in this order > > */ > > tell_host(vb, vb->deflate_vq); > > - mutex_unlock(>balloon_lock); > > release_pages_by_pfn(vb->pfns, vb->num_pfns); > > } > > > > ditto > > > > @@ -306,11 +302,13 @@ static int balloon(void *_vballoon) > > || freezing(current)); > > if (vb->need_stats_update) > > stats_handle_request(vb); > > + mutex_lock(>balloon_lock); > > if (diff > 0) > > fill_balloon(vb, diff); > > else if (diff < 0) > > leak_balloon(vb, -diff); > > update_balloon_size(vb); > > + mutex_unlock(>balloon_lock); > > } > > return 0; > > } > > Just a nitpick: > As leak_balloon() is also called at remove_common(), you'll need to introduce > the > mutex there, similarly. > > > Thanks for move this forward. > > Cheers! > -- Rafael > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 3/3] virtio-balloon: add auto-ballooning support
On Tue, 18 Dec 2012 14:53:30 -0800 Anton Vorontsov wrote: > Hello Luiz, > > On Tue, Dec 18, 2012 at 06:16:55PM -0200, Luiz Capitulino wrote: > > The auto-ballooning feature automatically performs balloon inflate > > or deflate based on host and guest memory pressure. This can help to > > avoid swapping or worse in both, host and guest. > > > > Auto-ballooning has a host and a guest part. The host performs > > automatic inflate by requesting the guest to inflate its balloon > > when the host is facing memory pressure. The guest performs > > automatic deflate when it's facing memory pressure itself. It's > > expected that auto-inflate and auto-deflate will balance each > > other over time. > > > > This commit implements the host side of auto-ballooning. > > > > To be notified of host memory pressure, this commit makes use of this > > kernel API proposal being discussed upstream: > > > > http://marc.info/?l=linux-mm=135513372205134=2 > > Wow, you're fast! And I'm glad that it works for you, so we have two > full-featured mempressure cgroup users already. Thanks, although I think we need more testing to be sure this does what we want. I mean, the basic mechanics does work, but my testing has been very light so far. > Even though it is a qemu patch, I think we should Cc linux-mm folks on it, > just to let them know the great news. I'll do it next time. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 3/3] virtio-balloon: add auto-ballooning support
On Tue, 18 Dec 2012 14:53:30 -0800 Anton Vorontsov anton.voront...@linaro.org wrote: Hello Luiz, On Tue, Dec 18, 2012 at 06:16:55PM -0200, Luiz Capitulino wrote: The auto-ballooning feature automatically performs balloon inflate or deflate based on host and guest memory pressure. This can help to avoid swapping or worse in both, host and guest. Auto-ballooning has a host and a guest part. The host performs automatic inflate by requesting the guest to inflate its balloon when the host is facing memory pressure. The guest performs automatic deflate when it's facing memory pressure itself. It's expected that auto-inflate and auto-deflate will balance each other over time. This commit implements the host side of auto-ballooning. To be notified of host memory pressure, this commit makes use of this kernel API proposal being discussed upstream: http://marc.info/?l=linux-mmm=135513372205134w=2 Wow, you're fast! And I'm glad that it works for you, so we have two full-featured mempressure cgroup users already. Thanks, although I think we need more testing to be sure this does what we want. I mean, the basic mechanics does work, but my testing has been very light so far. Even though it is a qemu patch, I think we should Cc linux-mm folks on it, just to let them know the great news. I'll do it next time. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/2] virtio_balloon: move locking to the balloon thread
On Wed, 19 Dec 2012 09:55:58 -0200 Rafael Aquini aqu...@redhat.com wrote: On Tue, Dec 18, 2012 at 06:17:29PM -0200, Luiz Capitulino wrote: Today, the balloon_lock mutex is taken and released by fill_balloon() and leak_balloon() when both functions are entered and when they return. This commit moves the locking to the caller instead, which is the balloon() thread. The balloon thread is the sole caller of those functions today. The reason for this move is that the next commit will introduce a shrinker callback for the balloon driver, which will also call leak_balloon() but will require different locking semantics. Signed-off-by: Luiz Capitulino lcapitul...@redhat.com --- drivers/virtio/virtio_balloon.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 2a70558..877e695 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -133,7 +133,6 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num) /* We can only do one array worth at a time. */ num = min(num, ARRAY_SIZE(vb-pfns)); - mutex_lock(vb-balloon_lock); for (vb-num_pfns = 0; vb-num_pfns num; vb-num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) { struct page *page = balloon_page_enqueue(vb_dev_info); @@ -155,7 +154,6 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num) /* Did we get any? */ if (vb-num_pfns != 0) tell_host(vb, vb-inflate_vq); - mutex_unlock(vb-balloon_lock); } Since you're removing the locking scheme from within this function, I think it would be a good idea introduce a comment stating its caller must held the mutex vb-balloon_lock. Will address all comments for v1 (or rfc v2), thanks Rafael. static void release_pages_by_pfn(const u32 pfns[], unsigned int num) @@ -177,7 +175,6 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) /* We can only do one array worth at a time. */ num = min(num, ARRAY_SIZE(vb-pfns)); - mutex_lock(vb-balloon_lock); for (vb-num_pfns = 0; vb-num_pfns num; vb-num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) { page = balloon_page_dequeue(vb_dev_info); @@ -193,7 +190,6 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) * is true, we *have* to do it in this order */ tell_host(vb, vb-deflate_vq); - mutex_unlock(vb-balloon_lock); release_pages_by_pfn(vb-pfns, vb-num_pfns); } ditto @@ -306,11 +302,13 @@ static int balloon(void *_vballoon) || freezing(current)); if (vb-need_stats_update) stats_handle_request(vb); + mutex_lock(vb-balloon_lock); if (diff 0) fill_balloon(vb, diff); else if (diff 0) leak_balloon(vb, -diff); update_balloon_size(vb); + mutex_unlock(vb-balloon_lock); } return 0; } Just a nitpick: As leak_balloon() is also called at remove_common(), you'll need to introduce the mutex there, similarly. Thanks for move this forward. Cheers! -- Rafael -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 2/2] virtio_balloon: add auto-ballooning support
The auto-ballooning feature automatically performs balloon inflate or deflate based on host and guest memory pressure. This can help to avoid swapping or worse in both, host and guest. Auto-ballooning has a host and a guest part. The host performs automatic inflate by requesting the guest to inflate its balloon when the host is facing memory pressure. The guest performs automatic deflate when it's facing memory pressure itself. It's expected that auto-inflate and auto-deflate will balance each other over time. This commit implements the guest side of auto-ballooning. To perform automatic deflate, the virtio_balloon driver registers a shrinker callback, which will try to deflate the guest's balloon on guest memory pressure just like if it were a cache. The shrinker callback is only registered if the host supports the VIRTIO_BALLOON_F_AUTO_BALLOON feature bit. FIXMEs o the guest kernel seems to spin when the host is performing a long auto-inflate Signed-off-by: Luiz Capitulino --- drivers/virtio/virtio_balloon.c | 54 + include/uapi/linux/virtio_balloon.h | 1 + 2 files changed, 55 insertions(+) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 877e695..12fb70e 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -71,6 +71,9 @@ struct virtio_balloon /* Memory statistics */ int need_stats_update; struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR]; + + /* Memory shrinker */ + struct shrinker shrinker; }; static struct virtio_device_id id_table[] = { @@ -286,6 +289,46 @@ static void update_balloon_size(struct virtio_balloon *vb) , sizeof(actual)); } +static unsigned long balloon_get_nr_pages(const struct virtio_balloon *vb) +{ + return vb->num_pages / VIRTIO_BALLOON_PAGES_PER_PAGE; +} + +static int balloon_shrinker(struct shrinker *shrinker,struct shrink_control *sc) +{ + unsigned int nr_pages, new_target; + struct virtio_balloon *vb; + + vb = container_of(shrinker, struct virtio_balloon, shrinker); + if (!mutex_trylock(>balloon_lock)) { + return -1; + } + + nr_pages = balloon_get_nr_pages(vb); + if (!sc->nr_to_scan || !nr_pages) { + goto out; + } + + /* +* If the current balloon size is greater than the number of +* pages being reclaimed by the kernel, deflate only the needed +* amount. Otherwise deflate everything we have. +*/ + if (nr_pages > sc->nr_to_scan) { + new_target = nr_pages - sc->nr_to_scan; + } else { + new_target = 0; + } + + leak_balloon(vb, new_target); + update_balloon_size(vb); + nr_pages = balloon_get_nr_pages(vb); + +out: + mutex_unlock(>balloon_lock); + return nr_pages; +} + static int balloon(void *_vballoon) { struct virtio_balloon *vb = _vballoon; @@ -472,6 +515,13 @@ static int virtballoon_probe(struct virtio_device *vdev) goto out_del_vqs; } + memset(>shrinker, 0, sizeof(vb->shrinker)); + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_AUTO_BALLOON)) { + vb->shrinker.shrink = balloon_shrinker; + vb->shrinker.seeks = DEFAULT_SEEKS; + register_shrinker(>shrinker); + } + return 0; out_del_vqs: @@ -488,6 +538,9 @@ out: static void remove_common(struct virtio_balloon *vb) { + if (vb->shrinker.shrink) + unregister_shrinker(>shrinker); + /* There might be pages left in the balloon: free them. */ while (vb->num_pages) leak_balloon(vb, vb->num_pages); @@ -542,6 +595,7 @@ static int virtballoon_restore(struct virtio_device *vdev) static unsigned int features[] = { VIRTIO_BALLOON_F_MUST_TELL_HOST, VIRTIO_BALLOON_F_STATS_VQ, + VIRTIO_BALLOON_F_AUTO_BALLOON, }; static struct virtio_driver virtio_balloon_driver = { diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index 652dc8b..3764cac 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -31,6 +31,7 @@ /* The feature bitmap for virtio balloon */ #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before reclaiming pages */ #define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */ +#define VIRTIO_BALLOON_F_AUTO_BALLOON 2 /* Automatic ballooning */ /* Size of a PFN in the balloon interface. */ #define VIRTIO_BALLOON_PFN_SHIFT 12 -- 1.8.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 0/2] auto-ballooning prototype (guest part)
Hi, This series implements an early protoype of a new feature called automatic ballooning. This is based on ideas by Rik van Riel and I also got some help from Rafael Aquini (misconceptions and bugs are all mine, though). The auto-ballooning feature automatically performs balloon inflate and deflate based on host and guest memory pressure. This can help to avoid swapping or worse in both, host and guest. This series implements the guest part. Full details on the design and implementation can be found in patch 2/2. To test this you will also need the host part, which can be found here (please, never mind the repo name): git://repo.or.cz/qemu/qmp-unstable.git balloon/auto-ballooning/rfc Any feedback is appreciated! Luiz Capitulino (2): virtio_balloon: move locking to the balloon thread virtio_balloon: add auto-ballooning support drivers/virtio/virtio_balloon.c | 60 ++--- include/uapi/linux/virtio_balloon.h | 1 + 2 files changed, 57 insertions(+), 4 deletions(-) -- 1.8.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 1/2] virtio_balloon: move locking to the balloon thread
Today, the balloon_lock mutex is taken and released by fill_balloon() and leak_balloon() when both functions are entered and when they return. This commit moves the locking to the caller instead, which is the balloon() thread. The balloon thread is the sole caller of those functions today. The reason for this move is that the next commit will introduce a shrinker callback for the balloon driver, which will also call leak_balloon() but will require different locking semantics. Signed-off-by: Luiz Capitulino --- drivers/virtio/virtio_balloon.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 2a70558..877e695 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -133,7 +133,6 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num) /* We can only do one array worth at a time. */ num = min(num, ARRAY_SIZE(vb->pfns)); - mutex_lock(>balloon_lock); for (vb->num_pfns = 0; vb->num_pfns < num; vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) { struct page *page = balloon_page_enqueue(vb_dev_info); @@ -155,7 +154,6 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num) /* Did we get any? */ if (vb->num_pfns != 0) tell_host(vb, vb->inflate_vq); - mutex_unlock(>balloon_lock); } static void release_pages_by_pfn(const u32 pfns[], unsigned int num) @@ -177,7 +175,6 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) /* We can only do one array worth at a time. */ num = min(num, ARRAY_SIZE(vb->pfns)); - mutex_lock(>balloon_lock); for (vb->num_pfns = 0; vb->num_pfns < num; vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) { page = balloon_page_dequeue(vb_dev_info); @@ -193,7 +190,6 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) * is true, we *have* to do it in this order */ tell_host(vb, vb->deflate_vq); - mutex_unlock(>balloon_lock); release_pages_by_pfn(vb->pfns, vb->num_pfns); } @@ -306,11 +302,13 @@ static int balloon(void *_vballoon) || freezing(current)); if (vb->need_stats_update) stats_handle_request(vb); + mutex_lock(>balloon_lock); if (diff > 0) fill_balloon(vb, diff); else if (diff < 0) leak_balloon(vb, -diff); update_balloon_size(vb); + mutex_unlock(>balloon_lock); } return 0; } -- 1.8.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 1/2] virtio_balloon: move locking to the balloon thread
Today, the balloon_lock mutex is taken and released by fill_balloon() and leak_balloon() when both functions are entered and when they return. This commit moves the locking to the caller instead, which is the balloon() thread. The balloon thread is the sole caller of those functions today. The reason for this move is that the next commit will introduce a shrinker callback for the balloon driver, which will also call leak_balloon() but will require different locking semantics. Signed-off-by: Luiz Capitulino lcapitul...@redhat.com --- drivers/virtio/virtio_balloon.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 2a70558..877e695 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -133,7 +133,6 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num) /* We can only do one array worth at a time. */ num = min(num, ARRAY_SIZE(vb-pfns)); - mutex_lock(vb-balloon_lock); for (vb-num_pfns = 0; vb-num_pfns num; vb-num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) { struct page *page = balloon_page_enqueue(vb_dev_info); @@ -155,7 +154,6 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num) /* Did we get any? */ if (vb-num_pfns != 0) tell_host(vb, vb-inflate_vq); - mutex_unlock(vb-balloon_lock); } static void release_pages_by_pfn(const u32 pfns[], unsigned int num) @@ -177,7 +175,6 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) /* We can only do one array worth at a time. */ num = min(num, ARRAY_SIZE(vb-pfns)); - mutex_lock(vb-balloon_lock); for (vb-num_pfns = 0; vb-num_pfns num; vb-num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) { page = balloon_page_dequeue(vb_dev_info); @@ -193,7 +190,6 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num) * is true, we *have* to do it in this order */ tell_host(vb, vb-deflate_vq); - mutex_unlock(vb-balloon_lock); release_pages_by_pfn(vb-pfns, vb-num_pfns); } @@ -306,11 +302,13 @@ static int balloon(void *_vballoon) || freezing(current)); if (vb-need_stats_update) stats_handle_request(vb); + mutex_lock(vb-balloon_lock); if (diff 0) fill_balloon(vb, diff); else if (diff 0) leak_balloon(vb, -diff); update_balloon_size(vb); + mutex_unlock(vb-balloon_lock); } return 0; } -- 1.8.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 0/2] auto-ballooning prototype (guest part)
Hi, This series implements an early protoype of a new feature called automatic ballooning. This is based on ideas by Rik van Riel and I also got some help from Rafael Aquini (misconceptions and bugs are all mine, though). The auto-ballooning feature automatically performs balloon inflate and deflate based on host and guest memory pressure. This can help to avoid swapping or worse in both, host and guest. This series implements the guest part. Full details on the design and implementation can be found in patch 2/2. To test this you will also need the host part, which can be found here (please, never mind the repo name): git://repo.or.cz/qemu/qmp-unstable.git balloon/auto-ballooning/rfc Any feedback is appreciated! Luiz Capitulino (2): virtio_balloon: move locking to the balloon thread virtio_balloon: add auto-ballooning support drivers/virtio/virtio_balloon.c | 60 ++--- include/uapi/linux/virtio_balloon.h | 1 + 2 files changed, 57 insertions(+), 4 deletions(-) -- 1.8.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 2/2] virtio_balloon: add auto-ballooning support
The auto-ballooning feature automatically performs balloon inflate or deflate based on host and guest memory pressure. This can help to avoid swapping or worse in both, host and guest. Auto-ballooning has a host and a guest part. The host performs automatic inflate by requesting the guest to inflate its balloon when the host is facing memory pressure. The guest performs automatic deflate when it's facing memory pressure itself. It's expected that auto-inflate and auto-deflate will balance each other over time. This commit implements the guest side of auto-ballooning. To perform automatic deflate, the virtio_balloon driver registers a shrinker callback, which will try to deflate the guest's balloon on guest memory pressure just like if it were a cache. The shrinker callback is only registered if the host supports the VIRTIO_BALLOON_F_AUTO_BALLOON feature bit. FIXMEs o the guest kernel seems to spin when the host is performing a long auto-inflate Signed-off-by: Luiz Capitulino lcapitul...@redhat.com --- drivers/virtio/virtio_balloon.c | 54 + include/uapi/linux/virtio_balloon.h | 1 + 2 files changed, 55 insertions(+) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 877e695..12fb70e 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -71,6 +71,9 @@ struct virtio_balloon /* Memory statistics */ int need_stats_update; struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR]; + + /* Memory shrinker */ + struct shrinker shrinker; }; static struct virtio_device_id id_table[] = { @@ -286,6 +289,46 @@ static void update_balloon_size(struct virtio_balloon *vb) actual, sizeof(actual)); } +static unsigned long balloon_get_nr_pages(const struct virtio_balloon *vb) +{ + return vb-num_pages / VIRTIO_BALLOON_PAGES_PER_PAGE; +} + +static int balloon_shrinker(struct shrinker *shrinker,struct shrink_control *sc) +{ + unsigned int nr_pages, new_target; + struct virtio_balloon *vb; + + vb = container_of(shrinker, struct virtio_balloon, shrinker); + if (!mutex_trylock(vb-balloon_lock)) { + return -1; + } + + nr_pages = balloon_get_nr_pages(vb); + if (!sc-nr_to_scan || !nr_pages) { + goto out; + } + + /* +* If the current balloon size is greater than the number of +* pages being reclaimed by the kernel, deflate only the needed +* amount. Otherwise deflate everything we have. +*/ + if (nr_pages sc-nr_to_scan) { + new_target = nr_pages - sc-nr_to_scan; + } else { + new_target = 0; + } + + leak_balloon(vb, new_target); + update_balloon_size(vb); + nr_pages = balloon_get_nr_pages(vb); + +out: + mutex_unlock(vb-balloon_lock); + return nr_pages; +} + static int balloon(void *_vballoon) { struct virtio_balloon *vb = _vballoon; @@ -472,6 +515,13 @@ static int virtballoon_probe(struct virtio_device *vdev) goto out_del_vqs; } + memset(vb-shrinker, 0, sizeof(vb-shrinker)); + if (virtio_has_feature(vb-vdev, VIRTIO_BALLOON_F_AUTO_BALLOON)) { + vb-shrinker.shrink = balloon_shrinker; + vb-shrinker.seeks = DEFAULT_SEEKS; + register_shrinker(vb-shrinker); + } + return 0; out_del_vqs: @@ -488,6 +538,9 @@ out: static void remove_common(struct virtio_balloon *vb) { + if (vb-shrinker.shrink) + unregister_shrinker(vb-shrinker); + /* There might be pages left in the balloon: free them. */ while (vb-num_pages) leak_balloon(vb, vb-num_pages); @@ -542,6 +595,7 @@ static int virtballoon_restore(struct virtio_device *vdev) static unsigned int features[] = { VIRTIO_BALLOON_F_MUST_TELL_HOST, VIRTIO_BALLOON_F_STATS_VQ, + VIRTIO_BALLOON_F_AUTO_BALLOON, }; static struct virtio_driver virtio_balloon_driver = { diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index 652dc8b..3764cac 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -31,6 +31,7 @@ /* The feature bitmap for virtio balloon */ #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before reclaiming pages */ #define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */ +#define VIRTIO_BALLOON_F_AUTO_BALLOON 2 /* Automatic ballooning */ /* Size of a PFN in the balloon interface. */ #define VIRTIO_BALLOON_PFN_SHIFT 12 -- 1.8.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Add mempressure cgroup
On Wed, 28 Nov 2012 17:27:51 -0800 Anton Vorontsov wrote: > On Wed, Nov 28, 2012 at 03:14:32PM -0800, Andrew Morton wrote: > [...] > > Compare this with the shrink_slab() shrinkers. With these, the VM can > > query and then control the clients. If something goes wrong or is out > > of balance, it's the VM's problem to solve. > > > > So I'm thinking that a better design would be one which puts the kernel > > VM in control of userspace scanning and freeing. Presumably with a > > query-and-control interface similar to the slab shrinkers. > > Thanks for the ideas, Andrew. > > Query-and-control scheme looks very attractive, and that's actually > resembles my "balance" level idea, when userland tells the kernel how much > reclaimable memory it has. Except the your scheme works in the reverse > direction, i.e. the kernel becomes in charge. > > But there is one, rather major issue: we're crossing kernel-userspace > boundary. And with the scheme we'll have to cross the boundary four times: > query / reply-available / control / reply-shrunk / (and repeat if > necessary, every SHRINK_BATCH pages). Plus, it has to be done somewhat > synchronously (all the four stages), and/or we have to make a "userspace > shrinker" thread working in parallel with the normal shrinker, and here, > I'm afraid, we'll see more strange interactions. :) Wouldn't this be just like kswapd? > But there is a good news: for these kind of fine-grained control we have a > better interface, where we don't have to communicate [very often] w/ the > kernel. These are "volatile ranges", where userland itself marks chunks of > data as "I might need it, but I won't cry if you recycle it; but when I > access it next time, let me know if you actually recycled it". Yes, > userland no longer able to decide which exact page it permits to recycle, > but we don't have use-cases when we actually care that much. And if we do, > we'd rather introduce volatile LRUs with different priorities, or > something alike. I'm new to this stuff so please take this with a grain of salt, but I'm not sure volatile ranges would be a good fit for our use case: we want to make (kvm) guests reduce their memory when the host is getting memory pressure. Having a notification seems just fine for this purpose, but I'm not sure how this would work with volatile ranges, as we'd have to mark pages volatile in advance. Andrew's idea seems to give a lot more freedom to apps, IMHO. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Add mempressure cgroup
On Wed, 28 Nov 2012 17:27:51 -0800 Anton Vorontsov anton.voront...@linaro.org wrote: On Wed, Nov 28, 2012 at 03:14:32PM -0800, Andrew Morton wrote: [...] Compare this with the shrink_slab() shrinkers. With these, the VM can query and then control the clients. If something goes wrong or is out of balance, it's the VM's problem to solve. So I'm thinking that a better design would be one which puts the kernel VM in control of userspace scanning and freeing. Presumably with a query-and-control interface similar to the slab shrinkers. Thanks for the ideas, Andrew. Query-and-control scheme looks very attractive, and that's actually resembles my balance level idea, when userland tells the kernel how much reclaimable memory it has. Except the your scheme works in the reverse direction, i.e. the kernel becomes in charge. But there is one, rather major issue: we're crossing kernel-userspace boundary. And with the scheme we'll have to cross the boundary four times: query / reply-available / control / reply-shrunk / (and repeat if necessary, every SHRINK_BATCH pages). Plus, it has to be done somewhat synchronously (all the four stages), and/or we have to make a userspace shrinker thread working in parallel with the normal shrinker, and here, I'm afraid, we'll see more strange interactions. :) Wouldn't this be just like kswapd? But there is a good news: for these kind of fine-grained control we have a better interface, where we don't have to communicate [very often] w/ the kernel. These are volatile ranges, where userland itself marks chunks of data as I might need it, but I won't cry if you recycle it; but when I access it next time, let me know if you actually recycled it. Yes, userland no longer able to decide which exact page it permits to recycle, but we don't have use-cases when we actually care that much. And if we do, we'd rather introduce volatile LRUs with different priorities, or something alike. I'm new to this stuff so please take this with a grain of salt, but I'm not sure volatile ranges would be a good fit for our use case: we want to make (kvm) guests reduce their memory when the host is getting memory pressure. Having a notification seems just fine for this purpose, but I'm not sure how this would work with volatile ranges, as we'd have to mark pages volatile in advance. Andrew's idea seems to give a lot more freedom to apps, IMHO. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 0/3] vmpressure_fd: Linux VM pressure notifications
Hi Anton, On Wed, 7 Nov 2012 02:53:49 -0800 Anton Vorontsov wrote: > Hi all, > > This is the third RFC. As suggested by Minchan Kim, the API is much > simplified now (comparing to vmevent_fd): Which tree is this against? I'd like to try this series, but it doesn't apply to Linus tree. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 0/3] vmpressure_fd: Linux VM pressure notifications
Hi Anton, On Wed, 7 Nov 2012 02:53:49 -0800 Anton Vorontsov anton.voront...@linaro.org wrote: Hi all, This is the third RFC. As suggested by Minchan Kim, the API is much simplified now (comparing to vmevent_fd): Which tree is this against? I'd like to try this series, but it doesn't apply to Linus tree. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Qemu-devel] [PATCH v10] kvm: notify host when the guest is panicked
On Wed, 29 Aug 2012 13:18:54 +0800 Wen Congyang wrote: > We can know the guest is panicked when the guest runs on xen. > But we do not have such feature on kvm. What's the status of this series? It got lost in my queue and I ended up not reviewing it, but it seems to be stuck. > > Another purpose of this feature is: management app(for example: > libvirt) can do auto dump when the guest is panicked. If management > app does not do auto dump, the guest's user can do dump by hand if > he sees the guest is panicked. > > We have three solutions to implement this feature: > 1. use vmcall > 2. use I/O port > 3. use virtio-serial. > > We have decided to avoid touching hypervisor. The reason why I choose > choose the I/O port is: > 1. it is easier to implememt > 2. it does not depend any virtual device > 3. it can work when starting the kernel > > Signed-off-by: Wen Congyang > --- > Documentation/virtual/kvm/pv_event.txt | 32 > > arch/ia64/include/asm/kvm_para.h | 14 ++ > arch/powerpc/include/asm/kvm_para.h| 14 ++ > arch/s390/include/asm/kvm_para.h | 14 ++ > arch/x86/include/asm/kvm_para.h| 27 +++ > arch/x86/kernel/kvm.c | 25 + > include/linux/kvm_para.h | 23 +++ > 7 files changed, 149 insertions(+), 0 deletions(-) > create mode 100644 Documentation/virtual/kvm/pv_event.txt > > diff --git a/Documentation/virtual/kvm/pv_event.txt > b/Documentation/virtual/kvm/pv_event.txt > new file mode 100644 > index 000..bb04de0 > --- /dev/null > +++ b/Documentation/virtual/kvm/pv_event.txt > @@ -0,0 +1,32 @@ > +The KVM paravirtual event interface > += > + > +Initializing the paravirtual event interface > +== > +kvm_pv_event_init() > +Argiments: > + None > + > +Return Value: > + 0: The guest kernel can use paravirtual event interface. > + 1: The guest kernel can't use paravirtual event interface. > + > +Querying whether the event can be ejected > +== > +kvm_pv_has_feature() > +Arguments: > + feature: The bit value of this paravirtual event to query > + > +Return Value: > + 0 : The guest kernel can't eject this paravirtual event. > + -1: The guest kernel can eject this paravirtual event. > + > + > +Ejecting paravirtual event > +== > +kvm_pv_eject_event() > +Arguments: > + event: The event to be ejected. > + > +Return Value: > + None > diff --git a/arch/ia64/include/asm/kvm_para.h > b/arch/ia64/include/asm/kvm_para.h > index 2019cb9..b5ec658 100644 > --- a/arch/ia64/include/asm/kvm_para.h > +++ b/arch/ia64/include/asm/kvm_para.h > @@ -31,6 +31,20 @@ static inline bool kvm_check_and_clear_guest_paused(void) > return false; > } > > +static inline int kvm_arch_pv_event_init(void) > +{ > + return 0; > +} > + > +static inline unsigned int kvm_arch_pv_features(void) > +{ > + return 0; > +} > + > +static inline void kvm_arch_pv_eject_event(unsigned int event) > +{ > +} > + > #endif > > #endif > diff --git a/arch/powerpc/include/asm/kvm_para.h > b/arch/powerpc/include/asm/kvm_para.h > index c18916b..01b98c7 100644 > --- a/arch/powerpc/include/asm/kvm_para.h > +++ b/arch/powerpc/include/asm/kvm_para.h > @@ -211,6 +211,20 @@ static inline bool kvm_check_and_clear_guest_paused(void) > return false; > } > > +static inline int kvm_arch_pv_event_init(void) > +{ > + return 0; > +} > + > +static inline unsigned int kvm_arch_pv_features(void) > +{ > + return 0; > +} > + > +static inline void kvm_arch_pv_eject_event(unsigned int event) > +{ > +} > + > #endif /* __KERNEL__ */ > > #endif /* __POWERPC_KVM_PARA_H__ */ > diff --git a/arch/s390/include/asm/kvm_para.h > b/arch/s390/include/asm/kvm_para.h > index da44867..00ce058 100644 > --- a/arch/s390/include/asm/kvm_para.h > +++ b/arch/s390/include/asm/kvm_para.h > @@ -154,6 +154,20 @@ static inline bool kvm_check_and_clear_guest_paused(void) > return false; > } > > +static inline int kvm_arch_pv_event_init(void) > +{ > + return 0; > +} > + > +static inline unsigned int kvm_arch_pv_features(void) > +{ > + return 0; > +} > + > +static inline void kvm_arch_pv_eject_event(unsigned int event) > +{ > +} > + > #endif > > #endif /* __S390_KVM_PARA_H */ > diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h > index 2f7712e..7d297f0 100644 > --- a/arch/x86/include/asm/kvm_para.h > +++ b/arch/x86/include/asm/kvm_para.h > @@ -96,8 +96,11 @@ struct kvm_vcpu_pv_apf_data { > #define KVM_PV_EOI_ENABLED KVM_PV_EOI_MASK > #define KVM_PV_EOI_DISABLED 0x0 > > +#define KVM_PV_EVENT_PORT(0x505UL) > + > #ifdef __KERNEL__ > #include > +#include > > extern void kvmclock_init(void); > extern int kvm_register_clock(char *txt); > @@ -228,6 +231,30 @@ static inline void
Re: [Qemu-devel] [PATCH v10] kvm: notify host when the guest is panicked
On Wed, 29 Aug 2012 13:18:54 +0800 Wen Congyang we...@cn.fujitsu.com wrote: We can know the guest is panicked when the guest runs on xen. But we do not have such feature on kvm. What's the status of this series? It got lost in my queue and I ended up not reviewing it, but it seems to be stuck. Another purpose of this feature is: management app(for example: libvirt) can do auto dump when the guest is panicked. If management app does not do auto dump, the guest's user can do dump by hand if he sees the guest is panicked. We have three solutions to implement this feature: 1. use vmcall 2. use I/O port 3. use virtio-serial. We have decided to avoid touching hypervisor. The reason why I choose choose the I/O port is: 1. it is easier to implememt 2. it does not depend any virtual device 3. it can work when starting the kernel Signed-off-by: Wen Congyang we...@cn.fujitsu.com --- Documentation/virtual/kvm/pv_event.txt | 32 arch/ia64/include/asm/kvm_para.h | 14 ++ arch/powerpc/include/asm/kvm_para.h| 14 ++ arch/s390/include/asm/kvm_para.h | 14 ++ arch/x86/include/asm/kvm_para.h| 27 +++ arch/x86/kernel/kvm.c | 25 + include/linux/kvm_para.h | 23 +++ 7 files changed, 149 insertions(+), 0 deletions(-) create mode 100644 Documentation/virtual/kvm/pv_event.txt diff --git a/Documentation/virtual/kvm/pv_event.txt b/Documentation/virtual/kvm/pv_event.txt new file mode 100644 index 000..bb04de0 --- /dev/null +++ b/Documentation/virtual/kvm/pv_event.txt @@ -0,0 +1,32 @@ +The KVM paravirtual event interface += + +Initializing the paravirtual event interface +== +kvm_pv_event_init() +Argiments: + None + +Return Value: + 0: The guest kernel can use paravirtual event interface. + 1: The guest kernel can't use paravirtual event interface. + +Querying whether the event can be ejected +== +kvm_pv_has_feature() +Arguments: + feature: The bit value of this paravirtual event to query + +Return Value: + 0 : The guest kernel can't eject this paravirtual event. + -1: The guest kernel can eject this paravirtual event. + + +Ejecting paravirtual event +== +kvm_pv_eject_event() +Arguments: + event: The event to be ejected. + +Return Value: + None diff --git a/arch/ia64/include/asm/kvm_para.h b/arch/ia64/include/asm/kvm_para.h index 2019cb9..b5ec658 100644 --- a/arch/ia64/include/asm/kvm_para.h +++ b/arch/ia64/include/asm/kvm_para.h @@ -31,6 +31,20 @@ static inline bool kvm_check_and_clear_guest_paused(void) return false; } +static inline int kvm_arch_pv_event_init(void) +{ + return 0; +} + +static inline unsigned int kvm_arch_pv_features(void) +{ + return 0; +} + +static inline void kvm_arch_pv_eject_event(unsigned int event) +{ +} + #endif #endif diff --git a/arch/powerpc/include/asm/kvm_para.h b/arch/powerpc/include/asm/kvm_para.h index c18916b..01b98c7 100644 --- a/arch/powerpc/include/asm/kvm_para.h +++ b/arch/powerpc/include/asm/kvm_para.h @@ -211,6 +211,20 @@ static inline bool kvm_check_and_clear_guest_paused(void) return false; } +static inline int kvm_arch_pv_event_init(void) +{ + return 0; +} + +static inline unsigned int kvm_arch_pv_features(void) +{ + return 0; +} + +static inline void kvm_arch_pv_eject_event(unsigned int event) +{ +} + #endif /* __KERNEL__ */ #endif /* __POWERPC_KVM_PARA_H__ */ diff --git a/arch/s390/include/asm/kvm_para.h b/arch/s390/include/asm/kvm_para.h index da44867..00ce058 100644 --- a/arch/s390/include/asm/kvm_para.h +++ b/arch/s390/include/asm/kvm_para.h @@ -154,6 +154,20 @@ static inline bool kvm_check_and_clear_guest_paused(void) return false; } +static inline int kvm_arch_pv_event_init(void) +{ + return 0; +} + +static inline unsigned int kvm_arch_pv_features(void) +{ + return 0; +} + +static inline void kvm_arch_pv_eject_event(unsigned int event) +{ +} + #endif #endif /* __S390_KVM_PARA_H */ diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index 2f7712e..7d297f0 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -96,8 +96,11 @@ struct kvm_vcpu_pv_apf_data { #define KVM_PV_EOI_ENABLED KVM_PV_EOI_MASK #define KVM_PV_EOI_DISABLED 0x0 +#define KVM_PV_EVENT_PORT(0x505UL) + #ifdef __KERNEL__ #include asm/processor.h +#include linux/ioport.h extern void kvmclock_init(void); extern int kvm_register_clock(char *txt); @@ -228,6 +231,30 @@ static inline void kvm_disable_steal_time(void) } #endif +static inline int kvm_arch_pv_event_init(void) +{ +