Re: [PATCH] vmpressure: implement strict mode

2013-07-02 Thread Luiz Capitulino
On Tue, 2 Jul 2013 21:47:03 +0200
Pavel Machek  wrote:

> On Tue 2013-07-02 11:06:28, Luiz Capitulino wrote:
> > On Mon, 1 Jul 2013 10:51:03 +0200
> > Pavel Machek  wrote:
> > 
> > > Hi!
> > > 
> > > > diff --git a/Documentation/cgroups/memory.txt 
> > > > b/Documentation/cgroups/memory.txt
> > > > index ddf4f93..3c589cf 100644
> > > > --- a/Documentation/cgroups/memory.txt
> > > > +++ b/Documentation/cgroups/memory.txt
> > > > @@ -807,12 +807,14 @@ register a notification, an application must:
> > > >  
> > > >  - create an eventfd using eventfd(2);
> > > >  - open memory.pressure_level;
> > > > -- write string like "  "
> > > > +- write string like "   
> > > > [strict]"
> > > >to cgroup.event_control.
> > > >  
> > > 
> > > This is.. pretty strange interface. Would it be cleaner to do ioctl()?
> > > New syscall?
> > 
> > Are you referring to my new mode or to the whole thing?
> 
> Well. The interface was already very strange and you made it even
> worse.

The existing interface is the cgroup's notification mechanism, I think
discussing it is a bit out of scope for my extension.

Now, regarding my extension itself and the current vmpressure API, I
believe that delivering all events to user-space (ie. w/o any filtering
in the kernel) is a better solution.

Point is whether we can do it with the current vmpressure API (which
is cgroup based) or whether we should move to something else.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] vmpressure: implement strict mode

2013-07-02 Thread Luiz Capitulino
On Tue, 2 Jul 2013 10:24:09 -0700
Anton Vorontsov  wrote:

> > Honestly, what Andrew suggested is the best design for me: apps
> > are notified on all events but the event name is sent to the application.
> 
> I am fine with this approach (or any other, I'm really indifferent to the
> API itself -- read/netlink/notification per file/whatever for the
> payload),

That's a very good thing because we've managed to agree on something :)

I'm also indifferent to the API, as long as we have 100% of the policy
in user-space. To me this means we do absolutely no filtering in the
kernel, which in turn means user-space gets all the events. Of course,
we need the event name as a payload.

Do we agree this solves all use-cases we have discussed so far?

> except that you still have the similar problem:
> 
>   read() oldread() new
>   --
>"low"   "low"
>"low"   "foo" -- the app does not know what does this mean
>"med"   "bar" -- ditto

It can just ignore it, have a special handling, log it, fail or whatever.
That's the good of having the policy in user-space.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vmpressure: implement strict mode

2013-07-02 Thread Luiz Capitulino
On Mon, 1 Jul 2013 10:51:03 +0200
Pavel Machek  wrote:

> Hi!
> 
> > diff --git a/Documentation/cgroups/memory.txt 
> > b/Documentation/cgroups/memory.txt
> > index ddf4f93..3c589cf 100644
> > --- a/Documentation/cgroups/memory.txt
> > +++ b/Documentation/cgroups/memory.txt
> > @@ -807,12 +807,14 @@ register a notification, an application must:
> >  
> >  - create an eventfd using eventfd(2);
> >  - open memory.pressure_level;
> > -- write string like "  "
> > +- write string like "   
> > [strict]"
> >to cgroup.event_control.
> >  
> 
> This is.. pretty strange interface. Would it be cleaner to do ioctl()?
> New syscall?

Are you referring to my new mode or to the whole thing?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] vmpressure: implement strict mode

2013-07-02 Thread Luiz Capitulino
On Fri, 28 Jun 2013 17:56:37 -0700
Anton Vorontsov  wrote:

> On Fri, Jun 28, 2013 at 03:44:02PM -0400, Luiz Capitulino wrote:
> > > Why can't you use poll() and demultiplex the events? Check if there is an
> > > event in the crit fd, and if there is, then just ignore all the rest.
> > 
> > This may be a valid workaround for current kernels, but application
> > behavior will be different among kernels with a different number of
> > events.
> 
> This is not a workaround, this is how poll works, and this is kinda
> expected...

I think this is a workaround because it's tailored to my specific
use-case and to a specific kernel version, as:

 1. Applications registering for lower levels (eg. low) are still
unable to tell which level actually caused them to be notified,
as lower levels are triggered along with higher levels

 2. Considering the interface can be extended, how can new applications
work on backwards mode? Say, we add ultra-critical on 3.12 and
I update my application to work on it, how will my application
work on 3.11?

Hint: Try and error is an horribly bad approach.

 3. I also don't believe we have good forward compatibility with
the current API, as adding new events will cause existing ones
to be triggered more often, so I'd expect app behavior to vary
among kernels with a different number of events.

Honestly, what Andrew suggested is the best design for me: apps
are notified on all events but the event name is sent to the application.

This is pretty simple and solves all the problems we've discussed
so far.

Why can't we just do it?

> But not that I had this plan in mind when I was designing the
> current scheme... :)

Hehe :)

> > Say, we events on top of critical. Then crit fd will now be
> > notified for cases where it didn't use to on older kernels.
> 
> I'm not sure I am following here... but thinking about it more, I guess
> the extra read() will be needed anyway (to reset the counter).

I hope I have explained this more clearly above.

> > > > However, it *is* possible to make non-strict work on strict if we make
> > > > strict default _and_ make reads on memory.pressure_level return
> > > > available events. Just do this on app initialization:
> > > > 
> > > > for each event in memory.pressure_level; do
> > > > /* register eventfd to be notified on "event" */
> > > > done
> > > 
> > > This scheme registers "all" events.
> > 
> > Yes, because I thought that's the user-case that matters for activity
> > manager :)
> 
> Some activity managers use only low levels (Android), some might use only
> medium levels (simple load-balancing).
> 
> Being able to register only "all" does not make sense to me.

Well, you can skip events if you want.

> > > Here is more complicated case:
> > > 
> > > Old kernels, pressure_level reads:
> > > 
> > >   low, med, crit
> > > 
> > > The app just wants to listen for med level.
> > > 
> > > New kernels, pressure_level reads:
> > > 
> > >   low, FOO, med, BAR, crit
> > > 
> > > How would application decide which of FOO and BAR are ex-med levels?
> > 
> > What you meant by ex-med?
> 
> The scale is continuous and non-overlapping. If you add some other level,
> you effectively "shrinking" other levels, so the ex-med in the list above
> might correspond to "FOO, med" or "med, BAR" or "FOO, med, BAR", and that
> is exactly the problem.

Just return the events in order?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] vmpressure: implement strict mode

2013-07-02 Thread Luiz Capitulino
On Tue, 2 Jul 2013 10:24:09 -0700
Anton Vorontsov an...@enomsg.org wrote:

  Honestly, what Andrew suggested is the best design for me: apps
  are notified on all events but the event name is sent to the application.
 
 I am fine with this approach (or any other, I'm really indifferent to the
 API itself -- read/netlink/notification per file/whatever for the
 payload),

That's a very good thing because we've managed to agree on something :)

I'm also indifferent to the API, as long as we have 100% of the policy
in user-space. To me this means we do absolutely no filtering in the
kernel, which in turn means user-space gets all the events. Of course,
we need the event name as a payload.

Do we agree this solves all use-cases we have discussed so far?

 except that you still have the similar problem:
 
   read() oldread() new
   --
low   low
low   foo -- the app does not know what does this mean
med   bar -- ditto

It can just ignore it, have a special handling, log it, fail or whatever.
That's the good of having the policy in user-space.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vmpressure: implement strict mode

2013-07-02 Thread Luiz Capitulino
On Tue, 2 Jul 2013 21:47:03 +0200
Pavel Machek pa...@ucw.cz wrote:

 On Tue 2013-07-02 11:06:28, Luiz Capitulino wrote:
  On Mon, 1 Jul 2013 10:51:03 +0200
  Pavel Machek pa...@ucw.cz wrote:
  
   Hi!
   
diff --git a/Documentation/cgroups/memory.txt 
b/Documentation/cgroups/memory.txt
index ddf4f93..3c589cf 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -807,12 +807,14 @@ register a notification, an application must:
 
 - create an eventfd using eventfd(2);
 - open memory.pressure_level;
-- write string like event_fd fd of memory.pressure_level level
+- write string like event_fd fd of memory.pressure_level level 
[strict]
   to cgroup.event_control.
 
   
   This is.. pretty strange interface. Would it be cleaner to do ioctl()?
   New syscall?
  
  Are you referring to my new mode or to the whole thing?
 
 Well. The interface was already very strange and you made it even
 worse.

The existing interface is the cgroup's notification mechanism, I think
discussing it is a bit out of scope for my extension.

Now, regarding my extension itself and the current vmpressure API, I
believe that delivering all events to user-space (ie. w/o any filtering
in the kernel) is a better solution.

Point is whether we can do it with the current vmpressure API (which
is cgroup based) or whether we should move to something else.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] vmpressure: implement strict mode

2013-07-02 Thread Luiz Capitulino
On Fri, 28 Jun 2013 17:56:37 -0700
Anton Vorontsov an...@enomsg.org wrote:

 On Fri, Jun 28, 2013 at 03:44:02PM -0400, Luiz Capitulino wrote:
   Why can't you use poll() and demultiplex the events? Check if there is an
   event in the crit fd, and if there is, then just ignore all the rest.
  
  This may be a valid workaround for current kernels, but application
  behavior will be different among kernels with a different number of
  events.
 
 This is not a workaround, this is how poll works, and this is kinda
 expected...

I think this is a workaround because it's tailored to my specific
use-case and to a specific kernel version, as:

 1. Applications registering for lower levels (eg. low) are still
unable to tell which level actually caused them to be notified,
as lower levels are triggered along with higher levels

 2. Considering the interface can be extended, how can new applications
work on backwards mode? Say, we add ultra-critical on 3.12 and
I update my application to work on it, how will my application
work on 3.11?

Hint: Try and error is an horribly bad approach.

 3. I also don't believe we have good forward compatibility with
the current API, as adding new events will cause existing ones
to be triggered more often, so I'd expect app behavior to vary
among kernels with a different number of events.

Honestly, what Andrew suggested is the best design for me: apps
are notified on all events but the event name is sent to the application.

This is pretty simple and solves all the problems we've discussed
so far.

Why can't we just do it?

 But not that I had this plan in mind when I was designing the
 current scheme... :)

Hehe :)

  Say, we events on top of critical. Then crit fd will now be
  notified for cases where it didn't use to on older kernels.
 
 I'm not sure I am following here... but thinking about it more, I guess
 the extra read() will be needed anyway (to reset the counter).

I hope I have explained this more clearly above.

However, it *is* possible to make non-strict work on strict if we make
strict default _and_ make reads on memory.pressure_level return
available events. Just do this on app initialization:

for each event in memory.pressure_level; do
/* register eventfd to be notified on event */
done
   
   This scheme registers all events.
  
  Yes, because I thought that's the user-case that matters for activity
  manager :)
 
 Some activity managers use only low levels (Android), some might use only
 medium levels (simple load-balancing).
 
 Being able to register only all does not make sense to me.

Well, you can skip events if you want.

   Here is more complicated case:
   
   Old kernels, pressure_level reads:
   
 low, med, crit
   
   The app just wants to listen for med level.
   
   New kernels, pressure_level reads:
   
 low, FOO, med, BAR, crit
   
   How would application decide which of FOO and BAR are ex-med levels?
  
  What you meant by ex-med?
 
 The scale is continuous and non-overlapping. If you add some other level,
 you effectively shrinking other levels, so the ex-med in the list above
 might correspond to FOO, med or med, BAR or FOO, med, BAR, and that
 is exactly the problem.

Just return the events in order?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vmpressure: implement strict mode

2013-07-02 Thread Luiz Capitulino
On Mon, 1 Jul 2013 10:51:03 +0200
Pavel Machek pa...@ucw.cz wrote:

 Hi!
 
  diff --git a/Documentation/cgroups/memory.txt 
  b/Documentation/cgroups/memory.txt
  index ddf4f93..3c589cf 100644
  --- a/Documentation/cgroups/memory.txt
  +++ b/Documentation/cgroups/memory.txt
  @@ -807,12 +807,14 @@ register a notification, an application must:
   
   - create an eventfd using eventfd(2);
   - open memory.pressure_level;
  -- write string like event_fd fd of memory.pressure_level level
  +- write string like event_fd fd of memory.pressure_level level 
  [strict]
 to cgroup.event_control.
   
 
 This is.. pretty strange interface. Would it be cleaner to do ioctl()?
 New syscall?

Are you referring to my new mode or to the whole thing?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] vmpressure: implement strict mode

2013-06-28 Thread Luiz Capitulino
On Fri, 28 Jun 2013 11:55:47 -0700
Anton Vorontsov  wrote:

> On Fri, Jun 28, 2013 at 02:45:07PM -0400, Luiz Capitulino wrote:
> > On Fri, 28 Jun 2013 10:09:17 -0700
> > Anton Vorontsov  wrote:
> > 
> > > So, I would now argue that the current scheme is perfectly OK and can do
> > > everything you can do with the "strict" one,
> > 
> > I forgot commenting this bit. This is not true, because I don't want a
> > low fd to be notified on critical level. The current interface just
> > can't do that.
> 
> Why can't you use poll() and demultiplex the events? Check if there is an
> event in the crit fd, and if there is, then just ignore all the rest.

This may be a valid workaround for current kernels, but application
behavior will be different among kernels with a different number of
events.

Say, we events on top of critical. Then crit fd will now be
notified for cases where it didn't use to on older kernels.

> > However, it *is* possible to make non-strict work on strict if we make
> > strict default _and_ make reads on memory.pressure_level return
> > available events. Just do this on app initialization:
> > 
> > for each event in memory.pressure_level; do
> > /* register eventfd to be notified on "event" */
> > done
> 
> This scheme registers "all" events.

Yes, because I thought that's the user-case that matters for activity
manager :)

> Here is more complicated case:
> 
> Old kernels, pressure_level reads:
> 
>   low, med, crit
> 
> The app just wants to listen for med level.
> 
> New kernels, pressure_level reads:
> 
>   low, FOO, med, BAR, crit
> 
> How would application decide which of FOO and BAR are ex-med levels?

What you meant by ex-med?

Let's not over-design. I agree that allowing the API to be extended
is a good thing, but we shouldn't give complex meaning to events,
otherwise we're better with a numeric scale instead.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] vmpressure: implement strict mode

2013-06-28 Thread Luiz Capitulino
On Fri, 28 Jun 2013 11:45:47 -0700
Anton Vorontsov  wrote:

> On Fri, Jun 28, 2013 at 02:25:58PM -0400, Luiz Capitulino wrote:
> > > > > That's how it's expected to work, because on strict mode you're 
> > > > > notified
> > > > > for the level you registered for. So apps registering for critical, 
> > > > > will
> > > > > still be notified on critical just like before.
> > > > 
> > > > Suppose you introduce a new level, and the system hits this level. 
> > > > Before,
> > > > the app would receive at least some notification for the given memory 
> > > > load
> > > > (i.e. one of the old levels), with the new level introduced in the 
> > > > kernel,
> > > > the app will receive no events at all.
> > 
> > That's not true. If an app registered for critical it will still get
> > critical notification when the system is at the critical level. Just as it
> > always did. No new events will change this.
> > 
> > With today's semantics though, new events will change when current events
> > are triggered. So each new extension will cause applications to have
> > different behaviors, in different kernel versions. This looks quite
> > undesirable to me.
> 
> I'll try to explain it again.
> 
> Old behaviour:
> 
> low -> event
>   x <- but the system is at this unnamed level, between low and med
> med
> crit
> 
> 
> We add a level:
> 
> low
> low-med <- system at this state, we send an event, but the old app does
>not know about it, so it won't receive *any* notifications. (In
>  older kernels it would receive low level notification
> med
> crit
> 
> You really don't see a problem here?

I do get what you're saying. We disagree it's a problem. In my mind the
best API is to get what you registered for. Nothing more, nothing less.

Now, there might be ways around it (being it a problem or not). I was
also considering this:

> 3. Never change the levels (how can we know?)

If we fail at determining levels (I honestly think current levels are
all we need), we can add a new interface later.

Also, what I said in the last email should work, which is to make
memory.pressure_level return supported levels, so an application can
register for all available levels. This way it will never miss a level.

I also think this matches having the mechanism in the kernel and
policy in user-space.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] vmpressure: implement strict mode

2013-06-28 Thread Luiz Capitulino
On Fri, 28 Jun 2013 10:09:17 -0700
Anton Vorontsov  wrote:

> So, I would now argue that the current scheme is perfectly OK and can do
> everything you can do with the "strict" one,

I forgot commenting this bit. This is not true, because I don't want a
low fd to be notified on critical level. The current interface just
can't do that.

However, it *is* possible to make non-strict work on strict if we make
strict default _and_ make reads on memory.pressure_level return
available events. Just do this on app initialization:

for each event in memory.pressure_level; do
/* register eventfd to be notified on "event" */
done

Then eventfd will always be notified, no matter the event.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] vmpressure: implement strict mode

2013-06-28 Thread Luiz Capitulino
On Fri, 28 Jun 2013 10:09:17 -0700
Anton Vorontsov  wrote:

> On Fri, Jun 28, 2013 at 09:57:22AM -0700, Anton Vorontsov wrote:
> > On Fri, Jun 28, 2013 at 10:00:27AM -0400, Luiz Capitulino wrote:
> > > On Thu, 27 Jun 2013 22:07:12 -0700
> > > Anton Vorontsov  wrote:
> > > 
> > > > On Thu, Jun 27, 2013 at 09:34:11PM -0700, Anton Vorontsov wrote:
> > > > > ... we can add the strict mode and deprecate the
> > > > > "filtering" -- basically we'll implement the idea of requiring that
> > > > > userspace registers a separate fd for each level.
> > > > 
> > > > Btw, assuming that more levels can be added, there will be a problem:
> > > > imagine that an app hooked up onto low, med, crit levels in "strict"
> > > > mode... then once we add a new level, the app will start missing the new
> > > > level events.
> > > 
> > > That's how it's expected to work, because on strict mode you're notified
> > > for the level you registered for. So apps registering for critical, will
> > > still be notified on critical just like before.
> > 
> > Suppose you introduce a new level, and the system hits this level. Before,
> > the app would receive at least some notification for the given memory load
> > (i.e. one of the old levels), with the new level introduced in the kernel,
> > the app will receive no events at all.

That's not true. If an app registered for critical it will still get
critical notification when the system is at the critical level. Just as it
always did. No new events will change this.

With today's semantics though, new events will change when current events
are triggered. So each new extension will cause applications to have
different behaviors, in different kernel versions. This looks quite
undesirable to me.

> > This makes a serious behavioural
> > change in the app (read: it'll break it).

How? Strict mode semantics is simple: you get what you registered for.
No matter the kernel version, no matter how many events we add on top
of existing ones. How can this brake applications?

> Btw, why exactly you need the strict mode?

I'm implementing automatic guest-resize for KVM guests. What I'd like to
do is to inflate the balloon by different values depending on the
host pressure. Say, 1MB on low; 16MB on medium and 128MB on critical.

The actual values are meaningless btw, as this is going to be set by
the user. So, saying that 1MB on low is too little to be concerned about
is not an valid argument, as the user can set this to 1GB.

> Why 'medium' won't work for the
> load-balancing?

I need precision. If the system is at 'medium' level, then I'll do
medium stuff. If it gets into critical, then I'll do critical stuff.

I don't want to do critical stuff on all levels, or critical on
medium. This is really messy. Just give me what I asked for, please.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] vmpressure: implement strict mode

2013-06-28 Thread Luiz Capitulino
On Thu, 27 Jun 2013 22:07:12 -0700
Anton Vorontsov  wrote:

> On Thu, Jun 27, 2013 at 09:34:11PM -0700, Anton Vorontsov wrote:
> > ... we can add the strict mode and deprecate the
> > "filtering" -- basically we'll implement the idea of requiring that
> > userspace registers a separate fd for each level.
> 
> Btw, assuming that more levels can be added, there will be a problem:
> imagine that an app hooked up onto low, med, crit levels in "strict"
> mode... then once we add a new level, the app will start missing the new
> level events.

That's how it's expected to work, because on strict mode you're notified
for the level you registered for. So apps registering for critical, will
still be notified on critical just like before.

> In the old scheme it is not a problem because of the >= condition.

I think the problem actually lies with the current interface, because
if an app registers for critical and we add a new level after critical
then this app will now be notified on critical *and* the new level. The
app's algorithm might not be prepared to deal with that.

> With a proper versioning this won't be a problem for a new scheme too.

I don't think there's a problem to be solved here. Strict mode does
allow forward compatibility. For good backward compatibility we can make
memory.pressure_level return supported levels on read. This way
applications can check if the level they are interested in exist and
then fallback or exit with a good error message if they don't.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] vmpressure: implement strict mode

2013-06-28 Thread Luiz Capitulino
On Thu, 27 Jun 2013 21:34:11 -0700
Anton Vorontsov  wrote:

> On Thu, Jun 27, 2013 at 06:13:53PM -0700, Andrew Morton wrote:
> > On Thu, 27 Jun 2013 17:58:53 -0700 Anton Vorontsov  wrote:
> > > Current frequency is 1/(2MB). Suppose we ended up scanning the whole
> > > memory on a 2GB host, this will give us 1024 hits. Doesn't feel too much*
> > > to me... But for what it worth, I am against adding read() to the
> > > interface -- just because we can avoid the unnecessary switch into the
> > > kernel.
> > 
> > What was it they said about premature optimization?
> > 
> > I think I'd rather do nothing than add a mode hack (already!).
> > 
> > The information Luiz wants is already available with the existing
> > interface, so why not just use it until there is a real demonstrated
> > problem?
> > 
> > But all this does point at the fact that the chosen interface was not a
> > good one.  And it's happening so soon :( A far better interface would
> > be to do away with this level filtering stuff in the kernel altogether.
> 
> OK, I am convinced that modes might be not necessary, but I see no big
> problem in current situation, we can add the strict mode and deprecate the
> "filtering" -- basically we'll implement the idea of requiring that
> userspace registers a separate fd for each level.

Agreed this is a good solution.

> As one of the ways to change the interface, we can do the strict mode by
> writing levels in uppercase, and warn_once on lowercase levels, describing
> that the old behaviour will go away. Once (if ever) we remove the old
> behaviour, the apps trying the old-style lowercase levels will fail
> gracefully with EINVAL.

Why don't we just break it? There's no non-development kernel released
with this interface yet.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] vmpressure: implement strict mode

2013-06-28 Thread Luiz Capitulino
On Thu, 27 Jun 2013 21:34:11 -0700
Anton Vorontsov an...@enomsg.org wrote:

 On Thu, Jun 27, 2013 at 06:13:53PM -0700, Andrew Morton wrote:
  On Thu, 27 Jun 2013 17:58:53 -0700 Anton Vorontsov an...@enomsg.org wrote:
   Current frequency is 1/(2MB). Suppose we ended up scanning the whole
   memory on a 2GB host, this will give us 1024 hits. Doesn't feel too much*
   to me... But for what it worth, I am against adding read() to the
   interface -- just because we can avoid the unnecessary switch into the
   kernel.
  
  What was it they said about premature optimization?
  
  I think I'd rather do nothing than add a mode hack (already!).
  
  The information Luiz wants is already available with the existing
  interface, so why not just use it until there is a real demonstrated
  problem?
  
  But all this does point at the fact that the chosen interface was not a
  good one.  And it's happening so soon :( A far better interface would
  be to do away with this level filtering stuff in the kernel altogether.
 
 OK, I am convinced that modes might be not necessary, but I see no big
 problem in current situation, we can add the strict mode and deprecate the
 filtering -- basically we'll implement the idea of requiring that
 userspace registers a separate fd for each level.

Agreed this is a good solution.

 As one of the ways to change the interface, we can do the strict mode by
 writing levels in uppercase, and warn_once on lowercase levels, describing
 that the old behaviour will go away. Once (if ever) we remove the old
 behaviour, the apps trying the old-style lowercase levels will fail
 gracefully with EINVAL.

Why don't we just break it? There's no non-development kernel released
with this interface yet.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] vmpressure: implement strict mode

2013-06-28 Thread Luiz Capitulino
On Thu, 27 Jun 2013 22:07:12 -0700
Anton Vorontsov an...@enomsg.org wrote:

 On Thu, Jun 27, 2013 at 09:34:11PM -0700, Anton Vorontsov wrote:
  ... we can add the strict mode and deprecate the
  filtering -- basically we'll implement the idea of requiring that
  userspace registers a separate fd for each level.
 
 Btw, assuming that more levels can be added, there will be a problem:
 imagine that an app hooked up onto low, med, crit levels in strict
 mode... then once we add a new level, the app will start missing the new
 level events.

That's how it's expected to work, because on strict mode you're notified
for the level you registered for. So apps registering for critical, will
still be notified on critical just like before.

 In the old scheme it is not a problem because of the = condition.

I think the problem actually lies with the current interface, because
if an app registers for critical and we add a new level after critical
then this app will now be notified on critical *and* the new level. The
app's algorithm might not be prepared to deal with that.

 With a proper versioning this won't be a problem for a new scheme too.

I don't think there's a problem to be solved here. Strict mode does
allow forward compatibility. For good backward compatibility we can make
memory.pressure_level return supported levels on read. This way
applications can check if the level they are interested in exist and
then fallback or exit with a good error message if they don't.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] vmpressure: implement strict mode

2013-06-28 Thread Luiz Capitulino
On Fri, 28 Jun 2013 10:09:17 -0700
Anton Vorontsov an...@enomsg.org wrote:

 On Fri, Jun 28, 2013 at 09:57:22AM -0700, Anton Vorontsov wrote:
  On Fri, Jun 28, 2013 at 10:00:27AM -0400, Luiz Capitulino wrote:
   On Thu, 27 Jun 2013 22:07:12 -0700
   Anton Vorontsov an...@enomsg.org wrote:
   
On Thu, Jun 27, 2013 at 09:34:11PM -0700, Anton Vorontsov wrote:
 ... we can add the strict mode and deprecate the
 filtering -- basically we'll implement the idea of requiring that
 userspace registers a separate fd for each level.

Btw, assuming that more levels can be added, there will be a problem:
imagine that an app hooked up onto low, med, crit levels in strict
mode... then once we add a new level, the app will start missing the new
level events.
   
   That's how it's expected to work, because on strict mode you're notified
   for the level you registered for. So apps registering for critical, will
   still be notified on critical just like before.
  
  Suppose you introduce a new level, and the system hits this level. Before,
  the app would receive at least some notification for the given memory load
  (i.e. one of the old levels), with the new level introduced in the kernel,
  the app will receive no events at all.

That's not true. If an app registered for critical it will still get
critical notification when the system is at the critical level. Just as it
always did. No new events will change this.

With today's semantics though, new events will change when current events
are triggered. So each new extension will cause applications to have
different behaviors, in different kernel versions. This looks quite
undesirable to me.

  This makes a serious behavioural
  change in the app (read: it'll break it).

How? Strict mode semantics is simple: you get what you registered for.
No matter the kernel version, no matter how many events we add on top
of existing ones. How can this brake applications?

 Btw, why exactly you need the strict mode?

I'm implementing automatic guest-resize for KVM guests. What I'd like to
do is to inflate the balloon by different values depending on the
host pressure. Say, 1MB on low; 16MB on medium and 128MB on critical.

The actual values are meaningless btw, as this is going to be set by
the user. So, saying that 1MB on low is too little to be concerned about
is not an valid argument, as the user can set this to 1GB.

 Why 'medium' won't work for the
 load-balancing?

I need precision. If the system is at 'medium' level, then I'll do
medium stuff. If it gets into critical, then I'll do critical stuff.

I don't want to do critical stuff on all levels, or critical on
medium. This is really messy. Just give me what I asked for, please.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] vmpressure: implement strict mode

2013-06-28 Thread Luiz Capitulino
On Fri, 28 Jun 2013 10:09:17 -0700
Anton Vorontsov an...@enomsg.org wrote:

 So, I would now argue that the current scheme is perfectly OK and can do
 everything you can do with the strict one,

I forgot commenting this bit. This is not true, because I don't want a
low fd to be notified on critical level. The current interface just
can't do that.

However, it *is* possible to make non-strict work on strict if we make
strict default _and_ make reads on memory.pressure_level return
available events. Just do this on app initialization:

for each event in memory.pressure_level; do
/* register eventfd to be notified on event */
done

Then eventfd will always be notified, no matter the event.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] vmpressure: implement strict mode

2013-06-28 Thread Luiz Capitulino
On Fri, 28 Jun 2013 11:45:47 -0700
Anton Vorontsov an...@enomsg.org wrote:

 On Fri, Jun 28, 2013 at 02:25:58PM -0400, Luiz Capitulino wrote:
 That's how it's expected to work, because on strict mode you're 
 notified
 for the level you registered for. So apps registering for critical, 
 will
 still be notified on critical just like before.

Suppose you introduce a new level, and the system hits this level. 
Before,
the app would receive at least some notification for the given memory 
load
(i.e. one of the old levels), with the new level introduced in the 
kernel,
the app will receive no events at all.
  
  That's not true. If an app registered for critical it will still get
  critical notification when the system is at the critical level. Just as it
  always did. No new events will change this.
  
  With today's semantics though, new events will change when current events
  are triggered. So each new extension will cause applications to have
  different behaviors, in different kernel versions. This looks quite
  undesirable to me.
 
 I'll try to explain it again.
 
 Old behaviour:
 
 low - event
   x - but the system is at this unnamed level, between low and med
 med
 crit
 
 
 We add a level:
 
 low
 low-med - system at this state, we send an event, but the old app does
not know about it, so it won't receive *any* notifications. (In
  older kernels it would receive low level notification
 med
 crit
 
 You really don't see a problem here?

I do get what you're saying. We disagree it's a problem. In my mind the
best API is to get what you registered for. Nothing more, nothing less.

Now, there might be ways around it (being it a problem or not). I was
also considering this:

 3. Never change the levels (how can we know?)

If we fail at determining levels (I honestly think current levels are
all we need), we can add a new interface later.

Also, what I said in the last email should work, which is to make
memory.pressure_level return supported levels, so an application can
register for all available levels. This way it will never miss a level.

I also think this matches having the mechanism in the kernel and
policy in user-space.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] vmpressure: implement strict mode

2013-06-28 Thread Luiz Capitulino
On Fri, 28 Jun 2013 11:55:47 -0700
Anton Vorontsov an...@enomsg.org wrote:

 On Fri, Jun 28, 2013 at 02:45:07PM -0400, Luiz Capitulino wrote:
  On Fri, 28 Jun 2013 10:09:17 -0700
  Anton Vorontsov an...@enomsg.org wrote:
  
   So, I would now argue that the current scheme is perfectly OK and can do
   everything you can do with the strict one,
  
  I forgot commenting this bit. This is not true, because I don't want a
  low fd to be notified on critical level. The current interface just
  can't do that.
 
 Why can't you use poll() and demultiplex the events? Check if there is an
 event in the crit fd, and if there is, then just ignore all the rest.

This may be a valid workaround for current kernels, but application
behavior will be different among kernels with a different number of
events.

Say, we events on top of critical. Then crit fd will now be
notified for cases where it didn't use to on older kernels.

  However, it *is* possible to make non-strict work on strict if we make
  strict default _and_ make reads on memory.pressure_level return
  available events. Just do this on app initialization:
  
  for each event in memory.pressure_level; do
  /* register eventfd to be notified on event */
  done
 
 This scheme registers all events.

Yes, because I thought that's the user-case that matters for activity
manager :)

 Here is more complicated case:
 
 Old kernels, pressure_level reads:
 
   low, med, crit
 
 The app just wants to listen for med level.
 
 New kernels, pressure_level reads:
 
   low, FOO, med, BAR, crit
 
 How would application decide which of FOO and BAR are ex-med levels?

What you meant by ex-med?

Let's not over-design. I agree that allowing the API to be extended
is a good thing, but we shouldn't give complex meaning to events,
otherwise we're better with a numeric scale instead.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] vmpressure: implement strict mode

2013-06-27 Thread Luiz Capitulino
On Thu, 27 Jun 2013 11:26:16 +0200
Michal Hocko  wrote:

> On Wed 26-06-13 23:17:12, Luiz Capitulino wrote:
> > Currently, an eventfd is notified for the level it's registered for
> > _plus_ higher levels.
> > 
> > This is a problem if an application wants to implement different
> > actions for different levels. For example, an application might want
> > to release 10% of its cache on level low, 50% on medium and 100% on
> > critical. To do this, an application has to register a different
> > eventfd for each pressure level. However, fd low is always going to
> > be notified and and all fds are going to be notified on level critical.
> > 
> > Strict mode solves this problem by strictly notifiying an eventfd
> > for the pressure level it registered for. This new mode is optional,
> > by default we still notify eventfds on higher levels too.
> > 
> > Signed-off-by: Luiz Capitulino 
> 
> Two nits bellow but it looks good in general to me. You can add my
> Reviewed-by: Michal Hocko 

Thanks.

> I still think that edge triggering makes some sense but that one might
> be rebased on top of this patch. We should still figure out whether the
> edge triggering is the right approach for the use case Hyunhee Kim wants
> it for so the strict mode should go first IMO.

I also think that edge triggering makes sense.

> > ---
> > 
> > o v2
> > 
> >  - Improve documentation
> >  - Use a bit to store mode instead of a bool
> >  - Minor changelog changes
> > 
> >  Documentation/cgroups/memory.txt | 26 ++
> >  mm/vmpressure.c  | 26 --
> >  2 files changed, 46 insertions(+), 6 deletions(-)
> > 
> > diff --git a/Documentation/cgroups/memory.txt 
> > b/Documentation/cgroups/memory.txt
> > index ddf4f93..412872b 100644
> > --- a/Documentation/cgroups/memory.txt
> > +++ b/Documentation/cgroups/memory.txt
> > @@ -791,6 +791,22 @@ way to trigger. Applications should do whatever they 
> > can to help the
> >  system. It might be too late to consult with vmstat or any other
> >  statistics, so it's advisable to take an immediate action.
> >  
> > +Applications can also choose between two notification modes when
> > +registering an eventfd for memory pressure events:
> > +
> > +When in "non-strict" mode, an eventfd is notified for the specific level
> > +it's registered for and higher levels. For example, an eventfd registered
> > +for low level is also going to be notified on medium and critical levels.
> > +This mode makes sense for applications interested on monitoring reclaim
> > +activity or implementing simple load-balacing logic. The non-strict mode
> > +is the default notification mode.
> > +
> > +When in "strict" mode, an eventfd is strictly notified for the pressure
> > +level it's registered for. For example, an eventfd registered for the low
> > +level event is not going to be notified when memory pressure gets into
> > +medium or critical levels. This allows for more complex logic based on
> > +the actual pressure level the system is experiencing.
> 
> It would be also fair to mention that there is no guarantee that lower
> levels are signaled before higher so nobody should rely on seeing LOW
> before MEDIUM or CRITICAL.

I think this is implied. Actually, as an user of this interface I didn't
expect this to happen until I read the code.

> 
> > +
> >  The events are propagated upward until the event is handled, i.e. the
> >  events are not pass-through. Here is what this means: for example you have
> >  three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
> > @@ -807,12 +823,14 @@ register a notification, an application must:
> >  
> >  - create an eventfd using eventfd(2);
> >  - open memory.pressure_level;
> > -- write string like "  "
> > +- write string like "   
> > [strict]"
> >to cgroup.event_control.
> >  
> > -Application will be notified through eventfd when memory pressure is at
> > -the specific level (or higher). Read/write operations to
> > -memory.pressure_level are no implemented.
> > +Applications will be notified through eventfd when memory pressure is at
> > +the specific level or higher. If strict is passed, then applications
> > +will only be notified when memory pressure reaches the specified level.
> > +
> > +Read/write operations to memory.pressure_level are no implemented.
> >  
> >  Test:
> >  
> > diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> > index 736a60

Re: [PATCH v2] vmpressure: implement strict mode

2013-06-27 Thread Luiz Capitulino
On Thu, 27 Jun 2013 11:26:16 +0200
Michal Hocko mho...@suse.cz wrote:

 On Wed 26-06-13 23:17:12, Luiz Capitulino wrote:
  Currently, an eventfd is notified for the level it's registered for
  _plus_ higher levels.
  
  This is a problem if an application wants to implement different
  actions for different levels. For example, an application might want
  to release 10% of its cache on level low, 50% on medium and 100% on
  critical. To do this, an application has to register a different
  eventfd for each pressure level. However, fd low is always going to
  be notified and and all fds are going to be notified on level critical.
  
  Strict mode solves this problem by strictly notifiying an eventfd
  for the pressure level it registered for. This new mode is optional,
  by default we still notify eventfds on higher levels too.
  
  Signed-off-by: Luiz Capitulino lcapitul...@redhat.com
 
 Two nits bellow but it looks good in general to me. You can add my
 Reviewed-by: Michal Hocko mho...@suse.cz

Thanks.

 I still think that edge triggering makes some sense but that one might
 be rebased on top of this patch. We should still figure out whether the
 edge triggering is the right approach for the use case Hyunhee Kim wants
 it for so the strict mode should go first IMO.

I also think that edge triggering makes sense.

  ---
  
  o v2
  
   - Improve documentation
   - Use a bit to store mode instead of a bool
   - Minor changelog changes
  
   Documentation/cgroups/memory.txt | 26 ++
   mm/vmpressure.c  | 26 --
   2 files changed, 46 insertions(+), 6 deletions(-)
  
  diff --git a/Documentation/cgroups/memory.txt 
  b/Documentation/cgroups/memory.txt
  index ddf4f93..412872b 100644
  --- a/Documentation/cgroups/memory.txt
  +++ b/Documentation/cgroups/memory.txt
  @@ -791,6 +791,22 @@ way to trigger. Applications should do whatever they 
  can to help the
   system. It might be too late to consult with vmstat or any other
   statistics, so it's advisable to take an immediate action.
   
  +Applications can also choose between two notification modes when
  +registering an eventfd for memory pressure events:
  +
  +When in non-strict mode, an eventfd is notified for the specific level
  +it's registered for and higher levels. For example, an eventfd registered
  +for low level is also going to be notified on medium and critical levels.
  +This mode makes sense for applications interested on monitoring reclaim
  +activity or implementing simple load-balacing logic. The non-strict mode
  +is the default notification mode.
  +
  +When in strict mode, an eventfd is strictly notified for the pressure
  +level it's registered for. For example, an eventfd registered for the low
  +level event is not going to be notified when memory pressure gets into
  +medium or critical levels. This allows for more complex logic based on
  +the actual pressure level the system is experiencing.
 
 It would be also fair to mention that there is no guarantee that lower
 levels are signaled before higher so nobody should rely on seeing LOW
 before MEDIUM or CRITICAL.

I think this is implied. Actually, as an user of this interface I didn't
expect this to happen until I read the code.

 
  +
   The events are propagated upward until the event is handled, i.e. the
   events are not pass-through. Here is what this means: for example you have
   three cgroups: A-B-C. Now you set up an event listener on cgroups A, B
  @@ -807,12 +823,14 @@ register a notification, an application must:
   
   - create an eventfd using eventfd(2);
   - open memory.pressure_level;
  -- write string like event_fd fd of memory.pressure_level level
  +- write string like event_fd fd of memory.pressure_level level 
  [strict]
 to cgroup.event_control.
   
  -Application will be notified through eventfd when memory pressure is at
  -the specific level (or higher). Read/write operations to
  -memory.pressure_level are no implemented.
  +Applications will be notified through eventfd when memory pressure is at
  +the specific level or higher. If strict is passed, then applications
  +will only be notified when memory pressure reaches the specified level.
  +
  +Read/write operations to memory.pressure_level are no implemented.
   
   Test:
   
  diff --git a/mm/vmpressure.c b/mm/vmpressure.c
  index 736a601..ba5c17e 100644
  --- a/mm/vmpressure.c
  +++ b/mm/vmpressure.c
  @@ -138,8 +138,16 @@ struct vmpressure_event {
  struct eventfd_ctx *efd;
  enum vmpressure_levels level;
  struct list_head node;
  +   unsigned int mode;
 
 You would fill up a hole between level and node if you move it up on
 64b. Doesn't matter much but why not do it...

You want me to respin?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http

[PATCH v2] vmpressure: implement strict mode

2013-06-26 Thread Luiz Capitulino
Currently, an eventfd is notified for the level it's registered for
_plus_ higher levels.

This is a problem if an application wants to implement different
actions for different levels. For example, an application might want
to release 10% of its cache on level low, 50% on medium and 100% on
critical. To do this, an application has to register a different
eventfd for each pressure level. However, fd low is always going to
be notified and and all fds are going to be notified on level critical.

Strict mode solves this problem by strictly notifiying an eventfd
for the pressure level it registered for. This new mode is optional,
by default we still notify eventfds on higher levels too.

Signed-off-by: Luiz Capitulino 
---

o v2

 - Improve documentation
 - Use a bit to store mode instead of a bool
 - Minor changelog changes

 Documentation/cgroups/memory.txt | 26 ++
 mm/vmpressure.c  | 26 --
 2 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index ddf4f93..412872b 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -791,6 +791,22 @@ way to trigger. Applications should do whatever they can 
to help the
 system. It might be too late to consult with vmstat or any other
 statistics, so it's advisable to take an immediate action.
 
+Applications can also choose between two notification modes when
+registering an eventfd for memory pressure events:
+
+When in "non-strict" mode, an eventfd is notified for the specific level
+it's registered for and higher levels. For example, an eventfd registered
+for low level is also going to be notified on medium and critical levels.
+This mode makes sense for applications interested on monitoring reclaim
+activity or implementing simple load-balacing logic. The non-strict mode
+is the default notification mode.
+
+When in "strict" mode, an eventfd is strictly notified for the pressure
+level it's registered for. For example, an eventfd registered for the low
+level event is not going to be notified when memory pressure gets into
+medium or critical levels. This allows for more complex logic based on
+the actual pressure level the system is experiencing.
+
 The events are propagated upward until the event is handled, i.e. the
 events are not pass-through. Here is what this means: for example you have
 three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
@@ -807,12 +823,14 @@ register a notification, an application must:
 
 - create an eventfd using eventfd(2);
 - open memory.pressure_level;
-- write string like "  "
+- write string like "   [strict]"
   to cgroup.event_control.
 
-Application will be notified through eventfd when memory pressure is at
-the specific level (or higher). Read/write operations to
-memory.pressure_level are no implemented.
+Applications will be notified through eventfd when memory pressure is at
+the specific level or higher. If strict is passed, then applications
+will only be notified when memory pressure reaches the specified level.
+
+Read/write operations to memory.pressure_level are no implemented.
 
 Test:
 
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 736a601..ba5c17e 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -138,8 +138,16 @@ struct vmpressure_event {
struct eventfd_ctx *efd;
enum vmpressure_levels level;
struct list_head node;
+   unsigned int mode;
 };
 
+#define VMPRESSURE_MODE_STRICT 1
+
+static inline bool vmpressure_mode_is_strict(const struct vmpressure_event *ev)
+{
+   return ev->mode & VMPRESSURE_MODE_STRICT;
+}
+
 static bool vmpressure_event(struct vmpressure *vmpr,
 unsigned long scanned, unsigned long reclaimed)
 {
@@ -153,6 +161,9 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 
list_for_each_entry(ev, >events, node) {
if (level >= ev->level) {
+   /* strict mode ensures level == ev->level */
+   if (vmpressure_mode_is_strict(ev) && level != ev->level)
+   continue;
eventfd_signal(ev->efd, 1);
signalled = true;
}
@@ -292,7 +303,7 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, 
int prio)
  * infrastructure, so that the notifications will be delivered to the
  * @eventfd. The @args parameter is a string that denotes pressure level
  * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or
- * "critical").
+ * "critical") and optionally a different operating mode (i.e. "strict")
  *
  * This function should not be used directly, just pass it to (struct
  * cftype).register_event, and then cgroup core will handle everything 

Re: [PATCH] vmpressure: implement strict mode

2013-06-26 Thread Luiz Capitulino
On Wed, 26 Jun 2013 10:08:27 +0200
Michal Hocko  wrote:

> On Tue 25-06-13 17:51:29, Luiz Capitulino wrote:
> > Currently, applications are notified for the level they registered for
> > _plus_ higher levels.
> > 
> > This is a problem if the application wants to implement different
> > actions for different levels. For example, an application might want
> > to release 10% of its cache on level low, 50% on medium and 100% on
> > critical. To do this, the application has to register a different fd
> > for each event. However, fd low is always going to be notified and
> > and all fds are going to be notified on level critical.
> 
> OK, I am not user of this interface but I thought that the application
> would take an action of the highest level it gets notification. But yes
> this might get clumsy to implement.
> 
> > Strict mode solves this problem by strictly notifiying the event
> > an fd has registered for. It's optional. By default we still notify
> > on higher levels.
> 
> OK, makes some sense to me and it should work with the proposed edge
> trigerring as well.
> 
> > Signed-off-by: Luiz Capitulino 
> > ---
> > 
> > PS: I'm following the discussion on the event storm problem, but I believe
> > strict mode is orthogonal to what has been suggested (although the
> > patches conflict)
> > 
> >  Documentation/cgroups/memory.txt | 10 ++
> >  mm/vmpressure.c  | 19 +--
> >  2 files changed, 23 insertions(+), 6 deletions(-)
> > 
> > diff --git a/Documentation/cgroups/memory.txt 
> > b/Documentation/cgroups/memory.txt
> > index ddf4f93..3c589cf 100644
> > --- a/Documentation/cgroups/memory.txt
> > +++ b/Documentation/cgroups/memory.txt
> > @@ -807,12 +807,14 @@ register a notification, an application must:
> >  
> >  - create an eventfd using eventfd(2);
> >  - open memory.pressure_level;
> > -- write string like "  "
> > +- write string like "   
> > [strict]"
> >to cgroup.event_control.
> >  
> > -Application will be notified through eventfd when memory pressure is at
> > -the specific level (or higher). Read/write operations to
> > -memory.pressure_level are no implemented.
> > +Applications will be notified through eventfd when memory pressure is at
> > +the specific level or higher. If strict is passed, then applications
> > +will only be notified when memory pressure reaches the specified level.
> 
> It would be good to describe when is strick and when the default
> appropriate.

Yeah, Anton asked for the same thing. Will do it.

> > +
> > +Read/write operations to memory.pressure_level are no implemented.
> >  
> >  Test:
> >  
> > diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> > index 736a601..6289ede 100644
> > --- a/mm/vmpressure.c
> > +++ b/mm/vmpressure.c
> > @@ -137,6 +137,7 @@ static enum vmpressure_levels 
> > vmpressure_calc_level(unsigned long scanned,
> >  struct vmpressure_event {
> > struct eventfd_ctx *efd;
> > enum vmpressure_levels level;
> > +   bool strict_mode;
> 
> Any reason to not using a flag for this? If there are any other modes to
> come them we would end up with zilions of bools which is not very nice.

Good point, I'll change it.

> 
> > struct list_head node;
> >  };
> >  
> > @@ -153,6 +154,9 @@ static bool vmpressure_event(struct vmpressure *vmpr,
> >  
> > list_for_each_entry(ev, >events, node) {
> > if (level >= ev->level) {
> > +   /* strict mode ensures level == ev->level */
> > +   if (ev->strict_mode && level != ev->level)
> > +   continue;
> > eventfd_signal(ev->efd, 1);
> > signalled = true;
> > }
> > @@ -292,7 +296,7 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup 
> > *memcg, int prio)
> >   * infrastructure, so that the notifications will be delivered to the
> >   * @eventfd. The @args parameter is a string that denotes pressure level
> >   * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or
> > - * "critical").
> > + * "critical") and optionally a different operating mode (i.e. "strict")
> >   *
> >   * This function should not be used directly, just pass it to (struct
> >   * cftype).register_event, and then cgroup core will handle everything by
> > @@ -303,22 +307,33 @@ int vmpressure_register_event(struct cgroup *cg

Re: [PATCH] vmpressure: implement strict mode

2013-06-26 Thread Luiz Capitulino
On Wed, 26 Jun 2013 17:20:40 +0900
Minchan Kim  wrote:

> Hello Michal,
> 
> On Wed, Jun 26, 2013 at 09:59:21AM +0200, Michal Hocko wrote:
> > On Wed 26-06-13 16:50:51, Minchan Kim wrote:
> > > On Tue, Jun 25, 2013 at 05:51:29PM -0400, Luiz Capitulino wrote:
> > > > Currently, applications are notified for the level they registered for
> > > > _plus_ higher levels.
> > > > 
> > > > This is a problem if the application wants to implement different
> > > > actions for different levels. For example, an application might want
> > > > to release 10% of its cache on level low, 50% on medium and 100% on
> > > > critical. To do this, the application has to register a different fd
> > > > for each event. However, fd low is always going to be notified and
> > > > and all fds are going to be notified on level critical.
> > > > 
> > > > Strict mode solves this problem by strictly notifiying the event
> > > > an fd has registered for. It's optional. By default we still notify
> > > > on higher levels.
> > > > 
> > > > Signed-off-by: Luiz Capitulino 
> > > Acked-by: Minchan Kim 
> > > 
> > > Shouldn't we make this default?
> > 
> > The interface is not there for long but still, changing it is always
> > quite tricky. And the users who care can be modified really easily so I
> > would stick with the original default.
> 
> Yeb, I am not strong against to stick old at a moment but at least,
> this patch makes more sense to me so I'd like to know why we didn't do it
> from the beginning. Surely, Anton has a answer.

That's exactly my thinking too: I think strict mode should be the default
mode, and the current mode should be optional. But it's not a big deal.

I've discussed this issue with Anton some weeks ago, and iirc (Anton,
please correct/clarify where appropriate) the conclusion was that the
current schema makes sense for apps monitoring reclaim activity, as
they can hook on low only.

Hmm. Something just crossed my mind. Maybe we should have two
notification schemas:

 o memory.pressure_level: implements strict mode (this patch)

 o memory.reclaim_activity: apps are notified whenever there's reclaim
activity

As for changing applications, it's better to get some breakage while
we're in -rc than regret the API later.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vmpressure: implement strict mode

2013-06-26 Thread Luiz Capitulino
On Tue, 25 Jun 2013 21:03:31 -0700
Anton Vorontsov  wrote:

> On Tue, Jun 25, 2013 at 05:51:29PM -0400, Luiz Capitulino wrote:
> > Currently, applications are notified for the level they registered for
> > _plus_ higher levels.
> > 
> > This is a problem if the application wants to implement different
> > actions for different levels. For example, an application might want
> > to release 10% of its cache on level low, 50% on medium and 100% on
> > critical. To do this, the application has to register a different fd
> > for each event. However, fd low is always going to be notified and
> > and all fds are going to be notified on level critical.
> > 
> > Strict mode solves this problem by strictly notifiying the event
> > an fd has registered for. It's optional. By default we still notify
> > on higher levels.
> > 
> > Signed-off-by: Luiz Capitulino 
> 
> In the documentation I would add more information about why exactly the
> strict mode makes sense.
> 
> For example, the non-strict fd listener hooked onto the low level makes
> sense for apps that just monitor reclaiming activity (like current Android
> Activity Manager), hooking onto 'medium' non-strict mode makes sense for
> simple load-balancing logic, and the new strict mode is for the cases when
> an application wants to implement some fancy logic as it makes a decision
> based on a concrete level.

OK, I'll respin. But you said it all already, so I'll base my text on
on what you wrote.

> Otherwise, it looks good.
> 
> Acked-by: Anton Vorontsov 

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vmpressure: implement strict mode

2013-06-26 Thread Luiz Capitulino
On Tue, 25 Jun 2013 21:03:31 -0700
Anton Vorontsov an...@enomsg.org wrote:

 On Tue, Jun 25, 2013 at 05:51:29PM -0400, Luiz Capitulino wrote:
  Currently, applications are notified for the level they registered for
  _plus_ higher levels.
  
  This is a problem if the application wants to implement different
  actions for different levels. For example, an application might want
  to release 10% of its cache on level low, 50% on medium and 100% on
  critical. To do this, the application has to register a different fd
  for each event. However, fd low is always going to be notified and
  and all fds are going to be notified on level critical.
  
  Strict mode solves this problem by strictly notifiying the event
  an fd has registered for. It's optional. By default we still notify
  on higher levels.
  
  Signed-off-by: Luiz Capitulino lcapitul...@redhat.com
 
 In the documentation I would add more information about why exactly the
 strict mode makes sense.
 
 For example, the non-strict fd listener hooked onto the low level makes
 sense for apps that just monitor reclaiming activity (like current Android
 Activity Manager), hooking onto 'medium' non-strict mode makes sense for
 simple load-balancing logic, and the new strict mode is for the cases when
 an application wants to implement some fancy logic as it makes a decision
 based on a concrete level.

OK, I'll respin. But you said it all already, so I'll base my text on
on what you wrote.

 Otherwise, it looks good.
 
 Acked-by: Anton Vorontsov an...@enomsg.org

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vmpressure: implement strict mode

2013-06-26 Thread Luiz Capitulino
On Wed, 26 Jun 2013 17:20:40 +0900
Minchan Kim minc...@kernel.org wrote:

 Hello Michal,
 
 On Wed, Jun 26, 2013 at 09:59:21AM +0200, Michal Hocko wrote:
  On Wed 26-06-13 16:50:51, Minchan Kim wrote:
   On Tue, Jun 25, 2013 at 05:51:29PM -0400, Luiz Capitulino wrote:
Currently, applications are notified for the level they registered for
_plus_ higher levels.

This is a problem if the application wants to implement different
actions for different levels. For example, an application might want
to release 10% of its cache on level low, 50% on medium and 100% on
critical. To do this, the application has to register a different fd
for each event. However, fd low is always going to be notified and
and all fds are going to be notified on level critical.

Strict mode solves this problem by strictly notifiying the event
an fd has registered for. It's optional. By default we still notify
on higher levels.

Signed-off-by: Luiz Capitulino lcapitul...@redhat.com
   Acked-by: Minchan Kim minc...@kernel.org
   
   Shouldn't we make this default?
  
  The interface is not there for long but still, changing it is always
  quite tricky. And the users who care can be modified really easily so I
  would stick with the original default.
 
 Yeb, I am not strong against to stick old at a moment but at least,
 this patch makes more sense to me so I'd like to know why we didn't do it
 from the beginning. Surely, Anton has a answer.

That's exactly my thinking too: I think strict mode should be the default
mode, and the current mode should be optional. But it's not a big deal.

I've discussed this issue with Anton some weeks ago, and iirc (Anton,
please correct/clarify where appropriate) the conclusion was that the
current schema makes sense for apps monitoring reclaim activity, as
they can hook on low only.

Hmm. Something just crossed my mind. Maybe we should have two
notification schemas:

 o memory.pressure_level: implements strict mode (this patch)

 o memory.reclaim_activity: apps are notified whenever there's reclaim
activity

As for changing applications, it's better to get some breakage while
we're in -rc than regret the API later.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vmpressure: implement strict mode

2013-06-26 Thread Luiz Capitulino
On Wed, 26 Jun 2013 10:08:27 +0200
Michal Hocko mho...@suse.cz wrote:

 On Tue 25-06-13 17:51:29, Luiz Capitulino wrote:
  Currently, applications are notified for the level they registered for
  _plus_ higher levels.
  
  This is a problem if the application wants to implement different
  actions for different levels. For example, an application might want
  to release 10% of its cache on level low, 50% on medium and 100% on
  critical. To do this, the application has to register a different fd
  for each event. However, fd low is always going to be notified and
  and all fds are going to be notified on level critical.
 
 OK, I am not user of this interface but I thought that the application
 would take an action of the highest level it gets notification. But yes
 this might get clumsy to implement.
 
  Strict mode solves this problem by strictly notifiying the event
  an fd has registered for. It's optional. By default we still notify
  on higher levels.
 
 OK, makes some sense to me and it should work with the proposed edge
 trigerring as well.
 
  Signed-off-by: Luiz Capitulino lcapitul...@redhat.com
  ---
  
  PS: I'm following the discussion on the event storm problem, but I believe
  strict mode is orthogonal to what has been suggested (although the
  patches conflict)
  
   Documentation/cgroups/memory.txt | 10 ++
   mm/vmpressure.c  | 19 +--
   2 files changed, 23 insertions(+), 6 deletions(-)
  
  diff --git a/Documentation/cgroups/memory.txt 
  b/Documentation/cgroups/memory.txt
  index ddf4f93..3c589cf 100644
  --- a/Documentation/cgroups/memory.txt
  +++ b/Documentation/cgroups/memory.txt
  @@ -807,12 +807,14 @@ register a notification, an application must:
   
   - create an eventfd using eventfd(2);
   - open memory.pressure_level;
  -- write string like event_fd fd of memory.pressure_level level
  +- write string like event_fd fd of memory.pressure_level level 
  [strict]
 to cgroup.event_control.
   
  -Application will be notified through eventfd when memory pressure is at
  -the specific level (or higher). Read/write operations to
  -memory.pressure_level are no implemented.
  +Applications will be notified through eventfd when memory pressure is at
  +the specific level or higher. If strict is passed, then applications
  +will only be notified when memory pressure reaches the specified level.
 
 It would be good to describe when is strick and when the default
 appropriate.

Yeah, Anton asked for the same thing. Will do it.

  +
  +Read/write operations to memory.pressure_level are no implemented.
   
   Test:
   
  diff --git a/mm/vmpressure.c b/mm/vmpressure.c
  index 736a601..6289ede 100644
  --- a/mm/vmpressure.c
  +++ b/mm/vmpressure.c
  @@ -137,6 +137,7 @@ static enum vmpressure_levels 
  vmpressure_calc_level(unsigned long scanned,
   struct vmpressure_event {
  struct eventfd_ctx *efd;
  enum vmpressure_levels level;
  +   bool strict_mode;
 
 Any reason to not using a flag for this? If there are any other modes to
 come them we would end up with zilions of bools which is not very nice.

Good point, I'll change it.

 
  struct list_head node;
   };
   
  @@ -153,6 +154,9 @@ static bool vmpressure_event(struct vmpressure *vmpr,
   
  list_for_each_entry(ev, vmpr-events, node) {
  if (level = ev-level) {
  +   /* strict mode ensures level == ev-level */
  +   if (ev-strict_mode  level != ev-level)
  +   continue;
  eventfd_signal(ev-efd, 1);
  signalled = true;
  }
  @@ -292,7 +296,7 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup 
  *memcg, int prio)
* infrastructure, so that the notifications will be delivered to the
* @eventfd. The @args parameter is a string that denotes pressure level
* threshold (one of vmpressure_str_levels, i.e. low, medium, or
  - * critical).
  + * critical) and optionally a different operating mode (i.e. strict)
*
* This function should not be used directly, just pass it to (struct
* cftype).register_event, and then cgroup core will handle everything by
  @@ -303,22 +307,33 @@ int vmpressure_register_event(struct cgroup *cg, 
  struct cftype *cft,
   {
  struct vmpressure *vmpr = cg_to_vmpressure(cg);
  struct vmpressure_event *ev;
  +   bool smode = false;
  +   const char *p;
  int level;
   
  for (level = 0; level  VMPRESSURE_NUM_LEVELS; level++) {
  -   if (!strcmp(vmpressure_str_levels[level], args))
  +   p = vmpressure_str_levels[level];
  +   if (!strncmp(p, args, strlen(p)))
  break;
  }
   
  if (level = VMPRESSURE_NUM_LEVELS)
  return -EINVAL;
   
  +   p = strchr(args, ' ');
  +   if (p) {
  +   if (strncmp(++p, strict, 6))
  +   return -EINVAL;
  +   smode = true;
  +   }
  +
  ev = kzalloc(sizeof(*ev

[PATCH v2] vmpressure: implement strict mode

2013-06-26 Thread Luiz Capitulino
Currently, an eventfd is notified for the level it's registered for
_plus_ higher levels.

This is a problem if an application wants to implement different
actions for different levels. For example, an application might want
to release 10% of its cache on level low, 50% on medium and 100% on
critical. To do this, an application has to register a different
eventfd for each pressure level. However, fd low is always going to
be notified and and all fds are going to be notified on level critical.

Strict mode solves this problem by strictly notifiying an eventfd
for the pressure level it registered for. This new mode is optional,
by default we still notify eventfds on higher levels too.

Signed-off-by: Luiz Capitulino lcapitul...@redhat.com
---

o v2

 - Improve documentation
 - Use a bit to store mode instead of a bool
 - Minor changelog changes

 Documentation/cgroups/memory.txt | 26 ++
 mm/vmpressure.c  | 26 --
 2 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index ddf4f93..412872b 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -791,6 +791,22 @@ way to trigger. Applications should do whatever they can 
to help the
 system. It might be too late to consult with vmstat or any other
 statistics, so it's advisable to take an immediate action.
 
+Applications can also choose between two notification modes when
+registering an eventfd for memory pressure events:
+
+When in non-strict mode, an eventfd is notified for the specific level
+it's registered for and higher levels. For example, an eventfd registered
+for low level is also going to be notified on medium and critical levels.
+This mode makes sense for applications interested on monitoring reclaim
+activity or implementing simple load-balacing logic. The non-strict mode
+is the default notification mode.
+
+When in strict mode, an eventfd is strictly notified for the pressure
+level it's registered for. For example, an eventfd registered for the low
+level event is not going to be notified when memory pressure gets into
+medium or critical levels. This allows for more complex logic based on
+the actual pressure level the system is experiencing.
+
 The events are propagated upward until the event is handled, i.e. the
 events are not pass-through. Here is what this means: for example you have
 three cgroups: A-B-C. Now you set up an event listener on cgroups A, B
@@ -807,12 +823,14 @@ register a notification, an application must:
 
 - create an eventfd using eventfd(2);
 - open memory.pressure_level;
-- write string like event_fd fd of memory.pressure_level level
+- write string like event_fd fd of memory.pressure_level level [strict]
   to cgroup.event_control.
 
-Application will be notified through eventfd when memory pressure is at
-the specific level (or higher). Read/write operations to
-memory.pressure_level are no implemented.
+Applications will be notified through eventfd when memory pressure is at
+the specific level or higher. If strict is passed, then applications
+will only be notified when memory pressure reaches the specified level.
+
+Read/write operations to memory.pressure_level are no implemented.
 
 Test:
 
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 736a601..ba5c17e 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -138,8 +138,16 @@ struct vmpressure_event {
struct eventfd_ctx *efd;
enum vmpressure_levels level;
struct list_head node;
+   unsigned int mode;
 };
 
+#define VMPRESSURE_MODE_STRICT 1
+
+static inline bool vmpressure_mode_is_strict(const struct vmpressure_event *ev)
+{
+   return ev-mode  VMPRESSURE_MODE_STRICT;
+}
+
 static bool vmpressure_event(struct vmpressure *vmpr,
 unsigned long scanned, unsigned long reclaimed)
 {
@@ -153,6 +161,9 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 
list_for_each_entry(ev, vmpr-events, node) {
if (level = ev-level) {
+   /* strict mode ensures level == ev-level */
+   if (vmpressure_mode_is_strict(ev)  level != ev-level)
+   continue;
eventfd_signal(ev-efd, 1);
signalled = true;
}
@@ -292,7 +303,7 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, 
int prio)
  * infrastructure, so that the notifications will be delivered to the
  * @eventfd. The @args parameter is a string that denotes pressure level
  * threshold (one of vmpressure_str_levels, i.e. low, medium, or
- * critical).
+ * critical) and optionally a different operating mode (i.e. strict)
  *
  * This function should not be used directly, just pass it to (struct
  * cftype).register_event, and then cgroup core will handle everything by
@@ -303,22 +314,33 @@ int vmpressure_register_event(struct cgroup

Re: [PATCH] vmpressure: implement strict mode

2013-06-25 Thread Luiz Capitulino
On Wed, 26 Jun 2013 10:12:15 +0900
Hyunhee Kim  wrote:

> Please see "[PATCH v3] memcg: event control at vmpressure". mail
> thread. (and also the thread I sent last Saturday.)
> There was discussion on this mode not sending lower events when "level
> != ev->level".

The new argument this patch adds should be orthogonal to what has been
discussed and suggested in that thread. The patches conflict though, but
it's just a matter of rebasing if both are ACKed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] vmpressure: implement strict mode

2013-06-25 Thread Luiz Capitulino
Currently, applications are notified for the level they registered for
_plus_ higher levels.

This is a problem if the application wants to implement different
actions for different levels. For example, an application might want
to release 10% of its cache on level low, 50% on medium and 100% on
critical. To do this, the application has to register a different fd
for each event. However, fd low is always going to be notified and
and all fds are going to be notified on level critical.

Strict mode solves this problem by strictly notifiying the event
an fd has registered for. It's optional. By default we still notify
on higher levels.

Signed-off-by: Luiz Capitulino 
---

PS: I'm following the discussion on the event storm problem, but I believe
strict mode is orthogonal to what has been suggested (although the
patches conflict)

 Documentation/cgroups/memory.txt | 10 ++
 mm/vmpressure.c  | 19 +--
 2 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index ddf4f93..3c589cf 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -807,12 +807,14 @@ register a notification, an application must:
 
 - create an eventfd using eventfd(2);
 - open memory.pressure_level;
-- write string like "  "
+- write string like "   [strict]"
   to cgroup.event_control.
 
-Application will be notified through eventfd when memory pressure is at
-the specific level (or higher). Read/write operations to
-memory.pressure_level are no implemented.
+Applications will be notified through eventfd when memory pressure is at
+the specific level or higher. If strict is passed, then applications
+will only be notified when memory pressure reaches the specified level.
+
+Read/write operations to memory.pressure_level are no implemented.
 
 Test:
 
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 736a601..6289ede 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -137,6 +137,7 @@ static enum vmpressure_levels 
vmpressure_calc_level(unsigned long scanned,
 struct vmpressure_event {
struct eventfd_ctx *efd;
enum vmpressure_levels level;
+   bool strict_mode;
struct list_head node;
 };
 
@@ -153,6 +154,9 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 
list_for_each_entry(ev, >events, node) {
if (level >= ev->level) {
+   /* strict mode ensures level == ev->level */
+   if (ev->strict_mode && level != ev->level)
+   continue;
eventfd_signal(ev->efd, 1);
signalled = true;
}
@@ -292,7 +296,7 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, 
int prio)
  * infrastructure, so that the notifications will be delivered to the
  * @eventfd. The @args parameter is a string that denotes pressure level
  * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or
- * "critical").
+ * "critical") and optionally a different operating mode (i.e. "strict")
  *
  * This function should not be used directly, just pass it to (struct
  * cftype).register_event, and then cgroup core will handle everything by
@@ -303,22 +307,33 @@ int vmpressure_register_event(struct cgroup *cg, struct 
cftype *cft,
 {
struct vmpressure *vmpr = cg_to_vmpressure(cg);
struct vmpressure_event *ev;
+   bool smode = false;
+   const char *p;
int level;
 
for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
-   if (!strcmp(vmpressure_str_levels[level], args))
+   p = vmpressure_str_levels[level];
+   if (!strncmp(p, args, strlen(p)))
break;
}
 
if (level >= VMPRESSURE_NUM_LEVELS)
return -EINVAL;
 
+   p = strchr(args, ' ');
+   if (p) {
+   if (strncmp(++p, "strict", 6))
+   return -EINVAL;
+   smode = true;
+   }
+
ev = kzalloc(sizeof(*ev), GFP_KERNEL);
if (!ev)
return -ENOMEM;
 
ev->efd = eventfd;
ev->level = level;
+   ev->strict_mode = smode;
 
mutex_lock(>events_lock);
list_add(>node, >events);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] vmpressure: implement strict mode

2013-06-25 Thread Luiz Capitulino
Currently, applications are notified for the level they registered for
_plus_ higher levels.

This is a problem if the application wants to implement different
actions for different levels. For example, an application might want
to release 10% of its cache on level low, 50% on medium and 100% on
critical. To do this, the application has to register a different fd
for each event. However, fd low is always going to be notified and
and all fds are going to be notified on level critical.

Strict mode solves this problem by strictly notifiying the event
an fd has registered for. It's optional. By default we still notify
on higher levels.

Signed-off-by: Luiz Capitulino lcapitul...@redhat.com
---

PS: I'm following the discussion on the event storm problem, but I believe
strict mode is orthogonal to what has been suggested (although the
patches conflict)

 Documentation/cgroups/memory.txt | 10 ++
 mm/vmpressure.c  | 19 +--
 2 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index ddf4f93..3c589cf 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -807,12 +807,14 @@ register a notification, an application must:
 
 - create an eventfd using eventfd(2);
 - open memory.pressure_level;
-- write string like event_fd fd of memory.pressure_level level
+- write string like event_fd fd of memory.pressure_level level [strict]
   to cgroup.event_control.
 
-Application will be notified through eventfd when memory pressure is at
-the specific level (or higher). Read/write operations to
-memory.pressure_level are no implemented.
+Applications will be notified through eventfd when memory pressure is at
+the specific level or higher. If strict is passed, then applications
+will only be notified when memory pressure reaches the specified level.
+
+Read/write operations to memory.pressure_level are no implemented.
 
 Test:
 
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 736a601..6289ede 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -137,6 +137,7 @@ static enum vmpressure_levels 
vmpressure_calc_level(unsigned long scanned,
 struct vmpressure_event {
struct eventfd_ctx *efd;
enum vmpressure_levels level;
+   bool strict_mode;
struct list_head node;
 };
 
@@ -153,6 +154,9 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 
list_for_each_entry(ev, vmpr-events, node) {
if (level = ev-level) {
+   /* strict mode ensures level == ev-level */
+   if (ev-strict_mode  level != ev-level)
+   continue;
eventfd_signal(ev-efd, 1);
signalled = true;
}
@@ -292,7 +296,7 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, 
int prio)
  * infrastructure, so that the notifications will be delivered to the
  * @eventfd. The @args parameter is a string that denotes pressure level
  * threshold (one of vmpressure_str_levels, i.e. low, medium, or
- * critical).
+ * critical) and optionally a different operating mode (i.e. strict)
  *
  * This function should not be used directly, just pass it to (struct
  * cftype).register_event, and then cgroup core will handle everything by
@@ -303,22 +307,33 @@ int vmpressure_register_event(struct cgroup *cg, struct 
cftype *cft,
 {
struct vmpressure *vmpr = cg_to_vmpressure(cg);
struct vmpressure_event *ev;
+   bool smode = false;
+   const char *p;
int level;
 
for (level = 0; level  VMPRESSURE_NUM_LEVELS; level++) {
-   if (!strcmp(vmpressure_str_levels[level], args))
+   p = vmpressure_str_levels[level];
+   if (!strncmp(p, args, strlen(p)))
break;
}
 
if (level = VMPRESSURE_NUM_LEVELS)
return -EINVAL;
 
+   p = strchr(args, ' ');
+   if (p) {
+   if (strncmp(++p, strict, 6))
+   return -EINVAL;
+   smode = true;
+   }
+
ev = kzalloc(sizeof(*ev), GFP_KERNEL);
if (!ev)
return -ENOMEM;
 
ev-efd = eventfd;
ev-level = level;
+   ev-strict_mode = smode;
 
mutex_lock(vmpr-events_lock);
list_add(ev-node, vmpr-events);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vmpressure: implement strict mode

2013-06-25 Thread Luiz Capitulino
On Wed, 26 Jun 2013 10:12:15 +0900
Hyunhee Kim hyunhee@samsung.com wrote:

 Please see [PATCH v3] memcg: event control at vmpressure. mail
 thread. (and also the thread I sent last Saturday.)
 There was discussion on this mode not sending lower events when level
 != ev-level.

The new argument this patch adds should be orthogonal to what has been
discussed and suggested in that thread. The patches conflict though, but
it's just a matter of rebasing if both are ACKed.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] virtio_balloon: leak_balloon(): only tell host if we got pages deflated

2013-06-17 Thread Luiz Capitulino
On Wed, 5 Jun 2013 21:18:37 -0400
Luiz Capitulino  wrote:

> The balloon_page_dequeue() function can return NULL. If it does for
> the first page being freed, then leak_balloon() will create a
> scatter list with len=0. Which in turn seems to generate an invalid
> virtio request.
> 
> I didn't get this in practice, I found it by code review. On the other
> hand, such an invalid virtio request will cause errors in QEMU and
> fill_balloon() also performs the same check implemented by this commit.
> 
> Signed-off-by: Luiz Capitulino 
> Acked-by: Rafael Aquini 

Andrew, can you pick this one?

> ---
> 
> o v2
> 
>  - Improve changelog
> 
>  drivers/virtio/virtio_balloon.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index bd3ae32..71af7b5 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -191,7 +191,8 @@ static void leak_balloon(struct virtio_balloon *vb, 
> size_t num)
>* virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>* is true, we *have* to do it in this order
>*/
> - tell_host(vb, vb->deflate_vq);
> + if (vb->num_pfns != 0)
> + tell_host(vb, vb->deflate_vq);
>   mutex_unlock(>balloon_lock);
>   release_pages_by_pfn(vb->pfns, vb->num_pfns);
>  }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] virtio_balloon: leak_balloon(): only tell host if we got pages deflated

2013-06-17 Thread Luiz Capitulino
On Wed, 5 Jun 2013 21:18:37 -0400
Luiz Capitulino lcapitul...@redhat.com wrote:

 The balloon_page_dequeue() function can return NULL. If it does for
 the first page being freed, then leak_balloon() will create a
 scatter list with len=0. Which in turn seems to generate an invalid
 virtio request.
 
 I didn't get this in practice, I found it by code review. On the other
 hand, such an invalid virtio request will cause errors in QEMU and
 fill_balloon() also performs the same check implemented by this commit.
 
 Signed-off-by: Luiz Capitulino lcapitul...@redhat.com
 Acked-by: Rafael Aquini aqu...@redhat.com

Andrew, can you pick this one?

 ---
 
 o v2
 
  - Improve changelog
 
  drivers/virtio/virtio_balloon.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)
 
 diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
 index bd3ae32..71af7b5 100644
 --- a/drivers/virtio/virtio_balloon.c
 +++ b/drivers/virtio/virtio_balloon.c
 @@ -191,7 +191,8 @@ static void leak_balloon(struct virtio_balloon *vb, 
 size_t num)
* virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
* is true, we *have* to do it in this order
*/
 - tell_host(vb, vb-deflate_vq);
 + if (vb-num_pfns != 0)
 + tell_host(vb, vb-deflate_vq);
   mutex_unlock(vb-balloon_lock);
   release_pages_by_pfn(vb-pfns, vb-num_pfns);
  }

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] virtio_balloon: leak_balloon(): only tell host if we got pages deflated

2013-06-06 Thread Luiz Capitulino
On Thu, 6 Jun 2013 11:13:58 -0300
Rafael Aquini  wrote:

> On Wed, Jun 05, 2013 at 09:18:37PM -0400, Luiz Capitulino wrote:
> > The balloon_page_dequeue() function can return NULL. If it does for
> > the first page being freed, then leak_balloon() will create a
> > scatter list with len=0. Which in turn seems to generate an invalid
> > virtio request.
> > 
> > I didn't get this in practice, I found it by code review. On the other
> > hand, such an invalid virtio request will cause errors in QEMU and
> > fill_balloon() also performs the same check implemented by this commit.
> > 
> > Signed-off-by: Luiz Capitulino 
> > Acked-by: Rafael Aquini 
> > ---
> > 
> > o v2
> > 
> >  - Improve changelog
> > 
> >  drivers/virtio/virtio_balloon.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/virtio/virtio_balloon.c 
> > b/drivers/virtio/virtio_balloon.c
> > index bd3ae32..71af7b5 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -191,7 +191,8 @@ static void leak_balloon(struct virtio_balloon *vb, 
> > size_t num)
> >  * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
> >  * is true, we *have* to do it in this order
> >  */
> > -   tell_host(vb, vb->deflate_vq);
> 
> Luiz, sorry for not being clearer before. I was referring to add a commentary 
> on
> code, to explain in short words why we should not get rid of this check point.

Oh.

> > +   if (vb->num_pfns != 0)
> > +   tell_host(vb, vb->deflate_vq);
> > mutex_unlock(>balloon_lock);
> 
> If the comment is regarded as unnecessary, then just ignore my suggestion. I'm
> OK with your patch. :)

IMHO, the code is clear enough.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] virtio_balloon: leak_balloon(): only tell host if we got pages deflated

2013-06-06 Thread Luiz Capitulino
On Thu, 6 Jun 2013 11:13:58 -0300
Rafael Aquini aqu...@redhat.com wrote:

 On Wed, Jun 05, 2013 at 09:18:37PM -0400, Luiz Capitulino wrote:
  The balloon_page_dequeue() function can return NULL. If it does for
  the first page being freed, then leak_balloon() will create a
  scatter list with len=0. Which in turn seems to generate an invalid
  virtio request.
  
  I didn't get this in practice, I found it by code review. On the other
  hand, such an invalid virtio request will cause errors in QEMU and
  fill_balloon() also performs the same check implemented by this commit.
  
  Signed-off-by: Luiz Capitulino lcapitul...@redhat.com
  Acked-by: Rafael Aquini aqu...@redhat.com
  ---
  
  o v2
  
   - Improve changelog
  
   drivers/virtio/virtio_balloon.c | 3 ++-
   1 file changed, 2 insertions(+), 1 deletion(-)
  
  diff --git a/drivers/virtio/virtio_balloon.c 
  b/drivers/virtio/virtio_balloon.c
  index bd3ae32..71af7b5 100644
  --- a/drivers/virtio/virtio_balloon.c
  +++ b/drivers/virtio/virtio_balloon.c
  @@ -191,7 +191,8 @@ static void leak_balloon(struct virtio_balloon *vb, 
  size_t num)
   * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
   * is true, we *have* to do it in this order
   */
  -   tell_host(vb, vb-deflate_vq);
 
 Luiz, sorry for not being clearer before. I was referring to add a commentary 
 on
 code, to explain in short words why we should not get rid of this check point.

Oh.

  +   if (vb-num_pfns != 0)
  +   tell_host(vb, vb-deflate_vq);
  mutex_unlock(vb-balloon_lock);
 
 If the comment is regarded as unnecessary, then just ignore my suggestion. I'm
 OK with your patch. :)

IMHO, the code is clear enough.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] virtio_balloon: leak_balloon(): only tell host if we got pages deflated

2013-06-05 Thread Luiz Capitulino
The balloon_page_dequeue() function can return NULL. If it does for
the first page being freed, then leak_balloon() will create a
scatter list with len=0. Which in turn seems to generate an invalid
virtio request.

I didn't get this in practice, I found it by code review. On the other
hand, such an invalid virtio request will cause errors in QEMU and
fill_balloon() also performs the same check implemented by this commit.

Signed-off-by: Luiz Capitulino 
Acked-by: Rafael Aquini 
---

o v2

 - Improve changelog

 drivers/virtio/virtio_balloon.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index bd3ae32..71af7b5 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -191,7 +191,8 @@ static void leak_balloon(struct virtio_balloon *vb, size_t 
num)
 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 * is true, we *have* to do it in this order
 */
-   tell_host(vb, vb->deflate_vq);
+   if (vb->num_pfns != 0)
+   tell_host(vb, vb->deflate_vq);
mutex_unlock(>balloon_lock);
release_pages_by_pfn(vb->pfns, vb->num_pfns);
 }
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] virtio_balloon: leak_balloon(): only tell host if we got pages deflated

2013-06-05 Thread Luiz Capitulino
On Wed, 5 Jun 2013 18:24:49 -0300
Rafael Aquini  wrote:

> On Wed, Jun 05, 2013 at 05:10:31PM -0400, Luiz Capitulino wrote:
> > The balloon_page_dequeue() function can return NULL. If it does for
> > the first page being freed, then leak_balloon() will create a
> > scatter list with len=0. Which in turn seems to generate an invalid
> > virtio request.
> > 
> > Signed-off-by: Luiz Capitulino 
> > ---
> > 
> > PS: I didn't get this in practice. I found it by code review. On the other
> > hand, automatic-ballooning was able to put such invalid requests in
> > the virtqueue and QEMU would explode...
> >
> 
> Nice catch! The patch looks sane and replicates the check done at
> fill_balloon(). I think we also could use this P.S. as a commentary 
> to let others aware of this scenario. Thanks Luiz!

Want me to respin?

> Acked-by: Rafael Aquini 

Thanks for your review!

> 
>  
> > PPS: Very lightly tested
> >
> >  drivers/virtio/virtio_balloon.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/virtio/virtio_balloon.c 
> > b/drivers/virtio/virtio_balloon.c
> > index bd3ae32..71af7b5 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -191,7 +191,8 @@ static void leak_balloon(struct virtio_balloon *vb, 
> > size_t num)
> >  * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
> >  * is true, we *have* to do it in this order
> >  */
> > -   tell_host(vb, vb->deflate_vq);
> > +   if (vb->num_pfns != 0)
> > +   tell_host(vb, vb->deflate_vq);
> > mutex_unlock(>balloon_lock);
> > release_pages_by_pfn(vb->pfns, vb->num_pfns);
> >  }
> > -- 
> > 1.8.1.4
> > 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] virtio_balloon: leak_balloon(): only tell host if we got pages deflated

2013-06-05 Thread Luiz Capitulino
The balloon_page_dequeue() function can return NULL. If it does for
the first page being freed, then leak_balloon() will create a
scatter list with len=0. Which in turn seems to generate an invalid
virtio request.

Signed-off-by: Luiz Capitulino 
---

PS: I didn't get this in practice. I found it by code review. On the other
hand, automatic-ballooning was able to put such invalid requests in
the virtqueue and QEMU would explode...

PPS: Very lightly tested

 drivers/virtio/virtio_balloon.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index bd3ae32..71af7b5 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -191,7 +191,8 @@ static void leak_balloon(struct virtio_balloon *vb, size_t 
num)
 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 * is true, we *have* to do it in this order
 */
-   tell_host(vb, vb->deflate_vq);
+   if (vb->num_pfns != 0)
+   tell_host(vb, vb->deflate_vq);
mutex_unlock(>balloon_lock);
release_pages_by_pfn(vb->pfns, vb->num_pfns);
 }
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] virtio_balloon: leak_balloon(): only tell host if we got pages deflated

2013-06-05 Thread Luiz Capitulino
The balloon_page_dequeue() function can return NULL. If it does for
the first page being freed, then leak_balloon() will create a
scatter list with len=0. Which in turn seems to generate an invalid
virtio request.

Signed-off-by: Luiz Capitulino lcapitul...@redhat.com
---

PS: I didn't get this in practice. I found it by code review. On the other
hand, automatic-ballooning was able to put such invalid requests in
the virtqueue and QEMU would explode...

PPS: Very lightly tested

 drivers/virtio/virtio_balloon.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index bd3ae32..71af7b5 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -191,7 +191,8 @@ static void leak_balloon(struct virtio_balloon *vb, size_t 
num)
 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 * is true, we *have* to do it in this order
 */
-   tell_host(vb, vb-deflate_vq);
+   if (vb-num_pfns != 0)
+   tell_host(vb, vb-deflate_vq);
mutex_unlock(vb-balloon_lock);
release_pages_by_pfn(vb-pfns, vb-num_pfns);
 }
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] virtio_balloon: leak_balloon(): only tell host if we got pages deflated

2013-06-05 Thread Luiz Capitulino
On Wed, 5 Jun 2013 18:24:49 -0300
Rafael Aquini aqu...@redhat.com wrote:

 On Wed, Jun 05, 2013 at 05:10:31PM -0400, Luiz Capitulino wrote:
  The balloon_page_dequeue() function can return NULL. If it does for
  the first page being freed, then leak_balloon() will create a
  scatter list with len=0. Which in turn seems to generate an invalid
  virtio request.
  
  Signed-off-by: Luiz Capitulino lcapitul...@redhat.com
  ---
  
  PS: I didn't get this in practice. I found it by code review. On the other
  hand, automatic-ballooning was able to put such invalid requests in
  the virtqueue and QEMU would explode...
 
 
 Nice catch! The patch looks sane and replicates the check done at
 fill_balloon(). I think we also could use this P.S. as a commentary 
 to let others aware of this scenario. Thanks Luiz!

Want me to respin?

 Acked-by: Rafael Aquini aqu...@redhat.com

Thanks for your review!

 
  
  PPS: Very lightly tested
 
   drivers/virtio/virtio_balloon.c | 3 ++-
   1 file changed, 2 insertions(+), 1 deletion(-)
  
  diff --git a/drivers/virtio/virtio_balloon.c 
  b/drivers/virtio/virtio_balloon.c
  index bd3ae32..71af7b5 100644
  --- a/drivers/virtio/virtio_balloon.c
  +++ b/drivers/virtio/virtio_balloon.c
  @@ -191,7 +191,8 @@ static void leak_balloon(struct virtio_balloon *vb, 
  size_t num)
   * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
   * is true, we *have* to do it in this order
   */
  -   tell_host(vb, vb-deflate_vq);
  +   if (vb-num_pfns != 0)
  +   tell_host(vb, vb-deflate_vq);
  mutex_unlock(vb-balloon_lock);
  release_pages_by_pfn(vb-pfns, vb-num_pfns);
   }
  -- 
  1.8.1.4
  
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] virtio_balloon: leak_balloon(): only tell host if we got pages deflated

2013-06-05 Thread Luiz Capitulino
The balloon_page_dequeue() function can return NULL. If it does for
the first page being freed, then leak_balloon() will create a
scatter list with len=0. Which in turn seems to generate an invalid
virtio request.

I didn't get this in practice, I found it by code review. On the other
hand, such an invalid virtio request will cause errors in QEMU and
fill_balloon() also performs the same check implemented by this commit.

Signed-off-by: Luiz Capitulino lcapitul...@redhat.com
Acked-by: Rafael Aquini aqu...@redhat.com
---

o v2

 - Improve changelog

 drivers/virtio/virtio_balloon.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index bd3ae32..71af7b5 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -191,7 +191,8 @@ static void leak_balloon(struct virtio_balloon *vb, size_t 
num)
 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 * is true, we *have* to do it in this order
 */
-   tell_host(vb, vb-deflate_vq);
+   if (vb-num_pfns != 0)
+   tell_host(vb, vb-deflate_vq);
mutex_unlock(vb-balloon_lock);
release_pages_by_pfn(vb-pfns, vb-num_pfns);
 }
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 0/2] virtio_balloon: auto-ballooning support

2013-05-16 Thread Luiz Capitulino
On Thu, 16 May 2013 16:56:34 -0400
Sasha Levin  wrote:

> On 05/09/2013 10:53 AM, Luiz Capitulino wrote:
> > Hi,
> > 
> > This series is a respin of automatic ballooning support I started
> > working on last year. Patch 2/2 contains all relevant technical
> > details and performance measurements results.
> > 
> > This is in RFC state because it's a work in progress.
> 
> Hi Luiz,
> 
> Is there a virtio spec patch I could use to get it implemented on
> kvmtool?

Not yet, this will come with v1. But I got some homework to do before
posting it (more perf tests).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 0/2] virtio_balloon: auto-ballooning support

2013-05-16 Thread Luiz Capitulino
On Thu, 16 May 2013 16:56:34 -0400
Sasha Levin sasha.le...@oracle.com wrote:

 On 05/09/2013 10:53 AM, Luiz Capitulino wrote:
  Hi,
  
  This series is a respin of automatic ballooning support I started
  working on last year. Patch 2/2 contains all relevant technical
  details and performance measurements results.
  
  This is in RFC state because it's a work in progress.
 
 Hi Luiz,
 
 Is there a virtio spec patch I could use to get it implemented on
 kvmtool?

Not yet, this will come with v1. But I got some homework to do before
posting it (more perf tests).
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] package: Makefile: unbreak binrpm-pkg target

2013-05-15 Thread Luiz Capitulino
Commit 6501320311664e11ad5489fedb17ea1f817c2617 dropped the rpm spec as a
prerequisite for the binrpm-pkg target but forgot to update $< usage,
which causes the rule to break.

This commit fixes that by replacing $< with the spec name.

Signed-off-by: Luiz Capitulino 
---
 scripts/package/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/package/Makefile b/scripts/package/Makefile
index 84a4060..a4f31c9 100644
--- a/scripts/package/Makefile
+++ b/scripts/package/Makefile
@@ -63,7 +63,7 @@ binrpm-pkg: FORCE
mv -f $(objtree)/.tmp_version $(objtree)/.version
 
$(RPM) $(RPMOPTS) --define "_builddir $(objtree)" --target \
-   $(UTS_MACHINE) -bb $<
+   $(UTS_MACHINE) -bb $(objtree)/binkernel.spec
rm binkernel.spec
 
 # Deb target
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] package: Makefile: unbreak binrpm-pkg target

2013-05-15 Thread Luiz Capitulino
Commit 6501320311664e11ad5489fedb17ea1f817c2617 dropped the rpm spec as a
prerequisite for the binrpm-pkg target but forgot to update $ usage,
which causes the rule to break.

This commit fixes that by replacing $ with the spec name.

Signed-off-by: Luiz Capitulino lcapitul...@redhat.com
---
 scripts/package/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/package/Makefile b/scripts/package/Makefile
index 84a4060..a4f31c9 100644
--- a/scripts/package/Makefile
+++ b/scripts/package/Makefile
@@ -63,7 +63,7 @@ binrpm-pkg: FORCE
mv -f $(objtree)/.tmp_version $(objtree)/.version
 
$(RPM) $(RPMOPTS) --define _builddir $(objtree) --target \
-   $(UTS_MACHINE) -bb $
+   $(UTS_MACHINE) -bb $(objtree)/binkernel.spec
rm binkernel.spec
 
 # Deb target
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] virtio_balloon: auto-ballooning support

2013-05-13 Thread Luiz Capitulino
On Mon, 13 May 2013 22:02:50 +0300
"Michael S. Tsirkin"  wrote:

> On Mon, May 13, 2013 at 02:25:11PM -0400, Luiz Capitulino wrote:
> > On Mon, 13 May 2013 11:34:41 -0300
> > Rafael Aquini  wrote:
> > 
> > > You're right, and the host's member is used to communicate the configured 
> > > size
> > > to guest's balloon device, however, by not changing it when the shrinker 
> > > causes 
> > > the balloon to deflate will make the balloon thread to be woken up again 
> > > in order to chase the balloon size target again, won't it? Check
> > 
> > I don't see the balloon thread waking up after the shrinker executes in my
> > testing. Maybe this is so because it will only wake up when QEMU notifies
> > a config change.
> 
> Well that's also a problem.
> Need some mechanism to re-inflate balloon

In this implemention this is done by QEMU, have you looked at the QEMU
patch yet?

 https://lists.gnu.org/archive/html/qemu-devel/2013-05/msg01295.html

> when guest memory pressure is down.

In this implementation we do this when there's pressure in the host. I expect
things to balance over time, and this seems to be what I'm observing in my
testing, but of course we need more testing.

> virtio fs mechanism worth a look?

Can you elaborate?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] virtio_balloon: auto-ballooning support

2013-05-13 Thread Luiz Capitulino
On Mon, 13 May 2013 11:34:41 -0300
Rafael Aquini  wrote:

> You're right, and the host's member is used to communicate the configured size
> to guest's balloon device, however, by not changing it when the shrinker 
> causes 
> the balloon to deflate will make the balloon thread to be woken up again 
> in order to chase the balloon size target again, won't it? Check

I don't see the balloon thread waking up after the shrinker executes in my
testing. Maybe this is so because it will only wake up when QEMU notifies
a config change.

But anyway, I'll think how to improve this as suggested by Michael too, as
I seem to be changing num_pages' semantics according to the virtio spec.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] virtio_balloon: auto-ballooning support

2013-05-13 Thread Luiz Capitulino
On Sun, 12 May 2013 21:49:34 +0300
"Michael S. Tsirkin"  wrote:

> On Sun, May 12, 2013 at 12:36:09PM -0400, Rik van Riel wrote:
> > On 05/12/2013 10:30 AM, Michael S. Tsirkin wrote:
> > >On Thu, May 09, 2013 at 10:53:49AM -0400, Luiz Capitulino wrote:
> > >>Automatic ballooning consists of dynamically adjusting the guest's
> > >>balloon according to memory pressure in the host and in the guest.
> > >>
> > >>This commit implements the guest side of automatic balloning, which
> > >>basically consists of registering a shrinker callback with the kernel,
> > >>which will try to deflate the guest's balloon by the amount of pages
> > >>being requested. The shrinker callback is only registered if the host
> > >>supports the VIRTIO_BALLOON_F_AUTO_BALLOON feature bit.
> > >
> > >OK so we have a new feature bit, such that:
> > >- if AUTO_BALLOON is set in host, guest can leak a
> > >   page from a balloon at any time
> > >
> > >questions left unanswered
> > >- what meaning does num_pages have now?
> > 
> > "as large as we could go"
> 
> I see. This is the reverse of it's current meaning.
> I would suggest a new field instead of overriding
> the existing one.

I'll write a spec patch as you suggested on irc and will decide what
to do from there.

> > >- when will the balloon be re-inflated?
> > 
> > I believe the qemu changes Luiz wrote address that side,
> > with qemu-kvm getting notifications from the host kernel
> > when there is memory pressure, and shrinking the guest
> > in reaction to that notification.
> 
> But it's the guest memory pressure we care about:
> 
> - host asks balloon to inflate
> later
> - guest asks balloon to deflate
> 
> with this patch guest takes priority,
> balloon deflates. So we should only inflate
> if guest does not need the memory.

Inflate will actually fail if the guest doesn't have memory to fill
the balloon. But in any case, and as you said elsewhere in this
thread, inflate is not new and could be even done by mngt. So I don't
think this should be changed in this patch.

> > >I'd like to see a spec patch addressing these questions.
> > >
> > >Would we ever want to mix the two types of ballooning?
> > >If yes possibly when we put a page in the balloon we
> > >might want to give host a hint "this page might be
> > >leaked again soon".
> > 
> > It might not be the same page, and the host really does
> > not care which page it is.
> 
> Whether we care depends on what we do with the page.
> For example, in the future we might care which numa node is
> used.
> 
> > The automatic inflation happens when the host needs to
> > free up memory.
> 
> This can be done today by management, with no need to
> change qemu. So automatic inflate, IMHO does not need
> a feature flag. It's the automatic deflate in guest that's new.

Makes sense.

> > >>Automatic inflate is performed by the host.
> > >>
> > >>Here are some numbers. The test-case is to run 35 VMs (1G of RAM each)
> > >>in parallel doing a kernel build. Host has 32GB of RAM and 16GB of swap.
> > >>SWAP IN and SWAP OUT correspond to the number of pages swapped in and
> > >>swapped out, respectively.
> > >>
> > >>Auto-ballooning disabled:
> > >>
> > >>RUN  TIME(s)  SWAP IN  SWAP OUT
> > >>
> > >>1634  930980   1588522
> > >>2610  627422   1362174
> > >>3649  1079847  1616367
> > >>4543  953289   1635379
> > >>5642  913237   1514000
> > >>
> > >>Auto-ballooning enabled:
> > >>
> > >>RUN  TIME(s)  SWAP IN  SWAP OUT
> > >>
> > >>1629  901  12537
> > >>2624  981  18506
> > >>3626  573  9085
> > >>4631  2250 42534
> > >>5627  1610 20808
> > >
> > >So what exactly happens here?
> > >Much less swap in/out activity, but no gain
> > >in total runtime. Doesn't this show there's
> > >a bottleneck somewhere? Could be a problem in
> > >the implementation?
> > 
> > It could also be an issue with the benchmark chosen,
> > which may not have swap as its bottleneck at any point.
> > 
> > However, the reduced swapping is still very promising!
> 
> Isn't this a direct result of inflating the balloon?
> E.g. just inflating the balloon without
> the shrinker 

Re: [RFC 2/2] virtio_balloon: auto-ballooning support

2013-05-13 Thread Luiz Capitulino
On Sun, 12 May 2013 21:49:34 +0300
Michael S. Tsirkin m...@redhat.com wrote:

 On Sun, May 12, 2013 at 12:36:09PM -0400, Rik van Riel wrote:
  On 05/12/2013 10:30 AM, Michael S. Tsirkin wrote:
  On Thu, May 09, 2013 at 10:53:49AM -0400, Luiz Capitulino wrote:
  Automatic ballooning consists of dynamically adjusting the guest's
  balloon according to memory pressure in the host and in the guest.
  
  This commit implements the guest side of automatic balloning, which
  basically consists of registering a shrinker callback with the kernel,
  which will try to deflate the guest's balloon by the amount of pages
  being requested. The shrinker callback is only registered if the host
  supports the VIRTIO_BALLOON_F_AUTO_BALLOON feature bit.
  
  OK so we have a new feature bit, such that:
  - if AUTO_BALLOON is set in host, guest can leak a
 page from a balloon at any time
  
  questions left unanswered
  - what meaning does num_pages have now?
  
  as large as we could go
 
 I see. This is the reverse of it's current meaning.
 I would suggest a new field instead of overriding
 the existing one.

I'll write a spec patch as you suggested on irc and will decide what
to do from there.

  - when will the balloon be re-inflated?
  
  I believe the qemu changes Luiz wrote address that side,
  with qemu-kvm getting notifications from the host kernel
  when there is memory pressure, and shrinking the guest
  in reaction to that notification.
 
 But it's the guest memory pressure we care about:
 
 - host asks balloon to inflate
 later
 - guest asks balloon to deflate
 
 with this patch guest takes priority,
 balloon deflates. So we should only inflate
 if guest does not need the memory.

Inflate will actually fail if the guest doesn't have memory to fill
the balloon. But in any case, and as you said elsewhere in this
thread, inflate is not new and could be even done by mngt. So I don't
think this should be changed in this patch.

  I'd like to see a spec patch addressing these questions.
  
  Would we ever want to mix the two types of ballooning?
  If yes possibly when we put a page in the balloon we
  might want to give host a hint this page might be
  leaked again soon.
  
  It might not be the same page, and the host really does
  not care which page it is.
 
 Whether we care depends on what we do with the page.
 For example, in the future we might care which numa node is
 used.
 
  The automatic inflation happens when the host needs to
  free up memory.
 
 This can be done today by management, with no need to
 change qemu. So automatic inflate, IMHO does not need
 a feature flag. It's the automatic deflate in guest that's new.

Makes sense.

  Automatic inflate is performed by the host.
  
  Here are some numbers. The test-case is to run 35 VMs (1G of RAM each)
  in parallel doing a kernel build. Host has 32GB of RAM and 16GB of swap.
  SWAP IN and SWAP OUT correspond to the number of pages swapped in and
  swapped out, respectively.
  
  Auto-ballooning disabled:
  
  RUN  TIME(s)  SWAP IN  SWAP OUT
  
  1634  930980   1588522
  2610  627422   1362174
  3649  1079847  1616367
  4543  953289   1635379
  5642  913237   1514000
  
  Auto-ballooning enabled:
  
  RUN  TIME(s)  SWAP IN  SWAP OUT
  
  1629  901  12537
  2624  981  18506
  3626  573  9085
  4631  2250 42534
  5627  1610 20808
  
  So what exactly happens here?
  Much less swap in/out activity, but no gain
  in total runtime. Doesn't this show there's
  a bottleneck somewhere? Could be a problem in
  the implementation?
  
  It could also be an issue with the benchmark chosen,
  which may not have swap as its bottleneck at any point.
  
  However, the reduced swapping is still very promising!
 
 Isn't this a direct result of inflating the balloon?
 E.g. just inflating the balloon without
 the shrinker will make us avoid swap in host.
 What we would want to check is whether shrinking works as
 expected, and whether we need to speed up shrinking.
 
 As I say above, inflating the balloon is easy.
 A good benchmark would show how we can
 deflate and re-inflate it efficiently with
 demand.

I'm going to measure this.

  Also, what happened with the balloon?
  Did we end up with balloon completely inflated? deflated?

In my test-case VMs are started with 1G. After the test, almost all
of them have between 200-300MB.

  
  One question to consider: possibly if we are
  going to reuse the page in the balloon soon,
  we might want to bypass notify before use for it?
  Maybe that will help speed things up.
  
  
  Signed-off-by: Luiz Capitulino lcapitul...@redhat.com
  ---
drivers/virtio/virtio_balloon.c | 55 
   +
include/uapi/linux/virtio_balloon.h |  1 +
2 files changed, 56 insertions(+)
  
  diff --git a/drivers/virtio/virtio_balloon.c 
  b/drivers/virtio/virtio_balloon.c
  index 9d5fe2b..f9dcae8 100644

Re: [RFC 2/2] virtio_balloon: auto-ballooning support

2013-05-13 Thread Luiz Capitulino
On Mon, 13 May 2013 11:34:41 -0300
Rafael Aquini aqu...@redhat.com wrote:

 You're right, and the host's member is used to communicate the configured size
 to guest's balloon device, however, by not changing it when the shrinker 
 causes 
 the balloon to deflate will make the balloon thread to be woken up again 
 in order to chase the balloon size target again, won't it? Check

I don't see the balloon thread waking up after the shrinker executes in my
testing. Maybe this is so because it will only wake up when QEMU notifies
a config change.

But anyway, I'll think how to improve this as suggested by Michael too, as
I seem to be changing num_pages' semantics according to the virtio spec.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] virtio_balloon: auto-ballooning support

2013-05-13 Thread Luiz Capitulino
On Mon, 13 May 2013 22:02:50 +0300
Michael S. Tsirkin m...@redhat.com wrote:

 On Mon, May 13, 2013 at 02:25:11PM -0400, Luiz Capitulino wrote:
  On Mon, 13 May 2013 11:34:41 -0300
  Rafael Aquini aqu...@redhat.com wrote:
  
   You're right, and the host's member is used to communicate the configured 
   size
   to guest's balloon device, however, by not changing it when the shrinker 
   causes 
   the balloon to deflate will make the balloon thread to be woken up again 
   in order to chase the balloon size target again, won't it? Check
  
  I don't see the balloon thread waking up after the shrinker executes in my
  testing. Maybe this is so because it will only wake up when QEMU notifies
  a config change.
 
 Well that's also a problem.
 Need some mechanism to re-inflate balloon

In this implemention this is done by QEMU, have you looked at the QEMU
patch yet?

 https://lists.gnu.org/archive/html/qemu-devel/2013-05/msg01295.html

 when guest memory pressure is down.

In this implementation we do this when there's pressure in the host. I expect
things to balance over time, and this seems to be what I'm observing in my
testing, but of course we need more testing.

 virtio fs mechanism worth a look?

Can you elaborate?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] virtio_balloon: auto-ballooning support

2013-05-10 Thread Luiz Capitulino
On Fri, 10 May 2013 09:20:46 -0400
Luiz Capitulino  wrote:

> On Thu, 9 May 2013 18:15:19 -0300
> Rafael Aquini  wrote:
> 
> > On Thu, May 09, 2013 at 10:53:49AM -0400, Luiz Capitulino wrote:
> > > Automatic ballooning consists of dynamically adjusting the guest's
> > > balloon according to memory pressure in the host and in the guest.
> > > 
> > > This commit implements the guest side of automatic balloning, which
> > > basically consists of registering a shrinker callback with the kernel,
> > > which will try to deflate the guest's balloon by the amount of pages
> > > being requested. The shrinker callback is only registered if the host
> > > supports the VIRTIO_BALLOON_F_AUTO_BALLOON feature bit.
> > > 
> > > Automatic inflate is performed by the host.
> > > 
> > > Here are some numbers. The test-case is to run 35 VMs (1G of RAM each)
> > > in parallel doing a kernel build. Host has 32GB of RAM and 16GB of swap.
> > > SWAP IN and SWAP OUT correspond to the number of pages swapped in and
> > > swapped out, respectively.
> > > 
> > > Auto-ballooning disabled:
> > > 
> > > RUN  TIME(s)  SWAP IN  SWAP OUT
> > > 
> > > 1634  930980   1588522
> > > 2610  627422   1362174
> > > 3649  1079847  1616367
> > > 4543  953289   1635379
> > > 5642  913237   1514000
> > > 
> > > Auto-ballooning enabled:
> > > 
> > > RUN  TIME(s)  SWAP IN  SWAP OUT
> > > 
> > > 1629  901  12537
> > > 2624  981  18506
> > > 3626  573  9085
> > > 4631  2250 42534
> > > 5627  1610 20808
> > > 
> > > Signed-off-by: Luiz Capitulino 
> > > ---
> > 
> > Nice work Luiz! Just allow me a silly question, though. 
> 
> I have 100% more chances of committing sillynesses than you, so please
> go ahead.
> 
> > Since your shrinker
> > doesn't change the balloon target size,
> 
> Which target size are you referring to? The one in the host (member num_pages
> of VirtIOBalloon in QEMU)?
> 
> If it the one in the host, then my understanding is that that member is only
> used to communicate the new balloon target to the guest. The guest driver
> will only read it when told (by the host) to do so, and when it does the
> target value will be correct.
> 
> Am I right?
> 
> > as soon as the shrink round finishes the
> > balloon will re-inflate again, won't it? Doesn't this cause a sort of 
> > "balloon
> > thrashing" scenario, if both guest and host are suffering from memory 
> > pressure?

Forgot to say that I didn't observe this in my testing. But I'll try harder
as soon as we clarify which target size we're talking about.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] virtio_balloon: auto-ballooning support

2013-05-10 Thread Luiz Capitulino
On Thu, 9 May 2013 18:15:19 -0300
Rafael Aquini  wrote:

> On Thu, May 09, 2013 at 10:53:49AM -0400, Luiz Capitulino wrote:
> > Automatic ballooning consists of dynamically adjusting the guest's
> > balloon according to memory pressure in the host and in the guest.
> > 
> > This commit implements the guest side of automatic balloning, which
> > basically consists of registering a shrinker callback with the kernel,
> > which will try to deflate the guest's balloon by the amount of pages
> > being requested. The shrinker callback is only registered if the host
> > supports the VIRTIO_BALLOON_F_AUTO_BALLOON feature bit.
> > 
> > Automatic inflate is performed by the host.
> > 
> > Here are some numbers. The test-case is to run 35 VMs (1G of RAM each)
> > in parallel doing a kernel build. Host has 32GB of RAM and 16GB of swap.
> > SWAP IN and SWAP OUT correspond to the number of pages swapped in and
> > swapped out, respectively.
> > 
> > Auto-ballooning disabled:
> > 
> > RUN  TIME(s)  SWAP IN  SWAP OUT
> > 
> > 1634  930980   1588522
> > 2610  627422   1362174
> > 3649  1079847  1616367
> > 4543  953289   1635379
> > 5642  913237   1514000
> > 
> > Auto-ballooning enabled:
> > 
> > RUN  TIME(s)  SWAP IN  SWAP OUT
> > 
> > 1629  901  12537
> > 2624  981  18506
> > 3626  573  9085
> > 4631  2250 42534
> > 5627  1610 20808
> > 
> > Signed-off-by: Luiz Capitulino 
> > ---
> 
> Nice work Luiz! Just allow me a silly question, though. 

I have 100% more chances of committing sillynesses than you, so please
go ahead.

> Since your shrinker
> doesn't change the balloon target size,

Which target size are you referring to? The one in the host (member num_pages
of VirtIOBalloon in QEMU)?

If it the one in the host, then my understanding is that that member is only
used to communicate the new balloon target to the guest. The guest driver
will only read it when told (by the host) to do so, and when it does the
target value will be correct.

Am I right?

> as soon as the shrink round finishes the
> balloon will re-inflate again, won't it? Doesn't this cause a sort of "balloon
> thrashing" scenario, if both guest and host are suffering from memory 
> pressure?
> 
> 
> The rest I have for the moment, are only nitpicks :)
> 
> 
> >  drivers/virtio/virtio_balloon.c | 55 
> > +
> >  include/uapi/linux/virtio_balloon.h |  1 +
> >  2 files changed, 56 insertions(+)
> > 
> > diff --git a/drivers/virtio/virtio_balloon.c 
> > b/drivers/virtio/virtio_balloon.c
> > index 9d5fe2b..f9dcae8 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -71,6 +71,9 @@ struct virtio_balloon
> > /* Memory statistics */
> > int need_stats_update;
> > struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
> > +
> > +   /* Memory shrinker */
> > +   struct shrinker shrinker;
> >  };
> >  
> >  static struct virtio_device_id id_table[] = {
> > @@ -126,6 +129,7 @@ static void set_page_pfns(u32 pfns[], struct page *page)
> > pfns[i] = page_to_balloon_pfn(page) + i;
> >  }
> >  
> > +/* This function should be called with vb->balloon_mutex held */
> >  static void fill_balloon(struct virtio_balloon *vb, size_t num)
> >  {
> > struct balloon_dev_info *vb_dev_info = vb->vb_dev_info;
> > @@ -166,6 +170,7 @@ static void release_pages_by_pfn(const u32 pfns[], 
> > unsigned int num)
> > }
> >  }
> >  
> > +/* This function should be called with vb->balloon_mutex held */
> >  static void leak_balloon(struct virtio_balloon *vb, size_t num)
> >  {
> > struct page *page;
> > @@ -285,6 +290,45 @@ static void update_balloon_size(struct virtio_balloon 
> > *vb)
> >   , sizeof(actual));
> >  }
> >  
> > +static unsigned long balloon_get_nr_pages(const struct virtio_balloon *vb)
> > +{
> > +   return vb->num_pages / VIRTIO_BALLOON_PAGES_PER_PAGE;
> > +}
> > +
> > +static int balloon_shrinker(struct shrinker *shrinker,struct 
> > shrink_control *sc)
> > +{
> > +   unsigned int nr_pages, new_target;
> > +   struct virtio_balloon *vb;
> > +
> > +   vb = container_of(shrinker, struct virtio_balloon, shrinker);
> > +   if (!mutex_trylock(>balloon_lock)) {
> > +   return -1;
> > +   }
> > +
> &

Re: [RFC 1/2] virtio_balloon: move balloon_lock mutex to callers

2013-05-10 Thread Luiz Capitulino
On Thu, 9 May 2013 18:03:09 -0300
Rafael Aquini  wrote:

> On Thu, May 09, 2013 at 10:53:48AM -0400, Luiz Capitulino wrote:
> > This commit moves the balloon_lock mutex out of the fill_balloon()
> > and leak_balloon() functions to their callers.
> > 
> > The reason for this change is that the next commit will introduce
> > a shrinker callback for the balloon driver, which will also call
> > leak_balloon() but will require different locking semantics.
> > 
> > Signed-off-by: Luiz Capitulino 
> > ---
> >  drivers/virtio/virtio_balloon.c | 8 
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_balloon.c 
> > b/drivers/virtio/virtio_balloon.c
> > index bd3ae32..9d5fe2b 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -133,7 +133,6 @@ static void fill_balloon(struct virtio_balloon *vb, 
> > size_t num)
> > /* We can only do one array worth at a time. */
> > num = min(num, ARRAY_SIZE(vb->pfns));
> >  
> > -   mutex_lock(>balloon_lock);
> > for (vb->num_pfns = 0; vb->num_pfns < num;
> >  vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> > struct page *page = balloon_page_enqueue(vb_dev_info);
> > @@ -154,7 +153,6 @@ static void fill_balloon(struct virtio_balloon *vb, 
> > size_t num)
> > /* Did we get any? */
> > if (vb->num_pfns != 0)
> > tell_host(vb, vb->inflate_vq);
> > -   mutex_unlock(>balloon_lock);
> >  }
> >  
> >  static void release_pages_by_pfn(const u32 pfns[], unsigned int num)
> > @@ -176,7 +174,6 @@ static void leak_balloon(struct virtio_balloon *vb, 
> > size_t num)
> > /* We can only do one array worth at a time. */
> > num = min(num, ARRAY_SIZE(vb->pfns));
> >  
> > -   mutex_lock(>balloon_lock);
> > for (vb->num_pfns = 0; vb->num_pfns < num;
> >  vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> > page = balloon_page_dequeue(vb_dev_info);
> > @@ -192,7 +189,6 @@ static void leak_balloon(struct virtio_balloon *vb, 
> > size_t num)
> >  * is true, we *have* to do it in this order
> >  */
> > tell_host(vb, vb->deflate_vq);
> > -   mutex_unlock(>balloon_lock);
> > release_pages_by_pfn(vb->pfns, vb->num_pfns);
> >  }
> >  
> > @@ -305,11 +301,13 @@ static int balloon(void *_vballoon)
> >  || freezing(current));
> > if (vb->need_stats_update)
> > stats_handle_request(vb);
> > +   mutex_lock(>balloon_lock);
> > if (diff > 0)
> > fill_balloon(vb, diff);
> > else if (diff < 0)
> > leak_balloon(vb, -diff);
> > update_balloon_size(vb);
> > +   mutex_unlock(>balloon_lock);
> > }
> > return 0;
> >  }
> > @@ -490,9 +488,11 @@ out:
> >  static void remove_common(struct virtio_balloon *vb)
> >  {
> > /* There might be pages left in the balloon: free them. */
> > +   mutex_lock(>balloon_lock);
> > while (vb->num_pages)
> > leak_balloon(vb, vb->num_pages);
> > update_balloon_size(vb);
> > +   mutex_unlock(>balloon_lock);
> 
> I think you will need to introduce the same change as above to 
> virtballoon_restore()

Thanks Rafael, I've fixed it in my tree.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/2] virtio_balloon: move balloon_lock mutex to callers

2013-05-10 Thread Luiz Capitulino
On Thu, 9 May 2013 18:03:09 -0300
Rafael Aquini aqu...@redhat.com wrote:

 On Thu, May 09, 2013 at 10:53:48AM -0400, Luiz Capitulino wrote:
  This commit moves the balloon_lock mutex out of the fill_balloon()
  and leak_balloon() functions to their callers.
  
  The reason for this change is that the next commit will introduce
  a shrinker callback for the balloon driver, which will also call
  leak_balloon() but will require different locking semantics.
  
  Signed-off-by: Luiz Capitulino lcapitul...@redhat.com
  ---
   drivers/virtio/virtio_balloon.c | 8 
   1 file changed, 4 insertions(+), 4 deletions(-)
  
  diff --git a/drivers/virtio/virtio_balloon.c 
  b/drivers/virtio/virtio_balloon.c
  index bd3ae32..9d5fe2b 100644
  --- a/drivers/virtio/virtio_balloon.c
  +++ b/drivers/virtio/virtio_balloon.c
  @@ -133,7 +133,6 @@ static void fill_balloon(struct virtio_balloon *vb, 
  size_t num)
  /* We can only do one array worth at a time. */
  num = min(num, ARRAY_SIZE(vb-pfns));
   
  -   mutex_lock(vb-balloon_lock);
  for (vb-num_pfns = 0; vb-num_pfns  num;
   vb-num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
  struct page *page = balloon_page_enqueue(vb_dev_info);
  @@ -154,7 +153,6 @@ static void fill_balloon(struct virtio_balloon *vb, 
  size_t num)
  /* Did we get any? */
  if (vb-num_pfns != 0)
  tell_host(vb, vb-inflate_vq);
  -   mutex_unlock(vb-balloon_lock);
   }
   
   static void release_pages_by_pfn(const u32 pfns[], unsigned int num)
  @@ -176,7 +174,6 @@ static void leak_balloon(struct virtio_balloon *vb, 
  size_t num)
  /* We can only do one array worth at a time. */
  num = min(num, ARRAY_SIZE(vb-pfns));
   
  -   mutex_lock(vb-balloon_lock);
  for (vb-num_pfns = 0; vb-num_pfns  num;
   vb-num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
  page = balloon_page_dequeue(vb_dev_info);
  @@ -192,7 +189,6 @@ static void leak_balloon(struct virtio_balloon *vb, 
  size_t num)
   * is true, we *have* to do it in this order
   */
  tell_host(vb, vb-deflate_vq);
  -   mutex_unlock(vb-balloon_lock);
  release_pages_by_pfn(vb-pfns, vb-num_pfns);
   }
   
  @@ -305,11 +301,13 @@ static int balloon(void *_vballoon)
   || freezing(current));
  if (vb-need_stats_update)
  stats_handle_request(vb);
  +   mutex_lock(vb-balloon_lock);
  if (diff  0)
  fill_balloon(vb, diff);
  else if (diff  0)
  leak_balloon(vb, -diff);
  update_balloon_size(vb);
  +   mutex_unlock(vb-balloon_lock);
  }
  return 0;
   }
  @@ -490,9 +488,11 @@ out:
   static void remove_common(struct virtio_balloon *vb)
   {
  /* There might be pages left in the balloon: free them. */
  +   mutex_lock(vb-balloon_lock);
  while (vb-num_pages)
  leak_balloon(vb, vb-num_pages);
  update_balloon_size(vb);
  +   mutex_unlock(vb-balloon_lock);
 
 I think you will need to introduce the same change as above to 
 virtballoon_restore()

Thanks Rafael, I've fixed it in my tree.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] virtio_balloon: auto-ballooning support

2013-05-10 Thread Luiz Capitulino
On Thu, 9 May 2013 18:15:19 -0300
Rafael Aquini aqu...@redhat.com wrote:

 On Thu, May 09, 2013 at 10:53:49AM -0400, Luiz Capitulino wrote:
  Automatic ballooning consists of dynamically adjusting the guest's
  balloon according to memory pressure in the host and in the guest.
  
  This commit implements the guest side of automatic balloning, which
  basically consists of registering a shrinker callback with the kernel,
  which will try to deflate the guest's balloon by the amount of pages
  being requested. The shrinker callback is only registered if the host
  supports the VIRTIO_BALLOON_F_AUTO_BALLOON feature bit.
  
  Automatic inflate is performed by the host.
  
  Here are some numbers. The test-case is to run 35 VMs (1G of RAM each)
  in parallel doing a kernel build. Host has 32GB of RAM and 16GB of swap.
  SWAP IN and SWAP OUT correspond to the number of pages swapped in and
  swapped out, respectively.
  
  Auto-ballooning disabled:
  
  RUN  TIME(s)  SWAP IN  SWAP OUT
  
  1634  930980   1588522
  2610  627422   1362174
  3649  1079847  1616367
  4543  953289   1635379
  5642  913237   1514000
  
  Auto-ballooning enabled:
  
  RUN  TIME(s)  SWAP IN  SWAP OUT
  
  1629  901  12537
  2624  981  18506
  3626  573  9085
  4631  2250 42534
  5627  1610 20808
  
  Signed-off-by: Luiz Capitulino lcapitul...@redhat.com
  ---
 
 Nice work Luiz! Just allow me a silly question, though. 

I have 100% more chances of committing sillynesses than you, so please
go ahead.

 Since your shrinker
 doesn't change the balloon target size,

Which target size are you referring to? The one in the host (member num_pages
of VirtIOBalloon in QEMU)?

If it the one in the host, then my understanding is that that member is only
used to communicate the new balloon target to the guest. The guest driver
will only read it when told (by the host) to do so, and when it does the
target value will be correct.

Am I right?

 as soon as the shrink round finishes the
 balloon will re-inflate again, won't it? Doesn't this cause a sort of balloon
 thrashing scenario, if both guest and host are suffering from memory 
 pressure?
 
 
 The rest I have for the moment, are only nitpicks :)
 
 
   drivers/virtio/virtio_balloon.c | 55 
  +
   include/uapi/linux/virtio_balloon.h |  1 +
   2 files changed, 56 insertions(+)
  
  diff --git a/drivers/virtio/virtio_balloon.c 
  b/drivers/virtio/virtio_balloon.c
  index 9d5fe2b..f9dcae8 100644
  --- a/drivers/virtio/virtio_balloon.c
  +++ b/drivers/virtio/virtio_balloon.c
  @@ -71,6 +71,9 @@ struct virtio_balloon
  /* Memory statistics */
  int need_stats_update;
  struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
  +
  +   /* Memory shrinker */
  +   struct shrinker shrinker;
   };
   
   static struct virtio_device_id id_table[] = {
  @@ -126,6 +129,7 @@ static void set_page_pfns(u32 pfns[], struct page *page)
  pfns[i] = page_to_balloon_pfn(page) + i;
   }
   
  +/* This function should be called with vb-balloon_mutex held */
   static void fill_balloon(struct virtio_balloon *vb, size_t num)
   {
  struct balloon_dev_info *vb_dev_info = vb-vb_dev_info;
  @@ -166,6 +170,7 @@ static void release_pages_by_pfn(const u32 pfns[], 
  unsigned int num)
  }
   }
   
  +/* This function should be called with vb-balloon_mutex held */
   static void leak_balloon(struct virtio_balloon *vb, size_t num)
   {
  struct page *page;
  @@ -285,6 +290,45 @@ static void update_balloon_size(struct virtio_balloon 
  *vb)
actual, sizeof(actual));
   }
   
  +static unsigned long balloon_get_nr_pages(const struct virtio_balloon *vb)
  +{
  +   return vb-num_pages / VIRTIO_BALLOON_PAGES_PER_PAGE;
  +}
  +
  +static int balloon_shrinker(struct shrinker *shrinker,struct 
  shrink_control *sc)
  +{
  +   unsigned int nr_pages, new_target;
  +   struct virtio_balloon *vb;
  +
  +   vb = container_of(shrinker, struct virtio_balloon, shrinker);
  +   if (!mutex_trylock(vb-balloon_lock)) {
  +   return -1;
  +   }
  +
  +   nr_pages = balloon_get_nr_pages(vb);
  +   if (!sc-nr_to_scan || !nr_pages) {
  +   goto out;
  +   }
  +
  +   /*
  +* If the current balloon size is greater than the number of
  +* pages being reclaimed by the kernel, deflate only the needed
  +* amount. Otherwise deflate everything we have.
  +*/
  +   new_target = 0;
  +   if (nr_pages  sc-nr_to_scan) {
  +   new_target = nr_pages - sc-nr_to_scan;
  +   }
  +
 
 CodingStyle: you don't need the curly-braces for all these single staments 
 above

Oh, this comes from QEMU coding style. Fixed.

  +   leak_balloon(vb, new_target);
  +   update_balloon_size(vb);
  +   nr_pages = balloon_get_nr_pages(vb);
  +
  +out:
  +   mutex_unlock(vb-balloon_lock);
  +   return nr_pages;
  +}
  +
   static int balloon(void

Re: [RFC 2/2] virtio_balloon: auto-ballooning support

2013-05-10 Thread Luiz Capitulino
On Fri, 10 May 2013 09:20:46 -0400
Luiz Capitulino lcapitul...@redhat.com wrote:

 On Thu, 9 May 2013 18:15:19 -0300
 Rafael Aquini aqu...@redhat.com wrote:
 
  On Thu, May 09, 2013 at 10:53:49AM -0400, Luiz Capitulino wrote:
   Automatic ballooning consists of dynamically adjusting the guest's
   balloon according to memory pressure in the host and in the guest.
   
   This commit implements the guest side of automatic balloning, which
   basically consists of registering a shrinker callback with the kernel,
   which will try to deflate the guest's balloon by the amount of pages
   being requested. The shrinker callback is only registered if the host
   supports the VIRTIO_BALLOON_F_AUTO_BALLOON feature bit.
   
   Automatic inflate is performed by the host.
   
   Here are some numbers. The test-case is to run 35 VMs (1G of RAM each)
   in parallel doing a kernel build. Host has 32GB of RAM and 16GB of swap.
   SWAP IN and SWAP OUT correspond to the number of pages swapped in and
   swapped out, respectively.
   
   Auto-ballooning disabled:
   
   RUN  TIME(s)  SWAP IN  SWAP OUT
   
   1634  930980   1588522
   2610  627422   1362174
   3649  1079847  1616367
   4543  953289   1635379
   5642  913237   1514000
   
   Auto-ballooning enabled:
   
   RUN  TIME(s)  SWAP IN  SWAP OUT
   
   1629  901  12537
   2624  981  18506
   3626  573  9085
   4631  2250 42534
   5627  1610 20808
   
   Signed-off-by: Luiz Capitulino lcapitul...@redhat.com
   ---
  
  Nice work Luiz! Just allow me a silly question, though. 
 
 I have 100% more chances of committing sillynesses than you, so please
 go ahead.
 
  Since your shrinker
  doesn't change the balloon target size,
 
 Which target size are you referring to? The one in the host (member num_pages
 of VirtIOBalloon in QEMU)?
 
 If it the one in the host, then my understanding is that that member is only
 used to communicate the new balloon target to the guest. The guest driver
 will only read it when told (by the host) to do so, and when it does the
 target value will be correct.
 
 Am I right?
 
  as soon as the shrink round finishes the
  balloon will re-inflate again, won't it? Doesn't this cause a sort of 
  balloon
  thrashing scenario, if both guest and host are suffering from memory 
  pressure?

Forgot to say that I didn't observe this in my testing. But I'll try harder
as soon as we clarify which target size we're talking about.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 1/2] virtio_balloon: move balloon_lock mutex to callers

2013-05-09 Thread Luiz Capitulino
This commit moves the balloon_lock mutex out of the fill_balloon()
and leak_balloon() functions to their callers.

The reason for this change is that the next commit will introduce
a shrinker callback for the balloon driver, which will also call
leak_balloon() but will require different locking semantics.

Signed-off-by: Luiz Capitulino 
---
 drivers/virtio/virtio_balloon.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index bd3ae32..9d5fe2b 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -133,7 +133,6 @@ static void fill_balloon(struct virtio_balloon *vb, size_t 
num)
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb->pfns));
 
-   mutex_lock(>balloon_lock);
for (vb->num_pfns = 0; vb->num_pfns < num;
 vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
struct page *page = balloon_page_enqueue(vb_dev_info);
@@ -154,7 +153,6 @@ static void fill_balloon(struct virtio_balloon *vb, size_t 
num)
/* Did we get any? */
if (vb->num_pfns != 0)
tell_host(vb, vb->inflate_vq);
-   mutex_unlock(>balloon_lock);
 }
 
 static void release_pages_by_pfn(const u32 pfns[], unsigned int num)
@@ -176,7 +174,6 @@ static void leak_balloon(struct virtio_balloon *vb, size_t 
num)
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb->pfns));
 
-   mutex_lock(>balloon_lock);
for (vb->num_pfns = 0; vb->num_pfns < num;
 vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
page = balloon_page_dequeue(vb_dev_info);
@@ -192,7 +189,6 @@ static void leak_balloon(struct virtio_balloon *vb, size_t 
num)
 * is true, we *have* to do it in this order
 */
tell_host(vb, vb->deflate_vq);
-   mutex_unlock(>balloon_lock);
release_pages_by_pfn(vb->pfns, vb->num_pfns);
 }
 
@@ -305,11 +301,13 @@ static int balloon(void *_vballoon)
 || freezing(current));
if (vb->need_stats_update)
stats_handle_request(vb);
+   mutex_lock(>balloon_lock);
if (diff > 0)
fill_balloon(vb, diff);
else if (diff < 0)
leak_balloon(vb, -diff);
update_balloon_size(vb);
+   mutex_unlock(>balloon_lock);
}
return 0;
 }
@@ -490,9 +488,11 @@ out:
 static void remove_common(struct virtio_balloon *vb)
 {
/* There might be pages left in the balloon: free them. */
+   mutex_lock(>balloon_lock);
while (vb->num_pages)
leak_balloon(vb, vb->num_pages);
update_balloon_size(vb);
+   mutex_unlock(>balloon_lock);
 
/* Now we reset the device so we can clean up the queues. */
vb->vdev->config->reset(vb->vdev);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 2/2] virtio_balloon: auto-ballooning support

2013-05-09 Thread Luiz Capitulino
Automatic ballooning consists of dynamically adjusting the guest's
balloon according to memory pressure in the host and in the guest.

This commit implements the guest side of automatic balloning, which
basically consists of registering a shrinker callback with the kernel,
which will try to deflate the guest's balloon by the amount of pages
being requested. The shrinker callback is only registered if the host
supports the VIRTIO_BALLOON_F_AUTO_BALLOON feature bit.

Automatic inflate is performed by the host.

Here are some numbers. The test-case is to run 35 VMs (1G of RAM each)
in parallel doing a kernel build. Host has 32GB of RAM and 16GB of swap.
SWAP IN and SWAP OUT correspond to the number of pages swapped in and
swapped out, respectively.

Auto-ballooning disabled:

RUN  TIME(s)  SWAP IN  SWAP OUT

1634  930980   1588522
2610  627422   1362174
3649  1079847  1616367
4543  953289   1635379
5642  913237   1514000

Auto-ballooning enabled:

RUN  TIME(s)  SWAP IN  SWAP OUT

1629  901  12537
2624  981  18506
3626  573  9085
4631  2250 42534
5627  1610 20808

Signed-off-by: Luiz Capitulino 
---
 drivers/virtio/virtio_balloon.c | 55 +
 include/uapi/linux/virtio_balloon.h |  1 +
 2 files changed, 56 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 9d5fe2b..f9dcae8 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -71,6 +71,9 @@ struct virtio_balloon
/* Memory statistics */
int need_stats_update;
struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
+
+   /* Memory shrinker */
+   struct shrinker shrinker;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -126,6 +129,7 @@ static void set_page_pfns(u32 pfns[], struct page *page)
pfns[i] = page_to_balloon_pfn(page) + i;
 }
 
+/* This function should be called with vb->balloon_mutex held */
 static void fill_balloon(struct virtio_balloon *vb, size_t num)
 {
struct balloon_dev_info *vb_dev_info = vb->vb_dev_info;
@@ -166,6 +170,7 @@ static void release_pages_by_pfn(const u32 pfns[], unsigned 
int num)
}
 }
 
+/* This function should be called with vb->balloon_mutex held */
 static void leak_balloon(struct virtio_balloon *vb, size_t num)
 {
struct page *page;
@@ -285,6 +290,45 @@ static void update_balloon_size(struct virtio_balloon *vb)
  , sizeof(actual));
 }
 
+static unsigned long balloon_get_nr_pages(const struct virtio_balloon *vb)
+{
+   return vb->num_pages / VIRTIO_BALLOON_PAGES_PER_PAGE;
+}
+
+static int balloon_shrinker(struct shrinker *shrinker,struct shrink_control 
*sc)
+{
+   unsigned int nr_pages, new_target;
+   struct virtio_balloon *vb;
+
+   vb = container_of(shrinker, struct virtio_balloon, shrinker);
+   if (!mutex_trylock(>balloon_lock)) {
+   return -1;
+   }
+
+   nr_pages = balloon_get_nr_pages(vb);
+   if (!sc->nr_to_scan || !nr_pages) {
+   goto out;
+   }
+
+   /*
+* If the current balloon size is greater than the number of
+* pages being reclaimed by the kernel, deflate only the needed
+* amount. Otherwise deflate everything we have.
+*/
+   new_target = 0;
+   if (nr_pages > sc->nr_to_scan) {
+   new_target = nr_pages - sc->nr_to_scan;
+   }
+
+   leak_balloon(vb, new_target);
+   update_balloon_size(vb);
+   nr_pages = balloon_get_nr_pages(vb);
+
+out:
+   mutex_unlock(>balloon_lock);
+   return nr_pages;
+}
+
 static int balloon(void *_vballoon)
 {
struct virtio_balloon *vb = _vballoon;
@@ -471,6 +515,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
goto out_del_vqs;
}
 
+   memset(>shrinker, 0, sizeof(vb->shrinker));
+   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_AUTO_BALLOON)) {
+   vb->shrinker.shrink = balloon_shrinker;
+   vb->shrinker.seeks = DEFAULT_SEEKS;
+   register_shrinker(>shrinker);
+   }
+
return 0;
 
 out_del_vqs:
@@ -487,6 +538,9 @@ out:
 
 static void remove_common(struct virtio_balloon *vb)
 {
+   if (vb->shrinker.shrink)
+   unregister_shrinker(>shrinker);
+
/* There might be pages left in the balloon: free them. */
mutex_lock(>balloon_lock);
while (vb->num_pages)
@@ -543,6 +597,7 @@ static int virtballoon_restore(struct virtio_device *vdev)
 static unsigned int features[] = {
VIRTIO_BALLOON_F_MUST_TELL_HOST,
VIRTIO_BALLOON_F_STATS_VQ,
+   VIRTIO_BALLOON_F_AUTO_BALLOON,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_ball

[RFC v2 0/2] virtio_balloon: auto-ballooning support

2013-05-09 Thread Luiz Capitulino
Hi,

This series is a respin of automatic ballooning support I started
working on last year. Patch 2/2 contains all relevant technical
details and performance measurements results.

This is in RFC state because it's a work in progress.

Here's some information if you want to try automatic ballooning:

 1. You'll need 3.9+ for the host kernel
 2. Apply this series for the guest kernel
 3. Grab the QEMU bits from:
git://repo.or.cz/qemu/qmp-unstable.git balloon/auto-ballooning/memcg/rfc
 4. Enable the balloon device in qemu with:
-device virtio-balloon-pci,auto-balloon=true
 5. Balloon the guest memory down, say from 1G to 256MB
 6. Generate some pressure in the guest, say a kernel build with -j16

Any feedback is appreciated!

Luiz Capitulino (2):
  virtio_balloon: move balloon_lock mutex to callers
  virtio_balloon: auto-ballooning support

 drivers/virtio/virtio_balloon.c | 63 ++---
 include/uapi/linux/virtio_balloon.h |  1 +
 2 files changed, 60 insertions(+), 4 deletions(-)

-- 
1.8.1.4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC v2 0/2] virtio_balloon: auto-ballooning support

2013-05-09 Thread Luiz Capitulino
Hi,

This series is a respin of automatic ballooning support I started
working on last year. Patch 2/2 contains all relevant technical
details and performance measurements results.

This is in RFC state because it's a work in progress.

Here's some information if you want to try automatic ballooning:

 1. You'll need 3.9+ for the host kernel
 2. Apply this series for the guest kernel
 3. Grab the QEMU bits from:
git://repo.or.cz/qemu/qmp-unstable.git balloon/auto-ballooning/memcg/rfc
 4. Enable the balloon device in qemu with:
-device virtio-balloon-pci,auto-balloon=true
 5. Balloon the guest memory down, say from 1G to 256MB
 6. Generate some pressure in the guest, say a kernel build with -j16

Any feedback is appreciated!

Luiz Capitulino (2):
  virtio_balloon: move balloon_lock mutex to callers
  virtio_balloon: auto-ballooning support

 drivers/virtio/virtio_balloon.c | 63 ++---
 include/uapi/linux/virtio_balloon.h |  1 +
 2 files changed, 60 insertions(+), 4 deletions(-)

-- 
1.8.1.4
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 2/2] virtio_balloon: auto-ballooning support

2013-05-09 Thread Luiz Capitulino
Automatic ballooning consists of dynamically adjusting the guest's
balloon according to memory pressure in the host and in the guest.

This commit implements the guest side of automatic balloning, which
basically consists of registering a shrinker callback with the kernel,
which will try to deflate the guest's balloon by the amount of pages
being requested. The shrinker callback is only registered if the host
supports the VIRTIO_BALLOON_F_AUTO_BALLOON feature bit.

Automatic inflate is performed by the host.

Here are some numbers. The test-case is to run 35 VMs (1G of RAM each)
in parallel doing a kernel build. Host has 32GB of RAM and 16GB of swap.
SWAP IN and SWAP OUT correspond to the number of pages swapped in and
swapped out, respectively.

Auto-ballooning disabled:

RUN  TIME(s)  SWAP IN  SWAP OUT

1634  930980   1588522
2610  627422   1362174
3649  1079847  1616367
4543  953289   1635379
5642  913237   1514000

Auto-ballooning enabled:

RUN  TIME(s)  SWAP IN  SWAP OUT

1629  901  12537
2624  981  18506
3626  573  9085
4631  2250 42534
5627  1610 20808

Signed-off-by: Luiz Capitulino lcapitul...@redhat.com
---
 drivers/virtio/virtio_balloon.c | 55 +
 include/uapi/linux/virtio_balloon.h |  1 +
 2 files changed, 56 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 9d5fe2b..f9dcae8 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -71,6 +71,9 @@ struct virtio_balloon
/* Memory statistics */
int need_stats_update;
struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
+
+   /* Memory shrinker */
+   struct shrinker shrinker;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -126,6 +129,7 @@ static void set_page_pfns(u32 pfns[], struct page *page)
pfns[i] = page_to_balloon_pfn(page) + i;
 }
 
+/* This function should be called with vb-balloon_mutex held */
 static void fill_balloon(struct virtio_balloon *vb, size_t num)
 {
struct balloon_dev_info *vb_dev_info = vb-vb_dev_info;
@@ -166,6 +170,7 @@ static void release_pages_by_pfn(const u32 pfns[], unsigned 
int num)
}
 }
 
+/* This function should be called with vb-balloon_mutex held */
 static void leak_balloon(struct virtio_balloon *vb, size_t num)
 {
struct page *page;
@@ -285,6 +290,45 @@ static void update_balloon_size(struct virtio_balloon *vb)
  actual, sizeof(actual));
 }
 
+static unsigned long balloon_get_nr_pages(const struct virtio_balloon *vb)
+{
+   return vb-num_pages / VIRTIO_BALLOON_PAGES_PER_PAGE;
+}
+
+static int balloon_shrinker(struct shrinker *shrinker,struct shrink_control 
*sc)
+{
+   unsigned int nr_pages, new_target;
+   struct virtio_balloon *vb;
+
+   vb = container_of(shrinker, struct virtio_balloon, shrinker);
+   if (!mutex_trylock(vb-balloon_lock)) {
+   return -1;
+   }
+
+   nr_pages = balloon_get_nr_pages(vb);
+   if (!sc-nr_to_scan || !nr_pages) {
+   goto out;
+   }
+
+   /*
+* If the current balloon size is greater than the number of
+* pages being reclaimed by the kernel, deflate only the needed
+* amount. Otherwise deflate everything we have.
+*/
+   new_target = 0;
+   if (nr_pages  sc-nr_to_scan) {
+   new_target = nr_pages - sc-nr_to_scan;
+   }
+
+   leak_balloon(vb, new_target);
+   update_balloon_size(vb);
+   nr_pages = balloon_get_nr_pages(vb);
+
+out:
+   mutex_unlock(vb-balloon_lock);
+   return nr_pages;
+}
+
 static int balloon(void *_vballoon)
 {
struct virtio_balloon *vb = _vballoon;
@@ -471,6 +515,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
goto out_del_vqs;
}
 
+   memset(vb-shrinker, 0, sizeof(vb-shrinker));
+   if (virtio_has_feature(vb-vdev, VIRTIO_BALLOON_F_AUTO_BALLOON)) {
+   vb-shrinker.shrink = balloon_shrinker;
+   vb-shrinker.seeks = DEFAULT_SEEKS;
+   register_shrinker(vb-shrinker);
+   }
+
return 0;
 
 out_del_vqs:
@@ -487,6 +538,9 @@ out:
 
 static void remove_common(struct virtio_balloon *vb)
 {
+   if (vb-shrinker.shrink)
+   unregister_shrinker(vb-shrinker);
+
/* There might be pages left in the balloon: free them. */
mutex_lock(vb-balloon_lock);
while (vb-num_pages)
@@ -543,6 +597,7 @@ static int virtballoon_restore(struct virtio_device *vdev)
 static unsigned int features[] = {
VIRTIO_BALLOON_F_MUST_TELL_HOST,
VIRTIO_BALLOON_F_STATS_VQ,
+   VIRTIO_BALLOON_F_AUTO_BALLOON,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 5e26f61..bd378a4 100644

[RFC 1/2] virtio_balloon: move balloon_lock mutex to callers

2013-05-09 Thread Luiz Capitulino
This commit moves the balloon_lock mutex out of the fill_balloon()
and leak_balloon() functions to their callers.

The reason for this change is that the next commit will introduce
a shrinker callback for the balloon driver, which will also call
leak_balloon() but will require different locking semantics.

Signed-off-by: Luiz Capitulino lcapitul...@redhat.com
---
 drivers/virtio/virtio_balloon.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index bd3ae32..9d5fe2b 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -133,7 +133,6 @@ static void fill_balloon(struct virtio_balloon *vb, size_t 
num)
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb-pfns));
 
-   mutex_lock(vb-balloon_lock);
for (vb-num_pfns = 0; vb-num_pfns  num;
 vb-num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
struct page *page = balloon_page_enqueue(vb_dev_info);
@@ -154,7 +153,6 @@ static void fill_balloon(struct virtio_balloon *vb, size_t 
num)
/* Did we get any? */
if (vb-num_pfns != 0)
tell_host(vb, vb-inflate_vq);
-   mutex_unlock(vb-balloon_lock);
 }
 
 static void release_pages_by_pfn(const u32 pfns[], unsigned int num)
@@ -176,7 +174,6 @@ static void leak_balloon(struct virtio_balloon *vb, size_t 
num)
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb-pfns));
 
-   mutex_lock(vb-balloon_lock);
for (vb-num_pfns = 0; vb-num_pfns  num;
 vb-num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
page = balloon_page_dequeue(vb_dev_info);
@@ -192,7 +189,6 @@ static void leak_balloon(struct virtio_balloon *vb, size_t 
num)
 * is true, we *have* to do it in this order
 */
tell_host(vb, vb-deflate_vq);
-   mutex_unlock(vb-balloon_lock);
release_pages_by_pfn(vb-pfns, vb-num_pfns);
 }
 
@@ -305,11 +301,13 @@ static int balloon(void *_vballoon)
 || freezing(current));
if (vb-need_stats_update)
stats_handle_request(vb);
+   mutex_lock(vb-balloon_lock);
if (diff  0)
fill_balloon(vb, diff);
else if (diff  0)
leak_balloon(vb, -diff);
update_balloon_size(vb);
+   mutex_unlock(vb-balloon_lock);
}
return 0;
 }
@@ -490,9 +488,11 @@ out:
 static void remove_common(struct virtio_balloon *vb)
 {
/* There might be pages left in the balloon: free them. */
+   mutex_lock(vb-balloon_lock);
while (vb-num_pages)
leak_balloon(vb, vb-num_pages);
update_balloon_size(vb);
+   mutex_unlock(vb-balloon_lock);
 
/* Now we reset the device so we can clean up the queues. */
vb-vdev-config-reset(vb-vdev);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] virtio_balloon: add auto-ballooning support

2013-01-14 Thread Luiz Capitulino
On Sat, 12 Jan 2013 02:13:17 +0530
Amit Shah  wrote:

> On (Tue) 18 Dec 2012 [18:17:30], Luiz Capitulino wrote:
> > The auto-ballooning feature automatically performs balloon inflate or
> > deflate based on host and guest memory pressure. This can help to
> > avoid swapping or worse in both, host and guest.
> > 
> > Auto-ballooning has a host and a guest part. The host performs
> > automatic inflate by requesting the guest to inflate its balloon
> > when the host is facing memory pressure. The guest performs
> > automatic deflate when it's facing memory pressure itself. It's
> > expected that auto-inflate and auto-deflate will balance each
> > other over time.
> > 
> > This commit implements the guest side of auto-ballooning.
> > 
> > To perform automatic deflate, the virtio_balloon driver registers
> > a shrinker callback, which will try to deflate the guest's balloon
> > on guest memory pressure just like if it were a cache. The shrinker
> > callback is only registered if the host supports the
> > VIRTIO_BALLOON_F_AUTO_BALLOON feature bit.
> 
> I'm wondering if guest should auto-deflate even when the AUTO_BALLOON
> feature isn't supported by the host: if a guest is under pressure,
> there's no way for it to tell the host and wait for the host to
> deflate the balloon, so it may be beneficial to just go ahead and
> deflate the balloon for all hosts.

I see two problems with this. First, this will automagically override
balloon changes done by the user; and second, if we don't have the
auto-inflate part and if the host starts facing memory pressure, VMs
may start getting OOM.

> Similarly, on the host side, management can configure a VM to either
> enable or disable auto-balloon (the auto-inflate part).  So even the
> host can do away with the feature advertisement and negotiation.
> 
> Is there some use-case I'm missing where doing these actions after
> feature negotiation is beneficial?
> 
> > FIXMEs
> > 
> >  o the guest kernel seems to spin when the host is performing a long
> >auto-inflate
> 
> Is this introduced by the current patches?  I'd assume it happens even
> without it -- these patches just introduce some heuristics, the
> mechanism has stayed the same.

Good point, I'll check that.

> > Signed-off-by: Luiz Capitulino 
> > ---
> >  drivers/virtio/virtio_balloon.c | 54 
> > +
> >  include/uapi/linux/virtio_balloon.h |  1 +
> >  2 files changed, 55 insertions(+)
> 
> Patch looks good, just one thing:
> 
> > +   /*
> > +* If the current balloon size is greater than the number of
> > +* pages being reclaimed by the kernel, deflate only the needed
> > +* amount. Otherwise deflate everything we have.
> > +*/
> > +   if (nr_pages > sc->nr_to_scan) {
> > +   new_target = nr_pages - sc->nr_to_scan;
> > +   } else {
> > +   new_target = 0;
> > +   }
> 
> This looks better:
> 
>   new_target = 0;
>   if (nr_pages > sc->nr_to_scan) {
>   new_target = nr_pages - sc->nr_to_scan;
>   }

Ok.

> 
> 
> Thanks,
>   Amit
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] virtio_balloon: add auto-ballooning support

2013-01-14 Thread Luiz Capitulino
On Sat, 12 Jan 2013 02:13:17 +0530
Amit Shah amit.s...@redhat.com wrote:

 On (Tue) 18 Dec 2012 [18:17:30], Luiz Capitulino wrote:
  The auto-ballooning feature automatically performs balloon inflate or
  deflate based on host and guest memory pressure. This can help to
  avoid swapping or worse in both, host and guest.
  
  Auto-ballooning has a host and a guest part. The host performs
  automatic inflate by requesting the guest to inflate its balloon
  when the host is facing memory pressure. The guest performs
  automatic deflate when it's facing memory pressure itself. It's
  expected that auto-inflate and auto-deflate will balance each
  other over time.
  
  This commit implements the guest side of auto-ballooning.
  
  To perform automatic deflate, the virtio_balloon driver registers
  a shrinker callback, which will try to deflate the guest's balloon
  on guest memory pressure just like if it were a cache. The shrinker
  callback is only registered if the host supports the
  VIRTIO_BALLOON_F_AUTO_BALLOON feature bit.
 
 I'm wondering if guest should auto-deflate even when the AUTO_BALLOON
 feature isn't supported by the host: if a guest is under pressure,
 there's no way for it to tell the host and wait for the host to
 deflate the balloon, so it may be beneficial to just go ahead and
 deflate the balloon for all hosts.

I see two problems with this. First, this will automagically override
balloon changes done by the user; and second, if we don't have the
auto-inflate part and if the host starts facing memory pressure, VMs
may start getting OOM.

 Similarly, on the host side, management can configure a VM to either
 enable or disable auto-balloon (the auto-inflate part).  So even the
 host can do away with the feature advertisement and negotiation.
 
 Is there some use-case I'm missing where doing these actions after
 feature negotiation is beneficial?
 
  FIXMEs
  
   o the guest kernel seems to spin when the host is performing a long
 auto-inflate
 
 Is this introduced by the current patches?  I'd assume it happens even
 without it -- these patches just introduce some heuristics, the
 mechanism has stayed the same.

Good point, I'll check that.

  Signed-off-by: Luiz Capitulino lcapitul...@redhat.com
  ---
   drivers/virtio/virtio_balloon.c | 54 
  +
   include/uapi/linux/virtio_balloon.h |  1 +
   2 files changed, 55 insertions(+)
 
 Patch looks good, just one thing:
 
  +   /*
  +* If the current balloon size is greater than the number of
  +* pages being reclaimed by the kernel, deflate only the needed
  +* amount. Otherwise deflate everything we have.
  +*/
  +   if (nr_pages  sc-nr_to_scan) {
  +   new_target = nr_pages - sc-nr_to_scan;
  +   } else {
  +   new_target = 0;
  +   }
 
 This looks better:
 
   new_target = 0;
   if (nr_pages  sc-nr_to_scan) {
   new_target = nr_pages - sc-nr_to_scan;
   }

Ok.

 
 
 Thanks,
   Amit
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Mempressure cgroup

2013-01-11 Thread Luiz Capitulino
On Fri, 4 Jan 2013 00:27:52 -0800
Anton Vorontsov  wrote:

> - I've split the pach into two: 'shrinker' and 'levels' parts. While the
>   full-fledged userland shrinker is an interesting idea, we don't have any
>   users ready for it, so I won't advocate for it too much.

For the next version of the automatic balloon prototype I'm planning to give
the user-space shrinker a try. It seems to be a better fit, as the current
prototype has to guess by how much a guest's balloon should be inflated.

Also, I think it would be worth it to list possible use-cases for the two
functionalities in the series' intro email. This might help choosing both,
one or another.

Looking forward to the next version :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Mempressure cgroup

2013-01-11 Thread Luiz Capitulino
On Fri, 4 Jan 2013 00:27:52 -0800
Anton Vorontsov anton.voront...@linaro.org wrote:

 - I've split the pach into two: 'shrinker' and 'levels' parts. While the
   full-fledged userland shrinker is an interesting idea, we don't have any
   users ready for it, so I won't advocate for it too much.

For the next version of the automatic balloon prototype I'm planning to give
the user-space shrinker a try. It seems to be a better fit, as the current
prototype has to guess by how much a guest's balloon should be inflated.

Also, I think it would be worth it to list possible use-cases for the two
functionalities in the series' intro email. This might help choosing both,
one or another.

Looking forward to the next version :)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Qemu-devel] [RFC 3/3] virtio-balloon: add auto-ballooning support

2012-12-22 Thread Luiz Capitulino
On Thu, 20 Dec 2012 05:24:12 +
Dietmar Maurer  wrote:

> > > Wow, you're fast! And I'm glad that it works for you, so we have two
> > > full-featured mempressure cgroup users already.
> > 
> > Thanks, although I think we need more testing to be sure this does what we
> > want. I mean, the basic mechanics does work, but my testing has been very
> > light so far.
> 
> Is it possible to assign different weights for different VMs, something like 
> the vmware 'shares' setting?

This series doesn't have the "weight" concept, it has auto-balloon-level and
auto-balloon-granularity. The former allows you to choose which type of
kernel low-mem level you want auto-inflate to trigger. The latter allows you
to say by how much the balloon should grow (as a percentage of the guest's
current memory).

Both of them are per VM.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Qemu-devel] [RFC 3/3] virtio-balloon: add auto-ballooning support

2012-12-22 Thread Luiz Capitulino
On Thu, 20 Dec 2012 05:24:12 +
Dietmar Maurer diet...@proxmox.com wrote:

   Wow, you're fast! And I'm glad that it works for you, so we have two
   full-featured mempressure cgroup users already.
  
  Thanks, although I think we need more testing to be sure this does what we
  want. I mean, the basic mechanics does work, but my testing has been very
  light so far.
 
 Is it possible to assign different weights for different VMs, something like 
 the vmware 'shares' setting?

This series doesn't have the weight concept, it has auto-balloon-level and
auto-balloon-granularity. The former allows you to choose which type of
kernel low-mem level you want auto-inflate to trigger. The latter allows you
to say by how much the balloon should grow (as a percentage of the guest's
current memory).

Both of them are per VM.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/2] virtio_balloon: move locking to the balloon thread

2012-12-19 Thread Luiz Capitulino
On Wed, 19 Dec 2012 09:55:58 -0200
Rafael Aquini  wrote:

> On Tue, Dec 18, 2012 at 06:17:29PM -0200, Luiz Capitulino wrote:
> > Today, the balloon_lock mutex is taken and released by fill_balloon()
> > and leak_balloon() when both functions are entered and when they
> > return.
> > 
> > This commit moves the locking to the caller instead, which is
> > the balloon() thread. The balloon thread is the sole caller of those
> > functions today.
> > 
> > The reason for this move is that the next commit will introduce
> > a shrinker callback for the balloon driver, which will also call
> > leak_balloon() but will require different locking semantics.
> > 
> > Signed-off-by: Luiz Capitulino 
> > ---
> >  drivers/virtio/virtio_balloon.c | 6 ++
> >  1 file changed, 2 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_balloon.c 
> > b/drivers/virtio/virtio_balloon.c
> > index 2a70558..877e695 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -133,7 +133,6 @@ static void fill_balloon(struct virtio_balloon *vb, 
> > size_t num)
> > /* We can only do one array worth at a time. */
> > num = min(num, ARRAY_SIZE(vb->pfns));
> >  
> > -   mutex_lock(>balloon_lock);
> > for (vb->num_pfns = 0; vb->num_pfns < num;
> >  vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> > struct page *page = balloon_page_enqueue(vb_dev_info);
> > @@ -155,7 +154,6 @@ static void fill_balloon(struct virtio_balloon *vb, 
> > size_t num)
> > /* Did we get any? */
> > if (vb->num_pfns != 0)
> > tell_host(vb, vb->inflate_vq);
> > -   mutex_unlock(>balloon_lock);
> >  }
> >
> 
> Since you're removing the locking scheme from within this function, I think it
> would be a good idea introduce a comment stating its caller must held the 
> mutex
> vb->balloon_lock.

Will address all comments for v1 (or rfc v2), thanks Rafael.

> 
>   
> >  static void release_pages_by_pfn(const u32 pfns[], unsigned int num)
> > @@ -177,7 +175,6 @@ static void leak_balloon(struct virtio_balloon *vb, 
> > size_t num)
> > /* We can only do one array worth at a time. */
> > num = min(num, ARRAY_SIZE(vb->pfns));
> >  
> > -   mutex_lock(>balloon_lock);
> > for (vb->num_pfns = 0; vb->num_pfns < num;
> >  vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> > page = balloon_page_dequeue(vb_dev_info);
> > @@ -193,7 +190,6 @@ static void leak_balloon(struct virtio_balloon *vb, 
> > size_t num)
> >  * is true, we *have* to do it in this order
> >  */
> > tell_host(vb, vb->deflate_vq);
> > -   mutex_unlock(>balloon_lock);
> > release_pages_by_pfn(vb->pfns, vb->num_pfns);
> >  }
> >
> 
> ditto
> 
>   
> > @@ -306,11 +302,13 @@ static int balloon(void *_vballoon)
> >  || freezing(current));
> > if (vb->need_stats_update)
> > stats_handle_request(vb);
> > +   mutex_lock(>balloon_lock);
> > if (diff > 0)
> > fill_balloon(vb, diff);
> > else if (diff < 0)
> > leak_balloon(vb, -diff);
> > update_balloon_size(vb);
> > +   mutex_unlock(>balloon_lock);
> > }
> > return 0;
> >  }
> 
> Just a nitpick:
> As leak_balloon() is also called at remove_common(), you'll need to introduce 
> the
> mutex there, similarly.
> 
> 
> Thanks for move this forward.
> 
> Cheers!
> -- Rafael
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] virtio-balloon: add auto-ballooning support

2012-12-19 Thread Luiz Capitulino
On Tue, 18 Dec 2012 14:53:30 -0800
Anton Vorontsov  wrote:

> Hello Luiz,
> 
> On Tue, Dec 18, 2012 at 06:16:55PM -0200, Luiz Capitulino wrote:
> > The auto-ballooning feature automatically performs balloon inflate
> > or deflate based on host and guest memory pressure. This can help to
> > avoid swapping or worse in both, host and guest.
> > 
> > Auto-ballooning has a host and a guest part. The host performs
> > automatic inflate by requesting the guest to inflate its balloon
> > when the host is facing memory pressure. The guest performs
> > automatic deflate when it's facing memory pressure itself. It's
> > expected that auto-inflate and auto-deflate will balance each
> > other over time.
> > 
> > This commit implements the host side of auto-ballooning.
> > 
> > To be notified of host memory pressure, this commit makes use of this
> > kernel API proposal being discussed upstream:
> > 
> >  http://marc.info/?l=linux-mm=135513372205134=2
> 
> Wow, you're fast! And I'm glad that it works for you, so we have two
> full-featured mempressure cgroup users already.

Thanks, although I think we need more testing to be sure this does what
we want. I mean, the basic mechanics does work, but my testing has been
very light so far.

> Even though it is a qemu patch, I think we should Cc linux-mm folks on it,
> just to let them know the great news.

I'll do it next time.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] virtio-balloon: add auto-ballooning support

2012-12-19 Thread Luiz Capitulino
On Tue, 18 Dec 2012 14:53:30 -0800
Anton Vorontsov anton.voront...@linaro.org wrote:

 Hello Luiz,
 
 On Tue, Dec 18, 2012 at 06:16:55PM -0200, Luiz Capitulino wrote:
  The auto-ballooning feature automatically performs balloon inflate
  or deflate based on host and guest memory pressure. This can help to
  avoid swapping or worse in both, host and guest.
  
  Auto-ballooning has a host and a guest part. The host performs
  automatic inflate by requesting the guest to inflate its balloon
  when the host is facing memory pressure. The guest performs
  automatic deflate when it's facing memory pressure itself. It's
  expected that auto-inflate and auto-deflate will balance each
  other over time.
  
  This commit implements the host side of auto-ballooning.
  
  To be notified of host memory pressure, this commit makes use of this
  kernel API proposal being discussed upstream:
  
   http://marc.info/?l=linux-mmm=135513372205134w=2
 
 Wow, you're fast! And I'm glad that it works for you, so we have two
 full-featured mempressure cgroup users already.

Thanks, although I think we need more testing to be sure this does what
we want. I mean, the basic mechanics does work, but my testing has been
very light so far.

 Even though it is a qemu patch, I think we should Cc linux-mm folks on it,
 just to let them know the great news.

I'll do it next time.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/2] virtio_balloon: move locking to the balloon thread

2012-12-19 Thread Luiz Capitulino
On Wed, 19 Dec 2012 09:55:58 -0200
Rafael Aquini aqu...@redhat.com wrote:

 On Tue, Dec 18, 2012 at 06:17:29PM -0200, Luiz Capitulino wrote:
  Today, the balloon_lock mutex is taken and released by fill_balloon()
  and leak_balloon() when both functions are entered and when they
  return.
  
  This commit moves the locking to the caller instead, which is
  the balloon() thread. The balloon thread is the sole caller of those
  functions today.
  
  The reason for this move is that the next commit will introduce
  a shrinker callback for the balloon driver, which will also call
  leak_balloon() but will require different locking semantics.
  
  Signed-off-by: Luiz Capitulino lcapitul...@redhat.com
  ---
   drivers/virtio/virtio_balloon.c | 6 ++
   1 file changed, 2 insertions(+), 4 deletions(-)
  
  diff --git a/drivers/virtio/virtio_balloon.c 
  b/drivers/virtio/virtio_balloon.c
  index 2a70558..877e695 100644
  --- a/drivers/virtio/virtio_balloon.c
  +++ b/drivers/virtio/virtio_balloon.c
  @@ -133,7 +133,6 @@ static void fill_balloon(struct virtio_balloon *vb, 
  size_t num)
  /* We can only do one array worth at a time. */
  num = min(num, ARRAY_SIZE(vb-pfns));
   
  -   mutex_lock(vb-balloon_lock);
  for (vb-num_pfns = 0; vb-num_pfns  num;
   vb-num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
  struct page *page = balloon_page_enqueue(vb_dev_info);
  @@ -155,7 +154,6 @@ static void fill_balloon(struct virtio_balloon *vb, 
  size_t num)
  /* Did we get any? */
  if (vb-num_pfns != 0)
  tell_host(vb, vb-inflate_vq);
  -   mutex_unlock(vb-balloon_lock);
   }
 
 
 Since you're removing the locking scheme from within this function, I think it
 would be a good idea introduce a comment stating its caller must held the 
 mutex
 vb-balloon_lock.

Will address all comments for v1 (or rfc v2), thanks Rafael.

 
   
   static void release_pages_by_pfn(const u32 pfns[], unsigned int num)
  @@ -177,7 +175,6 @@ static void leak_balloon(struct virtio_balloon *vb, 
  size_t num)
  /* We can only do one array worth at a time. */
  num = min(num, ARRAY_SIZE(vb-pfns));
   
  -   mutex_lock(vb-balloon_lock);
  for (vb-num_pfns = 0; vb-num_pfns  num;
   vb-num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
  page = balloon_page_dequeue(vb_dev_info);
  @@ -193,7 +190,6 @@ static void leak_balloon(struct virtio_balloon *vb, 
  size_t num)
   * is true, we *have* to do it in this order
   */
  tell_host(vb, vb-deflate_vq);
  -   mutex_unlock(vb-balloon_lock);
  release_pages_by_pfn(vb-pfns, vb-num_pfns);
   }
 
 
 ditto
 
   
  @@ -306,11 +302,13 @@ static int balloon(void *_vballoon)
   || freezing(current));
  if (vb-need_stats_update)
  stats_handle_request(vb);
  +   mutex_lock(vb-balloon_lock);
  if (diff  0)
  fill_balloon(vb, diff);
  else if (diff  0)
  leak_balloon(vb, -diff);
  update_balloon_size(vb);
  +   mutex_unlock(vb-balloon_lock);
  }
  return 0;
   }
 
 Just a nitpick:
 As leak_balloon() is also called at remove_common(), you'll need to introduce 
 the
 mutex there, similarly.
 
 
 Thanks for move this forward.
 
 Cheers!
 -- Rafael
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 2/2] virtio_balloon: add auto-ballooning support

2012-12-18 Thread Luiz Capitulino
The auto-ballooning feature automatically performs balloon inflate or
deflate based on host and guest memory pressure. This can help to
avoid swapping or worse in both, host and guest.

Auto-ballooning has a host and a guest part. The host performs
automatic inflate by requesting the guest to inflate its balloon
when the host is facing memory pressure. The guest performs
automatic deflate when it's facing memory pressure itself. It's
expected that auto-inflate and auto-deflate will balance each
other over time.

This commit implements the guest side of auto-ballooning.

To perform automatic deflate, the virtio_balloon driver registers
a shrinker callback, which will try to deflate the guest's balloon
on guest memory pressure just like if it were a cache. The shrinker
callback is only registered if the host supports the
VIRTIO_BALLOON_F_AUTO_BALLOON feature bit.

FIXMEs

 o the guest kernel seems to spin when the host is performing a long
   auto-inflate

Signed-off-by: Luiz Capitulino 
---
 drivers/virtio/virtio_balloon.c | 54 +
 include/uapi/linux/virtio_balloon.h |  1 +
 2 files changed, 55 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 877e695..12fb70e 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -71,6 +71,9 @@ struct virtio_balloon
/* Memory statistics */
int need_stats_update;
struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
+
+   /* Memory shrinker */
+   struct shrinker shrinker;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -286,6 +289,46 @@ static void update_balloon_size(struct virtio_balloon *vb)
  , sizeof(actual));
 }
 
+static unsigned long balloon_get_nr_pages(const struct virtio_balloon *vb)
+{
+   return vb->num_pages / VIRTIO_BALLOON_PAGES_PER_PAGE;
+}
+
+static int balloon_shrinker(struct shrinker *shrinker,struct shrink_control 
*sc)
+{
+   unsigned int nr_pages, new_target;
+   struct virtio_balloon *vb;
+
+   vb = container_of(shrinker, struct virtio_balloon, shrinker);
+   if (!mutex_trylock(>balloon_lock)) {
+   return -1;
+   }
+
+   nr_pages = balloon_get_nr_pages(vb);
+   if (!sc->nr_to_scan || !nr_pages) {
+   goto out;
+   }
+
+   /*
+* If the current balloon size is greater than the number of
+* pages being reclaimed by the kernel, deflate only the needed
+* amount. Otherwise deflate everything we have.
+*/
+   if (nr_pages > sc->nr_to_scan) {
+   new_target = nr_pages - sc->nr_to_scan;
+   } else {
+   new_target = 0;
+   }
+
+   leak_balloon(vb, new_target);
+   update_balloon_size(vb);
+   nr_pages = balloon_get_nr_pages(vb);
+
+out:
+   mutex_unlock(>balloon_lock);
+   return nr_pages;
+}
+
 static int balloon(void *_vballoon)
 {
struct virtio_balloon *vb = _vballoon;
@@ -472,6 +515,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
goto out_del_vqs;
}
 
+   memset(>shrinker, 0, sizeof(vb->shrinker));
+   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_AUTO_BALLOON)) {
+   vb->shrinker.shrink = balloon_shrinker;
+   vb->shrinker.seeks = DEFAULT_SEEKS;
+   register_shrinker(>shrinker);
+   }
+
return 0;
 
 out_del_vqs:
@@ -488,6 +538,9 @@ out:
 
 static void remove_common(struct virtio_balloon *vb)
 {
+   if (vb->shrinker.shrink)
+   unregister_shrinker(>shrinker);
+
/* There might be pages left in the balloon: free them. */
while (vb->num_pages)
leak_balloon(vb, vb->num_pages);
@@ -542,6 +595,7 @@ static int virtballoon_restore(struct virtio_device *vdev)
 static unsigned int features[] = {
VIRTIO_BALLOON_F_MUST_TELL_HOST,
VIRTIO_BALLOON_F_STATS_VQ,
+   VIRTIO_BALLOON_F_AUTO_BALLOON,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 652dc8b..3764cac 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -31,6 +31,7 @@
 /* The feature bitmap for virtio balloon */
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before reclaiming 
pages */
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
+#define VIRTIO_BALLOON_F_AUTO_BALLOON  2 /* Automatic ballooning */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 0/2] auto-ballooning prototype (guest part)

2012-12-18 Thread Luiz Capitulino
Hi,

This series implements an early protoype of a new feature called
automatic ballooning. This is based on ideas by Rik van Riel and
I also got some help from Rafael Aquini (misconceptions and bugs
are all mine, though).

The auto-ballooning feature automatically performs balloon inflate
and deflate based on host and guest memory pressure. This can help
to avoid swapping or worse in both, host and guest.

This series implements the guest part. Full details on the design
and implementation can be found in patch 2/2.

To test this you will also need the host part, which can be found
here (please, never mind the repo name):

 git://repo.or.cz/qemu/qmp-unstable.git balloon/auto-ballooning/rfc

Any feedback is appreciated!

Luiz Capitulino (2):
  virtio_balloon: move locking to the balloon thread
  virtio_balloon: add auto-ballooning support

 drivers/virtio/virtio_balloon.c | 60 ++---
 include/uapi/linux/virtio_balloon.h |  1 +
 2 files changed, 57 insertions(+), 4 deletions(-)

-- 
1.8.0
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 1/2] virtio_balloon: move locking to the balloon thread

2012-12-18 Thread Luiz Capitulino
Today, the balloon_lock mutex is taken and released by fill_balloon()
and leak_balloon() when both functions are entered and when they
return.

This commit moves the locking to the caller instead, which is
the balloon() thread. The balloon thread is the sole caller of those
functions today.

The reason for this move is that the next commit will introduce
a shrinker callback for the balloon driver, which will also call
leak_balloon() but will require different locking semantics.

Signed-off-by: Luiz Capitulino 
---
 drivers/virtio/virtio_balloon.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 2a70558..877e695 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -133,7 +133,6 @@ static void fill_balloon(struct virtio_balloon *vb, size_t 
num)
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb->pfns));
 
-   mutex_lock(>balloon_lock);
for (vb->num_pfns = 0; vb->num_pfns < num;
 vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
struct page *page = balloon_page_enqueue(vb_dev_info);
@@ -155,7 +154,6 @@ static void fill_balloon(struct virtio_balloon *vb, size_t 
num)
/* Did we get any? */
if (vb->num_pfns != 0)
tell_host(vb, vb->inflate_vq);
-   mutex_unlock(>balloon_lock);
 }
 
 static void release_pages_by_pfn(const u32 pfns[], unsigned int num)
@@ -177,7 +175,6 @@ static void leak_balloon(struct virtio_balloon *vb, size_t 
num)
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb->pfns));
 
-   mutex_lock(>balloon_lock);
for (vb->num_pfns = 0; vb->num_pfns < num;
 vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
page = balloon_page_dequeue(vb_dev_info);
@@ -193,7 +190,6 @@ static void leak_balloon(struct virtio_balloon *vb, size_t 
num)
 * is true, we *have* to do it in this order
 */
tell_host(vb, vb->deflate_vq);
-   mutex_unlock(>balloon_lock);
release_pages_by_pfn(vb->pfns, vb->num_pfns);
 }
 
@@ -306,11 +302,13 @@ static int balloon(void *_vballoon)
 || freezing(current));
if (vb->need_stats_update)
stats_handle_request(vb);
+   mutex_lock(>balloon_lock);
if (diff > 0)
fill_balloon(vb, diff);
else if (diff < 0)
leak_balloon(vb, -diff);
update_balloon_size(vb);
+   mutex_unlock(>balloon_lock);
}
return 0;
 }
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 1/2] virtio_balloon: move locking to the balloon thread

2012-12-18 Thread Luiz Capitulino
Today, the balloon_lock mutex is taken and released by fill_balloon()
and leak_balloon() when both functions are entered and when they
return.

This commit moves the locking to the caller instead, which is
the balloon() thread. The balloon thread is the sole caller of those
functions today.

The reason for this move is that the next commit will introduce
a shrinker callback for the balloon driver, which will also call
leak_balloon() but will require different locking semantics.

Signed-off-by: Luiz Capitulino lcapitul...@redhat.com
---
 drivers/virtio/virtio_balloon.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 2a70558..877e695 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -133,7 +133,6 @@ static void fill_balloon(struct virtio_balloon *vb, size_t 
num)
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb-pfns));
 
-   mutex_lock(vb-balloon_lock);
for (vb-num_pfns = 0; vb-num_pfns  num;
 vb-num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
struct page *page = balloon_page_enqueue(vb_dev_info);
@@ -155,7 +154,6 @@ static void fill_balloon(struct virtio_balloon *vb, size_t 
num)
/* Did we get any? */
if (vb-num_pfns != 0)
tell_host(vb, vb-inflate_vq);
-   mutex_unlock(vb-balloon_lock);
 }
 
 static void release_pages_by_pfn(const u32 pfns[], unsigned int num)
@@ -177,7 +175,6 @@ static void leak_balloon(struct virtio_balloon *vb, size_t 
num)
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb-pfns));
 
-   mutex_lock(vb-balloon_lock);
for (vb-num_pfns = 0; vb-num_pfns  num;
 vb-num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
page = balloon_page_dequeue(vb_dev_info);
@@ -193,7 +190,6 @@ static void leak_balloon(struct virtio_balloon *vb, size_t 
num)
 * is true, we *have* to do it in this order
 */
tell_host(vb, vb-deflate_vq);
-   mutex_unlock(vb-balloon_lock);
release_pages_by_pfn(vb-pfns, vb-num_pfns);
 }
 
@@ -306,11 +302,13 @@ static int balloon(void *_vballoon)
 || freezing(current));
if (vb-need_stats_update)
stats_handle_request(vb);
+   mutex_lock(vb-balloon_lock);
if (diff  0)
fill_balloon(vb, diff);
else if (diff  0)
leak_balloon(vb, -diff);
update_balloon_size(vb);
+   mutex_unlock(vb-balloon_lock);
}
return 0;
 }
-- 
1.8.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 0/2] auto-ballooning prototype (guest part)

2012-12-18 Thread Luiz Capitulino
Hi,

This series implements an early protoype of a new feature called
automatic ballooning. This is based on ideas by Rik van Riel and
I also got some help from Rafael Aquini (misconceptions and bugs
are all mine, though).

The auto-ballooning feature automatically performs balloon inflate
and deflate based on host and guest memory pressure. This can help
to avoid swapping or worse in both, host and guest.

This series implements the guest part. Full details on the design
and implementation can be found in patch 2/2.

To test this you will also need the host part, which can be found
here (please, never mind the repo name):

 git://repo.or.cz/qemu/qmp-unstable.git balloon/auto-ballooning/rfc

Any feedback is appreciated!

Luiz Capitulino (2):
  virtio_balloon: move locking to the balloon thread
  virtio_balloon: add auto-ballooning support

 drivers/virtio/virtio_balloon.c | 60 ++---
 include/uapi/linux/virtio_balloon.h |  1 +
 2 files changed, 57 insertions(+), 4 deletions(-)

-- 
1.8.0
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 2/2] virtio_balloon: add auto-ballooning support

2012-12-18 Thread Luiz Capitulino
The auto-ballooning feature automatically performs balloon inflate or
deflate based on host and guest memory pressure. This can help to
avoid swapping or worse in both, host and guest.

Auto-ballooning has a host and a guest part. The host performs
automatic inflate by requesting the guest to inflate its balloon
when the host is facing memory pressure. The guest performs
automatic deflate when it's facing memory pressure itself. It's
expected that auto-inflate and auto-deflate will balance each
other over time.

This commit implements the guest side of auto-ballooning.

To perform automatic deflate, the virtio_balloon driver registers
a shrinker callback, which will try to deflate the guest's balloon
on guest memory pressure just like if it were a cache. The shrinker
callback is only registered if the host supports the
VIRTIO_BALLOON_F_AUTO_BALLOON feature bit.

FIXMEs

 o the guest kernel seems to spin when the host is performing a long
   auto-inflate

Signed-off-by: Luiz Capitulino lcapitul...@redhat.com
---
 drivers/virtio/virtio_balloon.c | 54 +
 include/uapi/linux/virtio_balloon.h |  1 +
 2 files changed, 55 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 877e695..12fb70e 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -71,6 +71,9 @@ struct virtio_balloon
/* Memory statistics */
int need_stats_update;
struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
+
+   /* Memory shrinker */
+   struct shrinker shrinker;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -286,6 +289,46 @@ static void update_balloon_size(struct virtio_balloon *vb)
  actual, sizeof(actual));
 }
 
+static unsigned long balloon_get_nr_pages(const struct virtio_balloon *vb)
+{
+   return vb-num_pages / VIRTIO_BALLOON_PAGES_PER_PAGE;
+}
+
+static int balloon_shrinker(struct shrinker *shrinker,struct shrink_control 
*sc)
+{
+   unsigned int nr_pages, new_target;
+   struct virtio_balloon *vb;
+
+   vb = container_of(shrinker, struct virtio_balloon, shrinker);
+   if (!mutex_trylock(vb-balloon_lock)) {
+   return -1;
+   }
+
+   nr_pages = balloon_get_nr_pages(vb);
+   if (!sc-nr_to_scan || !nr_pages) {
+   goto out;
+   }
+
+   /*
+* If the current balloon size is greater than the number of
+* pages being reclaimed by the kernel, deflate only the needed
+* amount. Otherwise deflate everything we have.
+*/
+   if (nr_pages  sc-nr_to_scan) {
+   new_target = nr_pages - sc-nr_to_scan;
+   } else {
+   new_target = 0;
+   }
+
+   leak_balloon(vb, new_target);
+   update_balloon_size(vb);
+   nr_pages = balloon_get_nr_pages(vb);
+
+out:
+   mutex_unlock(vb-balloon_lock);
+   return nr_pages;
+}
+
 static int balloon(void *_vballoon)
 {
struct virtio_balloon *vb = _vballoon;
@@ -472,6 +515,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
goto out_del_vqs;
}
 
+   memset(vb-shrinker, 0, sizeof(vb-shrinker));
+   if (virtio_has_feature(vb-vdev, VIRTIO_BALLOON_F_AUTO_BALLOON)) {
+   vb-shrinker.shrink = balloon_shrinker;
+   vb-shrinker.seeks = DEFAULT_SEEKS;
+   register_shrinker(vb-shrinker);
+   }
+
return 0;
 
 out_del_vqs:
@@ -488,6 +538,9 @@ out:
 
 static void remove_common(struct virtio_balloon *vb)
 {
+   if (vb-shrinker.shrink)
+   unregister_shrinker(vb-shrinker);
+
/* There might be pages left in the balloon: free them. */
while (vb-num_pages)
leak_balloon(vb, vb-num_pages);
@@ -542,6 +595,7 @@ static int virtballoon_restore(struct virtio_device *vdev)
 static unsigned int features[] = {
VIRTIO_BALLOON_F_MUST_TELL_HOST,
VIRTIO_BALLOON_F_STATS_VQ,
+   VIRTIO_BALLOON_F_AUTO_BALLOON,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 652dc8b..3764cac 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -31,6 +31,7 @@
 /* The feature bitmap for virtio balloon */
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before reclaiming 
pages */
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
+#define VIRTIO_BALLOON_F_AUTO_BALLOON  2 /* Automatic ballooning */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
1.8.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Add mempressure cgroup

2012-11-30 Thread Luiz Capitulino
On Wed, 28 Nov 2012 17:27:51 -0800
Anton Vorontsov  wrote:

> On Wed, Nov 28, 2012 at 03:14:32PM -0800, Andrew Morton wrote:
> [...]
> > Compare this with the shrink_slab() shrinkers.  With these, the VM can
> > query and then control the clients.  If something goes wrong or is out
> > of balance, it's the VM's problem to solve.
> > 
> > So I'm thinking that a better design would be one which puts the kernel
> > VM in control of userspace scanning and freeing.  Presumably with a
> > query-and-control interface similar to the slab shrinkers.
> 
> Thanks for the ideas, Andrew.
> 
> Query-and-control scheme looks very attractive, and that's actually
> resembles my "balance" level idea, when userland tells the kernel how much
> reclaimable memory it has. Except the your scheme works in the reverse
> direction, i.e. the kernel becomes in charge.
> 
> But there is one, rather major issue: we're crossing kernel-userspace
> boundary. And with the scheme we'll have to cross the boundary four times:
> query / reply-available / control / reply-shrunk / (and repeat if
> necessary, every SHRINK_BATCH pages). Plus, it has to be done somewhat
> synchronously (all the four stages), and/or we have to make a "userspace
> shrinker" thread working in parallel with the normal shrinker, and here,
> I'm afraid, we'll see more strange interactions. :)

Wouldn't this be just like kswapd?

> But there is a good news: for these kind of fine-grained control we have a
> better interface, where we don't have to communicate [very often] w/ the
> kernel. These are "volatile ranges", where userland itself marks chunks of
> data as "I might need it, but I won't cry if you recycle it; but when I
> access it next time, let me know if you actually recycled it". Yes,
> userland no longer able to decide which exact page it permits to recycle,
> but we don't have use-cases when we actually care that much. And if we do,
> we'd rather introduce volatile LRUs with different priorities, or
> something alike.

I'm new to this stuff so please take this with a grain of salt, but I'm
not sure volatile ranges would be a good fit for our use case: we want to
make (kvm) guests reduce their memory when the host is getting memory
pressure.

Having a notification seems just fine for this purpose, but I'm not sure
how this would work with volatile ranges, as we'd have to mark pages volatile
in advance.

Andrew's idea seems to give a lot more freedom to apps, IMHO.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Add mempressure cgroup

2012-11-30 Thread Luiz Capitulino
On Wed, 28 Nov 2012 17:27:51 -0800
Anton Vorontsov anton.voront...@linaro.org wrote:

 On Wed, Nov 28, 2012 at 03:14:32PM -0800, Andrew Morton wrote:
 [...]
  Compare this with the shrink_slab() shrinkers.  With these, the VM can
  query and then control the clients.  If something goes wrong or is out
  of balance, it's the VM's problem to solve.
  
  So I'm thinking that a better design would be one which puts the kernel
  VM in control of userspace scanning and freeing.  Presumably with a
  query-and-control interface similar to the slab shrinkers.
 
 Thanks for the ideas, Andrew.
 
 Query-and-control scheme looks very attractive, and that's actually
 resembles my balance level idea, when userland tells the kernel how much
 reclaimable memory it has. Except the your scheme works in the reverse
 direction, i.e. the kernel becomes in charge.
 
 But there is one, rather major issue: we're crossing kernel-userspace
 boundary. And with the scheme we'll have to cross the boundary four times:
 query / reply-available / control / reply-shrunk / (and repeat if
 necessary, every SHRINK_BATCH pages). Plus, it has to be done somewhat
 synchronously (all the four stages), and/or we have to make a userspace
 shrinker thread working in parallel with the normal shrinker, and here,
 I'm afraid, we'll see more strange interactions. :)

Wouldn't this be just like kswapd?

 But there is a good news: for these kind of fine-grained control we have a
 better interface, where we don't have to communicate [very often] w/ the
 kernel. These are volatile ranges, where userland itself marks chunks of
 data as I might need it, but I won't cry if you recycle it; but when I
 access it next time, let me know if you actually recycled it. Yes,
 userland no longer able to decide which exact page it permits to recycle,
 but we don't have use-cases when we actually care that much. And if we do,
 we'd rather introduce volatile LRUs with different priorities, or
 something alike.

I'm new to this stuff so please take this with a grain of salt, but I'm
not sure volatile ranges would be a good fit for our use case: we want to
make (kvm) guests reduce their memory when the host is getting memory
pressure.

Having a notification seems just fine for this purpose, but I'm not sure
how this would work with volatile ranges, as we'd have to mark pages volatile
in advance.

Andrew's idea seems to give a lot more freedom to apps, IMHO.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v3 0/3] vmpressure_fd: Linux VM pressure notifications

2012-11-09 Thread Luiz Capitulino
Hi Anton,

On Wed, 7 Nov 2012 02:53:49 -0800
Anton Vorontsov  wrote:

> Hi all,
> 
> This is the third RFC. As suggested by Minchan Kim, the API is much
> simplified now (comparing to vmevent_fd):

Which tree is this against? I'd like to try this series, but it doesn't
apply to Linus tree.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v3 0/3] vmpressure_fd: Linux VM pressure notifications

2012-11-09 Thread Luiz Capitulino
Hi Anton,

On Wed, 7 Nov 2012 02:53:49 -0800
Anton Vorontsov anton.voront...@linaro.org wrote:

 Hi all,
 
 This is the third RFC. As suggested by Minchan Kim, the API is much
 simplified now (comparing to vmevent_fd):

Which tree is this against? I'd like to try this series, but it doesn't
apply to Linus tree.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Qemu-devel] [PATCH v10] kvm: notify host when the guest is panicked

2012-10-01 Thread Luiz Capitulino
On Wed, 29 Aug 2012 13:18:54 +0800
Wen Congyang  wrote:

> We can know the guest is panicked when the guest runs on xen.
> But we do not have such feature on kvm.

What's the status of this series?

It got lost in my queue and I ended up not reviewing it, but it seems
to be stuck.

> 
> Another purpose of this feature is: management app(for example:
> libvirt) can do auto dump when the guest is panicked. If management
> app does not do auto dump, the guest's user can do dump by hand if
> he sees the guest is panicked.
> 
> We have three solutions to implement this feature:
> 1. use vmcall
> 2. use I/O port
> 3. use virtio-serial.
> 
> We have decided to avoid touching hypervisor. The reason why I choose
> choose the I/O port is:
> 1. it is easier to implememt
> 2. it does not depend any virtual device
> 3. it can work when starting the kernel
> 
> Signed-off-by: Wen Congyang 
> ---
>  Documentation/virtual/kvm/pv_event.txt |   32 
> 
>  arch/ia64/include/asm/kvm_para.h   |   14 ++
>  arch/powerpc/include/asm/kvm_para.h|   14 ++
>  arch/s390/include/asm/kvm_para.h   |   14 ++
>  arch/x86/include/asm/kvm_para.h|   27 +++
>  arch/x86/kernel/kvm.c  |   25 +
>  include/linux/kvm_para.h   |   23 +++
>  7 files changed, 149 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/virtual/kvm/pv_event.txt
> 
> diff --git a/Documentation/virtual/kvm/pv_event.txt 
> b/Documentation/virtual/kvm/pv_event.txt
> new file mode 100644
> index 000..bb04de0
> --- /dev/null
> +++ b/Documentation/virtual/kvm/pv_event.txt
> @@ -0,0 +1,32 @@
> +The KVM paravirtual event interface
> +=
> +
> +Initializing the paravirtual event interface
> +==
> +kvm_pv_event_init()
> +Argiments:
> + None
> +
> +Return Value:
> + 0: The guest kernel can use paravirtual event interface.
> + 1: The guest kernel can't use paravirtual event interface.
> +
> +Querying whether the event can be ejected
> +==
> +kvm_pv_has_feature()
> +Arguments:
> + feature: The bit value of this paravirtual event to query
> +
> +Return Value:
> + 0 : The guest kernel can't eject this paravirtual event.
> + -1: The guest kernel can eject this paravirtual event.
> +
> +
> +Ejecting paravirtual event
> +==
> +kvm_pv_eject_event()
> +Arguments:
> + event: The event to be ejected.
> +
> +Return Value:
> + None
> diff --git a/arch/ia64/include/asm/kvm_para.h 
> b/arch/ia64/include/asm/kvm_para.h
> index 2019cb9..b5ec658 100644
> --- a/arch/ia64/include/asm/kvm_para.h
> +++ b/arch/ia64/include/asm/kvm_para.h
> @@ -31,6 +31,20 @@ static inline bool kvm_check_and_clear_guest_paused(void)
>   return false;
>  }
>  
> +static inline int kvm_arch_pv_event_init(void)
> +{
> + return 0;
> +}
> +
> +static inline unsigned int kvm_arch_pv_features(void)
> +{
> + return 0;
> +}
> +
> +static inline void kvm_arch_pv_eject_event(unsigned int event)
> +{
> +}
> +
>  #endif
>  
>  #endif
> diff --git a/arch/powerpc/include/asm/kvm_para.h 
> b/arch/powerpc/include/asm/kvm_para.h
> index c18916b..01b98c7 100644
> --- a/arch/powerpc/include/asm/kvm_para.h
> +++ b/arch/powerpc/include/asm/kvm_para.h
> @@ -211,6 +211,20 @@ static inline bool kvm_check_and_clear_guest_paused(void)
>   return false;
>  }
>  
> +static inline int kvm_arch_pv_event_init(void)
> +{
> + return 0;
> +}
> +
> +static inline unsigned int kvm_arch_pv_features(void)
> +{
> + return 0;
> +}
> +
> +static inline void kvm_arch_pv_eject_event(unsigned int event)
> +{
> +}
> +
>  #endif /* __KERNEL__ */
>  
>  #endif /* __POWERPC_KVM_PARA_H__ */
> diff --git a/arch/s390/include/asm/kvm_para.h 
> b/arch/s390/include/asm/kvm_para.h
> index da44867..00ce058 100644
> --- a/arch/s390/include/asm/kvm_para.h
> +++ b/arch/s390/include/asm/kvm_para.h
> @@ -154,6 +154,20 @@ static inline bool kvm_check_and_clear_guest_paused(void)
>   return false;
>  }
>  
> +static inline int kvm_arch_pv_event_init(void)
> +{
> + return 0;
> +}
> +
> +static inline unsigned int kvm_arch_pv_features(void)
> +{
> + return 0;
> +}
> +
> +static inline void kvm_arch_pv_eject_event(unsigned int event)
> +{
> +}
> +
>  #endif
>  
>  #endif /* __S390_KVM_PARA_H */
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 2f7712e..7d297f0 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -96,8 +96,11 @@ struct kvm_vcpu_pv_apf_data {
>  #define KVM_PV_EOI_ENABLED KVM_PV_EOI_MASK
>  #define KVM_PV_EOI_DISABLED 0x0
>  
> +#define KVM_PV_EVENT_PORT(0x505UL)
> +
>  #ifdef __KERNEL__
>  #include 
> +#include 
>  
>  extern void kvmclock_init(void);
>  extern int kvm_register_clock(char *txt);
> @@ -228,6 +231,30 @@ static inline void 

Re: [Qemu-devel] [PATCH v10] kvm: notify host when the guest is panicked

2012-10-01 Thread Luiz Capitulino
On Wed, 29 Aug 2012 13:18:54 +0800
Wen Congyang we...@cn.fujitsu.com wrote:

 We can know the guest is panicked when the guest runs on xen.
 But we do not have such feature on kvm.

What's the status of this series?

It got lost in my queue and I ended up not reviewing it, but it seems
to be stuck.

 
 Another purpose of this feature is: management app(for example:
 libvirt) can do auto dump when the guest is panicked. If management
 app does not do auto dump, the guest's user can do dump by hand if
 he sees the guest is panicked.
 
 We have three solutions to implement this feature:
 1. use vmcall
 2. use I/O port
 3. use virtio-serial.
 
 We have decided to avoid touching hypervisor. The reason why I choose
 choose the I/O port is:
 1. it is easier to implememt
 2. it does not depend any virtual device
 3. it can work when starting the kernel
 
 Signed-off-by: Wen Congyang we...@cn.fujitsu.com
 ---
  Documentation/virtual/kvm/pv_event.txt |   32 
 
  arch/ia64/include/asm/kvm_para.h   |   14 ++
  arch/powerpc/include/asm/kvm_para.h|   14 ++
  arch/s390/include/asm/kvm_para.h   |   14 ++
  arch/x86/include/asm/kvm_para.h|   27 +++
  arch/x86/kernel/kvm.c  |   25 +
  include/linux/kvm_para.h   |   23 +++
  7 files changed, 149 insertions(+), 0 deletions(-)
  create mode 100644 Documentation/virtual/kvm/pv_event.txt
 
 diff --git a/Documentation/virtual/kvm/pv_event.txt 
 b/Documentation/virtual/kvm/pv_event.txt
 new file mode 100644
 index 000..bb04de0
 --- /dev/null
 +++ b/Documentation/virtual/kvm/pv_event.txt
 @@ -0,0 +1,32 @@
 +The KVM paravirtual event interface
 +=
 +
 +Initializing the paravirtual event interface
 +==
 +kvm_pv_event_init()
 +Argiments:
 + None
 +
 +Return Value:
 + 0: The guest kernel can use paravirtual event interface.
 + 1: The guest kernel can't use paravirtual event interface.
 +
 +Querying whether the event can be ejected
 +==
 +kvm_pv_has_feature()
 +Arguments:
 + feature: The bit value of this paravirtual event to query
 +
 +Return Value:
 + 0 : The guest kernel can't eject this paravirtual event.
 + -1: The guest kernel can eject this paravirtual event.
 +
 +
 +Ejecting paravirtual event
 +==
 +kvm_pv_eject_event()
 +Arguments:
 + event: The event to be ejected.
 +
 +Return Value:
 + None
 diff --git a/arch/ia64/include/asm/kvm_para.h 
 b/arch/ia64/include/asm/kvm_para.h
 index 2019cb9..b5ec658 100644
 --- a/arch/ia64/include/asm/kvm_para.h
 +++ b/arch/ia64/include/asm/kvm_para.h
 @@ -31,6 +31,20 @@ static inline bool kvm_check_and_clear_guest_paused(void)
   return false;
  }
  
 +static inline int kvm_arch_pv_event_init(void)
 +{
 + return 0;
 +}
 +
 +static inline unsigned int kvm_arch_pv_features(void)
 +{
 + return 0;
 +}
 +
 +static inline void kvm_arch_pv_eject_event(unsigned int event)
 +{
 +}
 +
  #endif
  
  #endif
 diff --git a/arch/powerpc/include/asm/kvm_para.h 
 b/arch/powerpc/include/asm/kvm_para.h
 index c18916b..01b98c7 100644
 --- a/arch/powerpc/include/asm/kvm_para.h
 +++ b/arch/powerpc/include/asm/kvm_para.h
 @@ -211,6 +211,20 @@ static inline bool kvm_check_and_clear_guest_paused(void)
   return false;
  }
  
 +static inline int kvm_arch_pv_event_init(void)
 +{
 + return 0;
 +}
 +
 +static inline unsigned int kvm_arch_pv_features(void)
 +{
 + return 0;
 +}
 +
 +static inline void kvm_arch_pv_eject_event(unsigned int event)
 +{
 +}
 +
  #endif /* __KERNEL__ */
  
  #endif /* __POWERPC_KVM_PARA_H__ */
 diff --git a/arch/s390/include/asm/kvm_para.h 
 b/arch/s390/include/asm/kvm_para.h
 index da44867..00ce058 100644
 --- a/arch/s390/include/asm/kvm_para.h
 +++ b/arch/s390/include/asm/kvm_para.h
 @@ -154,6 +154,20 @@ static inline bool kvm_check_and_clear_guest_paused(void)
   return false;
  }
  
 +static inline int kvm_arch_pv_event_init(void)
 +{
 + return 0;
 +}
 +
 +static inline unsigned int kvm_arch_pv_features(void)
 +{
 + return 0;
 +}
 +
 +static inline void kvm_arch_pv_eject_event(unsigned int event)
 +{
 +}
 +
  #endif
  
  #endif /* __S390_KVM_PARA_H */
 diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
 index 2f7712e..7d297f0 100644
 --- a/arch/x86/include/asm/kvm_para.h
 +++ b/arch/x86/include/asm/kvm_para.h
 @@ -96,8 +96,11 @@ struct kvm_vcpu_pv_apf_data {
  #define KVM_PV_EOI_ENABLED KVM_PV_EOI_MASK
  #define KVM_PV_EOI_DISABLED 0x0
  
 +#define KVM_PV_EVENT_PORT(0x505UL)
 +
  #ifdef __KERNEL__
  #include asm/processor.h
 +#include linux/ioport.h
  
  extern void kvmclock_init(void);
  extern int kvm_register_clock(char *txt);
 @@ -228,6 +231,30 @@ static inline void kvm_disable_steal_time(void)
  }
  #endif
  
 +static inline int kvm_arch_pv_event_init(void)
 +{
 +  

<    1   2   3   4   5   6