Re: [PATCH v11 4/7] mm: introduce compaction and migration for ballooned pages

2012-11-27 Thread Rafael Aquini
On Tue, Nov 20, 2012 at 03:33:24PM -0800, Andrew Morton wrote:
> On Fri, 9 Nov 2012 12:16:02 +
> Mel Gorman  wrote:
> 
> > On Wed, Nov 07, 2012 at 01:05:51AM -0200, Rafael Aquini wrote:
> > > Memory fragmentation introduced by ballooning might reduce significantly
> > > the number of 2MB contiguous memory blocks that can be used within a 
> > > guest,
> > > thus imposing performance penalties associated with the reduced number of
> > > transparent huge pages that could be used by the guest workload.
> > > 
> > > This patch introduces the helper functions as well as the necessary 
> > > changes
> > > to teach compaction and migration bits how to cope with pages which are
> > > part of a guest memory balloon, in order to make them movable by memory
> > > compaction procedures.
> > > 
> >
> > ...
> >
> > > --- a/mm/compaction.c
> > > +++ b/mm/compaction.c
> > > @@ -14,6 +14,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  #include "internal.h"
> > >  
> > >  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> > > @@ -565,9 +566,24 @@ isolate_migratepages_range(struct zone *zone, struct 
> > > compact_control *cc,
> > >   goto next_pageblock;
> > >   }
> > >  
> > > - /* Check may be lockless but that's ok as we recheck later */
> > > - if (!PageLRU(page))
> > > + /*
> > > +  * Check may be lockless but that's ok as we recheck later.
> > > +  * It's possible to migrate LRU pages and balloon pages
> > > +  * Skip any other type of page
> > > +  */
> > > + if (!PageLRU(page)) {
> > > + if (unlikely(balloon_page_movable(page))) {
> > 
> > Because it's lockless, it really seems that the barrier stuck down there
> > is unnecessary. At worst you get a temporarily incorrect answer that you
> > recheck later under page lock in balloon_page_isolate.
> 

Sorry for the late reply.

> What happened with this?
> 
This Mel's concern were addressed by the last submitted review (v12)


> Also: what barrier?

Mel was refering to these barriers, at balloon_compaction.h:
---8<---
+/*
+ * balloon_page_insert - insert a page into the balloon's page list and make
+ *  the page->mapping assignment accordingly.
+ * @page: page to be assigned as a 'balloon page'
+ * @mapping : allocated special 'balloon_mapping'
+ * @head: balloon's device page list head
+ */
+static inline void balloon_page_insert(struct page *page,
+  struct address_space *mapping,
+  struct list_head *head)
+{
+   list_add(&page->lru, head);
+   /*
+* Make sure the page is already inserted on balloon's page list
+* before assigning its ->mapping.
+*/
+   smp_wmb();
+   page->mapping = mapping;
+}
+
+/*
+ * balloon_page_delete - clear the page->mapping and delete the page from
+ *  balloon's page list accordingly.
+ * @page: page to be released from balloon's page list
+ */
+static inline void balloon_page_delete(struct page *page)
+{
+   page->mapping = NULL;
+   /*
+* Make sure page->mapping is cleared before we proceed with
+* balloon's page list deletion.
+*/
+   smp_wmb();
+   list_del(&page->lru);
+}
+
+/*
+ * __is_movable_balloon_page - helper to perform @page mapping->flags tests
+ */
+static inline bool __is_movable_balloon_page(struct page *page)
+{
+   /*
+* we might attempt to read ->mapping concurrently to other
+* threads trying to write to it.
+*/
+   struct address_space *mapping = ACCESS_ONCE(page->mapping);
+   smp_read_barrier_depends();
+   return mapping_balloon(mapping);
+}
---8<---

The last review got rid of them to stick with Mel's ACK.


Cheers!
--Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm: fix balloon_page_movable() page->flags check

2012-11-27 Thread Rafael Aquini
This patch fixes the following crash by fixing and enhancing the way 
page->flags are tested to identify a ballooned page.

---8<---
BUG: unable to handle kernel NULL pointer dereference at 0194
IP: [] isolate_migratepages_range+0x344/0x7b0
--->8---

The NULL pointer deref was taking place because balloon_page_movable()
page->flags tests were incomplete and we ended up 
inadvertently poking at private pages.

Reported-by: Sasha Levin 
Signed-off-by: Rafael Aquini 
---
 include/linux/balloon_compaction.h | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/include/linux/balloon_compaction.h 
b/include/linux/balloon_compaction.h
index 68893bc..634a19b 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -107,6 +107,23 @@ static inline void balloon_mapping_free(struct 
address_space *balloon_mapping)
 }
 
 /*
+ * __balloon_page_flags - helper to perform balloon @page ->flags tests.
+ *
+ * As balloon pages are got from Buddy, and we do not play with page->flags
+ * at driver level (exception made when we get the page lock for compaction),
+ * therefore we can safely identify a ballooned page by checking if the
+ * NR_PAGEFLAGS rightmost bits from the page->flags are all cleared.
+ * This approach also helps on skipping ballooned pages that are locked for
+ * compaction or release, thus mitigating their racy check at
+ * balloon_page_movable()
+ */
+#define BALLOON_PAGE_FLAGS_MASK   ((1UL << NR_PAGEFLAGS) - 1)
+static inline bool __balloon_page_flags(struct page *page)
+{
+   return page->flags & BALLOON_PAGE_FLAGS_MASK ? false : true;
+}
+
+/*
  * __is_movable_balloon_page - helper to perform @page mapping->flags tests
  */
 static inline bool __is_movable_balloon_page(struct page *page)
@@ -135,8 +152,8 @@ static inline bool balloon_page_movable(struct page *page)
 * Before dereferencing and testing mapping->flags, lets make sure
 * this is not a page that uses ->mapping in a different way
 */
-   if (!PageSlab(page) && !PageSwapCache(page) && !PageAnon(page) &&
-   !page_mapped(page))
+   if (__balloon_page_flags(page) && !page_mapped(page) &&
+   page_count(page) == 1)
return __is_movable_balloon_page(page);
 
return false;
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix balloon_page_movable() page->flags check

2012-11-27 Thread Rafael Aquini
On Tue, Nov 27, 2012 at 09:31:10PM -0200, Rafael Aquini wrote:
> This patch fixes the following crash by fixing and enhancing the way 
> page->flags are tested to identify a ballooned page.
> 
> ---8<---
> BUG: unable to handle kernel NULL pointer dereference at 0194
> IP: [] isolate_migratepages_range+0x344/0x7b0
> --->8---
> 
> The NULL pointer deref was taking place because balloon_page_movable()
> page->flags tests were incomplete and we ended up 
> inadvertently poking at private pages.
> 
> Reported-by: Sasha Levin 
> Signed-off-by: Rafael Aquini 
> ---

Here it is Andrew, sorry by the lagged reply

Cheers!
--Rafael


>  include/linux/balloon_compaction.h | 21 +++--
>  1 file changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/balloon_compaction.h 
> b/include/linux/balloon_compaction.h
> index 68893bc..634a19b 100644
> --- a/include/linux/balloon_compaction.h
> +++ b/include/linux/balloon_compaction.h
> @@ -107,6 +107,23 @@ static inline void balloon_mapping_free(struct 
> address_space *balloon_mapping)
>  }
>  
>  /*
> + * __balloon_page_flags - helper to perform balloon @page ->flags tests.
> + *
> + * As balloon pages are got from Buddy, and we do not play with page->flags
> + * at driver level (exception made when we get the page lock for compaction),
> + * therefore we can safely identify a ballooned page by checking if the
> + * NR_PAGEFLAGS rightmost bits from the page->flags are all cleared.
> + * This approach also helps on skipping ballooned pages that are locked for
> + * compaction or release, thus mitigating their racy check at
> + * balloon_page_movable()
> + */
> +#define BALLOON_PAGE_FLAGS_MASK   ((1UL << NR_PAGEFLAGS) - 1)
> +static inline bool __balloon_page_flags(struct page *page)
> +{
> + return page->flags & BALLOON_PAGE_FLAGS_MASK ? false : true;
> +}
> +
> +/*
>   * __is_movable_balloon_page - helper to perform @page mapping->flags tests
>   */
>  static inline bool __is_movable_balloon_page(struct page *page)
> @@ -135,8 +152,8 @@ static inline bool balloon_page_movable(struct page *page)
>* Before dereferencing and testing mapping->flags, lets make sure
>* this is not a page that uses ->mapping in a different way
>*/
> - if (!PageSlab(page) && !PageSwapCache(page) && !PageAnon(page) &&
> - !page_mapped(page))
> + if (__balloon_page_flags(page) && !page_mapped(page) &&
> + page_count(page) == 1)
>   return __is_movable_balloon_page(page);
>  
>   return false;
> -- 
> 1.7.11.7
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix balloon_page_movable() page->flags check

2012-11-27 Thread Rafael Aquini
On Tue, Nov 27, 2012 at 03:52:01PM -0800, Andrew Morton wrote:
> On Tue, 27 Nov 2012 21:31:10 -0200
> Rafael Aquini  wrote:
> 
> > This patch fixes the following crash by fixing and enhancing the way 
> > page->flags are tested to identify a ballooned page.
> > 
> > ---8<---
> > BUG: unable to handle kernel NULL pointer dereference at 0194
> > IP: [] isolate_migratepages_range+0x344/0x7b0
> > --->8---
> > 
> > The NULL pointer deref was taking place because balloon_page_movable()
> > page->flags tests were incomplete and we ended up 
> > inadvertently poking at private pages.
> > 
> > 
> >
> >  /*
> > + * __balloon_page_flags - helper to perform balloon @page ->flags tests.
> > + *
> > + * As balloon pages are got from Buddy, and we do not play with page->flags
> > + * at driver level (exception made when we get the page lock for 
> > compaction),
> > + * therefore we can safely identify a ballooned page by checking if the
> > + * NR_PAGEFLAGS rightmost bits from the page->flags are all cleared.
> > + * This approach also helps on skipping ballooned pages that are locked for
> > + * compaction or release, thus mitigating their racy check at
> > + * balloon_page_movable()
> > + */
> > +#define BALLOON_PAGE_FLAGS_MASK   ((1UL << NR_PAGEFLAGS) - 1)
> 
> hm, this seems a bit fragile.
> 
> What's actually going on here?  You're assuming that a page fresh from
> buddy has all flags 0..NR_PAGEFLAGS cleared?
>
 
Yes, thats the idea, as when we get pages from freelists, they are all checked 
by prep_new_page()->check_new_page() before buffered_rmqueue() returns them.
By the path we use to get balloon pages, if the page has any of 0..NR_PAGEFLAGS
flags set at the alloc time it's regarded as bad_page by check_new_page(),
and we go after another victim.


> That may be true, I'm unsure.  Please take a look at
> PAGE_FLAGS_CHECK_AT_FREE and PAGE_FLAGS_CHECK_AT_PREP and work out why
> the heck these aren't the same thing!

Humm, I can't think of why, either... As I've followed the prep path, I didn't
notice that difference. I'll look at it closer, though.


>
> Either way around, doing
> 
>   #define BALLOON_PAGE_FLAGS_MASK PAGE_FLAGS_CHECK_AT_PREP
> 
> seems rather more maintainable.

As usual, your suggestion is far way better than my orinal proposal :)


> 
> > +static inline bool __balloon_page_flags(struct page *page)
> > +{
> > +   return page->flags & BALLOON_PAGE_FLAGS_MASK ? false : true;
> 
> We have a neater way of doing the scalar-to-boolean conversion:
> 
>   return !(page->flags & BALLOON_PAGE_FLAGS_MASK);
> 

ditto! :)


Do you want me to resubmit this patch with the changes you suggested?


Cheers!
Rafael

> > +}
> > +
> > +/*
> >   * __is_movable_balloon_page - helper to perform @page mapping->flags tests
> >   */
> >  static inline bool __is_movable_balloon_page(struct page *page)
> > @@ -135,8 +152,8 @@ static inline bool balloon_page_movable(struct page 
> > *page)
> >  * Before dereferencing and testing mapping->flags, lets make sure
> >  * this is not a page that uses ->mapping in a different way
> >  */
> > -   if (!PageSlab(page) && !PageSwapCache(page) && !PageAnon(page) &&
> > -   !page_mapped(page))
> > +   if (__balloon_page_flags(page) && !page_mapped(page) &&
> > +   page_count(page) == 1)
> > return __is_movable_balloon_page(page);
> >  
> > return false;
> >
> > ...
> >
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix balloon_page_movable() page->flags check

2012-11-27 Thread Rafael Aquini
On Tue, Nov 27, 2012 at 05:15:44PM -0800, Andrew Morton wrote:
> On Tue, 27 Nov 2012 22:34:10 -0200 Rafael Aquini  wrote:
> 
> > Do you want me to resubmit this patch with the changes you suggested?
> 
> oh, I think I can reach that far.  How's this look?
>

It looks great to me.

Just a small nitpick, 
here __balloon_page_flags should be changed to page_flags_cleared too:
> @@ -109,18 +110,16 @@ static inline void balloon_mapping_free(
>  /*
>   * __balloon_page_flags - helper to perform balloon @page ->flags tests.
>   *

Thanks!
--Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v9 3/5] virtio_balloon: introduce migration primitives to balloon pages

2012-08-28 Thread Rafael Aquini
On Tue, Aug 28, 2012 at 06:54:10PM +0300, Michael S. Tsirkin wrote:
> On Mon, Aug 27, 2012 at 04:47:13PM -0300, Rafael Aquini wrote:
> > On Sun, Aug 26, 2012 at 10:42:44AM +0300, Michael S. Tsirkin wrote:
> > > 
> > > Reading two atomics and doing math? Result can even be negative.
> > > I did not look at use closely but it looks suspicious.
> > Doc on atomic_read says:
> > "
> > The read is atomic in that the return value is guaranteed to be one of the
> > values initialized or modified with the interface operations if a proper
> > implicit or explicit memory barrier is used after possible runtime
> > initialization by any other thread and the value is modified only with the
> > interface operations.
> > "
> > 
> > There's no runtime init by other thread than balloon's itself at device 
> > register,
> > and the operations (inc, dec) are made by the proper interface operations
> > only when protected by the spinlock pages_lock. It does not look 
> > suspicious, IMHO.
> 
> Any use of multiple atomics is suspicious.
> Please just avoid it if you can. What's wrong with locking?
> 
> > I'm failing to see how it could become a negative on that case, since you 
> > cannot
> > isolate more pages than what was previoulsy inflated to balloon's list.
> 
> There is no order guarantee. So in
> A - B you can read B long after both A and B has been incremented.
> Maybe it is safe in this case but it needs careful documentation
> to explain how ordering works. Much easier to keep it all simple.
> 
> > 
> > > It's already the case everywhere except __wait_on_isolated_pages,
> > > so just fix that, and then we can keep using int instead of atomics.
> > > 
> > Sorry, I quite didn't get you here. fix what?
> 
> It's in the text you removed above. Access values under lock.
>

So, you prefer this way:

/*
 * __wait_on_isolated_pages - check if leak_balloon() must wait on isolated
 *pages before proceeding with the page release.
 * @vb : pointer to the struct virtio_balloon describing this device.
 * @leak_target: how many pages we are attempting to release this round.
 */
static inline void __wait_on_isolated_pages(struct virtio_balloon *vb,
size_t leak_target)
{
unsigned int num_pages, isolated_pages;
spin_lock(&vb->pages_lock);
num_pages = vb->num_pages;
isolated_pages = vb->num_isolated_pages;
spin_unlock(&vb->pages_lock);
/*
 * If isolated pages are making our leak target bigger than the
 * total pages that we can release this round. Let's wait for
 * migration returning enough pages back to balloon's list.
 */
wait_event(vb->config_change,
   (!isolated_pages ||
leak_target <= (num_pages - isolated_pages)));
}

?

> >  
> > > That's 1K on stack - and can become more if we increase
> > > VIRTIO_BALLOON_ARRAY_PFNS_MAX.  Probably too much - this is the reason
> > > we use vb->pfns.
> > >
> > If we want to use vb->pfns we'll have to make leak_balloon mutual exclusive 
> > with
> > page migration (as it was before), but that will inevictably bring us back 
> > to
> > the discussion on breaking the loop when isolated pages make leak_balloon 
> > find
> > less pages than it wants to release at each leak round.
> > 
> 
> I don't think this is an issue. The issue was busy waiting in that case.
>
But, in fact, it is. 
As we couldn't drop the mutex that prevents migration from happening, otherwise
the migration threads would screw up with our vb->pfns array, there will be no 
point
on keep waiting for isolated pages being reinserted on balloon's list, cause the
migration threads that will accomplish that task are also waiting on us dropping
the mutex.

You may argue that we could flag virtballoon_migratepage() to give up and return
before even trying to aquire the mutex, if a leak is ongoing -- deferring work
to virtballoon_putbackpage(). However, I'm eager to think that for this case,
the CPU time we spent isolating pages for compaction would be simply wasted and,
 perhaps, no effective compaction was even reached.
And that makes me think it would have been better to stick with the old logics 
of
breaking the loop since leak_balloon(), originally, also remains busy waiting
while pursuing its target, anyway.

That's the trade here, IMO. If one really wants to wait on potentially isolated
pages getting back to the list before proceeding, we'll have to burn a little
more stack space with local variables, unfortunately.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v9 3/5] virtio_balloon: introduce migration primitives to balloon pages

2012-08-28 Thread Rafael Aquini
On Tue, Aug 28, 2012 at 08:57:16PM +0300, Michael S. Tsirkin wrote:
> Sorry I do not understand what you are saying here. So find
> a different locking strategy.
> 
> For example something like:
> 
>  wait_event(vb->config_change,
>   ({ 
>  lock
>  if (target <= (num_pages - isolated_pages))
>  leak balloon
>  cond = target <= (num_pages - isolated_pages));
>  unlock;
>  cond;
>   })
>   )
> 
> seems to have no issues?

Ok, I see what you mean. I'll change it.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm] enable CONFIG_COMPACTION by default

2012-09-13 Thread Rafael Aquini
On Thu, Sep 13, 2012 at 04:21:04PM -0400, Rik van Riel wrote:
> Now that lumpy reclaim has been removed, compaction is the
> only way left to free up contiguous memory areas. It is time
> to just enable CONFIG_COMPACTION by default.
> 
> Signed-off-by: Rik van Riel 
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index d5c8019..32ea0ef 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -191,6 +191,7 @@ config SPLIT_PTLOCK_CPUS
>  # support for memory compaction
>  config COMPACTION
>   bool "Allow for memory compaction"
> + def_bool y
>   select MIGRATION
>   depends on MMU
>   help
>

Acked-by: Rafael Aquini  

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ 0/4] 3.0.51-stable review

2012-11-02 Thread Rafael Aquini
On Fri, Nov 02, 2012 at 10:06:04AM -0700, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 3.0.51 release.
> There are 4 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
> 
> Responses should be made by Sun Nov  4 17:03:28 UTC 2012.
> Anything received after that time might be too late.
> 
> The whole patch series can be found in one patch at:
>   kernel.org/pub/linux/kernel/v3.0/stable-review/patch-3.0.51-rc1.gz
> and the diffstat can be found below.
> 
> thanks,
> 
> greg k-h
> 
> -
> Pseudo-Shortlog of commits:
> 
> Greg Kroah-Hartman 
> Linux 3.0.51-rc1
> 
> Ben Skeggs 
> drm/nouveau: silence modesetting spam on pre-gf8 chipsets
> 
> Jan Kara 
> mm: fix XFS oops due to dirty pages without buffers on s390
> 

Howdy Greg,

Somehow the following patch is missing for this series submission:
> Len Brown 
> x86: Remove the ancient and deprecated disable_hlt() and enable_hlt() 
> facility

I glanced at the the downloadable consolidated patch and its hunks seem to be 
present,
though.

Cheers!
-- Rafael

> 
> Herton Ronaldo Krzesinski 
> floppy: do put_disk on current dr if blk_init_queue fails
> 
> 
> -
> 
> Diffstat:
> 
>  Documentation/feature-removal-schedule.txt |  8 ---
>  Makefile   |  4 ++--
>  arch/x86/include/asm/system.h  |  7 --
>  arch/x86/kernel/process.c  | 24 ---
>  drivers/block/floppy.c | 37 
> +-
>  drivers/gpu/drm/nouveau/nv04_dac.c |  8 +++
>  drivers/gpu/drm/nouveau/nv04_dfp.c |  6 ++---
>  drivers/gpu/drm/nouveau/nv04_tv.c  |  4 ++--
>  mm/rmap.c  | 21 +
>  9 files changed, 28 insertions(+), 91 deletions(-)
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ 0/4] 3.0.51-stable review

2012-11-03 Thread Rafael Aquini
On Fri, Nov 02, 2012 at 09:07:11PM -0700, Greg Kroah-Hartman wrote:
> > 
> > Howdy Greg,
> > 
> > Somehow the following patch is missing for this series submission:
> > > Len Brown 
> > > x86: Remove the ancient and deprecated disable_hlt() and enable_hlt() 
> > > facility
> > 
> > I glanced at the the downloadable consolidated patch and its hunks seem to 
> > be present,
> > though.
> 
> I do not understand.  Where do you not see it?  It's in the patch on
> kernel.org, right?  Where is it missing?
> 

The LKML patch series is missing PATCH 02/04.



> confused,
> 
> greg k-h
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/9] Revert "mm: compaction: check lock contention first before taking lock"

2012-09-21 Thread Rafael Aquini
On Fri, Sep 21, 2012 at 11:46:15AM +0100, Mel Gorman wrote:
> This reverts
> mm-compaction-check-lock-contention-first-before-taking-lock.patch as it
> is replaced by a later patch in the series.
> 
> Signed-off-by: Mel Gorman 

Acked-by: Rafael Aquini 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] Revert "mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long-fix"

2012-09-21 Thread Rafael Aquini
On Fri, Sep 21, 2012 at 11:46:16AM +0100, Mel Gorman wrote:
> This reverts
> mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long-fix
> as it is replaced by a later patch in the series.
> 
> Signed-off-by: Mel Gorman 

Acked-by: Rafael Aquini 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/9] Revert "mm: compaction: abort compaction loop if lock is contended or run too long"

2012-09-21 Thread Rafael Aquini
On Fri, Sep 21, 2012 at 11:46:17AM +0100, Mel Gorman wrote:
> This reverts
> mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long.patch
> as it is replaced by a later patch in the series.
> 
> Signed-off-by: Mel Gorman 

Acked-by: Rafael Aquini 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/9] mm: compaction: Abort compaction loop if lock is contended or run too long

2012-09-21 Thread Rafael Aquini
On Fri, Sep 21, 2012 at 11:46:18AM +0100, Mel Gorman wrote:
> From: Shaohua Li 
> 
> Changelog since V2
> o Fix BUG_ON triggered due to pages left on cc.migratepages
> o Make compact_zone_order() require non-NULL arg `contended'
> 
> Changelog since V1
> o only abort the compaction if lock is contended or run too long
> o Rearranged the code by Andrea Arcangeli.
> 
> isolate_migratepages_range() might isolate no pages if for example when
> zone->lru_lock is contended and running asynchronous compaction. In this
> case, we should abort compaction, otherwise, compact_zone will run a
> useless loop and make zone->lru_lock is even contended.
> 
> [minc...@kernel.org: Putback pages isolated for migration if aborting]
> [a...@linux-foundation.org: compact_zone_order requires non-NULL arg 
> contended]
> Signed-off-by: Andrea Arcangeli 
> Signed-off-by: Shaohua Li 
> Signed-off-by: Mel Gorman 
> Acked-by: Rik van Riel 
> ---

Acked-by: Rafael Aquini 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/9] mm: compaction: Acquire the zone->lru_lock as late as possible

2012-09-21 Thread Rafael Aquini
On Fri, Sep 21, 2012 at 11:46:19AM +0100, Mel Gorman wrote:
> Compactions migrate scanner acquires the zone->lru_lock when scanning a range
> of pages looking for LRU pages to acquire. It does this even if there are
> no LRU pages in the range. If multiple processes are compacting then this
> can cause severe locking contention. To make matters worse commit b2eef8c0
> (mm: compaction: minimise the time IRQs are disabled while isolating pages
> for migration) releases the lru_lock every SWAP_CLUSTER_MAX pages that are
> scanned.
> 
> This patch makes two changes to how the migrate scanner acquires the LRU
> lock. First, it only releases the LRU lock every SWAP_CLUSTER_MAX pages if
> the lock is contended. This reduces the number of times it unnecessarily
> disables and re-enables IRQs. The second is that it defers acquiring the
> LRU lock for as long as possible. If there are no LRU pages or the only
> LRU pages are transhuge then the LRU lock will not be acquired at all
> which reduces contention on zone->lru_lock.
> 
> Signed-off-by: Mel Gorman 
> Acked-by: Rik van Riel 
> ---

Acked-by: Rafael Aquini 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/9] mm: compaction: Acquire the zone->lock as late as possible

2012-09-21 Thread Rafael Aquini
On Fri, Sep 21, 2012 at 11:46:20AM +0100, Mel Gorman wrote:
> Compactions free scanner acquires the zone->lock when checking for PageBuddy
> pages and isolating them. It does this even if there are no PageBuddy pages
> in the range.
> 
> This patch defers acquiring the zone lock for as long as possible. In the
> event there are no free pages in the pageblock then the lock will not be
> acquired at all which reduces contention on zone->lock.
> 
> Signed-off-by: Mel Gorman 
> Acked-by: Rik van Riel 
> ---

Acked-by: Rafael Aquini 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 7/9] Revert "mm: have order > 0 compaction start off where it left"

2012-09-21 Thread Rafael Aquini
On Fri, Sep 21, 2012 at 11:46:21AM +0100, Mel Gorman wrote:
> This reverts commit 7db8889a (mm: have order > 0 compaction start off
> where it left) and commit de74f1cc (mm: have order > 0 compaction start
> near a pageblock with free pages). These patches were a good idea and
> tests confirmed that they massively reduced the amount of scanning but
> the implementation is complex and tricky to understand. A later patch
> will cache what pageblocks should be skipped and reimplements the
> concept of compact_cached_free_pfn on top for both migration and
> free scanners.
> 
> Signed-off-by: Mel Gorman 
> Acked-by: Rik van Riel 
> ---

Acked-by: Rafael Aquini 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated

2012-09-21 Thread Rafael Aquini
On Fri, Sep 21, 2012 at 11:46:22AM +0100, Mel Gorman wrote:
> When compaction was implemented it was known that scanning could potentially
> be excessive. The ideal was that a counter be maintained for each pageblock
> but maintaining this information would incur a severe penalty due to a
> shared writable cache line. It has reached the point where the scanning
> costs are an serious problem, particularly on long-lived systems where a
> large process starts and allocates a large number of THPs at the same time.
> 
> Instead of using a shared counter, this patch adds another bit to the
> pageblock flags called PG_migrate_skip. If a pageblock is scanned by
> either migrate or free scanner and 0 pages were isolated, the pageblock
> is marked to be skipped in the future. When scanning, this bit is checked
> before any scanning takes place and the block skipped if set.
> 
> The main difficulty with a patch like this is "when to ignore the cached
> information?" If it's ignored too often, the scanning rates will still
> be excessive. If the information is too stale then allocations will fail
> that might have otherwise succeeded. In this patch
> 
> o CMA always ignores the information
> o If the migrate and free scanner meet then the cached information will
>   be discarded if it's at least 5 seconds since the last time the cache
>   was discarded
> o If there are a large number of allocation failures, discard the cache.
> 
> The time-based heuristic is very clumsy but there are few choices for a
> better event. Depending solely on multiple allocation failures still allows
> excessive scanning when THP allocations are failing in quick succession
> due to memory pressure. Waiting until memory pressure is relieved would
> cause compaction to continually fail instead of using reclaim/compaction
> to try allocate the page. The time-based mechanism is clumsy but a better
> option is not obvious.
> 
> Signed-off-by: Mel Gorman 
> Acked-by: Rik van Riel 
> ---

Acked-by: Rafael Aquini 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 9/9] mm: compaction: Restart compaction from near where it left off

2012-09-21 Thread Rafael Aquini
On Fri, Sep 21, 2012 at 11:46:23AM +0100, Mel Gorman wrote:
> This is almost entirely based on Rik's previous patches and discussions
> with him about how this might be implemented.
> 
> Order > 0 compaction stops when enough free pages of the correct page
> order have been coalesced.  When doing subsequent higher order allocations,
> it is possible for compaction to be invoked many times.
> 
> However, the compaction code always starts out looking for things to compact
> at the start of the zone, and for free pages to compact things to at the
> end of the zone.
> 
> This can cause quadratic behaviour, with isolate_freepages starting at
> the end of the zone each time, even though previous invocations of the
> compaction code already filled up all free memory on that end of the zone.
> This can cause isolate_freepages to take enormous amounts of CPU with
> certain workloads on larger memory systems.
> 
> This patch caches where the migration and free scanner should start from on
> subsequent compaction invocations using the pageblock-skip information. When
> compaction starts it begins from the cached restart points and will
> update the cached restart points until a page is isolated or a pageblock
> is skipped that would have been scanned by synchronous compaction.
> 
> Signed-off-by: Mel Gorman 
> Acked-by: Rik van Riel 
> ---

Acked-by: Rafael Aquini 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v10 1/5] mm: introduce a common interface for balloon pages mobility

2012-09-25 Thread Rafael Aquini
On Tue, Sep 25, 2012 at 03:05:49AM +0200, Michael S. Tsirkin wrote:
> If these are all under page lock these barriers just confuse things,
> because they are almost never enough by themselves.
> So in that case it would be better to drop them and document
> usage as you are going to.
>

Would the following make more sense (with the proprer comments, as well) ?

---8<---
+static inline void balloon_page_set(struct page *page,
+   struct address_space *mapping,
+   struct list_head *head)
+{
+   list_add(&page->lru, head);
+   smp_wmb();
+   page->mapping = mapping;
+}
+
+static inline void balloon_page_del(struct page *page)
+{
+   page->mapping = NULL;
+   smp_wmb();
+   list_del(&page->lru);
+}
+
+static inline bool __is_movable_balloon_page(struct page *page)
+{
+   struct address_space *mapping = ACCESS_ONCE(page->mapping);
+   smp_read_barrier_depends();
+   return mapping_balloon(mapping);
+}
+
---8<---

There's still a case where we have to test page->mapping->flags and we cannot
afford to wait for, or grab, the page lock @ isolate_migratepages_range().
The barriers won't avoid leak_ballon() racing against 
isolate_migratepages_range(),
but they surely will make tests for page->mapping more consistent. And for those
cases where leak_balloon() races against
isolate_migratepages_range->isolate_balloon_page(), we solve the conflict of
interest through page refcounting and page lock. I'm preparing a more extensive
doc to include at Documentation/ to explain the interfaces and how we cope with
these mentioned races, as well.
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v10 3/5] virtio_balloon: introduce migration primitives to balloon pages

2012-09-25 Thread Rafael Aquini
On Tue, Sep 25, 2012 at 02:40:24AM +0200, Michael S. Tsirkin wrote:
> > @@ -139,9 +158,15 @@ static void fill_balloon(struct virtio_balloon *vb, 
> > size_t num)
> > break;
> > }
> > set_page_pfns(vb->pfns + vb->num_pfns, page);
> > -   vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
> > totalram_pages--;
> > +
> > +   BUG_ON(!trylock_page(page));
> 
> So here page lock is nested within balloon_lock.
> 
This page is coming from Buddy free-lists and as such, fill_balloon does not
race against page migration (which takes pages from an already isolated page 
list).
We need to grab page lock here, to prevent page isolation from happening while
this page is yet under "balloon insertion" step (mapping assign). Also, failing
to grab the page lock here (remember this page is coming from Buddy) means a
BUG.

I'll make sure to comment that on the code. Thanks for the nit.

> > +int virtballoon_migratepage(struct address_space *mapping,
> > +   struct page *newpage, struct page *page, enum migrate_mode mode)
> > +{
> > +   struct virtio_balloon *vb = __page_balloon_device(page);
> > +
> > +   BUG_ON(!vb);
> > +
> > +   mutex_lock(&vb->balloon_lock);
> 
> 
> While here balloon_lock is taken and according to documentation
> this is called under page lock.
> 
> So nesting is reversed which is normally a problem.
> Unfortunately lockep does not seem to work for page lock
> otherwise it would detect this.
> If this reversed nesting is not a problem, please add
> comments in code documenting that this is intentional
> and how it works.
>

The scheme works because we do not invert the page locking, actually. If we
stumble accross a locked page while holding mutex(balloon_lock) we just give up
that page for that round. The comment @ __leak_balloon explains that already:

+   /*
+* Grab the page lock to avoid racing against threads isolating
+* pages from, or migrating pages back to vb->pages list.
+* (both tasks are done under page lock protection)
+*
+* Failing to grab the page lock here means this page is being
+* isolated already, or its migration has not finished yet.
+*
+* We simply cannot afford to keep waiting on page lock here,
+* otherwise we might cause a lock inversion and remain dead-
+* locked with threads isolating/migrating pages.
+* So, we give up this round if we fail to grab the page lock.
+*/
+   if (!trylock_page(page))
+   break;


The otherwise is true as well. If the thread trying to isolate pages stumbles
across a balloon page already locked at leak_balloon, it gives up the isolation
step and mutex(&balloon_lock) is never attempted down the code path. Check this
isolate_balloon_page() snippet out:

+   /*
+* As balloon pages are not isolated from LRU lists, concurrent
+* compaction threads can race against page migration functions
+* move_to_new_page() & __unmap_and_move().
+* In order to avoid having an already isolated balloon page
+* being (wrongly) re-isolated while it is under migration,
+* lets be sure we have the page lock before proceeding with
+* the balloon page isolation steps.
+*/
+   if (likely(trylock_page(page))) {


I'll rework the comment to indicate the leak_balloon() case as well.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 5/9] MCS Lock: Restructure the MCS lock defines and locking code into its own file

2013-10-08 Thread Rafael Aquini
On Wed, Oct 02, 2013 at 03:38:32PM -0700, Tim Chen wrote:
> We will need the MCS lock code for doing optimistic spinning for rwsem.
> Extracting the MCS code from mutex.c and put into its own file allow us
> to reuse this code easily for rwsem.
> 
> Reviewed-by: Ingo Molnar 
> Reviewed-by: Peter Zijlstra 
> Signed-off-by: Tim Chen 
> Signed-off-by: Davidlohr Bueso 
> ---
>  include/linux/mcs_spinlock.h |   64 
> ++
>  include/linux/mutex.h|5 ++-
>  kernel/mutex.c   |   60 --
>  3 files changed, 74 insertions(+), 55 deletions(-)
>  create mode 100644 include/linux/mcs_spinlock.h
> 
> diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
> new file mode 100644
> index 000..b5de3b0
> --- /dev/null
> +++ b/include/linux/mcs_spinlock.h
> @@ -0,0 +1,64 @@
> +/*
> + * MCS lock defines
> + *
> + * This file contains the main data structure and API definitions of MCS 
> lock.
> + *
> + * The MCS lock (proposed by Mellor-Crummey and Scott) is a simple spin-lock
> + * with the desirable properties of being fair, and with each cpu trying
> + * to acquire the lock spinning on a local variable.
> + * It avoids expensive cache bouncings that common test-and-set spin-lock
> + * implementations incur.
> + */

nitpick:

I believe you need 

+#include 

here, to avoid breaking the build when arch_mutex_cpu_relax() is not defined
(arch/s390 is one case)

> +#ifndef __LINUX_MCS_SPINLOCK_H
> +#define __LINUX_MCS_SPINLOCK_H
> +
> +struct mcs_spinlock {
> + struct mcs_spinlock *next;
> + int locked; /* 1 if lock acquired */
> +};
> +
> +/*
> + * We don't inline mcs_spin_lock() so that perf can correctly account for the
> + * time spent in this lock function.
> + */
> +static noinline
> +void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
> +{
> + struct mcs_spinlock *prev;
> +
> + /* Init node */
> + node->locked = 0;
> + node->next   = NULL;
> +
> + prev = xchg(lock, node);
> + if (likely(prev == NULL)) {
> + /* Lock acquired */
> + node->locked = 1;
> + return;
> + }
> + ACCESS_ONCE(prev->next) = node;
> + smp_wmb();
> + /* Wait until the lock holder passes the lock down */
> + while (!ACCESS_ONCE(node->locked))
> + arch_mutex_cpu_relax();
> +}
> +
> +static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock 
> *node)
> +{
> + struct mcs_spinlock *next = ACCESS_ONCE(node->next);
> +
> + if (likely(!next)) {
> + /*
> +  * Release the lock by setting it to NULL
> +  */
> + if (cmpxchg(lock, node, NULL) == node)
> + return;
> + /* Wait until the next pointer is set */
> + while (!(next = ACCESS_ONCE(node->next)))
> + arch_mutex_cpu_relax();
> + }
> + ACCESS_ONCE(next->locked) = 1;
> + smp_wmb();
> +}
> +
> +#endif /* __LINUX_MCS_SPINLOCK_H */
> diff --git a/include/linux/mutex.h b/include/linux/mutex.h
> index ccd4260..e6eaeea 100644
> --- a/include/linux/mutex.h
> +++ b/include/linux/mutex.h
> @@ -46,6 +46,7 @@
>   * - detects multi-task circular deadlocks and prints out all affected
>   *   locks and tasks (and only those tasks)
>   */
> +struct mcs_spinlock;
>  struct mutex {
>   /* 1: unlocked, 0: locked, negative: locked, possible waiters */
>   atomic_tcount;
> @@ -55,7 +56,7 @@ struct mutex {
>   struct task_struct  *owner;
>  #endif
>  #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
> - void*spin_mlock;/* Spinner MCS lock */
> + struct mcs_spinlock *mcs_lock;  /* Spinner MCS lock */
>  #endif
>  #ifdef CONFIG_DEBUG_MUTEXES
>   const char  *name;
> @@ -179,4 +180,4 @@ extern int atomic_dec_and_mutex_lock(atomic_t *cnt, 
> struct mutex *lock);
>  #define arch_mutex_cpu_relax()   cpu_relax()
>  #endif
>  
> -#endif
> +#endif /* __LINUX_MUTEX_H */
> diff --git a/kernel/mutex.c b/kernel/mutex.c
> index 6d647ae..4640731 100644
> --- a/kernel/mutex.c
> +++ b/kernel/mutex.c
> @@ -25,6 +25,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /*
>   * In the DEBUG case we are using the "NULL fastpath" for mutexes,
> @@ -52,7 +53,7 @@ __mutex_init(struct mutex *lock, const char *name, struct 
> lock_class_key *key)
>   INIT_LIST_HEAD(&lock->wait_list);
>   mutex_clear_owner(lock);
>  #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
> - lock->spin_mlock = NULL;
> + lock->mcs_lock = NULL;
>  #endif
>  
>   debug_mutex_init(lock, name, key);
> @@ -111,54 +112,7 @@ EXPORT_SYMBOL(mutex_lock);
>   * more or less simultaneously, the spinners need to acquire a MCS lock
>   * first before spinning on the owner field.
>   *
> - * We don't inline mspin_lock() so that perf can correctly account for the
> - * time spent in this lock function.
>   */
> -struct mspin_n

Re: [PATCH v8 5/9] MCS Lock: Restructure the MCS lock defines and locking code into its own file

2013-10-08 Thread Rafael Aquini
On Tue, Oct 08, 2013 at 01:34:55PM -0700, Tim Chen wrote:
> On Tue, 2013-10-08 at 16:51 -0300, Rafael Aquini wrote:
> > On Wed, Oct 02, 2013 at 03:38:32PM -0700, Tim Chen wrote:
> > > We will need the MCS lock code for doing optimistic spinning for rwsem.
> > > Extracting the MCS code from mutex.c and put into its own file allow us
> > > to reuse this code easily for rwsem.
> > > 
> > > Reviewed-by: Ingo Molnar 
> > > Reviewed-by: Peter Zijlstra 
> > > Signed-off-by: Tim Chen 
> > > Signed-off-by: Davidlohr Bueso 
> > > ---
> > >  include/linux/mcs_spinlock.h |   64 
> > > ++
> > >  include/linux/mutex.h|5 ++-
> > >  kernel/mutex.c   |   60 
> > > --
> > >  3 files changed, 74 insertions(+), 55 deletions(-)
> > >  create mode 100644 include/linux/mcs_spinlock.h
> > > 
> > > diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
> > > new file mode 100644
> > > index 000..b5de3b0
> > > --- /dev/null
> > > +++ b/include/linux/mcs_spinlock.h
> > > @@ -0,0 +1,64 @@
> > > +/*
> > > + * MCS lock defines
> > > + *
> > > + * This file contains the main data structure and API definitions of MCS 
> > > lock.
> > > + *
> > > + * The MCS lock (proposed by Mellor-Crummey and Scott) is a simple 
> > > spin-lock
> > > + * with the desirable properties of being fair, and with each cpu trying
> > > + * to acquire the lock spinning on a local variable.
> > > + * It avoids expensive cache bouncings that common test-and-set spin-lock
> > > + * implementations incur.
> > > + */
> > 
> > nitpick:
> > 
> > I believe you need 
> > 
> > +#include 
> > 
> > here, to avoid breaking the build when arch_mutex_cpu_relax() is not defined
> > (arch/s390 is one case)
> 

Humm... sorry by my noise as I was looking into an old tree, before this commit:
commit 083986e8248d978b6c961d3da6beb0c921c68220
Author: Heiko Carstens 
Date:   Sat Sep 28 11:23:59 2013 +0200

mutex: replace CONFIG_HAVE_ARCH_MUTEX_CPU_RELAX with simple ifdef


> Probably 
> 
> +#include  
>

Yeah, but I guess right now you're ok without it, as the only place this 
header is included is in kernel/mutex.c and it linux/mutex.h get in before us.

If the plan is to extend usage for other places where mutex.h doesn't go, then
perhaps the better thing would be just copycat the same #ifdef here.

Cheers! (and sorry again for the noise)

> should be added instead?
> It defines arch_mutex_cpu_relax when there's no 
> architecture specific version.
> 
> Thanks.
> Tim
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: hugetlb: initialize PG_reserved for tail pages of gigantig compound pages

2013-10-10 Thread Rafael Aquini
On Thu, Oct 10, 2013 at 06:12:41PM +0200, Andrea Arcangeli wrote:
> 11feeb498086a3a5907b8148bdf1786a9b18fc55 introduced a memory leak when
> KVM is run on gigantic compound pages.
> 
> 11feeb498086a3a5907b8148bdf1786a9b18fc55 depends on the assumption
> that PG_reserved is identical for all head and tail pages of a
> compound page. So that if get_user_pages returns a tail page, we don't
> need to check the head page in order to know if we deal with a
> reserved page that requires different refcounting.
> 
> The assumption that PG_reserved is the same for head and tail pages is
> certainly correct for THP and regular hugepages, but gigantic
> hugepages allocated through bootmem don't clear the PG_reserved on the
> tail pages (the clearing of PG_reserved is done later only if the
> gigantic hugepage is freed).
> 
> This patch corrects the gigantic compound page initialization so that
> we can retain the optimization in
> 11feeb498086a3a5907b8148bdf1786a9b18fc55. The cacheline was already
> modified in order to set PG_tail so this won't affect the boot time of
> large memory systems.
> 
> Reported-by: andy123 
> Signed-off-by: Andrea Arcangeli 
> ---

Acked-by: Rafael Aquini 


>  mm/hugetlb.c | 18 +-
>  1 file changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index b49579c..315450e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -695,8 +695,24 @@ static void prep_compound_gigantic_page(struct page 
> *page, unsigned long order)
>   /* we rely on prep_new_huge_page to set the destructor */
>   set_compound_order(page, order);
>   __SetPageHead(page);
> + __ClearPageReserved(page);
>   for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
>   __SetPageTail(p);
> + /*
> +  * For gigantic hugepages allocated through bootmem at
> +  * boot, it's safer to be consistent with the
> +  * not-gigantic hugepages and to clear the PG_reserved
> +  * bit from all tail pages too. Otherwse drivers using
> +  * get_user_pages() to access tail pages, may get the
> +  * reference counting wrong if they see the
> +  * PG_reserved bitflag set on a tail page (despite the
> +  * head page didn't have PG_reserved set). Enforcing
> +  * this consistency between head and tail pages,
> +  * allows drivers to optimize away a check on the head
> +  * page when they need know if put_page is needed after
> +  * get_user_pages() or not.
> +  */
> + __ClearPageReserved(p);
>   set_page_count(p, 0);
>   p->first_page = page;
>   }
> @@ -1329,9 +1345,9 @@ static void __init gather_bootmem_prealloc(void)
>  #else
>   page = virt_to_page(m);
>  #endif
> - __ClearPageReserved(page);
>   WARN_ON(page_count(page) != 1);
>   prep_compound_huge_page(page, h->order);
> + WARN_ON(PageReserved(page));
>   prep_new_huge_page(h, page, page_to_nid(page));
>   /*
>* If we had gigantic hugepages allocated at boot time, we need
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm: avoid reinserting isolated balloon pages into LRU lists

2013-09-25 Thread Rafael Aquini
Isolated balloon pages can wrongly end up in LRU lists when migrate_pages()
finishes its round without draining all the isolated page list. The same issue
can happen when reclaim_clean_pages_from_list() tries to reclaim pages from an
isolated page list, before migration, in the CMA path. Such balloon page leak
opens a race window against LRU lists shrinkers that leads us to the following
kernel panic:

  BUG: unable to handle kernel NULL pointer dereference at 0028
  IP: [] shrink_page_list+0x24e/0x897
  PGD 3cda2067 PUD 3d713067 PMD 0
  Oops:  [#1] SMP
  CPU: 0 PID: 340 Comm: kswapd0 Not tainted 3.12.0-rc1-22626-g4367597 #87
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  task: 88003d920ee0 ti: 88003da48000 task.ti: 88003da48000
  RIP: 0010:[]  [] 
shrink_page_list+0x24e/0x897
  RSP: :88003da499b8  EFLAGS: 00010286
  RAX:  RBX: 88003e82bd60 RCX: 000657d5
  RDX:  RSI: 031f RDI: 88003e82bd40
  RBP: 88003da49ab0 R08: 0001 R09: 81121a45
  R10: 81121a45 R11: 88003c4a9a28 R12: 88003e82bd40
  R13: 88003da0e800 R14: 0001 R15: 88003da49d58
  FS:  () GS:88003fc0() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 067d9000 CR3: 3ace5000 CR4: 000407b0
  Stack:
   88003d920ee0 88003d920ee0 88003da49a50 88003da49a40
   88003da49b18 88003da49b08 00a499e0 81667fc0
      
  Call Trace:
   [] ? lock_release+0x188/0x1d7
   [] shrink_inactive_list+0x240/0x3de
   [] shrink_lruvec+0x3e0/0x566
   [] ? __lock_is_held+0x38/0x50
   [] __shrink_zone+0x94/0x178
   [] shrink_zone+0x3a/0x82
   [] balance_pgdat+0x32a/0x4c2
   [] kswapd+0x2f0/0x372
   [] ? abort_exclusive_wait+0x8b/0x8b
   [] ? balance_pgdat+0x4c2/0x4c2
   [] kthread+0xa2/0xaa
   [] ? __kthread_parkme+0x65/0x65
   [] ret_from_fork+0x7c/0xb0
   [] ? __kthread_parkme+0x65/0x65
  Code: 80 7d 8f 01 48 83 95 68 ff ff ff 00 4c 89 e7 e8 5a 7b 00 00 48 85 c0 49 
89 c5 75 08 80 7d 8f 00 74 3e eb 31 48 8b 80 18 01 00 00 <48> 8b 74 0d 48 8b 78 
30 be 02 00 00 00 ff d2 eb
  RIP  [] shrink_page_list+0x24e/0x897
   RSP 
  CR2: 0028
  ---[ end trace 703d2451af6ffbfd ]---
  Kernel panic - not syncing: Fatal exception

This patch fixes the issue, by assuring the proper tests are made at
putback_movable_pages() & reclaim_clean_pages_from_list() to avoid
isolated balloon pages being wrongly reinserted in LRU lists.

Reported-and-tested-by: Luiz Capitulino 
Signed-off-by: Rafael Aquini 
---
 include/linux/balloon_compaction.h | 25 +
 mm/migrate.c   |  2 +-
 mm/vmscan.c|  4 +++-
 3 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/include/linux/balloon_compaction.h 
b/include/linux/balloon_compaction.h
index f7f1d71..b5f3ec0 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -159,6 +159,26 @@ static inline bool balloon_page_movable(struct page *page)
 }
 
 /*
+ * isolated_balloon_page - identify an isolated balloon page on private
+ *compaction/migration page lists.
+ *
+ * After a compaction thread isolates a balloon page for migration, it raises
+ * the page refcount to avoid concurrent compaction threads on re-isolating the
+ * same page. For that reason putback_movable_pages(), or other routines
+ * that need to identify isolated balloon pages on private pagelists, cannot
+ * rely on balloon_page_movable() to accomplish the task.
+ */
+static inline bool isolated_balloon_page(struct page *page)
+{
+   /* Already isolated balloon pages, by default, have a raised refcount */
+   if (page_flags_cleared(page) && !page_mapped(page) &&
+   page_count(page) >= 2)
+   return __is_movable_balloon_page(page);
+
+   return false;
+}
+
+/*
  * balloon_page_insert - insert a page into the balloon's page list and make
  *  the page->mapping assignment accordingly.
  * @page: page to be assigned as a 'balloon page'
@@ -243,6 +263,11 @@ static inline bool balloon_page_movable(struct page *page)
return false;
 }
 
+static inline bool isolated_balloon_page(struct page *page)
+{
+   return false;
+}
+
 static inline bool balloon_page_isolate(struct page *page)
 {
return false;
diff --git a/mm/migrate.c b/mm/migrate.c
index 9c8d5f5..a26bccd 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -107,7 +107,7 @@ void putback_movable_pages(struct list_head *l)
list_del(&page->lru);
dec_zone_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
-   if (unlikely(balloon_page_movable(page)))
+

Re: [PATCH -next] virtio: balloon: fix missing unlock on error in fill_balloon()

2012-11-12 Thread Rafael Aquini
On Mon, Nov 12, 2012 at 09:50:40PM +0800, Wei Yongjun wrote:
> From: Wei Yongjun 
> 
> Add the missing unlock before return from function fill_balloon()
> in the error handling case.
> 
> Introduced by 9864a8(virtio_balloon: introduce migration primitives
> to balloon pages)
> 
> dpatch engine is used to auto generate this patch.
> (https://github.com/weiyj/dpatch)
> 
> Signed-off-by: Wei Yongjun 
> ---
>  drivers/virtio/virtio_balloon.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index f70151b..72e8dcb 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -152,8 +152,10 @@ static void fill_balloon(struct virtio_balloon *vb, 
> size_t num)
>   }
>  
>   /* Didn't get any?  Oh well. */
> - if (vb->num_pfns == 0)
> + if (vb->num_pfns == 0) {
> + mutex_unlock(&vb->balloon_lock);
>   return;
> + }
>  
>   tell_host(vb, vb->inflate_vq);
>   mutex_unlock(&vb->balloon_lock);
>

I missed that one when rewriting v10 to v11. :(

Good catch, Wei! thanks!

Andrew, we need this pick for the virtio_balloon v12 patch, as well.

-- Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -next] virtio: balloon: fix missing unlock on error in fill_balloon()

2012-11-13 Thread Rafael Aquini
On Mon, Nov 12, 2012 at 03:34:53PM -0800, Andrew Morton wrote:
> On Mon, 12 Nov 2012 21:50:40 +0800
> Wei Yongjun  wrote:
> 
> > From: Wei Yongjun 
> > 
> > Add the missing unlock before return from function fill_balloon()
> > in the error handling case.
> > 
> > Introduced by 9864a8(virtio_balloon: introduce migration primitives
> > to balloon pages)
> > 
> > dpatch engine is used to auto generate this patch.
> > (https://github.com/weiyj/dpatch)
> > 
> > Signed-off-by: Wei Yongjun 
> > ---
> >  drivers/virtio/virtio_balloon.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/virtio/virtio_balloon.c 
> > b/drivers/virtio/virtio_balloon.c
> > index f70151b..72e8dcb 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -152,8 +152,10 @@ static void fill_balloon(struct virtio_balloon *vb, 
> > size_t num)
> > }
> >  
> > /* Didn't get any?  Oh well. */
> > -   if (vb->num_pfns == 0)
> > +   if (vb->num_pfns == 0) {
> > +   mutex_unlock(&vb->balloon_lock);
> > return;
> > +   }
> >  
> > tell_host(vb, vb->inflate_vq);
> > mutex_unlock(&vb->balloon_lock);
> 
> Well.  Why did this bug occur in the first place?  Because
> fill_balloon() has multiple return points and one of those was
> overlooked when adding new locking.
> 
> This (and the addition of memory leaks) is quite common.  It is part of
> the reason why so much kernel code uses goto's to avoid multiple return
> points.  A maintainability thing.
> 
> And we can fix this shortcoming in fill_balloon() without even needing a
> goto:
> 
> --- 
> a/drivers/virtio/virtio_balloon.c~virtio_balloon-introduce-migration-primitives-to-balloon-pages-fix-fix
> +++ a/drivers/virtio/virtio_balloon.c
> @@ -151,13 +151,9 @@ static void fill_balloon(struct virtio_b
>   totalram_pages--;
>   }
>  
> - /* Didn't get any?  Oh well. */
> - if (vb->num_pfns == 0) {
> - mutex_unlock(&vb->balloon_lock);
> - return;
> - }
> -
> - tell_host(vb, vb->inflate_vq);
> + /* Did we get any? */
> + if (vb->num_pfns != 0)
> + tell_host(vb, vb->inflate_vq);
>   mutex_unlock(&vb->balloon_lock);
>  }
>

Your fix is far better, as it simplifies the code without loosing that test --
which only evaluates as true, if we fail to allocate a page right after getting
into the loop -- /me thinks driver folks place it there to not bother
tell_host() and the virt_queues if no pageframe is present at that pfns array.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v11 1/7] mm: adjust address_space_operations.migratepage() return code

2012-11-06 Thread Rafael Aquini
This patch introduces MIGRATEPAGE_SUCCESS as the default return code
for address_space_operations.migratepage() method and documents the
expected return code for the same method in failure cases.

Signed-off-by: Rafael Aquini 
---
 fs/hugetlbfs/inode.c|  4 ++--
 include/linux/migrate.h |  7 +++
 mm/migrate.c| 22 +++---
 3 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index c5bc355..bdeda2c 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -608,11 +608,11 @@ static int hugetlbfs_migrate_page(struct address_space 
*mapping,
int rc;
 
rc = migrate_huge_page_move_mapping(mapping, newpage, page);
-   if (rc)
+   if (rc != MIGRATEPAGE_SUCCESS)
return rc;
migrate_page_copy(newpage, page);
 
-   return 0;
+   return MIGRATEPAGE_SUCCESS;
 }
 
 static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index ce7e667..a4e886d 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -7,6 +7,13 @@
 
 typedef struct page *new_page_t(struct page *, unsigned long private, int **);
 
+/*
+ * Return values from addresss_space_operations.migratepage():
+ * - negative errno on page migration failure;
+ * - zero on page migration success;
+ */
+#define MIGRATEPAGE_SUCCESS0
+
 #ifdef CONFIG_MIGRATION
 
 extern void putback_lru_pages(struct list_head *l);
diff --git a/mm/migrate.c b/mm/migrate.c
index 77ed2d7..98c7a89 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -286,7 +286,7 @@ static int migrate_page_move_mapping(struct address_space 
*mapping,
/* Anonymous page without mapping */
if (page_count(page) != 1)
return -EAGAIN;
-   return 0;
+   return MIGRATEPAGE_SUCCESS;
}
 
spin_lock_irq(&mapping->tree_lock);
@@ -356,7 +356,7 @@ static int migrate_page_move_mapping(struct address_space 
*mapping,
}
spin_unlock_irq(&mapping->tree_lock);
 
-   return 0;
+   return MIGRATEPAGE_SUCCESS;
 }
 
 /*
@@ -372,7 +372,7 @@ int migrate_huge_page_move_mapping(struct address_space 
*mapping,
if (!mapping) {
if (page_count(page) != 1)
return -EAGAIN;
-   return 0;
+   return MIGRATEPAGE_SUCCESS;
}
 
spin_lock_irq(&mapping->tree_lock);
@@ -399,7 +399,7 @@ int migrate_huge_page_move_mapping(struct address_space 
*mapping,
page_unfreeze_refs(page, expected_count - 1);
 
spin_unlock_irq(&mapping->tree_lock);
-   return 0;
+   return MIGRATEPAGE_SUCCESS;
 }
 
 /*
@@ -486,11 +486,11 @@ int migrate_page(struct address_space *mapping,
 
rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode);
 
-   if (rc)
+   if (rc != MIGRATEPAGE_SUCCESS)
return rc;
 
migrate_page_copy(newpage, page);
-   return 0;
+   return MIGRATEPAGE_SUCCESS;
 }
 EXPORT_SYMBOL(migrate_page);
 
@@ -513,7 +513,7 @@ int buffer_migrate_page(struct address_space *mapping,
 
rc = migrate_page_move_mapping(mapping, newpage, page, head, mode);
 
-   if (rc)
+   if (rc != MIGRATEPAGE_SUCCESS)
return rc;
 
/*
@@ -549,7 +549,7 @@ int buffer_migrate_page(struct address_space *mapping,
 
} while (bh != head);
 
-   return 0;
+   return MIGRATEPAGE_SUCCESS;
 }
 EXPORT_SYMBOL(buffer_migrate_page);
 #endif
@@ -814,7 +814,7 @@ skip_unmap:
put_anon_vma(anon_vma);
 
 uncharge:
-   mem_cgroup_end_migration(mem, page, newpage, rc == 0);
+   mem_cgroup_end_migration(mem, page, newpage, rc == MIGRATEPAGE_SUCCESS);
 unlock:
unlock_page(page);
 out:
@@ -987,7 +987,7 @@ int migrate_pages(struct list_head *from,
case -EAGAIN:
retry++;
break;
-   case 0:
+   case MIGRATEPAGE_SUCCESS:
break;
default:
/* Permanent failure */
@@ -1024,7 +1024,7 @@ int migrate_huge_page(struct page *hpage, new_page_t 
get_new_page,
/* try again */
cond_resched();
break;
-   case 0:
+   case MIGRATEPAGE_SUCCESS:
goto out;
default:
rc = -EIO;
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v11 4/7] mm: introduce compaction and migration for ballooned pages

2012-11-06 Thread Rafael Aquini
Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a guest,
thus imposing performance penalties associated with the reduced number of
transparent huge pages that could be used by the guest workload.

This patch introduces the helper functions as well as the necessary changes
to teach compaction and migration bits how to cope with pages which are
part of a guest memory balloon, in order to make them movable by memory
compaction procedures.

Signed-off-by: Rafael Aquini 
---
 mm/compaction.c | 21 +++--
 mm/migrate.c| 36 ++--
 2 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 9eef558..76abd84 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
@@ -565,9 +566,24 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
goto next_pageblock;
}
 
-   /* Check may be lockless but that's ok as we recheck later */
-   if (!PageLRU(page))
+   /*
+* Check may be lockless but that's ok as we recheck later.
+* It's possible to migrate LRU pages and balloon pages
+* Skip any other type of page
+*/
+   if (!PageLRU(page)) {
+   if (unlikely(balloon_page_movable(page))) {
+   if (locked && balloon_page_isolate(page)) {
+   /* Successfully isolated */
+   cc->finished_update_migrate = true;
+   list_add(&page->lru, migratelist);
+   cc->nr_migratepages++;
+   nr_isolated++;
+   goto check_compact_cluster;
+   }
+   }
continue;
+   }
 
/*
 * PageLRU is set. lru_lock normally excludes isolation
@@ -621,6 +637,7 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
cc->nr_migratepages++;
nr_isolated++;
 
+check_compact_cluster:
/* Avoid isolating too much */
if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) {
++low_pfn;
diff --git a/mm/migrate.c b/mm/migrate.c
index 98c7a89..87ffe54 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -79,7 +80,10 @@ void putback_lru_pages(struct list_head *l)
list_del(&page->lru);
dec_zone_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
-   putback_lru_page(page);
+   if (unlikely(balloon_page_movable(page)))
+   balloon_page_putback(page);
+   else
+   putback_lru_page(page);
}
 }
 
@@ -778,6 +782,18 @@ static int __unmap_and_move(struct page *page, struct page 
*newpage,
}
}
 
+   if (unlikely(balloon_page_movable(page))) {
+   /*
+* A ballooned page does not need any special attention from
+* physical to virtual reverse mapping procedures.
+* Skip any attempt to unmap PTEs or to remap swap cache,
+* in order to avoid burning cycles at rmap level, and perform
+* the page migration right away (proteced by page lock).
+*/
+   rc = balloon_page_migrate(newpage, page, mode);
+   goto uncharge;
+   }
+
/*
 * Corner case handling:
 * 1. When a new swap-cache page is read into, it is added to the LRU
@@ -814,7 +830,9 @@ skip_unmap:
put_anon_vma(anon_vma);
 
 uncharge:
-   mem_cgroup_end_migration(mem, page, newpage, rc == MIGRATEPAGE_SUCCESS);
+   mem_cgroup_end_migration(mem, page, newpage,
+(rc == MIGRATEPAGE_SUCCESS ||
+ rc == MIGRATEPAGE_BALLOON_SUCCESS));
 unlock:
unlock_page(page);
 out:
@@ -846,6 +864,20 @@ static int unmap_and_move(new_page_t get_new_page, 
unsigned long private,
goto out;
 
rc = __unmap_and_move(page, newpage, force, offlining, mode);
+
+   if (unlikely(rc == MIGRATEPAGE_BALLOON_SUCCESS)) {
+   /*
+* A ballooned page has been migrated already.
+* Now, it's the time to remove the old page from the isolated
+* pageset list and 

[PATCH v11 5/7] virtio_balloon: introduce migration primitives to balloon pages

2012-11-06 Thread Rafael Aquini
Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a guest,
thus imposing performance penalties associated with the reduced number of
transparent huge pages that could be used by the guest workload.

Besides making balloon pages movable at allocation time and introducing
the necessary primitives to perform balloon page migration/compaction,
this patch also introduces the following locking scheme, in order to
enhance the syncronization methods for accessing elements of struct
virtio_balloon, thus providing protection against concurrent access
introduced by parallel memory migration threads.

 - balloon_lock (mutex) : synchronizes the access demand to elements of
  struct virtio_balloon and its queue operations;

Signed-off-by: Rafael Aquini 
---
 drivers/virtio/virtio_balloon.c | 135 
 1 file changed, 123 insertions(+), 12 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 0908e60..69eede7 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -34,6 +35,7 @@
  * page units.
  */
 #define VIRTIO_BALLOON_PAGES_PER_PAGE (PAGE_SIZE >> VIRTIO_BALLOON_PFN_SHIFT)
+#define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256
 
 struct virtio_balloon
 {
@@ -52,15 +54,19 @@ struct virtio_balloon
/* Number of balloon pages we've told the Host we're not using. */
unsigned int num_pages;
/*
-* The pages we've told the Host we're not using.
+* The pages we've told the Host we're not using are enqueued
+* at vb_dev_info->pages list.
 * Each page on this list adds VIRTIO_BALLOON_PAGES_PER_PAGE
 * to num_pages above.
 */
-   struct list_head pages;
+   struct balloon_dev_info *vb_dev_info;
+
+   /* Synchronize access/update to this struct virtio_balloon elements */
+   struct mutex balloon_lock;
 
/* The array of pfns we tell the Host about. */
unsigned int num_pfns;
-   u32 pfns[256];
+   u32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
 
/* Memory statistics */
int need_stats_update;
@@ -122,18 +128,25 @@ static void set_page_pfns(u32 pfns[], struct page *page)
 
 static void fill_balloon(struct virtio_balloon *vb, size_t num)
 {
+   struct balloon_dev_info *vb_dev_info = vb->vb_dev_info;
+
+   static DEFINE_RATELIMIT_STATE(fill_balloon_rs,
+ DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb->pfns));
 
+   mutex_lock(&vb->balloon_lock);
for (vb->num_pfns = 0; vb->num_pfns < num;
 vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-   struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY |
-   __GFP_NOMEMALLOC | __GFP_NOWARN);
+   struct page *page = balloon_page_enqueue(vb_dev_info);
+
if (!page) {
-   if (printk_ratelimit())
+   if (__ratelimit(&fill_balloon_rs))
dev_printk(KERN_INFO, &vb->vdev->dev,
   "Out of puff! Can't get %zu pages\n",
-  num);
+  VIRTIO_BALLOON_PAGES_PER_PAGE);
/* Sleep for at least 1/5 of a second before retry. */
msleep(200);
break;
@@ -141,7 +154,6 @@ static void fill_balloon(struct virtio_balloon *vb, size_t 
num)
set_page_pfns(vb->pfns + vb->num_pfns, page);
vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
totalram_pages--;
-   list_add(&page->lru, &vb->pages);
}
 
/* Didn't get any?  Oh well. */
@@ -149,6 +161,7 @@ static void fill_balloon(struct virtio_balloon *vb, size_t 
num)
return;
 
tell_host(vb, vb->inflate_vq);
+   mutex_unlock(&vb->balloon_lock);
 }
 
 static void release_pages_by_pfn(const u32 pfns[], unsigned int num)
@@ -165,14 +178,17 @@ static void release_pages_by_pfn(const u32 pfns[], 
unsigned int num)
 static void leak_balloon(struct virtio_balloon *vb, size_t num)
 {
struct page *page;
+   struct balloon_dev_info *vb_dev_info = vb->vb_dev_info;
 
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb->pfns));
 
+   mutex_lock(&vb->balloon_lock);
for (vb->num_pfn

[PATCH v11 7/7] mm: add vm event counters for balloon pages compaction

2012-11-06 Thread Rafael Aquini
This patch introduces a new set of vm event counters to keep track of
ballooned pages compaction activity.

Signed-off-by: Rafael Aquini 
---
 drivers/virtio/virtio_balloon.c |  1 +
 include/linux/vm_event_item.h   |  8 +++-
 mm/balloon_compaction.c |  2 ++
 mm/migrate.c|  1 +
 mm/vmstat.c | 10 +-
 5 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 69eede7..3756fc1 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -411,6 +411,7 @@ int virtballoon_migratepage(struct address_space *mapping,
tell_host(vb, vb->deflate_vq);
 
mutex_unlock(&vb->balloon_lock);
+   balloon_event_count(COMPACTBALLOONMIGRATED);
 
return MIGRATEPAGE_BALLOON_SUCCESS;
 }
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 3d31145..cbd72fc 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -41,7 +41,13 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_COMPACTION
COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
-#endif
+#ifdef CONFIG_BALLOON_COMPACTION
+   COMPACTBALLOONISOLATED, /* isolated from balloon pagelist */
+   COMPACTBALLOONMIGRATED, /* balloon page sucessfully migrated */
+   COMPACTBALLOONRELEASED, /* old-page released after migration */
+   COMPACTBALLOONRETURNED, /* putback to pagelist, not-migrated */
+#endif /* CONFIG_BALLOON_COMPACTION */
+#endif /* CONFIG_COMPACTION */
 #ifdef CONFIG_HUGETLB_PAGE
HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
 #endif
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index 90935aa..32927eb 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -215,6 +215,7 @@ bool balloon_page_isolate(struct page *page)
if (__is_movable_balloon_page(page) &&
page_count(page) == 2) {
__isolate_balloon_page(page);
+   balloon_event_count(COMPACTBALLOONISOLATED);
unlock_page(page);
return true;
}
@@ -237,6 +238,7 @@ void balloon_page_putback(struct page *page)
if (__is_movable_balloon_page(page)) {
__putback_balloon_page(page);
put_page(page);
+   balloon_event_count(COMPACTBALLOONRETURNED);
} else {
__WARN();
dump_page(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index adb3d44..ee3037d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -896,6 +896,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned 
long private,
page_is_file_cache(page));
put_page(page);
__free_page(page);
+   balloon_event_count(COMPACTBALLOONRELEASED);
return 0;
}
 out:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c737057..1363edc 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -781,7 +781,15 @@ const char * const vmstat_text[] = {
"compact_stall",
"compact_fail",
"compact_success",
-#endif
+
+#ifdef CONFIG_BALLOON_COMPACTION
+   "compact_balloon_isolated",
+   "compact_balloon_migrated",
+   "compact_balloon_released",
+   "compact_balloon_returned",
+#endif /* CONFIG_BALLOON_COMPACTION */
+
+#endif /* CONFIG_COMPACTION */
 
 #ifdef CONFIG_HUGETLB_PAGE
"htlb_buddy_alloc_success",
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v11 3/7] mm: introduce a common interface for balloon pages mobility

2012-11-06 Thread Rafael Aquini
Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a guest,
thus imposing performance penalties associated with the reduced number of
transparent huge pages that could be used by the guest workload.

This patch introduces a common interface to help a balloon driver on
making its page set movable to compaction, and thus allowing the system
to better leverage the compation efforts on memory defragmentation.

Signed-off-by: Rafael Aquini 
---
 include/linux/balloon_compaction.h | 220 ++
 include/linux/migrate.h|  10 ++
 include/linux/pagemap.h|  16 +++
 mm/Kconfig |  15 +++
 mm/Makefile|   1 +
 mm/balloon_compaction.c| 269 +
 6 files changed, 531 insertions(+)
 create mode 100644 include/linux/balloon_compaction.h
 create mode 100644 mm/balloon_compaction.c

diff --git a/include/linux/balloon_compaction.h 
b/include/linux/balloon_compaction.h
new file mode 100644
index 000..1865bd5
--- /dev/null
+++ b/include/linux/balloon_compaction.h
@@ -0,0 +1,220 @@
+/*
+ * include/linux/balloon_compaction.h
+ *
+ * Common interface definitions for making balloon pages movable to compaction.
+ *
+ * Copyright (C) 2012, Red Hat, Inc.  Rafael Aquini 
+ */
+#ifndef _LINUX_BALLOON_COMPACTION_H
+#define _LINUX_BALLOON_COMPACTION_H
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * Balloon device information descriptor.
+ * This struct is used to allow the common balloon compaction interface
+ * procedures to find the proper balloon device holding memory pages they'll
+ * have to cope for page compaction / migration, as well as it serves the
+ * balloon driver as a page book-keeper for its registered balloon devices.
+ */
+struct balloon_dev_info {
+   void *balloon_device;   /* balloon device descriptor */
+   struct address_space *mapping;  /* balloon special page->mapping */
+   unsigned long isolated_pages;   /* # of isolated pages for migration */
+   spinlock_t pages_lock;  /* Protection to pages list */
+   struct list_head pages; /* Pages enqueued & handled to Host */
+};
+
+extern struct page *balloon_page_enqueue(struct balloon_dev_info *b_dev_info);
+extern struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info);
+extern struct balloon_dev_info *balloon_devinfo_alloc(
+   void *balloon_dev_descriptor);
+
+static inline void balloon_devinfo_free(struct balloon_dev_info *b_dev_info)
+{
+   kfree(b_dev_info);
+}
+
+#ifdef CONFIG_BALLOON_COMPACTION
+extern bool balloon_page_isolate(struct page *page);
+extern void balloon_page_putback(struct page *page);
+extern int balloon_page_migrate(struct page *newpage,
+   struct page *page, enum migrate_mode mode);
+extern struct address_space
+*balloon_mapping_alloc(struct balloon_dev_info *b_dev_info,
+   const struct address_space_operations *a_ops);
+
+static inline void balloon_mapping_free(struct address_space *balloon_mapping)
+{
+   kfree(balloon_mapping);
+}
+
+/*
+ * balloon_page_insert - insert a page into the balloon's page list and make
+ *  the page->mapping assignment accordingly.
+ * @page: page to be assigned as a 'balloon page'
+ * @mapping : allocated special 'balloon_mapping'
+ * @head: balloon's device page list head
+ */
+static inline void balloon_page_insert(struct page *page,
+  struct address_space *mapping,
+  struct list_head *head)
+{
+   list_add(&page->lru, head);
+   /*
+* Make sure the page is already inserted on balloon's page list
+* before assigning its ->mapping.
+*/
+   smp_wmb();
+   page->mapping = mapping;
+}
+
+/*
+ * balloon_page_delete - clear the page->mapping and delete the page from
+ *  balloon's page list accordingly.
+ * @page: page to be released from balloon's page list
+ */
+static inline void balloon_page_delete(struct page *page)
+{
+   page->mapping = NULL;
+   /*
+* Make sure page->mapping is cleared before we proceed with
+* balloon's page list deletion.
+*/
+   smp_wmb();
+   list_del(&page->lru);
+}
+
+/*
+ * __is_movable_balloon_page - helper to perform @page mapping->flags tests
+ */
+static inline bool __is_movable_balloon_page(struct page *page)
+{
+   /*
+* we might attempt to read ->mapping concurrently to other
+* threads trying to write to it.
+*/
+   struct address_space *mapping = ACCESS_ONCE(page->mapping);
+   smp_read_barrier_depends();
+   return mapping_ba

[PATCH v11 6/7] mm: introduce putback_movable_pages()

2012-11-06 Thread Rafael Aquini
The PATCH "mm: introduce compaction and migration for virtio ballooned pages"
hacks around putback_lru_pages() in order to allow ballooned pages to be
re-inserted on balloon page list as if a ballooned page was like a LRU page.

As ballooned pages are not legitimate LRU pages, this patch introduces
putback_movable_pages() to properly cope with cases where the isolated
pageset contains ballooned pages and LRU pages, thus fixing the mentioned
inelegant hack around putback_lru_pages().

Signed-off-by: Rafael Aquini 
---
 include/linux/migrate.h |  2 ++
 mm/compaction.c |  6 +++---
 mm/migrate.c| 20 
 mm/page_alloc.c |  2 +-
 4 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index e570c3c..ff074a4 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -27,6 +27,7 @@ typedef struct page *new_page_t(struct page *, unsigned long 
private, int **);
 #ifdef CONFIG_MIGRATION
 
 extern void putback_lru_pages(struct list_head *l);
+extern void putback_movable_pages(struct list_head *l);
 extern int migrate_page(struct address_space *,
struct page *, struct page *, enum migrate_mode);
 extern int migrate_pages(struct list_head *l, new_page_t x,
@@ -50,6 +51,7 @@ extern int migrate_huge_page_move_mapping(struct 
address_space *mapping,
 #else
 
 static inline void putback_lru_pages(struct list_head *l) {}
+static inline void putback_movable_pages(struct list_head *l) {}
 static inline int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
enum migrate_mode mode) { return -ENOSYS; }
diff --git a/mm/compaction.c b/mm/compaction.c
index 76abd84..f268bd8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -995,7 +995,7 @@ static int compact_zone(struct zone *zone, struct 
compact_control *cc)
switch (isolate_migratepages(zone, cc)) {
case ISOLATE_ABORT:
ret = COMPACT_PARTIAL;
-   putback_lru_pages(&cc->migratepages);
+   putback_movable_pages(&cc->migratepages);
cc->nr_migratepages = 0;
goto out;
case ISOLATE_NONE:
@@ -1018,9 +1018,9 @@ static int compact_zone(struct zone *zone, struct 
compact_control *cc)
trace_mm_compaction_migratepages(nr_migrate - nr_remaining,
nr_remaining);
 
-   /* Release LRU pages not migrated */
+   /* Release isolated pages not migrated */
if (err) {
-   putback_lru_pages(&cc->migratepages);
+   putback_movable_pages(&cc->migratepages);
cc->nr_migratepages = 0;
if (err == -ENOMEM) {
ret = COMPACT_PARTIAL;
diff --git a/mm/migrate.c b/mm/migrate.c
index 87ffe54..adb3d44 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -80,6 +80,26 @@ void putback_lru_pages(struct list_head *l)
list_del(&page->lru);
dec_zone_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
+   putback_lru_page(page);
+   }
+}
+
+/*
+ * Put previously isolated pages back onto the appropriate lists
+ * from where they were once taken off for compaction/migration.
+ *
+ * This function shall be used instead of putback_lru_pages(),
+ * whenever the isolated pageset has been built by isolate_migratepages_range()
+ */
+void putback_movable_pages(struct list_head *l)
+{
+   struct page *page;
+   struct page *page2;
+
+   list_for_each_entry_safe(page, page2, l, lru) {
+   list_del(&page->lru);
+   dec_zone_page_state(page, NR_ISOLATED_ANON +
+   page_is_file_cache(page));
if (unlikely(balloon_page_movable(page)))
balloon_page_putback(page);
else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5b74de6..1cb0f93 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5710,7 +5710,7 @@ static int __alloc_contig_migrate_range(struct 
compact_control *cc,
0, false, MIGRATE_SYNC);
}
 
-   putback_lru_pages(&cc->migratepages);
+   putback_movable_pages(&cc->migratepages);
return ret > 0 ? 0 : ret;
 }
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v11 2/7] mm: redefine address_space.assoc_mapping

2012-11-06 Thread Rafael Aquini
This patch overhauls struct address_space.assoc_mapping renaming it to
address_space.private_data and its type is redefined to void*.
By this approach we consistently name the .private_* elements from
struct address_space as well as allow extended usage for address_space
association with other data structures through ->private_data.

Also, all users of old ->assoc_mapping element are converted to reflect
its new name and type change.

Signed-off-by: Rafael Aquini 
---
 fs/buffer.c| 12 ++--
 fs/gfs2/glock.c|  2 +-
 fs/inode.c |  2 +-
 fs/nilfs2/page.c   |  2 +-
 include/linux/fs.h |  2 +-
 5 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index b5f0442..e0bad95 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -555,7 +555,7 @@ void emergency_thaw_all(void)
  */
 int sync_mapping_buffers(struct address_space *mapping)
 {
-   struct address_space *buffer_mapping = mapping->assoc_mapping;
+   struct address_space *buffer_mapping = mapping->private_data;
 
if (buffer_mapping == NULL || list_empty(&mapping->private_list))
return 0;
@@ -588,10 +588,10 @@ void mark_buffer_dirty_inode(struct buffer_head *bh, 
struct inode *inode)
struct address_space *buffer_mapping = bh->b_page->mapping;
 
mark_buffer_dirty(bh);
-   if (!mapping->assoc_mapping) {
-   mapping->assoc_mapping = buffer_mapping;
+   if (!mapping->private_data) {
+   mapping->private_data = buffer_mapping;
} else {
-   BUG_ON(mapping->assoc_mapping != buffer_mapping);
+   BUG_ON(mapping->private_data != buffer_mapping);
}
if (!bh->b_assoc_map) {
spin_lock(&buffer_mapping->private_lock);
@@ -788,7 +788,7 @@ void invalidate_inode_buffers(struct inode *inode)
if (inode_has_buffers(inode)) {
struct address_space *mapping = &inode->i_data;
struct list_head *list = &mapping->private_list;
-   struct address_space *buffer_mapping = mapping->assoc_mapping;
+   struct address_space *buffer_mapping = mapping->private_data;
 
spin_lock(&buffer_mapping->private_lock);
while (!list_empty(list))
@@ -811,7 +811,7 @@ int remove_inode_buffers(struct inode *inode)
if (inode_has_buffers(inode)) {
struct address_space *mapping = &inode->i_data;
struct list_head *list = &mapping->private_list;
-   struct address_space *buffer_mapping = mapping->assoc_mapping;
+   struct address_space *buffer_mapping = mapping->private_data;
 
spin_lock(&buffer_mapping->private_lock);
while (!list_empty(list)) {
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index e6c2fd5..0f22d09 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -768,7 +768,7 @@ int gfs2_glock_get(struct gfs2_sbd *sdp, u64 number,
mapping->host = s->s_bdev->bd_inode;
mapping->flags = 0;
mapping_set_gfp_mask(mapping, GFP_NOFS);
-   mapping->assoc_mapping = NULL;
+   mapping->private_data = NULL;
mapping->backing_dev_info = s->s_bdi;
mapping->writeback_index = 0;
}
diff --git a/fs/inode.c b/fs/inode.c
index b03c719..4cac8e1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -165,7 +165,7 @@ int inode_init_always(struct super_block *sb, struct inode 
*inode)
mapping->host = inode;
mapping->flags = 0;
mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
-   mapping->assoc_mapping = NULL;
+   mapping->private_data = NULL;
mapping->backing_dev_info = &default_backing_dev_info;
mapping->writeback_index = 0;
 
diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c
index 3e7b2a0..07f76db 100644
--- a/fs/nilfs2/page.c
+++ b/fs/nilfs2/page.c
@@ -431,7 +431,7 @@ void nilfs_mapping_init(struct address_space *mapping, 
struct inode *inode,
mapping->host = inode;
mapping->flags = 0;
mapping_set_gfp_mask(mapping, GFP_NOFS);
-   mapping->assoc_mapping = NULL;
+   mapping->private_data = NULL;
mapping->backing_dev_info = bdi;
mapping->a_ops = &empty_aops;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b33cfc9..0982565 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -418,7 +418,7 @@ struct address_space {
struct backing_dev_info *backing_dev_info; /* device readahead, etc */
spinlock_t  private_lock;   /* for use by the address_space 
*/
struct list_headprivate_list;   /* ditto */
-   struct address_space*assoc_mapping; /* ditto */
+   void*private_data;  /* ditto */
 } __attri

[PATCH v11 0/7] make balloon pages movable by compaction

2012-11-06 Thread Rafael Aquini
Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a guest,
thus imposing performance penalties associated with the reduced number of
transparent huge pages that could be used by the guest workload.

This patch-set follows the main idea discussed at 2012 LSFMMS session:
"Ballooning for transparent huge pages" -- http://lwn.net/Articles/490114/
to introduce the required changes to the virtio_balloon driver, as well as
the changes to the core compaction & migration bits, in order to make those
subsystems aware of ballooned pages and allow memory balloon pages become
movable within a guest, thus avoiding the aforementioned fragmentation issue

Following are numbers that prove this patch benefits on allowing compaction
to be more effective at memory ballooned guests.

Results for STRESS-HIGHALLOC benchmark, from Mel Gorman's mmtests suite,
running on a 4gB RAM KVM guest which was ballooning 512mB RAM in 64mB chunks,
at every minute (inflating/deflating), while test was running:

===BEGIN stress-highalloc

STRESS-HIGHALLOC
 highalloc-3.7 highalloc-3.7
 rc4-clean rc4-patch
Pass 1  55.00 ( 0.00%)62.00 ( 7.00%)
Pass 2  54.00 ( 0.00%)62.00 ( 8.00%)
while Rested75.00 ( 0.00%)80.00 ( 5.00%)

MMTests Statistics: duration
 3.7 3.7
   rc4-clean   rc4-patch
User 1207.59 1207.46
System   1300.55 1299.61
Elapsed  2273.72 2157.06

MMTests Statistics: vmstat
3.7 3.7
  rc4-clean   rc4-patch
Page Ins3581516 2374368
Page Outs  1114869210410332
Swap Ins 80  47
Swap Outs  3641 476
Direct pages scanned  37978   33826
Kswapd pages scanned1828245 1342869
Kswapd pages reclaimed  1710236 1304099
Direct pages reclaimed32207   31005
Kswapd efficiency   93% 97%
Kswapd velocity 804.077 622.546
Direct efficiency   84% 91%
Direct velocity  16.703  15.682
Percentage direct scans  2%  2%
Page writes by reclaim792529704
Page writes file  756119228
Page writes anon   3641 476
Page reclaim immediate16764   11014
Page rescued immediate0   0
Slabs scanned   2171904 2152448
Direct inode steals 3852261
Kswapd inode steals  659137  609670
Kswapd skipped wait   1  69
THP fault alloc 546 631
THP collapse alloc  361 339
THP splits  259 263
THP fault fallback   98  50
THP collapse fail20  17
Compaction stalls   747 499
Compaction success  244 145
Compaction failures 503 354
Compaction pages moved   370888  474837
Compaction move failure   77378   65259

===END stress-highalloc

Rafael Aquini (7):
  mm: adjust address_space_operations.migratepage() return code
  mm: redefine address_space.assoc_mapping
  mm: introduce a common interface for balloon pages mobility
  mm: introduce compaction and migration for ballooned pages
  virtio_balloon: introduce migration primitives to balloon pages
  mm: introduce putback_movable_pages()
  mm: add vm event counters for balloon pages compaction

 drivers/virtio/virtio_balloon.c| 136 +--
 fs/buffer.c|  12 +-
 fs/gfs2/glock.c|   2 +-
 fs/hugetlbfs/inode.c   |   4 +-
 fs/inode.c |   2 +-
 fs/nilfs2/page.c   |   2 +-
 include/linux/balloon_compaction.h | 220 ++
 include/linux/fs.h |   2 +-
 include/linux/migrate.h|  19 +++
 include/linux/pagemap.h|  16 +++
 include/linux/vm_event_item.h  |   8 +-
 mm/Kconfig |  15 ++
 mm/Makefile|   1 +
 mm/balloon_compaction.c| 271 +
 mm/compaction.c|  27 +++-
 mm/migrate.c   |  77 +--
 mm/page_alloc.c|   2 +-
 mm/vmstat.c|  10 +-
 18 files changed, 782 insertions(+), 44 deletions(-)
 create mode 100644 include/linux/balloon_compaction.h
 create mode 100644 mm/balloon_compaction.c

Change log:
v11:
 * Address AKPM's last review suggestions;
 * Extend the balloon compaction common API and simplify its usage at driver;
 * Minor nitpicks on code commentary;
v10:
 * Adjust leak_balloon() wait_event logic to make a clear locking schem

Re: [PATCH v11 3/7] mm: introduce a common interface for balloon pages mobility

2012-11-07 Thread Rafael Aquini
On Wed, Nov 07, 2012 at 01:02:07PM -0800, Andrew Morton wrote:
> On Wed,  7 Nov 2012 01:05:50 -0200
> Rafael Aquini  wrote:
> 
> > Memory fragmentation introduced by ballooning might reduce significantly
> > the number of 2MB contiguous memory blocks that can be used within a guest,
> > thus imposing performance penalties associated with the reduced number of
> > transparent huge pages that could be used by the guest workload.
> > 
> > This patch introduces a common interface to help a balloon driver on
> > making its page set movable to compaction, and thus allowing the system
> > to better leverage the compation efforts on memory defragmentation.
> 
> 
> mm/migrate.c: In function 'unmap_and_move':
> mm/migrate.c:899: error: 'COMPACTBALLOONRELEASED' undeclared (first use in 
> this function)
> mm/migrate.c:899: error: (Each undeclared identifier is reported only once
> mm/migrate.c:899: error: for each function it appears in.)
> 
> You've been bad - you didn't test with your feature disabled. 
> Please do that.  And not just compilation testing.
>

Gasp... Shame on me, indeed. balloon_event_count() was a macro and I had it
tested a couple of review rounds earlier. You requested me to get rid of
preprocessor pirotech, and I did but I miserably failed on re-testing it.
 
I'm pretty sure Santa is not going to visit me this year.

Do you want me to resubmit this?

> 
> We can fix this one with a sucky macro.  I think that's better than
> unconditionally defining the enums.
> 
> --- 
> a/include/linux/balloon_compaction.h~mm-introduce-a-common-interface-for-balloon-pages-mobility-fix
> +++ a/include/linux/balloon_compaction.h
> @@ -207,10 +207,8 @@ static inline gfp_t balloon_mapping_gfp_
>   return GFP_HIGHUSER;
>  }
>  
> -static inline void balloon_event_count(enum vm_event_item item)
> -{
> - return;
> -}
> +/* A macro, to avoid generating references to the undefined COMPACTBALLOON* 
> */
> +#define balloon_event_count(item) do { } while (0)
>  
>  static inline bool balloon_compaction_check(void)
>  {
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v11 1/7] mm: adjust address_space_operations.migratepage() return code

2012-11-07 Thread Rafael Aquini
On Wed, Nov 07, 2012 at 11:56:10AM -0800, Andrew Morton wrote:
> On Wed,  7 Nov 2012 01:05:48 -0200
> Rafael Aquini  wrote:
> 
> > This patch introduces MIGRATEPAGE_SUCCESS as the default return code
> > for address_space_operations.migratepage() method and documents the
> > expected return code for the same method in failure cases.
> 
> I hit a large number of rejects applying this against linux-next.  Due
> to the increasingly irritating sched/numa code in there.
> 
> I attempted to fix it up and also converted some (but not all) of the
> implicit tests of `rc' against zero.
> 
> Please check the result very carefully - more changes will be needed.
> 
> All those
> 
> - if (rc)
> + if (rc != MIGRATEPAGE_SUCCESS)
> 
> changes are a pain.  Perhaps we shouldn't bother.

Thanks for doing that.

This hunk at migrate_pages(), however, is not necessary:

@@ -1001,7 +1001,7 @@ out:
if (!swapwrite)
current->flags &= ~PF_SWAPWRITE;

-   if (rc)
+   if (rc != MIGRATEPAGE_SUCCESS)
return rc;

Here, migrate_pages() is not testing rc for the migration success, but it's 
just trying to
devise the flow if it has to return -ENOMEM, actually.

I guess, a change to make that snippet more clear could be:

diff --git a/mm/migrate.c b/mm/migrate.c
index 77ed2d7..6562aee 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -987,7 +987,7 @@ int migrate_pages(struct list_head *from,
case -EAGAIN:
retry++;
break;
-   case 0:
+   case MIGRATEPAGE_SUCCESS:
break;
default:
/* Permanent failure */
@@ -996,15 +996,12 @@ int migrate_pages(struct list_head *from,
}
}
}
-   rc = 0;
+   rc = nr_failed + retry;
 out:
if (!swapwrite)
current->flags &= ~PF_SWAPWRITE;
 
-   if (rc)
-   return rc;
-
-   return nr_failed + retry;
+   return rc;
 }


I can rebase this patch and resubmit if you prefer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v11 5/7] virtio_balloon: introduce migration primitives to balloon pages

2012-11-07 Thread Rafael Aquini
On Wed, Nov 07, 2012 at 11:58:10AM -0800, Andrew Morton wrote:
> On Wed,  7 Nov 2012 01:05:52 -0200
> Rafael Aquini  wrote:
> 
> > Memory fragmentation introduced by ballooning might reduce significantly
> > the number of 2MB contiguous memory blocks that can be used within a guest,
> > thus imposing performance penalties associated with the reduced number of
> > transparent huge pages that could be used by the guest workload.
> > 
> > Besides making balloon pages movable at allocation time and introducing
> > the necessary primitives to perform balloon page migration/compaction,
> > this patch also introduces the following locking scheme, in order to
> > enhance the syncronization methods for accessing elements of struct
> > virtio_balloon, thus providing protection against concurrent access
> > introduced by parallel memory migration threads.
> > 
> > ...
> >
> > @@ -122,18 +128,25 @@ static void set_page_pfns(u32 pfns[], struct page 
> > *page)
> >  
> >  static void fill_balloon(struct virtio_balloon *vb, size_t num)
> >  {
> > +   struct balloon_dev_info *vb_dev_info = vb->vb_dev_info;
> > +
> > +   static DEFINE_RATELIMIT_STATE(fill_balloon_rs,
> > + DEFAULT_RATELIMIT_INTERVAL,
> > + DEFAULT_RATELIMIT_BURST);
> > +
> > /* We can only do one array worth at a time. */
> > num = min(num, ARRAY_SIZE(vb->pfns));
> >  
> > +   mutex_lock(&vb->balloon_lock);
> > for (vb->num_pfns = 0; vb->num_pfns < num;
> >  vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> > -   struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY |
> > -   __GFP_NOMEMALLOC | __GFP_NOWARN);
> > +   struct page *page = balloon_page_enqueue(vb_dev_info);
> > +
> > if (!page) {
> > -   if (printk_ratelimit())
> > +   if (__ratelimit(&fill_balloon_rs))
> > dev_printk(KERN_INFO, &vb->vdev->dev,
> >"Out of puff! Can't get %zu pages\n",
> > -  num);
> > +  VIRTIO_BALLOON_PAGES_PER_PAGE);
> > /* Sleep for at least 1/5 of a second before retry. */
> > msleep(200);
> > break;
> 
> linux-next's fill_balloon() has already been converted to
> dev_info_ratelimited().  I fixed everything up.  Please check the result.

Looks great, thanks for doing it

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v11 5/7] virtio_balloon: introduce migration primitives to balloon pages

2012-11-07 Thread Rafael Aquini
On Wed, Nov 07, 2012 at 04:11:46PM -0800, Andrew Morton wrote:
> On Thu, 08 Nov 2012 09:32:18 +1030
> Rusty Russell  wrote:
> 
> > Rafael Aquini  writes:
> > > + * virtballoon_migratepage - perform the balloon page migration on 
> > > behalf of
> > > + *a compation thread. (called under page 
> > > lock)
> > 
> > > + if (!mutex_trylock(&vb->balloon_lock))
> > > + return -EAGAIN;
> > 
> > Erk, OK...
> 
> Not really.  As is almost always the case with a trylock, it needs a
> comment explaining why we couldn't use the far superior mutex_lock(). 
> Data: this reader doesn't know!
>


That was just to alleviate balloon_lock contention if we're migrating pages
concurrently with balloon_fill() or balloon_leak(), as it's easier to retry
the page migration later (in the contended case).

 
> > > + /* balloon's page migration 1st step  -- inflate "newpage" */
> > > + spin_lock_irqsave(&vb_dev_info->pages_lock, flags);
> > > + balloon_page_insert(newpage, mapping, &vb_dev_info->pages);
> > > + vb_dev_info->isolated_pages--;
> > > + spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
> > > + vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> > > + set_page_pfns(vb->pfns, newpage);
> > > + tell_host(vb, vb->inflate_vq);
> > 
> > tell_host does wait_event(), so you can't call it under the page_lock.
> > Right?
> 
> Sleeping inside lock_page() is OK.  More problematic is that GFP_KERNEL
> allocation.  iirc it _should_ be OK.  Core VM uses trylock_page() and
> the filesystems shouldn't be doing a synchronous lock_page() in the
> pageout path.  But I suspect it isn't a well-tested area.


The locked page under migration is not contended by any other FS / core VM path,
as it is already isolated by compaction to a private migration page list.
OTOH, there's a chance of another parallel compaction thread hitting this page
while scanning page blocks for isolation, but that path can be considered safe
as it uses trylock_page()



> 
> > You probably get away with it because current qemu will service you
> > immediately.  You could spin here in this case for the moment.
> > 
> > There's a second call to tell_host():
> > 
> > > + /*
> > > +  * balloon's page migration 2nd step -- deflate "page"
> > > +  *
> > > +  * It's safe to delete page->lru here because this page is at
> > > +  * an isolated migration list, and this step is expected to happen here
> > > +  */
> > > + balloon_page_delete(page);
> > > + vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> > > + set_page_pfns(vb->pfns, page);
> > > + tell_host(vb, vb->deflate_vq);
> > 
> > The first one can be delayed, the second one can be delayed if the host
> > didn't ask for VIRTIO_BALLOON_F_MUST_TELL_HOST (qemu doesn't).
> > 
> > We could implement a proper request queue for these, and return -EAGAIN
> > if the queue fills.  Though in practice, it's not important (it might
> > help performance).
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v11 5/7] virtio_balloon: introduce migration primitives to balloon pages

2012-11-07 Thread Rafael Aquini
On Thu, Nov 08, 2012 at 09:32:18AM +1030, Rusty Russell wrote:
> The first one can be delayed, the second one can be delayed if the host
> didn't ask for VIRTIO_BALLOON_F_MUST_TELL_HOST (qemu doesn't).
> 
> We could implement a proper request queue for these, and return -EAGAIN
> if the queue fills.  Though in practice, it's not important (it might
> help performance).

I liked the idea. Give me the directions to accomplish it and I'll give it a try
for sure.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v11 3/7] mm: introduce a common interface for balloon pages mobility

2012-11-09 Thread Rafael Aquini
On Fri, Nov 09, 2012 at 12:11:33PM +, Mel Gorman wrote:
> > +/*
> > + * balloon_page_insert - insert a page into the balloon's page list and 
> > make
> > + *  the page->mapping assignment accordingly.
> > + * @page: page to be assigned as a 'balloon page'
> > + * @mapping : allocated special 'balloon_mapping'
> > + * @head: balloon's device page list head
> > + */
> > +static inline void balloon_page_insert(struct page *page,
> > +  struct address_space *mapping,
> > +  struct list_head *head)
> > +{
> > +   list_add(&page->lru, head);
> > +   /*
> > +* Make sure the page is already inserted on balloon's page list
> > +* before assigning its ->mapping.
> > +*/
> > +   smp_wmb();
> > +   page->mapping = mapping;
> > +}
> > +
> 
> Elsewhere we have;
> 
>   spin_lock_irqsave(&b_dev_info->pages_lock, flags);
>   balloon_page_insert(page, b_dev_info->mapping, &b_dev_info->pages);
>   spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
> 
> So this happens under an irq-safe lock. Why is a smp_wmb necessary?
> 
> > +
> > +/*
> > + * balloon_page_delete - clear the page->mapping and delete the page from
> > + *  balloon's page list accordingly.
> > + * @page: page to be released from balloon's page list
> > + */
> > +static inline void balloon_page_delete(struct page *page)
> > +{
> > +   page->mapping = NULL;
> > +   /*
> > +* Make sure page->mapping is cleared before we proceed with
> > +* balloon's page list deletion.
> > +*/
> > +   smp_wmb();
> > +   list_del(&page->lru);
> > +}
> > +
> 
> Same thing on locking except this also appears to be under the page lock
> making the barrier seem even more unnecessary.
> 
> > +/*
> > + * __is_movable_balloon_page - helper to perform @page mapping->flags tests
> > + */
> > +static inline bool __is_movable_balloon_page(struct page *page)
> > +{
> > +   /*
> > +* we might attempt to read ->mapping concurrently to other
> > +* threads trying to write to it.
> > +*/
> > +   struct address_space *mapping = ACCESS_ONCE(page->mapping);
> > +   smp_read_barrier_depends();
> > +   return mapping_balloon(mapping);
> > +}
> > +
> 
> What happens if this race occurs? I assume it's a racy check before you
> isolate the balloon in which case the barrier may be overkill.
>

You're 100% right. If, by any chance, we stumble across a balloon page
transitioning to non-balloon page (to be released by the driver), while scanning
pages for isolation, the racy checks at balloon_isolate_page() will catch that 
up 
and properly sort the situation out. 

> > +/*
> > + * balloon_page_movable - test page->mapping->flags to identify balloon 
> > pages
> > + * that can be moved by compaction/migration.
> > + *
> > + * This function is used at core compaction's page isolation scheme, 
> > therefore
> > + * most pages exposed to it are not enlisted as balloon pages and so, to 
> > avoid
> > + * undesired side effects like racing against __free_pages(), we cannot 
> > afford
> > + * holding the page locked while testing page->mapping->flags here.
> > + *
> > + * As we might return false positives in the case of a balloon page being 
> > just
> > + * released under us, the page->mapping->flags need to be re-tested later,
> > + * under the proper page lock, at the functions that will be coping with 
> > the
> > + * balloon page case.
> > + */
> > +static inline bool balloon_page_movable(struct page *page)
> > +{

 
> > +# support for memory balloon compaction
> > +config BALLOON_COMPACTION
> > +   bool "Allow for balloon memory compaction/migration"
> > +   select COMPACTION
> > +   depends on VIRTIO_BALLOON
> > +   help
> > + Memory fragmentation introduced by ballooning might reduce
> > + significantly the number of 2MB contiguous memory blocks that can be
> > + used within a guest, thus imposing performance penalties associated
> > + with the reduced number of transparent huge pages that could be used
> > + by the guest workload. Allowing the compaction & migration for memory
> > + pages enlisted as being part of memory balloon devices avoids the
> > + scenario aforementioned and helps improving memory defragmentation.
> > +
> 
> Rather than select COMPACTION, should it depend on it? Similarly as THP
> is the primary motivation, would it make more sense to depend on
> TRANSPARENT_HUGEPAGE?
> 
> Should it default y? It seems useful, why would someone support
> VIRTIO_BALLOON and *not* use this?
>
Good catch, will change it.
 
> 
> Ok, I did not spot any obvious problems in this. The barriers were the
> big issue for me really - they seem overkill. I think we've discussed
> this already but even though it was recent I cannot remember the
> conclusion. In a sense, it doesn't matter because it should have been
> described in the code anyway.
> 
> If you get the barrier issue sorted out then feel 

Re: [PATCH v11 7/7] mm: add vm event counters for balloon pages compaction

2012-11-09 Thread Rafael Aquini
On Fri, Nov 09, 2012 at 12:20:33PM +, Mel Gorman wrote:
> On Wed, Nov 07, 2012 at 01:05:54AM -0200, Rafael Aquini wrote:
> > This patch introduces a new set of vm event counters to keep track of
> > ballooned pages compaction activity.
> > 
> > Signed-off-by: Rafael Aquini 
> 
> Other than confirming the thing actually works can any meaningful
> conclusions be drawn from this counters?
> 
> I know I have been inconsistent on this myself in the past but recently
> I've been taking the attitude that the counters can be used to fit into
> some other metric. I'm looking to change the compaction counters to be
> able to build a basic cost model for example. The same idea could be
> used for balloons of course but it's a less critical path than
> compaction for THP for example.
> 
> Assuming it builds and all the defines are correct when the feature is
> not configured (I didn't check) then there is nothing wrong with the
> patch. However, if it was dropped would it make life very hard or would
> you notice?
> 

Originally, I proposed this patch as droppable (and it's still droppable)
because its major purpose was solely to show the thing working consistently

OTOH, it might make the life easier to spot breakages if it remains with the
merged bits, and per a reviewer request I removed its 'DROP BEFORE MERGE'
disclaimer.

   https://lkml.org/lkml/2012/8/8/616

-- Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v11 3/7] mm: introduce a common interface for balloon pages mobility

2012-11-09 Thread Rafael Aquini
On Fri, Nov 09, 2012 at 04:23:27PM +, Mel Gorman wrote:
> On Fri, Nov 09, 2012 at 12:53:22PM -0200, Rafael Aquini wrote:
> > > 
> > > If you get the barrier issue sorted out then feel free to add
> > > 
> > > Acked-by: Mel Gorman 
> > > 
> > 
> > I believe we can drop the barriers stuff, as the locking scheme is now 
> > provinding
> > enough protection against collisions between isolation page scanning and
> > balloon_leak() page release (the major concern that has lead to the barriers
> > originally)
> > 
> > I'll refactor this patch with no barriers and ensure a better commentary on 
> > the
> > aforementioned locking scheme and resubmit, if it's OK to everyone
> > 
> 
> Sounds good to me. When they are dropped again feel free to stick my ack
> on for the compaction and migration parts. The virtio aspects are up to
> someone else :)
> 

Andrew,

If we get no further objections raised on dropping those barriers, would you
like me to resubmit the whole series rebased on the latest -next, 
or just this (new) refactored patch?

-- Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v12 1/7] mm: adjust address_space_operations.migratepage() return code

2012-11-11 Thread Rafael Aquini
This patch introduces MIGRATEPAGE_SUCCESS as the default return code
for address_space_operations.migratepage() method and documents the
expected return code for the same method in failure cases.

Signed-off-by: Rafael Aquini 
---
 fs/hugetlbfs/inode.c|  4 ++--
 include/linux/migrate.h |  7 +++
 mm/migrate.c| 33 +++--
 3 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 14bc0c1..fed1cd5 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -582,11 +582,11 @@ static int hugetlbfs_migrate_page(struct address_space 
*mapping,
int rc;
 
rc = migrate_huge_page_move_mapping(mapping, newpage, page);
-   if (rc)
+   if (rc != MIGRATEPAGE_SUCCESS)
return rc;
migrate_page_copy(newpage, page);
 
-   return 0;
+   return MIGRATEPAGE_SUCCESS;
 }
 
 static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 9a5afea..fab15ae 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -7,6 +7,13 @@
 
 typedef struct page *new_page_t(struct page *, unsigned long private, int **);
 
+/*
+ * Return values from addresss_space_operations.migratepage():
+ * - negative errno on page migration failure;
+ * - zero on page migration success;
+ */
+#define MIGRATEPAGE_SUCCESS0
+
 #ifdef CONFIG_MIGRATION
 
 extern void putback_lru_pages(struct list_head *l);
diff --git a/mm/migrate.c b/mm/migrate.c
index 0c5ec37..6f408c7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -286,7 +286,7 @@ static int migrate_page_move_mapping(struct address_space 
*mapping,
expected_count += 1;
if (page_count(page) != expected_count)
return -EAGAIN;
-   return 0;
+   return MIGRATEPAGE_SUCCESS;
}
 
spin_lock_irq(&mapping->tree_lock);
@@ -356,7 +356,7 @@ static int migrate_page_move_mapping(struct address_space 
*mapping,
}
spin_unlock_irq(&mapping->tree_lock);
 
-   return 0;
+   return MIGRATEPAGE_SUCCESS;
 }
 
 /*
@@ -372,7 +372,7 @@ int migrate_huge_page_move_mapping(struct address_space 
*mapping,
if (!mapping) {
if (page_count(page) != 1)
return -EAGAIN;
-   return 0;
+   return MIGRATEPAGE_SUCCESS;
}
 
spin_lock_irq(&mapping->tree_lock);
@@ -399,7 +399,7 @@ int migrate_huge_page_move_mapping(struct address_space 
*mapping,
page_unfreeze_refs(page, expected_count - 1);
 
spin_unlock_irq(&mapping->tree_lock);
-   return 0;
+   return MIGRATEPAGE_SUCCESS;
 }
 
 /*
@@ -486,11 +486,11 @@ int migrate_page(struct address_space *mapping,
 
rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode);
 
-   if (rc)
+   if (rc != MIGRATEPAGE_SUCCESS)
return rc;
 
migrate_page_copy(newpage, page);
-   return 0;
+   return MIGRATEPAGE_SUCCESS;
 }
 EXPORT_SYMBOL(migrate_page);
 
@@ -513,7 +513,7 @@ int buffer_migrate_page(struct address_space *mapping,
 
rc = migrate_page_move_mapping(mapping, newpage, page, head, mode);
 
-   if (rc)
+   if (rc != MIGRATEPAGE_SUCCESS)
return rc;
 
/*
@@ -549,7 +549,7 @@ int buffer_migrate_page(struct address_space *mapping,
 
} while (bh != head);
 
-   return 0;
+   return MIGRATEPAGE_SUCCESS;
 }
 EXPORT_SYMBOL(buffer_migrate_page);
 #endif
@@ -628,7 +628,7 @@ static int fallback_migrate_page(struct address_space 
*mapping,
  *
  * Return value:
  *   < 0 - error code
- *  == 0 - success
+ *  MIGRATEPAGE_SUCCESS - success
  */
 static int move_to_new_page(struct page *newpage, struct page *page,
int remap_swapcache, enum migrate_mode mode)
@@ -665,7 +665,7 @@ static int move_to_new_page(struct page *newpage, struct 
page *page,
else
rc = fallback_migrate_page(mapping, newpage, page, mode);
 
-   if (rc) {
+   if (rc != MIGRATEPAGE_SUCCESS) {
newpage->mapping = NULL;
} else {
if (remap_swapcache)
@@ -814,7 +814,7 @@ skip_unmap:
put_anon_vma(anon_vma);
 
 uncharge:
-   mem_cgroup_end_migration(mem, page, newpage, rc == 0);
+   mem_cgroup_end_migration(mem, page, newpage, rc == MIGRATEPAGE_SUCCESS);
 unlock:
unlock_page(page);
 out:
@@ -987,7 +987,7 @@ int migrate_pages(struct list_head *from,
case -EAGAIN:
retry++;
break;
-   case 0:
+   case MIGRATEPAGE_SUCCESS:
break;
default:
/* Permanent failure */
@@ -996,15 +996,12 @@ int mi

[PATCH v12 3/7] mm: introduce a common interface for balloon pages mobility

2012-11-11 Thread Rafael Aquini
Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a guest,
thus imposing performance penalties associated with the reduced number of
transparent huge pages that could be used by the guest workload.

This patch introduces a common interface to help a balloon driver on
making its page set movable to compaction, and thus allowing the system
to better leverage the compation efforts on memory defragmentation.

Signed-off-by: Rafael Aquini 
Acked-by: Mel Gorman 
---
 include/linux/balloon_compaction.h | 256 +++
 include/linux/migrate.h|  10 ++
 include/linux/pagemap.h|  16 ++
 mm/Kconfig |  15 ++
 mm/Makefile|   3 +-
 mm/balloon_compaction.c| 302 +
 6 files changed, 601 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/balloon_compaction.h
 create mode 100644 mm/balloon_compaction.c

diff --git a/include/linux/balloon_compaction.h 
b/include/linux/balloon_compaction.h
new file mode 100644
index 000..2e63d94
--- /dev/null
+++ b/include/linux/balloon_compaction.h
@@ -0,0 +1,256 @@
+/*
+ * include/linux/balloon_compaction.h
+ *
+ * Common interface definitions for making balloon pages movable by compaction.
+ *
+ * Despite being perfectly possible to perform ballooned pages migration, they
+ * make a special corner case to compaction scans because balloon pages are not
+ * enlisted at any LRU list like the other pages we do compact / migrate.
+ *
+ * As the page isolation scanning step a compaction thread does is a lockless
+ * procedure (from a page standpoint), it might bring some racy situations 
while
+ * performing balloon page compaction. In order to sort out these racy 
scenarios
+ * and safely perform balloon's page compaction and migration we must, always,
+ * ensure following these three simple rules:
+ *
+ *   i. when updating a balloon's page ->mapping element, strictly do it under
+ *  the following lock order, independently of the far superior
+ *  locking scheme (lru_lock, balloon_lock):
+ * +-page_lock(page);
+ *   +--spin_lock_irq(&b_dev_info->pages_lock);
+ * ... page->mapping updates here ...
+ *
+ *  ii. before isolating or dequeueing a balloon page from the balloon device
+ *  pages list, the page reference counter must be raised by one and the
+ *  extra refcount must be dropped when the page is enqueued back into
+ *  the balloon device page list, thus a balloon page keeps its reference
+ *  counter raised only while it is under our special handling;
+ *
+ * iii. after the lockless scan step have selected a potential balloon page for
+ *  isolation, re-test the page->mapping flags and the page ref counter
+ *  under the proper page lock, to ensure isolating a valid balloon page
+ *  (not yet isolated, nor under release procedure)
+ *
+ * The functions provided by this interface are placed to help on coping with
+ * the aforementioned balloon page corner case, as well as to ensure the simple
+ * set of exposed rules are satisfied while we are dealing with balloon pages
+ * compaction / migration.
+ *
+ * Copyright (C) 2012, Red Hat, Inc.  Rafael Aquini 
+ */
+#ifndef _LINUX_BALLOON_COMPACTION_H
+#define _LINUX_BALLOON_COMPACTION_H
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * Balloon device information descriptor.
+ * This struct is used to allow the common balloon compaction interface
+ * procedures to find the proper balloon device holding memory pages they'll
+ * have to cope for page compaction / migration, as well as it serves the
+ * balloon driver as a page book-keeper for its registered balloon devices.
+ */
+struct balloon_dev_info {
+   void *balloon_device;   /* balloon device descriptor */
+   struct address_space *mapping;  /* balloon special page->mapping */
+   unsigned long isolated_pages;   /* # of isolated pages for migration */
+   spinlock_t pages_lock;  /* Protection to pages list */
+   struct list_head pages; /* Pages enqueued & handled to Host */
+};
+
+extern struct page *balloon_page_enqueue(struct balloon_dev_info *b_dev_info);
+extern struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info);
+extern struct balloon_dev_info *balloon_devinfo_alloc(
+   void *balloon_dev_descriptor);
+
+static inline void balloon_devinfo_free(struct balloon_dev_info *b_dev_info)
+{
+   kfree(b_dev_info);
+}
+
+/*
+ * balloon_page_free - release a balloon page back to the page free lists
+ * @page: ballooned page to be set free
+ *
+ * This function must be used to properly set free an isolated/dequeued balloon
+ * page at the end of a sucessful page migration, or at the balloon driver's
+ * page release procedure.

[PATCH v12 5/7] virtio_balloon: introduce migration primitives to balloon pages

2012-11-11 Thread Rafael Aquini
Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a guest,
thus imposing performance penalties associated with the reduced number of
transparent huge pages that could be used by the guest workload.

Besides making balloon pages movable at allocation time and introducing
the necessary primitives to perform balloon page migration/compaction,
this patch also introduces the following locking scheme, in order to
enhance the syncronization methods for accessing elements of struct
virtio_balloon, thus providing protection against concurrent access
introduced by parallel memory migration threads.

 - balloon_lock (mutex) : synchronizes the access demand to elements of
  struct virtio_balloon and its queue operations;

Signed-off-by: Rafael Aquini 
Acked-by: Michael S. Tsirkin 
---
 drivers/virtio/virtio_balloon.c | 139 
 1 file changed, 127 insertions(+), 12 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 586395c..8f92ab7 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -34,6 +35,7 @@
  * page units.
  */
 #define VIRTIO_BALLOON_PAGES_PER_PAGE (PAGE_SIZE >> VIRTIO_BALLOON_PFN_SHIFT)
+#define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256
 
 struct virtio_balloon
 {
@@ -52,15 +54,19 @@ struct virtio_balloon
/* Number of balloon pages we've told the Host we're not using. */
unsigned int num_pages;
/*
-* The pages we've told the Host we're not using.
+* The pages we've told the Host we're not using are enqueued
+* at vb_dev_info->pages list.
 * Each page on this list adds VIRTIO_BALLOON_PAGES_PER_PAGE
 * to num_pages above.
 */
-   struct list_head pages;
+   struct balloon_dev_info *vb_dev_info;
+
+   /* Synchronize access/update to this struct virtio_balloon elements */
+   struct mutex balloon_lock;
 
/* The array of pfns we tell the Host about. */
unsigned int num_pfns;
-   u32 pfns[256];
+   u32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
 
/* Memory statistics */
int need_stats_update;
@@ -122,17 +128,20 @@ static void set_page_pfns(u32 pfns[], struct page *page)
 
 static void fill_balloon(struct virtio_balloon *vb, size_t num)
 {
+   struct balloon_dev_info *vb_dev_info = vb->vb_dev_info;
+
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb->pfns));
 
+   mutex_lock(&vb->balloon_lock);
for (vb->num_pfns = 0; vb->num_pfns < num;
 vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-   struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY |
-   __GFP_NOMEMALLOC | __GFP_NOWARN);
+   struct page *page = balloon_page_enqueue(vb_dev_info);
+
if (!page) {
dev_info_ratelimited(&vb->vdev->dev,
 "Out of puff! Can't get %zu 
pages\n",
-num);
+VIRTIO_BALLOON_PAGES_PER_PAGE);
/* Sleep for at least 1/5 of a second before retry. */
msleep(200);
break;
@@ -140,7 +149,6 @@ static void fill_balloon(struct virtio_balloon *vb, size_t 
num)
set_page_pfns(vb->pfns + vb->num_pfns, page);
vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
totalram_pages--;
-   list_add(&page->lru, &vb->pages);
}
 
/* Didn't get any?  Oh well. */
@@ -148,6 +156,7 @@ static void fill_balloon(struct virtio_balloon *vb, size_t 
num)
return;
 
tell_host(vb, vb->inflate_vq);
+   mutex_unlock(&vb->balloon_lock);
 }
 
 static void release_pages_by_pfn(const u32 pfns[], unsigned int num)
@@ -156,7 +165,7 @@ static void release_pages_by_pfn(const u32 pfns[], unsigned 
int num)
 
/* Find pfns pointing at start of each page, get pages and free them. */
for (i = 0; i < num; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-   __free_page(balloon_pfn_to_page(pfns[i]));
+   balloon_page_free(balloon_pfn_to_page(pfns[i]));
totalram_pages++;
}
 }
@@ -164,14 +173,17 @@ static void release_pages_by_pfn(const u32 pfns[], 
unsigned int num)
 static void leak_balloon(struct virtio_balloon *vb, size_t num)
 {
struct page *page;
+   struct balloon_dev_info *vb_dev_info = vb->vb_dev_info;
 
/* We can only do one array

[PATCH v12 0/7] make balloon pages movable by compaction

2012-11-11 Thread Rafael Aquini
Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a guest,
thus imposing performance penalties associated with the reduced number of
transparent huge pages that could be used by the guest workload.

This patch-set follows the main idea discussed at 2012 LSFMMS session:
"Ballooning for transparent huge pages" -- http://lwn.net/Articles/490114/
to introduce the required changes to the virtio_balloon driver, as well as
the changes to the core compaction & migration bits, in order to make those
subsystems aware of ballooned pages and allow memory balloon pages become
movable within a guest, thus avoiding the aforementioned fragmentation issue

Following are numbers that prove this patch benefits on allowing compaction
to be more effective at memory ballooned guests.

Results for STRESS-HIGHALLOC benchmark, from Mel Gorman's mmtests suite,
running on a 4gB RAM KVM guest which was ballooning 512mB RAM in 64mB chunks,
at every minute (inflating/deflating), while test was running:

===BEGIN stress-highalloc

STRESS-HIGHALLOC
 highalloc-3.7 highalloc-3.7
 rc4-clean rc4-patch
Pass 1  55.00 ( 0.00%)62.00 ( 7.00%)
Pass 2  54.00 ( 0.00%)62.00 ( 8.00%)
while Rested75.00 ( 0.00%)80.00 ( 5.00%)

MMTests Statistics: duration
 3.7 3.7
   rc4-clean   rc4-patch
User 1207.59 1207.46
System   1300.55 1299.61
Elapsed  2273.72 2157.06

MMTests Statistics: vmstat
3.7 3.7
  rc4-clean   rc4-patch
Page Ins3581516 2374368
Page Outs  1114869210410332
Swap Ins 80  47
Swap Outs  3641 476
Direct pages scanned  37978   33826
Kswapd pages scanned1828245 1342869
Kswapd pages reclaimed  1710236 1304099
Direct pages reclaimed32207   31005
Kswapd efficiency   93% 97%
Kswapd velocity 804.077 622.546
Direct efficiency   84% 91%
Direct velocity  16.703  15.682
Percentage direct scans  2%  2%
Page writes by reclaim792529704
Page writes file  756119228
Page writes anon   3641 476
Page reclaim immediate16764   11014
Page rescued immediate0   0
Slabs scanned   2171904 2152448
Direct inode steals 3852261
Kswapd inode steals  659137  609670
Kswapd skipped wait   1  69
THP fault alloc 546 631
THP collapse alloc  361 339
THP splits  259 263
THP fault fallback   98  50
THP collapse fail20  17
Compaction stalls   747 499
Compaction success  244 145
Compaction failures 503 354
Compaction pages moved   370888  474837
Compaction move failure   77378   65259

===END stress-highalloc

Rafael Aquini (7):
  mm: adjust address_space_operations.migratepage() return code
  mm: redefine address_space.assoc_mapping
  mm: introduce a common interface for balloon pages mobility
  mm: introduce compaction and migration for ballooned pages
  virtio_balloon: introduce migration primitives to balloon pages
  mm: introduce putback_movable_pages()
  mm: add vm event counters for balloon pages compaction

 drivers/virtio/virtio_balloon.c| 139 +++--
 fs/buffer.c|  12 +-
 fs/gfs2/glock.c|   2 +-
 fs/hugetlbfs/inode.c   |   4 +-
 fs/inode.c |   2 +-
 fs/nilfs2/page.c   |   2 +-
 include/linux/balloon_compaction.h | 263 
 include/linux/fs.h |   2 +-
 include/linux/migrate.h|  19 +++
 include/linux/pagemap.h|  16 ++
 include/linux/vm_event_item.h  |   7 +-
 mm/Kconfig |  15 ++
 mm/Makefile|   3 +-
 mm/balloon_compaction.c| 304 +
 mm/compaction.c|  27 +++-
 mm/migrate.c   |  86 ---
 mm/page_alloc.c|   2 +-
 mm/vmstat.c|   9 +-
 18 files changed, 862 insertions(+), 52 deletions(-)
 create mode 100644 include/linux/balloon_compaction.h
 create mode 100644 mm/balloon_compaction.c

Change log:
v12:
 * Address last suggestions on sorting the barriers usage out  (Mel Gorman);
 * Fix reported build breakages for CONFIG_BALLOON_COMPACTION=n (Andrew Morton);
 * Enhance commentary on the locking scheme used for balloon page compacti

[PATCH v12 4/7] mm: introduce compaction and migration for ballooned pages

2012-11-11 Thread Rafael Aquini
Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a guest,
thus imposing performance penalties associated with the reduced number of
transparent huge pages that could be used by the guest workload.

This patch introduces the helper functions as well as the necessary changes
to teach compaction and migration bits how to cope with pages which are
part of a guest memory balloon, in order to make them movable by memory
compaction procedures.

Signed-off-by: Rafael Aquini 
Acked-by: Mel Gorman 
---
 mm/compaction.c | 21 +++--
 mm/migrate.c| 34 --
 2 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 9eef558..76abd84 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
@@ -565,9 +566,24 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
goto next_pageblock;
}
 
-   /* Check may be lockless but that's ok as we recheck later */
-   if (!PageLRU(page))
+   /*
+* Check may be lockless but that's ok as we recheck later.
+* It's possible to migrate LRU pages and balloon pages
+* Skip any other type of page
+*/
+   if (!PageLRU(page)) {
+   if (unlikely(balloon_page_movable(page))) {
+   if (locked && balloon_page_isolate(page)) {
+   /* Successfully isolated */
+   cc->finished_update_migrate = true;
+   list_add(&page->lru, migratelist);
+   cc->nr_migratepages++;
+   nr_isolated++;
+   goto check_compact_cluster;
+   }
+   }
continue;
+   }
 
/*
 * PageLRU is set. lru_lock normally excludes isolation
@@ -621,6 +637,7 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
cc->nr_migratepages++;
nr_isolated++;
 
+check_compact_cluster:
/* Avoid isolating too much */
if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) {
++low_pfn;
diff --git a/mm/migrate.c b/mm/migrate.c
index 6f408c7..a771751 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -79,7 +80,10 @@ void putback_lru_pages(struct list_head *l)
list_del(&page->lru);
dec_zone_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
-   putback_lru_page(page);
+   if (unlikely(balloon_page_movable(page)))
+   balloon_page_putback(page);
+   else
+   putback_lru_page(page);
}
 }
 
@@ -778,6 +782,18 @@ static int __unmap_and_move(struct page *page, struct page 
*newpage,
}
}
 
+   if (unlikely(balloon_page_movable(page))) {
+   /*
+* A ballooned page does not need any special attention from
+* physical to virtual reverse mapping procedures.
+* Skip any attempt to unmap PTEs or to remap swap cache,
+* in order to avoid burning cycles at rmap level, and perform
+* the page migration right away (proteced by page lock).
+*/
+   rc = balloon_page_migrate(newpage, page, mode);
+   goto uncharge;
+   }
+
/*
 * Corner case handling:
 * 1. When a new swap-cache page is read into, it is added to the LRU
@@ -814,7 +830,9 @@ skip_unmap:
put_anon_vma(anon_vma);
 
 uncharge:
-   mem_cgroup_end_migration(mem, page, newpage, rc == MIGRATEPAGE_SUCCESS);
+   mem_cgroup_end_migration(mem, page, newpage,
+(rc == MIGRATEPAGE_SUCCESS ||
+ rc == MIGRATEPAGE_BALLOON_SUCCESS));
 unlock:
unlock_page(page);
 out:
@@ -846,6 +864,18 @@ static int unmap_and_move(new_page_t get_new_page, 
unsigned long private,
goto out;
 
rc = __unmap_and_move(page, newpage, force, offlining, mode);
+
+   if (unlikely(rc == MIGRATEPAGE_BALLOON_SUCCESS)) {
+   /*
+* A ballooned page has been migrated already.
+* Now, it's the time to wrap-up count

[PATCH v12 6/7] mm: introduce putback_movable_pages()

2012-11-11 Thread Rafael Aquini
The PATCH "mm: introduce compaction and migration for virtio ballooned pages"
hacks around putback_lru_pages() in order to allow ballooned pages to be
re-inserted on balloon page list as if a ballooned page was like a LRU page.

As ballooned pages are not legitimate LRU pages, this patch introduces
putback_movable_pages() to properly cope with cases where the isolated
pageset contains ballooned pages and LRU pages, thus fixing the mentioned
inelegant hack around putback_lru_pages().

Signed-off-by: Rafael Aquini 
---
 include/linux/migrate.h |  2 ++
 mm/compaction.c |  6 +++---
 mm/migrate.c| 20 
 mm/page_alloc.c |  2 +-
 4 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 4ce2ee9..42fafb4 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -27,6 +27,7 @@ typedef struct page *new_page_t(struct page *, unsigned long 
private, int **);
 #ifdef CONFIG_MIGRATION
 
 extern void putback_lru_pages(struct list_head *l);
+extern void putback_movable_pages(struct list_head *l);
 extern int migrate_page(struct address_space *,
struct page *, struct page *, enum migrate_mode);
 extern int migrate_pages(struct list_head *l, new_page_t x,
@@ -51,6 +52,7 @@ extern int migrate_misplaced_page(struct page *page, int 
node);
 #else
 
 static inline void putback_lru_pages(struct list_head *l) {}
+static inline void putback_movable_pages(struct list_head *l) {}
 static inline int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
enum migrate_mode mode) { return -ENOSYS; }
diff --git a/mm/compaction.c b/mm/compaction.c
index 76abd84..f268bd8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -995,7 +995,7 @@ static int compact_zone(struct zone *zone, struct 
compact_control *cc)
switch (isolate_migratepages(zone, cc)) {
case ISOLATE_ABORT:
ret = COMPACT_PARTIAL;
-   putback_lru_pages(&cc->migratepages);
+   putback_movable_pages(&cc->migratepages);
cc->nr_migratepages = 0;
goto out;
case ISOLATE_NONE:
@@ -1018,9 +1018,9 @@ static int compact_zone(struct zone *zone, struct 
compact_control *cc)
trace_mm_compaction_migratepages(nr_migrate - nr_remaining,
nr_remaining);
 
-   /* Release LRU pages not migrated */
+   /* Release isolated pages not migrated */
if (err) {
-   putback_lru_pages(&cc->migratepages);
+   putback_movable_pages(&cc->migratepages);
cc->nr_migratepages = 0;
if (err == -ENOMEM) {
ret = COMPACT_PARTIAL;
diff --git a/mm/migrate.c b/mm/migrate.c
index a771751..107a281 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -80,6 +80,26 @@ void putback_lru_pages(struct list_head *l)
list_del(&page->lru);
dec_zone_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
+   putback_lru_page(page);
+   }
+}
+
+/*
+ * Put previously isolated pages back onto the appropriate lists
+ * from where they were once taken off for compaction/migration.
+ *
+ * This function shall be used instead of putback_lru_pages(),
+ * whenever the isolated pageset has been built by isolate_migratepages_range()
+ */
+void putback_movable_pages(struct list_head *l)
+{
+   struct page *page;
+   struct page *page2;
+
+   list_for_each_entry_safe(page, page2, l, lru) {
+   list_del(&page->lru);
+   dec_zone_page_state(page, NR_ISOLATED_ANON +
+   page_is_file_cache(page));
if (unlikely(balloon_page_movable(page)))
balloon_page_putback(page);
else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fd1919..6b990cb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5795,7 +5795,7 @@ static int __alloc_contig_migrate_range(struct 
compact_control *cc,
0, false, MIGRATE_SYNC);
}
 
-   putback_lru_pages(&cc->migratepages);
+   putback_movable_pages(&cc->migratepages);
return ret > 0 ? 0 : ret;
 }
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v12 2/7] mm: redefine address_space.assoc_mapping

2012-11-11 Thread Rafael Aquini
This patch overhauls struct address_space.assoc_mapping renaming it to
address_space.private_data and its type is redefined to void*.
By this approach we consistently name the .private_* elements from
struct address_space as well as allow extended usage for address_space
association with other data structures through ->private_data.

Also, all users of old ->assoc_mapping element are converted to reflect
its new name and type change (->private_data).

Signed-off-by: Rafael Aquini 
---
 fs/buffer.c| 12 ++--
 fs/gfs2/glock.c|  2 +-
 fs/inode.c |  2 +-
 fs/nilfs2/page.c   |  2 +-
 include/linux/fs.h |  2 +-
 5 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index b5f0442..e0bad95 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -555,7 +555,7 @@ void emergency_thaw_all(void)
  */
 int sync_mapping_buffers(struct address_space *mapping)
 {
-   struct address_space *buffer_mapping = mapping->assoc_mapping;
+   struct address_space *buffer_mapping = mapping->private_data;
 
if (buffer_mapping == NULL || list_empty(&mapping->private_list))
return 0;
@@ -588,10 +588,10 @@ void mark_buffer_dirty_inode(struct buffer_head *bh, 
struct inode *inode)
struct address_space *buffer_mapping = bh->b_page->mapping;
 
mark_buffer_dirty(bh);
-   if (!mapping->assoc_mapping) {
-   mapping->assoc_mapping = buffer_mapping;
+   if (!mapping->private_data) {
+   mapping->private_data = buffer_mapping;
} else {
-   BUG_ON(mapping->assoc_mapping != buffer_mapping);
+   BUG_ON(mapping->private_data != buffer_mapping);
}
if (!bh->b_assoc_map) {
spin_lock(&buffer_mapping->private_lock);
@@ -788,7 +788,7 @@ void invalidate_inode_buffers(struct inode *inode)
if (inode_has_buffers(inode)) {
struct address_space *mapping = &inode->i_data;
struct list_head *list = &mapping->private_list;
-   struct address_space *buffer_mapping = mapping->assoc_mapping;
+   struct address_space *buffer_mapping = mapping->private_data;
 
spin_lock(&buffer_mapping->private_lock);
while (!list_empty(list))
@@ -811,7 +811,7 @@ int remove_inode_buffers(struct inode *inode)
if (inode_has_buffers(inode)) {
struct address_space *mapping = &inode->i_data;
struct list_head *list = &mapping->private_list;
-   struct address_space *buffer_mapping = mapping->assoc_mapping;
+   struct address_space *buffer_mapping = mapping->private_data;
 
spin_lock(&buffer_mapping->private_lock);
while (!list_empty(list)) {
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 6114571..904a808 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -766,7 +766,7 @@ int gfs2_glock_get(struct gfs2_sbd *sdp, u64 number,
mapping->host = s->s_bdev->bd_inode;
mapping->flags = 0;
mapping_set_gfp_mask(mapping, GFP_NOFS);
-   mapping->assoc_mapping = NULL;
+   mapping->private_data = NULL;
mapping->backing_dev_info = s->s_bdi;
mapping->writeback_index = 0;
}
diff --git a/fs/inode.c b/fs/inode.c
index b03c719..4cac8e1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -165,7 +165,7 @@ int inode_init_always(struct super_block *sb, struct inode 
*inode)
mapping->host = inode;
mapping->flags = 0;
mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
-   mapping->assoc_mapping = NULL;
+   mapping->private_data = NULL;
mapping->backing_dev_info = &default_backing_dev_info;
mapping->writeback_index = 0;
 
diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c
index 3e7b2a0..07f76db 100644
--- a/fs/nilfs2/page.c
+++ b/fs/nilfs2/page.c
@@ -431,7 +431,7 @@ void nilfs_mapping_init(struct address_space *mapping, 
struct inode *inode,
mapping->host = inode;
mapping->flags = 0;
mapping_set_gfp_mask(mapping, GFP_NOFS);
-   mapping->assoc_mapping = NULL;
+   mapping->private_data = NULL;
mapping->backing_dev_info = bdi;
mapping->a_ops = &empty_aops;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b33cfc9..0982565 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -418,7 +418,7 @@ struct address_space {
struct backing_dev_info *backing_dev_info; /* device readahead, etc */
spinlock_t  private_lock;   /* for use by the address_space 
*/
struct list_headprivate_list;   /* ditto */
-   struct address_space*assoc_mapping; /* ditto */
+   void*private_data; 

[PATCH v12 7/7] mm: add vm event counters for balloon pages compaction

2012-11-11 Thread Rafael Aquini
This patch introduces a new set of vm event counters to keep track of
ballooned pages compaction activity.

Signed-off-by: Rafael Aquini 
---
 include/linux/balloon_compaction.h | 7 +++
 include/linux/vm_event_item.h  | 7 ++-
 mm/balloon_compaction.c| 2 ++
 mm/migrate.c   | 1 +
 mm/vmstat.c| 9 -
 5 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/include/linux/balloon_compaction.h 
b/include/linux/balloon_compaction.h
index 2e63d94..68893bc 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -197,8 +197,15 @@ static inline bool balloon_compaction_check(void)
return true;
 }
 
+static inline void balloon_event_count(enum vm_event_item item)
+{
+   count_vm_event(item);
+}
 #else /* !CONFIG_BALLOON_COMPACTION */
 
+/* A macro, to avoid generating references to the undefined COMPACTBALLOON* */
+#define balloon_event_count(item) do { } while (0)
+
 static inline void *balloon_mapping_alloc(void *balloon_device,
const struct address_space_operations *a_ops)
 {
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 3d31145..bd67c3f 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -41,7 +41,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_COMPACTION
COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
-#endif
+#ifdef CONFIG_BALLOON_COMPACTION
+   COMPACTBALLOONISOLATED, /* isolated from balloon pagelist */
+   COMPACTBALLOONMIGRATED, /* balloon page sucessfully migrated */
+   COMPACTBALLOONRETURNED, /* putback to pagelist, not-migrated */
+#endif /* CONFIG_BALLOON_COMPACTION */
+#endif /* CONFIG_COMPACTION */
 #ifdef CONFIG_HUGETLB_PAGE
HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
 #endif
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index 07dbc8e..2c8ce49 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -242,6 +242,7 @@ bool balloon_page_isolate(struct page *page)
if (__is_movable_balloon_page(page) &&
page_count(page) == 2) {
__isolate_balloon_page(page);
+   balloon_event_count(COMPACTBALLOONISOLATED);
unlock_page(page);
return true;
}
@@ -265,6 +266,7 @@ void balloon_page_putback(struct page *page)
__putback_balloon_page(page);
/* drop the extra ref count taken for page isolation */
put_page(page);
+   balloon_event_count(COMPACTBALLOONRETURNED);
} else {
WARN_ON(1);
dump_page(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index 107a281..ecae213 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -894,6 +894,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned 
long private,
dec_zone_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
balloon_page_free(page);
+   balloon_event_count(COMPACTBALLOONMIGRATED);
return MIGRATEPAGE_SUCCESS;
}
 out:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c737057..18a76ea 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -781,7 +781,14 @@ const char * const vmstat_text[] = {
"compact_stall",
"compact_fail",
"compact_success",
-#endif
+
+#ifdef CONFIG_BALLOON_COMPACTION
+   "compact_balloon_isolated",
+   "compact_balloon_migrated",
+   "compact_balloon_returned",
+#endif /* CONFIG_BALLOON_COMPACTION */
+
+#endif /* CONFIG_COMPACTION */
 
 #ifdef CONFIG_HUGETLB_PAGE
"htlb_buddy_alloc_success",
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v12 0/7] make balloon pages movable by compaction

2012-11-11 Thread Rafael Aquini
On Sun, Nov 11, 2012 at 05:01:13PM -0200, Rafael Aquini wrote:
> Change log:
> v12:
>  * Address last suggestions on sorting the barriers usage out  (Mel 
> Gorman);
>  * Fix reported build breakages for CONFIG_BALLOON_COMPACTION=n (Andrew 
> Morton);
>  * Enhance commentary on the locking scheme used for balloon page compaction;
>  * Move all the 'balloon vm_event counter' changes to PATCH 07;

Andrew, 

could you drop the earlier review (v11) and pick this submission (v12) instead,
please?


Aside: before submitting I rebased the series on -next-20121109, after reverting
the v11 commits, to make those clean-up hunks from PATCH 01 apply smoothly.

-- Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v11 7/7] mm: add vm event counters for balloon pages compaction

2012-11-11 Thread Rafael Aquini
On Sat, Nov 10, 2012 at 05:55:38PM +0200, Michael S. Tsirkin wrote:
> > mutex_unlock(&vb->balloon_lock);
> > +   balloon_event_count(COMPACTBALLOONMIGRATED);
> >  
> > return MIGRATEPAGE_BALLOON_SUCCESS;
> >  }
> 
> Looks like any ballon would need to do this.
> Can this  chunk go into caller instead?
>

Good catch. It's done, already (v12 just hit the wild).

Thanks!
-- Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/2] virtio_balloon: move locking to the balloon thread

2012-12-19 Thread Rafael Aquini
On Tue, Dec 18, 2012 at 06:17:29PM -0200, Luiz Capitulino wrote:
> Today, the balloon_lock mutex is taken and released by fill_balloon()
> and leak_balloon() when both functions are entered and when they
> return.
> 
> This commit moves the locking to the caller instead, which is
> the balloon() thread. The balloon thread is the sole caller of those
> functions today.
> 
> The reason for this move is that the next commit will introduce
> a shrinker callback for the balloon driver, which will also call
> leak_balloon() but will require different locking semantics.
> 
> Signed-off-by: Luiz Capitulino 
> ---
>  drivers/virtio/virtio_balloon.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 2a70558..877e695 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -133,7 +133,6 @@ static void fill_balloon(struct virtio_balloon *vb, 
> size_t num)
>   /* We can only do one array worth at a time. */
>   num = min(num, ARRAY_SIZE(vb->pfns));
>  
> - mutex_lock(&vb->balloon_lock);
>   for (vb->num_pfns = 0; vb->num_pfns < num;
>vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
>   struct page *page = balloon_page_enqueue(vb_dev_info);
> @@ -155,7 +154,6 @@ static void fill_balloon(struct virtio_balloon *vb, 
> size_t num)
>   /* Did we get any? */
>   if (vb->num_pfns != 0)
>   tell_host(vb, vb->inflate_vq);
> - mutex_unlock(&vb->balloon_lock);
>  }
>

Since you're removing the locking scheme from within this function, I think it
would be a good idea introduce a comment stating its caller must held the mutex
vb->balloon_lock.

  
>  static void release_pages_by_pfn(const u32 pfns[], unsigned int num)
> @@ -177,7 +175,6 @@ static void leak_balloon(struct virtio_balloon *vb, 
> size_t num)
>   /* We can only do one array worth at a time. */
>   num = min(num, ARRAY_SIZE(vb->pfns));
>  
> - mutex_lock(&vb->balloon_lock);
>   for (vb->num_pfns = 0; vb->num_pfns < num;
>vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
>   page = balloon_page_dequeue(vb_dev_info);
> @@ -193,7 +190,6 @@ static void leak_balloon(struct virtio_balloon *vb, 
> size_t num)
>* is true, we *have* to do it in this order
>*/
>   tell_host(vb, vb->deflate_vq);
> - mutex_unlock(&vb->balloon_lock);
>   release_pages_by_pfn(vb->pfns, vb->num_pfns);
>  }
>

ditto

  
> @@ -306,11 +302,13 @@ static int balloon(void *_vballoon)
>|| freezing(current));
>   if (vb->need_stats_update)
>   stats_handle_request(vb);
> + mutex_lock(&vb->balloon_lock);
>   if (diff > 0)
>   fill_balloon(vb, diff);
>   else if (diff < 0)
>   leak_balloon(vb, -diff);
>   update_balloon_size(vb);
> + mutex_unlock(&vb->balloon_lock);
>   }
>   return 0;
>  }

Just a nitpick:
As leak_balloon() is also called at remove_common(), you'll need to introduce 
the
mutex there, similarly.


Thanks for move this forward.

Cheers!
-- Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/3] x86,smp: move waiting on contended lock out of line

2012-12-23 Thread Rafael Aquini
On Fri, Dec 21, 2012 at 06:50:38PM -0500, Rik van Riel wrote:
> Subject: x86,smp: move waiting on contended ticket lock out of line
> 
> Moving the wait loop for congested loops to its own function allows
> us to add things to that wait loop, without growing the size of the
> kernel text appreciably.
> 
> Signed-off-by: Rik van Riel 
> ---

Reviewed-by: Rafael Aquini 


>  arch/x86/include/asm/spinlock.h |   13 +++--
>  arch/x86/kernel/smp.c   |   14 ++
>  2 files changed, 21 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> index 33692ea..2a45eb0 100644
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
> @@ -34,6 +34,8 @@
>  # define UNLOCK_LOCK_PREFIX
>  #endif
>  
> +extern void ticket_spin_lock_wait(arch_spinlock_t *, struct __raw_tickets);
> +
>  /*
>   * Ticket locks are conceptually two parts, one indicating the current head 
> of
>   * the queue, and the other indicating the current tail. The lock is acquired
> @@ -53,12 +55,11 @@ static __always_inline void 
> __ticket_spin_lock(arch_spinlock_t *lock)
>  
>   inc = xadd(&lock->tickets, inc);
>  
> - for (;;) {
> - if (inc.head == inc.tail)
> - break;
> - cpu_relax();
> - inc.head = ACCESS_ONCE(lock->tickets.head);
> - }
> + if (inc.head == inc.tail)
> + goto out;
> +
> + ticket_spin_lock_wait(lock, inc);
> + out:
>   barrier();  /* make sure nothing creeps before the lock is 
> taken */
>  }
>  
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index 48d2b7d..20da354 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -113,6 +113,20 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
>  static bool smp_no_nmi_ipi = false;
>  
>  /*
> + * Wait on a congested ticket spinlock.
> + */
> +void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
> +{
> + for (;;) {
> + cpu_relax();
> + inc.head = ACCESS_ONCE(lock->tickets.head);
> +
> + if (inc.head == inc.tail)
> + break;
> + }
> +}
> +
> +/*
>   * this function sends a 'reschedule' IPI to another CPU.
>   * it goes straight through and wastes no time serializing
>   * anything. Worst case is that we lose a reschedule ...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/3] x86,smp: proportional backoff for ticket spinlocks

2012-12-23 Thread Rafael Aquini
On Fri, Dec 21, 2012 at 06:51:15PM -0500, Rik van Riel wrote:
> Subject: x86,smp: proportional backoff for ticket spinlocks
> 
> Simple fixed value proportional backoff for ticket spinlocks.
> By pounding on the cacheline with the spin lock less often,
> bus traffic is reduced. In cases of a data structure with
> embedded spinlock, the lock holder has a better chance of
> making progress.
> 
> Signed-off-by: Rik van Riel 
> ---

Reviewed-by: Rafael Aquini 

>  arch/x86/kernel/smp.c |6 --
>  1 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index 20da354..4e44840 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -118,9 +118,11 @@ static bool smp_no_nmi_ipi = false;
>  void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>  {
>   for (;;) {
> - cpu_relax();
> - inc.head = ACCESS_ONCE(lock->tickets.head);
> + int loops = 50 * (__ticket_t)(inc.tail - inc.head);
> + while (loops--)
> + cpu_relax();
>  
> + inc.head = ACCESS_ONCE(lock->tickets.head);
>   if (inc.head == inc.tail)
>   break;
>   }
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 3/3] x86,smp: auto tune spinlock backoff delay factor

2012-12-23 Thread Rafael Aquini
On Fri, Dec 21, 2012 at 10:58:48PM -0500, Rik van Riel wrote:
> On 12/21/2012 10:49 PM, Steven Rostedt wrote:
> >On Fri, Dec 21, 2012 at 09:51:35PM -0500, Rik van Riel wrote:
> 
> >>However, since spinlock contention should not be the
> >>usual state, and all a scalable lock does is make sure
> >>that N+1 CPUs does not perform worse than N CPUs, using
> >>scalable locks is a stop-gap measure.
> >>
> >>I believe a stop-gap measure should be kept as simple as
> >>we can. I am willing to consider moving to a per-lock
> >>delay factor if we can figure out an easy way to do it,
> >>but I would like to avoid too much extra complexity...
> >
> >Rik,
> >
> >I like your solution. It's rather simple and simple solutions tend to
> >end up being the closest to optimal. The more complex a solution gets,
> >the more it starts chasing fireflies.
> 
> >Anyway, I'd like to see this code tested, and more benchmarks run
> >against it.
> 
> Absolutely.  I would love to see if this code actually
> causes regressions anywhere.
> 
> It is simple enough that I suspect it will not, but there
> really is only one way to find out.
> 
> The more people test this with different workloads on
> different SMP systems, the better.
>

Great work Rik,

I have a couple of small SMP systems I'll start to test with your patches, also
I might be able to test this work after new year's eve on a big SMP box that 
seems to be facing a severe lock starvation issue due to the BUS saturation 
your work is aiming to reduce.

Cheers!
-- Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v12 4/7] mm: introduce compaction and migration for ballooned pages

2012-11-17 Thread Rafael Aquini
On Sat, Nov 17, 2012 at 01:01:30PM -0500, Sasha Levin wrote:
> 
> I'm getting the following while fuzzing using trinity inside a KVM tools 
> guest,
> on latest -next:
> 
> [ 1642.783728] BUG: unable to handle kernel NULL pointer dereference at 
> 0194
> [ 1642.785083] IP: [] isolate_migratepages_range+0x344/0x7b0
> 
> My guess is that we see those because of a race during the check in
> isolate_migratepages_range().
> 
> 
> Thanks,
> Sasha

Sasha, could you share your .config and steps you did used with trinity? So I
can attempt to reproduce this issue you reported.

Thanks, 
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v12 4/7] mm: introduce compaction and migration for ballooned pages

2012-11-20 Thread Rafael Aquini
On Sun, Nov 18, 2012 at 09:59:47AM -0500, Sasha Levin wrote:
> On Sat, Nov 17, 2012 at 4:54 PM, Rafael Aquini  wrote:
> > On Sat, Nov 17, 2012 at 01:01:30PM -0500, Sasha Levin wrote:
> >>
> >> I'm getting the following while fuzzing using trinity inside a KVM tools 
> >> guest,
> >> on latest -next:
> >>
> >> [ 1642.783728] BUG: unable to handle kernel NULL pointer dereference at 
> >> 0194
> >> [ 1642.785083] IP: [] 
> >> isolate_migratepages_range+0x344/0x7b0
> >>
> >> My guess is that we see those because of a race during the check in
> >> isolate_migratepages_range().
> >>
> >>
> >> Thanks,
> >> Sasha
> >
> > Sasha, could you share your .config and steps you did used with trinity? So 
> > I
> > can attempt to reproduce this issue you reported.
> 
> Basically try running trinity (with ./trinity -m --quiet --dangerous
> -l off) inside a disposable guest as root.
> 
> I manage to hit that every couple of hours.
> 
> Config attached.
> 

Howdy Sasha,

After several hours since last Sunday running trinity tests on a traditional
KVM-QEMU guest as well as running it on a lkvm guest (both running
next-20121115) I couldn't hit a single time the crash you've reported,
(un)fortunately.

Also, the .config you gave me, applied on top of next-20121115, haven't produced
the same bin you've running and hitting the mentioned bug, apparently.

Here's the RIP for your crash:
[ 1642.783728] BUG: unable to handle kernel NULL pointer dereference at
0194
[ 1642.785083] IP: [] isolate_migratepages_range+0x344/0x7b0


And here's the symbol address for the next-20121115 with your .config I've been
running tests on:
[raquini@x61 linux]$ nm -n vmlinux | grep isolate_migratepages_range 
8122d890 T isolate_migratepages_range

Also, it seems quite clear I'm missing something from your tree, as applying the
RIP displacement (0x344) to my local isolate_migratepages_range sym addr leads
me to the _middle_ of a instruction opcode that does not dereference any
pointers at all.

So, if you're consistently reproducing the same crash, consider to share with us
a disassembled dump from the isolate_migratepages_range() you're running along
with the crash stack-dump, please.

Cheers!
-- Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] virtio_balloon: leak_balloon(): only tell host if we got pages deflated

2013-06-05 Thread Rafael Aquini
On Wed, Jun 05, 2013 at 05:10:31PM -0400, Luiz Capitulino wrote:
> The balloon_page_dequeue() function can return NULL. If it does for
> the first page being freed, then leak_balloon() will create a
> scatter list with len=0. Which in turn seems to generate an invalid
> virtio request.
> 
> Signed-off-by: Luiz Capitulino 
> ---
> 
> PS: I didn't get this in practice. I found it by code review. On the other
> hand, automatic-ballooning was able to put such invalid requests in
> the virtqueue and QEMU would explode...
>

Nice catch! The patch looks sane and replicates the check done at
fill_balloon(). I think we also could use this P.S. as a commentary 
to let others aware of this scenario. Thanks Luiz!

Acked-by: Rafael Aquini 

 
> PPS: Very lightly tested
>
>  drivers/virtio/virtio_balloon.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index bd3ae32..71af7b5 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -191,7 +191,8 @@ static void leak_balloon(struct virtio_balloon *vb, 
> size_t num)
>* virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>* is true, we *have* to do it in this order
>*/
> - tell_host(vb, vb->deflate_vq);
> + if (vb->num_pfns != 0)
> + tell_host(vb, vb->deflate_vq);
>   mutex_unlock(&vb->balloon_lock);
>   release_pages_by_pfn(vb->pfns, vb->num_pfns);
>  }
> -- 
> 1.8.1.4
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] virtio_balloon: leak_balloon(): only tell host if we got pages deflated

2013-06-05 Thread Rafael Aquini
On Wed, Jun 05, 2013 at 07:08:44PM -0400, Luiz Capitulino wrote:
> On Wed, 5 Jun 2013 18:24:49 -0300
> Rafael Aquini  wrote:
> 
> > On Wed, Jun 05, 2013 at 05:10:31PM -0400, Luiz Capitulino wrote:
> > > The balloon_page_dequeue() function can return NULL. If it does for
> > > the first page being freed, then leak_balloon() will create a
> > > scatter list with len=0. Which in turn seems to generate an invalid
> > > virtio request.
> > > 
> > > Signed-off-by: Luiz Capitulino 
> > > ---
> > > 
> > > PS: I didn't get this in practice. I found it by code review. On the other
> > > hand, automatic-ballooning was able to put such invalid requests in
> > > the virtqueue and QEMU would explode...
> > >
> > 
> > Nice catch! The patch looks sane and replicates the check done at
> > fill_balloon(). I think we also could use this P.S. as a commentary 
> > to let others aware of this scenario. Thanks Luiz!
> 
> Want me to respin?
>

That would be great, indeed. I guess the commentary could also go for the same
if case at fill_balloon(), assuming the tests are placed to prevent the same
scenario you described at changelog. You can stick my Ack on it, if reposting.

Cheers!
-- Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] virtio_balloon: leak_balloon(): only tell host if we got pages deflated

2013-06-06 Thread Rafael Aquini
On Wed, Jun 05, 2013 at 09:18:37PM -0400, Luiz Capitulino wrote:
> The balloon_page_dequeue() function can return NULL. If it does for
> the first page being freed, then leak_balloon() will create a
> scatter list with len=0. Which in turn seems to generate an invalid
> virtio request.
> 
> I didn't get this in practice, I found it by code review. On the other
> hand, such an invalid virtio request will cause errors in QEMU and
> fill_balloon() also performs the same check implemented by this commit.
> 
> Signed-off-by: Luiz Capitulino 
> Acked-by: Rafael Aquini 
> ---
> 
> o v2
> 
>  - Improve changelog
> 
>  drivers/virtio/virtio_balloon.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index bd3ae32..71af7b5 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -191,7 +191,8 @@ static void leak_balloon(struct virtio_balloon *vb, 
> size_t num)
>* virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>* is true, we *have* to do it in this order
>*/
> - tell_host(vb, vb->deflate_vq);

Luiz, sorry for not being clearer before. I was referring to add a commentary on
code, to explain in short words why we should not get rid of this check point.

> + if (vb->num_pfns != 0)
> + tell_host(vb, vb->deflate_vq);
>   mutex_unlock(&vb->balloon_lock);

If the comment is regarded as unnecessary, then just ignore my suggestion. I'm
OK with your patch. :)

Cheers!
-- Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O compeletion

2013-05-30 Thread Rafael Aquini
read_swap_cache_async() can race against get_swap_page(), and stumble across
a SWAP_HAS_CACHE entry in the swap map whose page wasn't brought into the
swapcache yet. This transient swap_map state is expected to be transitory,
but the actual placement of discard at scan_swap_map() inserts a wait for
I/O completion thus making the thread at read_swap_cache_async() to loop
around its -EEXIST case, while the other end at get_swap_page()
is scheduled away at scan_swap_map(). This can leave the system deadlocked
if the I/O completion happens to be waiting on the CPU workqueue where
read_swap_cache_async() is busy looping and !CONFIG_PREEMPT.

This patch introduces a cond_resched() call to make the aforementioned
read_swap_cache_async() busy loop condition to bail out when necessary,
thus avoiding the subtle race window.

Signed-off-by: Rafael Aquini 
---
 mm/swap_state.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index b3d40dc..9ad9e3b 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -336,8 +336,20 @@ struct page *read_swap_cache_async(swp_entry_t entry, 
gfp_t gfp_mask,
 * Swap entry may have been freed since our caller observed it.
 */
err = swapcache_prepare(entry);
-   if (err == -EEXIST) {   /* seems racy */
+   if (err == -EEXIST) {
radix_tree_preload_end();
+   /*
+* We might race against get_swap_page() and stumble
+* across a SWAP_HAS_CACHE swap_map entry whose page
+* has not been brought into the swapcache yet, while
+* the other end is scheduled away waiting on discard
+* I/O completion.
+* In order to avoid turning this transitory state
+* into a permanent loop around this -EEXIST case,
+* lets just conditionally invoke the scheduler,
+* if there are some more important tasks to run.
+*/
+   cond_resched();
continue;
}
if (err) {  /* swp entry is obsolete ? */
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O compeletion

2013-05-30 Thread Rafael Aquini
On Thu, May 30, 2013 at 03:55:39PM -0400, Johannes Weiner wrote:
> On Thu, May 30, 2013 at 03:05:00PM -0300, Rafael Aquini wrote:
> > read_swap_cache_async() can race against get_swap_page(), and stumble across
> > a SWAP_HAS_CACHE entry in the swap map whose page wasn't brought into the
> > swapcache yet. This transient swap_map state is expected to be transitory,
> > but the actual placement of discard at scan_swap_map() inserts a wait for
> > I/O completion thus making the thread at read_swap_cache_async() to loop
> > around its -EEXIST case, while the other end at get_swap_page()
> > is scheduled away at scan_swap_map(). This can leave the system deadlocked
> > if the I/O completion happens to be waiting on the CPU workqueue where
> 
> waitqueue?
>

Ugh! I will repost this to fix it and the "compeletion" typo at subject...


> > read_swap_cache_async() is busy looping and !CONFIG_PREEMPT.
> > 
> > This patch introduces a cond_resched() call to make the aforementioned
> > read_swap_cache_async() busy loop condition to bail out when necessary,
> > thus avoiding the subtle race window.
> > 
> > Signed-off-by: Rafael Aquini 
> > ---
> >  mm/swap_state.c | 14 +-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index b3d40dc..9ad9e3b 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -336,8 +336,20 @@ struct page *read_swap_cache_async(swp_entry_t entry, 
> > gfp_t gfp_mask,
> >  * Swap entry may have been freed since our caller observed it.
> >  */
> > err = swapcache_prepare(entry);
> > -   if (err == -EEXIST) {   /* seems racy */
> > +   if (err == -EEXIST) {
> > radix_tree_preload_end();
> > +   /*
> > +* We might race against get_swap_page() and stumble
> > +* across a SWAP_HAS_CACHE swap_map entry whose page
> > +* has not been brought into the swapcache yet, while
> > +* the other end is scheduled away waiting on discard
> > +* I/O completion.
> > +* In order to avoid turning this transitory state
> > +* into a permanent loop around this -EEXIST case,
> > +* lets just conditionally invoke the scheduler,
> > +* if there are some more important tasks to run.
> > +*/
> > +   cond_resched();
> 
> Might be worth mentioning the !CONFIG_PREEMPT deadlock scenario here,
> especially since under CONFIG_PREEMPT the radix_tree_preload_end() is
> already a scheduling point through the preempt_enable().
>

Nice suggestion, will do it. Thanks for reviewing this patch!


> Other than that, the patch looks good to me!
> 
> Acked-by: Johannes Weiner 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O completion

2013-05-30 Thread Rafael Aquini
read_swap_cache_async() can race against get_swap_page(), and stumble across
a SWAP_HAS_CACHE entry in the swap map whose page wasn't brought into the
swapcache yet. This transient swap_map state is expected to be transitory,
but the actual placement of discard at scan_swap_map() inserts a wait for
I/O completion thus making the thread at read_swap_cache_async() to loop
around its -EEXIST case, while the other end at get_swap_page()
is scheduled away at scan_swap_map(). This can leave the system deadlocked
if the I/O completion happens to be waiting on the CPU waitqueue where
read_swap_cache_async() is busy looping and !CONFIG_PREEMPT.

This patch introduces a cond_resched() call to make the aforementioned
read_swap_cache_async() busy loop condition to bail out when necessary,
thus avoiding the subtle race window.

Signed-off-by: Rafael Aquini 
Acked-by: Johannes Weiner 
Acked-by: KOSAKI Motohiro 
Acked-by: Hugh Dickins 
Cc: sta...@vger.kernel.org
---
Changelog:
* fixed typos in commit message;(hannes)
* enhanced the commentary on the deadlock scenario; (hannes)
* include the received ACKs and Cc: stable

 mm/swap_state.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index b3d40dc..f24ab0d 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -336,8 +336,24 @@ struct page *read_swap_cache_async(swp_entry_t entry, 
gfp_t gfp_mask,
 * Swap entry may have been freed since our caller observed it.
 */
err = swapcache_prepare(entry);
-   if (err == -EEXIST) {   /* seems racy */
+   if (err == -EEXIST) {
radix_tree_preload_end();
+   /*
+* We might race against get_swap_page() and stumble
+* across a SWAP_HAS_CACHE swap_map entry whose page
+* has not been brought into the swapcache yet, while
+* the other end is scheduled away waiting on discard
+* I/O completion at scan_swap_map().
+*
+* In order to avoid turning this transitory state
+* into a permanent loop around this -EEXIST case
+* if !CONFIG_PREEMPT and the I/O completion happens
+* to be waiting on the CPU waitqueue where we are now
+* busy looping, we just conditionally invoke the
+* scheduler here, if there are some more important
+* tasks to run.
+*/
+   cond_resched();
continue;
}
if (err) {  /* swp entry is obsolete ? */
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] virtio_balloon: auto-ballooning support

2013-05-13 Thread Rafael Aquini
On Fri, May 10, 2013 at 09:20:46AM -0400, Luiz Capitulino wrote:
> On Thu, 9 May 2013 18:15:19 -0300
> Rafael Aquini  wrote:
> 
> > Since your shrinker
> > doesn't change the balloon target size,
> 
> Which target size are you referring to? The one in the host (member num_pages
> of VirtIOBalloon in QEMU)?
>

Yes, I'm referring to the struct virtio_balloon_config->num_pages element,
which basically is the balloon size target. Guest's struct
virtio_balloon->num_pages is just a book keeping element to allow the guest
balloon thread to track how far it is from achieving the target set by the host,
IIUC.
 
> If it the one in the host, then my understanding is that that member is only
> used to communicate the new balloon target to the guest. The guest driver
> will only read it when told (by the host) to do so, and when it does the
> target value will be correct.
>
> Am I right?
> 

You're right, and the host's member is used to communicate the configured size
to guest's balloon device, however, by not changing it when the shrinker causes 
the balloon to deflate will make the balloon thread to be woken up again 
in order to chase the balloon size target again, won't it? Check
towards_target() and balloon() and you will see from where my concern arises.

Cheers!
-- Rafael

 
> > as soon as the shrink round finishes the
> > balloon will re-inflate again, won't it? Doesn't this cause a sort of 
> > "balloon
> > thrashing" scenario, if both guest and host are suffering from memory 
> > pressure?
> > 
> > 
> > The rest I have for the moment, are only nitpicks :)
> > 
> > 
> > >  drivers/virtio/virtio_balloon.c | 55 
> > > +
> > >  include/uapi/linux/virtio_balloon.h |  1 +
> > >  2 files changed, 56 insertions(+)
> > > 
> > > diff --git a/drivers/virtio/virtio_balloon.c 
> > > b/drivers/virtio/virtio_balloon.c
> > > index 9d5fe2b..f9dcae8 100644
> > > --- a/drivers/virtio/virtio_balloon.c
> > > +++ b/drivers/virtio/virtio_balloon.c
> > > @@ -71,6 +71,9 @@ struct virtio_balloon
> > >   /* Memory statistics */
> > >   int need_stats_update;
> > >   struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
> > > +
> > > + /* Memory shrinker */
> > > + struct shrinker shrinker;
> > >  };
> > >  
> > >  static struct virtio_device_id id_table[] = {
> > > @@ -126,6 +129,7 @@ static void set_page_pfns(u32 pfns[], struct page 
> > > *page)
> > >   pfns[i] = page_to_balloon_pfn(page) + i;
> > >  }
> > >  
> > > +/* This function should be called with vb->balloon_mutex held */
> > >  static void fill_balloon(struct virtio_balloon *vb, size_t num)
> > >  {
> > >   struct balloon_dev_info *vb_dev_info = vb->vb_dev_info;
> > > @@ -166,6 +170,7 @@ static void release_pages_by_pfn(const u32 pfns[], 
> > > unsigned int num)
> > >   }
> > >  }
> > >  
> > > +/* This function should be called with vb->balloon_mutex held */
> > >  static void leak_balloon(struct virtio_balloon *vb, size_t num)
> > >  {
> > >   struct page *page;
> > > @@ -285,6 +290,45 @@ static void update_balloon_size(struct 
> > > virtio_balloon *vb)
> > > &actual, sizeof(actual));
> > >  }
> > >  
> > > +static unsigned long balloon_get_nr_pages(const struct virtio_balloon 
> > > *vb)
> > > +{
> > > + return vb->num_pages / VIRTIO_BALLOON_PAGES_PER_PAGE;
> > > +}
> > > +
> > > +static int balloon_shrinker(struct shrinker *shrinker,struct 
> > > shrink_control *sc)
> > > +{
> > > + unsigned int nr_pages, new_target;
> > > + struct virtio_balloon *vb;
> > > +
> > > + vb = container_of(shrinker, struct virtio_balloon, shrinker);
> > > + if (!mutex_trylock(&vb->balloon_lock)) {
> > > + return -1;
> > > + }
> > > +
> > > + nr_pages = balloon_get_nr_pages(vb);
> > > + if (!sc->nr_to_scan || !nr_pages) {
> > > + goto out;
> > > + }
> > > +
> > > + /*
> > > +  * If the current balloon size is greater than the number of
> > > +  * pages being reclaimed by the kernel, deflate only the needed
> > > +  * amount. Otherwise deflate everything we have.
> > > +  */
> > > + new_target = 0;
> > > + if (nr_pages > sc->nr_to_scan) {
> > >

Re: [RFC 1/2] virtio_balloon: move balloon_lock mutex to callers

2013-05-09 Thread Rafael Aquini
On Thu, May 09, 2013 at 10:53:48AM -0400, Luiz Capitulino wrote:
> This commit moves the balloon_lock mutex out of the fill_balloon()
> and leak_balloon() functions to their callers.
> 
> The reason for this change is that the next commit will introduce
> a shrinker callback for the balloon driver, which will also call
> leak_balloon() but will require different locking semantics.
> 
> Signed-off-by: Luiz Capitulino 
> ---
>  drivers/virtio/virtio_balloon.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index bd3ae32..9d5fe2b 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -133,7 +133,6 @@ static void fill_balloon(struct virtio_balloon *vb, 
> size_t num)
>   /* We can only do one array worth at a time. */
>   num = min(num, ARRAY_SIZE(vb->pfns));
>  
> - mutex_lock(&vb->balloon_lock);
>   for (vb->num_pfns = 0; vb->num_pfns < num;
>vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
>   struct page *page = balloon_page_enqueue(vb_dev_info);
> @@ -154,7 +153,6 @@ static void fill_balloon(struct virtio_balloon *vb, 
> size_t num)
>   /* Did we get any? */
>   if (vb->num_pfns != 0)
>   tell_host(vb, vb->inflate_vq);
> - mutex_unlock(&vb->balloon_lock);
>  }
>  
>  static void release_pages_by_pfn(const u32 pfns[], unsigned int num)
> @@ -176,7 +174,6 @@ static void leak_balloon(struct virtio_balloon *vb, 
> size_t num)
>   /* We can only do one array worth at a time. */
>   num = min(num, ARRAY_SIZE(vb->pfns));
>  
> - mutex_lock(&vb->balloon_lock);
>   for (vb->num_pfns = 0; vb->num_pfns < num;
>vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
>   page = balloon_page_dequeue(vb_dev_info);
> @@ -192,7 +189,6 @@ static void leak_balloon(struct virtio_balloon *vb, 
> size_t num)
>* is true, we *have* to do it in this order
>*/
>   tell_host(vb, vb->deflate_vq);
> - mutex_unlock(&vb->balloon_lock);
>   release_pages_by_pfn(vb->pfns, vb->num_pfns);
>  }
>  
> @@ -305,11 +301,13 @@ static int balloon(void *_vballoon)
>|| freezing(current));
>   if (vb->need_stats_update)
>   stats_handle_request(vb);
> + mutex_lock(&vb->balloon_lock);
>   if (diff > 0)
>   fill_balloon(vb, diff);
>   else if (diff < 0)
>   leak_balloon(vb, -diff);
>   update_balloon_size(vb);
> + mutex_unlock(&vb->balloon_lock);
>   }
>   return 0;
>  }
> @@ -490,9 +488,11 @@ out:
>  static void remove_common(struct virtio_balloon *vb)
>  {
>   /* There might be pages left in the balloon: free them. */
> + mutex_lock(&vb->balloon_lock);
>   while (vb->num_pages)
>   leak_balloon(vb, vb->num_pages);
>   update_balloon_size(vb);
> + mutex_unlock(&vb->balloon_lock);

I think you will need to introduce the same change as above to 
virtballoon_restore()


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] virtio_balloon: auto-ballooning support

2013-05-09 Thread Rafael Aquini
On Thu, May 09, 2013 at 10:53:49AM -0400, Luiz Capitulino wrote:
> Automatic ballooning consists of dynamically adjusting the guest's
> balloon according to memory pressure in the host and in the guest.
> 
> This commit implements the guest side of automatic balloning, which
> basically consists of registering a shrinker callback with the kernel,
> which will try to deflate the guest's balloon by the amount of pages
> being requested. The shrinker callback is only registered if the host
> supports the VIRTIO_BALLOON_F_AUTO_BALLOON feature bit.
> 
> Automatic inflate is performed by the host.
> 
> Here are some numbers. The test-case is to run 35 VMs (1G of RAM each)
> in parallel doing a kernel build. Host has 32GB of RAM and 16GB of swap.
> SWAP IN and SWAP OUT correspond to the number of pages swapped in and
> swapped out, respectively.
> 
> Auto-ballooning disabled:
> 
> RUN  TIME(s)  SWAP IN  SWAP OUT
> 
> 1634  930980   1588522
> 2610  627422   1362174
> 3649  1079847  1616367
> 4543  953289   1635379
> 5642  913237   1514000
> 
> Auto-ballooning enabled:
> 
> RUN  TIME(s)  SWAP IN  SWAP OUT
> 
> 1629  901  12537
> 2624  981  18506
> 3626  573  9085
> 4631  2250 42534
> 5627  1610 20808
> 
> Signed-off-by: Luiz Capitulino 
> ---

Nice work Luiz! Just allow me a silly question, though. Since your shrinker
doesn't change the balloon target size, as soon as the shrink round finishes the
balloon will re-inflate again, won't it? Doesn't this cause a sort of "balloon
thrashing" scenario, if both guest and host are suffering from memory pressure?


The rest I have for the moment, are only nitpicks :)


>  drivers/virtio/virtio_balloon.c | 55 
> +
>  include/uapi/linux/virtio_balloon.h |  1 +
>  2 files changed, 56 insertions(+)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 9d5fe2b..f9dcae8 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -71,6 +71,9 @@ struct virtio_balloon
>   /* Memory statistics */
>   int need_stats_update;
>   struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
> +
> + /* Memory shrinker */
> + struct shrinker shrinker;
>  };
>  
>  static struct virtio_device_id id_table[] = {
> @@ -126,6 +129,7 @@ static void set_page_pfns(u32 pfns[], struct page *page)
>   pfns[i] = page_to_balloon_pfn(page) + i;
>  }
>  
> +/* This function should be called with vb->balloon_mutex held */
>  static void fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
>   struct balloon_dev_info *vb_dev_info = vb->vb_dev_info;
> @@ -166,6 +170,7 @@ static void release_pages_by_pfn(const u32 pfns[], 
> unsigned int num)
>   }
>  }
>  
> +/* This function should be called with vb->balloon_mutex held */
>  static void leak_balloon(struct virtio_balloon *vb, size_t num)
>  {
>   struct page *page;
> @@ -285,6 +290,45 @@ static void update_balloon_size(struct virtio_balloon 
> *vb)
> &actual, sizeof(actual));
>  }
>  
> +static unsigned long balloon_get_nr_pages(const struct virtio_balloon *vb)
> +{
> + return vb->num_pages / VIRTIO_BALLOON_PAGES_PER_PAGE;
> +}
> +
> +static int balloon_shrinker(struct shrinker *shrinker,struct shrink_control 
> *sc)
> +{
> + unsigned int nr_pages, new_target;
> + struct virtio_balloon *vb;
> +
> + vb = container_of(shrinker, struct virtio_balloon, shrinker);
> + if (!mutex_trylock(&vb->balloon_lock)) {
> + return -1;
> + }
> +
> + nr_pages = balloon_get_nr_pages(vb);
> + if (!sc->nr_to_scan || !nr_pages) {
> + goto out;
> + }
> +
> + /*
> +  * If the current balloon size is greater than the number of
> +  * pages being reclaimed by the kernel, deflate only the needed
> +  * amount. Otherwise deflate everything we have.
> +  */
> + new_target = 0;
> + if (nr_pages > sc->nr_to_scan) {
> + new_target = nr_pages - sc->nr_to_scan;
> + }
> +

CodingStyle: you don't need the curly-braces for all these single staments above



> + leak_balloon(vb, new_target);
> + update_balloon_size(vb);
> + nr_pages = balloon_get_nr_pages(vb);
> +
> +out:
> + mutex_unlock(&vb->balloon_lock);
> + return nr_pages;
> +}
> +
>  static int balloon(void *_vballoon)
>  {
>   struct virtio_balloon *vb = _vballoon;
> @@ -471,6 +515,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
>   goto out_del_vqs;
>   }
>  
> + memset(&vb->shrinker, 0, sizeof(vb->shrinker));
> + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_AUTO_BALLOON)) {
> + vb->shrinker.shrink = balloon_shrinker;
> + vb->shrinker.seeks = DEFAULT_SEEKS;
> + register_shrinker(&vb->shrinker);
> + }
> +
>   return 0;
>  
>  out_del_vqs:
> @@ -487,6 

Re: [Patch v2] mm: slab: Verify the nodeid passed to ____cache_alloc_node

2013-04-23 Thread Rafael Aquini
On Tue, Apr 23, 2013 at 10:31:36AM -0400, Aaron Tomlin wrote:
> Hi,
> 
> This patch is in response to BZ#42967 [1]. 
> Using VM_BUG_ON so it's used only when CONFIG_DEBUG_VM is set,
> given that cache_alloc_node() is a hot code path.
> 
This seems to be a valid condition to BUG_ON, though.

> Cheers,
> Aaron
> 
> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=42967
> 
> ---8<---
> mm: slab: Verify the nodeid passed to cache_alloc_node
> 
> If the nodeid is > num_online_nodes() this can cause an
> Oops and a panic(). The purpose of this patch is to assert
> if this condition is true to aid debugging efforts rather
> than some random NULL pointer dereference or page fault.
> 
> Signed-off-by: Aaron Tomlin 
> Reviewed-by: Rik van Riel 
>

Acked-by: Rafael Aquini 


 
> 
>  slab.c |1 +
>  1 file changed, 1 insertion(+)
>  
> diff --git a/mm/slab.c b/mm/slab.c
> index e7667a3..735e8bd 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
>  -3412,6 +3412,7 @@ static void *cache_alloc_node(struct kmem_cache 
> *cachep, gfp_t flags,
>   void *obj;
>   int x;
>  
> + VM_BUG_ON(nodeid > num_online_nodes());
>   l3 = cachep->nodelists[nodeid];
>   BUG_ON(!l3);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 01/02] swap: discard while swapping only if SWAP_FLAG_DISCARD_CLUSTER

2013-05-20 Thread Rafael Aquini
Intruduce a new flag to make page-cluster fine-grained discards while swapping
conditional, as they can be considered detrimental to some setups. However,
keep allowing batched discards at sys_swapon() time, when enabled by the
system administrator. 

Signed-off-by: Rafael Aquini 
---
 include/linux/swap.h |  8 +---
 mm/swapfile.c| 12 
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1701ce4..ab2e742 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -19,10 +19,11 @@ struct bio;
 #define SWAP_FLAG_PREFER   0x8000  /* set if swap priority specified */
 #define SWAP_FLAG_PRIO_MASK0x7fff
 #define SWAP_FLAG_PRIO_SHIFT   0
-#define SWAP_FLAG_DISCARD  0x1 /* discard swap cluster after use */
+#define SWAP_FLAG_DISCARD  0x1 /* enable discard for swap areas */
+#define SWAP_FLAG_DISCARD_CLUSTER 0x2 /* discard swap clusters after use */
 
 #define SWAP_FLAGS_VALID   (SWAP_FLAG_PRIO_MASK | SWAP_FLAG_PREFER | \
-SWAP_FLAG_DISCARD)
+SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_CLUSTER)
 
 static inline int current_is_kswapd(void)
 {
@@ -152,8 +153,9 @@ enum {
SWP_CONTINUED   = (1 << 5), /* swap_map has count continuation */
SWP_BLKDEV  = (1 << 6), /* its a block device */
SWP_FILE= (1 << 7), /* set after swap_activate success */
+   SWP_CLUSTERDISCARD = (1 << 8),  /* discard swap cluster after usage */
/* add others here before... */
-   SWP_SCANNING= (1 << 8), /* refcount in scan_swap_map */
+   SWP_SCANNING= (1 << 9), /* refcount in scan_swap_map */
 };
 
 #define SWAP_CLUSTER_MAX 32UL
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6c340d9..197461f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -212,7 +212,7 @@ static unsigned long scan_swap_map(struct swap_info_struct 
*si,
si->cluster_nr = SWAPFILE_CLUSTER - 1;
goto checks;
}
-   if (si->flags & SWP_DISCARDABLE) {
+   if (si->flags & SWP_CLUSTERDISCARD) {
/*
 * Start range check on racing allocations, in case
 * they overlap the cluster we eventually decide on
@@ -322,7 +322,7 @@ checks:
 
if (si->lowest_alloc) {
/*
-* Only set when SWP_DISCARDABLE, and there's a scan
+* Only set when SWP_CLUSTERDISCARD, and there's a scan
 * for a free cluster in progress or just completed.
 */
if (found_free_cluster) {
@@ -2123,8 +2123,11 @@ SYSCALL_DEFINE2(swapon, const char __user *, 
specialfile, int, swap_flags)
p->flags |= SWP_SOLIDSTATE;
p->cluster_next = 1 + (prandom_u32() % p->highest_bit);
}
-   if ((swap_flags & SWAP_FLAG_DISCARD) && discard_swap(p) == 0)
+   if ((swap_flags & SWAP_FLAG_DISCARD) && discard_swap(p) == 0) {
p->flags |= SWP_DISCARDABLE;
+   if (swap_flags & SWAP_FLAG_DISCARD_CLUSTER)
+   p->flags |= SWP_CLUSTERDISCARD;
+   }
}
 
mutex_lock(&swapon_mutex);
@@ -2135,11 +2138,12 @@ SYSCALL_DEFINE2(swapon, const char __user *, 
specialfile, int, swap_flags)
enable_swap_info(p, prio, swap_map, frontswap_map);
 
printk(KERN_INFO "Adding %uk swap on %s.  "
-   "Priority:%d extents:%d across:%lluk %s%s%s\n",
+   "Priority:%d extents:%d across:%lluk %s%s%s%s\n",
p->pages<<(PAGE_SHIFT-10), name->name, p->prio,
nr_extents, (unsigned long long)span<<(PAGE_SHIFT-10),
(p->flags & SWP_SOLIDSTATE) ? "SS" : "",
(p->flags & SWP_DISCARDABLE) ? "D" : "",
+   (p->flags & SWP_CLUSTERDISCARD) ? "C" : "",
(frontswap_map) ? "FS" : "");
 
mutex_unlock(&swapon_mutex);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 00/02] swap: allowing a more flexible DISCARD policy

2013-05-20 Thread Rafael Aquini
Howdy folks,

While working on a backport for the following changes:

3399446 swap: discard while swapping only if SWAP_FLAG_DISCARD
052b198 swap: don't do discard if no discard option added

We found ourselves around an interesting discussion on how limiting 
the behavior with regard to user-visible swap areas configuration has become
after applying the aforementioned changesets.

Before commit 3399446, if the swap backing device supported DISCARD,
then a batched discard was issued at swapon(8) time, and fine-grained DISCARDs
were issued in between freeing swap page-clusters and re-writing to them.
As noticed at 3399446's commit message, the fine-grained discards often
didn't help on improving performance as expected, and were potentially causing
more trouble than desired. So, commit 3399446 introduced a new swapon flag,
to make the fine-grained discards while swapping conditional.
However a batched discard would have been issued everytime swapon(8) was
turning a new swap area available.

This batched operation that remained at sys_swapon was considered troublesome
for some setups, and specially because a sysadmin was not flagging swapon(8)
to do discards -- http://www.spinics.net/lists/linux-mm/msg31741.html 
then, commit 052b198 got merged to address the scenario described above.
After this last commit, now we can either only do both batched and fine-grained
discards for swap, or none of them.

As depicted above, this seems to be not very flexible as it could be,
and the whole discussion we had (internally) left us wondering if does upstream
feel it would be useful to allow for both a batched discard as well as a
fine-grained discard option for swap? (By batched, here, it could mean just
the one time operation at swapon, or something similar to what fstrim does).

In fact, we all agreed with having no discards sent down at all as the default
behaviour. But thinking a little more about the use cases where device supports
discard:
a) and can do it quickly;
b) but it's slow to do in small granularities (or concurrent with other
   I/O);
c) but the implementation is so horrendous that you don't even want to
   send one down;

And assuming that the sysadmin considers it useful to send the discards down
at all, we would (probably) want the following solutions:

1) do the fine-grained discards if swap device is capable of doing so;
2) do batched discards, either at swapon or via something like fstrim; or
3) turn it off completely (default behavior nowadays)


i.e.: Today, if we have a swap device like (b), we cannot perform (2) even if
it might be regardeed as interesting, or necessary to the workload because it
would imply (1), and the device cannot do that and perform optimally.


With all that in mind, and in order to attempt to sort out the (un)flexibility
problem exposed above, I came up with the following patches:

01 (kernel) swap: discard while swapping only if SWAP_FLAG_DISCARD_CLUSTER
02 (util-linux) swapon: add "cluster-discard" support


Sorry for the long email.

Your feedback is very much appreciated!
-- Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 02/02] swapon: add "cluster-discard" support

2013-05-20 Thread Rafael Aquini
Introduce a new swapon flag/option to support more flexible swap discard setup.
The --cluster-discard swapon(8) option can be used by a system admin to flag
sys_swapon() to perform page-cluster fine-grained discards.

This patch also changes the behaviour of swapon(8) --discard option, that now
will only be used to flag sys_swapon() batched discards will be issued at
swapon(8) time.

Signed-off-by: Rafael Aquini 
---
 sys-utils/swapon.8 | 19 ---
 sys-utils/swapon.c | 34 ++
 2 files changed, 42 insertions(+), 11 deletions(-)

diff --git a/sys-utils/swapon.8 b/sys-utils/swapon.8
index 385bf5a..0c8ac69 100644
--- a/sys-utils/swapon.8
+++ b/sys-utils/swapon.8
@@ -112,10 +112,23 @@ All devices marked as ``swap'' in
 are made available, except for those with the ``noauto'' option.
 Devices that are already being used as swap are silently skipped.
 .TP
+.TP
+.B "\-c, \-\-cluster\-discard"
+Swapping will discard clusters of swap pages in between freeing them
+and re-writing to them, if the swap device supports that. This option
+also implies the
+.I \-d, \-\-discard
+swapon flag.
+The
+.I /etc/fstab
+mount option
+.BI cluster\-discard
+may be also used to enable this flag.
+
+.TP
 .B "\-d, \-\-discard"
-Discard freed swap pages before they are reused, if the swap
-device supports the discard or trim operation.  This may improve
-performance on some Solid State Devices, but often it does not.
+Enables swap discards, if the swap device supports that, and performs
+a batch discard operation for the swap device at swapon time.
 The
 .I /etc/fstab
 mount option
diff --git a/sys-utils/swapon.c b/sys-utils/swapon.c
index f1e2433..a71f69e 100644
--- a/sys-utils/swapon.c
+++ b/sys-utils/swapon.c
@@ -34,7 +34,11 @@
 #endif
 
 #ifndef SWAP_FLAG_DISCARD
-# define SWAP_FLAG_DISCARD 0x1 /* discard swap cluster after use */
+# define SWAP_FLAG_DISCARD 0x1 /* enable discard for swap */
+#endif
+
+#ifndef SWAP_FLAG_DISCARD_CLUSTER
+# define SWAP_FLAG_DISCARD_CLUSTER 0x2 /* discard swap cluster after use */
 #endif
 
 #ifndef SWAP_FLAG_PREFER
@@ -70,7 +74,7 @@ enum {
 
 static int all;
 static int priority = -1;  /* non-prioritized swap by default */
-static int discard;
+static int discard = 0;/* don't send swap discards by default 
*/
 
 /* If true, don't complain if the device/file doesn't exist */
 static int ifexists;
@@ -570,8 +574,11 @@ static int do_swapon(const char *orig_special, int prio,
   << SWAP_FLAG_PRIO_SHIFT);
}
 #endif
-   if (fl_discard)
+   if (fl_discard) {
flags |= SWAP_FLAG_DISCARD;
+   if (fl_discard > 1)
+   flags |= SWAP_FLAG_DISCARD_CLUSTER;
+   }
 
status = swapon(special, flags);
if (status < 0)
@@ -615,8 +622,14 @@ static int swapon_all(void)
 
if (mnt_fs_get_option(fs, "noauto", NULL, NULL) == 0)
continue;
-   if (mnt_fs_get_option(fs, "discard", NULL, NULL) == 0)
-   dsc = 1;
+   if (mnt_fs_get_option(fs, "discard", NULL, NULL) == 0) {
+   if !(dsc)
+   dsc = 1;
+   }
+   if (mnt_fs_get_option(fs, "cluster-discard", NULL, NULL) == 0) {
+   if (!dsc || dsc == 1)
+   dsc = 2;
+   }
if (mnt_fs_get_option(fs, "nofail", NULL, NULL) == 0)
nofail = 1;
if (mnt_fs_get_option(fs, "pri", &p, NULL) == 0 && p)
@@ -647,7 +660,8 @@ static void __attribute__ ((__noreturn__)) usage(FILE * out)
 
fputs(USAGE_OPTIONS, out);
fputs(_(" -a, --all  enable all swaps from /etc/fstab\n"
-   " -d, --discard  discard freed pages before they are 
reused\n"
+   " -c, --cluster-discard  discard freed pages before they are 
reused, while swapping\n"
+   " -d, --discard  discard freed pages before they are 
reused, all at once, at swapon time\n"
" -e, --ifexists silently skip devices that do not 
exist\n"
" -f, --fixpgsz  reinitialize the swap space if 
necessary\n"
" -p, --priority   specify the priority of the swap 
device\n"
@@ -696,6 +710,7 @@ int main(int argc, char *argv[])
 
static const struct option long_opts[] = {
{ "priority", 1, 0, 'p' },
+   { "cluster-discard",  0, 0, 'c' },
{ "discard",  0, 0, 'd' },
{ "ifexists", 0, 0, 'e' },
{ "summary",

Re: [RFC PATCH 01/02] swap: discard while swapping only if SWAP_FLAG_DISCARD_CLUSTER

2013-05-21 Thread Rafael Aquini
Howdy Kosaki-san,

Thanks for your time over this one :)

On Mon, May 20, 2013 at 08:55:33PM -0400, KOSAKI Motohiro wrote:
> (5/20/13 8:04 PM), Rafael Aquini wrote:
> > Intruduce a new flag to make page-cluster fine-grained discards while 
> > swapping
> > conditional, as they can be considered detrimental to some setups. However,
> > keep allowing batched discards at sys_swapon() time, when enabled by the
> > system administrator. 
> > 
> > Signed-off-by: Rafael Aquini 
> > ---
> >  include/linux/swap.h |  8 +---
> >  mm/swapfile.c| 12 
> >  2 files changed, 13 insertions(+), 7 deletions(-)
> > 
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 1701ce4..ab2e742 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -19,10 +19,11 @@ struct bio;
> >  #define SWAP_FLAG_PREFER   0x8000  /* set if swap priority specified */
> >  #define SWAP_FLAG_PRIO_MASK0x7fff
> >  #define SWAP_FLAG_PRIO_SHIFT   0
> > -#define SWAP_FLAG_DISCARD  0x1 /* discard swap cluster after use */
> > +#define SWAP_FLAG_DISCARD  0x1 /* enable discard for swap areas */
> > +#define SWAP_FLAG_DISCARD_CLUSTER 0x2 /* discard swap clusters after 
> > use */
> 
> From point of backward compatibility view, 0x1 should be disable both 
> discarding
> when mount and when IO.

I think you mean 0x1 should be enable both here, then. That's a nice catch. 
I'll
try to think a way to accomplish it in a simple fashion.


> And, introducing new two flags, enable mount time discard and enable IO time 
> discard.
> 
> IOW, Please consider newer kernel and older swapon(8) conbination.
> Other than that, looks good to me.
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 02/02] swapon: add "cluster-discard" support

2013-05-21 Thread Rafael Aquini
Karel, Motohiro,

Thanks a lot for your time reviewing this patch and providing me with valuable
feedback.

On Tue, May 21, 2013 at 04:17:04PM -0400, KOSAKI Motohiro wrote:
> (5/21/13 6:26 AM), Karel Zak wrote:
> > On Mon, May 20, 2013 at 09:02:43PM -0400, KOSAKI Motohiro wrote:
> >>> - if (fl_discard)
> >>> + if (fl_discard) {
> >>>   flags |= SWAP_FLAG_DISCARD;
> >>> + if (fl_discard > 1)
> >>> + flags |= SWAP_FLAG_DISCARD_CLUSTER;
> >>
> >> This is not enough, IMHO. When running this code on old kernel, swapon() 
> >> return EINVAL.
> >> At that time, we should fall back swapon(0x1).
> > 
> >  Hmm.. currently we don't use any fallback for any swap flag (e.g.
> >  0x1) for compatibility with old kernels. Maybe it's better to
> >  keep it simple and stupid and return an error message than introduce
> >  any super-smart semantic to hide incompatible fstab configuration.
> 
> Hm. If so, I'd propose to revert the following change. 
> 
> > .B "\-d, \-\-discard"
> >-Discard freed swap pages before they are reused, if the swap
> >-device supports the discard or trim operation.  This may improve
> >-performance on some Solid State Devices, but often it does not.
> >+Enables swap discards, if the swap device supports that, and performs
> >+a batch discard operation for the swap device at swapon time.
> 
> 
> And instead, I suggest to make --discard-on-swapon like the following.
> (better name idea is welcome) 
> 
> +--discard-on-swapon
> +Enables swap discards, if the swap device supports that, and performs
> +a batch discard operation for the swap device at swapon time.
> 
> I mean, preserving flags semantics removes the reason we need make a fallback.
> 
>

Instead of reverting and renaming --discard, what about making it accept an
optional argument, so we could use --discard (to enable all thing and keep
backward compatibility); --discard=cluster & --discard=batch (or whatever we
think it should be named). I'll try to sort this approach out if you folks think
it's worthwhile. 

-- Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 00/02] swap: allowing a more flexible DISCARD policy V2

2013-05-25 Thread Rafael Aquini
Considering the use cases where the swap device supports discard:
a) and can do it quickly;
b) but it's slow to do in small granularities (or concurrent with other
   I/O);
c) but the implementation is so horrendous that you don't even want to
   send one down;

And assuming that the sysadmin considers it useful to send the discards down
at all, we would (probably) want the following solutions:

  i. do the fine-grained discards for freed swap pages, if device is 
 capable of doing so optimally;
 ii. do single-time (batched) swap area discards, either at swapon 
 or via something like fstrim (not implemented yet);
iii. allow doing both single-time and fine-grained discards; or
 iv. turn it off completely (default behavior)


As implemented today, one can only enable/disable discards for swap, but 
one cannot select, for instance, solution (ii) on a swap device like (b)
even though the single-time discard is regarded to be interesting, or
necessary to the workload because it would imply (1), and the device
is not capable of performing it optimally.

This patchset addresses the scenario depicted above by introducing a
way to ensure the (probably) wanted solutions (i, ii, iii and iv) can
be flexibly flagged through swapon(8) to allow a sysadmin to select
the best suitable swap discard policy accordingly to system constraints.

Changeset from V1:
01 (kernel)   swap: discard while swapping only if SWAP_FLAG_DISCARD_PAGES
* ensure backwards compatibility with older swapon(8) releases;  (mkosaki)

02 (util-linux) swapon: allow a more flexible swap discard policy
* introduce discard policy selector as an optional argument of --discard option;
* rename user-visible discard policy names;   (mkosaki, karel, jmoyer) 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 01/02] swap: discard while swapping only if SWAP_FLAG_DISCARD_PAGES

2013-05-25 Thread Rafael Aquini
This patch introduces SWAP_FLAG_DISCARD_PAGES and SWAP_FLAG_DISCARD_ONCE
new flags to allow more flexibe swap discard policies being flagged through
swapon(8). The default behavior is to keep both single-time, or batched, area
discards (SWAP_FLAG_DISCARD_ONCE) and fine-grained discards for page-clusters
(SWAP_FLAG_DISCARD_PAGES) enabled, in order to keep consistentcy with older
kernel behavior, as well as maintain compatibility with older swapon(8).
However, through the new introduced flags the best suitable discard policy 
can be selected accordingly to any given swap device constraint.

Signed-off-by: Rafael Aquini 
---
 include/linux/swap.h | 13 +
 mm/swapfile.c| 55 +++-
 2 files changed, 59 insertions(+), 9 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1701ce4..33fa21f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -19,10 +19,13 @@ struct bio;
 #define SWAP_FLAG_PREFER   0x8000  /* set if swap priority specified */
 #define SWAP_FLAG_PRIO_MASK0x7fff
 #define SWAP_FLAG_PRIO_SHIFT   0
-#define SWAP_FLAG_DISCARD  0x1 /* discard swap cluster after use */
+#define SWAP_FLAG_DISCARD  0x1 /* enable discard for swap */
+#define SWAP_FLAG_DISCARD_ONCE 0x2 /* discard swap area at swapon-time */
+#define SWAP_FLAG_DISCARD_PAGES 0x4 /* discard page-clusters after use */
 
 #define SWAP_FLAGS_VALID   (SWAP_FLAG_PRIO_MASK | SWAP_FLAG_PREFER | \
-SWAP_FLAG_DISCARD)
+SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_ONCE | \
+SWAP_FLAG_DISCARD_PAGES)
 
 static inline int current_is_kswapd(void)
 {
@@ -146,14 +149,16 @@ struct swap_extent {
 enum {
SWP_USED= (1 << 0), /* is slot in swap_info[] used? */
SWP_WRITEOK = (1 << 1), /* ok to write to this swap?*/
-   SWP_DISCARDABLE = (1 << 2), /* swapon+blkdev support discard */
+   SWP_DISCARDABLE = (1 << 2), /* blkdev support discard */
SWP_DISCARDING  = (1 << 3), /* now discarding a free cluster */
SWP_SOLIDSTATE  = (1 << 4), /* blkdev seeks are cheap */
SWP_CONTINUED   = (1 << 5), /* swap_map has count continuation */
SWP_BLKDEV  = (1 << 6), /* its a block device */
SWP_FILE= (1 << 7), /* set after swap_activate success */
+   SWP_AREA_DISCARD = (1 << 8),/* single-time swap area discards */
+   SWP_PAGE_DISCARD = (1 << 9),/* freed swap page-cluster discards */
/* add others here before... */
-   SWP_SCANNING= (1 << 8), /* refcount in scan_swap_map */
+   SWP_SCANNING= (1 << 10),/* refcount in scan_swap_map */
 };
 
 #define SWAP_CLUSTER_MAX 32UL
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6c340d9..719513d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -212,7 +212,7 @@ static unsigned long scan_swap_map(struct swap_info_struct 
*si,
si->cluster_nr = SWAPFILE_CLUSTER - 1;
goto checks;
}
-   if (si->flags & SWP_DISCARDABLE) {
+   if (si->flags & SWP_PAGE_DISCARD) {
/*
 * Start range check on racing allocations, in case
 * they overlap the cluster we eventually decide on
@@ -322,7 +322,7 @@ checks:
 
if (si->lowest_alloc) {
/*
-* Only set when SWP_DISCARDABLE, and there's a scan
+* Only set when SWP_PAGE_DISCARD, and there's a scan
 * for a free cluster in progress or just completed.
 */
if (found_free_cluster) {
@@ -2016,6 +2016,20 @@ static int setup_swap_map_and_extents(struct 
swap_info_struct *p,
return nr_extents;
 }
 
+/*
+ * Helper to sys_swapon determining if a given swap
+ * backing device queue supports DISCARD operations.
+ */
+static bool swap_discardable(struct swap_info_struct *si)
+{
+   struct request_queue *q = bdev_get_queue(si->bdev);
+
+   if (!q || !blk_queue_discard(q))
+   return false;
+
+   return true;
+}
+
 SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 {
struct swap_info_struct *p;
@@ -2123,8 +2137,37 @@ SYSCALL_DEFINE2(swapon, const char __user *, 
specialfile, int, swap_flags)
p->flags |= SWP_SOLIDSTATE;
p->cluster_next = 1 + (prandom_u32() % p->highest_bit);
}
-   if ((swap_flags & SWAP_FLAG_DISCARD) && discard_swap(p) == 0)
-   p->flags |= SWP_DISCARDABLE;
+
+   if ((swap_flags & SWAP_FLAG_DISCARD) && swap_discardable

[PATCH 02/02] swapon: allow a more flexible swap discard policy

2013-05-25 Thread Rafael Aquini
Introduce the necessary changes to swapon(8) allowing a sysadmin to leverage
the new changes introduced to sys_swapon by "swap: discard while swapping
only if SWAP_FLAG_DISCARD_PAGES", therefore allowing a more flexible set of
choices when selection the discard policy for mounted swap areas.
This patch introduces the following optional arguments to the already
existent swapon(8) "--discard" option, in order to allow a discard type to 
be selected at swapon time:
 * once: only single-time area discards are issued. (swapon)
 * pages   : discard freed pages before they are reused.
If no policy is selected both discard types are enabled. (default)

Signed-off-by: Rafael Aquini 
---
 sys-utils/swapon.8 | 24 +--
 sys-utils/swapon.c | 70 ++
 2 files changed, 77 insertions(+), 17 deletions(-)

diff --git a/sys-utils/swapon.8 b/sys-utils/swapon.8
index 385bf5a..17f7970 100644
--- a/sys-utils/swapon.8
+++ b/sys-utils/swapon.8
@@ -112,15 +112,25 @@ All devices marked as ``swap'' in
 are made available, except for those with the ``noauto'' option.
 Devices that are already being used as swap are silently skipped.
 .TP
-.B "\-d, \-\-discard"
-Discard freed swap pages before they are reused, if the swap
-device supports the discard or trim operation.  This may improve
-performance on some Solid State Devices, but often it does not.
+.B "\-d, \-\-discard\fR [\fIpolicy\fR]"
+Enable swap discards, if the swap backing device supports the discard or
+trim operation. This may improve performance on some Solid State Devices,
+but often it does not. The long option \-\-discard allows one to select
+between two available swap discard policies:
+.BI \-\-discard=once
+to perform a single-time discard operation for the whole swap area at swapon;
+or
+.BI \-\-discard=pages
+to discard freed swap pages before they are reused, while swapping.
+If no policy is selected, the default behavior is to enable both discard types.
 The
 .I /etc/fstab
-mount option
-.BI discard
-may be also used to enable discard flag.
+mount options
+.BI discard,
+.BI discard=once,
+or
+.BI discard=pages
+may be also used to enable discard flags.
 .TP
 .B "\-e, \-\-ifexists"
 Silently skip devices that do not exist.
diff --git a/sys-utils/swapon.c b/sys-utils/swapon.c
index f1e2433..8a90bfe 100644
--- a/sys-utils/swapon.c
+++ b/sys-utils/swapon.c
@@ -34,9 +34,20 @@
 #endif
 
 #ifndef SWAP_FLAG_DISCARD
-# define SWAP_FLAG_DISCARD 0x1 /* discard swap cluster after use */
+# define SWAP_FLAG_DISCARD 0x1 /* enable discard for swap */
 #endif
 
+#ifndef SWAP_FLAG_DISCARD_ONCE
+# define SWAP_FLAG_DISCARD_ONCE 0x2 /* discard swap area at swapon-time */
+#endif
+
+#ifndef SWAP_FLAG_DISCARD_PAGES
+# define SWAP_FLAG_DISCARD_PAGES 0x4 /* discard page-clusters after use */
+#endif
+
+#define SWAP_FLAGS_DISCARD_VALID (SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_ONCE | 
\
+ SWAP_FLAG_DISCARD_PAGES)
+
 #ifndef SWAP_FLAG_PREFER
 # define SWAP_FLAG_PREFER  0x8000  /* set if swap priority specified */
 #endif
@@ -70,7 +81,7 @@ enum {
 
 static int all;
 static int priority = -1;  /* non-prioritized swap by default */
-static int discard;
+static int discard = 0;/* don't send swap discards by default 
*/
 
 /* If true, don't complain if the device/file doesn't exist */
 static int ifexists;
@@ -570,8 +581,22 @@ static int do_swapon(const char *orig_special, int prio,
   << SWAP_FLAG_PRIO_SHIFT);
}
 #endif
-   if (fl_discard)
-   flags |= SWAP_FLAG_DISCARD;
+   /*
+* Validate the discard flags passed and set them
+* accordingly before calling sys_swapon.
+*/
+   if (fl_discard && !(fl_discard & ~SWAP_FLAGS_DISCARD_VALID)) {
+   /*
+* If we get here with both discard policy flags set,
+* we just need to tell the kernel to enable discards
+* and it will do correctly, just as we expect.
+*/
+   if ((fl_discard & SWAP_FLAG_DISCARD_ONCE) &&
+   (fl_discard & SWAP_FLAG_DISCARD_PAGES))
+   flags |= SWAP_FLAG_DISCARD;
+   else
+   flags |= fl_discard;
+   }
 
status = swapon(special, flags);
if (status < 0)
@@ -611,12 +636,22 @@ static int swapon_all(void)
while (mnt_table_find_next_fs(tb, itr, match_swap, NULL, &fs) == 0) {
/* defaults */
int pri = priority, dsc = discard, nofail = ifexists;
-   char *p, *src;
+   char *p, *src, *dscarg;
 
if (mnt_fs_get_option(fs, "noauto", NULL, NULL) == 0)
continue;
-   if (mnt_fs_get_option(fs

Re: [PATCH 01/02] swap: discard while swapping only if SWAP_FLAG_DISCARD_PAGES

2013-05-26 Thread Rafael Aquini
On Sun, May 26, 2013 at 07:44:56AM -0400, KOSAKI Motohiro wrote:
> > +   /*
> > +* By flagging sys_swapon, a sysadmin can tell us to
> > +* either do sinle-time area discards only, or to 
> > just
> > +* perform discards for released swap page-clusters.
> > +* Now it's time to adjust the p->flags accordingly.
> > +*/
> > +   if (swap_flags & SWAP_FLAG_DISCARD_ONCE)
> > +   p->flags &= ~SWP_PAGE_DISCARD;
> > +   else if (swap_flags & SWAP_FLAG_DISCARD_PAGES)
> > +   p->flags &= ~SWP_AREA_DISCARD;
> 
> When using old swapon(8), this code turn off both flags, right?

As the flag that enables swap discards SWAP_FLAG_DISCARD remains meaning the
same it meant before, when using old swapon(8) 
(SWP_PAGE_DISCARD|SWP_AREA_DISCARD)
will remain flagged when discard is enabled, so we keep doing discards the same 
way 
we did before (at swapon, and for every released page-cluster). 
The flags are removed orthogonally only when the new swapon(8) selects one of 
the
particular discard policy available by using either SWAP_FLAG_DISCARD_ONCE,
or SWAP_FLAG_DISCARD_PAGES flags.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/02] swap: discard while swapping only if SWAP_FLAG_DISCARD_PAGES

2013-05-26 Thread Rafael Aquini
On Sun, May 26, 2013 at 10:55:32AM -0400, KOSAKI Motohiro wrote:
> On Sun, May 26, 2013 at 9:52 AM, Rafael Aquini  wrote:
> > On Sun, May 26, 2013 at 07:44:56AM -0400, KOSAKI Motohiro wrote:
> >> > +   /*
> >> > +* By flagging sys_swapon, a sysadmin can tell 
> >> > us to
> >> > +* either do sinle-time area discards only, or 
> >> > to just
> >> > +* perform discards for released swap 
> >> > page-clusters.
> >> > +* Now it's time to adjust the p->flags 
> >> > accordingly.
> >> > +*/
> >> > +   if (swap_flags & SWAP_FLAG_DISCARD_ONCE)
> >> > +   p->flags &= ~SWP_PAGE_DISCARD;
> >> > +   else if (swap_flags & SWAP_FLAG_DISCARD_PAGES)
> >> > +   p->flags &= ~SWP_AREA_DISCARD;
> >>
> >> When using old swapon(8), this code turn off both flags, right
> >
>  > As the flag that enables swap discards SWAP_FLAG_DISCARD remains meaning 
> the
> > same it meant before, when using old swapon(8) 
> > (SWP_PAGE_DISCARD|SWP_AREA_DISCARD)
> 
> But old swapon(8) don't use neigher SWAP_FLAG_DISCARD_ONCE nor
> SWAP_FLAG_DISCARD_PAGES.  It uses only SWAP_FLAG_DISCARD. So, this
> condition disables both SWP_PAGE_DISCARD and SWP_AREA_DISCARD.
>

This condition _only_ disables one of the new flags orthogonally if swapon(8)
flags a policy to sys_swapon. As old swapon(8) can only flag SWAP_FLAG_DISCARD,
the original behavior is kept. Nothing will change when one is using an old
swapon(8) with this changeset. 

 
> And you changed that SWP_DISCARDABLE is not checked in IO path  at all.
> 
> >-   if (si->flags & SWP_DISCARDABLE) {
> >+   if (si->flags & SWP_PAGE_DISCARD) {
> 

And this is exactly what this change is about -- only enabling that particular
I/O path if we've been told to discard swap page-clusters. Notice that having
SWP_PAGE_DISCARD flagged already implies SWP_DISCARDABLE.


> I suggest new swapon(8) don't pass SWP_DISCARDABLE and kernel handle
> SWP_DISCARDABLE as (SWAP_FLAG_DISCARD_ONCE | SWAP_FLAG_DISCARD_PAGES).

As the old swapon(8) case can only pass SWAP_FLAG_DISCARD along, this change
would nullify the backwards compatibility, wouldn't it?



> 
> Optionally, warn SWP_DISCARDABLE is a good idea.
> 
> 
> > will remain flagged when discard is enabled, so we keep doing discards the 
> > same way
> > we did before (at swapon, and for every released page-cluster).
> > The flags are removed orthogonally only when the new swapon(8) selects one 
> > of the
> > particular discard policy available by using either SWAP_FLAG_DISCARD_ONCE,
> > or SWAP_FLAG_DISCARD_PAGES flags.
> >
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch v2] skbuff: Hide GFP_ATOMIC page allocation failures for dropped packets

2013-05-27 Thread Rafael Aquini
On Sun, May 26, 2013 at 09:45:01PM +0100, atom...@redhat.com wrote:
> From: Aaron Tomlin 
> 
> Since v1:
>  - Removed unnecessary parentheses (sergei.shtylyov)
> 
> ---8<---
> 
> Failed GFP_ATOMIC allocations by the network stack result in dropped
> packets, which will be received on a subsequent retransmit, and an
> unnecessary, noisy warning with a kernel backtrace.
> 
> These warnings are harmless, but they still cause users to panic and
> file bug reports over dropped packets. It would be better to hide the
> failed allocation warnings and backtraces, and let retransmits handle
> dropped packets quietly.
> 
> Signed-off-by: Aaron Tomlin 
> ---

Acked-by: Rafael Aquini 


>  net/core/skbuff.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index af9185d..84aa870 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -236,7 +236,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
> gfp_mask,
>   ? skbuff_fclone_cache : skbuff_head_cache;
>  
>   if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX))
> - gfp_mask |= __GFP_MEMALLOC;
> + gfp_mask |= __GFP_MEMALLOC | __GFP_NOWARN;
>  
>   /* Get the HEAD */
>   skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
> -- 
> 1.8.1.4
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch v2] skbuff: Hide GFP_ATOMIC page allocation failures for dropped packets

2013-05-28 Thread Rafael Aquini
On Tue, May 28, 2013 at 09:00:45AM -0700, Ben Greear wrote:
> On 05/27/2013 03:41 PM, Francois Romieu wrote:
> >atom...@redhat.com  :
> >[...]
> >>Failed GFP_ATOMIC allocations by the network stack result in dropped
> >>packets, which will be received on a subsequent retransmit, and an
> >>unnecessary, noisy warning with a kernel backtrace.
> >>
> >>These warnings are harmless, but they still cause users to panic and
> >>file bug reports over dropped packets. It would be better to hide the
> >>failed allocation warnings and backtraces, and let retransmits handle
> >>dropped packets quietly.
> >
> >Linux VM may be perfect but device drivers do stupid things.
> >
> >Please don't paper over it just because some shit ends in your backyard.
> 
> We should rate-limit these messages at least.  When a system is low on memory
> the logs can quickly fill up with useless OOM messages, further slowing
> the system...
>

The real problem seems to be that more and more the network stack (drivers, 
perhaps)
is relying on chunks of contiguous page-blocks without a fallback mechanism to
order-0 page allocations. When memory gets fragmented, these alloc failures
start to pop up more often and they scare ordinary sysadmins out of their 
paints.

The big point of this change was to attempt to relief some of these warnings 
which we believed as being useless, since the net stack would recover from it
by re-transmissions.
We might have misjudged the scenario, though. Perhaps a better approach would be
making the warning less verbose for all page-alloc failures. We could, perhaps,
only print a stack-dump out, if some debug flag is passed along, either as
reference, or by some CONFIG_DEBUG_ preprocessor directive.

Rafael
 
> Ben
> 
> >
> 
> 
> -- 
> Ben Greear 
> Candela Technologies Inc  http://www.candelatech.com
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch v2] skbuff: Hide GFP_ATOMIC page allocation failures for dropped packets

2013-05-28 Thread Rafael Aquini
On Tue, May 28, 2013 at 09:29:37AM -0700, Eric Dumazet wrote:
> On Tue, 2013-05-28 at 13:15 -0300, Rafael Aquini wrote:
> 
> > The real problem seems to be that more and more the network stack (drivers, 
> > perhaps)
> > is relying on chunks of contiguous page-blocks without a fallback mechanism 
> > to
> > order-0 page allocations. When memory gets fragmented, these alloc failures
> > start to pop up more often and they scare ordinary sysadmins out of their 
> > paints.
> > 
> 
> Where do you see that ?
>
> I see exactly the opposite trend. 
> 
> We have less and less buggy drivers, and we want to catch last
> offenders.
> 

Perhaps the explanation is because we're looking into old stuff bad effects,
then. But just to list a few for your appreciation:

Apr 23 11:25:31 217-IDC kernel: httpd: page allocation failure. order:1,
mode:0x20 Apr 23 11:25:31 217-IDC kernel: Pid: 19747, comm: httpd Not tainted
2.6.32-358.2.1.el6.x86_64 #1 Apr 23 11:25:31 217-IDC kernel: Call Trace: Apr 23
11:25:31 217-IDC kernel:  [] ?
__alloc_pages_nodemask+0x757/0x8d0 Apr 23 11:25:31 217-IDC kernel:
[] ? bond_start_xmit+0x2f1/0x5d0 [bonding]


Apr  4 18:51:32 exton kernel: swapper: page allocation failure. order:1,
mode:0x20
Apr  4 18:51:32 exton kernel: Pid: 0, comm: swapper Not tainted
2.6.32-279.19.1.el6.x86_64 #1
Apr  4 18:51:32 exton kernel: Call Trace:
Apr  4 18:51:32 exton kernel:   [] ?
__alloc_pages_nodemask+0x77f/0x940
Apr  4 18:51:32 exton kernel: [] ? kmem_getpages+0x62/0x170
Apr  4 18:51:32 exton kernel: [] ? fallback_alloc+0x1ba/0x270
Apr  4 18:51:32 exton kernel: [] ? cache_grow+0x2cf/0x320
Apr  4 18:51:32 exton kernel: [] ?
cache_alloc_node+0x99/0x160
Apr  4 18:51:32 exton kernel: [] ?
kmem_cache_alloc_node_trace+0x90/0x200
Apr  4 18:51:32 exton kernel: [] ? __kmalloc_node+0x4d/0x60
Apr  4 18:51:32 exton kernel: [] ? __alloc_skb+0x6d/0x190
Apr  4 18:51:32 exton kernel: [] ? dev_alloc_skb+0x1d/0x40
Apr  4 18:51:32 exton kernel: [] ?
ipoib_cm_alloc_rx_skb+0x30/0x430 [ib_ipoib]
Apr  4 18:51:32 exton kernel: [] ?
ipoib_cm_handle_rx_wc+0x29f/0x770 [ib_ipoib]
Apr  4 18:51:32 exton kernel: [] ? mlx4_ib_poll_cq+0x2c6/0x7f0
[mlx4_ib]


May 14 09:00:34 ifil03 kernel: swapper: page allocation failure. order:1,
mode:0x20
May 14 09:00:34 ifil03 kernel: Pid: 0, comm: swapper Not tainted
2.6.32-220.el6.x86_64 #1
May 14 09:00:34 ifil03 kernel: Call Trace:
May 14 09:00:34 ifil03 kernel:   [] ?
__alloc_pages_nodemask+0x77f/0x940
May 14 09:00:34 ifil03 kernel: [] ? kmem_getpages+0x62/0x170
May 14 09:00:34 ifil03 kernel: [] ? fallback_alloc+0x1ba/0x270
May 14 09:00:34 ifil03 kernel: [] ? cache_grow+0x2cf/0x320
May 14 09:00:34 ifil03 kernel: [] ?
cache_alloc_node+0x99/0x160
May 14 09:00:34 ifil03 kernel: [] ?
kmem_cache_alloc+0x11b/0x190
May 14 09:00:34 ifil03 kernel: [] ? sk_prot_alloc+0x48/0x1c0
May 14 09:00:34 ifil03 kernel: [] ? sk_clone+0x22/0x2e0
May 14 09:00:34 ifil03 kernel: [] ? inet_csk_clone+0x16/0xd0
May 14 09:00:34 ifil03 kernel: [] ?
tcp_create_openreq_child+0x23/0x450
May 14 09:00:34 ifil03 kernel: [] ?
tcp_v4_syn_recv_sock+0x4d/0x2a0
May 14 09:00:34 ifil03 kernel: [] ? tcp_check_req+0x201/0x420
May 14 09:00:34 ifil03 kernel: [] ?
tcp_rcv_state_process+0x116/0xa30
May 14 09:00:34 ifil03 kernel: [] ? tcp_v4_do_rcv+0x35b/0x430
May 14 09:00:34 ifil03 kernel: [] ? tcp_v4_rcv+0x4e1/0x860
May 14 09:00:34 ifil03 kernel: [] ?
ip_local_deliver_finish+0xdd/0x2d0
May 14 09:00:34 ifil03 kernel: [] ? ip_local_deliver+0x98/0xa0
May 14 09:00:34 ifil03 kernel: [] ? ip_rcv_finish+0x12d/0x440
May 14 09:00:34 ifil03 kernel: [] ?
intel_pmu_enable_all+0xa6/0x150
May 14 09:00:34 ifil03 kernel: [] ? ip_rcv+0x275/0x350
May 14 09:00:34 ifil03 kernel: [] ?
__netif_receive_skb+0x49b/0x6e0
May 14 09:00:34 ifil03 kernel: [] ?
netif_receive_skb+0x58/0x60
May 14 09:00:34 ifil03 kernel: [] ?
vmxnet3_rq_rx_complete+0x36e/0x880 [vmxnet3]




 
> > The big point of this change was to attempt to relief some of these 
> > warnings 
> > which we believed as being useless, since the net stack would recover from 
> > it
> > by re-transmissions.
> > We might have misjudged the scenario, though. Perhaps a better approach 
> > would be
> > making the warning less verbose for all page-alloc failures. We could, 
> > perhaps,
> > only print a stack-dump out, if some debug flag is passed along, either as
> > reference, or by some CONFIG_DEBUG_ preprocessor directive.
> 
> 
> warn_alloc_failed() uses the standard DEFAULT_RATELIMIT_INTERVAL which
> is very small (5 * HZ)
> 
> I would bump nopage_rs to somethin more reasonable, like one hour or one
> day.
>

Re: [PATCH v4 1/3] mm: introduce compaction and migration for virtio ballooned pages

2012-08-03 Thread Rafael Aquini
On Wed, Aug 01, 2012 at 04:53:47PM -0400, Rik van Riel wrote:
> On 07/23/2012 02:19 PM, Rafael Aquini wrote:
> 
> >In a glance, I believe this whole dance you're suggesting might just be too 
> >much
> >of an overcomplication, and the best approach would be simply teaching the
> >hotplug bits about the ballooned corner case just like it's being done to
> >compaction/migration. However, I'll look at it carefully before making any 
> >other
> >adjustments/propositions over here.
> 
> Compaction and hotplug do essentially the same thing
> here: "collect all the movable pages from a page block,
> and move them elsewhere".
> 
> Whether or not it is easier for them to share code, or
> to duplicate a few lines of code, is something that can
> be looked into later.

I'm 100% in agreement with your thoughts here.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 1/3] mm: introduce compaction and migration for virtio ballooned pages

2012-08-03 Thread Rafael Aquini
On Mon, Jul 23, 2012 at 11:33:32AM +0900, Minchan Kim wrote:
> Look at memory-hotplug, offline_page calls has_unmovable_pages, scan_lru_pages
> and do_migrate_range which calls isolate_lru_page. They consider only LRU 
> pages
> to migratable ones.
>
As promised, I looked into those bits. Yes, they only isolate LRU pages, and as
such, having this series merged or not doesn't change a bit for that code path.
In fact, having this series merged and teaching hotplug's
offline_pages()/do_migrate_rage() about ballooned pages might be extremely
beneficial in the rare event offlining memory stumbles across a balloon page.

As Rik said, I believe this is something we can look into in the near future.
 
> IMHO, better approach is that after we can get complete free pageblocks
> by compaction or reclaim, move balloon pages into that pageblocks and make
> that blocks to unmovable. It can prevent fragmentation and it makes
> current or future code don't need to consider balloon page.
> 
I totally agree with Rik on this one, as well. This is the wrong approach here.

All that said, I'll soon respin a v5 based on your comments on branch hinting 
and
commentary improvements, as well as addressing AKPM's concerns. I'll also revert
isolate_balloon_page() last changes back to make it a public symbol again, as
(I believe) we'll shortly be using it for letting hotplug bits aware of how to
isolate ballooned pages.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v5 0/3] make balloon pages movable by compaction

2012-08-06 Thread Rafael Aquini
Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a guest,
thus imposing performance penalties associated with the reduced number of
transparent huge pages that could be used by the guest workload.

This patch-set follows the main idea discussed at 2012 LSFMMS session:
"Ballooning for transparent huge pages" -- http://lwn.net/Articles/490114/
to introduce the required changes to the virtio_balloon driver, as well as
the changes to the core compaction & migration bits, in order to make those
subsystems aware of ballooned pages and allow memory balloon pages become
movable within a guest, thus avoiding the aforementioned fragmentation issue

Rafael Aquini (3):
  mm: introduce compaction and migration for virtio ballooned pages
  virtio_balloon: introduce migration primitives to balloon pages
  mm: add vm event counters for balloon pages compaction

 drivers/virtio/virtio_balloon.c | 139 +---
 include/linux/mm.h  |  26 
 include/linux/virtio_balloon.h  |   4 ++
 include/linux/vm_event_item.h   |   2 +
 mm/compaction.c | 131 +++--
 mm/migrate.c|  34 +-
 mm/vmstat.c |   4 ++
 7 files changed, 312 insertions(+), 28 deletions(-)

Change log:
v5:
 * address Andrew Morton's review comments on the patch series;
 * address a couple extra nitpick suggestions on PATCH 01 (Minchan);
v4: 
 * address Rusty Russel's review comments on PATCH 02;
 * re-base virtio_balloon patch on 9c378abc5c0c6fc8e3acf5968924d274503819b3;
V3: 
 * address reviewers nitpick suggestions on PATCH 01 (Mel, Minchan);
V2: 
 * address Mel Gorman's review comments on PATCH 01;


Preliminary test results:
(2 VCPU 1024mB RAM KVM guest running 3.6.0_rc1+ -- after a reboot)

* 64mB balloon:
[root@localhost ~]# awk '/compact/ {print}' /proc/vmstat
compact_blocks_moved 0
compact_pages_moved 0
compact_pagemigrate_failed 0
compact_stall 0
compact_fail 0
compact_success 0
compact_balloon_migrated 0
compact_balloon_failed 0
compact_balloon_isolated 0
compact_balloon_freed 0
[root@localhost ~]#
[root@localhost ~]# for i in $(seq 1 6); do echo 1 > 
/proc/sys/vm/compact_memory & done &>/dev/null 
[1]   Doneecho 1 > /proc/sys/vm/compact_memory
[2]   Doneecho 1 > /proc/sys/vm/compact_memory
[3]   Doneecho 1 > /proc/sys/vm/compact_memory
[4]   Doneecho 1 > /proc/sys/vm/compact_memory
[5]-  Doneecho 1 > /proc/sys/vm/compact_memory
[6]+  Doneecho 1 > /proc/sys/vm/compact_memory
[root@localhost ~]# 
[root@localhost ~]# awk '/compact/ {print}' /proc/vmstat
compact_blocks_moved 3520
compact_pages_moved 47548
compact_pagemigrate_failed 120
compact_stall 0
compact_fail 0
compact_success 0
compact_balloon_migrated 16378
compact_balloon_failed 0
compact_balloon_isolated 16378
compact_balloon_freed 16378

* 128mB balloon:
[root@localhost ~]# awk '/compact/ {print}' /proc/vmstat
compact_blocks_moved 0
compact_pages_moved 0
compact_pagemigrate_failed 0
compact_stall 0
compact_fail 0
compact_success 0
compact_balloon_migrated 0
compact_balloon_failed 0
compact_balloon_isolated 0
compact_balloon_freed 0
[root@localhost ~]#
[root@localhost ~]# for i in $(seq 1 6); do echo 1 > 
/proc/sys/vm/compact_memory & done &>/dev/null 
[1]   Doneecho 1 > /proc/sys/vm/compact_memory
[2]   Doneecho 1 > /proc/sys/vm/compact_memory
[3]   Doneecho 1 > /proc/sys/vm/compact_memory
[4]   Doneecho 1 > /proc/sys/vm/compact_memory
[5]-  Doneecho 1 > /proc/sys/vm/compact_memory
[6]+  Doneecho 1 > /proc/sys/vm/compact_memory
[root@localhost ~]# 
[root@localhost ~]# awk '/compact/ {print}' /proc/vmstat
compact_blocks_moved 3356
compact_pages_moved 47099
compact_pagemigrate_failed 158
compact_stall 0
compact_fail 0
compact_success 0
compact_balloon_migrated 26275
compact_balloon_failed 42
compact_balloon_isolated 26317
compact_balloon_freed 26275

-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v5 1/3] mm: introduce compaction and migration for virtio ballooned pages

2012-08-06 Thread Rafael Aquini
Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a guest,
thus imposing performance penalties associated with the reduced number of
transparent huge pages that could be used by the guest workload.

This patch introduces the helper functions as well as the necessary changes
to teach compaction and migration bits how to cope with pages which are
part of a guest memory balloon, in order to make them movable by memory
compaction procedures.

Signed-off-by: Rafael Aquini 
---
 include/linux/mm.h |  26 +++
 mm/compaction.c| 130 +
 mm/migrate.c   |  32 -
 3 files changed, 168 insertions(+), 20 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 311be90..01a2dc9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1662,5 +1662,31 @@ static inline unsigned int 
debug_guardpage_minorder(void) { return 0; }
 static inline bool page_is_guard(struct page *page) { return false; }
 #endif /* CONFIG_DEBUG_PAGEALLOC */
 
+#if (defined(CONFIG_VIRTIO_BALLOON) || defined(CONFIG_VIRTIO_BALLOON_MODULE))
+extern bool isolate_balloon_page(struct page *);
+extern bool putback_balloon_page(struct page *);
+extern struct address_space *balloon_mapping;
+
+static inline bool is_balloon_page(struct page *page)
+{
+   return (page->mapping && page->mapping == balloon_mapping);
+}
+
+static inline bool balloon_compaction_enabled(void)
+{
+#if defined(CONFIG_COMPACTION)
+   return true;
+#else
+   return false;
+#endif /* CONFIG_COMPACTION */
+}
+
+#else
+static inline bool isolate_balloon_page(struct page *page) { return false; }
+static inline bool putback_balloon_page(struct page *page) { return false; }
+static inline bool is_balloon_page(struct page *page)  { return false; }
+static inline bool balloon_compaction_enabled(void){ return false; }
+#endif /* (CONFIG_VIRTIO_BALLOON || CONFIG_VIRTIO_BALLOON_MODULE) */
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/mm/compaction.c b/mm/compaction.c
index e78cb96..9499d85 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
@@ -21,6 +22,89 @@
 #define CREATE_TRACE_POINTS
 #include 
 
+#if defined(CONFIG_VIRTIO_BALLOON) || defined(CONFIG_VIRTIO_BALLOON_MODULE)
+/*
+ * Balloon pages special page->mapping.
+ * Users must properly allocate and initialize an instance of balloon_mapping,
+ * and set it as the page->mapping for balloon enlisted page instances.
+ * There is no need on utilizing struct address_space locking schemes for
+ * balloon_mapping as, once it gets initialized at balloon driver, it will
+ * remain just like a static reference that helps us on identifying a guest
+ * ballooned page by its mapping, as well as it will keep the 'a_ops' callback
+ * pointers to the functions that will execute the balloon page mobility tasks.
+ *
+ * address_space_operations necessary methods for ballooned pages:
+ *   .migratepage- used to perform balloon's page migration (as is)
+ *   .invalidatepage - used to isolate a page from balloon's page list
+ *   .freepage   - used to reinsert an isolated page to balloon's page list
+ */
+struct address_space *balloon_mapping;
+EXPORT_SYMBOL_GPL(balloon_mapping);
+
+static inline void __isolate_balloon_page(struct page *page)
+{
+   page->mapping->a_ops->invalidatepage(page, 0);
+}
+
+static inline void __putback_balloon_page(struct page *page)
+{
+   page->mapping->a_ops->freepage(page);
+}
+
+/* __isolate_lru_page() counterpart for a ballooned page */
+bool isolate_balloon_page(struct page *page)
+{
+   if (WARN_ON(!is_balloon_page(page)))
+   return false;
+
+   if (likely(get_page_unless_zero(page))) {
+   /*
+* As balloon pages are not isolated from LRU lists, concurrent
+* compaction threads can race against page migration functions
+* move_to_new_page() & __unmap_and_move().
+* In order to avoid having an already isolated balloon page
+* being (wrongly) re-isolated while it is under migration,
+* lets be sure we have the page lock before proceeding with
+* the balloon page isolation steps.
+*/
+   if (likely(trylock_page(page))) {
+   /*
+* A ballooned page, by default, has just one refcount.
+* Prevent concurrent compaction threads from isolating
+* an already isolated balloon page.
+*/
+   if (is_balloon_page(page) && (page_count(page) == 2)) {
+

[PATCH v5 2/3] virtio_balloon: introduce migration primitives to balloon pages

2012-08-06 Thread Rafael Aquini
Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a guest,
thus imposing performance penalties associated with the reduced number of
transparent huge pages that could be used by the guest workload.

Besides making balloon pages movable at allocation time and introducing
the necessary primitives to perform balloon page migration/compaction,
this patch also introduces the following locking scheme to provide the
proper synchronization and protection for struct virtio_balloon elements
against concurrent accesses due to parallel operations introduced by
memory compaction / page migration.
 - balloon_lock (mutex) : synchronizes the access demand to elements of
  struct virtio_balloon and its queue operations;
 - pages_lock (spinlock): special protection to balloon pages list against
  concurrent list handling operations;

Signed-off-by: Rafael Aquini 
---
 drivers/virtio/virtio_balloon.c | 138 +---
 include/linux/virtio_balloon.h  |   4 ++
 2 files changed, 134 insertions(+), 8 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 0908e60..7c937a0 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -35,6 +36,12 @@
  */
 #define VIRTIO_BALLOON_PAGES_PER_PAGE (PAGE_SIZE >> VIRTIO_BALLOON_PFN_SHIFT)
 
+/* Synchronizes accesses/updates to the struct virtio_balloon elements */
+DEFINE_MUTEX(balloon_lock);
+
+/* Protects 'virtio_balloon->pages' list against concurrent handling */
+DEFINE_SPINLOCK(pages_lock);
+
 struct virtio_balloon
 {
struct virtio_device *vdev;
@@ -51,6 +58,7 @@ struct virtio_balloon
 
/* Number of balloon pages we've told the Host we're not using. */
unsigned int num_pages;
+
/*
 * The pages we've told the Host we're not using.
 * Each page on this list adds VIRTIO_BALLOON_PAGES_PER_PAGE
@@ -125,10 +133,12 @@ static void fill_balloon(struct virtio_balloon *vb, 
size_t num)
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb->pfns));
 
+   mutex_lock(&balloon_lock);
for (vb->num_pfns = 0; vb->num_pfns < num;
 vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-   struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY |
-   __GFP_NOMEMALLOC | __GFP_NOWARN);
+   struct page *page = alloc_page(GFP_HIGHUSER_MOVABLE |
+   __GFP_NORETRY | __GFP_NOWARN |
+   __GFP_NOMEMALLOC);
if (!page) {
if (printk_ratelimit())
dev_printk(KERN_INFO, &vb->vdev->dev,
@@ -141,7 +151,10 @@ static void fill_balloon(struct virtio_balloon *vb, size_t 
num)
set_page_pfns(vb->pfns + vb->num_pfns, page);
vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
totalram_pages--;
+   spin_lock(&pages_lock);
list_add(&page->lru, &vb->pages);
+   page->mapping = balloon_mapping;
+   spin_unlock(&pages_lock);
}
 
/* Didn't get any?  Oh well. */
@@ -149,6 +162,7 @@ static void fill_balloon(struct virtio_balloon *vb, size_t 
num)
return;
 
tell_host(vb, vb->inflate_vq);
+   mutex_unlock(&balloon_lock);
 }
 
 static void release_pages_by_pfn(const u32 pfns[], unsigned int num)
@@ -169,10 +183,22 @@ static void leak_balloon(struct virtio_balloon *vb, 
size_t num)
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb->pfns));
 
+   mutex_lock(&balloon_lock);
for (vb->num_pfns = 0; vb->num_pfns < num;
 vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
+   /*
+* We can race against virtballoon_isolatepage() and end up
+* stumbling across a _temporarily_ empty 'pages' list.
+*/
+   spin_lock(&pages_lock);
+   if (unlikely(list_empty(&vb->pages))) {
+   spin_unlock(&pages_lock);
+   break;
+   }
page = list_first_entry(&vb->pages, struct page, lru);
+   page->mapping = NULL;
list_del(&page->lru);
+   spin_unlock(&pages_lock);
set_page_pfns(vb->pfns + vb->num_pfns, page);
vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
}

[PATCH v5 3/3] mm: add vm event counters for balloon pages compaction

2012-08-06 Thread Rafael Aquini
This patch is only for testing report purposes and shall be dropped in case of
the rest of this patchset getting accepted for merging.

Signed-off-by: Rafael Aquini 
---
 drivers/virtio/virtio_balloon.c | 1 +
 include/linux/vm_event_item.h   | 2 ++
 mm/compaction.c | 1 +
 mm/migrate.c| 6 --
 mm/vmstat.c | 4 
 5 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 7c937a0..b8f7ea5 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -414,6 +414,7 @@ int virtballoon_migratepage(struct address_space *mapping,
 
mutex_unlock(&balloon_lock);
 
+   count_vm_event(COMPACTBALLOONMIGRATED);
return 0;
 }
 
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 57f7b10..a632a5d 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -41,6 +41,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_COMPACTION
COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
+   COMPACTBALLOONMIGRATED, COMPACTBALLOONFAILED,
+   COMPACTBALLOONISOLATED, COMPACTBALLOONFREED,
 #endif
 #ifdef CONFIG_HUGETLB_PAGE
HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
diff --git a/mm/compaction.c b/mm/compaction.c
index 9499d85..4e2e46a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -76,6 +76,7 @@ bool isolate_balloon_page(struct page *page)
if (is_balloon_page(page) && (page_count(page) == 2)) {
__isolate_balloon_page(page);
unlock_page(page);
+   count_vm_event(COMPACTBALLOONISOLATED);
return true;
}
unlock_page(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index fc56968..f98804a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -80,9 +80,10 @@ void putback_lru_pages(struct list_head *l)
dec_zone_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
if (unlikely(is_balloon_page(page) &&
-   balloon_compaction_enabled()))
+   balloon_compaction_enabled())) {
+   count_vm_event(COMPACTBALLOONFAILED);
WARN_ON(!putback_balloon_page(page));
-   else
+   } else
putback_lru_page(page);
}
 }
@@ -874,6 +875,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned 
long private,
page_is_file_cache(page));
put_page(page);
__free_page(page);
+   count_vm_event(COMPACTBALLOONFREED);
return rc;
}
 out:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index df7a674..8d80f60 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -768,6 +768,10 @@ const char * const vmstat_text[] = {
"compact_stall",
"compact_fail",
"compact_success",
+   "compact_balloon_migrated",
+   "compact_balloon_failed",
+   "compact_balloon_isolated",
+   "compact_balloon_freed",
 #endif
 
 #ifdef CONFIG_HUGETLB_PAGE
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 1/3] mm: introduce compaction and migration for virtio ballooned pages

2012-08-06 Thread Rafael Aquini
On Mon, Aug 06, 2012 at 02:38:23PM -0400, Rik van Riel wrote:
> On 08/06/2012 09:56 AM, Rafael Aquini wrote:
> 
> >@@ -846,6 +861,21 @@ static int unmap_and_move(new_page_t get_new_page, 
> >unsigned long private,
> > goto out;
> >
> > rc = __unmap_and_move(page, newpage, force, offlining, mode);
> >+
> >+if (unlikely(is_balloon_page(newpage)&&
> >+ balloon_compaction_enabled())) {
> 
> Could that be collapsed into one movable_balloon_page(newpage) function
> call?
> 
Keeping is_balloon_page() as is, and itroducing this new movable_balloon_page()
function call, or just doing a plain rename, as Andrew has first suggested?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 1/3] mm: introduce compaction and migration for virtio ballooned pages

2012-08-06 Thread Rafael Aquini
On Mon, Aug 06, 2012 at 03:06:49PM -0400, Rik van Riel wrote:
> 
> Just a plain rename would work.
>
Ok, I will rename it.
 
> >+static inline bool is_balloon_page(struct page *page)
> >+{
> >+return (page->mapping && page->mapping == balloon_mapping);
> >+}
> 
> As an aside, since you are only comparing page->mapping and
> not dereferencing it, it can be simplified to just:
> 
>   return (page->mapping == balloon_mapping);
> 
We really need both comparisons to avoid potential NULL pointer dereferences at
__isolate_balloon_page() & __putback_balloon_page() while running at bare metal
with no balloon driver loaded, since balloon_mapping itself is a pointer which
each balloon driver can set to its own structure. 

Thanks, Rik, for taking the time to look at this patch and provide (always)
valuable feedback.

I'll shortly respin a v6 with your suggestions.

-- Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility

2012-08-21 Thread Rafael Aquini
On Tue, Aug 21, 2012 at 10:13:30PM +0300, Michael S. Tsirkin wrote:
> > 
> > I believe rcu_dereference_protected() is what I want/need here, since this 
> > code
> > is always called for pages which we hold locked (PG_locked bit).
> 
> It would only help if we locked the page while updating the mapping,
> as far as I can see we don't.
>

But we can do it. In fact, by doing it (locking the page) we can easily avoid
the nasty race balloon_isolate_page / leak_balloon, in a much simpler way, IMHO.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility

2012-08-21 Thread Rafael Aquini
On Tue, Aug 21, 2012 at 10:16:12PM +0300, Michael S. Tsirkin wrote:
> On Tue, Aug 21, 2012 at 02:55:03PM -0300, Rafael Aquini wrote:
> > On Tue, Aug 21, 2012 at 04:52:23PM +0300, Michael S. Tsirkin wrote:
> > > > + * address_space_operations utilized methods for ballooned pages:
> > > > + *   .migratepage- used to perform balloon's page migration (as is)
> > > > + *   .launder_page   - used to isolate a page from balloon's page list
> > > > + *   .freepage   - used to reinsert an isolated page to balloon's 
> > > > page list
> > > > + */
> > > 
> > > It would be a good idea to document the assumptions here.
> > > Looks like .launder_page and .freepage are called in rcu critical
> > > section.
> > > But migratepage isn't - why is that safe?
> > > 
> > 
> > The migratepage callback for virtio_balloon can sleep, and IIUC we cannot 
> > sleep
> > within a RCU critical section. 
> > 
> > Also, The migratepage callback is called at inner migration's circle 
> > function
> > move_to_new_page(), and I don't think embedding it in a RCU critical section
> > would be a good idea, for the same understanding aforementioned.
> 
> Yes but this means it is still exposed to the module unloading
> races that RCU was supposed to fix.
> So need to either rework that code so it won't sleep
> or switch to some other synchronization.
>
Can you refactor tell_host() to not sleep? Or, can I get rid of calling it at
virtballoon_migratepage()? If 'no' is the answer for both questions, that's the
way that code has to remain, even if we find a way around to hack the
migratepage callback and have it embedded into a RCU crit section.

That's why I believe once the balloon driver is commanded to unload, we must
flag virtballoon_migratepage to skip it's work. By doing this, the thread
performing memory compaction will have to recur to the 'putback' path which is
RCU protected. (IMHO).

As the module will not uload utill it leaks all pages on its list, that unload
race you pointed before will be covered.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility

2012-08-21 Thread Rafael Aquini
On Tue, Aug 21, 2012 at 10:30:31PM +0300, Michael S. Tsirkin wrote:
> On Tue, Aug 21, 2012 at 04:23:58PM -0300, Rafael Aquini wrote:
> > On Tue, Aug 21, 2012 at 10:13:30PM +0300, Michael S. Tsirkin wrote:
> > > > 
> > > > I believe rcu_dereference_protected() is what I want/need here, since 
> > > > this code
> > > > is always called for pages which we hold locked (PG_locked bit).
> > > 
> > > It would only help if we locked the page while updating the mapping,
> > > as far as I can see we don't.
> > >
> > 
> > But we can do it. In fact, by doing it (locking the page) we can easily 
> > avoid
> > the nasty race balloon_isolate_page / leak_balloon, in a much simpler way, 
> > IMHO.
> 
> Absolutely. Further, we should look hard at whether most RCU uses
> in this patchset can be replaced with page lock.
>

Yeah, In fact, by testing/grabbing the page lock at leak_balloon() even the
module unload X migration / putback race seems to fade away, since migration
code holds the page locked all the way.

And that seems a quite easy task to be accomplished:


@@ -169,21 +197,61 @@ static void leak_balloon(struct virtio_balloon *vb, size_t
num)
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb->pfns));

+   mutex_lock(&vb->balloon_lock);
for (vb->num_pfns = 0; vb->num_pfns < num;
 vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
+   spin_lock(&vb->pages_lock);
+   /*
+* 'virtballoon_isolatepage()' can drain vb->pages list
+* making us to stumble across a _temporarily_ empty list.
+*
+* Release the spinlock and resume from here in order to
+* give page migration a shot to refill vb->pages list.
+*/
+   if (unlikely(list_empty(&vb->pages))) {
+   spin_unlock(&vb->pages_lock);
+   break;
+   }
+
page = list_first_entry(&vb->pages, struct page, lru);
+
+   /*
+* Grab the page lock to avoid racing against threads isolating
+* pages from vb->pages list (it's done under page lock).
+*
+* Failing to grab the page lock here means this page has been
+* selected for isolation already.
+*/
+   if (!trylock_page(page)) {
+   spin_unlock(&vb->pages_lock);
+   break;
+   }
+
+   clear_balloon_mapping(page);
list_del(&page->lru);
set_page_pfns(vb->pfns + vb->num_pfns, page);
vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
+   unlock_page(page);
+   spin_unlock(&vb->pages_lock);
}

.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility

2012-08-21 Thread Rafael Aquini
On Wed, Aug 22, 2012 at 03:07:41AM +0300, Michael S. Tsirkin wrote:
> On Tue, Aug 21, 2012 at 05:45:56PM -0300, Rafael Aquini wrote:
> > On Tue, Aug 21, 2012 at 10:30:31PM +0300, Michael S. Tsirkin wrote:
> > > On Tue, Aug 21, 2012 at 04:23:58PM -0300, Rafael Aquini wrote:
> > > > On Tue, Aug 21, 2012 at 10:13:30PM +0300, Michael S. Tsirkin wrote:
> > > > > > 
> > > > > > I believe rcu_dereference_protected() is what I want/need here, 
> > > > > > since this code
> > > > > > is always called for pages which we hold locked (PG_locked bit).
> > > > > 
> > > > > It would only help if we locked the page while updating the mapping,
> > > > > as far as I can see we don't.
> > > > >
> > > > 
> > > > But we can do it. In fact, by doing it (locking the page) we can easily 
> > > > avoid
> > > > the nasty race balloon_isolate_page / leak_balloon, in a much simpler 
> > > > way, IMHO.
> > > 
> > > Absolutely. Further, we should look hard at whether most RCU uses
> > > in this patchset can be replaced with page lock.
> > >
> > 
> > Yeah, In fact, by testing/grabbing the page lock at leak_balloon() even the
> > module unload X migration / putback race seems to fade away, since migration
> > code holds the page locked all the way.
> > And that seems a quite easy task to be accomplished:
> > 
> > 
> > @@ -169,21 +197,61 @@ static void leak_balloon(struct virtio_balloon *vb, 
> > size_t
> > num)
> > /* We can only do one array worth at a time. */
> > num = min(num, ARRAY_SIZE(vb->pfns));
> > 
> > +   mutex_lock(&vb->balloon_lock);
> > for (vb->num_pfns = 0; vb->num_pfns < num;
> >  vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> > +   spin_lock(&vb->pages_lock);
> > +   /*
> > +* 'virtballoon_isolatepage()' can drain vb->pages list
> > +* making us to stumble across a _temporarily_ empty list.
> 
> This still worries me. If this happens we do not
> lock the page so module can go away?
> if not need to document why.
>
The module won't unload unless it leaks all its pages. If we hit that test that
worries you, leak_balloon() will get back to its caller -- remove_common(), and
it will kept looping at:

/* There might be pages left in the balloon: free them. */
while (vb->num_pages)
leak_balloon(vb, vb->num_pages);

This is true because we do not mess with vb->num_pages while isolating/migrating
balloon pages, so the module will only unload when all isolated pages get back
to vb->pages_list and leak_balloon() reap them appropriatelly. As we will be
doing isolation/migration/putback steps under 'page lock' that race is gone.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility

2012-08-22 Thread Rafael Aquini
On Wed, Aug 22, 2012 at 12:33:17PM +0300, Michael S. Tsirkin wrote:
> Hmm, so this will busy wait which is unelegant.
> We need some event IMO.

No, it does not busy wait. leak_balloon() is mutual exclusive with migration
steps, so for the case we have one racing against the other, we really want
leak_balloon() dropping the mutex temporarily to allow migration complete its
work of refilling vb->pages list. Also, leak_balloon() calls tell_host(), which
will potentially make it to schedule for each round of vb->pfns leak_balloon()
will release. So, when remove_common() calls leak_balloon() looping on
vb->num_pages, that won't become a tight loop. 
The scheme was apparently working before this series, and it will remain working
after it.


> Also, reading num_pages without a lock here
> which seems wrong.

I'll protect it with vb->balloon_lock mutex. That will be consistent with the
lock protection scheme this patch is introducing for struct virtio_balloon
elements.


> A similar concern applies to normal leaking
> of the balloon: here we might leak less than
> required, then wait for the next config change
> event.

Just as before, same thing here. If you leaked less than required, balloon()
will keep calling leak_balloon() until the balloon target is reached. This
scheme was working before, and it will keep working after this patch.


> How about we signal config_change
> event when pages are back to pages_list?

I really don't know what to tell you here, but, to me, it seems like an
overcomplication that isn't directly entangled with this patch purposes.
Besides, you cannot expect compation / migration happening and racing against
leak_balloon() all the time to make them signal events to the later, so we might
just be creating a wait-forever condition for leak_balloon(), IMHO.

Cheers!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility

2012-08-23 Thread Rafael Aquini
On Thu, Aug 23, 2012 at 01:01:07PM +0300, Michael S. Tsirkin wrote:
> > So, when remove_common() calls leak_balloon() looping on
> > vb->num_pages, that won't become a tight loop. 
> > The scheme was apparently working before this series, and it will remain 
> > working
> > after it.
> 
> It seems that before we would always leak all requested memory
> in one go. I can't tell why we have a while loop there at all.
> Rusty, could you clarify please?
>

It seems that your claim isn't right. leak_balloon() cannot do it all at once,
as for each round it only releases 256 pages, at most; and the 'one go' would
require a couple of loop rounds at remove_common().
So, nothing has changed here.

 
> > Just as before, same thing here. If you leaked less than required, balloon()
> > will keep calling leak_balloon() until the balloon target is reached. This
> > scheme was working before, and it will keep working after this patch.
> >
> 
> IIUC we never hit this path before.
>  
So, how does balloon() works then?


> > > How about we signal config_change
> > > event when pages are back to pages_list?
> > 
> > I really don't know what to tell you here, but, to me, it seems like an
> > overcomplication that isn't directly entangled with this patch purposes.
> > Besides, you cannot expect compation / migration happening and racing 
> > against
> > leak_balloon() all the time to make them signal events to the later, so we 
> > might
> > just be creating a wait-forever condition for leak_balloon(), IMHO.
> 
> So use wait_event or similar, check for existance of isolated pages.
> 

The thing here is expecting compaction as being an external event to signal
actions to the balloon driver won't work as you desire. Also, as far as the
balloon driver is concerned, it's only a matter of time to accomplish a total,
or partial, balloon leak, even when we have some pages isolated from balloon's
page list.

IMHO, you're attempting to complicate a simple thing that is already working
well. As said before, there are no guarantees you'll have isolated pages 
by the time you're leaking the balloon, so you might leave it waiting forever
on something that will not happen. And if there are isolated pages while balloon
is leaking, they'll have their chance to get back to the list before the device
finishes its leaking job.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility

2012-08-23 Thread Rafael Aquini
On Thu, Aug 23, 2012 at 03:34:32PM +0300, Michael S. Tsirkin wrote:
> > So, nothing has changed here.
> 
> Yes, your patch does change things:
> leak_balloon now might return without freeing any pages.
> In that case we will not be making any progress, and just
> spin, pinning CPU.

That's a transitory condition, that migh happen if leak_balloon() takes place
when compaction, or migration are under their way and it might only affects the
module unload case. Also it won't pin CPU because it keeps releasing the locks
it grabs, as it loops. So, we are locubrating about rarities, IMHO. 

> 
> >  
> > > > Just as before, same thing here. If you leaked less than required, 
> > > > balloon()
> > > > will keep calling leak_balloon() until the balloon target is reached. 
> > > > This
> > > > scheme was working before, and it will keep working after this patch.
> > > >
> > > 
> > > IIUC we never hit this path before.
> > >  
> > So, how does balloon() works then?
> > 
> 
> It gets a request to leak a given number of pages
> and executes it, then tells host that it is done.
> It never needs to spin busy-waiting on a CPU for this.
>

So, what this patch changes for the ordinary leak_balloon() case?

 
> Well busy wait pinning CPU is ugly.  Instead we should block thread and
> wake it up when done.  I don't mind how we fix it specifically.
>

I already told you that we do not do that by any mean introduced by this patch.
You're just being stubborn here. If those bits are broken, they were already
broken before I did come up with this proposal.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility

2012-08-23 Thread Rafael Aquini
On Thu, Aug 23, 2012 at 04:53:29PM +0300, Michael S. Tsirkin wrote:
> On Thu, Aug 23, 2012 at 10:06:07AM -0300, Rafael Aquini wrote:
> > On Thu, Aug 23, 2012 at 03:34:32PM +0300, Michael S. Tsirkin wrote:
> > > > So, nothing has changed here.
> > > 
> > > Yes, your patch does change things:
> > > leak_balloon now might return without freeing any pages.
> > > In that case we will not be making any progress, and just
> > > spin, pinning CPU.
> > 
> > That's a transitory condition, that migh happen if leak_balloon() takes 
> > place
> > when compaction, or migration are under their way and it might only affects 
> > the
> > module unload case.
> 
> Regular operation seems even more broken: host might ask
> you to leak memory but because it is under compaction
> you might leak nothing. No?
>

And that is exactely what it wants to do. If there is (temporarily) nothing to 
leak,
then not leaking is the only sane thing to do. Having balloon pages being 
migrated
does not break the leak at all, despite it can last a little longer.

 
 
> > 
> > I already told you that we do not do that by any mean introduced by this 
> > patch.
> > You're just being stubborn here. If those bits are broken, they were already
> > broken before I did come up with this proposal.
> 
> Sorry you don't address the points I am making.  Maybe there are no
> bugs. But it looks like there are.  And assuming I am just seeing things
> this just means patch needs more comments, in commit log and in
> code to explain the design so that it stops looking like that.
>

Yes, I belive you're biased here.


> Basically it was very simple: we assumed page->lru was never
> touched for an allocated page, so it's safe to use it for
> internal book-keeping by the driver.
>
> Now, this is not the case anymore, you add some logic in mm/ that might
> or might not touch page->lru depending on things like reference count.
> And you are asking why things break even though you change very little
> in balloon itself?  Because the interface between balloon and mm is now
> big, fragile and largely undocumented.
> 

The driver don't use page->lru as its bookeeping at all, it uses
vb->num_pages instead. 


> Another strangeness I just noticed: if we ever do an extra get_page in
> balloon, compaction logic in mm will break, yes?  But one expects to be
> able to do get_page after alloc_page without ill effects
> as long as one does put_page before free.
>

You can do it (bump up the balloon page refcount), and it will only prevent
balloon pages from being isolated and migrated, thus reducing the effectiveness 
of
defragmenting memory when balloon pages are present, just like it happens today.

It really doesn't seems the case of virtio_balloon driver, or any other driver,
which allocates pages directly from buddy to keep raising the page refcount,
though.
 

> Just a thought: maybe it is cleaner to move all balloon page tracking
> into mm/?  Implement alloc_balloon/free_balloon with methods to fill and
> leak pages, and callbacks to invoke when done.  This should be good for
> other hypervisors too. If you like this idea, I can even try to help out
> by refactoring current code in this way, so that you can build on it.
> But this is just a thought, not a must.
>

That seems to be a good thought to be on a future enhancements wish-list, for 
sure.
We can start thinking of it, and I surely would be more than happy on be doing
it along with you. But I don't think not having it right away is a dealbreaker
for this proposal, as is.

I'm not against your thoughts, and I'm really glad that you're providing such
good dicussion over this subject, but, now I'll wait for Rusty thoughts on 
this one question.

Cheers!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility

2012-08-23 Thread Rafael Aquini
On Thu, Aug 23, 2012 at 12:03:15PM -0400, Rik van Riel wrote:
> >
> >Not "longer" - apparently forever unless user resend the leak command.
> >It's wrong - it should
> >1. not tell host if nothing was done
> >2. after migration finished leak and tell host
> 
> Agreed.  If the balloon is told to leak N pages, and could
> not do so because those pages were locked, the balloon driver
> needs to retry (maybe waiting on a page lock?) and not signal
> completion until after the job has been completed.
> 
> Having the balloon driver wait on the page lock should be
> fine, because compaction does not hold the page lock for
> long.

And that is precisely what leak_balloon is doing. When it stumbles across a
locked page it gets rid of that leak round to give a shot for compaction to
finish its task.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility

2012-08-23 Thread Rafael Aquini
On Thu, Aug 23, 2012 at 07:25:05PM +0300, Michael S. Tsirkin wrote:
> On Thu, Aug 23, 2012 at 04:53:28PM +0300, Michael S. Tsirkin wrote:
> > Basically it was very simple: we assumed page->lru was never
> > touched for an allocated page, so it's safe to use it for
> > internal book-keeping by the driver.
> > 
> > Now, this is not the case anymore, you add some logic in mm/ that might
> > or might not touch page->lru depending on things like reference count.
> 
> Another thought: would the issue go away if balloon used
> page->private to link pages instead of LRU?
> mm core could keep a reference on page to avoid it
> being used while mm handles it (maybe it does already?).
>
I don't think so. That would be a lot more trikier and complex, IMHO.
 
> If we do this, will not the only change to balloon be to tell mm that it
> can use compaction for these pages when it allocates the page: using
> some GPF flag or a new API?
> 

What about keep a conter at virtio_balloon structure on how much pages are
isolated from balloon's list and check it at leak time?
if the counter gets > 0 than we can safely put leak_balloon() to wait until
balloon page list gets completely refilled. I guess that is simple to get
accomplished and potentially addresses all your concerns on this issue.

Cheers!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   >