[PATCH v2] zsmalloc: add comments for ->inuse to zspage

2015-09-23 Thread Hui Zhu
Signed-off-by: Hui Zhu 
---
 mm/zsmalloc.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index f135b1b..f62f2fb 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -38,6 +38,7 @@
  * page->lru: links together first pages of various zspages.
  * Basically forming list of zspages in a fullness group.
  * page->mapping: class index and fullness group of the zspage
+ * page->inuse: the objects number that is used in this zspage
  *
  * Usage of struct page flags:
  * PG_private: identifies the first component page
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Call for Topics and Sponsors

2015-06-25 Thread Hui Zhu
*
Call for Topics and Sponsors

Workshop on Open Source Development Tools 2015
Beijing, China
Sep. 12, 2015 (TBD)
HelloGCC Work Group (www.hellogcc.org)
*
Open Source Development Tools Workshop is a meeting for open
source software developers. You can share your work, study and
learning experience of open source software development here.
Our main topics is open source development tools.

The content of topics can be:
* GNU toolchain (gcc, binutils, gdb, etc)
* Clang/LLVM toolchain
* Other tools of open source development, debug and simulation

The form of topics can be:
* the introduction of your own work
* the introduction of your work did in the past
* tutorial, experience and etc
* other forms of presentation, such as lightning talk

If you have some topics, please contact us:
* send email to hello...@freelists.org (need to subscribe
http://www.freelists.org/list/hellogcc first)
* login into freenode IRC #hellogcc room

Important Date:
* the deadline of topics and sponsors solicitation: Aug 1st, 2015

Previous Meetings:
* OSDT 2014: http://www.hellogcc.org/?p=33910
* HelloGCC 2013: http://www.hellogcc.org/?p=33518
* HelloGCC 2012: http://linux.chinaunix.net/hellogcc2012
* HelloGCC 2011: http://linux.chinaunix.net/hellogcc2011
* HelloGCC 2010: http://linux.chinaunix.net/hellogcc2010
* HelloGCC 2009: http://www.aka-kernel.org/news/hellogcc/index.html

If you want to sponsor us, we will very appreciate and please contact us via
hellogcc.workgr...@gmail.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 2/2] Change limit of HighAtomic from 1% to 10%

2017-09-26 Thread Hui Zhu
After "Try to use HighAtomic if try to alloc umovable page that order
is not 0".  The result is still not very well because the the limit of
HighAtomic make kernel cannot reserve more pageblock to HighAtomic.

The patch change max_managed from 1% to 10% make HighAtomic can get more
pageblocks.

Signed-off-by: Hui Zhu 
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b54e94a..9322458 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2101,7 +2101,7 @@ static void reserve_highatomic_pageblock(struct page 
*page, struct zone *zone,
 * Limit the number reserved to 1 pageblock or roughly 1% of a zone.
 * Check is race-prone but harmless.
 */
-   max_managed = (zone->managed_pages / 100) + pageblock_nr_pages;
+   max_managed = (zone->managed_pages / 10) + pageblock_nr_pages;
if (zone->nr_reserved_highatomic >= max_managed)
return;
 
-- 
1.9.1



[RFC 1/2] Try to use HighAtomic if try to alloc umovable page that order is not 0

2017-09-26 Thread Hui Zhu
The page add a new condition to let gfp_to_alloc_flags return
alloc_flags with ALLOC_HARDER if the order is not 0 and migratetype is
MIGRATE_UNMOVABLE.

Then alloc umovable page that order is not 0 will try to use HighAtomic.

Signed-off-by: Hui Zhu 
---
 mm/page_alloc.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c841af8..b54e94a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3642,7 +3642,7 @@ static void wake_all_kswapds(unsigned int order, const 
struct alloc_context *ac)
 }
 
 static inline unsigned int
-gfp_to_alloc_flags(gfp_t gfp_mask)
+gfp_to_alloc_flags(gfp_t gfp_mask, int order, int migratetype)
 {
unsigned int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
 
@@ -3671,6 +3671,8 @@ static void wake_all_kswapds(unsigned int order, const 
struct alloc_context *ac)
alloc_flags &= ~ALLOC_CPUSET;
} else if (unlikely(rt_task(current)) && !in_interrupt())
alloc_flags |= ALLOC_HARDER;
+   else if (order > 0 && migratetype == MIGRATE_UNMOVABLE)
+   alloc_flags |= ALLOC_HARDER;
 
 #ifdef CONFIG_CMA
if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
@@ -3903,7 +3905,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 * kswapd needs to be woken up, and to avoid the cost of setting up
 * alloc_flags precisely. So we do that now.
 */
-   alloc_flags = gfp_to_alloc_flags(gfp_mask);
+   alloc_flags = gfp_to_alloc_flags(gfp_mask, order, ac->migratetype);
 
/*
 * We need to recalculate the starting point for the zonelist iterator
-- 
1.9.1



[RFC 0/2] Use HighAtomic against long-term fragmentation

2017-09-26 Thread Hui Zhu
Current HighAtomic just to handle the high atomic page alloc.
But I found that use it handle the normal unmovable continuous page
alloc will help to against long-term fragmentation.

Use highatomic as normal page alloc is odd.  But I really got some good
results with our internal test and mmtests.

Do you think it is worth to work on it?

The patches was tested with mmtests stress-highalloc modified to do
GFP_KERNEL order-4 allocations, on 4.14.0-rc1+ 2 cpus Vbox 1G memory.
  orig  ch
Minor Faults  4565947743315623
Major Faults   319 371
Swap Ins 0   0
Swap Outs0   0
Allocation stalls0   0
DMA allocs   93518   18345
DMA32 allocs  4239569940406865
Normal allocs0   0
Movable allocs   0   0
Direct pages scanned  7056   16232
Kswapd pages scanned946174  961750
Kswapd pages reclaimed  945077  942821
Direct pages reclaimed7022   16170
Kswapd efficiency  99% 98%
Kswapd velocity   1576.3521567.977
Direct efficiency  99% 99%
Direct velocity 11.755  26.464
Percentage direct scans 0%  1%
Zone normal velocity  1588.1081594.441
Zone dma32 velocity  0.000   0.000
Zone dma velocity0.000   0.000
Page writes by reclaim   0.000   0.000
Page writes file 0   0
Page writes anon 0   0
Page reclaim immediate 405   16429
Sector Reads   2027848 2109324
Sector Writes  3386260 3299388
Page rescued immediate   0   0
Slabs scanned   867805  877005
Direct inode steals3372072
Kswapd inode steals  33911   41777
Kswapd skipped wait  0   0
THP fault alloc 30  84
THP collapse alloc 188 244
THP splits   0   0
THP fault fallback  67  51
THP collapse fail6   4
Compaction stalls  111  49
Compaction success  81  35
Compaction failures 30  14
Page migrate success 57962   43921
Page migrate failure67 183
Compaction pages isolated   117473   88823
Compaction migrate scanned   75548   50403
Compaction free scanned1454638  672310
Compaction cost 62  47
NUMA alloc hit4212949340018326
NUMA alloc miss  0   0
NUMA interleave hit  0   0
NUMA alloc local  4212949340018326
NUMA base PTE updates0   0
NUMA huge PMD updates0   0
NUMA page range updates  0   0
NUMA hint faults 0   0
NUMA hint local faults   0   0
NUMA hint local percent100 100
NUMA pages migrated  0   0
AutoNUMA cost   0%  0%

Hui Zhu (2):
Try to use HighAtomic if try to alloc umovable page that order is not 0
Change limit of HighAtomic from 1% to 10%

 page_alloc.c |8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)


Re: [RFC 0/2] Use HighAtomic against long-term fragmentation

2017-09-26 Thread Hui Zhu
2017-09-26 17:51 GMT+08:00 Mel Gorman :
> On Tue, Sep 26, 2017 at 04:46:42PM +0800, Hui Zhu wrote:
>> Current HighAtomic just to handle the high atomic page alloc.
>> But I found that use it handle the normal unmovable continuous page
>> alloc will help to against long-term fragmentation.
>>
>
> This is not wise. High-order atomic allocations do not always have a
> smooth recovery path such as network drivers with large MTUs that have no
> choice but to drop the traffic and hope for a retransmit. That's why they
> have the highatomic reserve. If the reserve is used for normal unmovable
> allocations then allocation requests that could have waited for reclaim
> may cause high-order atomic allocations to fail. Changing it may allow
> improve latencies in some limited cases while causing functional failures
> in others.  If there is a special case where there are a large number of
> other high-order allocations then I would suggest increasing min_free_kbytes
> instead as a workaround.

I think let 0 order unmovable page alloc and other order unmovable pages
alloc use different migrate types will help against long-term
fragmentation.

Do you think kernel can add a special migrate type for big than 0 order
unmovable pages alloc?

Thanks,
Hui

>
> --
> Mel Gorman
> SUSE Labs


[PATCH] zsmalloc: zs_page_migrate: not check inuse if migrate_mode is not MIGRATE_ASYNC

2017-07-14 Thread Hui Zhu
Got some -EBUSY from zs_page_migrate that will make migration
slow (retry) or fail (zs_page_putback will schedule_work free_work,
but it cannot ensure the success).

And I didn't find anything that make zs_page_migrate cannot work with
a ZS_EMPTY zspage.
So make the patch to not check inuse if migrate_mode is not
MIGRATE_ASYNC.

Signed-off-by: Hui Zhu 
---
 mm/zsmalloc.c | 66 +--
 1 file changed, 37 insertions(+), 29 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index d41edd2..c298e5c 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1982,6 +1982,7 @@ int zs_page_migrate(struct address_space *mapping, struct 
page *newpage,
unsigned long old_obj, new_obj;
unsigned int obj_idx;
int ret = -EAGAIN;
+   int inuse;
 
VM_BUG_ON_PAGE(!PageMovable(page), page);
VM_BUG_ON_PAGE(!PageIsolated(page), page);
@@ -1996,21 +1997,24 @@ int zs_page_migrate(struct address_space *mapping, 
struct page *newpage,
offset = get_first_obj_offset(page);
 
spin_lock(&class->lock);
-   if (!get_zspage_inuse(zspage)) {
+   inuse = get_zspage_inuse(zspage);
+   if (mode == MIGRATE_ASYNC && !inuse) {
ret = -EBUSY;
goto unlock_class;
}
 
pos = offset;
s_addr = kmap_atomic(page);
-   while (pos < PAGE_SIZE) {
-   head = obj_to_head(page, s_addr + pos);
-   if (head & OBJ_ALLOCATED_TAG) {
-   handle = head & ~OBJ_ALLOCATED_TAG;
-   if (!trypin_tag(handle))
-   goto unpin_objects;
+   if (inuse) {
+   while (pos < PAGE_SIZE) {
+   head = obj_to_head(page, s_addr + pos);
+   if (head & OBJ_ALLOCATED_TAG) {
+   handle = head & ~OBJ_ALLOCATED_TAG;
+   if (!trypin_tag(handle))
+   goto unpin_objects;
+   }
+   pos += class->size;
}
-   pos += class->size;
}
 
/*
@@ -2020,20 +2024,22 @@ int zs_page_migrate(struct address_space *mapping, 
struct page *newpage,
memcpy(d_addr, s_addr, PAGE_SIZE);
kunmap_atomic(d_addr);
 
-   for (addr = s_addr + offset; addr < s_addr + pos;
-   addr += class->size) {
-   head = obj_to_head(page, addr);
-   if (head & OBJ_ALLOCATED_TAG) {
-   handle = head & ~OBJ_ALLOCATED_TAG;
-   if (!testpin_tag(handle))
-   BUG();
-
-   old_obj = handle_to_obj(handle);
-   obj_to_location(old_obj, &dummy, &obj_idx);
-   new_obj = (unsigned long)location_to_obj(newpage,
-   obj_idx);
-   new_obj |= BIT(HANDLE_PIN_BIT);
-   record_obj(handle, new_obj);
+   if (inuse) {
+   for (addr = s_addr + offset; addr < s_addr + pos;
+   addr += class->size) {
+   head = obj_to_head(page, addr);
+   if (head & OBJ_ALLOCATED_TAG) {
+   handle = head & ~OBJ_ALLOCATED_TAG;
+   if (!testpin_tag(handle))
+   BUG();
+
+   old_obj = handle_to_obj(handle);
+   obj_to_location(old_obj, &dummy, &obj_idx);
+   new_obj = (unsigned long)
+   location_to_obj(newpage, obj_idx);
+   new_obj |= BIT(HANDLE_PIN_BIT);
+   record_obj(handle, new_obj);
+   }
}
}
 
@@ -2055,14 +2061,16 @@ int zs_page_migrate(struct address_space *mapping, 
struct page *newpage,
 
ret = MIGRATEPAGE_SUCCESS;
 unpin_objects:
-   for (addr = s_addr + offset; addr < s_addr + pos;
+   if (inuse) {
+   for (addr = s_addr + offset; addr < s_addr + pos;
addr += class->size) {
-   head = obj_to_head(page, addr);
-   if (head & OBJ_ALLOCATED_TAG) {
-   handle = head & ~OBJ_ALLOCATED_TAG;
-   if (!testpin_tag(handle))
-   BUG();
-   unpin_tag(handle);
+   head = obj_to_head(page, addr);
+   if (head & OBJ_ALLOCATED_TAG) {
+   handle = head & ~OBJ_ALLOCATED_TAG;
+   if (

[PATCH] zsmalloc: zs_page_migrate: schedule free_work if zspage is ZS_EMPTY

2017-08-13 Thread Hui Zhu
After commit e2846124f9a2 ("zsmalloc: zs_page_migrate: skip unnecessary
loops but not return -EBUSY if zspage is not inuse") zs_page_migrate
can handle the ZS_EMPTY zspage.

But it will affect the free_work free the zspage.  That will make this
ZS_EMPTY zspage stay in system until another zspage wake up free_work.

Make this patch let zs_page_migrate wake up free_work if need.

Fixes: e2846124f9a2 ("zsmalloc: zs_page_migrate: skip unnecessary loops but not 
return -EBUSY if zspage is not inuse")
Signed-off-by: Hui Zhu 
---
 mm/zsmalloc.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 62457eb..48ce043 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -2035,8 +2035,14 @@ int zs_page_migrate(struct address_space *mapping, 
struct page *newpage,
 * Page migration is done so let's putback isolated zspage to
 * the list if @page is final isolated subpage in the zspage.
 */
-   if (!is_zspage_isolated(zspage))
-   putback_zspage(class, zspage);
+   if (!is_zspage_isolated(zspage)) {
+   /*
+* The page and class is locked, we cannot free zspage
+* immediately so let's defer.
+*/
+   if (putback_zspage(class, zspage) == ZS_EMPTY)
+   schedule_work(&pool->free_work);
+   }
 
reset_page(page);
put_page(page);
-- 
1.9.1



Re: [PATCH] zsmalloc: zs_page_migrate: schedule free_work if zspage is ZS_EMPTY

2017-08-14 Thread Hui Zhu
2017-08-14 16:31 GMT+08:00 Minchan Kim :
> Hi Hui,
>
> On Mon, Aug 14, 2017 at 02:34:46PM +0800, Hui Zhu wrote:
>> After commit e2846124f9a2 ("zsmalloc: zs_page_migrate: skip unnecessary
>> loops but not return -EBUSY if zspage is not inuse") zs_page_migrate
>> can handle the ZS_EMPTY zspage.
>>
>> But it will affect the free_work free the zspage.  That will make this
>> ZS_EMPTY zspage stay in system until another zspage wake up free_work.
>>
>> Make this patch let zs_page_migrate wake up free_work if need.
>>
>> Fixes: e2846124f9a2 ("zsmalloc: zs_page_migrate: skip unnecessary loops but 
>> not return -EBUSY if zspage is not inuse")
>> Signed-off-by: Hui Zhu 
>
> This patch makes me remind why I didn't try to migrate empty zspage
> as you did e2846124f9a2. I have forgotten it toally.
>
> We cannot guarantee when the freeing of the page happens if we use
> deferred freeing in zs_page_migrate. However, we returns
> MIGRATEPAGE_SUCCESS which is totally lie.
> Without instant freeing the page, it doesn't help the migration
> situation. No?
>

Sorry I think the reason is I didn't introduce this clear.
After I patch e2846124f9a2.  I got some false in zs_page_isolate:
if (get_zspage_inuse(zspage) == 0) {
spin_unlock(&class->lock);
return false;
}
The page of this zspage was migrated in before.

So I think e2846124f9a2 is OK that MIGRATEPAGE_SUCCESS with the "page".
But it keep the "newpage" with a empty zspage inside system.
Root cause is zs_page_isolate remove it from  ZS_EMPTY list but not
call zs_page_putback "schedule_work(&pool->free_work);".  Because
zs_page_migrate done the job without
"schedule_work(&pool->free_work);"

That is why I made the new patch.

Thanks,
Hui

> I start to wonder why your patch e2846124f9a2 helped your test.
> I will think over the issue with fresh mind after the holiday.
>
>> ---
>>  mm/zsmalloc.c | 10 --
>>  1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
>> index 62457eb..48ce043 100644
>> --- a/mm/zsmalloc.c
>> +++ b/mm/zsmalloc.c
>> @@ -2035,8 +2035,14 @@ int zs_page_migrate(struct address_space *mapping, 
>> struct page *newpage,
>>* Page migration is done so let's putback isolated zspage to
>>* the list if @page is final isolated subpage in the zspage.
>>*/
>> - if (!is_zspage_isolated(zspage))
>> - putback_zspage(class, zspage);
>> + if (!is_zspage_isolated(zspage)) {
>> + /*
>> +  * The page and class is locked, we cannot free zspage
>> +  * immediately so let's defer.
>> +  */
>> + if (putback_zspage(class, zspage) == ZS_EMPTY)
>> + schedule_work(&pool->free_work);
>> + }
>>
>>   reset_page(page);
>>   put_page(page);
>> --
>> 1.9.1
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majord...@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 


[PATCH v2] zsmalloc: zs_page_migrate: schedule free_work if zspage is ZS_EMPTY

2017-08-14 Thread Hui Zhu
After commit e2846124f9a2 ("zsmalloc: zs_page_migrate: skip unnecessary
loops but not return -EBUSY if zspage is not inuse") zs_page_migrate
can handle the ZS_EMPTY zspage.

But I got some false in zs_page_isolate:
if (get_zspage_inuse(zspage) == 0) {
spin_unlock(&class->lock);
return false;
}
The page of this zspage was migrated in before.

The reason is commit e2846124f9a2 ("zsmalloc: zs_page_migrate: skip
unnecessary loops but not return -EBUSY if zspage is not inuse") just
handle the "page" but not "newpage" then it keep the "newpage" with
a empty zspage inside system.
Root cause is zs_page_isolate remove it from ZS_EMPTY list but not
call zs_page_putback "schedule_work(&pool->free_work);".  Because
zs_page_migrate done the job without "schedule_work(&pool->free_work);"

Make this patch let zs_page_migrate wake up free_work if need.

Fixes: e2846124f9a2 ("zsmalloc: zs_page_migrate: skip unnecessary loops but not 
return -EBUSY if zspage is not inuse")
Signed-off-by: Hui Zhu 
---
 mm/zsmalloc.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 62457eb..c6cc77c 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -2035,8 +2035,17 @@ int zs_page_migrate(struct address_space *mapping, 
struct page *newpage,
 * Page migration is done so let's putback isolated zspage to
 * the list if @page is final isolated subpage in the zspage.
 */
-   if (!is_zspage_isolated(zspage))
-   putback_zspage(class, zspage);
+   if (!is_zspage_isolated(zspage)) {
+   /*
+* Page will be freed in following part. But newpage and
+* zspage will stay in system if zspage is in ZS_EMPTY
+* list.  So call free_work to free it.
+* The page and class is locked, we cannot free zspage
+* immediately so let's defer.
+*/
+   if (putback_zspage(class, zspage) == ZS_EMPTY)
+   schedule_work(&pool->free_work);
+   }
 
reset_page(page);
put_page(page);
-- 
1.9.1



Re: [PATCH] zsmalloc: zs_page_migrate: not check inuse if migrate_mode is not MIGRATE_ASYNC

2017-07-20 Thread Hui Zhu
2017-07-20 16:47 GMT+08:00 Minchan Kim :
> Hi Hui,
>
> On Thu, Jul 20, 2017 at 02:39:17PM +0800, Hui Zhu wrote:
>> Hi Minchan,
>>
>> I am sorry for answer late.
>> I spent some time on ubuntu 16.04 with mmtests in an old laptop.
>>
>> 2017-07-17 13:39 GMT+08:00 Minchan Kim :
>> > Hello Hui,
>> >
>> > On Fri, Jul 14, 2017 at 03:51:07PM +0800, Hui Zhu wrote:
>> >> Got some -EBUSY from zs_page_migrate that will make migration
>> >> slow (retry) or fail (zs_page_putback will schedule_work free_work,
>> >> but it cannot ensure the success).
>> >
>> > I think EAGAIN(migration retrial) is better than EBUSY(bailout) because
>> > expectation is that zsmalloc will release the empty zs_page soon so
>> > at next retrial, it will be succeeded.
>>
>>
>> I am not sure.
>>
>> This is the call trace of zs_page_migrate:
>> zs_page_migrate
>> mapping->a_ops->migratepage
>> move_to_new_page
>> __unmap_and_move
>> unmap_and_move
>> migrate_pages
>>
>> In unmap_and_move will remove page from migration page list
>> and call putback_movable_page(will call mapping->a_ops->putback_page) if
>> return value of zs_page_migrate is not -EAGAIN.
>> The comments of this part:
>> After called mapping->a_ops->putback_page, zsmalloc can free the page
>> from ZS_EMPTY list.
>>
>> If retrun -EAGAIN, the page will be not be put back.  EAGAIN page will
>> be try again in migrate_pages without re-isolate.
>
> You're right. With -EGAIN, it burns out CPU pointlessly.
>
>>
>> > About schedule_work, as you said, we don't make sure when it happens but
>> > I believe it will happen in a migration iteration most of case.
>> > How often do you see that case?
>>
>> I noticed this issue because my Kernel patch 
>> https://lkml.org/lkml/2014/5/28/113
>> that will remove retry in __alloc_contig_migrate_range.
>> This retry willhandle the -EBUSY because it will re-isolate the page
>> and re-call migrate_pages.
>> Without it will make cma_alloc fail at once with -EBUSY.
>
> LKML.org server is not responding so hard to see patch you mentioned
> but I just got your point now so I don't care any more. Your patch is
> enough simple as considering the benefit.
> Just look at below comment.
>
>>
>> >
>> >>
>> >> And I didn't find anything that make zs_page_migrate cannot work with
>> >> a ZS_EMPTY zspage.
>> >> So make the patch to not check inuse if migrate_mode is not
>> >> MIGRATE_ASYNC.
>> >
>> > At a first glance, I think it work but the question is that it a same 
>> > problem
>> > ith schedule_work of zs_page_putback. IOW, Until the work is done, 
>> > compaction
>> > cannot succeed. Do you have any number before and after?
>> >
>>
>>
>> Following is what I got with highalloc-performance in a vbox with 2
>> cpu 1G memory 512 zram as swap:
>>oriafte
>>   orig   after
>> Minor Faults  5080511350801261
>> Major Faults 43918   46692
>> Swap Ins 42087   46299
>> Swap Outs89718  105495
>> Allocation stalls0   0
>> DMA allocs   57787   69787
>> DMA32 allocs  4796459947983772
>> Normal allocs0   0
>> Movable allocs   0   0
>> Direct pages scanned 45493   28837
>> Kswapd pages scanned   1565222 1512947
>> Kswapd pages reclaimed 134 1334030
>> Direct pages reclaimed   45615   30174
>> Kswapd efficiency  85% 88%
>> Kswapd velocity   1897.1011708.309
>> Direct efficiency 100%104%
>> Direct velocity 55.139  32.561
>> Percentage direct scans 2%  1%
>> Zone normal velocity  1952.2401740.870
>> Zone dma32 velocity  0.000   0.000
>> Zone dma velocity0.000   0.000
>> Page writes by reclaim   89764.000  106043.000
>> Page writes file46 548
>> Page writes anon 89718  105495
>> Page reclaim immediate   214577269
>> Sector Reads   3259688 3144160
>

[PATCH] zsmalloc: zs_page_migrate: skip unnecessary loops but not return -EBUSY if zspage is not inuse

2017-07-24 Thread Hui Zhu
The first version is in [1].

Got -EBUSY from zs_page_migrate will make migration
slow (retry) or fail (zs_page_putback will schedule_work free_work,
but it cannot ensure the success).

I noticed this issue because my Kernel patched [2]
that will remove retry in __alloc_contig_migrate_range.
This retry willhandle the -EBUSY because it will re-isolate the page
and re-call migrate_pages.
Without it will make cma_alloc fail at once with -EBUSY.

According to the review from Minchan Kim in [3], I update the patch
to skip unnecessary loops but not return -EBUSY if zspage is not inuse.

Following is what I got with highalloc-performance in a vbox with 2
cpu 1G memory 512 zram as swap.  And the swappiness is set to 100.
   ori  ne
  orig new
Minor Faults  5080511350830235
Major Faults 43918   56530
Swap Ins 42087   55680
Swap Outs89718  104700
Allocation stalls0   0
DMA allocs   57787   52364
DMA32 allocs  4796459948043563
Normal allocs0   0
Movable allocs   0   0
Direct pages scanned 45493   23167
Kswapd pages scanned   1565222 1725078
Kswapd pages reclaimed 134 1503037
Direct pages reclaimed   45615   25186
Kswapd efficiency  85% 87%
Kswapd velocity   1897.1011949.042
Direct efficiency 100%108%
Direct velocity 55.139  26.175
Percentage direct scans 2%  1%
Zone normal velocity  1952.2401975.217
Zone dma32 velocity  0.000   0.000
Zone dma velocity0.000   0.000
Page writes by reclaim   89764.000  105233.000
Page writes file46 533
Page writes anon 89718  104700
Page reclaim immediate   214573699
Sector Reads   3259688 3441368
Sector Writes  3667252 3754836
Page rescued immediate   0   0
Slabs scanned  1042872 1160855
Direct inode steals   8042   10089
Kswapd inode steals  54295   29170
Kswapd skipped wait  0   0
THP fault alloc175 154
THP collapse alloc 226 289
THP splits   0   0
THP fault fallback  11  14
THP collapse fail3   2
Compaction stalls  536 646
Compaction success 322 358
Compaction failures214 288
Page migrate success119608  111063
Page migrate failure  27232593
Compaction pages isolated   250179  232652
Compaction migrate scanned 9131832 9942306
Compaction free scanned2093272 2613998
Compaction cost192 189
NUMA alloc hit4712455547193990
NUMA alloc miss  0   0
NUMA interleave hit  0   0
NUMA alloc local  4712455547193990
NUMA base PTE updates0   0
NUMA huge PMD updates0   0
NUMA page range updates  0   0
NUMA hint faults 0   0
NUMA hint local faults   0   0
NUMA hint local percent100 100
NUMA pages migrated  0   0
AutoNUMA cost   0%  0%

[1]: https://lkml.org/lkml/2017/7/14/93
[2]: https://lkml.org/lkml/2014/5/28/113
[3]: https://lkml.org/lkml/2017/7/21/10

Signed-off-by: Hui Zhu 
---
 mm/zsmalloc.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index d41edd2..c2c7ba9 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1997,8 +1997,11 @@ int zs_page_migrate(struct address_space *mapping, 
struct page *newpage,
 
spin_lock(&class->lock);
if (!get_zspage_inuse(zspage)) {
-   ret = -EBUSY;
-   goto unlock_class;
+   /*
+* Set "offset" to end of the page so that every loops
+* skips unnecessary object scanning.
+*/
+   offset = PAGE_SIZE;
}
 
pos = offset;
@@ -2066,7 +2069,7 @@ int zs_page_migrate(struct address_space *mapping, struct 
page *newpage,
}
}
kunmap_atomic(s_addr);
-unlock_class:
+
spin_unlock(&class->lock);
migrate_write_unlock(zspage);
 
-- 
1.9.1



[RFC 3/4] module: add /proc/modules_update_version

2017-10-13 Thread Hui Zhu
With "BloodTest: perf", we can get the address of kernel from "cpu0/page"
without symbol.
The application that call BloodTest need translate the address to symbol
with itself.  For normal address, just vmlinux is OK to get the right
symbol.  But for the address of kernel module, it also need the address
of modules from /proc/modules.

Add /proc/modules_update_version will help the application to get if the
kernel modules address is changed or not.

Signed-off-by: Hui Zhu 
---
 kernel/module.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/kernel/module.c b/kernel/module.c
index de66ec8..ed6f370 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -317,6 +317,8 @@ struct load_info {
} index;
 };
 
+static atomic_t modules_update_version = ATOMIC_INIT(0);
+
 /*
  * We require a truly strong try_module_get(): 0 means success.
  * Otherwise an error is returned due to ongoing or failed
@@ -1020,6 +1022,9 @@ int module_refcount(struct module *mod)
strlcpy(last_unloaded_module, mod->name, sizeof(last_unloaded_module));
 
free_module(mod);
+
+   atomic_inc(&modules_update_version);
+
return 0;
 out:
mutex_unlock(&module_mutex);
@@ -3183,6 +3188,8 @@ static int move_module(struct module *mod, struct 
load_info *info)
 (long)shdr->sh_addr, info->secstrings + shdr->sh_name);
}
 
+   atomic_inc(&modules_update_version);
+
return 0;
 }
 
@@ -4196,9 +4203,21 @@ static int modules_open(struct inode *inode, struct file 
*file)
.release= seq_release,
 };
 
+static int modules_update_version_get(void *data, u64 *val)
+{
+   *val = (u64)atomic_read(&modules_update_version);
+
+   return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(proc_modules_update_version_operations,
+   modules_update_version_get, NULL, "%llu\n");
+
 static int __init proc_modules_init(void)
 {
proc_create("modules", 0, NULL, &proc_modules_operations);
+   proc_create("modules_update_version", 0, NULL,
+   &proc_modules_update_version_operations);
return 0;
 }
 module_init(proc_modules_init);
-- 
1.9.1



[RFC 4/4] BloodTest: task

2017-10-13 Thread Hui Zhu
This patch add the function that get the infomation that task use the
resource of system for example cpu time, read_bytes, write_bytes.
The interface is in "/sys/kernel/debug/bloodtest/task".
"on" is the switch.  When it set to 1, access "test" will record task
infomation.
After record, access "str" will get the record data in string.
Access "page" will get the record data in binary that is format is
in "bin_format".

Signed-off-by: Hui Zhu 
---
 include/linux/bloodtest.h   |  10 +
 kernel/bloodtest/Makefile   |   2 +-
 kernel/bloodtest/core.c |  21 +++
 kernel/bloodtest/internal.h |  13 ++
 kernel/bloodtest/perf.c |  33 +---
 kernel/bloodtest/task.c | 447 
 kernel/exit.c   |   4 +
 7 files changed, 505 insertions(+), 25 deletions(-)
 create mode 100644 include/linux/bloodtest.h
 create mode 100644 kernel/bloodtest/task.c

diff --git a/include/linux/bloodtest.h b/include/linux/bloodtest.h
new file mode 100644
index 000..55f4ebc
--- /dev/null
+++ b/include/linux/bloodtest.h
@@ -0,0 +1,10 @@
+#ifndef __LINUX_BLOODTEST_H
+#define __LINUX_BLOODTEST_H
+
+#ifdef CONFIG_BLOODTEST
+extern void bt_task_exit_record(struct task_struct *p);
+#else
+static inline void bt_task_exit_record(struct task_struct *p)  { }
+#endif
+
+#endif /* __LINUX_BLOODTEST_H */
diff --git a/kernel/bloodtest/Makefile b/kernel/bloodtest/Makefile
index 79b7ea0..a6f1a7a 100644
--- a/kernel/bloodtest/Makefile
+++ b/kernel/bloodtest/Makefile
@@ -1,3 +1,3 @@
-obj-y  = core.o pages.o kernel_stat.o
+obj-y  = core.o pages.o kernel_stat.o task.o
 
 obj-$(CONFIG_PERF_EVENTS) += perf.o
diff --git a/kernel/bloodtest/core.c b/kernel/bloodtest/core.c
index 5ba800c..6cfcdf2 100644
--- a/kernel/bloodtest/core.c
+++ b/kernel/bloodtest/core.c
@@ -16,6 +16,7 @@
 /* This function must be called under the protection of bt_lock.  */
 static void bt_insert(void)
 {
+   bt_insert_task();
bt_insert_perf();
bt_insert_kernel_stat();
 }
@@ -25,6 +26,7 @@ static void bt_pullout(void)
 {
bt_pullout_kernel_stat();
bt_pullout_perf();
+   bt_pullout_task();
 }
 
 /* This function must be called under the protection of bt_lock.  */
@@ -99,13 +101,32 @@ static int __init bt_init(void)
bt_ktime = ktime_set(1, 0);
 
ret = bt_perf_init(d);
+   if (ret < 0)
+   goto out;
+
+   ret = bt_task_init(d);
 
 out:
if (ret != 0) {
debugfs_remove(t);
debugfs_remove(d);
+   pr_err("bloodtest: init get error %d\n", ret);
}
return ret;
 }
 
 core_initcall(bt_init);
+
+int bt_number_get(void *data, u64 *val)
+{
+   unsigned int *number_point = data;
+
+   down_read(&bt_lock);
+
+   *val = (u64)*number_point;
+
+   up_read(&bt_lock);
+
+   return 0;
+}
+
diff --git a/kernel/bloodtest/internal.h b/kernel/bloodtest/internal.h
index f6befc4..5aacf37 100644
--- a/kernel/bloodtest/internal.h
+++ b/kernel/bloodtest/internal.h
@@ -3,6 +3,13 @@
 
 #include 
 
+#define SHOW_FORMAT_1(p, s, entry, type, sign, size) \
+   seq_printf(p, "%s format:%s %s offset:%lu size:%lu\n", \
+  #entry, #type, sign, offsetof(s, entry), \
+  (unsigned long)size)
+#define SHOW_FORMAT(p, s, entry, type, sign) \
+   SHOW_FORMAT_1(p, s, entry, type, sign, sizeof(type))
+
 extern struct rw_semaphore bt_lock;
 
 struct bt_pages {
@@ -45,4 +52,10 @@ static inline void bt_task_pullout_perf(void)
{ }
 static inline int bt_perf_init(struct dentry *d)   { return 0; }
 #endif
 
+extern void bt_insert_task(void);
+extern void bt_pullout_task(void);
+extern int bt_task_init(struct dentry *d);
+
+extern int bt_number_get(void *data, u64 *val);
+
 #endif /* _KERNEL_BLOODTEST_INTERNAL_H */
diff --git a/kernel/bloodtest/perf.c b/kernel/bloodtest/perf.c
index cf23844..d495258 100644
--- a/kernel/bloodtest/perf.c
+++ b/kernel/bloodtest/perf.c
@@ -40,20 +40,7 @@ struct perf_rec {
 struct dentry *perf_dir;
 struct dentry *perf_str_dir;
 
-static int perf_number_get(void *data, u64 *val)
-{
-   unsigned int *number_point = data;
-
-   down_read(&bt_lock);
-
-   *val = (u64)*number_point;
-
-   up_read(&bt_lock);
-
-   return 0;
-}
-
-DEFINE_SIMPLE_ATTRIBUTE(perf_number_fops, perf_number_get, NULL, "%llu\n");
+DEFINE_SIMPLE_ATTRIBUTE(perf_number_fops, bt_number_get, NULL, "%llu\n");
 
 static void perf_overflow_handler(struct perf_event *event,
struct perf_sample_data *data,
@@ -402,7 +389,7 @@ static int perf_event_set(void *data, u64 val)
 }
 
 DEFINE_SIMPLE_ATTRIBUTE(perf_event_fops,
-   perf_number_get,
+   bt_number_get,
perf_event_set, "%llu\n");
 
 static int perf_bin_format_show(struct seq_file *p, void *unused)
@@ 

[RFC 1/4] BloodTest: kernel status

2017-10-13 Thread Hui Zhu
This patch include the base framework of BloodTest and get the kernel
status function.

The interface is in "/sys/kernel/debug/bloodtest".
Access "test" will call bt_insert that will call all start record
function.  And register a hrtimer to call bt_pullout to stop record.

bt_insert and bt_pullout will call analysing tools.

Signed-off-by: Hui Zhu 
---
 fs/proc/stat.c |   8 +--
 include/linux/kernel_stat.h|   3 ++
 init/Kconfig   |   3 ++
 kernel/Makefile|   2 +
 kernel/bloodtest/Makefile  |   1 +
 kernel/bloodtest/core.c| 117 +
 kernel/bloodtest/internal.h|  19 +++
 kernel/bloodtest/kernel_stat.c |  62 ++
 8 files changed, 211 insertions(+), 4 deletions(-)
 create mode 100644 kernel/bloodtest/Makefile
 create mode 100644 kernel/bloodtest/core.c
 create mode 100644 kernel/bloodtest/internal.h
 create mode 100644 kernel/bloodtest/kernel_stat.c

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index bd4e55f..c6f4fd4 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -22,7 +22,7 @@
 
 #ifdef arch_idle_time
 
-static u64 get_idle_time(int cpu)
+u64 get_idle_time(int cpu)
 {
u64 idle;
 
@@ -32,7 +32,7 @@ static u64 get_idle_time(int cpu)
return idle;
 }
 
-static u64 get_iowait_time(int cpu)
+u64 get_iowait_time(int cpu)
 {
u64 iowait;
 
@@ -44,7 +44,7 @@ static u64 get_iowait_time(int cpu)
 
 #else
 
-static u64 get_idle_time(int cpu)
+u64 get_idle_time(int cpu)
 {
u64 idle, idle_usecs = -1ULL;
 
@@ -60,7 +60,7 @@ static u64 get_idle_time(int cpu)
return idle;
 }
 
-static u64 get_iowait_time(int cpu)
+u64 get_iowait_time(int cpu)
 {
u64 iowait, iowait_usecs = -1ULL;
 
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 66be8b6..bf8d3f0 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -96,4 +96,7 @@ static inline void account_process_tick(struct task_struct 
*tsk, int user)
 
 extern void account_idle_ticks(unsigned long ticks);
 
+extern u64 get_idle_time(int cpu);
+extern u64 get_iowait_time(int cpu);
+
 #endif /* _LINUX_KERNEL_STAT_H */
diff --git a/init/Kconfig b/init/Kconfig
index 78cb246..f63550c 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1909,3 +1909,6 @@ config ASN1
  functions to call on what tags.
 
 source "kernel/Kconfig.locks"
+
+config BLOODTEST
+   bool "Blood test"
diff --git a/kernel/Makefile b/kernel/Makefile
index ed470aa..2a04e42 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -103,6 +103,8 @@ obj-$(CONFIG_BPF) += bpf/
 
 obj-$(CONFIG_PERF_EVENTS) += events/
 
+obj-$(CONFIG_BLOODTEST) += bloodtest/
+
 obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
 obj-$(CONFIG_PADATA) += padata.o
 obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
diff --git a/kernel/bloodtest/Makefile b/kernel/bloodtest/Makefile
new file mode 100644
index 000..7f289af
--- /dev/null
+++ b/kernel/bloodtest/Makefile
@@ -0,0 +1 @@
+obj-y  = core.o kernel_stat.o
diff --git a/kernel/bloodtest/core.c b/kernel/bloodtest/core.c
new file mode 100644
index 000..7b39cbb
--- /dev/null
+++ b/kernel/bloodtest/core.c
@@ -0,0 +1,117 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "internal.h"
+
+enum bt_stat_enum bt_stat;
+DEFINE_SPINLOCK(bt_lock);
+
+static DECLARE_WAIT_QUEUE_HEAD(bt_wq);
+static struct hrtimer bt_timer;
+static ktime_t bt_ktime;
+
+static bool is_bt_stat(enum bt_stat_enum stat)
+{
+   unsigned long flags;
+   bool ret = false;
+
+   spin_lock_irqsave(&bt_lock, flags);
+   if (bt_stat == stat)
+   ret = true;
+   spin_unlock_irqrestore(&bt_lock, flags);
+
+   return ret;
+}
+
+/* This function must be called under the protection of bt_lock.  */
+static void bt_insert(void)
+{
+   bt_stat = bt_running;
+
+   bt_insert_kernel_stat();
+}
+
+/* This function must be called under the protection of bt_lock.  */
+static void bt_pullout(void)
+{
+   bt_pullout_kernel_stat();
+
+   bt_stat = bt_done;
+}
+
+/* This function must be called under the protection of bt_lock.  */
+static void bt_report(struct seq_file *p)
+{
+   bt_report_kernel_stat(p);
+}
+
+static enum hrtimer_restart bt_timer_fn(struct hrtimer *data)
+{
+   spin_lock(&bt_lock);
+   bt_pullout();
+   spin_unlock(&bt_lock);
+
+   wake_up_interruptible_all(&bt_wq);
+
+   return HRTIMER_NORESTART;
+}
+
+static int test_show(struct seq_file *p, void *v)
+{
+   int ret = 0;
+
+   spin_lock(&bt_lock);
+   if (bt_stat == bt_running)
+   goto wait;
+
+   hrtimer_start(&bt_timer, bt_ktime, HRTIMER_MODE_REL);
+   bt_insert();
+
+wait:
+   spin_unlock(&bt_lock);
+   ret = wait_event_interruptible(bt_wq, is_bt_stat(bt_done));
+   if (ret)
+   goto out;
+
+   spin_lock(&bt_

[RFC 0/4] BloodTest: kernel status

2017-10-13 Thread Hui Zhu
BloodTest: an interface to call other analysing tools

Linux kernel has a lot of analysing tools, perf, ftrace, systemtap, KGTP
and so on.
And kernel also supplies a lot of internal value from procfs and sysfs
to analyse the performance.

Sometime, user need get performance infomation quickly, low overhead and
full coverage.
BloodTest is for it.
It is a interface can acess function of other analysing tools and
records to internal buffer that user or application can access very
quickly (mmap).

Now, BloodTest just support record cpu, perf and task infomation in
one seconds.

Hui Zhu (2):
BloodTest: kernel status
BloodTest: perf
Module: add /proc/modules_update_version
BloodTest: task

 fs/proc/stat.c |8 
 include/linux/bloodtest.h  |   10 
 include/linux/kernel_stat.h|3 
 init/Kconfig   |3 
 kernel/Makefile|2 
 kernel/bloodtest/Makefile  |3 
 kernel/bloodtest/core.c|  132 +
 kernel/bloodtest/internal.h|   61 
 kernel/bloodtest/kernel_stat.c |   62 
 kernel/bloodtest/pages.c   |  266 ++
 kernel/bloodtest/perf.c|  576 +
 kernel/bloodtest/task.c|  447 +++
 kernel/exit.c  |4 
 kernel/module.c|   19 +
 14 files changed, 1592 insertions(+), 4 deletions(-)


[RFC 2/4] BloodTest: perf

2017-10-13 Thread Hui Zhu
This patch add the function that call perf function and bt_pages that
can record the data that get from perf.

The interface is in "/sys/kernel/debug/bloodtest/perf".
"on" is the switch.  When it set to 1, access "test" will call perf.
There are "perf_config", "perf_freq", "perf_period", "perf_type" can
set the options of perf.
After record, access "str" will get the record data in string.
Access "cpu0/page" will get the record data in binary that is format is
in "bin_format".

Signed-off-by: Hui Zhu 
---
 kernel/bloodtest/Makefile   |   4 +-
 kernel/bloodtest/core.c |  76 +++---
 kernel/bloodtest/internal.h |  43 +++-
 kernel/bloodtest/pages.c| 266 
 kernel/bloodtest/perf.c | 591 
 5 files changed, 931 insertions(+), 49 deletions(-)
 create mode 100644 kernel/bloodtest/pages.c
 create mode 100644 kernel/bloodtest/perf.c

diff --git a/kernel/bloodtest/Makefile b/kernel/bloodtest/Makefile
index 7f289af..79b7ea0 100644
--- a/kernel/bloodtest/Makefile
+++ b/kernel/bloodtest/Makefile
@@ -1 +1,3 @@
-obj-y  = core.o kernel_stat.o
+obj-y  = core.o pages.o kernel_stat.o
+
+obj-$(CONFIG_PERF_EVENTS) += perf.o
diff --git a/kernel/bloodtest/core.c b/kernel/bloodtest/core.c
index 7b39cbb..5ba800c 100644
--- a/kernel/bloodtest/core.c
+++ b/kernel/bloodtest/core.c
@@ -6,31 +6,17 @@
 
 #include "internal.h"
 
-enum bt_stat_enum bt_stat;
-DEFINE_SPINLOCK(bt_lock);
+DECLARE_RWSEM(bt_lock);
 
 static DECLARE_WAIT_QUEUE_HEAD(bt_wq);
 static struct hrtimer bt_timer;
 static ktime_t bt_ktime;
-
-static bool is_bt_stat(enum bt_stat_enum stat)
-{
-   unsigned long flags;
-   bool ret = false;
-
-   spin_lock_irqsave(&bt_lock, flags);
-   if (bt_stat == stat)
-   ret = true;
-   spin_unlock_irqrestore(&bt_lock, flags);
-
-   return ret;
-}
+static bool bt_timer_stop;
 
 /* This function must be called under the protection of bt_lock.  */
 static void bt_insert(void)
 {
-   bt_stat = bt_running;
-
+   bt_insert_perf();
bt_insert_kernel_stat();
 }
 
@@ -38,8 +24,13 @@ static void bt_insert(void)
 static void bt_pullout(void)
 {
bt_pullout_kernel_stat();
+   bt_pullout_perf();
+}
 
-   bt_stat = bt_done;
+/* This function must be called under the protection of bt_lock.  */
+static void bt_task_pullout(void)
+{
+   bt_task_pullout_perf();
 }
 
 /* This function must be called under the protection of bt_lock.  */
@@ -50,38 +41,33 @@ static void bt_report(struct seq_file *p)
 
 static enum hrtimer_restart bt_timer_fn(struct hrtimer *data)
 {
-   spin_lock(&bt_lock);
bt_pullout();
-   spin_unlock(&bt_lock);
 
-   wake_up_interruptible_all(&bt_wq);
+   bt_timer_stop = true;
+   wake_up_all(&bt_wq);
 
return HRTIMER_NORESTART;
 }
 
-static int test_show(struct seq_file *p, void *v)
+static int test_show(struct seq_file *p, void *unused)
 {
-   int ret = 0;
+   down_write(&bt_lock);
 
-   spin_lock(&bt_lock);
-   if (bt_stat == bt_running)
-   goto wait;
+   bt_timer_stop = false;
 
-   hrtimer_start(&bt_timer, bt_ktime, HRTIMER_MODE_REL);
bt_insert();
+   hrtimer_start(&bt_timer, bt_ktime, HRTIMER_MODE_REL);
 
-wait:
-   spin_unlock(&bt_lock);
-   ret = wait_event_interruptible(bt_wq, is_bt_stat(bt_done));
-   if (ret)
-   goto out;
+   wait_event(bt_wq, bt_timer_stop);
 
-   spin_lock(&bt_lock);
-   bt_report(p);
-   spin_unlock(&bt_lock);
+   bt_task_pullout();
+   up_write(&bt_lock);
 
-out:
-   return ret;
+   down_read(&bt_lock);
+   bt_report(p);
+   up_read(&bt_lock);
+   
+   return 0;
 }
 
 static int test_open(struct inode *inode, struct file *file)
@@ -98,20 +84,28 @@ static int test_open(struct inode *inode, struct file *file)
 
 static int __init bt_init(void)
 {
-   struct dentry *d, *t;
+   int ret = -ENOMEM;
+   struct dentry *d = NULL, *t = NULL;
 
d = debugfs_create_dir("bloodtest", NULL);
if (!d)
-   return -ENOMEM;
+   goto out;
t = debugfs_create_file("test", S_IRUSR, d, NULL, &test_fops);
if (!t)
-   return -ENOMEM;
+   goto out;
 
hrtimer_init(&bt_timer, CLOCK_REALTIME, HRTIMER_MODE_REL);
bt_timer.function = bt_timer_fn;
bt_ktime = ktime_set(1, 0);
 
-   return 0;
+   ret = bt_perf_init(d);
+
+out:
+   if (ret != 0) {
+   debugfs_remove(t);
+   debugfs_remove(d);
+   }
+   return ret;
 }
 
 core_initcall(bt_init);
diff --git a/kernel/bloodtest/internal.h b/kernel/bloodtest/internal.h
index 48faf4d..f6befc4 100644
--- a/kernel/bloodtest/internal.h
+++ 

Re: [PATCH] zsmalloc: zs_page_migrate: not check inuse if migrate_mode is not MIGRATE_ASYNC

2017-07-19 Thread Hui Zhu
Hi Minchan,

I am sorry for answer late.
I spent some time on ubuntu 16.04 with mmtests in an old laptop.

2017-07-17 13:39 GMT+08:00 Minchan Kim :
> Hello Hui,
>
> On Fri, Jul 14, 2017 at 03:51:07PM +0800, Hui Zhu wrote:
>> Got some -EBUSY from zs_page_migrate that will make migration
>> slow (retry) or fail (zs_page_putback will schedule_work free_work,
>> but it cannot ensure the success).
>
> I think EAGAIN(migration retrial) is better than EBUSY(bailout) because
> expectation is that zsmalloc will release the empty zs_page soon so
> at next retrial, it will be succeeded.


I am not sure.

This is the call trace of zs_page_migrate:
zs_page_migrate
mapping->a_ops->migratepage
move_to_new_page
__unmap_and_move
unmap_and_move
migrate_pages

In unmap_and_move will remove page from migration page list
and call putback_movable_page(will call mapping->a_ops->putback_page) if
return value of zs_page_migrate is not -EAGAIN.
The comments of this part:
After called mapping->a_ops->putback_page, zsmalloc can free the page
from ZS_EMPTY list.

If retrun -EAGAIN, the page will be not be put back.  EAGAIN page will
be try again in migrate_pages without re-isolate.

> About schedule_work, as you said, we don't make sure when it happens but
> I believe it will happen in a migration iteration most of case.
> How often do you see that case?

I noticed this issue because my Kernel patch https://lkml.org/lkml/2014/5/28/113
that will remove retry in __alloc_contig_migrate_range.
This retry willhandle the -EBUSY because it will re-isolate the page
and re-call migrate_pages.
Without it will make cma_alloc fail at once with -EBUSY.

>
>>
>> And I didn't find anything that make zs_page_migrate cannot work with
>> a ZS_EMPTY zspage.
>> So make the patch to not check inuse if migrate_mode is not
>> MIGRATE_ASYNC.
>
> At a first glance, I think it work but the question is that it a same problem
> ith schedule_work of zs_page_putback. IOW, Until the work is done, compaction
> cannot succeed. Do you have any number before and after?
>


Following is what I got with highalloc-performance in a vbox with 2
cpu 1G memory 512 zram as swap:
   oriafte
  orig   after
Minor Faults  5080511350801261
Major Faults 43918   46692
Swap Ins 42087   46299
Swap Outs89718  105495
Allocation stalls0   0
DMA allocs   57787   69787
DMA32 allocs  4796459947983772
Normal allocs0   0
Movable allocs   0   0
Direct pages scanned 45493   28837
Kswapd pages scanned   1565222 1512947
Kswapd pages reclaimed 134 1334030
Direct pages reclaimed   45615   30174
Kswapd efficiency  85% 88%
Kswapd velocity   1897.1011708.309
Direct efficiency 100%104%
Direct velocity 55.139  32.561
Percentage direct scans 2%  1%
Zone normal velocity  1952.2401740.870
Zone dma32 velocity  0.000   0.000
Zone dma velocity0.000   0.000
Page writes by reclaim   89764.000  106043.000
Page writes file46 548
Page writes anon 89718  105495
Page reclaim immediate   214577269
Sector Reads   3259688 3144160
Sector Writes  3667252 3675528
Page rescued immediate   0   0
Slabs scanned  1042872 1035438
Direct inode steals   80427772
Kswapd inode steals  54295   55075
Kswapd skipped wait  0   0
THP fault alloc175 200
THP collapse alloc 226 363
THP splits   0   0
THP fault fallback  11   1
THP collapse fail3   1
Compaction stalls  536 647
Compaction success 322 384
Compaction failures214 263
Page migrate success119608  127002
Page migrate failure  27232309
Compaction pages isolated   250179  265318
Compaction migrate scanned 9131832 9351314
Compaction free scanned2093272 3059014
Compaction cost192 202
NUMA alloc hit4712455547086375
NUMA alloc miss  0   0
NUMA interleave hit  0   0
NUMA alloc local  4712455547086375
NUMA base PTE updates0   0
NUMA huge PMD up

[PATCH] usemem: Add option touch-alloc

2020-12-16 Thread Hui Zhu
Some environment will not fault in memory even if MAP_POPULATE is set.
This commit add option touch-alloc to read memory after allocate it to
make sure the pages is fault in.

Signed-off-by: Hui Zhu 
---
 usemem.c | 37 +
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/usemem.c b/usemem.c
index 6d1d575..d93691b 100644
--- a/usemem.c
+++ b/usemem.c
@@ -97,6 +97,7 @@ unsigned long opt_delay = 0;
 int opt_read_again = 0;
 int opt_punch_holes = 0;
 int opt_init_time = 0;
+int opt_touch_alloc = 0;
 int nr_task;
 int nr_thread;
 int nr_cpu;
@@ -157,6 +158,7 @@ void usage(int ok)
"-Z|--read-again read memory again after access the memory\n"
"--punch-holes   free every other page after allocation\n"
"--init-time remove the initialization time from the run 
time and show the initialization time\n"
+   "--touch-alloc   read memory after allocate it\n"
"-h|--help   show this message\n"
,   ourname);
 
@@ -197,6 +199,7 @@ static const struct option opts[] = {
{ "read-again"  , 0, NULL, 'Z' },
{ "punch-holes" , 0, NULL,   0 },
{ "init-time"   , 0, NULL,   0 },
+   { "touch-alloc" , 0, NULL,   0 },
{ "help", 0, NULL, 'h' },
{ NULL  , 0, NULL, 0 }
 };
@@ -326,6 +329,18 @@ void detach(void)
}
 }
 
+unsigned long do_access(unsigned long *p, unsigned long idx, int read)
+{
+   volatile unsigned long *vp = p;
+
+   if (read)
+   return vp[idx]; /* read data */
+   else {
+   vp[idx] = idx;  /* write data */
+   return 0;
+   }
+}
+
 unsigned long * allocate(unsigned long bytes)
 {
unsigned long *p;
@@ -352,6 +367,14 @@ unsigned long * allocate(unsigned long bytes)
p = (unsigned long *)ALIGN((unsigned long)p, pagesize - 1);
}
 
+   if (opt_touch_alloc) {
+   unsigned long i;
+   unsigned long m = bytes / sizeof(*p);
+
+   for (i = 0; i < m; i += 1)
+   do_access(p, i, 1);
+   }
+
return p;
 }
 
@@ -433,18 +456,6 @@ void shm_unlock(int seg_id)
shmctl(seg_id, SHM_UNLOCK, NULL);
 }
 
-unsigned long do_access(unsigned long *p, unsigned long idx, int read)
-{
-   volatile unsigned long *vp = p;
-
-   if (read)
-   return vp[idx]; /* read data */
-   else {
-   vp[idx] = idx;  /* write data */
-   return 0;
-   }
-}
-
 #define NSEC_PER_SEC  (1UL * 1000 * 1000 * 1000)
 
 long nsec_sub(long nsec1, long nsec2)
@@ -950,6 +961,8 @@ int main(int argc, char *argv[])
opt_punch_holes = 1;
} else if (strcmp(opts[opt_index].name, "init-time") == 
0) { 
opt_init_time = 1;
+   } else if (strcmp(opts[opt_index].name, "touch-alloc") 
== 0) { 
+   opt_touch_alloc = 1;
} else
usage(1);
break;
-- 
2.17.1



[PATCH] usemem: Add option init-time

2020-12-16 Thread Hui Zhu
From: Hui Zhu 

This commit add a new option init-time to remove the initialization time
from the run time and show the initialization time.

Signed-off-by: Hui Zhu 
---
 usemem.c | 29 +++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/usemem.c b/usemem.c
index 823647e..6d1d575 100644
--- a/usemem.c
+++ b/usemem.c
@@ -96,6 +96,7 @@ int opt_bind_interval = 0;
 unsigned long opt_delay = 0;
 int opt_read_again = 0;
 int opt_punch_holes = 0;
+int opt_init_time = 0;
 int nr_task;
 int nr_thread;
 int nr_cpu;
@@ -155,6 +156,7 @@ void usage(int ok)
"-U|--hugetlballocate hugetlbfs page\n"
"-Z|--read-again read memory again after access the memory\n"
"--punch-holes   free every other page after allocation\n"
+   "--init-time remove the initialization time from the run 
time and show the initialization time\n"
"-h|--help   show this message\n"
,   ourname);
 
@@ -193,7 +195,8 @@ static const struct option opts[] = {
{ "delay"   , 1, NULL, 'e' },
{ "hugetlb" , 0, NULL, 'U' },
{ "read-again"  , 0, NULL, 'Z' },
-   { "punch-holes" , 0, NULL,   0 },
+   { "punch-holes" , 0, NULL,   0 },
+   { "init-time"   , 0, NULL,   0 },
{ "help", 0, NULL, 'h' },
{ NULL  , 0, NULL, 0 }
 };
@@ -945,6 +948,8 @@ int main(int argc, char *argv[])
case 0:
if (strcmp(opts[opt_index].name, "punch-holes") == 0) {
opt_punch_holes = 1;
+   } else if (strcmp(opts[opt_index].name, "init-time") == 
0) { 
+   opt_init_time = 1;
} else
usage(1);
break;
@@ -1128,7 +1133,7 @@ int main(int argc, char *argv[])
if (optind != argc - 1)
usage(0);
 
-   if (!opt_write_signal_read)
+   if (!opt_write_signal_read || opt_init_time)
gettimeofday(&start_time, NULL);
 
opt_bytes = memparse(argv[optind], NULL);
@@ -1263,5 +1268,25 @@ int main(int argc, char *argv[])
if (!nr_task)
nr_task = 1;
 
+   if (opt_init_time) {
+   struct timeval stop;
+   char buf[1024];
+   size_t len;
+   unsigned long delta_us;
+
+   gettimeofday(&stop, NULL);
+   delta_us = (stop.tv_sec - start_time.tv_sec) * 100 +
+   (stop.tv_usec - start_time.tv_usec);
+   len = snprintf(buf, sizeof(buf),
+   "the initialization time is %lu secs %lu usecs\n",
+   delta_us / 100, delta_us % 100);
+   fflush(stdout);
+   if (write(1, buf, len) != len)
+   fprintf(stderr, "WARNING: statistics output may be 
incomplete.\n");
+
+   if (!opt_write_signal_read)
+   gettimeofday(&start_time, NULL);
+   }
+
return do_tasks();
 }
-- 
2.17.1



[RFC for Linux v4 2/2] virtio_balloon: Add deflate_cont_vq to deflate continuous pages

2020-07-15 Thread Hui Zhu
This commit adds a vq deflate_cont_vq to deflate continuous pages.
When VIRTIO_BALLOON_F_CONT_PAGES is set, call leak_balloon_cont to leak
the balloon.
leak_balloon_cont will call balloon_page_list_dequeue_cont get continuous
pages from balloon and report them use deflate_cont_vq.

Signed-off-by: Hui Zhu 
---
 drivers/virtio/virtio_balloon.c| 73 
 include/linux/balloon_compaction.h |  3 ++
 mm/balloon_compaction.c| 76 ++
 3 files changed, 144 insertions(+), 8 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index b89f566..258b3d9 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -44,6 +44,7 @@
 
 #define VIRTIO_BALLOON_INFLATE_MAX_ORDER min((int) (sizeof(__virtio32) * 
BITS_PER_BYTE - \
1 - PAGE_SHIFT), 
(MAX_ORDER-1))
+#define VIRTIO_BALLOON_DEFLATE_MAX_PAGES_NUM (((__virtio32)~0U) >> PAGE_SHIFT)
 
 #ifdef CONFIG_BALLOON_COMPACTION
 static struct vfsmount *balloon_mnt;
@@ -56,6 +57,7 @@ enum virtio_balloon_vq {
VIRTIO_BALLOON_VQ_FREE_PAGE,
VIRTIO_BALLOON_VQ_REPORTING,
VIRTIO_BALLOON_VQ_INFLATE_CONT,
+   VIRTIO_BALLOON_VQ_DEFLATE_CONT,
VIRTIO_BALLOON_VQ_MAX
 };
 
@@ -65,7 +67,8 @@ enum virtio_balloon_config_read {
 
 struct virtio_balloon {
struct virtio_device *vdev;
-   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq, 
*inflate_cont_vq;
+   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq,
+*inflate_cont_vq, *deflate_cont_vq;
 
/* Balloon's own wq for cpu-intensive work items */
struct workqueue_struct *balloon_wq;
@@ -215,6 +218,16 @@ static void set_page_pfns(struct virtio_balloon *vb,
  page_to_balloon_pfn(page) + i);
 }
 
+static void set_page_pfns_size(struct virtio_balloon *vb,
+  __virtio32 pfns[], struct page *page,
+  size_t size)
+{
+   /* Set the first pfn of the continuous pages.  */
+   pfns[0] = cpu_to_virtio32(vb->vdev, page_to_balloon_pfn(page));
+   /* Set the size of the continuous pages.  */
+   pfns[1] = (__virtio32) size;
+}
+
 static void set_page_pfns_order(struct virtio_balloon *vb,
__virtio32 pfns[], struct page *page,
unsigned int order)
@@ -222,10 +235,7 @@ static void set_page_pfns_order(struct virtio_balloon *vb,
if (order == 0)
return set_page_pfns(vb, pfns, page);
 
-   /* Set the first pfn of the continuous pages.  */
-   pfns[0] = cpu_to_virtio32(vb->vdev, page_to_balloon_pfn(page));
-   /* Set the size of the continuous pages.  */
-   pfns[1] = PAGE_SIZE << order;
+   set_page_pfns_size(vb, pfns, page, PAGE_SIZE << order);
 }
 
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
@@ -367,6 +377,42 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
return num_freed_pages;
 }
 
+static unsigned int leak_balloon_cont(struct virtio_balloon *vb, size_t num)
+{
+   unsigned int num_freed_pages;
+   struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
+   LIST_HEAD(pages);
+   size_t num_pages;
+
+   mutex_lock(&vb->balloon_lock);
+   for (vb->num_pfns = 0, num_freed_pages = 0;
+vb->num_pfns < ARRAY_SIZE(vb->pfns) && num_freed_pages < num;
+vb->num_pfns += 2,
+num_freed_pages += num_pages << (PAGE_SHIFT - 
VIRTIO_BALLOON_PFN_SHIFT)) {
+   struct page *page;
+
+   num_pages = balloon_page_list_dequeue_cont(vb_dev_info, &pages, 
&page,
+   min_t(size_t,
+ 
VIRTIO_BALLOON_DEFLATE_MAX_PAGES_NUM,
+ num - num_freed_pages));
+   if (!num_pages)
+   break;
+   set_page_pfns_size(vb, vb->pfns + vb->num_pfns, page, num_pages 
<< PAGE_SHIFT);
+   }
+   vb->num_pages -= num_freed_pages;
+
+   /*
+* Note that if
+* virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
+* is true, we *have* to do it in this order
+*/
+   if (vb->num_pfns != 0)
+   tell_host(vb, vb->deflate_cont_vq);
+   release_pages_balloon(vb, &pages);
+   mutex_unlock(&vb->balloon_lock);
+   return num_freed_pages;
+}
+
 static inline void update_stat(struct virtio_balloon *vb, int idx,
   u16 tag, u64 val)
 {
@@ -551,8 +597,12 @@ static void update_balloon_size_func(struct work_struct 
*work)
 
if (diff > 0)

[RFC for qemu v4 0/2] virtio-balloon: Add option cont-pages to set VIRTIO_BALLOON_F_CONT_PAGES

2020-07-15 Thread Hui Zhu
Code of current version for Linux and qemu is available in [1] and [2].
Update of this version:
1. Report continuous pages will increase the speed.  So added deflate
   continuous pages.
2. According to the comments from David in [3], added 2 new vqs icvq and
   dcvq to get continuous pages with format 32 bits pfn and 32 bits size.

Following is the introduction of the function.
Set option cont-pages to on will open flags VIRTIO_BALLOON_F_CONT_PAGES.
qemu will get continuous pages from icvq and dcvq and do madvise
MADV_WILLNEED and MADV_DONTNEED with the pages.
Opening this flag can bring two benefits:
1. Increase the speed of balloon inflate and deflate.
2. Decrease the splitted THPs number in the host.

[1] https://github.com/teawater/linux/tree/balloon_conts
[2] https://github.com/teawater/qemu/tree/balloon_conts
[3] https://lkml.org/lkml/2020/5/13/1211

Hui Zhu (2):
  virtio_balloon: Add cont-pages and icvq
  virtio_balloon: Add dcvq to deflate continuous pages

 hw/virtio/virtio-balloon.c  |   92 +++-
 include/hw/virtio/virtio-balloon.h  |2
 include/standard-headers/linux/virtio_balloon.h |1
 3 files changed, 63 insertions(+), 32 deletions(-)


[RFC for Linux v4 0/2] virtio_balloon: Add VIRTIO_BALLOON_F_CONT_PAGES to report continuous pages

2020-07-15 Thread Hui Zhu
The first, second and third version are in [1], [2] and [3].
Code of current version for Linux and qemu is available in [4] and [5].
Update of this version:
1. Report continuous pages will increase the speed.  So added deflate
   continuous pages.
2. According to the comments from David in [6], added 2 new vqs inflate_cont_vq
   and deflate_cont_vq to report continuous pages with format 32 bits pfn and 32
   bits size.
Following is the introduction of the function.
These patches add VIRTIO_BALLOON_F_CONT_PAGES to virtio_balloon. With this
flag, balloon tries to use continuous pages to inflate and deflate.
Opening this flag can bring two benefits:
1. Report continuous pages will increase memory report size of each time
   call tell_host.  Then it will increase the speed of balloon inflate and
   deflate.
2. Host THPs will be splitted when qemu release the page of balloon inflate.
   Inflate balloon with continuous pages will let QEMU release the pages
   of same THPs.  That will help decrease the splitted THPs number in
   the host.
   Following is an example in a VM with 1G memory 1CPU.  This test setups an
   environment that has a lot of fragmentation pages.  Then inflate balloon will
   split the THPs.
// This is the THP number before VM execution in the host.
// None use THP.
cat /proc/meminfo | grep AnonHugePages:
AnonHugePages: 0 kB
// After VM start, use usemem
// (https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git)
// punch-holes function generates 400m fragmentation pages in the guest
// kernel.
usemem --punch-holes -s -1 800m &
// This is the THP number after this command in the host.
// Some THP is used by VM because usemem will access 800M memory
// in the guest.
cat /proc/meminfo | grep AnonHugePages:
AnonHugePages:911360 kB
// Connect to the QEMU monitor, setup balloon, and set it size to 600M.
(qemu) device_add virtio-balloon-pci,id=balloon1
(qemu) info balloon
balloon: actual=1024
(qemu) balloon 600
(qemu) info balloon
balloon: actual=600
// This is the THP number after inflate the balloon in the host.
cat /proc/meminfo | grep AnonHugePages:
AnonHugePages: 88064 kB
// Set the size back to 1024M in the QEMU monitor.
(qemu) balloon 1024
(qemu) info balloon
balloon: actual=1024
// Use usemem to increase the memory usage of QEMU.
killall usemem
usemem 800m
// This is the THP number after this operation.
cat /proc/meminfo | grep AnonHugePages:
AnonHugePages: 65536 kB

Following example change to use continuous pages balloon.  The number of
splitted THPs is decreased.
// This is the THP number before VM execution in the host.
// None use THP.
cat /proc/meminfo | grep AnonHugePages:
AnonHugePages: 0 kB
// After VM start, use usemem punch-holes function generates 400M
// fragmentation pages in the guest kernel.
usemem --punch-holes -s -1 800m &
// This is the THP number after this command in the host.
// Some THP is used by VM because usemem will access 800M memory
// in the guest.
cat /proc/meminfo | grep AnonHugePages:
AnonHugePages:911360 kB
// Connect to the QEMU monitor, setup balloon, and set it size to 600M.
(qemu) device_add virtio-balloon-pci,id=balloon1,cont-pages=on
(qemu) info balloon
balloon: actual=1024
(qemu) balloon 600
(qemu) info balloon
balloon: actual=600
// This is the THP number after inflate the balloon in the host.
cat /proc/meminfo | grep AnonHugePages:
AnonHugePages:616448 kB
// Set the size back to 1024M in the QEMU monitor.
(qemu) balloon 1024
(qemu) info balloon
balloon: actual=1024
// Use usemem to increase the memory usage of QEMU.
killall usemem
usemem 800m
// This is the THP number after this operation.
cat /proc/meminfo | grep AnonHugePages:
AnonHugePages:907264 kB

[1] https://lkml.org/lkml/2020/3/12/144
[2] 
https://lore.kernel.org/linux-mm/1584893097-12317-1-git-send-email-teawa...@gmail.com/
[3] https://lkml.org/lkml/2020/5/12/324
[4] https://github.com/teawater/linux/tree/balloon_conts
[5] https://github.com/teawater/qemu/tree/balloon_conts
[6] https://lkml.org/lkml/2020/5/13/1211

Hui Zhu (2):
  virtio_balloon: Add VIRTIO_BALLOON_F_CONT_PAGES and inflate_cont_vq
  virtio_balloon: Add deflate_cont_vq to deflate continuous pages

 drivers/virtio/virtio_balloon.c |  180 +++-
 include/linux/balloon_compaction.h  |   12 ++
 include/uapi/linux/virtio_balloon.h |1
 mm/balloon_compaction.c |  117 +--
 4 files changed, 280 insertions(+), 30 deletions(-)


[RFC for qemu v4 1/2] virtio_balloon: Add cont-pages and icvq

2020-07-15 Thread Hui Zhu
This commit adds cont-pages option to virtio_balloon.  virtio_balloon
will open flags VIRTIO_BALLOON_F_CONT_PAGES with this option.
And it add a vq icvq to inflate continuous pages.
When VIRTIO_BALLOON_F_CONT_PAGES is set, try to get continuous pages
from icvq and use madvise MADV_DONTNEED release the pages.

Signed-off-by: Hui Zhu 
---
 hw/virtio/virtio-balloon.c  | 80 -
 include/hw/virtio/virtio-balloon.h  |  2 +-
 include/standard-headers/linux/virtio_balloon.h |  1 +
 3 files changed, 55 insertions(+), 28 deletions(-)

diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index a4729f7..d36a5c8 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -65,23 +65,26 @@ static bool 
virtio_balloon_pbp_matches(PartiallyBalloonedPage *pbp,
 
 static void balloon_inflate_page(VirtIOBalloon *balloon,
  MemoryRegion *mr, hwaddr mr_offset,
+ size_t size,
  PartiallyBalloonedPage *pbp)
 {
 void *addr = memory_region_get_ram_ptr(mr) + mr_offset;
 ram_addr_t rb_offset, rb_aligned_offset, base_gpa;
 RAMBlock *rb;
 size_t rb_page_size;
-int subpages;
+int subpages, pages_num;
 
 /* XXX is there a better way to get to the RAMBlock than via a
  * host address? */
 rb = qemu_ram_block_from_host(addr, false, &rb_offset);
 rb_page_size = qemu_ram_pagesize(rb);
 
+size &= ~(rb_page_size - 1);
+
 if (rb_page_size == BALLOON_PAGE_SIZE) {
 /* Easy case */
 
-ram_block_discard_range(rb, rb_offset, rb_page_size);
+ram_block_discard_range(rb, rb_offset, size);
 /* We ignore errors from ram_block_discard_range(), because it
  * has already reported them, and failing to discard a balloon
  * page is not fatal */
@@ -99,32 +102,38 @@ static void balloon_inflate_page(VirtIOBalloon *balloon,
 
 rb_aligned_offset = QEMU_ALIGN_DOWN(rb_offset, rb_page_size);
 subpages = rb_page_size / BALLOON_PAGE_SIZE;
-base_gpa = memory_region_get_ram_addr(mr) + mr_offset -
-   (rb_offset - rb_aligned_offset);
 
-if (pbp->bitmap && !virtio_balloon_pbp_matches(pbp, base_gpa)) {
-/* We've partially ballooned part of a host page, but now
- * we're trying to balloon part of a different one.  Too hard,
- * give up on the old partial page */
-virtio_balloon_pbp_free(pbp);
-}
+for (pages_num = size / BALLOON_PAGE_SIZE;
+ pages_num > 0; pages_num--) {
+base_gpa = memory_region_get_ram_addr(mr) + mr_offset -
+   (rb_offset - rb_aligned_offset);
 
-if (!pbp->bitmap) {
-virtio_balloon_pbp_alloc(pbp, base_gpa, subpages);
-}
+if (pbp->bitmap && !virtio_balloon_pbp_matches(pbp, base_gpa)) {
+/* We've partially ballooned part of a host page, but now
+* we're trying to balloon part of a different one.  Too hard,
+* give up on the old partial page */
+virtio_balloon_pbp_free(pbp);
+}
 
-set_bit((rb_offset - rb_aligned_offset) / BALLOON_PAGE_SIZE,
-pbp->bitmap);
+if (!pbp->bitmap) {
+virtio_balloon_pbp_alloc(pbp, base_gpa, subpages);
+}
 
-if (bitmap_full(pbp->bitmap, subpages)) {
-/* We've accumulated a full host page, we can actually discard
- * it now */
+set_bit((rb_offset - rb_aligned_offset) / BALLOON_PAGE_SIZE,
+pbp->bitmap);
 
-ram_block_discard_range(rb, rb_aligned_offset, rb_page_size);
-/* We ignore errors from ram_block_discard_range(), because it
- * has already reported them, and failing to discard a balloon
- * page is not fatal */
-virtio_balloon_pbp_free(pbp);
+if (bitmap_full(pbp->bitmap, subpages)) {
+/* We've accumulated a full host page, we can actually discard
+* it now */
+
+ram_block_discard_range(rb, rb_aligned_offset, rb_page_size);
+/* We ignore errors from ram_block_discard_range(), because it
+* has already reported them, and failing to discard a balloon
+* page is not fatal */
+virtio_balloon_pbp_free(pbp);
+}
+
+mr_offset += BALLOON_PAGE_SIZE;
 }
 }
 
@@ -340,12 +349,21 @@ static void virtio_balloon_handle_output(VirtIODevice 
*vdev, VirtQueue *vq)
 while (iov_to_buf(elem->out_sg, elem->out_num, offset, &pfn, 4) == 4) {
 unsigned int p = virtio_ldl_p(vdev, &pfn);
 hwaddr pa;
+unsigned int psize = BALLOON_PAGE_SIZE;
 
 pa = (hwaddr) p << VIRTIO_BALLOON_PFN_SHIFT;
 offset += 4;
 
-section = memory_region_find(get_system_memory(), pa,
- 

[RFC for Linux v4 1/2] virtio_balloon: Add VIRTIO_BALLOON_F_CONT_PAGES and inflate_cont_vq

2020-07-15 Thread Hui Zhu
This commit adds a new flag VIRTIO_BALLOON_F_CONT_PAGES to virtio_balloon.
Add it adds a vq inflate_cont_vq to inflate continuous pages.
When VIRTIO_BALLOON_F_CONT_PAGES is set, try to allocate continuous pages
and report them use inflate_cont_vq.

Signed-off-by: Hui Zhu 
---
 drivers/virtio/virtio_balloon.c | 119 ++--
 include/linux/balloon_compaction.h  |   9 ++-
 include/uapi/linux/virtio_balloon.h |   1 +
 mm/balloon_compaction.c |  41 ++---
 4 files changed, 142 insertions(+), 28 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 1f157d2..b89f566 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -42,6 +42,9 @@
(1 << (VIRTIO_BALLOON_HINT_BLOCK_ORDER + PAGE_SHIFT))
 #define VIRTIO_BALLOON_HINT_BLOCK_PAGES (1 << VIRTIO_BALLOON_HINT_BLOCK_ORDER)
 
+#define VIRTIO_BALLOON_INFLATE_MAX_ORDER min((int) (sizeof(__virtio32) * 
BITS_PER_BYTE - \
+   1 - PAGE_SHIFT), 
(MAX_ORDER-1))
+
 #ifdef CONFIG_BALLOON_COMPACTION
 static struct vfsmount *balloon_mnt;
 #endif
@@ -52,6 +55,7 @@ enum virtio_balloon_vq {
VIRTIO_BALLOON_VQ_STATS,
VIRTIO_BALLOON_VQ_FREE_PAGE,
VIRTIO_BALLOON_VQ_REPORTING,
+   VIRTIO_BALLOON_VQ_INFLATE_CONT,
VIRTIO_BALLOON_VQ_MAX
 };
 
@@ -61,7 +65,7 @@ enum virtio_balloon_config_read {
 
 struct virtio_balloon {
struct virtio_device *vdev;
-   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
+   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq, 
*inflate_cont_vq;
 
/* Balloon's own wq for cpu-intensive work items */
struct workqueue_struct *balloon_wq;
@@ -126,6 +130,9 @@ struct virtio_balloon {
/* Free page reporting device */
struct virtqueue *reporting_vq;
struct page_reporting_dev_info pr_dev_info;
+
+   /* Current order of inflate continuous pages - 
VIRTIO_BALLOON_F_CONT_PAGES */
+   __u32 current_pages_order;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -208,19 +215,59 @@ static void set_page_pfns(struct virtio_balloon *vb,
  page_to_balloon_pfn(page) + i);
 }
 
+static void set_page_pfns_order(struct virtio_balloon *vb,
+   __virtio32 pfns[], struct page *page,
+   unsigned int order)
+{
+   if (order == 0)
+   return set_page_pfns(vb, pfns, page);
+
+   /* Set the first pfn of the continuous pages.  */
+   pfns[0] = cpu_to_virtio32(vb->vdev, page_to_balloon_pfn(page));
+   /* Set the size of the continuous pages.  */
+   pfns[1] = PAGE_SIZE << order;
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
unsigned num_allocated_pages;
-   unsigned num_pfns;
+   unsigned int num_pfns, pfn_per_alloc;
struct page *page;
LIST_HEAD(pages);
+   bool is_cont = vb->current_pages_order != 0;
 
-   /* We can only do one array worth at a time. */
-   num = min(num, ARRAY_SIZE(vb->pfns));
-
-   for (num_pfns = 0; num_pfns < num;
-num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-   struct page *page = balloon_page_alloc();
+   if (is_cont)
+   pfn_per_alloc = 2;
+   else
+   pfn_per_alloc = VIRTIO_BALLOON_PAGES_PER_PAGE;
+
+   for (num_pfns = 0, num_allocated_pages = 0;
+num_pfns < ARRAY_SIZE(vb->pfns) && num_allocated_pages < num;
+num_pfns += pfn_per_alloc,
+num_allocated_pages += VIRTIO_BALLOON_PAGES_PER_PAGE << 
vb->current_pages_order) {
+   struct page *page;
+
+   for (; vb->current_pages_order >= 0; vb->current_pages_order--) 
{
+   if (vb->current_pages_order &&
+   num - num_allocated_pages <
+   VIRTIO_BALLOON_PAGES_PER_PAGE << 
vb->current_pages_order)
+   continue;
+   page = balloon_pages_alloc(vb->current_pages_order);
+   if (page) {
+   /* If the first allocated page is not 
continuous pages,
+* go back to transport page as signle page.
+*/
+   if (is_cont && num_pfns == 0 && 
!vb->current_pages_order) {
+   is_cont = false;
+   pfn_per_alloc = 
VIRTIO_BALLOON_PAGES_PER_PAGE;
+   }
+   set_page_private(page, vb->current_pages_order);
+   balloon_page_push(&pages, page);
+   br

[RFC for qemu v4 2/2] virtio_balloon: Add dcvq to deflate continuous pages

2020-07-15 Thread Hui Zhu
This commit adds a vq dcvq to deflate continuous pages.
When VIRTIO_BALLOON_F_CONT_PAGES is set, try to get continuous pages
from icvq and use madvise MADV_WILLNEED with the pages.

Signed-off-by: Hui Zhu 
---
 hw/virtio/virtio-balloon.c | 14 +-
 include/hw/virtio/virtio-balloon.h |  2 +-
 2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index d36a5c8..165adf7 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -138,7 +138,8 @@ static void balloon_inflate_page(VirtIOBalloon *balloon,
 }
 
 static void balloon_deflate_page(VirtIOBalloon *balloon,
- MemoryRegion *mr, hwaddr mr_offset)
+ MemoryRegion *mr, hwaddr mr_offset,
+ size_t size)
 {
 void *addr = memory_region_get_ram_ptr(mr) + mr_offset;
 ram_addr_t rb_offset;
@@ -153,10 +154,11 @@ static void balloon_deflate_page(VirtIOBalloon *balloon,
 rb_page_size = qemu_ram_pagesize(rb);
 
 host_addr = (void *)((uintptr_t)addr & ~(rb_page_size - 1));
+size &= ~(rb_page_size - 1);
 
 /* When a page is deflated, we hint the whole host page it lives
  * on, since we can't do anything smaller */
-ret = qemu_madvise(host_addr, rb_page_size, QEMU_MADV_WILLNEED);
+ret = qemu_madvise(host_addr, size, QEMU_MADV_WILLNEED);
 if (ret != 0) {
 warn_report("Couldn't MADV_WILLNEED on balloon deflate: %s",
 strerror(errno));
@@ -354,7 +356,7 @@ static void virtio_balloon_handle_output(VirtIODevice 
*vdev, VirtQueue *vq)
 pa = (hwaddr) p << VIRTIO_BALLOON_PFN_SHIFT;
 offset += 4;
 
-if (vq == s->icvq) {
+if (vq == s->icvq || vq == s->dcvq) {
 uint32_t psize_ptr;
 if (iov_to_buf(elem->out_sg, elem->out_num, offset, 
&psize_ptr, 4) != 4) {
 break;
@@ -383,8 +385,9 @@ static void virtio_balloon_handle_output(VirtIODevice 
*vdev, VirtQueue *vq)
 balloon_inflate_page(s, section.mr,
  section.offset_within_region,
  psize, &pbp);
-} else if (vq == s->dvq) {
-balloon_deflate_page(s, section.mr, 
section.offset_within_region);
+} else if (vq == s->dvq || vq == s->dcvq) {
+balloon_deflate_page(s, section.mr, 
section.offset_within_region,
+ psize);
 } else {
 g_assert_not_reached();
 }
@@ -838,6 +841,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, 
Error **errp)
 
 if (virtio_has_feature(s->host_features, VIRTIO_BALLOON_F_CONT_PAGES)) {
 s->icvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
+s->dcvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
 }
 
 reset_stats(s);
diff --git a/include/hw/virtio/virtio-balloon.h 
b/include/hw/virtio/virtio-balloon.h
index 6a2514d..848a7fb 100644
--- a/include/hw/virtio/virtio-balloon.h
+++ b/include/hw/virtio/virtio-balloon.h
@@ -42,7 +42,7 @@ enum virtio_balloon_free_page_report_status {
 
 typedef struct VirtIOBalloon {
 VirtIODevice parent_obj;
-VirtQueue *ivq, *dvq, *svq, *free_page_vq, *icvq;
+VirtQueue *ivq, *dvq, *svq, *free_page_vq, *icvq, *dcvq;
 uint32_t free_page_report_status;
 uint32_t num_pages;
 uint32_t actual;
-- 
2.7.4



<    1   2