[RFC] A multithread lockless deduplication engine

2017-09-19 Thread XaviLi
PageONE (Page Object Non-duplicate Engine) is a multithread kernel page 
deduplication engine. It is based on a lock-less tree algorithm we currently 
named as SD (Static and Dynamic) Tree. Normal operations such as 
insert/query/delete to this tree are block-less. Adding more CPU cores can 
linearly boost speed as far as we tested. Multithreading gives not only 
opportunity to work faster. It also allows any CPU to donate spare time for the 
job. Therefore, it reveals a way to use CPU more efficiently. PPR is from an 
open source solution named Dynamic VM:
https://github.com/baibantech/dynamic_vm.git 

patch can be found here:  
https://github.com/baibantech/dynamic_vm/tree/master/dynamic_vm_0.5

One work thread of PageONE can match the speed of KSM daemon. Adding more CPUs 
can increase speed linearly. Here we can see a brief test:

Test environment
DELL R730
Intel® Xeon® E5-2650 v4 (2.20 GHz, of Cores 12, threads 24); 
256GB RAM
Host OS: Ubuntu server 14.04 Host kernel: 4.4.1
Qemu: 2.9.0
Guest OS: Ubuntu server 16.04 Guest kernel: 4.4.76

We ran 12 VMs together. Each create 16GB data in memory. After all data is 
ready we start dedup-engine and see how host-side used memory amount changes.

KSM:
Configuration: sleep_millisecs = 0, pages_to_scan = 100
Starting used memory: 216.8G
Result: KSM start merging pages immediately after turned on. KSM daemon took 
100% of one CPU for 13:16 until used memory was reduced to 79.0GB.

PageONE:
Configuration: merge_period(secs) = 20, work threads = 12
Starting used memory: 207.3G
(Which means PageONE scans full physical memory in 20 secs period. Pages was 
merged if not changed in 2 merge_periods.)
Result: In the first two periods PageONE only observe and identify unchanged 
pages. Little CPU was used in this time. As the third period begin all 12 
threads start using 100% CPU to do real merge job. 00:58 later used memory was 
reduced to 70.5GB.

We ran the above test using the data quite easy for red-black tree of KSM. 
Every difference can be detected by comparing the first 8 bytes. Then we ran 
another test in which each data was begin with random zero bytes for 
comparison. The average size of zero data was 128 bytes. Result is shown below:

KSM:
Configuration: sleep_millisecs = 0, pages_to_scan = 100
Starting used memory: 216.8G
Result: 19:49 minutes until used memory was reduced to 78.7GB.

PageONE:
Configuration: merge period(secs) = 20, work threads = 12
Starting used memory: 210.3G
Result: First 2 periods same as above. 1:09 after merge job start memory was 
reduced to 72GB.

PageONE shows little difference in the two tests because SD tree search compare 
each key bit just once in most cases.

[RFC] Another Para-Virtualization page recycler. Empty Guest OS free pages every few seconds

2017-09-19 Thread XaviLi
PPR (Per Page Recycler) is a para virtualization driver currently available for 
KVM hosts and Linux/Windows guests. With PPR, every page freed to Guest OS can 
be recycled in seconds by hypervisor. Therefore, VMs can dynamical 
allocate/free pages from hypervisor according to application’s request. This 
issue can result in a memory density boost in many cases with little CPU cost. 
PPR is a component of an open source solution named Dynamic VM:
https://github.com/baibantech/dynamic_vm.git

patch can be found here: 
https://github.com/baibantech/dynamic_vm/tree/master/dynamic_vm_0.5

The guest side is very simple. PPR have a hook at the entry of page-free 
interface of Guest OS to mark every freed pages. Marked pages can be recycled 
and replaced by read-only zero page in one scan period. If the page was 
allocated again before the recycling happens, the mark was simply canceled by 
PPR’s page-allocation interface hook.

The guest driver works in a per-page way other than VirtIO balloon’s batch 
mode. The recycling issue is based on a lockless operation. The guest and host 
does not send any interrupt to each other for this.

We ran a test on our machine. Host use about 6500+ CPU cycles averagely for 
recycling each page. In which 3900 cycles is cost by the Linux kernel 
page-table operation. The guest hook cost about 300 cycles for mark or unmark 
each page. It is cheap enough in many cases. The daemon cost 5% of one CPU when 
the scanning found nothing to recycle. Other threads only move when there is 
work to do.

Saving memory often leads to a CPU-memory tradeoff topic. Here thing is 
different. In memory intensive usage we often see page-deduplication applied. 
It is very much cheaper to recycle a page than dedup it. In that case we can 
use PPR to win CPU and memory both.

We have a test to demonstrate how density was boosted. In which PPR works with 
a multithread page deduplication engine named PageONE rather than KSM. PageONE 
is going to be discussed in another topic.

Test environment
DELL R730
Intel® Xeon® E5-2650 v4 (2.20 GHz, of Cores 12, threads 24); 
256GB RAM
Host OS: Ubuntu server 14.04 Host kernel: 4.4.1
Qemu: 2.9.0
Guest OS: Ubuntu server 16.04 Guest kernel: 4.4.76

Test VMs’ memory size was configured to be 4GB. Each VM ran an application 
allocate/free memory every 30 minutes. The top memory allocation size of each 
period is 3GB and average size is 150MB. We ran many such VMs in the same time 
same machine to see the total host-side memory cost.


Raw KVM
Result: 67 VMs used 230GB memory. 

KVM + KSM 
Configuration: sleep_millisecs = 0, pages_to_scan = 100
Result: 73 VMs used 230GB memory. KSM daemon cost 100% of one CPU

KVM + PPR + PageONE:
Configuration: 
work threads =12 reclaim_period = 10 secs merge_period = 600 secs 
(Which means freed pages was recycled each 10 secs, pages were merged if not 
modified in 2 merge_periods)
Result: 516 VMs run together. The total used memory goes up and down between 
218GB and 139GB, averagely 182GB. 

PPR and PageONE shared a daemon thread and 12 work threads. We test their CPU 
cost together because the two features are sharing threads while offloading for 
each other. 

The average CPU cost of work threads is 6.1%. Daemon thread’s cost is 22.1%

Here we tried other combinations in the same test case:
reclaim_period(sec) merge_period(sec) average memory(GB) work threads' average 
CPU daemon's CPU
5 200 164 6.8% 34%
5 600 176 6.3% 35%
10 200 171 6.6% 22%
10 600 182 6.1% 22%
NOTE: It is a high pressure test. PPR was recycling 3.2TB memory per hour. In 
normal case PPR costs far less CPU than that.

[RFC] [Resend] Another Para-Virtualization page recycler -- Code details, Trap-less way to return free pages to kernel

2017-09-21 Thread XaviLi
We raised a topic about PPR (Per Page Recycler) and thank to Jan Kiszka for 
advises. We are here to break up patch codes and explain the code in detail. 
There are too many things to explain in one topic. We would like to do it part 
by part. Content of original mails and patches can be found below in the end.

1.  Why another page recycler?

Freed memory always be returned to kernel in groups. User mode applications use 
munmap when the freed chunk is accumulated to be big enough. In VM world, 
balloon driver is triggered when free memory is worth to be collected. PPR 
offers a way to make a reclaim for each free-able page because it cost less CPU 
and trap-less.

The APPs or VMs release any uncopied pages to kernel instead of reserve them 
means we can use memory more efficiently. We start test from virtual machine 
scenario because the effect here is most obvious. In our experiment we can run 
516 VMs with PPR ,in contrast to 60+ without PPR. This issue is also work for 
normal applications. Here we call VM or applications as APP for simple.

2.  Basic Method:

Let begin with a question. Is it possible for APPs to set a “freed” mark at the 
beginning bytes of the page. Whether Kernel can take a glance and know it is 
reclaim-able? It is NOT possible because the memory-content is arbitrary. No 
particular value can be reserved to stand for “reclaim-able”. We let the first 
bytes of freed pages indicate the location of freed page pointer pool. Pointers 
in which are the reliable proofs of page being free-able. A wrong indicator 
leads to unproper pointer and doesn’t cause any further trouble. We call this 
method “PIP” (Pointer Indicator Pair). 

In some case pages-content are scanned periodically. One example is page 
deduplication. If we can find the page recycle-able at the beginning bytes, 
then the rest job can be saved. PPR work alone is very cheap and can win both 
CPU and memory when work with other scanners. The cost and test result can be 
found in the original mails below.

3.  Code Break Up:

The APP side:
Page Free hook: virt_mark_page_release() (page_reclaim_guest.c)
The free-page hook is called when a page is going to be freed. It just marks 
the beginning of the page as an indicator. Allocate the position of pointer 
from the pool and set the pointer to point the freed page. So that the page can 
be recycled in seconds. The allocation is quite simple because we can assume 
the pool is big enough and reclaim can happen in time to avoid the head catchup 
the tail in most cases. The pool is big but not consume much memory. Because 
when it is empty and zeroed, it can be shrunk by page-deduplication.

int virt_mark_page_release(struct page *page)
{
int pool_id ;
unsigned long long alloc_id;
unsigned long long state;
unsigned long long idx ;
volatile struct virt_release_mark *mark ;
unsigned long long time_begin = 0;
if(!guest_mem_pool)
{
clear_page_content((void*)page);
set_guest_page_clear_ok();
return -1;
}
if(!pone_page_reclaim_enable)
{
reset_guest_page_clear_ok();
return -1;
}
time_begin = rdtsc_ordered();
pool_id = guest_mem_pool->pool_id;
/*share memory pool alloc a position,the default content is 0*/
alloc_id = atomic64_add_return(1,(atomic64_t*)&guest_mem_pool->alloc_idx)-1;
idx = alloc_id%guest_mem_pool->desc_max;
state = guest_mem_pool->desc[idx];

mark = get_page_content((void*)page);
if(0 == state)
{
/*the reclaim identification store on the share mem position,using  
gfn*/
if(0 != 
atomic64_cmpxchg((atomic64_t*)&guest_mem_pool->desc[idx],0,page_to_pfn(page)))
{
/*if the alloced position used by another thread,release mark 
invalid*/
pool_id = guest_mem_pool->pool_max  +1;
idx = guest_mem_pool->desc_max +1;

//atomic64_add(1,(atomic64_t*)&guest_mem_pool->mark_release_err_conflict);

}
else
{
//atomic64_add(1,(atomic64_t*)&guest_mem_pool->mark_release_ok);
}
/*write release mark on the beginning of the release page*/
mark->pool_id = pool_id;
mark->alloc_id = idx;
barrier();
mark->desc = guest_mem_pool->mem_ind;
barrier();
put_page_content((void*)mark);
PONE_TIMEPOINT_SET(page_reclaim_free_ok , rdtsc_ordered()- time_begin);
return 0;
}
else
{
/*alloced position used by another thread,release mark invalid*/
mark->pool_id = guest_mem_pool->pool_max +1;
mark->alloc_id = guest_mem_pool->desc_max +1;
barrier();
mark->desc = guest_mem_pool->mem_ind;
barrier();
put_page_content((void*)mark);
}
//atomic64_add(1,(atomic64_t*)&guest_mem_pool->mark_release_err_state);
PONE_TIMEPOINT_SET(page_reclaim_free_fail , rdtsc_ordered()- time_begin);
return -1;
}

Page Allocation hook: virt_mark_pa

Re: [RFC] [Resend] Another Para-Virtualization page recycler -- Code details, Trap-less way to return free pages to kernel

2017-09-22 Thread XaviLi
On Thu ,21 Sep 2017 20:16 +0800
Jonathan Corbet  wrote:
> If you want these patches to be reviewed, you really need to submit a
> proper patch series.  Please look at
> Documentation/process/submitting-patches.rst for all the details.

> You will also want to make the code compliant with the kernel's coding
> style.

Thank you for your advice. We have switched to the upstream Linux version and 
updated the project code. 
Now we are working on code compliant. The patch will be refreshed once finished.

> I have not reviewed this code (nor am I really the person to do a proper
> review), but this jumped at me:

>> while(mark->desc != 0)
>> {
>> barrier();
>> }

> Busy waits in the memory-management code are going to raise a lot of
> eyebrows, and you really need to document what you think that barrier()
> call is doing.  I suspect it's not giving you the protection you think it
> is.

Your advice is very valuable for this piece of code.

This is a special case of lock-less operation. The PIP pointer acts like a 
ticket. Once set, only one host or guest thread can get it (protected by CAS). 
The APP see the indicator exist and failed to fetch the pointer, in normal 
case, it means kernel is recycling the same page. 
This APP side code is spinning on the indicator to wait for the host to finish 
the recycling. 
The barrier() call is refreshing cache to see the result in time. We choose the 
spinning way because other waiting methods are heavier than the host operation 
we are waiting for (about 4000 cycles worst case). 

Your pointing out let us notice that:
It is a possibility that the while() never quit. This is probably rises by two 
cases as far as we can imagine.
1. Host Kernel is experiencing a serious malfunctioning.
2. APP or Guest OS memory pool corrupts so that two allocators get the same 
page.

Both cases we have not found a way to survive or finish the job. But that is 
not a proper reason to leave it unhandled. We should also keep in mind to 
document the codes that can rise questions.

Thanks again for your precious time and suggestions!


l...@baibantech.com.cn