[RFC] [Resend] Another Para-Virtualization page recycler -- Code details, Trap-less way to return free pages to kernel

XaviLi Thu, 21 Sep 2017 01:26:46 -0700

We raised a topic about PPR (Per Page Recycler) and thank to Jan Kiszka for 
advises. We are here to break up patch codes and explain the code in detail. 
There are too many things to explain in one topic. We would like to do it part 
by part. Content of original mails and patches can be found below in the end.


1.  Why another page recycler?

Freed memory always be returned to kernel in groups. User mode applications use 
munmap when the freed chunk is accumulated to be big enough. In VM world, 
balloon driver is triggered when free memory is worth to be collected. PPR 
offers a way to make a reclaim for each free-able page because it cost less CPU 
and trap-less.

The APPs or VMs release any uncopied pages to kernel instead of reserve them 
means we can use memory more efficiently. We start test from virtual machine 
scenario because the effect here is most obvious. In our experiment we can run 
516 VMs with PPR ,in contrast to 60+ without PPR. This issue is also work for 
normal applications. Here we call VM or applications as APP for simple.

2.  Basic Method:

Let begin with a question. Is it possible for APPs to set a “freed” mark at the 
beginning bytes of the page. Whether Kernel can take a glance and know it is 
reclaim-able? It is NOT possible because the memory-content is arbitrary. No 
particular value can be reserved to stand for “reclaim-able”. We let the first 
bytes of freed pages indicate the location of freed page pointer pool. Pointers 
in which are the reliable proofs of page being free-able. A wrong indicator 
leads to unproper pointer and doesn’t cause any further trouble. We call this 
method “PIP” (Pointer Indicator Pair). 

In some case pages-content are scanned periodically. One example is page 
deduplication. If we can find the page recycle-able at the beginning bytes, 
then the rest job can be saved. PPR work alone is very cheap and can win both 
CPU and memory when work with other scanners. The cost and test result can be 
found in the original mails below.

3.  Code Break Up:

The APP side:
Page Free hook: virt_mark_page_release() (page_reclaim_guest.c)
The free-page hook is called when a page is going to be freed. It just marks 
the beginning of the page as an indicator. Allocate the position of pointer 
from the pool and set the pointer to point the freed page. So that the page can 
be recycled in seconds. The allocation is quite simple because we can assume 
the pool is big enough and reclaim can happen in time to avoid the head catchup 
the tail in most cases. The pool is big but not consume much memory. Because 
when it is empty and zeroed, it can be shrunk by page-deduplication.

int virt_mark_page_release(struct page *page)
{
    int pool_id ;
    unsigned long long alloc_id;
    unsigned long long state;
    unsigned long long idx ;
    volatile struct virt_release_mark *mark ;
    unsigned long long time_begin = 0;
    if(!guest_mem_pool)
    {
        clear_page_content((void*)page);
        set_guest_page_clear_ok();
        return -1;
    }
    if(!pone_page_reclaim_enable)
    {
        reset_guest_page_clear_ok();
        return -1;
    }
    time_begin = rdtsc_ordered();
    pool_id = guest_mem_pool->pool_id;
    /*share memory pool alloc a position,the default content is 0*/
    alloc_id = atomic64_add_return(1,(atomic64_t*)&guest_mem_pool->alloc_idx)-1;
    idx = alloc_id%guest_mem_pool->desc_max;
    state = guest_mem_pool->desc[idx];
    
    mark = get_page_content((void*)page);
    if(0 == state)
    {
        /*the reclaim identification store on the share mem position,using  
gfn*/
        if(0 != 
atomic64_cmpxchg((atomic64_t*)&guest_mem_pool->desc[idx],0,page_to_pfn(page)))
        {
            /*if the alloced position used by another thread,release mark 
invalid*/
            pool_id = guest_mem_pool->pool_max  +1;
            idx = guest_mem_pool->desc_max +1;
            
//atomic64_add(1,(atomic64_t*)&guest_mem_pool->mark_release_err_conflict);
            
        }
        else
        {
            //atomic64_add(1,(atomic64_t*)&guest_mem_pool->mark_release_ok);
        }
        /*write release mark on the beginning of the release page*/
        mark->pool_id = pool_id;
        mark->alloc_id = idx;
        barrier();
        mark->desc = guest_mem_pool->mem_ind;
        barrier();
        put_page_content((void*)mark);
        PONE_TIMEPOINT_SET(page_reclaim_free_ok , rdtsc_ordered()- time_begin);
        return 0;
    }
    else
    {
        /*alloced position used by another thread,release mark invalid*/
        mark->pool_id = guest_mem_pool->pool_max +1;
        mark->alloc_id = guest_mem_pool->desc_max +1;
        barrier();
        mark->desc = guest_mem_pool->mem_ind;
        barrier();
        put_page_content((void*)mark);
    }
    //atomic64_add(1,(atomic64_t*)&guest_mem_pool->mark_release_err_state);
    PONE_TIMEPOINT_SET(page_reclaim_free_fail , rdtsc_ordered()- time_begin);
    return -1;
}

Page Allocation hook: virt_mark_page_alloc() (page_reclaim_guest.c)
The allocation hook is called when a page is allocated. Assume the page is 
beginning with an indicator when not recycled. In this case it uses a lockless 
way to undo the pointer and indicator. If the beginning is zero, that means the 
page is reclaimed. It can be safely given to the user and leave the real 
allocation job to a future Copy On Write.

int virt_mark_page_alloc(struct page *page)
{
    unsigned long long state;
    unsigned long long idx ;
    volatile struct virt_release_mark *mark ;
    unsigned long long time_begin = 0;  
    if(!guest_mem_pool)
    {
        return 0;
    }

    if(!pone_page_reclaim_enable)
    {
        return 0;
    }
    time_begin = rdtsc_ordered();
    mark = get_page_content((void*)page);
    
    if(mark->desc == guest_mem_pool->mem_ind)
    {
        if(mark->pool_id == guest_mem_pool->pool_id)
        {
            if(mark->alloc_id < guest_mem_pool->desc_max)
            {
                idx = mark->alloc_id;
                state = guest_mem_pool->desc[mark->alloc_id];
                if(state == page_to_pfn(page))
                {
                    /*clear the reclaim identification from the share mem pool*/
                    if(state == 
atomic64_cmpxchg((atomic64_t*)&guest_mem_pool->desc[idx],state,0))
                    {
                        
//atomic64_add(1,(atomic64_t*)&guest_mem_pool->mark_alloc_ok);
                    }
                    else
                    {
                        /*if the reclaim identification is cleared ,this mean 
the host kernel  is reclaiming or reclaimed this page*/ 
                        while(mark->desc != 0)
                        {
                            barrier();
                        }
                    }
                }
            }
        }
        /*clear the release mark in the page*/
        mark->pool_id = 0;
        mark->alloc_id = 0;
        barrier();
        mark->desc = 0;
        barrier();
        put_page_content((void*)mark);
        PONE_TIMEPOINT_SET(page_reclaim_alloc_ok,rdtsc_ordered()-time_begin);
        return 0;
    }
    else
    {
    }
    //atomic64_add(1,(atomic64_t*)&guest_mem_pool->mark_alloc_err_state);
    put_page_content((void*)mark);
    //PONE_TIMEPOINT_SET(page_reclaim_alloc_fail,rdtsc_ordered()-time_begin);
    return -1;
}

Kernel reclaim process: process_virt_page_release() (page_reclaim_host.c)
This process is called in a kernel thread when a page is found reclaim-able. It 
uses a lockless operation to undo the pointer and replace the page with a zero 
page. This issue is protected by the assumption that any pages pointed by a PIP 
pointer cannot begin with zero.
int process_virt_page_release(void *page_mem, unsigned long identification)
{
    int pool_id = 0;
    unsigned long long alloc_id = 0;
    unsigned long dsc_page_off = 0;
    void *page = NULL;
    void *dsc_page = NULL;
    unsigned long long *dsc = NULL;
    unsigned long new_ident = 0;
    unsigned long cmp_args[8] = {0};
    struct virt_release_mark *mark = page_mem;
    struct virt_mem_pool *pool = NULL;
    unsigned long time_begin = 0;   
    pool_id = mark->pool_id;
    alloc_id = mark->alloc_id;
    if(pool_id > MEM_POOL_MAX)
    {
        if(pool_id != MEM_POOL_MAX +1)
        PONE_DEBUG("virt mem error \r\n");
        return VIRT_MEM_FAIL;
    }
    if(NULL == mem_pool_addr[pool_id])
    {
        return VIRT_MEM_FAIL;
    }
    pool =  mem_pool_addr[pool_id];
    
    if(alloc_id > pool->desc_max)
    {
        return VIRT_MEM_FAIL;
    }

    time_begin = rdtsc_ordered();
    /*get the share mem pool page by release mark, where the page reclaim ident 
recorded*/
    page = get_reclaim_identification_page(pool,mark,&dsc_page_off); 
    if(NULL == page)
    {
        return VIRT_MEM_FAIL;
    }
    PONE_TIMEPOINT_SET(ppr_get_ident_page ,rdtsc_ordered()- time_begin);
    dsc_page = get_page_content(page);
    dsc = dsc_page+dsc_page_off;
    /*get the page reclaim ident from the share mem pool page*/
    new_ident = *dsc;

    time_begin = rdtsc_ordered();
    /*compare the ident from pool with the args identification,if equal reclaim 
the page*/
    if(VIRT_MEM_OK == 
compare_reclaim_identification(pool,new_ident,identification,cmp_args))   
    {
        PONE_TIMEPOINT_SET(ppr_cmp_ident ,rdtsc_ordered()- time_begin);
        /*clear the ident in the share mem pool ,if cleared ,mean the guest 
kernel alloc this page again*/
        if(new_ident == atomic64_cmpxchg((atomic64_t*)dsc,new_ident,0))
        {
            time_begin = rdtsc_ordered();
            /*reclaim the page*/
            if(VIRT_MEM_OK ==replace_reclaim_page(pool,identification,cmp_args))
            {
                PONE_TIMEPOINT_SET(ppr_replace_page ,rdtsc_ordered()- 
time_begin);
                put_page_content(page);
                put_page(page);
                return VIRT_MEM_OK;
            }
        }
        else
        {
            free_reclaim_cmp_args(pool,cmp_args);
        }
    }

    put_page_content(page);
    put_page(page);
    return VIRT_MEM_FAIL;
}


Kernel Scanning process: splitter_daemon_thread() (slice_state_daemon.c)
This is the body of daemon thread. It periodically scans memory for 
deduplication purpose. When found a page begin with PIP Indicator, it delivers 
the page to the reclaim entry.
static int splitter_daemon_thread(void *data)
{
    int i = 0;
    int j = 0;
    long long slice_num = 0;
    long long slice_state = 0;
    unsigned long slice_idx = 0;
    unsigned long slice_begin = 0;
    int volatile_oper = 0;
    int need_repeat =0 ;
    unsigned int scan_count = 0;
    unsigned long long start_jiffies = 0;
    unsigned long long end_jiffies = 0;
    unsigned long long cost_time = 0;
    unsigned long long slice_vcnt = 0;
    unsigned long long slice_scan_cnt = 0;
    long que_id;
    struct page *page = NULL;
    void    *page_addr = NULL;
    unsigned long long time_begin = 0;
    __set_current_state(TASK_RUNNING);

    do
    {

        volatile_oper = 0;
        need_repeat =0;
        scan_count++;
        if((scan_count % pone_daemon_merge_scan) == 0)
        {
            volatile_oper = 1;
        }
        start_jiffies = get_jiffies_64();
        if(pone_daemon_run)
        {
            for(i = 0 ; i<global_block->node_num;i++)
        {   
            slice_num = global_block->slice_node[i].slice_num;
            slice_begin = global_block->slice_node[i].slice_start;
            
            for(j = 0;j<slice_num;j++)
            {
                slice_state = get_slice_state(i,j);
                
                if((SLICE_VOLATILE == slice_state) ||(SLICE_WATCH == 
slice_state))
                {
                    slice_idx = slice_begin + j; 
                    page = pfn_to_page(slice_idx);

                    if((SLICE_VOLATILE == slice_state) && 
(pone_page_reclaim_enable ==1))
                    {
                        /* if page state is volatile ,determine whether this 
page has a release mark for PPR*/  
                        page_addr = kmap_atomic(page);
                        if(PONE_OK == is_virt_page_release(page_addr))
                        {
                            /*if page has release mark send  que to processing 
,else determine merge period is reached */ 
                            kunmap_atomic(page_addr);
                            goto get_que;
                        }
                        kunmap_atomic(page_addr);   
                    }

                    if(!volatile_oper)
                    {
                        continue;
                    }
                    /*the merge period is reached*,wath state merge processing*/
                    if(SLICE_WATCH == slice_state)
                    {
                        /*change state from watch to watch_que,send page to 
daemon order que to processing*/
                        if(0 != 
change_slice_state(i,j,SLICE_WATCH,SLICE_WATCH_QUE))
                        {
                            need_repeat++;
                            continue;
                        }
                        slice_daemon_find_watch++;
                        /*this que is processing is load balancing*/
                        lfo_write(slice_daemon_order_que,48,(unsigned 
long)page);   
                        continue;
                    }

                    /*volatile state merge processing, volatile cnt is a opt 
method ,
                     * when wathed page is modified , the volatile cnt 
added,then we
                     * scaned this page next merge period*/
get_cnt:
                    if(0 != (slice_vcnt = get_slice_volatile_cnt(i,j)))
                    {
                        slice_scan_cnt  = get_slice_scan_cnt(i,j);
                        if(slice_scan_cnt == slice_vcnt)
                        {   
                            if(0 != change_slice_scan_cnt(i,j,slice_scan_cnt,0))
                            {
                                goto get_cnt;
                            }
                            
atomic64_add(1,(atomic64_t*)&slice_daemon_volatile_cnt[slice_vcnt]);
                        }
                        else
                        {
                            if(slice_scan_cnt > slice_vcnt)
                            {
                                printk("daemon cnt bug bug bug bug %lld,%lld 
\r\n",slice_scan_cnt,slice_vcnt);
                                if(0 == slice_vcnt)
                                {
                                    continue;
                                }
                            }
                            if(0 != 
change_slice_scan_cnt(i,j,slice_scan_cnt,slice_scan_cnt+1))
                            {
                                goto get_cnt;
                            }
                            continue;

                        }
                    }

                    /*when  processing the volatile state que,can mkprotect 
page ,to avoid page table lock ,we dispath the page of same process to same 
que*/ 
get_que:
                    que_id = pone_get_slice_que_id(page);
                    if((-1 == que_id) || (0 == que_id))
                    {
                        continue;
                    }
                    que_id = hash_64(que_id,48);
                    que_id = pone_que_stat_lookup(que_id);
                    if(SLICE_VOLATILE == slice_state)
                    {
                        if(0 != 
change_slice_state(i,j,SLICE_VOLATILE,SLICE_ENQUE))
                        {
                            need_repeat++;
                            continue;
                        }
                        slice_daemon_find_volatile++;
                        time_begin = rdtsc_ordered();
                        lfo_write(slice_order_que[que_id],0,(unsigned 
long)page);
                        PONE_TIMEPOINT_SET(lf_order_que_write,(rdtsc_ordered()- 
time_begin));
                    }
                }
            }
        }
        }
        
        end_jiffies = get_jiffies_64();
        
        cost_time = jiffies_to_msecs(end_jiffies - start_jiffies);
        daemon_sleep_period_in_loop++;
        if(cost_time >pone_daemon_base_scan_period)
        {
            msleep(pone_daemon_base_scan_period);
        
        }
        else
        {   
            msleep(pone_daemon_base_scan_period - cost_time);
        }
    }while(!kthread_should_stop());
    return 0;
}
The content of original emails and patches could be found here:
PPR description
https://github.com/baibantech/dynamic_vm/wiki/PPR-Details
Patch:
https://github.com/baibantech/dynamic_vm/tree/master/dynamic_vm_0.5
DynamicVM Project (include this two technologies):
https://github.com/baibantech/dynamic_vm.git
User’s guide.
https://github.com/baibantech/dynamic_vm/wiki/Dynamic-Vm-Usage

[RFC] [Resend] Another Para-Virtualization page recycler -- Code details, Trap-less way to return free pages to kernel

Reply via email to