Very interesting, thanks! Tim
On Wed, Sep 22, 2010 at 10:36 AM, Venkatesh Srinivas <[email protected] > wrote: > Hi, > > A feature added to DragonFly during the 2.8 development cycle was > idle-time page zeroing. Stated simply, the system will use some of its > idle time to zero free pages, possibly saving time when they are > allocated. Walking through the idle zero code is instructive - it provides > a view into a number of DragonFly kernel subsystems. > > Some background: > > The DragonFly (and FreeBSD) virtual memory systems are organized around a > number of queues, describing all of the page frames in a system. The > queues are: > active := Pages that are actively mapped and in use > inactive := Pages that are dirty; these may be mapped, but will be > the reclaimed under memory pressure > cache := Pages that are clean and reusable, but still hold their > contents until needed under pressure > free := Pages not actively holding data, ready for allocation > > The cache and free queues are actually divided into a number of > sub-queues, one for each cache color, but they function as single queues. > They are also loosely sorted, with zeroed pages at the tail. > > Page allocation requests, for example by a user process zero-fill fault, > need pages of zeroes. The fault handler code will call vm_page_alloc > (found in /usr/src/sys/vm/vm_page.c) with the VM_ALLOC_ZERO flag set, > which will take a page from the tail of free queue, if available. If the > page was not already zeroed, it will be, (by the caller!). Having zeroed > pages around would save that time. > > In DragonFly, the idle zero logic runs in its own LWKT, which runs at > system idle time. The LWKT is somewhat atypical - it works pretty hard to > get out of the way, at costs to its own idle zero rate. (In FreeBSD 4.x, > it ran as part of the idle loop; in 6.x+, it runs in its own kernel > thread). > > Code time: > > The code is in /usr/src/sys/vm/vm_zeroidle.c > (http://grok.x12.su/source/xref/dragonfly/sys/vm/vm_zeroidle.c) if you'd > like to follow along. > > Typical of such walkthroughs, we will start the very last line of the > file: > > SYSINIT(pagezero, SI_SUB_KTHREAD_VM, SI_ORDER_ANY, pagezero_start, NULL); >> > > SYSINIT is a DFly/FBSD kernel macro, which marks a function to be called > during boot. This SYSINIT invocation is saying 'call the function > pagezero_start, when starting the VM daemons (SI_SUB_KTHREAD_VM), at any > point during the VM daemon startup (SI_ORDER_ANY), with NULL args'. > > The pagezero_start function, just above the SYSINIT invocation, looks > like (simplified): > > static void pagezero_start(void __unused *arg) { >> struct thread *td; >> >> idlezero_nocache = bzeront_avail; >> kthread_create(vm_pagezero, NULL, &td, "pagezero"); >> } >> > > This function captures a flag from the platform specific code - is the > bzeront function available (on SSE2 i386 systems, we use the MOVNTI > instruction to zero pages, avoiding polluting a processor's Data Cache > with lots of zeroes; this flag indicates whether MOVNTI is available). The > function then kicks off an LWKT, named 'pagezero', running the vm_pagezero > function. The LWKT starts up with the MP lock held. > > The vm_pagezero() function, lurking just above in this file, is the core > of the idle zero logic. It performs some setup work: > > lwkt_setpri_self(TDPRI_IDLE_WORK); > > lwkt_setcpu_self(globaldata_find(ncpus - 1)); > > Setting its priority to just above the idle thread and moving itself to > the last CPU on the system. It then enters its main loop. > > The idle zero main loop is constructed as a state machine, with a few > states - IDLE, GET_PAGE, ZERO_PAGE, and RELEASE_PAGE. The main loop > switches on the current state executes a small block of code, then > transitions states. At each transition, it calles lwkt_yield(), to switch > to any ready LWKTs on the current CPU. > > The idle state is the state that the logic starts in: > > case STATE_IDLE: > > tsleep(&zero_state, 0, "pgzero", sleep_time); > > if (vm_page_zero_check()) > > npages = idlezero_rate / 10; > > sleep_time = vm_page_zero_time(); > > if (npages) > > state = STATE_GET_PAGE; > > break; > > In the idle state, the idle zero LWKT sleeps for 'sleep_time'; when there > are no pages to zero, sleep_time is a long interval - 'LONG_SLEEP_TIME', or > ten time the system clock; when there are, we sleep for > 'DEFAULT_SLEEP_TIME', a tenth of the system clock. When the LWKT wakes from > its sleep, it calls vm_page_zero_check(), also in this file; > vm_page_zero_check() will be described later, but it returns true if we > should be zeroing pages. If so, we compute the number of pages to zero, how > long to sleep on the next entry to the idle state, and transition to the > GET_PAGE state. We break between transitions, to attempt lwkt_yield() again. > > The GET_PAGE state logic looks like: > > case STATE_GET_PAGE: > > m = vm_page_free_fromq_fast(); > > if (m == NULL) { > > state = STATE_IDLE; > > } else { > > state = STATE_ZERO_PAGE; > > buf = lwbuf_alloc(m); > > pg = (char *)lwbuf_kva(buf); > > } > > break; > > In GET_PAGE state we attempt to acquire a page to zero, using a relatively > new interface, vm_page_free_fromq_fast(). This routine, in vm_page.c, > attempts to get a page from one of the free queues. If it fails to get one, > we return to the idle state; otherwise, we prepare to entire the ZERO_PAGE > state. We allocate an lwbuf and bind it to the page we wish to zero. > > In the ZERO_PAGE state, we actually zero the page: > > case STATE_ZERO_PAGE: > > while (i < PAGE_SIZE) { > > if (idlezero_nocache == 1) > > bzeront(&pg[i], IDLEZERO_RUN); > > else > > bzero(&pg[i], IDLEZERO_RUN); > > i += IDLEZERO_RUN; > > lwkt_yield(); > > } > > state = STATE_RELEASE_PAGE; > > break; > > We loop across the entire page, zeroing 64-bytes at a time. After each > 64-byte run, we lwkt_yield(), if any LWKTs are waiting to run. If the MOVNTI > instruction is available, we use it via bzeront(); otherwise, we use > bzero(). When we are done zeroing the page, we enter the RELEASE_PAGE state. > > In the RELEASE_PAGE state, we tear down the lwbuf and return the page to > the free queue: > > case STATE_RELEASE_PAGE: > > lwbuf_free(buf); > > vm_page_flag_set(m, PG_ZERO); > > vm_page_free_toq(m); > > state = STATE_GET_PAGE; > > ++idlezero_count; > > break; > > We first release the lwbuf; we then mark the page as zeroed and return it > to the free queue. We transition back to the GET_PAGE state, and bump an > idlezero counter. > > The operation of the idle zero code can be monitored via sysctls - the > sysctl vm.stats.vm.v_ozfod tracks the total number of zero-fill faults which > found a zero-filled page waiting for them (vm.stats.vm.v_zfod tracks total > zfod faults). The vm.idlezero_count tracks the total number of pages the > idle zero logic has managed to zero-fill. > > Hopefully this was interesting, > -- vs >
