Very interesting, thanks!

Tim


On Wed, Sep 22, 2010 at 10:36 AM, Venkatesh Srinivas <[email protected]
> wrote:

> Hi,
>
> A feature added to DragonFly during the 2.8 development cycle was
> idle-time page zeroing. Stated simply, the system will use some of its
> idle time to zero free pages, possibly saving time when they are
> allocated. Walking through the idle zero code is instructive - it provides
> a view into a number of DragonFly kernel subsystems.
>
> Some background:
>
> The DragonFly (and FreeBSD) virtual memory systems are organized around a
> number of queues, describing all of the page frames in a system. The
> queues are:
>        active := Pages that are actively mapped and in use
>        inactive := Pages that are dirty; these may be mapped, but will be
>                    the reclaimed under memory pressure
>        cache := Pages that are clean and reusable, but still hold their
>                 contents until needed under pressure
>        free := Pages not actively holding data, ready for allocation
>
> The cache and free queues are actually divided into a number of
> sub-queues, one for each cache color, but they function as single queues.
> They are also loosely sorted, with zeroed pages at the tail.
>
> Page allocation requests, for example by a user process zero-fill fault,
> need pages of zeroes. The fault handler code will call vm_page_alloc
> (found in /usr/src/sys/vm/vm_page.c) with the VM_ALLOC_ZERO flag set,
> which will take a page from the tail of free queue, if available. If the
> page was not already zeroed, it will be, (by the caller!). Having zeroed
> pages around would save that time.
>
> In DragonFly, the idle zero logic runs in its own LWKT, which runs at
> system idle time. The LWKT is somewhat atypical - it works pretty hard to
> get out of the way, at costs to its own idle zero rate. (In FreeBSD 4.x,
> it ran as part of the idle loop; in 6.x+, it runs in its own kernel
> thread).
>
> Code time:
>
> The code is in /usr/src/sys/vm/vm_zeroidle.c
> (http://grok.x12.su/source/xref/dragonfly/sys/vm/vm_zeroidle.c) if you'd
> like to follow along.
>
> Typical of such walkthroughs, we will start the very last line of the
> file:
>
>  SYSINIT(pagezero, SI_SUB_KTHREAD_VM, SI_ORDER_ANY, pagezero_start, NULL);
>>
>
> SYSINIT is a DFly/FBSD kernel macro, which marks a function to be called
> during boot. This SYSINIT invocation is saying 'call the function
> pagezero_start, when starting the VM daemons (SI_SUB_KTHREAD_VM), at any
> point during the VM daemon startup (SI_ORDER_ANY), with NULL args'.
>
> The pagezero_start function, just above the SYSINIT invocation, looks
> like (simplified):
>
>  static void pagezero_start(void __unused *arg) {
>>      struct thread *td;
>>
>>      idlezero_nocache = bzeront_avail;
>>      kthread_create(vm_pagezero, NULL, &td, "pagezero");
>> }
>>
>
> This function captures a flag from the platform specific code - is the
> bzeront function available (on SSE2 i386 systems, we use the MOVNTI
> instruction to zero pages, avoiding polluting a processor's Data Cache
> with lots of zeroes; this flag indicates whether MOVNTI is available). The
> function then kicks off an LWKT, named 'pagezero', running the vm_pagezero
> function. The LWKT starts up with the MP lock held.
>
> The vm_pagezero() function, lurking just above in this file, is the core
> of the idle zero logic. It performs some setup work:
>        > lwkt_setpri_self(TDPRI_IDLE_WORK);
>        > lwkt_setcpu_self(globaldata_find(ncpus - 1));
>
> Setting its priority to just above the idle thread and moving itself to
> the last CPU on the system. It then enters its main loop.
>
> The idle zero main loop is constructed as a state machine, with a few
> states - IDLE, GET_PAGE, ZERO_PAGE, and RELEASE_PAGE. The main loop
> switches on the current state executes a small block of code, then
> transitions states.  At each transition, it calles lwkt_yield(), to switch
> to any ready LWKTs on the current CPU.
>
> The idle state is the state that the logic starts in:
>        > case STATE_IDLE:
>        >       tsleep(&zero_state, 0, "pgzero", sleep_time);
>        >       if (vm_page_zero_check())
>        >               npages = idlezero_rate / 10;
>        >       sleep_time = vm_page_zero_time();
>        >       if (npages)
>        >               state = STATE_GET_PAGE;
>        >       break;
>
> In the idle state, the idle zero LWKT sleeps for 'sleep_time'; when there
> are no pages to zero, sleep_time is a long interval - 'LONG_SLEEP_TIME', or
> ten time the system clock; when there are, we sleep for
> 'DEFAULT_SLEEP_TIME', a tenth of the system clock. When the LWKT wakes from
> its sleep, it calls vm_page_zero_check(), also in this file;
> vm_page_zero_check() will be described later, but it returns true if we
> should be zeroing pages. If so, we compute the number of pages to zero, how
> long to sleep on the next entry to the idle state, and transition to the
> GET_PAGE state. We break between transitions, to attempt lwkt_yield() again.
>
> The GET_PAGE state logic looks like:
>        > case STATE_GET_PAGE:
>        >       m = vm_page_free_fromq_fast();
>        >       if (m == NULL) {
>        >               state = STATE_IDLE;
>        >       } else {
>        >               state = STATE_ZERO_PAGE;
>        >               buf = lwbuf_alloc(m);
>        >               pg = (char *)lwbuf_kva(buf);
>        >       }
>        >       break;
>
> In GET_PAGE state we attempt to acquire a page to zero, using a relatively
> new interface, vm_page_free_fromq_fast(). This routine, in vm_page.c,
> attempts to get a page from one of the free queues. If it fails to get one,
> we return to the idle state; otherwise, we prepare to entire the ZERO_PAGE
> state. We allocate an lwbuf and bind it to the page we wish to zero.
>
> In the ZERO_PAGE state, we actually zero the page:
>        > case STATE_ZERO_PAGE:
>        >       while (i < PAGE_SIZE) {
>        >               if (idlezero_nocache == 1)
>        >                       bzeront(&pg[i], IDLEZERO_RUN);
>        >               else
>        >                       bzero(&pg[i], IDLEZERO_RUN);
>        >               i += IDLEZERO_RUN;
>        >               lwkt_yield();
>        >       }
>        >       state = STATE_RELEASE_PAGE;
>        >       break;
>
> We loop across the entire page, zeroing 64-bytes at a time. After each
> 64-byte run, we lwkt_yield(), if any LWKTs are waiting to run. If the MOVNTI
> instruction is available, we use it via bzeront(); otherwise, we use
> bzero(). When we are done zeroing the page, we enter the RELEASE_PAGE state.
>
> In the RELEASE_PAGE state, we tear down the lwbuf and return the page to
> the free queue:
>        >       case STATE_RELEASE_PAGE:
>        >               lwbuf_free(buf);
>        >               vm_page_flag_set(m, PG_ZERO);
>        >               vm_page_free_toq(m);
>        >               state = STATE_GET_PAGE;
>        >               ++idlezero_count;
>        >               break;
>
> We first release the lwbuf; we then mark the page as zeroed and return it
> to the free queue. We transition back to the GET_PAGE state, and bump an
> idlezero counter.
>
> The operation of the idle zero code can be monitored via sysctls - the
> sysctl vm.stats.vm.v_ozfod tracks the total number of zero-fill faults which
> found a zero-filled page waiting for them (vm.stats.vm.v_zfod tracks total
> zfod faults). The vm.idlezero_count tracks the total number of pages the
> idle zero logic has managed to zero-fill.
>
> Hopefully this was interesting,
> -- vs
>

Reply via email to