On Wed, 2008-01-30 at 18:28 +0100, Peter Zijlstra wrote: > Subject: mm: MADV_WILLNEED implementation for anonymous memory > > Implement MADV_WILLNEED for anonymous pages by walking the page tables and > starting asynchonous swap cache reads for all encountered swap pages. > > Doing so required a modification to the page table walking library functions. > Previously ->pte_entry() could be called while holding a kmap_atomic, to > overcome this problem the pte walker is changed to copy batches of the pmd > and iterate them.
That's a pretty reasonable approach. My original approach was to buffer a page worth of PTEs with all the attendant malloc annoyances. Then Andrew and I came up with another fix a bit ago by effectively doing a batch of size 1: mapping and immediately unmapping per PTE. That's basically a no-op on !HIGHPTE but could potentially be expensive in the HIGHPTE case. Your approach might be a good complexity/performance middle ground. Unfortunately, I think we only implemented our fix in one of the relevant places: the /proc/pid/pagemap code hooks a callback at the pte table level and then does its own walk across the table. Perhaps I should refactor this so that it hooks in at the pte entry level of the walker instead. > +/* > + * Much of the complication here is to work around CONFIG_HIGHPTE which needs > + * to kmap the pmd. So copy batches of ptes from the pmd and iterate over > + * those. > + */ > +#define WALK_BATCH_SIZE 32 > + > static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, > const struct mm_walk *walk, void *private) > { > pte_t *pte; > + pte_t ptes[WALK_BATCH_SIZE]; > + unsigned long start; > + unsigned int i; > int err = 0; > > - pte = pte_offset_map(pmd, addr); > do { > - err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, private); > - if (err) > - break; > - } while (pte++, addr += PAGE_SIZE, addr != end); > + start = addr; > > - pte_unmap(pte); > + pte = pte_offset_map(pmd, addr); > + for (i = 0; i < WALK_BATCH_SIZE && addr != end; > + i++, pte++, addr += PAGE_SIZE) > + ptes[i] = *pte; Looks like this could be: for (i = 0; i < WALK_BATCH_SIZE && addr + i * PAGE_SIZE != end; i++) ptes[i] = pte[i]; > + pte_unmap(pte); > + > + for (i = 0, pte = ptes, addr = start; > + i < WALK_BATCH_SIZE && addr != end; > + i++, pte++, addr += PAGE_SIZE) { > + err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, > + private); for (i = 0; i < WALK_BATCH_SIZE && addr != end; i++, addr+= PAGE_SIZE) { err = walk->pte_entry(ptes[i], addr, addr + PAGE_SIZE, private); And we can ditch start. Also, one wonders if setting batch size to 1 will then convince the compiler to collapse this into a more trivial loop in the !HIGHPTE case. -- Mathematics is the supreme nostalgia of our time. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/