On Mon, Jul 9, 2018 at 5:44 PM, Dave Hansen <dave.han...@intel.com> wrote: > ... cc'ing a few folks who I know have been looking at this code > lately. The full oops is below if any of you want to take a look. > > OK, well, annotating the disassembly a bit: > >> (gdb) disass free_pages_and_swap_cache >> Dump of assembler code for function free_pages_and_swap_cache: >> 0xffffffff8124c0d0 <+0>: callq 0xffffffff81a017a0 <__fentry__> >> 0xffffffff8124c0d5 <+5>: push %r14 >> 0xffffffff8124c0d7 <+7>: push %r13 >> 0xffffffff8124c0d9 <+9>: push %r12 >> 0xffffffff8124c0db <+11>: mov %rdi,%r12 // %r12 = pages >> 0xffffffff8124c0de <+14>: push %rbp >> 0xffffffff8124c0df <+15>: mov %esi,%ebp // %ebp = nr >> 0xffffffff8124c0e1 <+17>: push %rbx >> 0xffffffff8124c0e2 <+18>: callq 0xffffffff81205a10 <lru_add_drain> >> 0xffffffff8124c0e7 <+23>: test %ebp,%ebp // test nr==0 >> 0xffffffff8124c0e9 <+25>: jle 0xffffffff8124c156 >> <free_pages_and_swap_cache+134> >> 0xffffffff8124c0eb <+27>: lea -0x1(%rbp),%eax >> 0xffffffff8124c0ee <+30>: mov %r12,%rbx // %rbx = pages >> 0xffffffff8124c0f1 <+33>: lea 0x8(%r12,%rax,8),%r14 // load &pages[nr] >> into %r14? >> 0xffffffff8124c0f6 <+38>: mov (%rbx),%r13 // %r13 = pages[i] >> 0xffffffff8124c0f9 <+41>: mov 0x20(%r13),%rdx //<<<<<<<<<<<<<<<<<<<< >> GPF here. > %r13 is 64-byte aligned, so looks like a halfway reasonable 'struct page *'. > > %R14 looks OK (0xffff93d4abb5f000) because it points to the end of a > dynamically-allocated (not on-stack) mmu_gather_batch page. %RBX is > pointing 50 pages up from the start of the previous page. That makes it > the 48th page in pages[] after a pointer and two integers in the > beginning of the structure. That 48 is important because it's way > larger than the on-stack size of 8. > > It's hard to make much sense of %R13 (pages[48] / 0xfffbf0809e304bc0) > because the vmemmap addresses get randomized. But, I _think_ that's too > high of an address for a 4-level paging vmemmap[] entry. Does anybody > else know offhand? > > I'd really want to see this reproduced without KASLR to make the oops > easier to read. It would also be handy to try your workload with all > the pedantic debugging: KASAN, slab debugging, DEBUG_PAGE_ALLOC, etc... > and see if it still triggers.
How can I turn them on at boot time? > Some relevant functions and structures below for reference. > > void free_pages_and_swap_cache(struct page **pages, int nr) > { > for (i = 0; i < nr; i++) > free_swap_cache(pages[i]); > } > > > static void tlb_flush_mmu_free(struct mmu_gather *tlb) > { > for (batch = &tlb->local; batch && batch->nr; > batch = batch->next) { > free_pages_and_swap_cache(batch->pages, batch->nr); > } > > zap_pte_range() > { > if (force_flush) > tlb_flush_mmu_free(tlb); > } > > ... all the way up to the on-stack-allocated mmu_gather: > > void zap_page_range(struct vm_area_struct *vma, unsigned long start, > unsigned long size) > { > struct mmu_gather tlb; > > > #define MMU_GATHER_BUNDLE 8 > > struct mmu_gather { > ... > struct mmu_gather_batch local; > struct page *__pages[MMU_GATHER_BUNDLE]; > } > > struct mmu_gather_batch { > struct mmu_gather_batch *next; > unsigned int nr; > unsigned int max; > struct page *pages[0]; > }; > > #define MAX_GATHER_BATCH \ > ((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(void *)) > -- H.J.