Hello, for the hack week at opensuse (see http://idea.opensuse.org/) I've been working on a new feature called CONFIG_PAGE_SHIFT.
In the last few days while reading the topics of the VM summit I answered I disliked the dependency on defrag for reliable I/O and I suggested I had an alternative design being able to run static binaries so far. So some discussion come up and I got request for disclosing my stuff sooner than later. Frankly I wanted to do a bit more of progress before posting it, because I'm still unsure if this is totally good idea, the large part of the userland page fault frequency reduction isn't implemented yet but OTOH I sure like it more than everything else I've seen so far in this space, and I don't want to hide it further after explicit requests to disclose from other vm developers ;). Some background: The x86 and amd64 architectures only support a fixed 4k page size. The smaller the page size, the lower amount of memory is wasted in partially unused ram, but the slower the global performance is. The next available hardware page size is 2M which is generally too big for general purpose applications and the x86 ABI requires mmap offset parameter to work with 4k granularity (amd64 abi fixes that problem but apps have been written for the x86 ABI so we'd rather keep supporting the 4k file offset granularity in mmap if we want to be sure not to break backwards compatibility with userland, especially for the 32bit compatibility mode). While there's nothing we can do in software to alleviate the _hardware_ related overhead of the 4k page size (like tlb caching and frequency of the hardware pagetable walking), the 4k page size end up hurting many purely software things. The xfs developers for example want to enlarge their filesystem blocksize (the filesystem blocksize has a tradeoff similar to the PAGE_SIZE, the larger the faster the filesystem but more disk space is potentially wasted), they also want to use the “normal” writeback pagecache efficient behavior when using a writable fs on top of a dvd-ram with an hardblocksize of 64k. But they can't on x86/amd64 because the PAGE_SIZE is still 4k and the whole linux kernel can't handle more than a blocksize of PAGE_SIZE. What they miss is that the problem with the 4k PAGE_SIZE isn't just the maximum blocksize we can support (i.e. dvd with 64k hardblocksize) but the _whole_ kernel (not only the storage/fs subsystems) is slower because of the 4k thing. This starts from the page faults in a memcpy() that are double the number than if this was a 8k page-size, all the memory allocations (including slub/slab/blob/whatever) are double or 4 fold or 8 fold the ones that would happen with a 8k/16k/32k page size. So my whole idea is to once and for all to decuple the size of the pte-entry (4k on x86/amd64) with the page allocator granularity. The HARD_PAGE_SHIFT will be 4k still, the common code PAGE_SIZE will be variable and configurable at compile time with CONFIG_PAGE_SHIFT. I feel this need to happen at some point in the linux VM, since once done I can't imagine any server running with a 4k page-size anymore. Rule number 1: the moment you need to relay on order > 0 allocations for critical things like basic buffered I/O, you must make everything an order >0 allocation and just boost the PAGE_SIZE. Only vm_pgoff and other pte manipulations will be still indexed at the HARD_PAGE_SIZE, all common code won't notice. The backwards compatibility is provided by tracking vm_pgoff+((addr & ~PAGE_MASK) >> HARD_PAGE_SHIFT), see hardpfn_offset_to_index. Thanks to anon-vma this should work for anon+mremap too, even though I need to figure out some bits there still but I don't see anything fundamentally different (anon-vma whole point is to reduce the differences in that area and to allow doing on anonymous memory anything we can do on pagecache). The pagecache side already apparently works, it still needs a restart of the pagetable walking loop over the PAGE_SHIFT-HARD_PAGE_SHIFT bits bounded by vm_start >> HARD_PAGE_SHIFT, vm_end >> HARD_PAGE_SHIFT, so to reduce all the page fault rate though. The pte unmapping may be severely broken too. Once finished, this should allow for a total backwards compatible design without any aliasing in the pagecache (only the pte won't be naturally aligned but that's ok, aliasing at the virtual level is a fundamental property of the VM and it always happens). This whole issue is really a pure tradeoff between memory consumption and I/O and CPU performance (and for the dvd-ram and xfs also a way to use larger hardblocksize), so being able to benchmark is the first priority, if there's no significant benchmark gain this whole thing may be a failure. I'm not talking the I/O bound side, the I/O side performance boost is guaranteed (exactly the same as with the variable order page cache). 64k is probably the ideal value for CONFIG_PAGE_SHIFT in db servers, only 8 times faster in allocating ram but without huge ram waste and especially optimal I/O size for ide (and better for scsi too of course). Comparison with “variable order page cache”: that tries to keep the page allocator at 4k and changes the pagecache layer at order > 0 allocations. The major showstopper with their design is that there's no way they can defrag reliably the kernel memory as long as any driver is still allowed to run alloc_page(). Worst of all the defragmenter will waste lots of cpu if it has an hard time to defrag, so it's not a strightforward tradeoff and it has corner cases where its underperformance will be hard to evaluate because it normally won't trigger (even if in the best-case the I/O performance will be good). Even worse if it eventually fails to defrag (no guarantees can be made unless certain areas of memory are marked non generic) I/O reliability will be decreased. So it would need at least a fallback to order 0 to be really reliable. And despite all the above downsides it provides no advantage except being able to access devices with an hard/soft blocksize larger than 4k (it only tackle on the I/O performance with 4k fs, it hurts CPU performance if something). My design solves their troubles (I/O performance) and at the same time it boost the performance of everyone else too. It however requires compiling a kernel with a special CONFIG_PAGE_SHIFT, but then you also have to specially create xfs with a >4k blocksize so it seems a minor issue (especially for the 1024 cpu systems ;), and in theory the CONFIG_PAGE_SHIFT can also become a boot time parameter if we're ok to waste quite some cycles at runtime. The original idea of having a software page size larger than a hardware page size, originated at SUSE by myself and Andi Kleen while helping AMD to design their amd64 cpu, IIRC the conclusion was not to worry too much about the 4k page size being too small because we could make a soft-page-size if the time would come (or even a 2M PAGE_SIZE kernel), it's just that at the time we thought we had to break backwards compatibility (hence the ABI change in amd64 not requiring a 4k mmap offset alignment anymore), but I hope my current improved/refined idea for the hack week of handling not naturally aligned pages using the vm_pgoff indexed at HARD_PAGE_SIZE plus the few bits between PAGE_SHIFT and HARD_PAGE_SHIFT of virtual address, will not need to break anything anymore. The following simple bench seems to run fine on one real hardware and on kvm (a friend of mine failed so far to run it on his hardware though, so perhaps some driver triggers some remaining bugs) when booted as init=/tmp/bench-static after “cp -a /dev/hda /tmp/”. #include <stdio.h> #include <sys/time.h> #include <fcntl.h> #include <unistd.h> #include <stdlib.h> #include <assert.h> #define BUFSIZE (100*1024*1024) main() { struct timeval before, after; int fd = open("/tmp/hda", O_RDONLY); unsigned long usec; char * c = malloc(BUFSIZE); assert(c); assert(fd > 2); for (;;) { gettimeofday(&before, NULL); if (read(fd, c, BUFSIZE) != BUFSIZE) printf("errorn"); gettimeofday(&after, NULL); lseek(fd, 0, 0); usec = (after.tv_sec - before.tv_sec)*1000000; usec += after.tv_usec - before.tv_usec; printf("%d usecn", usec); } } CONFIG_PAGE_SHIFT = 12 (default): 109770 usec 109673 usec CONFIG_PAGE_SHIFT = 13 (8k page size) 108738 usec 108667 usec Numbers are totally repeatable. Because I was too lazy at adapting the anonymous memory page faults so far, the page coloring is guaranteed the worst possible in the lowest significant bit of the page color, but once I'll stop wasting gigantic amounts of ram over anon memory and I'll reduce the page fault rate of 2 4 8 16 etc... times, the anonymous memory will be automatically page-colored (for the first time, actually not a perfect coloring but a better coloring for sure, the larger the PAGE_SHIFT the better the coloring), so after that there shouldn't be slowdowns anymore at the very large PAGE_SHIFT like 64 and over. Max PAGE_SIZE supported is 8M, but implementation details in pgattr will likely prevent to boot (now even to compile) with a page size over 2M (easy to fix but going over 2M wasn't a short term worry). Clearly once we reach those large PAGE_SIZE, it'll also be possible to use the pmd to map the 'struct page' with a large tlb if it has been mapped naturally aligned in the virtual address space. If you want to help/look here the patch: http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.22-rc7/hard-page-size I'm tracking it with hg mq extension so far, but I can change if it helps. Thanks. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/