Support mmap() on /dev/oldmem to improve performance of reading /proc/vmcore. Currently, read to /proc/vmcore is done by read_oldmem() that uses ioremap and iounmap per a single page; for example, if memory is 1GB, ioremap/iounmap is called (1GB / 4KB)-times, that is, 262144 times. This causes big performance degradation.
By this patch, we saw improvement on simple benchmark from 200 [MiB/sec] to over 100.00 [GiB/sec]. Benchmark ========= = Machine spec - CPU: Intel(R) Xeon(R) CPU E7- 4820 @ 2.00GHz (4 sockets, 8 cores) (*) - memory: 32GB - kernel: 3.8-rc5 with this patch - vmcore size: 31.7GB (*) only 1 cpu is used in the 2nd kernel now. = Benchmark Case 1) copy /proc/vmcore with mmap() on /dev/oldmem I ran the next command and recorded real time: $ for n in $(seq 1 15) ; do \ > time copyvmcore --blocksize=$((4096 * (1 << (n - 1)))) /proc/vmcore /dev/null \ > done where copyvmcore is an ad-hoc test tool that parses ELF headers and copies them sequentially using mmap() on /dev/oldmem. See attached file. | n | map size | time | page table | performance | | | | (sec) | | [GiB/sec] | |----+----------+-------+------------+-------------| | 1 | 4 KiB | 41.86 | 8 iB | 0.76 | | 2 | 8 KiB | 25.43 | 16 iB | 1.25 | | 3 | 16 KiB | 13.28 | 32 iB | 2.39 | | 4 | 32 KiB | 7.20 | 64 iB | 4.40 | | 5 | 64 KiB | 3.45 | 128 iB | 9.19 | | 6 | 128 KiB | 1.82 | 256 iB | 17.42 | | 7 | 256 KiB | 1.03 | 512 iB | 30.78 | | 8 | 512 KiB | 0.61 | 1K iB | 51.97 | | 9 | 1 MiB | 0.41 | 2K iB | 77.32 | | 10 | 2 MiB | 0.32 | 4K iB | 99.06 | | 11 | 4 MiB | 0.27 | 8K iB | 117.41 | | 12 | 8 MiB | 0.24 | 16 KiB | 132.08 | | 13 | 16 MiB | 0.23 | 32 KiB | 137.83 | | 14 | 32 MiB | 0.22 | 64 KiB | 144.09 | | 15 | 64 MiB | 0.22 | 128 KiB | 144.09 | 2) copy /proc/vmcore without mmap() on /dev/oldmem $ time dd bs=4096 if=/proc/vmcore of=/dev/null 8307246+1 records in 8307246+1 records out real 2m 31.50s user 0m 1.06s sys 2m 27.60s So performance is 214.26 [MiB/sec]. 3) The benchmark on previous patch See: http://lists.infradead.org/pipermail/kexec/2013-January/007758.html where more than 2.5 [GiB/sec] improvement was shown. = Discussion When map size is small, there are many mmap() calls and we can see the same situation as ioremap() case. When map size is large enough, we can see drastic improvement. This is because the number of mmap() is as small enough as page table modification and TLB flush doesn't matter. Another reason why performance is drastically better than the previous patch's is that memory copy from kernel-space to user-space is no longer performed now. Performance improvement is saturated in relatively small map size and so page table is relatively small. I guess we don't need to support large pages on remap_pfn_range() for now or for ever. Design Concern ============== The previous patch mapped a whole memory range targeted by kdump at the same time on linear direct-mapping region. But doing that way is, on the worst case, difficult to estimate amount of memory used for page table to map the range. Although I then once tried to investigate how to improve the issue by mapping all the DIMM ranges that are all expected to be 1GB-aligned and so 1GB pages are effective for small memory, I didn't choose this way due to one memory hot-plugging issue related to undefined behaviour of reading physical memory hole typically resulting in system hang and another issue complicating the address to the 1st issue, that there's no reliable source of actually present DIMM ranges for use of kernel; SMBIOS is not sufficient since not all the firmwares exports them. On the other hand, /dev/oldmem is a simple interface whose offset value corresponds to physical address of a whole system memory, and even by this, there's room for userland tool to improve performance in enough quality. For example, makedumpfile causes the performance in reading huge consequtive memory sequentially, such as huge mem_map array or each chunk corresponding to PT_LOAD map. We can improve the performance even by using mmap() only there. For design decision, I didn't support mmap() on /proc/vmcore because it abstracts old memory as ELF format, so there's range consequtive on /proc/vmcore but not consequtive on the actual old memory. For example, consider ELF headers on the 2nd kernel and the note objects, memory chunks corresponding to PT_LOAD entries on the first kernel. They are not consequtive on the old memory. So reampping them so /proc/vmcore appears consequtive using existing remap_pfn_range() needs some complicated work. TODO ==== - fix makedumpfile to use mmap() on /dev/oldmem and benchmark it to confirm whether we can see enough performance improvement. Test ==== Tested and built on x86_64. Thanks. HATAYAMA, Daisuke >From cf89aace87c8e7192909eb35334a139143a806e8 Mon Sep 17 00:00:00 2001 From: HATAYAMA Daisuke <d.hatay...@jp.fujitsu.com> Date: Wed, 30 Jan 2013 13:02:02 +0900 Subject: [PATCH] kdump, oldmem: support mmap on /dev/oldmem Support mmap() on /dev/oldmem to improve performance of reading /proc/vmcore. Currently, read to /proc/vmcore is done by read_oldmem() that uses ioremap and iounmap per a single page; for example, if memory is 1GB, ioremap/iounmap is called (1GB / 4KB)-times, that is, 262144 times. This causes big performance degradation. By this patch, we saw improvement on simple benchmark from 200 [MiB/sec] to over 100.00 [GiB/sec]. We don't permit mapping over saved_max_pfn as is done in read_oldmem() and mapping memory as neither writable nor executable. Signed-off-by: HATAYAMA Daisuke <d.hatay...@jp.fujitsu.com> --- drivers/char/mem.c | 27 +++++++++++++++++++++++++++ 1 files changed, 27 insertions(+), 0 deletions(-) diff --git a/drivers/char/mem.c b/drivers/char/mem.c index c6fa3bc..e9046634 100644 --- a/drivers/char/mem.c +++ b/drivers/char/mem.c @@ -388,6 +388,32 @@ static ssize_t read_oldmem(struct file *file, char __user *buf, } return read; } + +/* + * Mmap memory corresponding to the old kernel. + */ +static int mmap_oldmem(struct file *file, struct vm_area_struct *vma) +{ + size_t size = vma->vm_end - vma->vm_start; + unsigned long pfn = vma->vm_pgoff; + + if (pfn + (size >> PAGE_SHIFT) > saved_max_pfn + 1) + return -EINVAL; + + if (vma->vm_flags & (VM_WRITE | VM_EXEC)) + return -EPERM; + + vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC); + + if (remap_pfn_range(vma, + vma->vm_start, + pfn, + size, + vma->vm_page_prot)) + return -EAGAIN; + + return 0; +} #endif #ifdef CONFIG_DEVKMEM @@ -806,6 +832,7 @@ static const struct file_operations oldmem_fops = { .read = read_oldmem, .open = open_oldmem, .llseek = default_llseek, + .mmap = mmap_oldmem, }; #endif -- 1.7.7.6
tool.tar.gz
Description: tool.tar.gz