Re: another pmem variant V2
* Christoph Hellwig wrote: > On Thu, Apr 02, 2015 at 03:11:36PM +, Elliott, Robert (Server Storage) > wrote: > > AttrCopyRead IOPS Write IOPS > > = == > > UC memcpy 36 K22 K > > UC NT rd,wr513 K 326 K > > > > WB memcpy 3.4 M 2.5 M > > WB NT rd,wr3.3 M 3.5 M > > > > WC memcpy 776 K 3.5 M > > WC NT rd,wr3.0 M 3.9 M > > > > WT memcpy 2.1 M 22 K > > WT NT rd,wr3.3 M 2.1 M > > > > a few other variations yielded the peak numbers: > > WC NT rd only 3.2 M 4.1 M > > WC NT wr only 712 K 4.6 M > > WT NT wr only 2.6 M 4.0 M > > > > There are lots of tuning considerations for those memcpy > > functions - how far to unroll the loop, whether to > > include PRFETCHNTA instructions, etc. > > Looks like we should aіm for WC + NT would be a good start. > > Can you prepare a patch to add your NT memcpy variants and > a second one to use them in the pmem driver? So we already have NT memcpy variants, see the copy_*_user_*_nocache() primitives in arch/x86/. They could be used almost straight away for kernel memory as well, as kernel buffers will not fault. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: another pmem variant V2
On Thu, Apr 02, 2015 at 03:11:36PM +, Elliott, Robert (Server Storage) wrote: > Attr CopyRead IOPS Write IOPS > = == > UCmemcpy 36 K22 K > UCNT rd,wr513 K 326 K > > WBmemcpy 3.4 M 2.5 M > WBNT rd,wr3.3 M 3.5 M > > WCmemcpy 776 K 3.5 M > WCNT rd,wr3.0 M 3.9 M > > WTmemcpy 2.1 M 22 K > WTNT rd,wr3.3 M 2.1 M > > a few other variations yielded the peak numbers: > WCNT rd only 3.2 M 4.1 M > WCNT wr only 712 K 4.6 M > WTNT wr only 2.6 M 4.0 M > > There are lots of tuning considerations for those memcpy > functions - how far to unroll the loop, whether to > include PRFETCHNTA instructions, etc. Looks like we should aіm for WC + NT would be a good start. Can you prepare a patch to add your NT memcpy variants and a second one to use them in the pmem driver? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: another pmem variant V2
> -Original Message- > From: Christoph Hellwig [mailto:h...@lst.de] > Sent: Wednesday, April 1, 2015 2:26 AM > To: Elliott, Robert (Server Storage) > Cc: Christoph Hellwig; linux-nvd...@ml01.01.org; linux- > fsde...@vger.kernel.org; linux-kernel@vger.kernel.org; x...@kernel.org; > ross.zwis...@linux.intel.com; ax...@kernel.dk; b...@plexistor.com; Kani, > Toshimitsu > Subject: Re: another pmem variant V2 > > On Tue, Mar 31, 2015 at 10:11:29PM +, Elliott, Robert (Server Storage) > wrote: > > I used fio to test 4 KiB random read and write IOPS > > on a 2-socket x86 DDR4 system. With various cache attributes: > > > > attrreadwrite notes > > - - > > UC 37 K21 Kioremap_nocache > > WB 3.6 M 2.5 M ioremap > > WC 764 K 3.7 M ioremap_wc > > WT ioremap_wt > > > > So, although UC and WT are the only modes certain to be safe, > > the V1 default of UC provides abysmal performance - worse than > > a consumer-class SATA SSD. > > It doesn't look quite as bad on my setup, but performance is fairly > bad here as well. > > > A solution for x86 is to use the MOVNTI instruction in WB > > mode. This non-temporal hint uses a buffer like the write > > combining buffer, not filling the cache and not stopping > > everything in the CPU. The kernel function __copy_from_user() > > uses that instruction (with SFENCE at the end) - see > > arch/x86/lib/copy_user_nocache_64.S. > > > > If I made the change from memcpy() to __copy_from_user() > > correctly, that results in: > > > > attrreadwrite notes > > - - > > WB w/NTI2.4 M 2.6 M __copy_from_user() > > WC w/NTI3.2 M 2.1 M __copy_from_user() > > That looks a lot better. It doesn't help us with a pmem device > mapped directly into userspace using mmap with the DAX infrastructure, > though. > > Note when we want to move to non-temporal copies we'll need to add > a new prototype, as __copy_from_user isn't guaranteed to use these, > and it is defined to only work on user addresses. That doesn't matter > on x86 but would blow up on say sparc or s390. Here are some updated numbers including: * WT (writethrough) cache attribute * memcpy that uses non-temporal stores (MOVNTDQ) to the persistent memory for block writes (rather than MOVNTI) * memcpy that uses non-temporal loads (MOVNTDQA) from the persistent memory for block reads AttrCopyRead IOPS Write IOPS = == UC memcpy 36 K22 K UC NT rd,wr513 K 326 K WB memcpy 3.4 M 2.5 M WB NT rd,wr3.3 M 3.5 M WC memcpy 776 K 3.5 M WC NT rd,wr3.0 M 3.9 M WT memcpy 2.1 M 22 K WT NT rd,wr3.3 M 2.1 M a few other variations yielded the peak numbers: WC NT rd only 3.2 M 4.1 M WC NT wr only 712 K 4.6 M WT NT wr only 2.6 M 4.0 M There are lots of tuning considerations for those memcpy functions - how far to unroll the loop, whether to include PRFETCHNTA instructions, etc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: another pmem variant V2
On Wed, Apr 01, 2015 at 07:33:38PM +, Elliott, Robert (Server Storage) wrote: > I triggered a paging error in the memcpy call for a block read > from system-udevd (actually in a modified memcpy() for the cache > attribute experiments). > > 1. This triggered an illegal schedule() call from an atomic context. > The call trace is shown below. > > 2. memcpy() doesn't provide exception handling or error reporting. > Some functions like do so, like __copy_user_nocache in > arch/x85/lib/copy_user_nocache_64.S. > > Should pmem only use functions that do so, if available on the > architecture? We'll need to define an interface for the function to use if it isn't plain memcpy, which would include that detail. But I can't see how that memcpy should ever have to handle a page fault, I'd be curious how your reproduces this issue. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: another pmem variant V2
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel- > ow...@vger.kernel.org] On Behalf Of Christoph Hellwig > Sent: Thursday, March 26, 2015 3:33 AM > To: linux-nvd...@ml01.01.org; linux-fsde...@vger.kernel.org; linux- > ker...@vger.kernel.org; x...@kernel.org > Cc: ross.zwis...@linux.intel.com; ax...@kernel.dk; b...@plexistor.com > Subject: another pmem variant V2 > I triggered a paging error in the memcpy call for a block read from system-udevd (actually in a modified memcpy() for the cache attribute experiments). 1. This triggered an illegal schedule() call from an atomic context. The call trace is shown below. 2. memcpy() doesn't provide exception handling or error reporting. Some functions like do so, like __copy_user_nocache in arch/x85/lib/copy_user_nocache_64.S. Should pmem only use functions that do so, if available on the architecture? pmem_rw_page can pass along the return value from the copy function. pmem_make_request can report the error, if any, via bio_endio. Call trace == [62117.317216] BUG: scheduling while atomic: systemd-udevd/22135/0x0001 [62117.317232] Modules linked in: pmem ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables sg vfat fat x86_pkg_temp_thermal coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd xhci_pci hpilo xhci_hcd sb_edac edac_core microcode iTCO_wdt iTCO_vendor_support hpwdt ioatdma shpchp pcspkr lpc_ich mfd_core i2c_i801 wmi pcc_cpufreq dca acpi_cpufreq uinput nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs exportfs sr_mod cdrom sd_mod bnx2x tg3 ahci mdio libahci ptp pps_core hpsa libcrc32c dm_mirror dm_region_hash dm_log dm_mod ipv6 autofs4 [last unloaded: pmem] [62117.317233] CPU: 31 PID: 22135 Comm: systemd-udevd Tainted: G D 4.0.0-rc6+ #7 [62117.317234] Hardware name: HP ProLiant DL380 Gen9 [62117.317235] 88047f3f3ac0 8804241db2e8 815a8866 ff86ff86 [62117.317236] 8804241dbfd8 8804241db2f8 815a4b45 8804241db348 [62117.317237] 815ab893 880457091050 88047f3fbb20 [62117.317237] Call Trace: [62117.317240] [] dump_stack+0x45/0x57 [62117.317245] [] __schedule_bug+0x46/0x54 [62117.317247] [] __schedule+0x793/0x870 [62117.317251] [] ? bit_wait+0x50/0x50 [62117.317252] [] schedule+0x37/0x90 [62117.317253] [] schedule_timeout+0x1dc/0x260 [62117.317258] [] ? ktime_get+0x3e/0xa0 [62117.317259] [] io_schedule_timeout+0xac/0x140 [62117.317261] [] bit_wait_io+0x36/0x50 [62117.317262] [] __wait_on_bit_lock+0x4b/0xb0 [62117.317263] [] ? find_get_entries+0xe2/0x130 [62117.317265] [] __lock_page+0xac/0xb0 [62117.317269] [] ? autoremove_wake_function+0x40/0x40 [62117.317276] [] truncate_inode_pages_range+0x3af/0x620 [62117.317278] [] ? cpumask_next_and+0x37/0x50 [62117.317279] [] ? __brelse+0x40/0x40 [62117.317283] [] ? smp_call_function_many+0x5d/0x280 [62117.317284] [] ? free_cpumask_var+0x9/0x10 [62117.317285] [] ? on_each_cpu_cond+0xbd/0x160 [62117.317286] [] ? __brelse+0x40/0x40 [62117.317288] [] truncate_inode_pages+0x15/0x20 [62117.317289] [] kill_bdev+0x33/0x40 [62117.317291] [] __blkdev_put+0x68/0x210 [62117.317293] [] blkdev_put+0x50/0x130 [62117.317294] [] blkdev_close+0x25/0x30 [62117.317296] [] __fput+0xe7/0x220 [62117.317298] [] fput+0xe/0x10 [62117.317302] [] task_work_run+0xc4/0xe0 [62117.317306] [] do_exit+0x2d8/0xb10 [62117.317308] [] ? kmsg_dump+0x9c/0xc0 [62117.317312] [] oops_end+0x8e/0xd0 [62117.317313] [] no_context+0x2d4/0x334 [62117.317314] [] __bad_area_nosemaphore+0x6d/0x1c6 [62117.317317] [] ? zone_statistics+0x80/0xa0 [62117.317319] [] bad_area_nosemaphore+0x13/0x15 [62117.317321] [] __do_page_fault+0x91/0x430 [62117.317322] [] do_page_fault+0xc/0x10 [62117.317324] [] page_fault+0x22/0x30 [62117.317325] [] ? pmem_do_bvec.isra.6+0x212/0x3f0 [pmem] [62117.317326] [] pmem_rw_page+0x43/0x60 [pmem] [62117.317328] [] ? __radix_tree_preload+0x38/0xa0 [62117.317329] [] bdev_read_page+0x2e/0x40 [62117.317330] [] do_mpage_readpage+0x51f/0x6c0 [62117.317331] [] ? lru_cache_add+0xe/0x10 [62117.317332] [] mpage_readpages+0xdb/0x130 [62117.317333] [] ? I_BDEV+0x10/0x10 [62117.317334] [] ? I_BDEV+0x10/0x10 [62117.317336] [] blkdev_readpages+0x1d/0x20 [62117.317336] [] __do_page_cache_readahead+0x194/0x210 [62117.317337] [] force_page_cache_readahead+0x75/0xb0 [62117.317338] [] page_cache_sync_readahead+0x43/0x50 [62117.317339] [] generic_file_read_iter+0x431/0x630 [62117.317341] [] blkdev_read_iter+0x37/0x40 [62117.317342] [] new_sync_read+0x7e/0xb0 [62117.317343] [] __vfs_read+0x18/0x50 [62117.317344] [] vfs_read+0x86/0x140 [62117.317345] [] SyS_read+0x46/0xb0 [62117.317346] [] ? __audit_syscall_entry+0xb4/0x110 [62117.317348] [] system_call_fastpath+0x12/0x17 [62121.618505] note: systemd-udevd[22133] exited with preempt_count 1 -- To unsubscr
Re: another pmem variant V2
On 03/31/2015 07:16 PM, Christoph Hellwig wrote: > On Tue, Mar 31, 2015 at 06:14:15PM +0300, Boaz Harrosh wrote: >> We can not accept it as is right now. > > Who is we? > >> We have conducted farther tests. And it messes up NUMA. > > Only you if you use the memmap option in weird ways. > No weird ways just the single range. I would be very happy if you can teach me the proper way, believe me I'm trying for 2 days. > Sounds like I should simply remove the memmap= option so people don't > abuse it. The main point is to parse the e820 tables, which works fine. > What abuse? the single range and the problem shows up. So you are just pasting over the problem sir. The patch that you are submitting has grave problems and covering it up with not allowing memmap=! will not fix it. The Kernel as is after your patch does not like this half baked beast as we defined it, the defined pmem-memory-range messes things up. > And those people having fake pmem, or pcie cards that they are too lazy > to submit proper drivers for can stick to their out of tree hacks? > I have now conducted more tests on real type-12 DDR3 system and the exact same problem as I reported exists with real type-12 chips! And who are we kidding? the "memmap=!" yes or no, makes no difference at all. All it does is edit the table as if it was the table the BIOS gave us. There is no extra processing done on memmap=. Your e820 patch trashes NUMA. But I fix it for good this time here is the fix below. After I apply below patch every thing boots and work just as expected. All the problems I reported disappear. Any configuration, any number of ranges, cross NUMA or not, just works, exactly as before with my patches. The fix is over your V2, I will post one later that fixes your V3 and adds back the memmap=! --- If you inspect my fix below You will see that what happened is that the original patch was too aggressive in making pmem look like ram. In fact it started the ARCH side memory initialization and was only skipping the generic initialization of memory. This messed up internal real-memory structures. With this fix below block/drivers/pmem.ko loads just fine finds its resources and maps them, and everything just works. including any "abuse" to memmap=! and any NUMA configurations. Both real HW, type-12 HW and NUMA Vms. (Preliminary testing we are conducting the full test rig as we speak) (I'm still going over it, I might send some more cleaning) Cheers --- git diff --stat -p -M HEAD arch/x86/kernel/e820.c | 7 --- arch/x86/kernel/pmem.c | 17 - arch/x86/kernel/setup.c | 2 -- 3 files changed, 4 insertions(+), 22 deletions(-) diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index 07246e5..dce0e84 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -963,9 +963,10 @@ void __init e820_reserve_resources(void) * pci device BAR resource and insert them later in * pcibios_resource_survey() */ - if (e820.map[i].type != E820_RESERVED || res->start < (1ULL<<20)) { - if (e820.map[i].type != E820_PRAM) - res->flags |= IORESOURCE_BUSY; + if (((e820.map[i].type != E820_RESERVED) && +(e820.map[i].type != E820_PRAM)) || +res->start < (1ULL<<20)) { + res->flags |= IORESOURCE_BUSY; insert_resource(&iomem_resource, res); } res++; diff --git a/arch/x86/kernel/pmem.c b/arch/x86/kernel/pmem.c index f970048..fcdbc20 100644 --- a/arch/x86/kernel/pmem.c +++ b/arch/x86/kernel/pmem.c @@ -9,23 +9,6 @@ #include #include -void __init reserve_pmem(void) -{ - int i; - - for (i = 0; i < e820.nr_map; i++) { - struct e820entry *ei = &e820.map[i]; - - if (ei->type != E820_PRAM) - continue; - - memblock_reserve(ei->addr, ei->addr + ei->size); - max_pfn_mapped = init_memory_mapping( - ei->addr < 1UL << 32 ? 1UL << 32 : ei->addr, - ei->addr + ei->size); - } -} - static __init void register_pmem_device(struct resource *res) { struct platform_device *pdev; diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index f2bed2b..0a2421c 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1158,8 +1158,6 @@ void __init setup_arch(char **cmdline_p) early_acpi_boot_init(); - reserve_pmem(); - initmem_init(); dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: another pmem variant V2
On Tue, Mar 31, 2015 at 10:11:29PM +, Elliott, Robert (Server Storage) wrote: > I used fio to test 4 KiB random read and write IOPS > on a 2-socket x86 DDR4 system. With various cache attributes: > > attr readwrite notes > - - > UC37 K21 Kioremap_nocache > WB3.6 M 2.5 M ioremap > WC764 K 3.7 M ioremap_wc > WTioremap_wt > > So, although UC and WT are the only modes certain to be safe, > the V1 default of UC provides abysmal performance - worse than > a consumer-class SATA SSD. It doesn't look quite as bad on my setup, but performance is fairly bad here as well. > A solution for x86 is to use the MOVNTI instruction in WB > mode. This non-temporal hint uses a buffer like the write > combining buffer, not filling the cache and not stopping > everything in the CPU. The kernel function __copy_from_user() > uses that instruction (with SFENCE at the end) - see > arch/x86/lib/copy_user_nocache_64.S. > > If I made the change from memcpy() to __copy_from_user() > correctly, that results in: > > attr readwrite notes > - - > WB w/NTI 2.4 M 2.6 M __copy_from_user() > WC w/NTI 3.2 M 2.1 M __copy_from_user() That looks a lot better. It doesn't help us with a pmem device mapped directly into userspace using mmap with the DAX infrastructure, though. Note when we want to move to non-temporal copies we'll need to add a new prototype, as __copy_from_user isn't guaranteed to use these, and it is defined to only work on user addresses. That doesn't matter on x86 but would blow up on say sparc or s390. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: another pmem variant V2
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel- > ow...@vger.kernel.org] On Behalf Of Christoph Hellwig > Sent: Thursday, March 26, 2015 3:33 AM > To: linux-nvd...@ml01.01.org; linux-fsde...@vger.kernel.org; linux- > ker...@vger.kernel.org; x...@kernel.org > Cc: ross.zwis...@linux.intel.com; ax...@kernel.dk; b...@plexistor.com > Subject: another pmem variant V2 > > Here is another version of the same trivial pmem driver, because two > obviously aren't enough. The first patch is the same pmem driver > that Ross posted a short time ago, just modified to use platform_devices > to find the persistant memory region instead of hardconding it in the > Kconfig. This allows to keep pmem.c separate from any discovery mechanism, > but still allow auto-discovery. > ... > This has been tested both with a real NVDIMM on a system with a type 12 > capable bios, as well as with "fake persistent" memory using the memmap= > option. > > Changes since V1: > - s/E820_PROTECTED_KERN/E820_PMEM/g > - map the persistent memory as uncached > - better kernel parameter description > - various typo fixes > - MODULE_LICENSE fix I used fio to test 4 KiB random read and write IOPS on a 2-socket x86 DDR4 system. With various cache attributes: attrreadwrite notes - - UC 37 K21 Kioremap_nocache WB 3.6 M 2.5 M ioremap WC 764 K 3.7 M ioremap_wc WT ioremap_wt So, although UC and WT are the only modes certain to be safe, the V1 default of UC provides abysmal performance - worse than a consumer-class SATA SSD. A solution for x86 is to use the MOVNTI instruction in WB mode. This non-temporal hint uses a buffer like the write combining buffer, not filling the cache and not stopping everything in the CPU. The kernel function __copy_from_user() uses that instruction (with SFENCE at the end) - see arch/x86/lib/copy_user_nocache_64.S. If I made the change from memcpy() to __copy_from_user() correctly, that results in: attrreadwrite notes - - WB w/NTI2.4 M 2.6 M __copy_from_user() WC w/NTI3.2 M 2.1 M __copy_from_user() There is also a non-temporal streaming load hint instruction called MOVNTDQA that might be helpful for reads for both WB and WC. I don't see any existing kernel memcpy-like function that utilizes this instruction, so haven't tried it yet. Intel64 and IA-32 Architectures Software Developers Manual excerpts (Jan 2015) === "The non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD) allow data to be moved from the processor's registers directly into system memory without being also written into the L1, L2, and/or L3 caches. These instructions can be used to prevent cache pollution when operating on data that is going to be modified only once before being stored back into system memory. ... MOVNTI ... The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy. ... MOVNTDQA Provides a non-temporal hint that can cause adjacent 16-byte items within an aligned 64-byte region (a streaming line) to be fetched and held in a small set of temporary buffers ("streaming load buffers"). Subsequent streaming loads to other aligned 16-byte items in the same streaming line may be supplied from the streaming load buffer and can improve throughput. ... A processor implementation may make use of the non-temporal hint associated with this instruction if the memory source is WC (write combining) memory type. An implementation may also make use of the non-temporal hint associated with this instruction if the memory source is WB (writeback) memory type." -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: another pmem variant V2
On Tue, Mar 31, 2015 at 06:44:56PM +0200, Ingo Molnar wrote: > I'd be fine with that too - mind sending an updated series? I will send an updated one tonight or early tomorrow. Btw, do you want to keep the E820_PRAM name instead of E820_PMEM? Seems like most people either don't care or prefer E820_PMEM. I'm fine either way. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: another pmem variant V2
* Christoph Hellwig wrote: > On Tue, Mar 31, 2015 at 06:14:15PM +0300, Boaz Harrosh wrote: > > We can not accept it as is right now. > > Who is we? > > > We have conducted farther tests. And it messes up NUMA. > > Only you if you use the memmap option in weird ways. > > Sounds like I should simply remove the memmap= option so people don't > abuse it. The main point is to parse the e820 tables, which works fine. I'd be fine with that too - mind sending an updated series? the memmap= option can be added on top of that. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: another pmem variant V2
On Tue, Mar 31, 2015 at 06:14:15PM +0300, Boaz Harrosh wrote: > We can not accept it as is right now. Who is we? > We have conducted farther tests. And it messes up NUMA. Only you if you use the memmap option in weird ways. Sounds like I should simply remove the memmap= option so people don't abuse it. The main point is to parse the e820 tables, which works fine. And those people having fake pmem, or pcie cards that they are too lazy to submit proper drivers for can stick to their out of tree hacks? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: another pmem variant V2
On Tue, Mar 31, 2015 at 01:25:46PM +0300, Boaz Harrosh wrote: > The problem I see is that if I state a memmap=nn!aa that crosses a NUMA > boundary then the machine will not boot. > So BTW for sure I need that "don't merge E820_PMEM ranges" patch because > otherwise I will not be able to boot if I have pmem on both NUMA nodes > and they happen to be contiguous. Ok. > Regarding the SQUASHMEs to PMEM. Originally I had them as 3-4 patches. > But I thought since you are squashing them into a single submitted patch > I can just send just the one patch. Tell me what you prefer and I'll > resend (The one vs the three) The slpit is mostly to get a good description for each change. > And one last issue. I have some configuration "hardness" with the > memmap=nn!aa Kernel command line API, it was better for me with the > pmem map= module param. Will you be OK if I split pmem_probe() into > calling pmem_alloc(addr, length), so I can keep an out-of-tree patch > that adds the map= parameter to pmem? Can't your out of tree patch do that as well? In fact you might want to rewrite it to a module that allows your to create/destroy the platform_devices you use. And for your PCIe case I'd really prefer a proper in-tree PCI driver for it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: another pmem variant V2
On 03/31/2015 12:25 PM, Christoph Hellwig wrote: > On Thu, Mar 26, 2015 at 06:57:47PM +0200, Boaz Harrosh wrote: <> > > Any news? I'd really like to resend this ASAP to get it into 4.1.. Hi Christoph I hate to be bearer of bad news but we have a problem with the e820 patch: x86: add support for the non-standard protected e820 type We can not accept it as is right now. We have conducted farther tests. And it messes up NUMA. All the below is based on your latest patches on top of 4.0-rc5 Before any modprobe of pmem.ko, just a clean boot. In the same exact Kernel, if you use memmap=nn!aa ie add "type-12" section we have below problems, but if we do memmap=nn\$aa ie add "reserved" section then everything is fine. [With my old e820 patches it all works fine because it is closer to the memmap=nn\$aa "reserved" section way] The problems we see in a NUMA machine. * On some machines we cannot boot if a single memmap=nn!aa crosses a NUMA boundary. Some VMs sometime boot sometimes do not. Some VMs never boot. Some machines boot just fine. Doing memmap=nn1!aa1,nn2!aa2 where the split is at the NUMA boundary will enable all machines to boot * Regardless if we use memmap=nn!aa crossing a NUMA boundary or if we do memmap=nn1!aa1,nn2!aa2 the output of cat /sys/devices/system/node/node1/meminfo Is all ZEROs. Yes very scary everything is ZERO. Even though the dmseg prints show the correct numbers the above node1/meminfo is broken. If with the same Kernel we do memmap=nn\$aa then everything is clean. So something in the way we defined our new type-12 region upsets the NUMA code. And we need farther investigation. Perhaps you would like to start with my e820 much more conservative fix, that makes type-12 memory behave exactly as reserved memory. And go from there, step by step. Until we fix the problem above. So we can submit something like that for 4.1. Talk to me, tell me what you need me to experiment with. Should I try your platform device way but based on my e820 fix? how can I please help push this through? Thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: another pmem variant V2
On 03/31/2015 01:25 PM, Boaz Harrosh wrote: > On 03/31/2015 12:25 PM, Christoph Hellwig wrote: <> > The problem I see is that if I state a memmap=nn!aa that crosses a NUMA > boundary then the machine will not boot. > So BTW for sure I need that "don't merge E820_PMEM ranges" patch because > otherwise I will not be able to boot if I have pmem on both NUMA nodes > and they happen to be contiguous. > BTW if you need NUMA emulated in your KVM add the below to your virsh edit memory= is in KiB only The sum of these must exactly match what you have for 8388608 8388608 And you probably already have say: Cheers Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: another pmem variant V2
On 03/31/2015 12:25 PM, Christoph Hellwig wrote: > On Thu, Mar 26, 2015 at 06:57:47PM +0200, Boaz Harrosh wrote: >> On 03/26/2015 10:32 AM, Christoph Hellwig wrote: >>> Here is another version of the same trivial pmem driver, because two >>> obviously aren't enough. The first patch is the same pmem driver >>> that Ross posted a short time ago, just modified to use platform_devices >>> to find the persistant memory region instead of hardconding it in the >>> Kconfig. This allows to keep pmem.c separate from any discovery mechanism, >>> but still allow auto-discovery. >>> >> >> Hi Christoph >> >> So I've been trying to test your version, and play around with it. >> I currently have some problems, but this is the end of the week for me >> so I will debug and fix it after the weekend on Sunday. > > Any news? I'd really like to resend this ASAP to get it into 4.1.. Yes sorry I got stuck with the NUMA thing. I will finally finish today. For some reason your patch with the memmap=nn!aa behaves differently than with memmap=nn\$aa And also compared to my old e820.c fix + my old pmem. My fixes are effectively the same as with your Kernel and the memmap=nn\$aa which sets the range "reserved" like. And less "ram" like as in your patch. The problem I see is that if I state a memmap=nn!aa that crosses a NUMA boundary then the machine will not boot. So BTW for sure I need that "don't merge E820_PMEM ranges" patch because otherwise I will not be able to boot if I have pmem on both NUMA nodes and they happen to be contiguous. I do not understand why this happens, because a contiguous range of RAM is fine with cross NUMA and pmem not. (Also the pmem defined as memmap=nn!aa behaves the same) Also we have another problem with NUMA that I'm researching for a solution long term. Is that if the second NUMA node has only pmem and no RAM than the Kernel will not define a second NUMA node. And we are NUMA screwed with pmem. So one must put v-ram and nv-ram equally spread across his nodes. Regarding the SQUASHMEs to PMEM. Originally I had them as 3-4 patches. But I thought since you are squashing them into a single submitted patch I can just send just the one patch. Tell me what you prefer and I'll resend (The one vs the three) And one last issue. I have some configuration "hardness" with the memmap=nn!aa Kernel command line API, it was better for me with the pmem map= module param. Will you be OK if I split pmem_probe() into calling pmem_alloc(addr, length), so I can keep an out-of-tree patch that adds the map= parameter to pmem? Will send the patch in one hour. Just tell me if you need just the one or three. Thanks, Christoph Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: another pmem variant V2
On Thu, Mar 26, 2015 at 06:57:47PM +0200, Boaz Harrosh wrote: > On 03/26/2015 10:32 AM, Christoph Hellwig wrote: > > Here is another version of the same trivial pmem driver, because two > > obviously aren't enough. The first patch is the same pmem driver > > that Ross posted a short time ago, just modified to use platform_devices > > to find the persistant memory region instead of hardconding it in the > > Kconfig. This allows to keep pmem.c separate from any discovery mechanism, > > but still allow auto-discovery. > > > > Hi Christoph > > So I've been trying to test your version, and play around with it. > I currently have some problems, but this is the end of the week for me > so I will debug and fix it after the weekend on Sunday. Any news? I'd really like to resend this ASAP to get it into 4.1.. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: another pmem variant V2
On Thu, Mar 26, 2015 at 07:31:03PM +0200, Boaz Harrosh wrote: > But I hope you are not ignoring my real problem. any two memmap= ranges > will halt the boot. Specially if they are dis-contiguous. not the case here, verified it in various cofigurations. > Also I need the contiguous variant split into two devices because > they might belong to two NUMA nodes. It is very hard to manage > if a NUMA crossing is in a middle of a single pmemX device > The way we like to configure it is that each /dev/pmem belongs to > a single NUMA node. And in a multy device setup each CPU node > allocates from "his" pmem device If there is space. > (And it lets me set application affinity if need to) The hack below ensures two separate type 12 entries stay separate, but I'm not sure I really want this. so far it seems like very special hacks for your very specialized fake-pmem config. > BTW: Will device mapper let me call ->direct_access() Right now it doesn't, but it's not hard to add.. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: another pmem variant V2
And here's the patch, sorry: diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index 4bd525a..e7bf89e 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -346,7 +346,7 @@ int __init sanitize_e820_map(struct e820entry *biosmap, int max_nr_map, * continue building up new bios map based on this * information */ - if (current_type != last_type) { + if (current_type != last_type || current_type == E820_PRAM) { if (last_type != 0) { new_bios[new_bios_entry].size = change_point[chgidx]->addr - last_addr; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: another pmem variant V2
On 03/26/2015 07:18 PM, Christoph Hellwig wrote: > On Thu, Mar 26, 2015 at 06:57:47PM +0200, Boaz Harrosh wrote: >> For one this auto discovery of yours is very (very) nice but is a bit >> inconvenience. Before I would reserve a big chuck on each NUMA range >> on Kernel's memmap= And then at pmem map= would slice and dice it >> as I want hot style on modprobe with no need for reboot. Now I need >> to do it on reboot theoretically. (You know xfstest needs lots of devices >> some big some small ;-)) > > Slicing up a block device based on kernel options is not exactly a smart > idea. We have partitions that are perfectly fine for that. If you > really don't are about persistance of your partitioning you can just > set up a device mapper table. No need to reinvent the wheel. > I know! fdisk is my friend, I know. But I hope you are not ignoring my real problem. any two memmap= ranges will halt the boot. Specially if they are dis-contiguous. Also I need the contiguous variant split into two devices because they might belong to two NUMA nodes. It is very hard to manage if a NUMA crossing is in a middle of a single pmemX device The way we like to configure it is that each /dev/pmem belongs to a single NUMA node. And in a multy device setup each CPU node allocates from "his" pmem device If there is space. (And it lets me set application affinity if need to) BTW: Will device mapper let me call ->direct_access() Thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: another pmem variant V2
On Thu, Mar 26, 2015 at 06:57:47PM +0200, Boaz Harrosh wrote: > For one this auto discovery of yours is very (very) nice but is a bit > inconvenience. Before I would reserve a big chuck on each NUMA range > on Kernel's memmap= And then at pmem map= would slice and dice it > as I want hot style on modprobe with no need for reboot. Now I need > to do it on reboot theoretically. (You know xfstest needs lots of devices > some big some small ;-)) Slicing up a block device based on kernel options is not exactly a smart idea. We have partitions that are perfectly fine for that. If you really don't are about persistance of your partitioning you can just set up a device mapper table. No need to reinvent the wheel. > Also with the modprob pmem map= I was supporting a PCIE memory card but > I guess I need to throw this one out the door. Send it my way to support it properly... Just like any other PCIe device it should have a proper driver. For now it can register a pmem platform device, but when you submit it I'd rather add slightly more low-level versions of pmem_probe and pmem_remove that take a struct device * and possibly a resource structure so that you can directly call into it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: another pmem variant V2
On 03/26/2015 10:32 AM, Christoph Hellwig wrote: > Here is another version of the same trivial pmem driver, because two > obviously aren't enough. The first patch is the same pmem driver > that Ross posted a short time ago, just modified to use platform_devices > to find the persistant memory region instead of hardconding it in the > Kconfig. This allows to keep pmem.c separate from any discovery mechanism, > but still allow auto-discovery. > Hi Christoph So I've been trying to test your version, and play around with it. I currently have some problems, but this is the end of the week for me so I will debug and fix it after the weekend on Sunday. But just a heads up for what I stumbled on. For one this auto discovery of yours is very (very) nice but is a bit inconvenience. Before I would reserve a big chuck on each NUMA range on Kernel's memmap= And then at pmem map= would slice and dice it as I want hot style on modprobe with no need for reboot. Now I need to do it on reboot theoretically. (You know xfstest needs lots of devices some big some small ;-)) theoretically because if I try: memmap=1G!4G,1G!6G It will not boot at all. (will debug on Sunday) I must have 2 devices because each one is on another NUMA node. If I do memmap=2G!4G,1G!6G , which is a contiguous range it will give me only one device. Just an inconvenience. (And also this one sometimes does not boot) {All this in a VM for now did not even get to real HW yet} Also with the modprob pmem map= I was supporting a PCIE memory card but I guess I need to throw this one out the door. Am also posting an RFC cleanup to your pmem driver. RFC because I was not yet able to boot the all thing in our lab for heavy testing. Thanks Boaz > The other two patches are a heavily rewritten version of the code that > Intel gave to various storage vendors to discover the type 12 (and earlier > type 6) nvdimms, which I massaged into a form that is hopefully suitable > for mainline. > > Note that pmem.c really is the minimal version as I think we need something > included ASAP. We'll eventually need to be able to do other I/O from and > to it, and as most people know everyone has their own preferre method to > do it, which I'd like to discuss once we have the basic driver in. > > This has been tested both with a real NVDIMM on a system with a type 12 > capable bios, as well as with "fake persistent" memory using the memmap= > option. > > Changes since V1: > - s/E820_PROTECTED_KERN/E820_PMEM/g > - map the persistent memory as uncached > - better kernel parameter description > - various typo fixes > - MODULE_LICENSE fix > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/