Courier was not able to deliver your parcel (ID07586844, UPS)
Dear Customer, We can not deliver your parcel arrived at April 09. Postal label is enclosed to this e-mail. Please check the attachment! Yours respectfully, Randall Calhoun, UPS Office Manager. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH v3] x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions
Before we rework the "pmem api" to stop abusing __copy_user_nocache() for memcpy_to_pmem() we need to fix cases where we may strand dirty data in the cpu cache. The problem occurs when copy_from_iter_pmem() is used for arbitrary data transfers from userspace. There is no guarantee that these transfers, performed by dax_iomap_actor(), will have aligned destinations or aligned transfer lengths. Backstop the usage __copy_user_nocache() with explicit cache management in these unaligned cases. Yes, copy_from_iter_pmem() is now too big for an inline, but addressing that is saved for a later patch that moves the entirety of the "pmem api" into the pmem driver directly. Fixes: 5de490daec8b ("pmem: add copy_from_iter_pmem() and clear_pmem()") Cc: Cc: Cc: Jan Kara Cc: Jeff Moyer Cc: Ingo Molnar Cc: Christoph Hellwig Cc: "H. Peter Anvin" Cc: Al Viro Cc: Thomas Gleixner Cc: Matthew Wilcox Cc: Ross Zwisler Signed-off-by: Toshi Kani Signed-off-by: Dan Williams --- Changes in v3: * match the implementation to the notes at the top of __copy_user_nocache (Toshi) * Switch to using the IS_ALIGNED() macro to make alignment checks easier to read and harder to get wrong like they were in v2. (Toshi, Dan) arch/x86/include/asm/pmem.h | 42 +++--- 1 file changed, 31 insertions(+), 11 deletions(-) diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h index 2c1ebeb4d737..529bb4a6487a 100644 --- a/arch/x86/include/asm/pmem.h +++ b/arch/x86/include/asm/pmem.h @@ -55,7 +55,8 @@ static inline int arch_memcpy_from_pmem(void *dst, const void *src, size_t n) * @size: number of bytes to write back * * Write back a cache range using the CLWB (cache line write back) - * instruction. + * instruction. Note that @size is internally rounded up to be cache + * line size aligned. */ static inline void arch_wb_cache_pmem(void *addr, size_t size) { @@ -69,15 +70,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t size) clwb(p); } -/* - * copy_from_iter_nocache() on x86 only uses non-temporal stores for iovec - * iterators, so for other types (bvec & kvec) we must do a cache write-back. - */ -static inline bool __iter_needs_pmem_wb(struct iov_iter *i) -{ - return iter_is_iovec(i) == false; -} - /** * arch_copy_from_iter_pmem - copy data from an iterator to PMEM * @addr: PMEM destination address @@ -94,7 +86,35 @@ static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes, /* TODO: skip the write-back by always using non-temporal stores */ len = copy_from_iter_nocache(addr, bytes, i); - if (__iter_needs_pmem_wb(i)) + /* +* In the iovec case on x86_64 copy_from_iter_nocache() uses +* non-temporal stores for the bulk of the transfer, but we need +* to manually flush if the transfer is unaligned. A cached +* memory copy is used when destination or size is not naturally +* aligned. That is: +* - Require 8-byte alignment when size is 8 bytes or larger. +* - Require 4-byte alignment when size is 4 bytes. +* +* In the non-iovec case the entire destination needs to be +* flushed. +*/ + if (iter_is_iovec(i)) { + unsigned long flushed, dest = (unsigned long) addr; + + if (bytes < 8) { + if (!IS_ALIGNED(dest, 4) || (bytes != 4)) + arch_wb_cache_pmem(addr, 1); + } else { + if (!IS_ALIGNED(dest, 8)) { + dest = ALIGN(dest, boot_cpu_data.x86_clflush_size); + arch_wb_cache_pmem(addr, 1); + } + + flushed = dest - (unsigned long) addr; + if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8)) + arch_wb_cache_pmem(addr + bytes - 1, 1); + } + } else arch_wb_cache_pmem(addr, bytes); return len; ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
RE: [PATCH v2] x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions
> >> > The clflush here flushes for the cacheline size. So, we do not need to > flush > >> > the same cacheline again when the unaligned tail is in the same line. > >> > >> Ok, makes sense. Last question, can't we reduce the check to be: > >> > >> if ((bytes > flushed) && ((bytes - flushed) & 3)) > >> > >> ...since if 'bytes' was 4-byte aligned we would have performed > >> non-temporal stores. > > > > That is not documented behavior of copy_user_nocache, but as long as the > pmem > > version of copy_user_nocache follows the same implemented behavior, yes, > that > > works. > > Hmm, sorry this comment confuses me, I'm only referring to the current > version of __copy_user_nocache not the new pmem version. The way I > read the current code we only ever jump to the cached copy loop > (.L_1b_cache_copy_loop) if the trailing byte-count is 4-byte > misaligned. Yes, you are right and that's how the code is implemented. I added this trailing 4-byte handling for the >=8B case, which is shared with <8B case, since it was easy to do. But I considered it a bonus. This function also needs to handle 4B-aligned destination if it is to state that it handles 4B alignment for the >=8B case as well. Otherwise, it's inconsistent. Since I did not see much point of supporting such case, I simply documented in the Note that 8 byte alignment is required for the >=8B case. Thanks, -Toshi ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2] x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions
On Mon, Apr 10, 2017 at 2:45 PM, Kani, Toshimitsu wrote: >> On Mon, Apr 10, 2017 at 2:28 PM, Kani, Toshimitsu >> wrote: >> >> > Thanks for the update. I think the alignment check should be based on >> >> > the following note in copy_user_nocache. >> >> > >> >> > * Note: Cached memory copy is used when destination or size is not >> >> > * naturally aligned. That is: >> >> > * - Require 8-byte alignment when size is 8 bytes or larger. >> >> > * - Require 4-byte alignment when size is 4 bytes. >> >> > >> >> > So, I think the code may be something like this. I also made the >> >> > following >> >> changes: >> >> >> >> Thanks! >> >> >> >> > - Mask with 7, not 8. >> >> >> >> Yes, good catch. >> >> >> >> > - ALIGN with cacheline size, instead of 8. >> >> > - Add (bytes > flushed) test since calculation with unsigned long still >> results >> >> in a negative >> >> >value (as a positive value). >> >> > >> >> > if (bytes < 8) { >> >> > if ((dest & 3) || (bytes != 4)) >> >> > arch_wb_cache_pmem(addr, 1); >> >> > } else { >> >> > if (dest & 7) { >> >> > dest = ALIGN(dest, >> >> > boot_cpu_data.x86_clflush_size); >> >> >> >> Why align the destination to the next cacheline? As far as I can see >> >> the ALIGN_DESTINATION macro in arch/x86/include/asm/asm.h only aligns >> >> to the next 8-byte boundary. >> > >> > The clflush here flushes for the cacheline size. So, we do not need to >> > flush >> > the same cacheline again when the unaligned tail is in the same line. >> >> Ok, makes sense. Last question, can't we reduce the check to be: >> >> if ((bytes > flushed) && ((bytes - flushed) & 3)) >> >> ...since if 'bytes' was 4-byte aligned we would have performed >> non-temporal stores. > > That is not documented behavior of copy_user_nocache, but as long as the pmem > version of copy_user_nocache follows the same implemented behavior, yes, that > works. Hmm, sorry this comment confuses me, I'm only referring to the current version of __copy_user_nocache not the new pmem version. The way I read the current code we only ever jump to the cached copy loop (.L_1b_cache_copy_loop) if the trailing byte-count is 4-byte misaligned. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
RE: [PATCH v2] x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions
> On Mon, Apr 10, 2017 at 2:28 PM, Kani, Toshimitsu > wrote: > >> > Thanks for the update. I think the alignment check should be based on > >> > the following note in copy_user_nocache. > >> > > >> > * Note: Cached memory copy is used when destination or size is not > >> > * naturally aligned. That is: > >> > * - Require 8-byte alignment when size is 8 bytes or larger. > >> > * - Require 4-byte alignment when size is 4 bytes. > >> > > >> > So, I think the code may be something like this. I also made the > >> > following > >> changes: > >> > >> Thanks! > >> > >> > - Mask with 7, not 8. > >> > >> Yes, good catch. > >> > >> > - ALIGN with cacheline size, instead of 8. > >> > - Add (bytes > flushed) test since calculation with unsigned long still > results > >> in a negative > >> >value (as a positive value). > >> > > >> > if (bytes < 8) { > >> > if ((dest & 3) || (bytes != 4)) > >> > arch_wb_cache_pmem(addr, 1); > >> > } else { > >> > if (dest & 7) { > >> > dest = ALIGN(dest, > >> > boot_cpu_data.x86_clflush_size); > >> > >> Why align the destination to the next cacheline? As far as I can see > >> the ALIGN_DESTINATION macro in arch/x86/include/asm/asm.h only aligns > >> to the next 8-byte boundary. > > > > The clflush here flushes for the cacheline size. So, we do not need to > > flush > > the same cacheline again when the unaligned tail is in the same line. > > Ok, makes sense. Last question, can't we reduce the check to be: > > if ((bytes > flushed) && ((bytes - flushed) & 3)) > > ...since if 'bytes' was 4-byte aligned we would have performed > non-temporal stores. That is not documented behavior of copy_user_nocache, but as long as the pmem version of copy_user_nocache follows the same implemented behavior, yes, that works. > Can I add your Signed-off-by: on v3? Sure. Thanks, -Toshi ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2] x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions
On Mon, Apr 10, 2017 at 2:28 PM, Kani, Toshimitsu wrote: >> > Thanks for the update. I think the alignment check should be based on >> > the following note in copy_user_nocache. >> > >> > * Note: Cached memory copy is used when destination or size is not >> > * naturally aligned. That is: >> > * - Require 8-byte alignment when size is 8 bytes or larger. >> > * - Require 4-byte alignment when size is 4 bytes. >> > >> > So, I think the code may be something like this. I also made the following >> changes: >> >> Thanks! >> >> > - Mask with 7, not 8. >> >> Yes, good catch. >> >> > - ALIGN with cacheline size, instead of 8. >> > - Add (bytes > flushed) test since calculation with unsigned long still >> > results >> in a negative >> >value (as a positive value). >> > >> > if (bytes < 8) { >> > if ((dest & 3) || (bytes != 4)) >> > arch_wb_cache_pmem(addr, 1); >> > } else { >> > if (dest & 7) { >> > dest = ALIGN(dest, boot_cpu_data.x86_clflush_size); >> >> Why align the destination to the next cacheline? As far as I can see >> the ALIGN_DESTINATION macro in arch/x86/include/asm/asm.h only aligns >> to the next 8-byte boundary. > > The clflush here flushes for the cacheline size. So, we do not need to flush > the same cacheline again when the unaligned tail is in the same line. Ok, makes sense. Last question, can't we reduce the check to be: if ((bytes > flushed) && ((bytes - flushed) & 3)) ...since if 'bytes' was 4-byte aligned we would have performed non-temporal stores. Can I add your Signed-off-by: on v3? ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
RE: [PATCH v2] x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions
> > Thanks for the update. I think the alignment check should be based on > > the following note in copy_user_nocache. > > > > * Note: Cached memory copy is used when destination or size is not > > * naturally aligned. That is: > > * - Require 8-byte alignment when size is 8 bytes or larger. > > * - Require 4-byte alignment when size is 4 bytes. > > > > So, I think the code may be something like this. I also made the following > changes: > > Thanks! > > > - Mask with 7, not 8. > > Yes, good catch. > > > - ALIGN with cacheline size, instead of 8. > > - Add (bytes > flushed) test since calculation with unsigned long still > > results > in a negative > >value (as a positive value). > > > > if (bytes < 8) { > > if ((dest & 3) || (bytes != 4)) > > arch_wb_cache_pmem(addr, 1); > > } else { > > if (dest & 7) { > > dest = ALIGN(dest, boot_cpu_data.x86_clflush_size); > > Why align the destination to the next cacheline? As far as I can see > the ALIGN_DESTINATION macro in arch/x86/include/asm/asm.h only aligns > to the next 8-byte boundary. The clflush here flushes for the cacheline size. So, we do not need to flush the same cacheline again when the unaligned tail is in the same line. Thanks, -Toshi ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2] x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions
On Mon, Apr 10, 2017 at 11:53 AM, Kani, Toshimitsu wrote: >> Subject: [PATCH v2] x86, pmem: fix broken __copy_user_nocache cache- >> bypass assumptions >> >> Before we rework the "pmem api" to stop abusing __copy_user_nocache() >> for memcpy_to_pmem() we need to fix cases where we may strand dirty data >> in the cpu cache. The problem occurs when copy_from_iter_pmem() is used >> for arbitrary data transfers from userspace. There is no guarantee that >> these transfers, performed by dax_iomap_actor(), will have aligned >> destinations or aligned transfer lengths. Backstop the usage >> __copy_user_nocache() with explicit cache management in these unaligned >> cases. >> >> Yes, copy_from_iter_pmem() is now too big for an inline, but addressing >> that is saved for a later patch that moves the entirety of the "pmem >> api" into the pmem driver directly. > : >> --- >> v2: Change the condition for flushing the last cacheline of the >> destination from 8-byte to 4-byte misalignment (Toshi) > : >> arch/x86/include/asm/pmem.h | 41 ++ > : >> @@ -94,7 +86,34 @@ static inline size_t arch_copy_from_iter_pmem(void >> *addr, size_t bytes, >> /* TODO: skip the write-back by always using non-temporal stores */ >> len = copy_from_iter_nocache(addr, bytes, i); >> >> -if (__iter_needs_pmem_wb(i)) >> +/* >> + * In the iovec case on x86_64 copy_from_iter_nocache() uses >> + * non-temporal stores for the bulk of the transfer, but we need >> + * to manually flush if the transfer is unaligned. In the >> + * non-iovec case the entire destination needs to be flushed. >> + */ >> +if (iter_is_iovec(i)) { >> +unsigned long dest = (unsigned long) addr; >> + >> +/* >> + * If the destination is not 8-byte aligned then >> + * __copy_user_nocache (on x86_64) uses cached copies >> + */ >> +if (dest & 8) { >> +arch_wb_cache_pmem(addr, 1); >> +dest = ALIGN(dest, 8); >> +} >> + >> +/* >> + * If the remaining transfer length, after accounting >> + * for destination alignment, is not 4-byte aligned >> + * then __copy_user_nocache() falls back to cached >> + * copies for the trailing bytes in the final cacheline >> + * of the transfer. >> + */ >> +if ((bytes - (dest - (unsigned long) addr)) & 4) >> +arch_wb_cache_pmem(addr + bytes - 1, 1); >> +} else >> arch_wb_cache_pmem(addr, bytes); >> >> return len; > > Thanks for the update. I think the alignment check should be based on > the following note in copy_user_nocache. > > * Note: Cached memory copy is used when destination or size is not > * naturally aligned. That is: > * - Require 8-byte alignment when size is 8 bytes or larger. > * - Require 4-byte alignment when size is 4 bytes. > > So, I think the code may be something like this. I also made the following > changes: Thanks! > - Mask with 7, not 8. Yes, good catch. > - ALIGN with cacheline size, instead of 8. > - Add (bytes > flushed) test since calculation with unsigned long still > results in a negative >value (as a positive value). > > if (bytes < 8) { > if ((dest & 3) || (bytes != 4)) > arch_wb_cache_pmem(addr, 1); > } else { > if (dest & 7) { > dest = ALIGN(dest, boot_cpu_data.x86_clflush_size); Why align the destination to the next cacheline? As far as I can see the ALIGN_DESTINATION macro in arch/x86/include/asm/asm.h only aligns to the next 8-byte boundary. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] dax: fix radix tree insertion race
On Mon, Apr 10, 2017 at 03:41:11PM +0200, Jan Kara wrote: > On Thu 06-04-17 15:29:44, Ross Zwisler wrote: > > While running generic/340 in my test setup I hit the following race. It can > > happen with kernels that support FS DAX PMDs, so v4.10 thru v4.11-rc5. > > > > Thread 1Thread 2 > > > > dax_iomap_pmd_fault() > > grab_mapping_entry() > > spin_lock_irq() > > get_unlocked_mapping_entry() > > 'entry' is NULL, can't call lock_slot() > > spin_unlock_irq() > > radix_tree_preload() > > dax_iomap_pmd_fault() > > grab_mapping_entry() > > spin_lock_irq() > > get_unlocked_mapping_entry() > > ... > > lock_slot() > > spin_unlock_irq() > > dax_pmd_insert_mapping() > > > > spin_lock_irq() > > __radix_tree_insert() fails with -EEXIST > > > when inserting a 4k entry where a PMD exists> > > > > The issue is that we have to drop mapping->tree_lock while calling > > radix_tree_preload(), but since we didn't have a radix tree entry to lock > > (unlike in the pmd_downgrade case) we have no protection against Thread 2 > > coming along and inserting a PMD at the same index. For 4k entries we > > handled this with a special-case response to -EEXIST coming from the > > __radix_tree_insert(), but this doesn't save us for PMDs because the > > -EEXIST case can also mean that we collided with a 4k entry in the radix > > tree at a different index, but one that is covered by our PMD range. > > > > So, correctly handle both the 4k and 2M collision cases by explicitly > > re-checking the radix tree for an entry at our index once we reacquire > > mapping->tree_lock. > > > > This patch has made it through a clean xfstests run with the current > > v4.11-rc5 based linux/master, and it also ran generic/340 500 times in a > > loop. It used to fail within the first 10 iterations. > > > > Signed-off-by: Ross Zwisler > > Cc: [4.10+] > > The patch looks good to me (and I can see Andrew already sent it to Linus), > I'm just worndering where did things actually go wrong? I'd expect we would > return VM_FAULT_FALLBACK from dax_iomap_pmd_fault() and then do PTE fault > for the address which should just work out fine... Yep, that's what I thought as well, and I think it does work for processes which have separate page tables. The second process will do a 4k fault (just as it would have if it had a VMA that was smaller than 2MiB, for example), map the 4k page into its page table and just dirty the 2MiB DAX entry in the radix tree. I've tested this case manually in the past. I think the error case that I was seeing was for threads that share page tables. In that case the 2nd thread falls back to PTEs, but there is already a PMD in the page table from the first fault. When we try and insert a PTE over the PMD we get the following BUG: BUG: unable to handle kernel NULL pointer dereference at (null) IP: do_raw_spin_trylock+0x5/0x40 PGD 8d6ee0067 PUD 8db6e8067 PMD 0 Oops: [#1] PREEMPT SMP Modules linked in: dax_pmem nd_pmem dax nd_btt nd_e820 libnvdimm [last unloaded: scsi_debug] CPU: 2 PID: 25323 Comm: holetest Not tainted 4.11.0-rc4 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014 task: 880095492a00 task.stack: c90014048000 RIP: 0010:do_raw_spin_trylock+0x5/0x40 RSP: :c9001404bb60 EFLAGS: 00010296 RAX: 880095492a00 RBX: 0018 RCX: RDX: RSI: RDI: RBP: c9001404bb80 R08: 0001 R09: R10: 880095492a00 R11: R12: R13: 8808d5fe4220 R14: 88004c3e3c80 R15: 8025 FS: 7f7ed7dff700() GS:8808de40() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 0008d86f6000 CR4: 001406e0 Call Trace: ? _raw_spin_lock+0x49/0x80 ? __get_locked_pte+0x16b/0x1d0 __get_locked_pte+0x16b/0x1d0 insert_pfn.isra.68+0x3a/0x100 vm_insert_mixed+0x64/0x90 dax_iomap_fault+0xa41/0x1680 ext4_dax_huge_fault+0xa9/0xd0 ext4_dax_fault+0x10/0x20 __do_fault+0x20/0x130 __handle_mm_fault+0x9b3/0x1190 handle_mm_fault+0x169/0x370 ? handle_mm_fault+0x47/0x370 __do_page_fault+0x28f/0x590 trace_do_page_fault+0x58/0x2c0 do_async_page_fault+0x2c/0x90 async_page_fault+0x28/0x30 RIP: 0033:0x4014b2 RSP: 002b:7f7ed7dfef20 EFLAGS: 00010216 RAX: 7f7ec6c00400 RBX: 0001 RCX: 01c0 RDX: 1c01 RSI: RDI: RBP: 7f7ed7dff700 R08: 7f
Re: KASLR causes intermittent boot failures on some systems
Kees Cook writes: > On Mon, Apr 10, 2017 at 11:22 AM, Jeff Moyer wrote: >> Kees Cook writes: >> >>> On Mon, Apr 10, 2017 at 8:49 AM, Jeff Moyer wrote: Kees Cook writes: > On Fri, Apr 7, 2017 at 7:41 AM, Jeff Moyer wrote: >> Hi, >> >> commit 021182e52fe01 ("x86/mm: Enable KASLR for physical mapping memory >> regions") causes some of my systems with persistent memory (whether real >> or emulated) to fail to boot with a couple of different crash >> signatures. The first signature is a NMI watchdog lockup of all but 1 >> cpu, which causes much difficulty in extracting useful information from >> the console. The second variant is an invalid paging request, listed >> below. > > Just to rule out some of the stuff in the boot path, does booting with > "nokaslr" solve this? (i.e. I want to figure out if this is from some > of the rearrangements done that are exposed under that commit, or if > it is genuinely the randomization that is killing the systems...) Adding "nokaslr" to the boot line does indeed make the problem go away. >>> >>> Are you booting with a memmap= flag? >> >> From my first email: >> >> [ 0.00] Command line: BOOT_IMAGE=/vmlinuz-4.11.0-rc5+ >> root=/dev/mapper/rhel_intel--lizardhead--04-root ro memmap=192G!1024G >> crashkernel=auto rd.lvm.lv=rhel_intel-lizardhead-04/root >> rd.lvm.lv=rhel_intel-lizardhead-04/swap console=ttyS0,115200n81 >> LANG=en_US.UTF-8 >> >> Did you not receive the attachments? > > I see it now, thanks! > > The memmap parsing was added in -rc1 (f28442497b5ca), so I'd expect > that to be handled. Hmmm. I can also reproduce this on a system with real persistent memory, which does not require the memmap parameter. Cheers, Jeff ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: KASLR causes intermittent boot failures on some systems
On Mon, Apr 10, 2017 at 11:22 AM, Jeff Moyer wrote: > Kees Cook writes: > >> On Mon, Apr 10, 2017 at 8:49 AM, Jeff Moyer wrote: >>> Kees Cook writes: >>> On Fri, Apr 7, 2017 at 7:41 AM, Jeff Moyer wrote: > Hi, > > commit 021182e52fe01 ("x86/mm: Enable KASLR for physical mapping memory > regions") causes some of my systems with persistent memory (whether real > or emulated) to fail to boot with a couple of different crash > signatures. The first signature is a NMI watchdog lockup of all but 1 > cpu, which causes much difficulty in extracting useful information from > the console. The second variant is an invalid paging request, listed > below. Just to rule out some of the stuff in the boot path, does booting with "nokaslr" solve this? (i.e. I want to figure out if this is from some of the rearrangements done that are exposed under that commit, or if it is genuinely the randomization that is killing the systems...) >>> >>> Adding "nokaslr" to the boot line does indeed make the problem go away. >> >> Are you booting with a memmap= flag? > > From my first email: > > [ 0.00] Command line: BOOT_IMAGE=/vmlinuz-4.11.0-rc5+ > root=/dev/mapper/rhel_intel--lizardhead--04-root ro memmap=192G!1024G > crashkernel=auto rd.lvm.lv=rhel_intel-lizardhead-04/root > rd.lvm.lv=rhel_intel-lizardhead-04/swap console=ttyS0,115200n81 > LANG=en_US.UTF-8 > > Did you not receive the attachments? I see it now, thanks! The memmap parsing was added in -rc1 (f28442497b5ca), so I'd expect that to be handled. Hmmm. -Kees -- Kees Cook Pixel Security ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
RE: [PATCH v2] x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions
> Subject: [PATCH v2] x86, pmem: fix broken __copy_user_nocache cache- > bypass assumptions > > Before we rework the "pmem api" to stop abusing __copy_user_nocache() > for memcpy_to_pmem() we need to fix cases where we may strand dirty data > in the cpu cache. The problem occurs when copy_from_iter_pmem() is used > for arbitrary data transfers from userspace. There is no guarantee that > these transfers, performed by dax_iomap_actor(), will have aligned > destinations or aligned transfer lengths. Backstop the usage > __copy_user_nocache() with explicit cache management in these unaligned > cases. > > Yes, copy_from_iter_pmem() is now too big for an inline, but addressing > that is saved for a later patch that moves the entirety of the "pmem > api" into the pmem driver directly. : > --- > v2: Change the condition for flushing the last cacheline of the > destination from 8-byte to 4-byte misalignment (Toshi) : > arch/x86/include/asm/pmem.h | 41 ++ : > @@ -94,7 +86,34 @@ static inline size_t arch_copy_from_iter_pmem(void > *addr, size_t bytes, > /* TODO: skip the write-back by always using non-temporal stores */ > len = copy_from_iter_nocache(addr, bytes, i); > > - if (__iter_needs_pmem_wb(i)) > + /* > + * In the iovec case on x86_64 copy_from_iter_nocache() uses > + * non-temporal stores for the bulk of the transfer, but we need > + * to manually flush if the transfer is unaligned. In the > + * non-iovec case the entire destination needs to be flushed. > + */ > + if (iter_is_iovec(i)) { > + unsigned long dest = (unsigned long) addr; > + > + /* > + * If the destination is not 8-byte aligned then > + * __copy_user_nocache (on x86_64) uses cached copies > + */ > + if (dest & 8) { > + arch_wb_cache_pmem(addr, 1); > + dest = ALIGN(dest, 8); > + } > + > + /* > + * If the remaining transfer length, after accounting > + * for destination alignment, is not 4-byte aligned > + * then __copy_user_nocache() falls back to cached > + * copies for the trailing bytes in the final cacheline > + * of the transfer. > + */ > + if ((bytes - (dest - (unsigned long) addr)) & 4) > + arch_wb_cache_pmem(addr + bytes - 1, 1); > + } else > arch_wb_cache_pmem(addr, bytes); > > return len; Thanks for the update. I think the alignment check should be based on the following note in copy_user_nocache. * Note: Cached memory copy is used when destination or size is not * naturally aligned. That is: * - Require 8-byte alignment when size is 8 bytes or larger. * - Require 4-byte alignment when size is 4 bytes. So, I think the code may be something like this. I also made the following changes: - Mask with 7, not 8. - ALIGN with cacheline size, instead of 8. - Add (bytes > flushed) test since calculation with unsigned long still results in a negative value (as a positive value). if (bytes < 8) { if ((dest & 3) || (bytes != 4)) arch_wb_cache_pmem(addr, 1); } else { if (dest & 7) { dest = ALIGN(dest, boot_cpu_data.x86_clflush_size); arch_wb_cache_pmem(addr, 1); } flushed = dest - (unsigned long) addr; if ((bytes > flushed) && ((bytes - flushed) & 7)) arch_wb_cache_pmem(addr + bytes - 1, 1); } Thanks, -Toshi ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: KASLR causes intermittent boot failures on some systems
Kees Cook writes: > On Mon, Apr 10, 2017 at 8:49 AM, Jeff Moyer wrote: >> Kees Cook writes: >> >>> On Fri, Apr 7, 2017 at 7:41 AM, Jeff Moyer wrote: Hi, commit 021182e52fe01 ("x86/mm: Enable KASLR for physical mapping memory regions") causes some of my systems with persistent memory (whether real or emulated) to fail to boot with a couple of different crash signatures. The first signature is a NMI watchdog lockup of all but 1 cpu, which causes much difficulty in extracting useful information from the console. The second variant is an invalid paging request, listed below. >>> >>> Just to rule out some of the stuff in the boot path, does booting with >>> "nokaslr" solve this? (i.e. I want to figure out if this is from some >>> of the rearrangements done that are exposed under that commit, or if >>> it is genuinely the randomization that is killing the systems...) >> >> Adding "nokaslr" to the boot line does indeed make the problem go away. > > Are you booting with a memmap= flag? >From my first email: [ 0.00] Command line: BOOT_IMAGE=/vmlinuz-4.11.0-rc5+ root=/dev/mapper/rhel_intel--lizardhead--04-root ro memmap=192G!1024G crashkernel=auto rd.lvm.lv=rhel_intel-lizardhead-04/root rd.lvm.lv=rhel_intel-lizardhead-04/swap console=ttyS0,115200n81 LANG=en_US.UTF-8 Did you not receive the attachments? Cheers, Jeff ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: KASLR causes intermittent boot failures on some systems
On Mon, Apr 10, 2017 at 8:49 AM, Jeff Moyer wrote: > Kees Cook writes: > >> On Fri, Apr 7, 2017 at 7:41 AM, Jeff Moyer wrote: >>> Hi, >>> >>> commit 021182e52fe01 ("x86/mm: Enable KASLR for physical mapping memory >>> regions") causes some of my systems with persistent memory (whether real >>> or emulated) to fail to boot with a couple of different crash >>> signatures. The first signature is a NMI watchdog lockup of all but 1 >>> cpu, which causes much difficulty in extracting useful information from >>> the console. The second variant is an invalid paging request, listed >>> below. >> >> Just to rule out some of the stuff in the boot path, does booting with >> "nokaslr" solve this? (i.e. I want to figure out if this is from some >> of the rearrangements done that are exposed under that commit, or if >> it is genuinely the randomization that is killing the systems...) > > Adding "nokaslr" to the boot line does indeed make the problem go away. Are you booting with a memmap= flag? -Kees -- Kees Cook Pixel Security ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v5 3/4] libnvdimm: add support for clear poison list and badblocks for device dax
On 04/07/2017 05:46 PM, Dan Williams wrote: > On Fri, Apr 7, 2017 at 5:31 PM, Dan Williams wrote: >> On Fri, Apr 7, 2017 at 3:07 PM, Dave Jiang wrote: >>> Providing mechanism to clear poison list via the ndctl ND_CMD_CLEAR_ERROR >>> call. We will update the poison list and also the badblocks at region level >>> if the region is in dax mode or in pmem mode and not active. In other >>> words we force badblocks to be cleared through write requests if the >>> address is currently accessed through a block device, otherwise it can >>> only be done via the ioctl+dsm path. >>> >>> Signed-off-by: Dave Jiang >>> Reviewed-by: Johannes Thumshirn >>> --- >>> drivers/nvdimm/bus.c | 78 >>> + >>> drivers/nvdimm/core.c | 17 -- >>> drivers/nvdimm/region.c | 25 ++ >>> include/linux/libnvdimm.h |7 +++- >>> 4 files changed, 115 insertions(+), 12 deletions(-) >>> >>> diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c >>> index 351bac8..af64e9c 100644 >>> --- a/drivers/nvdimm/bus.c >>> +++ b/drivers/nvdimm/bus.c >>> @@ -27,6 +27,7 @@ >>> #include >>> #include "nd-core.h" >>> #include "nd.h" >>> +#include "pfn.h" >>> >>> int nvdimm_major; >>> static int nvdimm_bus_major; >>> @@ -218,11 +219,19 @@ long nvdimm_clear_poison(struct device *dev, >>> phys_addr_t phys, >>> if (cmd_rc < 0) >>> return cmd_rc; >>> >>> - nvdimm_clear_from_poison_list(nvdimm_bus, phys, len); >>> return clear_err.cleared; >> >> This seems like a typo. We want to clear poison in the pmem write path. > > Changing this into an nvdimm_forget_poison() call works (passes your > daxdev-errors.sh test), and that's what I've pushed out to my pending > branch to let 0day beat up on it a bit. > Thanks. For some reason I was thinking __nvdimm_forget_poison() is being called through ->ndctl() and so we didn't need it. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
On 10/04/17 02:29 AM, Sagi Grimberg wrote: > What you are saying is surprising to me. The switch needs to preserve > ordering across different switch ports ?? > You are suggesting that there is a *switch-wide* state that tracks > MemRds never pass MemWrs across all the switch ports? That is a very > non-trivial statement... Yes, it is a requirement of the PCIe spec for transactions to be strongly ordered throughout the fabric so switches must have an internal state across all ports. Without that, it would be impossible to have PCI cards work together even if they are using system memory to do so. Also, I believe, it was done this way to maintain maximum compatibility with the legacy PCI bus. There is also a relaxed ordering bit that allows specific transactions to ignore ordering which can help performance. Obviously this becomes impossible if you have some kind of complex multi-path fabric. Logan ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: KASLR causes intermittent boot failures on some systems
Baoquan He writes: > On 04/07/17 at 10:41am, Jeff Moyer wrote: >> Hi, >> >> commit 021182e52fe01 ("x86/mm: Enable KASLR for physical mapping memory >> regions") causes some of my systems with persistent memory (whether real >> or emulated) to fail to boot with a couple of different crash >> signatures. The first signature is a NMI watchdog lockup of all but 1 >> cpu, which causes much difficulty in extracting useful information from >> the console. The second variant is an invalid paging request, listed >> below. >> >> On some systems, I haven't hit this problem at all. Other systems >> experience a failed boot maybe 20-30% of the time. To reproduce it, >> configure some emulated pmem on your system. You can find directions >> for that here: https://nvdimm.wiki.kernel.org/ > > I got below problem when configure ndctl, didn't find a package named > libkmod: Sorry, you can skip the ndctl step altogether. This is emulated pmem, so the setting won't stick anyway. -Jeff ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: KASLR causes intermittent boot failures on some systems
Kees Cook writes: > On Fri, Apr 7, 2017 at 7:41 AM, Jeff Moyer wrote: >> Hi, >> >> commit 021182e52fe01 ("x86/mm: Enable KASLR for physical mapping memory >> regions") causes some of my systems with persistent memory (whether real >> or emulated) to fail to boot with a couple of different crash >> signatures. The first signature is a NMI watchdog lockup of all but 1 >> cpu, which causes much difficulty in extracting useful information from >> the console. The second variant is an invalid paging request, listed >> below. > > Just to rule out some of the stuff in the boot path, does booting with > "nokaslr" solve this? (i.e. I want to figure out if this is from some > of the rearrangements done that are exposed under that commit, or if > it is genuinely the randomization that is killing the systems...) Adding "nokaslr" to the boot line does indeed make the problem go away. Cheers, Jeff ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] dax: fix radix tree insertion race
On Thu 06-04-17 15:29:44, Ross Zwisler wrote: > While running generic/340 in my test setup I hit the following race. It can > happen with kernels that support FS DAX PMDs, so v4.10 thru v4.11-rc5. > > Thread 1 Thread 2 > > dax_iomap_pmd_fault() > grab_mapping_entry() > spin_lock_irq() > get_unlocked_mapping_entry() > 'entry' is NULL, can't call lock_slot() > spin_unlock_irq() > radix_tree_preload() > dax_iomap_pmd_fault() > grab_mapping_entry() > spin_lock_irq() > get_unlocked_mapping_entry() > ... > lock_slot() > spin_unlock_irq() > dax_pmd_insert_mapping() > > spin_lock_irq() > __radix_tree_insert() fails with -EEXIST > when inserting a 4k entry where a PMD exists> > > The issue is that we have to drop mapping->tree_lock while calling > radix_tree_preload(), but since we didn't have a radix tree entry to lock > (unlike in the pmd_downgrade case) we have no protection against Thread 2 > coming along and inserting a PMD at the same index. For 4k entries we > handled this with a special-case response to -EEXIST coming from the > __radix_tree_insert(), but this doesn't save us for PMDs because the > -EEXIST case can also mean that we collided with a 4k entry in the radix > tree at a different index, but one that is covered by our PMD range. > > So, correctly handle both the 4k and 2M collision cases by explicitly > re-checking the radix tree for an entry at our index once we reacquire > mapping->tree_lock. > > This patch has made it through a clean xfstests run with the current > v4.11-rc5 based linux/master, and it also ran generic/340 500 times in a > loop. It used to fail within the first 10 iterations. > > Signed-off-by: Ross Zwisler > Cc: [4.10+] The patch looks good to me (and I can see Andrew already sent it to Linus), I'm just worndering where did things actually go wrong? I'd expect we would return VM_FAULT_FALLBACK from dax_iomap_pmd_fault() and then do PTE fault for the address which should just work out fine... Honza > --- > fs/dax.c | 35 ++- > 1 file changed, 22 insertions(+), 13 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index de622d4..85abd74 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -373,6 +373,22 @@ static void *grab_mapping_entry(struct address_space > *mapping, pgoff_t index, > } > spin_lock_irq(&mapping->tree_lock); > > + if (!entry) { > + /* > + * We needed to drop the page_tree lock while calling > + * radix_tree_preload() and we didn't have an entry to > + * lock. See if another thread inserted an entry at > + * our index during this time. > + */ > + entry = __radix_tree_lookup(&mapping->page_tree, index, > + NULL, &slot); > + if (entry) { > + radix_tree_preload_end(); > + spin_unlock_irq(&mapping->tree_lock); > + goto restart; > + } > + } > + > if (pmd_downgrade) { > radix_tree_delete(&mapping->page_tree, index); > mapping->nrexceptional--; > @@ -388,19 +404,12 @@ static void *grab_mapping_entry(struct address_space > *mapping, pgoff_t index, > if (err) { > spin_unlock_irq(&mapping->tree_lock); > /* > - * Someone already created the entry? This is a > - * normal failure when inserting PMDs in a range > - * that already contains PTEs. In that case we want > - * to return -EEXIST immediately. > - */ > - if (err == -EEXIST && !(size_flag & RADIX_DAX_PMD)) > - goto restart; > - /* > - * Our insertion of a DAX PMD entry failed, most > - * likely because it collided with a PTE sized entry > - * at a different index in the PMD range. We haven't > - * inserted anything into the radix tree and have no > - * waiters to wake. > + * Our insertion of a DAX entry failed, most likely > + * because we were insert
Please recheck your delivery address USPS parcel 376147120
Hello, We can not deliver your parcel arrived at Mon, 10 Apr 2017 14:28:30 +0300. Please check the attachment for complete details! Thank you for making business with us. Corissa Tavernia - USPS Parcels Delivery Agent. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
Sagi As long as legA, legB and the RC are all connected to the same switch then ordering will be preserved (I think many other topologies also work). Here is how it would work for the problem case you are concerned about (which is a read from the NVMe drive). 1. Disk device DMAs out the data to the p2pmem device via a string of PCIe MemWr TLPs. 2. Disk device writes to the completion queue (in system memory) via a MemWr TLP. 3. The last of the MemWrs from step 1 might have got stalled in the PCIe switch due to congestion but if so they are stalled in the egress path of the switch for the p2pmem port. 4. The RC determines the IO is complete when the TLP associated with step 2 updates the memory associated with the CQ. It issues some operation to read the p2pmem. 5. Regardless of whether the MemRd TLP comes from the RC or another device connected to the switch it is queued in the egress queue for the p2pmem FIO behind the last DMA TLP (from step 1). PCIe ordering ensures that this MemRd cannot overtake the MemWr (Reads can never pass writes). Therefore the MemRd can never get to the p2pmem device until after the last DMA MemWr has. What you are saying is surprising to me. The switch needs to preserve ordering across different switch ports ?? You are suggesting that there is a *switch-wide* state that tracks MemRds never pass MemWrs across all the switch ports? That is a very non-trivial statement... ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm