Courier was not able to deliver your parcel (ID07586844, UPS)

2017-04-10 Thread youngbear_ftp
Dear Customer,

We can not deliver your parcel arrived at April 09.

Postal label is enclosed to this e-mail. Please check the attachment!

Yours respectfully,
Randall Calhoun,
UPS Office Manager.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v3] x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions

2017-04-10 Thread Dan Williams
Before we rework the "pmem api" to stop abusing __copy_user_nocache()
for memcpy_to_pmem() we need to fix cases where we may strand dirty data
in the cpu cache. The problem occurs when copy_from_iter_pmem() is used
for arbitrary data transfers from userspace. There is no guarantee that
these transfers, performed by dax_iomap_actor(), will have aligned
destinations or aligned transfer lengths. Backstop the usage
__copy_user_nocache() with explicit cache management in these unaligned
cases.

Yes, copy_from_iter_pmem() is now too big for an inline, but addressing
that is saved for a later patch that moves the entirety of the "pmem
api" into the pmem driver directly.

Fixes: 5de490daec8b ("pmem: add copy_from_iter_pmem() and clear_pmem()")
Cc: 
Cc: 
Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Ingo Molnar 
Cc: Christoph Hellwig 
Cc: "H. Peter Anvin" 
Cc: Al Viro 
Cc: Thomas Gleixner 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Toshi Kani 
Signed-off-by: Dan Williams 
---
Changes in v3:
* match the implementation to the notes at the top of
  __copy_user_nocache (Toshi)

* Switch to using the IS_ALIGNED() macro to make alignment checks easier
  to read and harder to get wrong like they were in v2. (Toshi, Dan)

 arch/x86/include/asm/pmem.h |   42 +++---
 1 file changed, 31 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index 2c1ebeb4d737..529bb4a6487a 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -55,7 +55,8 @@ static inline int arch_memcpy_from_pmem(void *dst, const void 
*src, size_t n)
  * @size:  number of bytes to write back
  *
  * Write back a cache range using the CLWB (cache line write back)
- * instruction.
+ * instruction. Note that @size is internally rounded up to be cache
+ * line size aligned.
  */
 static inline void arch_wb_cache_pmem(void *addr, size_t size)
 {
@@ -69,15 +70,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t 
size)
clwb(p);
 }
 
-/*
- * copy_from_iter_nocache() on x86 only uses non-temporal stores for iovec
- * iterators, so for other types (bvec & kvec) we must do a cache write-back.
- */
-static inline bool __iter_needs_pmem_wb(struct iov_iter *i)
-{
-   return iter_is_iovec(i) == false;
-}
-
 /**
  * arch_copy_from_iter_pmem - copy data from an iterator to PMEM
  * @addr:  PMEM destination address
@@ -94,7 +86,35 @@ static inline size_t arch_copy_from_iter_pmem(void *addr, 
size_t bytes,
/* TODO: skip the write-back by always using non-temporal stores */
len = copy_from_iter_nocache(addr, bytes, i);
 
-   if (__iter_needs_pmem_wb(i))
+   /*
+* In the iovec case on x86_64 copy_from_iter_nocache() uses
+* non-temporal stores for the bulk of the transfer, but we need
+* to manually flush if the transfer is unaligned. A cached
+* memory copy is used when destination or size is not naturally
+* aligned. That is:
+*   - Require 8-byte alignment when size is 8 bytes or larger.
+*   - Require 4-byte alignment when size is 4 bytes.
+*
+* In the non-iovec case the entire destination needs to be
+* flushed.
+*/
+   if (iter_is_iovec(i)) {
+   unsigned long flushed, dest = (unsigned long) addr;
+
+   if (bytes < 8) {
+   if (!IS_ALIGNED(dest, 4) || (bytes != 4))
+   arch_wb_cache_pmem(addr, 1);
+   } else {
+   if (!IS_ALIGNED(dest, 8)) {
+   dest = ALIGN(dest, 
boot_cpu_data.x86_clflush_size);
+   arch_wb_cache_pmem(addr, 1);
+   }
+
+   flushed = dest - (unsigned long) addr;
+   if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8))
+   arch_wb_cache_pmem(addr + bytes - 1, 1);
+   }
+   } else
arch_wb_cache_pmem(addr, bytes);
 
return len;

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


RE: [PATCH v2] x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions

2017-04-10 Thread Kani, Toshimitsu
> >> > The clflush here flushes for the cacheline size.  So, we do not need to
> flush
> >> > the same cacheline again when the unaligned tail is in the same line.
> >>
> >> Ok, makes sense. Last question, can't we reduce the check to be:
> >>
> >> if ((bytes > flushed) && ((bytes - flushed) & 3))
> >>
> >> ...since if 'bytes' was 4-byte aligned we would have performed
> >> non-temporal stores.
> >
> > That is not documented behavior of copy_user_nocache, but as long as the
> pmem
> > version of copy_user_nocache follows the same implemented behavior, yes,
> that
> > works.
> 
> Hmm, sorry this comment confuses me, I'm only referring to the current
> version of __copy_user_nocache not the new pmem version. The way I
> read the current code we only ever jump to the cached copy loop
> (.L_1b_cache_copy_loop) if the trailing byte-count is 4-byte
> misaligned.

Yes, you are right and that's how the code is implemented.  I added this 
trailing
4-byte handling for the >=8B case, which is shared with <8B case, since it was 
easy to do.  But I considered it a bonus.  This function also needs to handle 
4B-aligned destination if it is to state that it handles 4B alignment for the 
>=8B
case as well.   Otherwise, it's inconsistent.  Since I did not see much point 
of supporting
such case, I simply documented in the Note that 8 byte alignment is required for
the >=8B case.

Thanks,
-Toshi
 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2] x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions

2017-04-10 Thread Dan Williams
On Mon, Apr 10, 2017 at 2:45 PM, Kani, Toshimitsu  wrote:
>> On Mon, Apr 10, 2017 at 2:28 PM, Kani, Toshimitsu 
>> wrote:
>> >> > Thanks for the update.  I think the alignment check should be based on
>> >> > the following note in copy_user_nocache.
>> >> >
>> >> >  * Note: Cached memory copy is used when destination or size is not
>> >> >  * naturally aligned. That is:
>> >> >  *  - Require 8-byte alignment when size is 8 bytes or larger.
>> >> >  *  - Require 4-byte alignment when size is 4 bytes.
>> >> >
>> >> > So, I think the code may be something like this.  I also made the 
>> >> > following
>> >> changes:
>> >>
>> >> Thanks!
>> >>
>> >> >  - Mask with 7, not 8.
>> >>
>> >> Yes, good catch.
>> >>
>> >> >  - ALIGN with cacheline size, instead of 8.
>> >> >  - Add (bytes > flushed) test since calculation with unsigned long still
>> results
>> >> in a negative
>> >> >value (as a positive value).
>> >> >
>> >> > if (bytes < 8) {
>> >> > if ((dest & 3) || (bytes != 4))
>> >> > arch_wb_cache_pmem(addr, 1);
>> >> > } else {
>> >> > if (dest & 7) {
>> >> > dest = ALIGN(dest, 
>> >> > boot_cpu_data.x86_clflush_size);
>> >>
>> >> Why align the destination to the next cacheline? As far as I can see
>> >> the ALIGN_DESTINATION macro in arch/x86/include/asm/asm.h only aligns
>> >> to the next 8-byte boundary.
>> >
>> > The clflush here flushes for the cacheline size.  So, we do not need to 
>> > flush
>> > the same cacheline again when the unaligned tail is in the same line.
>>
>> Ok, makes sense. Last question, can't we reduce the check to be:
>>
>> if ((bytes > flushed) && ((bytes - flushed) & 3))
>>
>> ...since if 'bytes' was 4-byte aligned we would have performed
>> non-temporal stores.
>
> That is not documented behavior of copy_user_nocache, but as long as the pmem
> version of copy_user_nocache follows the same implemented behavior, yes, that
> works.

Hmm, sorry this comment confuses me, I'm only referring to the current
version of __copy_user_nocache not the new pmem version. The way I
read the current code we only ever jump to the cached copy loop
(.L_1b_cache_copy_loop) if the trailing byte-count is 4-byte
misaligned.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


RE: [PATCH v2] x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions

2017-04-10 Thread Kani, Toshimitsu
> On Mon, Apr 10, 2017 at 2:28 PM, Kani, Toshimitsu 
> wrote:
> >> > Thanks for the update.  I think the alignment check should be based on
> >> > the following note in copy_user_nocache.
> >> >
> >> >  * Note: Cached memory copy is used when destination or size is not
> >> >  * naturally aligned. That is:
> >> >  *  - Require 8-byte alignment when size is 8 bytes or larger.
> >> >  *  - Require 4-byte alignment when size is 4 bytes.
> >> >
> >> > So, I think the code may be something like this.  I also made the 
> >> > following
> >> changes:
> >>
> >> Thanks!
> >>
> >> >  - Mask with 7, not 8.
> >>
> >> Yes, good catch.
> >>
> >> >  - ALIGN with cacheline size, instead of 8.
> >> >  - Add (bytes > flushed) test since calculation with unsigned long still
> results
> >> in a negative
> >> >value (as a positive value).
> >> >
> >> > if (bytes < 8) {
> >> > if ((dest & 3) || (bytes != 4))
> >> > arch_wb_cache_pmem(addr, 1);
> >> > } else {
> >> > if (dest & 7) {
> >> > dest = ALIGN(dest, 
> >> > boot_cpu_data.x86_clflush_size);
> >>
> >> Why align the destination to the next cacheline? As far as I can see
> >> the ALIGN_DESTINATION macro in arch/x86/include/asm/asm.h only aligns
> >> to the next 8-byte boundary.
> >
> > The clflush here flushes for the cacheline size.  So, we do not need to 
> > flush
> > the same cacheline again when the unaligned tail is in the same line.
> 
> Ok, makes sense. Last question, can't we reduce the check to be:
> 
> if ((bytes > flushed) && ((bytes - flushed) & 3))
> 
> ...since if 'bytes' was 4-byte aligned we would have performed
> non-temporal stores.

That is not documented behavior of copy_user_nocache, but as long as the pmem
version of copy_user_nocache follows the same implemented behavior, yes, that
works.

> Can I add your Signed-off-by: on v3?

Sure.

Thanks,
-Toshi

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2] x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions

2017-04-10 Thread Dan Williams
On Mon, Apr 10, 2017 at 2:28 PM, Kani, Toshimitsu  wrote:
>> > Thanks for the update.  I think the alignment check should be based on
>> > the following note in copy_user_nocache.
>> >
>> >  * Note: Cached memory copy is used when destination or size is not
>> >  * naturally aligned. That is:
>> >  *  - Require 8-byte alignment when size is 8 bytes or larger.
>> >  *  - Require 4-byte alignment when size is 4 bytes.
>> >
>> > So, I think the code may be something like this.  I also made the following
>> changes:
>>
>> Thanks!
>>
>> >  - Mask with 7, not 8.
>>
>> Yes, good catch.
>>
>> >  - ALIGN with cacheline size, instead of 8.
>> >  - Add (bytes > flushed) test since calculation with unsigned long still 
>> > results
>> in a negative
>> >value (as a positive value).
>> >
>> > if (bytes < 8) {
>> > if ((dest & 3) || (bytes != 4))
>> > arch_wb_cache_pmem(addr, 1);
>> > } else {
>> > if (dest & 7) {
>> > dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
>>
>> Why align the destination to the next cacheline? As far as I can see
>> the ALIGN_DESTINATION macro in arch/x86/include/asm/asm.h only aligns
>> to the next 8-byte boundary.
>
> The clflush here flushes for the cacheline size.  So, we do not need to flush
> the same cacheline again when the unaligned tail is in the same line.

Ok, makes sense. Last question, can't we reduce the check to be:

if ((bytes > flushed) && ((bytes - flushed) & 3))

...since if 'bytes' was 4-byte aligned we would have performed
non-temporal stores.

Can I add your Signed-off-by: on v3?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


RE: [PATCH v2] x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions

2017-04-10 Thread Kani, Toshimitsu
> > Thanks for the update.  I think the alignment check should be based on
> > the following note in copy_user_nocache.
> >
> >  * Note: Cached memory copy is used when destination or size is not
> >  * naturally aligned. That is:
> >  *  - Require 8-byte alignment when size is 8 bytes or larger.
> >  *  - Require 4-byte alignment when size is 4 bytes.
> >
> > So, I think the code may be something like this.  I also made the following
> changes:
> 
> Thanks!
> 
> >  - Mask with 7, not 8.
> 
> Yes, good catch.
> 
> >  - ALIGN with cacheline size, instead of 8.
> >  - Add (bytes > flushed) test since calculation with unsigned long still 
> > results
> in a negative
> >value (as a positive value).
> >
> > if (bytes < 8) {
> > if ((dest & 3) || (bytes != 4))
> > arch_wb_cache_pmem(addr, 1);
> > } else {
> > if (dest & 7) {
> > dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
> 
> Why align the destination to the next cacheline? As far as I can see
> the ALIGN_DESTINATION macro in arch/x86/include/asm/asm.h only aligns
> to the next 8-byte boundary.

The clflush here flushes for the cacheline size.  So, we do not need to flush
the same cacheline again when the unaligned tail is in the same line.

Thanks,
-Toshi
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2] x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions

2017-04-10 Thread Dan Williams
On Mon, Apr 10, 2017 at 11:53 AM, Kani, Toshimitsu  wrote:
>> Subject: [PATCH v2] x86, pmem: fix broken __copy_user_nocache cache-
>> bypass assumptions
>>
>> Before we rework the "pmem api" to stop abusing __copy_user_nocache()
>> for memcpy_to_pmem() we need to fix cases where we may strand dirty data
>> in the cpu cache. The problem occurs when copy_from_iter_pmem() is used
>> for arbitrary data transfers from userspace. There is no guarantee that
>> these transfers, performed by dax_iomap_actor(), will have aligned
>> destinations or aligned transfer lengths. Backstop the usage
>> __copy_user_nocache() with explicit cache management in these unaligned
>> cases.
>>
>> Yes, copy_from_iter_pmem() is now too big for an inline, but addressing
>> that is saved for a later patch that moves the entirety of the "pmem
>> api" into the pmem driver directly.
>  :
>> ---
>> v2: Change the condition for flushing the last cacheline of the
>> destination from 8-byte to 4-byte misalignment (Toshi)
>  :
>>  arch/x86/include/asm/pmem.h |   41 ++
>  :
>> @@ -94,7 +86,34 @@ static inline size_t arch_copy_from_iter_pmem(void
>> *addr, size_t bytes,
>>  /* TODO: skip the write-back by always using non-temporal stores */
>>  len = copy_from_iter_nocache(addr, bytes, i);
>>
>> -if (__iter_needs_pmem_wb(i))
>> +/*
>> + * In the iovec case on x86_64 copy_from_iter_nocache() uses
>> + * non-temporal stores for the bulk of the transfer, but we need
>> + * to manually flush if the transfer is unaligned. In the
>> + * non-iovec case the entire destination needs to be flushed.
>> + */
>> +if (iter_is_iovec(i)) {
>> +unsigned long dest = (unsigned long) addr;
>> +
>> +/*
>> + * If the destination is not 8-byte aligned then
>> + * __copy_user_nocache (on x86_64) uses cached copies
>> + */
>> +if (dest & 8) {
>> +arch_wb_cache_pmem(addr, 1);
>> +dest = ALIGN(dest, 8);
>> +}
>> +
>> +/*
>> + * If the remaining transfer length, after accounting
>> + * for destination alignment, is not 4-byte aligned
>> + * then __copy_user_nocache() falls back to cached
>> + * copies for the trailing bytes in the final cacheline
>> + * of the transfer.
>> + */
>> +if ((bytes - (dest - (unsigned long) addr)) & 4)
>> +arch_wb_cache_pmem(addr + bytes - 1, 1);
>> +} else
>>  arch_wb_cache_pmem(addr, bytes);
>>
>>  return len;
>
> Thanks for the update.  I think the alignment check should be based on
> the following note in copy_user_nocache.
>
>  * Note: Cached memory copy is used when destination or size is not
>  * naturally aligned. That is:
>  *  - Require 8-byte alignment when size is 8 bytes or larger.
>  *  - Require 4-byte alignment when size is 4 bytes.
>
> So, I think the code may be something like this.  I also made the following 
> changes:

Thanks!

>  - Mask with 7, not 8.

Yes, good catch.

>  - ALIGN with cacheline size, instead of 8.
>  - Add (bytes > flushed) test since calculation with unsigned long still 
> results in a negative
>value (as a positive value).
>
> if (bytes < 8) {
> if ((dest & 3) || (bytes != 4))
> arch_wb_cache_pmem(addr, 1);
> } else {
> if (dest & 7) {
> dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);

Why align the destination to the next cacheline? As far as I can see
the ALIGN_DESTINATION macro in arch/x86/include/asm/asm.h only aligns
to the next 8-byte boundary.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] dax: fix radix tree insertion race

2017-04-10 Thread Ross Zwisler
On Mon, Apr 10, 2017 at 03:41:11PM +0200, Jan Kara wrote:
> On Thu 06-04-17 15:29:44, Ross Zwisler wrote:
> > While running generic/340 in my test setup I hit the following race. It can
> > happen with kernels that support FS DAX PMDs, so v4.10 thru v4.11-rc5.
> > 
> > Thread 1Thread 2
> > 
> > dax_iomap_pmd_fault()
> >   grab_mapping_entry()
> > spin_lock_irq()
> > get_unlocked_mapping_entry()
> > 'entry' is NULL, can't call lock_slot()
> > spin_unlock_irq()
> > radix_tree_preload()
> > dax_iomap_pmd_fault()
> >   grab_mapping_entry()
> > spin_lock_irq()
> > get_unlocked_mapping_entry()
> > ...
> > lock_slot()
> > spin_unlock_irq()
> >   dax_pmd_insert_mapping()
> > 
> > spin_lock_irq()
> > __radix_tree_insert() fails with -EEXIST
> >  >  when inserting a 4k entry where a PMD exists>
> > 
> > The issue is that we have to drop mapping->tree_lock while calling
> > radix_tree_preload(), but since we didn't have a radix tree entry to lock
> > (unlike in the pmd_downgrade case) we have no protection against Thread 2
> > coming along and inserting a PMD at the same index.  For 4k entries we
> > handled this with a special-case response to -EEXIST coming from the
> > __radix_tree_insert(), but this doesn't save us for PMDs because the
> > -EEXIST case can also mean that we collided with a 4k entry in the radix
> > tree at a different index, but one that is covered by our PMD range.
> > 
> > So, correctly handle both the 4k and 2M collision cases by explicitly
> > re-checking the radix tree for an entry at our index once we reacquire
> > mapping->tree_lock.
> > 
> > This patch has made it through a clean xfstests run with the current
> > v4.11-rc5 based linux/master, and it also ran generic/340 500 times in a
> > loop.  It used to fail within the first 10 iterations.
> > 
> > Signed-off-by: Ross Zwisler 
> > Cc: [4.10+]
> 
> The patch looks good to me (and I can see Andrew already sent it to Linus),
> I'm just worndering where did things actually go wrong? I'd expect we would
> return VM_FAULT_FALLBACK from dax_iomap_pmd_fault() and then do PTE fault
> for the address which should just work out fine...

Yep, that's what I thought as well, and I think it does work for processes
which have separate page tables.  The second process will do a 4k fault (just
as it would have if it had a VMA that was smaller than 2MiB, for example), map
the 4k page into its page table and just dirty the 2MiB DAX entry in the radix
tree.  I've tested this case manually in the past.

I think the error case that I was seeing was for threads that share page
tables.  In that case the 2nd thread falls back to PTEs, but there is already
a PMD in the page table from the first fault.  When we try and insert a PTE
over the PMD we get the following BUG:

BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: do_raw_spin_trylock+0x5/0x40
PGD 8d6ee0067
PUD 8db6e8067
PMD 0

Oops:  [#1] PREEMPT SMP
Modules linked in: dax_pmem nd_pmem dax nd_btt nd_e820 libnvdimm [last 
unloaded: scsi_debug]
CPU: 2 PID: 25323 Comm: holetest Not tainted 4.11.0-rc4 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 
04/01/2014
task: 880095492a00 task.stack: c90014048000
RIP: 0010:do_raw_spin_trylock+0x5/0x40
RSP: :c9001404bb60 EFLAGS: 00010296
RAX: 880095492a00 RBX: 0018 RCX: 
RDX:  RSI:  RDI: 
RBP: c9001404bb80 R08: 0001 R09: 
R10: 880095492a00 R11:  R12: 
R13: 8808d5fe4220 R14: 88004c3e3c80 R15: 8025
FS:  7f7ed7dff700() GS:8808de40() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2:  CR3: 0008d86f6000 CR4: 001406e0
Call Trace:
 ? _raw_spin_lock+0x49/0x80
 ? __get_locked_pte+0x16b/0x1d0
 __get_locked_pte+0x16b/0x1d0
 insert_pfn.isra.68+0x3a/0x100
 vm_insert_mixed+0x64/0x90
 dax_iomap_fault+0xa41/0x1680
 ext4_dax_huge_fault+0xa9/0xd0
 ext4_dax_fault+0x10/0x20
 __do_fault+0x20/0x130
 __handle_mm_fault+0x9b3/0x1190
 handle_mm_fault+0x169/0x370
 ? handle_mm_fault+0x47/0x370
 __do_page_fault+0x28f/0x590
 trace_do_page_fault+0x58/0x2c0
 do_async_page_fault+0x2c/0x90
 async_page_fault+0x28/0x30
RIP: 0033:0x4014b2
RSP: 002b:7f7ed7dfef20 EFLAGS: 00010216
RAX: 7f7ec6c00400 RBX: 0001 RCX: 01c0
RDX: 1c01 RSI:  RDI: 
RBP: 7f7ed7dff700 R08: 7f

Re: KASLR causes intermittent boot failures on some systems

2017-04-10 Thread Jeff Moyer
Kees Cook  writes:

> On Mon, Apr 10, 2017 at 11:22 AM, Jeff Moyer  wrote:
>> Kees Cook  writes:
>>
>>> On Mon, Apr 10, 2017 at 8:49 AM, Jeff Moyer  wrote:
 Kees Cook  writes:

> On Fri, Apr 7, 2017 at 7:41 AM, Jeff Moyer  wrote:
>> Hi,
>>
>> commit 021182e52fe01 ("x86/mm: Enable KASLR for physical mapping memory
>> regions") causes some of my systems with persistent memory (whether real
>> or emulated) to fail to boot with a couple of different crash
>> signatures.  The first signature is a NMI watchdog lockup of all but 1
>> cpu, which causes much difficulty in extracting useful information from
>> the console.  The second variant is an invalid paging request, listed
>> below.
>
> Just to rule out some of the stuff in the boot path, does booting with
> "nokaslr" solve this? (i.e. I want to figure out if this is from some
> of the rearrangements done that are exposed under that commit, or if
> it is genuinely the randomization that is killing the systems...)

 Adding "nokaslr" to the boot line does indeed make the problem go away.
>>>
>>> Are you booting with a memmap= flag?
>>
>> From my first email:
>>
>> [ 0.00] Command line: BOOT_IMAGE=/vmlinuz-4.11.0-rc5+
>> root=/dev/mapper/rhel_intel--lizardhead--04-root ro memmap=192G!1024G
>> crashkernel=auto rd.lvm.lv=rhel_intel-lizardhead-04/root
>> rd.lvm.lv=rhel_intel-lizardhead-04/swap console=ttyS0,115200n81
>> LANG=en_US.UTF-8
>>
>> Did you not receive the attachments?
>
> I see it now, thanks!
>
> The memmap parsing was added in -rc1 (f28442497b5ca), so I'd expect
> that to be handled. Hmmm.

I can also reproduce this on a system with real persistent memory, which
does not require the memmap parameter.

Cheers,
Jeff
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: KASLR causes intermittent boot failures on some systems

2017-04-10 Thread Kees Cook
On Mon, Apr 10, 2017 at 11:22 AM, Jeff Moyer  wrote:
> Kees Cook  writes:
>
>> On Mon, Apr 10, 2017 at 8:49 AM, Jeff Moyer  wrote:
>>> Kees Cook  writes:
>>>
 On Fri, Apr 7, 2017 at 7:41 AM, Jeff Moyer  wrote:
> Hi,
>
> commit 021182e52fe01 ("x86/mm: Enable KASLR for physical mapping memory
> regions") causes some of my systems with persistent memory (whether real
> or emulated) to fail to boot with a couple of different crash
> signatures.  The first signature is a NMI watchdog lockup of all but 1
> cpu, which causes much difficulty in extracting useful information from
> the console.  The second variant is an invalid paging request, listed
> below.

 Just to rule out some of the stuff in the boot path, does booting with
 "nokaslr" solve this? (i.e. I want to figure out if this is from some
 of the rearrangements done that are exposed under that commit, or if
 it is genuinely the randomization that is killing the systems...)
>>>
>>> Adding "nokaslr" to the boot line does indeed make the problem go away.
>>
>> Are you booting with a memmap= flag?
>
> From my first email:
>
> [ 0.00] Command line: BOOT_IMAGE=/vmlinuz-4.11.0-rc5+
> root=/dev/mapper/rhel_intel--lizardhead--04-root ro memmap=192G!1024G
> crashkernel=auto rd.lvm.lv=rhel_intel-lizardhead-04/root
> rd.lvm.lv=rhel_intel-lizardhead-04/swap console=ttyS0,115200n81
> LANG=en_US.UTF-8
>
> Did you not receive the attachments?

I see it now, thanks!

The memmap parsing was added in -rc1 (f28442497b5ca), so I'd expect
that to be handled. Hmmm.

-Kees

-- 
Kees Cook
Pixel Security
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


RE: [PATCH v2] x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions

2017-04-10 Thread Kani, Toshimitsu
> Subject: [PATCH v2] x86, pmem: fix broken __copy_user_nocache cache-
> bypass assumptions
> 
> Before we rework the "pmem api" to stop abusing __copy_user_nocache()
> for memcpy_to_pmem() we need to fix cases where we may strand dirty data
> in the cpu cache. The problem occurs when copy_from_iter_pmem() is used
> for arbitrary data transfers from userspace. There is no guarantee that
> these transfers, performed by dax_iomap_actor(), will have aligned
> destinations or aligned transfer lengths. Backstop the usage
> __copy_user_nocache() with explicit cache management in these unaligned
> cases.
> 
> Yes, copy_from_iter_pmem() is now too big for an inline, but addressing
> that is saved for a later patch that moves the entirety of the "pmem
> api" into the pmem driver directly.
 :
> ---
> v2: Change the condition for flushing the last cacheline of the
> destination from 8-byte to 4-byte misalignment (Toshi)
 :
>  arch/x86/include/asm/pmem.h |   41 ++
 :
> @@ -94,7 +86,34 @@ static inline size_t arch_copy_from_iter_pmem(void
> *addr, size_t bytes,
>   /* TODO: skip the write-back by always using non-temporal stores */
>   len = copy_from_iter_nocache(addr, bytes, i);
> 
> - if (__iter_needs_pmem_wb(i))
> + /*
> +  * In the iovec case on x86_64 copy_from_iter_nocache() uses
> +  * non-temporal stores for the bulk of the transfer, but we need
> +  * to manually flush if the transfer is unaligned. In the
> +  * non-iovec case the entire destination needs to be flushed.
> +  */
> + if (iter_is_iovec(i)) {
> + unsigned long dest = (unsigned long) addr;
> +
> + /*
> +  * If the destination is not 8-byte aligned then
> +  * __copy_user_nocache (on x86_64) uses cached copies
> +  */
> + if (dest & 8) {
> + arch_wb_cache_pmem(addr, 1);
> + dest = ALIGN(dest, 8);
> + }
> +
> + /*
> +  * If the remaining transfer length, after accounting
> +  * for destination alignment, is not 4-byte aligned
> +  * then __copy_user_nocache() falls back to cached
> +  * copies for the trailing bytes in the final cacheline
> +  * of the transfer.
> +  */
> + if ((bytes - (dest - (unsigned long) addr)) & 4)
> + arch_wb_cache_pmem(addr + bytes - 1, 1);
> + } else
>   arch_wb_cache_pmem(addr, bytes);
> 
>   return len;

Thanks for the update.  I think the alignment check should be based on
the following note in copy_user_nocache.

 * Note: Cached memory copy is used when destination or size is not
 * naturally aligned. That is:
 *  - Require 8-byte alignment when size is 8 bytes or larger.
 *  - Require 4-byte alignment when size is 4 bytes.

So, I think the code may be something like this.  I also made the following 
changes:
 - Mask with 7, not 8.
 - ALIGN with cacheline size, instead of 8.
 - Add (bytes > flushed) test since calculation with unsigned long still 
results in a negative
   value (as a positive value).

if (bytes < 8) {
if ((dest & 3) || (bytes != 4))
arch_wb_cache_pmem(addr, 1);
} else {
if (dest & 7) {
dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
arch_wb_cache_pmem(addr, 1);
}

flushed = dest - (unsigned long) addr;
if ((bytes > flushed) && ((bytes - flushed) & 7))
arch_wb_cache_pmem(addr + bytes - 1, 1);
}

Thanks,
-Toshi
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: KASLR causes intermittent boot failures on some systems

2017-04-10 Thread Jeff Moyer
Kees Cook  writes:

> On Mon, Apr 10, 2017 at 8:49 AM, Jeff Moyer  wrote:
>> Kees Cook  writes:
>>
>>> On Fri, Apr 7, 2017 at 7:41 AM, Jeff Moyer  wrote:
 Hi,

 commit 021182e52fe01 ("x86/mm: Enable KASLR for physical mapping memory
 regions") causes some of my systems with persistent memory (whether real
 or emulated) to fail to boot with a couple of different crash
 signatures.  The first signature is a NMI watchdog lockup of all but 1
 cpu, which causes much difficulty in extracting useful information from
 the console.  The second variant is an invalid paging request, listed
 below.
>>>
>>> Just to rule out some of the stuff in the boot path, does booting with
>>> "nokaslr" solve this? (i.e. I want to figure out if this is from some
>>> of the rearrangements done that are exposed under that commit, or if
>>> it is genuinely the randomization that is killing the systems...)
>>
>> Adding "nokaslr" to the boot line does indeed make the problem go away.
>
> Are you booting with a memmap= flag?

>From my first email:

[ 0.00] Command line: BOOT_IMAGE=/vmlinuz-4.11.0-rc5+
root=/dev/mapper/rhel_intel--lizardhead--04-root ro memmap=192G!1024G
crashkernel=auto rd.lvm.lv=rhel_intel-lizardhead-04/root
rd.lvm.lv=rhel_intel-lizardhead-04/swap console=ttyS0,115200n81
LANG=en_US.UTF-8

Did you not receive the attachments?

Cheers,
Jeff
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: KASLR causes intermittent boot failures on some systems

2017-04-10 Thread Kees Cook
On Mon, Apr 10, 2017 at 8:49 AM, Jeff Moyer  wrote:
> Kees Cook  writes:
>
>> On Fri, Apr 7, 2017 at 7:41 AM, Jeff Moyer  wrote:
>>> Hi,
>>>
>>> commit 021182e52fe01 ("x86/mm: Enable KASLR for physical mapping memory
>>> regions") causes some of my systems with persistent memory (whether real
>>> or emulated) to fail to boot with a couple of different crash
>>> signatures.  The first signature is a NMI watchdog lockup of all but 1
>>> cpu, which causes much difficulty in extracting useful information from
>>> the console.  The second variant is an invalid paging request, listed
>>> below.
>>
>> Just to rule out some of the stuff in the boot path, does booting with
>> "nokaslr" solve this? (i.e. I want to figure out if this is from some
>> of the rearrangements done that are exposed under that commit, or if
>> it is genuinely the randomization that is killing the systems...)
>
> Adding "nokaslr" to the boot line does indeed make the problem go away.

Are you booting with a memmap= flag?

-Kees

-- 
Kees Cook
Pixel Security
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 3/4] libnvdimm: add support for clear poison list and badblocks for device dax

2017-04-10 Thread Dave Jiang


On 04/07/2017 05:46 PM, Dan Williams wrote:
> On Fri, Apr 7, 2017 at 5:31 PM, Dan Williams  wrote:
>> On Fri, Apr 7, 2017 at 3:07 PM, Dave Jiang  wrote:
>>> Providing mechanism to clear poison list via the ndctl ND_CMD_CLEAR_ERROR
>>> call. We will update the poison list and also the badblocks at region level
>>> if the region is in dax mode or in pmem mode and not active. In other
>>> words we force badblocks to be cleared through write requests if the
>>> address is currently accessed through a block device, otherwise it can
>>> only be done via the ioctl+dsm path.
>>>
>>> Signed-off-by: Dave Jiang 
>>> Reviewed-by: Johannes Thumshirn 
>>> ---
>>>  drivers/nvdimm/bus.c  |   78 
>>> +
>>>  drivers/nvdimm/core.c |   17 --
>>>  drivers/nvdimm/region.c   |   25 ++
>>>  include/linux/libnvdimm.h |7 +++-
>>>  4 files changed, 115 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
>>> index 351bac8..af64e9c 100644
>>> --- a/drivers/nvdimm/bus.c
>>> +++ b/drivers/nvdimm/bus.c
>>> @@ -27,6 +27,7 @@
>>>  #include 
>>>  #include "nd-core.h"
>>>  #include "nd.h"
>>> +#include "pfn.h"
>>>
>>>  int nvdimm_major;
>>>  static int nvdimm_bus_major;
>>> @@ -218,11 +219,19 @@ long nvdimm_clear_poison(struct device *dev, 
>>> phys_addr_t phys,
>>> if (cmd_rc < 0)
>>> return cmd_rc;
>>>
>>> -   nvdimm_clear_from_poison_list(nvdimm_bus, phys, len);
>>> return clear_err.cleared;
>>
>> This seems like a typo. We want to clear poison in the pmem write path.
> 
> Changing this into an nvdimm_forget_poison() call works (passes your
> daxdev-errors.sh test), and that's what I've pushed out to my pending
> branch to let 0day beat up on it a bit.
> 

Thanks. For some reason I was thinking __nvdimm_forget_poison() is being
called through ->ndctl() and so we didn't need it.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem

2017-04-10 Thread Logan Gunthorpe


On 10/04/17 02:29 AM, Sagi Grimberg wrote:
> What you are saying is surprising to me. The switch needs to preserve
> ordering across different switch ports ??

> You are suggesting that there is a *switch-wide* state that tracks
> MemRds never pass MemWrs across all the switch ports? That is a very
> non-trivial statement...

Yes, it is a requirement of the PCIe spec for transactions to be
strongly ordered throughout the fabric so switches must have an internal
state across all ports. Without that, it would be impossible to have PCI
cards work together even if they are using system memory to do so. Also,
I believe, it was done this way to maintain maximum compatibility with
the legacy PCI bus. There is also a relaxed ordering bit that allows
specific transactions to ignore ordering which can help performance.

Obviously this becomes impossible if you have some kind of complex
multi-path fabric.

Logan
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: KASLR causes intermittent boot failures on some systems

2017-04-10 Thread Jeff Moyer
Baoquan He  writes:

> On 04/07/17 at 10:41am, Jeff Moyer wrote:
>> Hi,
>> 
>> commit 021182e52fe01 ("x86/mm: Enable KASLR for physical mapping memory
>> regions") causes some of my systems with persistent memory (whether real
>> or emulated) to fail to boot with a couple of different crash
>> signatures.  The first signature is a NMI watchdog lockup of all but 1
>> cpu, which causes much difficulty in extracting useful information from
>> the console.  The second variant is an invalid paging request, listed
>> below.
>> 
>> On some systems, I haven't hit this problem at all.  Other systems
>> experience a failed boot maybe 20-30% of the time.  To reproduce it,
>> configure some emulated pmem on your system.  You can find directions
>> for that here: https://nvdimm.wiki.kernel.org/
>
> I got below problem when configure ndctl, didn't find a package named
> libkmod:

Sorry, you can skip the ndctl step altogether.  This is emulated pmem,
so the setting won't stick anyway.

-Jeff
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: KASLR causes intermittent boot failures on some systems

2017-04-10 Thread Jeff Moyer
Kees Cook  writes:

> On Fri, Apr 7, 2017 at 7:41 AM, Jeff Moyer  wrote:
>> Hi,
>>
>> commit 021182e52fe01 ("x86/mm: Enable KASLR for physical mapping memory
>> regions") causes some of my systems with persistent memory (whether real
>> or emulated) to fail to boot with a couple of different crash
>> signatures.  The first signature is a NMI watchdog lockup of all but 1
>> cpu, which causes much difficulty in extracting useful information from
>> the console.  The second variant is an invalid paging request, listed
>> below.
>
> Just to rule out some of the stuff in the boot path, does booting with
> "nokaslr" solve this? (i.e. I want to figure out if this is from some
> of the rearrangements done that are exposed under that commit, or if
> it is genuinely the randomization that is killing the systems...)

Adding "nokaslr" to the boot line does indeed make the problem go away.

Cheers,
Jeff
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] dax: fix radix tree insertion race

2017-04-10 Thread Jan Kara
On Thu 06-04-17 15:29:44, Ross Zwisler wrote:
> While running generic/340 in my test setup I hit the following race. It can
> happen with kernels that support FS DAX PMDs, so v4.10 thru v4.11-rc5.
> 
> Thread 1  Thread 2
>   
> dax_iomap_pmd_fault()
>   grab_mapping_entry()
> spin_lock_irq()
> get_unlocked_mapping_entry()
> 'entry' is NULL, can't call lock_slot()
> spin_unlock_irq()
> radix_tree_preload()
>   dax_iomap_pmd_fault()
> grab_mapping_entry()
>   spin_lock_irq()
>   get_unlocked_mapping_entry()
>   ...
>   lock_slot()
>   spin_unlock_irq()
> dax_pmd_insert_mapping()
>   
> spin_lock_irq()
> __radix_tree_insert() fails with -EEXIST
>   when inserting a 4k entry where a PMD exists>
> 
> The issue is that we have to drop mapping->tree_lock while calling
> radix_tree_preload(), but since we didn't have a radix tree entry to lock
> (unlike in the pmd_downgrade case) we have no protection against Thread 2
> coming along and inserting a PMD at the same index.  For 4k entries we
> handled this with a special-case response to -EEXIST coming from the
> __radix_tree_insert(), but this doesn't save us for PMDs because the
> -EEXIST case can also mean that we collided with a 4k entry in the radix
> tree at a different index, but one that is covered by our PMD range.
> 
> So, correctly handle both the 4k and 2M collision cases by explicitly
> re-checking the radix tree for an entry at our index once we reacquire
> mapping->tree_lock.
> 
> This patch has made it through a clean xfstests run with the current
> v4.11-rc5 based linux/master, and it also ran generic/340 500 times in a
> loop.  It used to fail within the first 10 iterations.
> 
> Signed-off-by: Ross Zwisler 
> Cc: [4.10+]

The patch looks good to me (and I can see Andrew already sent it to Linus),
I'm just worndering where did things actually go wrong? I'd expect we would
return VM_FAULT_FALLBACK from dax_iomap_pmd_fault() and then do PTE fault
for the address which should just work out fine...

Honza

> ---
>  fs/dax.c | 35 ++-
>  1 file changed, 22 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index de622d4..85abd74 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -373,6 +373,22 @@ static void *grab_mapping_entry(struct address_space 
> *mapping, pgoff_t index,
>   }
>   spin_lock_irq(&mapping->tree_lock);
>  
> + if (!entry) {
> + /*
> +  * We needed to drop the page_tree lock while calling
> +  * radix_tree_preload() and we didn't have an entry to
> +  * lock.  See if another thread inserted an entry at
> +  * our index during this time.
> +  */
> + entry = __radix_tree_lookup(&mapping->page_tree, index,
> + NULL, &slot);
> + if (entry) {
> + radix_tree_preload_end();
> + spin_unlock_irq(&mapping->tree_lock);
> + goto restart;
> + }
> + }
> +
>   if (pmd_downgrade) {
>   radix_tree_delete(&mapping->page_tree, index);
>   mapping->nrexceptional--;
> @@ -388,19 +404,12 @@ static void *grab_mapping_entry(struct address_space 
> *mapping, pgoff_t index,
>   if (err) {
>   spin_unlock_irq(&mapping->tree_lock);
>   /*
> -  * Someone already created the entry?  This is a
> -  * normal failure when inserting PMDs in a range
> -  * that already contains PTEs.  In that case we want
> -  * to return -EEXIST immediately.
> -  */
> - if (err == -EEXIST && !(size_flag & RADIX_DAX_PMD))
> - goto restart;
> - /*
> -  * Our insertion of a DAX PMD entry failed, most
> -  * likely because it collided with a PTE sized entry
> -  * at a different index in the PMD range.  We haven't
> -  * inserted anything into the radix tree and have no
> -  * waiters to wake.
> +  * Our insertion of a DAX entry failed, most likely
> +  * because we were insert

Please recheck your delivery address USPS parcel 376147120

2017-04-10 Thread USPS Priority Parcels
Hello,

We can not deliver your parcel arrived at  Mon, 10 Apr 2017 14:28:30 +0300.

Please check the attachment for complete details!

Thank you for making business with us.
Corissa Tavernia -  USPS Parcels Delivery Agent.


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem

2017-04-10 Thread Sagi Grimberg



Sagi

As long as legA, legB and the RC are all connected to the same switch then 
ordering will be preserved (I think many other topologies also work). Here is 
how it would work for the problem case you are concerned about (which is a read 
from the NVMe drive).

1. Disk device DMAs out the data to the p2pmem device via a string of PCIe 
MemWr TLPs.
2. Disk device writes to the completion queue (in system memory) via a MemWr 
TLP.
3. The last of the MemWrs from step 1 might have got stalled in the PCIe switch 
due to congestion but if so they are stalled in the egress path of the switch 
for the p2pmem port.
4. The RC determines the IO is complete when the TLP associated with step 2 
updates the memory associated with the CQ. It issues some operation to read the 
p2pmem.
5. Regardless of whether the MemRd TLP comes from the RC or another device 
connected to the switch it is queued in the egress queue for the p2pmem FIO 
behind the last DMA TLP (from step 1).
PCIe ordering ensures that this MemRd cannot overtake the MemWr (Reads can 
never pass writes).
Therefore the MemRd can never get to the p2pmem device until after the last DMA 
MemWr has.


What you are saying is surprising to me. The switch needs to preserve
ordering across different switch ports ??

You are suggesting that there is a *switch-wide* state that tracks
MemRds never pass MemWrs across all the switch ports? That is a very
non-trivial statement...
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm