Re: [PATCH 0/6] use memcpy_mcsafe() for copy_to_iter()

2018-05-02 Thread Dan Williams
On Wed, May 2, 2018 at 9:19 AM, Andy Lutomirski  wrote:
> On Tue, May 1, 2018 at 8:34 PM Linus Torvalds
> 
> wrote:
>
>> On Tue, May 1, 2018 at 8:22 PM Dan Williams 
>> wrote:
>
>> > All that to say that having a typical RAM page covering poisoned pmem
>> > would complicate the 'clear badblocks' implementation.
>
>> Ugh, ok.
>
>> I guess the good news is that your patches aren't so big, and don't really
>> affect anything else.
>
>
> I pondered this a bit.  Doing better might be a big pain in the arse.  The
> interesting case is where ordinary kernel code (memcpy, plain old memory
> operands, etc) access faulty pmem.  This means that there's no extable
> entry around.  If we actually try to recover, we have a few problems:
>
>   - We can't sanely skip the instruction without causing random errors.
>
>   - If the access was through the kernel direct map, then we could plausibly
> remap a different page in place of the faulty page.  The problem is that,
> if the page is *writable* and we share it between more than one faulty
> page, then we're enabling a giant information leak.  But we still need to
> figure out how we're supposed to invalidate the old mapping from a random,
> potentially atomic context.
>
>   - If the access is through kmap or similar, then we're talking about
> modifying a PTE out from under kernel code that really isn't expecting us
> to modify it.
>
>   - How are we supposed to signal the process or fail a syscall?  The fault
> could have come from interrupt context, softirq context, kernel thread
> context, etc, and figuring out who's to blame seems quite awkward and
> fragile.
>
> All that being said, I suspect that we still have issues even with accesses
> to user VAs that are protected by extable entries.  The whole #MC mechanism
> is a supremely shitty interface for recoverable errors (especially on
> Intel), and I'm a bit scared of what happens if the offending access is,
> say, inside a perf NMI.
>
> Dan, is there any chance you could put some pressure on the architecture
> folks to invent an entirely new, less shitty way to tell the OS about
> recoverable memory errors?  And to make it testable by normal people?
> Needing big metal EINJ hardware to test the house of cards that is #MC is
> just awful and means that there are few enough kernel developers that are
> actually able to test that I can probably count them on one hand.  And I'm
> not one of them...

I feel this testing pain too. The EINJ facility is not ubiquitous
which is why I punted and wrote patch 6 to unit test this. You're
right that does not scale for all the potential places we'd like to be
able to safely handle memory errors in the kernel.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/6] use memcpy_mcsafe() for copy_to_iter()

2018-05-02 Thread Andy Lutomirski
On Tue, May 1, 2018 at 8:34 PM Linus Torvalds

wrote:

> On Tue, May 1, 2018 at 8:22 PM Dan Williams 
> wrote:

> > All that to say that having a typical RAM page covering poisoned pmem
> > would complicate the 'clear badblocks' implementation.

> Ugh, ok.

> I guess the good news is that your patches aren't so big, and don't really
> affect anything else.


I pondered this a bit.  Doing better might be a big pain in the arse.  The
interesting case is where ordinary kernel code (memcpy, plain old memory
operands, etc) access faulty pmem.  This means that there's no extable
entry around.  If we actually try to recover, we have a few problems:

  - We can't sanely skip the instruction without causing random errors.

  - If the access was through the kernel direct map, then we could plausibly
remap a different page in place of the faulty page.  The problem is that,
if the page is *writable* and we share it between more than one faulty
page, then we're enabling a giant information leak.  But we still need to
figure out how we're supposed to invalidate the old mapping from a random,
potentially atomic context.

  - If the access is through kmap or similar, then we're talking about
modifying a PTE out from under kernel code that really isn't expecting us
to modify it.

  - How are we supposed to signal the process or fail a syscall?  The fault
could have come from interrupt context, softirq context, kernel thread
context, etc, and figuring out who's to blame seems quite awkward and
fragile.

All that being said, I suspect that we still have issues even with accesses
to user VAs that are protected by extable entries.  The whole #MC mechanism
is a supremely shitty interface for recoverable errors (especially on
Intel), and I'm a bit scared of what happens if the offending access is,
say, inside a perf NMI.

Dan, is there any chance you could put some pressure on the architecture
folks to invent an entirely new, less shitty way to tell the OS about
recoverable memory errors?  And to make it testable by normal people?
Needing big metal EINJ hardware to test the house of cards that is #MC is
just awful and means that there are few enough kernel developers that are
actually able to test that I can probably count them on one hand.  And I'm
not one of them...
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/6] use memcpy_mcsafe() for copy_to_iter()

2018-05-02 Thread Borislav Petkov
On Tue, May 01, 2018 at 07:25:57PM -0700, Dan Williams wrote:
> Right, but the only way to make MCE non-fatal is to teach the machine
> check handler about recoverable conditions. This patch teaches the
> machine check handler how to recover copy_to_iter() errors.

Yeah, about that: maybe we talked about this at the time but does the
actual MCE signature state the error was caused by a read from an nvdimm
range?

Because if so, we could lower the severity of the error when we look at
it in the #MC handler and do some more graceful handling later...

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/6] use memcpy_mcsafe() for copy_to_iter()

2018-05-01 Thread Dan Williams
On Tue, May 1, 2018 at 9:14 PM, Linus Torvalds
 wrote:
> On Tue, May 1, 2018 at 9:00 PM Dan Williams 
> wrote:
>> >
>> > I  have some dim memory of "rep movs doesn't work well for pmem", but
> does
>> > it *seriously* need unrolling to cacheline boundaries? And if it does,
> who
>> > designed it, and why is anybody using it?
>> >
>
>> I think this is an FAQ from the original submission, in fact some guy
>> named "Linus Torvalds" asked [1]:
>
> Oh, I already mentioned that  I remembered that "rep movs" didn't work well.
>
> But there's a big gap between "just use 'rep movs' and 'do some cacheline
> unrollong'".
>
> Why isn't it just doing a simple word-at-a-time loop and letting the CPU do
> the unrolling that it will already do on its own?
>
> I may have gotten that answered too, but there's no comment in the code
> about why it's such a disgusting mess, so I've long since forgotten _why_
> it's such a disgusting mess.
>
> That loop unrolling _used_ to be "hey, it's simple".
>
> Now it's "Hey, that's truly disgusting", with the separate fault handling
> for every single case in the unrolled loop.
>
> Just look at the nasty _ASM_EXTABLE_FAULT() uses and those E_cache_x error
> labels, and getting the number rof bytes copied right.
>
> And then ask yourself "what if we didn't unroll that thing 8 times, AND WE
> COULD GET RID OF ALL OF THOSE?"
>
> Maybe you already did ask yourself.  But I'm asking because it sure isn't
> explained in the code.

Ah, sorry. Yeah, I don't see a good reason to keep the unrolling. It
would definitely clean up the fault handling, I'll respin.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/6] use memcpy_mcsafe() for copy_to_iter()

2018-05-01 Thread Linus Torvalds
On Tue, May 1, 2018 at 9:00 PM Dan Williams 
wrote:
> >
> > I  have some dim memory of "rep movs doesn't work well for pmem", but
does
> > it *seriously* need unrolling to cacheline boundaries? And if it does,
who
> > designed it, and why is anybody using it?
> >

> I think this is an FAQ from the original submission, in fact some guy
> named "Linus Torvalds" asked [1]:

Oh, I already mentioned that  I remembered that "rep movs" didn't work well.

But there's a big gap between "just use 'rep movs' and 'do some cacheline
unrollong'".

Why isn't it just doing a simple word-at-a-time loop and letting the CPU do
the unrolling that it will already do on its own?

I may have gotten that answered too, but there's no comment in the code
about why it's such a disgusting mess, so I've long since forgotten _why_
it's such a disgusting mess.

That loop unrolling _used_ to be "hey, it's simple".

Now it's "Hey, that's truly disgusting", with the separate fault handling
for every single case in the unrolled loop.

Just look at the nasty _ASM_EXTABLE_FAULT() uses and those E_cache_x error
labels, and getting the number rof bytes copied right.

And then ask yourself "what if we didn't unroll that thing 8 times, AND WE
COULD GET RID OF ALL OF THOSE?"

Maybe you already did ask yourself.  But I'm asking because it sure isn't
explained in the code.

 Linus
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/6] use memcpy_mcsafe() for copy_to_iter()

2018-05-01 Thread Dan Williams
On Tue, May 1, 2018 at 8:33 PM, Linus Torvalds
 wrote:
> On Tue, May 1, 2018 at 8:22 PM Dan Williams 
> wrote:
>
>> All that to say that having a typical RAM page covering poisoned pmem
>> would complicate the 'clear badblocks' implementation.
>
> Ugh, ok.
>
> I guess the good news is that your patches aren't so big, and don't really
> affect anything else.
>
> But can we at least take this to be the impetus for just getting rid of
> that disgusting unrolled memcpy? Ablout half of the lines in the patch set
> comes from that thing.
>
> Is anybody seriously going to use pmem with some in-order chip that can't
> even get something as simple as a memory copy loop right? "git blame"
> fingers Tony Luck, I think he may have been influenced by the fumes from
> Itanium.
>
> I  have some dim memory of "rep movs doesn't work well for pmem", but does
> it *seriously* need unrolling to cacheline boundaries? And if it does, who
> designed it, and why is anybody using it?
>

I think this is an FAQ from the original submission, in fact some guy
named "Linus Torvalds" asked [1]:

---

>  - why does this use the complex - and slower, on modern machines -
> unrolled manual memory copy, when you might as well just use a single
>
>  rep ; movsb
>
> which not only makes it smaller, but makes the exception fixup trivial.

Because current generation cpus don't give a recoverable machine
check if we consume with a "rep ; movsb" :-(
When we have that we can pick the best copy function based
on the capabilities of the cpu we are running on.

---

[1]: https://lkml.org/lkml/2016/2/18/608
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/6] use memcpy_mcsafe() for copy_to_iter()

2018-05-01 Thread Linus Torvalds
On Tue, May 1, 2018 at 8:22 PM Dan Williams 
wrote:

> All that to say that having a typical RAM page covering poisoned pmem
> would complicate the 'clear badblocks' implementation.

Ugh, ok.

I guess the good news is that your patches aren't so big, and don't really
affect anything else.

But can we at least take this to be the impetus for just getting rid of
that disgusting unrolled memcpy? Ablout half of the lines in the patch set
comes from that thing.

Is anybody seriously going to use pmem with some in-order chip that can't
even get something as simple as a memory copy loop right? "git blame"
fingers Tony Luck, I think he may have been influenced by the fumes from
Itanium.

I  have some dim memory of "rep movs doesn't work well for pmem", but does
it *seriously* need unrolling to cacheline boundaries? And if it does, who
designed it, and why is anybody using it?

  Linus
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/6] use memcpy_mcsafe() for copy_to_iter()

2018-05-01 Thread Dan Williams
On Tue, May 1, 2018 at 8:20 PM, Dan Williams  wrote:
> On Tue, May 1, 2018 at 8:13 PM, Linus Torvalds
>  wrote:
>> On Tue, May 1, 2018 at 8:03 PM Dan Williams 
>> wrote:
>>
>>> Because dax. There's no page cache indirection games we can play here
>>> to poison a page and map in another page. The mapped page is 1:1
>>> associated with the filesystem block and physical memory address.
>>
>> I'm not talking page cache indirection.
>>
>> I'm talking literally mapping a different page into the kernel virtual
>> address space that the failing read was done for.
>>
>> But you seem to be right that we don't actually support that. I'm guessing
>> the hwpoison code has never had to run in that kind of situation and will
>> just give up.
>>
>> That would seem to be sad. It really feels like the obvious solution to any
>> MCE's - just map a dummy page at the address that causes problems.
>>
>> That can have bad effects for real memory (because who knows what internal
>> kernel data structure might be in there), but would seem to be the
>> _optimal_ solution for some  random pmem access. And makes it absolutely
>> trivial to just return to the execution that got  the error exception.
>
> The other property of pmem that we need to contend with that makes it
> a snowflake relative to typical memory is that errors can be repaired
> by sending a slow-path command to the DIMM device. We trap block-layer
> writes in the pmem driver that target known 'badblocks' and send the
> sideband command to clear the error along with the new data.

All that to say that having a typical RAM page covering poisoned pmem
would complicate the 'clear badblocks' implementation.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/6] use memcpy_mcsafe() for copy_to_iter()

2018-05-01 Thread Dan Williams
On Tue, May 1, 2018 at 8:13 PM, Linus Torvalds
 wrote:
> On Tue, May 1, 2018 at 8:03 PM Dan Williams 
> wrote:
>
>> Because dax. There's no page cache indirection games we can play here
>> to poison a page and map in another page. The mapped page is 1:1
>> associated with the filesystem block and physical memory address.
>
> I'm not talking page cache indirection.
>
> I'm talking literally mapping a different page into the kernel virtual
> address space that the failing read was done for.
>
> But you seem to be right that we don't actually support that. I'm guessing
> the hwpoison code has never had to run in that kind of situation and will
> just give up.
>
> That would seem to be sad. It really feels like the obvious solution to any
> MCE's - just map a dummy page at the address that causes problems.
>
> That can have bad effects for real memory (because who knows what internal
> kernel data structure might be in there), but would seem to be the
> _optimal_ solution for some  random pmem access. And makes it absolutely
> trivial to just return to the execution that got  the error exception.

The other property of pmem that we need to contend with that makes it
a snowflake relative to typical memory is that errors can be repaired
by sending a slow-path command to the DIMM device. We trap block-layer
writes in the pmem driver that target known 'badblocks' and send the
sideband command to clear the error along with the new data.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/6] use memcpy_mcsafe() for copy_to_iter()

2018-05-01 Thread Linus Torvalds
On Tue, May 1, 2018 at 8:03 PM Dan Williams 
wrote:

> Because dax. There's no page cache indirection games we can play here
> to poison a page and map in another page. The mapped page is 1:1
> associated with the filesystem block and physical memory address.

I'm not talking page cache indirection.

I'm talking literally mapping a different page into the kernel virtual
address space that the failing read was done for.

But you seem to be right that we don't actually support that. I'm guessing
the hwpoison code has never had to run in that kind of situation and will
just give up.

That would seem to be sad. It really feels like the obvious solution to any
MCE's - just map a dummy page at the address that causes problems.

That can have bad effects for real memory (because who knows what internal
kernel data structure might be in there), but would seem to be the
_optimal_ solution for some  random pmem access. And makes it absolutely
trivial to just return to the execution that got  the error exception.

Linus
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/6] use memcpy_mcsafe() for copy_to_iter()

2018-05-01 Thread Dan Williams
On Tue, May 1, 2018 at 7:53 PM, Linus Torvalds
 wrote:
> On Tue, May 1, 2018 at 7:26 PM Dan Williams 
> wrote:
>
>> Right, but the only way to make MCE non-fatal is to teach the machine
>> check handler about recoverable conditions. This patch teaches the
>> machine check handler how to recover copy_to_iter() errors.
>
> Why not just unmap the page and remap a new page  in its place? Meaning
> that it needs absolutely no special error handling in the callers.
>
> IOW, treat it *exactly*  like the whole page poisoning.
>
> We _have_ the technology. Why does this code think it's such a special
> snow-flake?

Because dax. There's no page cache indirection games we can play here
to poison a page and map in another page. The mapped page is 1:1
associated with the filesystem block and physical memory address.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/6] use memcpy_mcsafe() for copy_to_iter()

2018-05-01 Thread Linus Torvalds
On Tue, May 1, 2018 at 7:26 PM Dan Williams 
wrote:

> Right, but the only way to make MCE non-fatal is to teach the machine
> check handler about recoverable conditions. This patch teaches the
> machine check handler how to recover copy_to_iter() errors.

Why not just unmap the page and remap a new page  in its place? Meaning
that it needs absolutely no special error handling in the callers.

IOW, treat it *exactly*  like the whole page poisoning.

We _have_ the technology. Why does this code think it's such a special
snow-flake?

  Linus
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/6] use memcpy_mcsafe() for copy_to_iter()

2018-05-01 Thread Dan Williams
On Tue, May 1, 2018 at 5:09 PM, Linus Torvalds
 wrote:
> On Tue, May 1, 2018 at 4:03 PM Dan Williams 
> wrote:
>
>> I'm confused. Are you talking about getting rid of the block-layer
>> bypass or changing how MCS errors are handled?
>
> The latter.
>
>> If it's the latter, MCS error handling, I don't see how get
>> around something like copy_to_iter_mcsafe().
>
> So the basic issue is that since everybody wants mmap() to be at least an
> option (and preferably one of the _main_ options), I think that the whole
> "MCS errors are fatal" is fundamentally flawed.
>
> Which means that MCS errors can't be fatal.
>
> Which in turn means that the whole "special memcpy" seems very suspect.
>
> Can't we just do
>
>   - use a normal memcpy()
>
>   - basically set an "IO error flag" on MCE.
>
>   - for a user access the IO error flag potentially causes a SIGBUS as you
> mention, but even there it's not 100% clear that's necessarily possible or
> a good idea (I'm assuming that it can be damned hard to figure out _who_
> caused the problem if it was a cached write that causes an MCE much much
> later).

Writes don't trigger MCE. Only consumed poison / media errors trigger
MCE. I.e. even a read-modify-write operation to write-back a partially
dirty cacheline will not trigger an MCE because the read is not
consumed by the core only the cache. We'll get notified when that
happens, but only by CMCI interrupt not an MCE exception.

>   - for the kernel, the "IO error flag" can hopefully be then (again,
> assuming you can correlate the MCE with the right process) be turned into
> EIO.

This is precisely the current implementation / usage of
memcpy_mcsafe(). Reads go through the driver and the driver does the
right / simple thing to turn an MCE into EIO. I'd like to make this
the only model and kill the driver bypass in fs/dax.c so that the vfs
does not need to contend with these low level architecture details.

To be clear I'm not against dax specific optimization that does not go
through the block layer, but it should still be a driver call.

>> You mention mmap. Yes, we want the predominant access model to be
>> dax-mmap for Persistent Memory, but there's still the question about
>> what to do with media errors. To date we are trying to mirror the
>> error handling model for System Memory, i.e. SIGBUS to the process
>> that consumed the error. Is that error handling model also problematic
>> in your view?
>
> See above: if you can handle user space errors "gracefully" (ie with a
> SIGBUS, no crazy "system fatal (reboot)" garbage), then I really don't see
> why you can't do the same for the kernel accesses.
>
> IOW, why do we need that special "copy_to_iter_mcsafe()", when a normal
> "copy_to_iter()" should just work (and basically _has_ to work) anyway?
>
> Put another way: I think the whole basic premis of your patch is wrong,
> because (to quote your original patch descriptor), the fundamental starting
> point is garbage:
>
> The result of the bypass is that the kernel treats machine checks during
> read as system fatal (reboot) [..]
>
> See? If you are able to map that memory into user space, and recover, then
> why the whole crazy "system fatal" thing for kernel accesses?

Right, but the only way to make MCE non-fatal is to teach the machine
check handler about recoverable conditions. This patch teaches the
machine check handler how to recover copy_to_iter() errors.

We already have copy_from_iter_flushcache() that is used as a 'struct
dax_operations' op. I can do the same for this copy_to_iter() case so
at least it's up to the driver and not the vfs (fs/dax.c) to decide
how to handle this case.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/6] use memcpy_mcsafe() for copy_to_iter()

2018-05-01 Thread Linus Torvalds
On Tue, May 1, 2018 at 4:03 PM Dan Williams 
wrote:

> I'm confused. Are you talking about getting rid of the block-layer
> bypass or changing how MCS errors are handled?

The latter.

> If it's the latter, MCS error handling, I don't see how get
> around something like copy_to_iter_mcsafe().

So the basic issue is that since everybody wants mmap() to be at least an
option (and preferably one of the _main_ options), I think that the whole
"MCS errors are fatal" is fundamentally flawed.

Which means that MCS errors can't be fatal.

Which in turn means that the whole "special memcpy" seems very suspect.

Can't we just do

  - use a normal memcpy()

  - basically set an "IO error flag" on MCE.

  - for a user access the IO error flag potentially causes a SIGBUS as you
mention, but even there it's not 100% clear that's necessarily possible or
a good idea (I'm assuming that it can be damned hard to figure out _who_
caused the problem if it was a cached write that causes an MCE much much
later).

  - for the kernel, the "IO error flag" can hopefully be then (again,
assuming you can correlate the MCE with the right process) be turned into
EIO.

> You mention mmap. Yes, we want the predominant access model to be
> dax-mmap for Persistent Memory, but there's still the question about
> what to do with media errors. To date we are trying to mirror the
> error handling model for System Memory, i.e. SIGBUS to the process
> that consumed the error. Is that error handling model also problematic
> in your view?

See above: if you can handle user space errors "gracefully" (ie with a
SIGBUS, no crazy "system fatal (reboot)" garbage), then I really don't see
why you can't do the same for the kernel accesses.

IOW, why do we need that special "copy_to_iter_mcsafe()", when a normal
"copy_to_iter()" should just work (and basically _has_ to work) anyway?

Put another way: I think the whole basic premis of your patch is wrong,
because (to quote your original patch descriptor), the fundamental starting
point is garbage:

The result of the bypass is that the kernel treats machine checks during
read as system fatal (reboot) [..]

See? If you are able to map that memory into user space, and recover, then
why the whole crazy "system fatal" thing for kernel accesses?

 Linus
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/6] use memcpy_mcsafe() for copy_to_iter()

2018-05-01 Thread Dan Williams
On Tue, May 1, 2018 at 4:28 PM, Andy Lutomirski  wrote:
> On Tue, May 1, 2018 at 4:02 PM Dan Williams 
> wrote:
>
>> On Tue, May 1, 2018 at 2:05 PM, Linus Torvalds
>>  wrote:
>> > On Tue, May 1, 2018 at 1:55 PM Dan Williams 
>> > wrote:
>> >
>> >> The result of the bypass is that the kernel treats machine checks
> during
>> >> read as system fatal (reboot) when they could simply be flagged as an
>> >> I/O error, similar to performing reads through the pmem driver. Prevent
>> >> this fatal condition by deploying memcpy_mcsafe() in the fsdax read
>> >> path.
>> >
>> > How about just changing the rules, and go the old "Don't do that then"
> way?
>> >
>> > IOW, get rid of the whole idea that MCS errors should be fatal. It's
> wrong
>> > and pointless anyway.
>> >
>> > The while approach seems fundamentally buggered, if you ever want to
> mmap
>> > one of these things. And don't you want that?
>> >
>> > So why continue down a fundamentally broken path?
>
>> I'm confused. Are you talking about getting rid of the block-layer
>> bypass or changing how MCS errors are handled? If it's the former I've
>> gotten push back in the past trying to remove the bypass, but I feel
>> better about my chances to slay that beast wielding the +5 Hammer of
>> Linus. If it's the latter, MCS error handling, I don't see how get
>> around something like copy_to_iter_mcsafe().
>
>> You mention mmap. Yes, we want the predominant access model to be
>> dax-mmap for Persistent Memory, but there's still the question about
>> what to do with media errors. To date we are trying to mirror the
>> error handling model for System Memory, i.e. SIGBUS to the process
>> that consumed the error. Is that error handling model also problematic
>> in your view?
>
> I'm not sure exactly what you mean here, but my understanding of the status
> quo is that memory errors in user code are non-fatal but that memory errors
> in kernel code are fatal unless there's an appropriate extable entry.  The
> old iov_iter code assumes that memcpy() on kernel addresses can't fail.
> I'm not sure how else it could work.

Right, I'm trying to clarify the "IOW, get rid of the whole idea that
MCS errors should be fatal" comment. Especially as I am about to go
fix memory_failure() to understand that ZONE_DEVICE pages != typical
"struct page", and do the right thing with respect to un-mapping
userspace dax mapped pages.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/6] use memcpy_mcsafe() for copy_to_iter()

2018-05-01 Thread Linus Torvalds
On Tue, May 1, 2018 at 1:55 PM Dan Williams 
wrote:

> The result of the bypass is that the kernel treats machine checks during
> read as system fatal (reboot) when they could simply be flagged as an
> I/O error, similar to performing reads through the pmem driver. Prevent
> this fatal condition by deploying memcpy_mcsafe() in the fsdax read
> path.

How about just changing the rules, and go the old "Don't do that then" way?

IOW, get rid of the whole idea that MCS errors should be fatal. It's wrong
and pointless anyway.

The while approach seems fundamentally buggered, if you ever want to mmap
one of these things. And don't you want that?

So why continue down a fundamentally broken path?

  Linus
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 0/6] use memcpy_mcsafe() for copy_to_iter()

2018-05-01 Thread Dan Williams
Currently memcpy_mcsafe() is only deployed in the pmem driver when
reading through a /dev/pmemX block device. However, a filesystem in dax
mode mounted on a /dev/pmemX block device will bypass the block layer
and the driver for reads. The filesystem-dax (fsdax) read case uses
dax_direct_access() and copy_to_iter() to bypass the block layer.

The result of the bypass is that the kernel treats machine checks during
read as system fatal (reboot) when they could simply be flagged as an
I/O error, similar to performing reads through the pmem driver. Prevent
this fatal condition by deploying memcpy_mcsafe() in the fsdax read
path.

The main differences between this copy_to_user_mcsafe() and
copy_user_generic_unrolled() are:

* Typical tail/residue handling after a fault retries the copy
  byte-by-byte until the fault happens again. Re-triggering machine
  checks is potentially fatal so the implementation uses source alignment
  and poison alignment assumptions to limit the residue copying to known
  good bytes.

* SMAP coordination is handled external to the assembly with
  __uaccess_begin() and __uaccess_end().

* ITER_KVEC and ITER_BVEC can now end prematurely with an error.

The new MCSAFE_DEBUG facility is proposed as a way to unit test the
exception handling without requiring an ACPI EINJ capable platform.

Thanks to Tony Luck for his review, test, and implementation ideas on
initial versions of this patchset.

---

Dan Williams (6):
  x86, memcpy_mcsafe: update labels in support of write fault handling
  x86, memcpy_mcsafe: return bytes remaining
  x86, memcpy_mcsafe: add write-protection-fault handling
  x86, memcpy_mcsafe: define copy_to_iter_mcsafe()
  dax: use copy_to_iter_mcsafe() in dax_iomap_actor()
  x86, nfit_test: unit test for memcpy_mcsafe()


 arch/x86/Kconfig.debug  |3 +
 arch/x86/include/asm/mcsafe_debug.h |   50 ++
 arch/x86/include/asm/string_64.h|8 +-
 arch/x86/include/asm/uaccess_64.h   |   14 +++
 arch/x86/lib/memcpy_64.S|  178 ---
 arch/x86/lib/usercopy_64.c  |   12 ++
 drivers/nvdimm/claim.c  |3 -
 drivers/nvdimm/pmem.c   |6 +
 fs/dax.c|   20 ++--
 include/linux/string.h  |4 -
 include/linux/uio.h |   10 ++
 lib/iov_iter.c  |   59 
 tools/testing/nvdimm/test/nfit.c|   48 +
 13 files changed, 360 insertions(+), 55 deletions(-)
 create mode 100644 arch/x86/include/asm/mcsafe_debug.h
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm