Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-05-11 Thread Vivek Goyal
On Thu, Apr 30, 2020 at 06:21:45PM -0700, Dan Williams wrote:
> On Thu, Apr 30, 2020 at 5:10 PM Linus Torvalds
>  wrote:
> >
> > On Thu, Apr 30, 2020 at 4:52 PM Dan Williams  
> > wrote:
> > >
> > > You had me until here. Up to this point I was grokking that Andy's
> > > "_fallible" suggestion does help explain better than "_safe", because
> > > the copy is doing extra safety checks. copy_to_user() and
> > > copy_to_user_fallible() mean *something* where copy_to_user_safe()
> > > does not.
> >
> > It's a horrible word, btw. The word doesn't actually mean what Andy
> > means it to mean. "fallible" means "can make mistakes", not "can
> > fault".
> >
> > So "fallible" is a horrible name.
> >
> > But anyway, I don't hate something like "copy_to_user_fallible()"
> > conceptually. The naming needs to be fixed, in that "user" can always
> > take a fault, so it's the _source_ that can fault, not the "user"
> > part.
> >
> > It was the "copy_safe()" model that I find unacceptable, that uses
> > _one_ name for what is at the very least *four* different operations:
> >
> >  - copy from faulting memory to user
> >
> >  - copy from faulting memory to kernel
> >
> >  - copy from kernel to faulting memory
> >
> >  - copy within faulting memory
> >
> > No way can you do that with one single function. A kernel address and
> > a user address may literally have the exact same bit representation.
> > So the user vs kernel distinction _has_ to be in the name.
> >
> > The "kernel vs faulting" doesn't necessarily have to be there from an
> > implementation standpoint, but it *should* be there, because
> >
> >  - it might affect implemmentation
> >
> >  - but even if it DOESN'T affect implementation, it should be separate
> > just from the standpoint of being self-documenting code.
> >
> > > However you lose me on this "broken nvdimm semantics" contention.
> > > There is nothing nvdimm-hardware specific about the copy_safe()
> > > implementation, zero, nada, nothing new to the error model that DRAM
> > > did not also inflict on the Linux implementation.
> >
> > Ok, so good. Let's kill this all, and just use memcpy(), and copy_to_user().
> >
> > Just make sure that the nvdimm code doesn't use invalid kernel
> > addresses or other broken poisoning.
> >
> > Problem solved.
> >
> > You can't have it both ways. Either memcpy just works, or it doesn't.
> 
> It doesn't, but copy_to_user() is frustratingly close and you can see
> in the patch that I went ahead and used copy_user_generic() to
> implement the backend of the default "fast" implementation.
> 
> However now I see that copy_user_generic() works for the wrong reason.
> It works because the exception on the source address due to poison
> looks no different than a write fault on the user address to the
> caller, it's still just a short copy. So it makes copy_to_user() work
> for the wrong reason relative to the name.
> 
> How about, following your suggestion, introduce copy_mc_to_user() (can
> just use copy_user_generic() internally) and copy_mc_to_kernel() for
> the other the helpers that the copy_to_iter() implementation needs?
> That makes it clear that no mmu-faults are expected on reads, only
> exceptions, and no protection-faults are expected at all for
> copy_mc_to_kernel() even if it happens to accidentally handle it.
> Following Jann's ex_handler_uaccess() example I could arrange for
> copy_mc_to_kernel() to use a new _ASM_EXTABLE_MC() to validate that
> the only type of exception meant to be handled is MC and warn
> otherwise?

While we are discussing this, I wanted to mention another use case
I am looking at. That is using DAX for virtiofs. virtiofs device
exports a shared memory region which guest maps using DAX. virtiofs
driver dax ops ->copy_to_iter() and ->copy_from_iter() needs to now
copy contents from/to this shared memmory region to user space.

So far we are focussed only on nvdimm and expecting only a machine
check on read side, IIUC. But this virtual device will probably
need something more.

- A page can go missing on host (because file got truncated). So error
  can happen both in read and write path.

- It might not be a machine check to report this kind of error. KVM as
  of now considering using an interrupt to report errors or possibly
  using #VE to report errors.

IOW, tying these new helpers to only machine check will work well for
nvdimm use case but not for virtual devices like virtiofs and we will
end up defining more helpers. Something more generic then machine check
might be able to address both.

Thanks
Vivek



Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-05-04 Thread Dan Williams
On Mon, May 4, 2020 at 1:26 PM Andy Lutomirski  wrote:
>
> On Mon, May 4, 2020 at 1:05 PM Luck, Tony  wrote:
> >
> > > When a copy function hits a bad page and the page is not yet known to
> > > be bad, what does it do?  (I.e. the page was believed to be fine but
> > > the copy function gets #MC.)  Does it unmap it right away?  What does
> > > it return?
> >
> > I suspect that we will only ever find a handful of situations where the
> > kernel can recover from memory that has gone bad that are worth fixing
> > (got to be some code path that touches a meaningful fraction of memory,
> > otherwise we get code complexity without any meaningful payoff).
> >
> > I don't think we'd want different actions for the cases of "we just found 
> > out
> > now that this page is bad" and "we got a notification an hour ago that this
> > page had gone bad". Currently we treat those the same for application
> > errors ... SIGBUS either way[1].
>
> Oh, I agree that the end result should be the same.  I'm thinking more
> about the mechanism and the internal API.  As a somewhat silly example
> of why there's a difference, the first time we try to read from bad
> memory, we can expect #MC (I assume, on a sensibly functioning
> platform).  But, once we get the #MC, I imagine that the #MC handler
> will want to unmap the page to prevent a storm of additional #MC
> events on the same page -- given the awful x86 #MC design, too many
> all at once is fatal.  So the next time we copy_mc_to_user() or
> whatever from the memory, we'll get #PF instead.  Or maybe that #MC
> will defer the unmap?

After the consumption the PMEM driver arranges for the page to never
be mapped again via its "badblocks" list.

>
> So the point of my questions is that the overall design should be at
> least somewhat settled before anyone tries to review just the copy
> functions.

I would say that DAX / PMEM stretches the Linux memory error handling
model beyond what it was originally designed. The primary concepts
that bend the assumptions of mm/memory-failure.c are:

1/ DAX pages can not be offlined via the page allocator.

2/ DAX pages (well cachelines in those pages) can be asynchronously
marked poisoned by a platform or device patrol scrub facility.

3/ DAX pages might be repaired by writes.

Currently 1/ and 2/ are managed by a per-block-device "badblocks" list
that is populated by scrub results and also amended when #MC is raised
(see nfit_handle_mce()). When fs/dax.c services faults it will decline
to map the page if the physical file extent intersects a bad block.

There is also support for sending SIGBUS if userspace races the
scrubber to consume the badblock. However, that uses the standard
'struct page' error model and assumes that a file backed page is 1:1
mapped to a file. This requirement prevents filesystems from enabling
reflink. That collision and the desire to enable reflink is why we are
now investigating supplanting the mm/memory-failure.c model. When the
page is "owned" by a filesystem invoke the filesystem to handle the
memory error across all impacted files.

The presence of 3/ means that any action error handling takes to
disable access to the page needs to be capable of being undone, which
runs counter to the mm/memory-failure.c assumption that offlining is a
one-way trip.


Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-05-04 Thread Andy Lutomirski
On Mon, May 4, 2020 at 1:05 PM Luck, Tony  wrote:
>
> > When a copy function hits a bad page and the page is not yet known to
> > be bad, what does it do?  (I.e. the page was believed to be fine but
> > the copy function gets #MC.)  Does it unmap it right away?  What does
> > it return?
>
> I suspect that we will only ever find a handful of situations where the
> kernel can recover from memory that has gone bad that are worth fixing
> (got to be some code path that touches a meaningful fraction of memory,
> otherwise we get code complexity without any meaningful payoff).
>
> I don't think we'd want different actions for the cases of "we just found out
> now that this page is bad" and "we got a notification an hour ago that this
> page had gone bad". Currently we treat those the same for application
> errors ... SIGBUS either way[1].

Oh, I agree that the end result should be the same.  I'm thinking more
about the mechanism and the internal API.  As a somewhat silly example
of why there's a difference, the first time we try to read from bad
memory, we can expect #MC (I assume, on a sensibly functioning
platform).  But, once we get the #MC, I imagine that the #MC handler
will want to unmap the page to prevent a storm of additional #MC
events on the same page -- given the awful x86 #MC design, too many
all at once is fatal.  So the next time we copy_mc_to_user() or
whatever from the memory, we'll get #PF instead.  Or maybe that #MC
will defer the unmap?

So the point of my questions is that the overall design should be at
least somewhat settled before anyone tries to review just the copy
functions.


RE: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-05-04 Thread Luck, Tony
> When a copy function hits a bad page and the page is not yet known to
> be bad, what does it do?  (I.e. the page was believed to be fine but
> the copy function gets #MC.)  Does it unmap it right away?  What does
> it return?

I suspect that we will only ever find a handful of situations where the
kernel can recover from memory that has gone bad that are worth fixing
(got to be some code path that touches a meaningful fraction of memory,
otherwise we get code complexity without any meaningful payoff).

I don't think we'd want different actions for the cases of "we just found out
now that this page is bad" and "we got a notification an hour ago that this
page had gone bad". Currently we treat those the same for application
errors ... SIGBUS either way[1].

-Tony

[1] well there are options both globally and at the per-process level to have
the "early" notifications delivered right away.


Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-05-04 Thread Dan Williams
On Sun, May 3, 2020 at 5:57 AM David Laight  wrote:
>
> From: Linus Torvalds
> > Sent: 01 May 2020 19:29
> ...
> > And as DavidL pointed out - if you ever have "iomem" as a source or
> > destination, you need yet another case. Not because they can take
> > another kind of fault (although on some platforms you have the machine
> > checks for that too), but because they have *very* different
> > performance profiles (and the ERMS "rep movsb" sucks baby donkeys
> > through a straw).
>
>
> I was actually thinking that the nvdimm accesses need to be treated
> much more like (cached) memory mapped io space than normal system
> memory.
> So treating them the same as "iomem" and then having access functions
> that report access failures (which the current readq() doesn't)
> might make sense.

While I agree that something like copy_mc_iomem_to_{user,kernel} could
have users, nvdimm is not one of them.

> If you are using memory that 'might fail' for kernel code or data
> you really get what you deserve.

nvdimms are no less "might fail" than DRAM, recall that some nvdimms
are just DRAM with a platform promise that their contents are battery
backed.

> OTOH system response to PCIe errors is currently rather problematic.
> Mostly reads time out and return ~0u.
> This can be checked for and, if possibly valid, a second location read.

Yes, the ambiguous ~0u return needs careful handling.


RE: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-05-03 Thread David Laight
From: Linus Torvalds
> Sent: 01 May 2020 19:29
...
> And as DavidL pointed out - if you ever have "iomem" as a source or
> destination, you need yet another case. Not because they can take
> another kind of fault (although on some platforms you have the machine
> checks for that too), but because they have *very* different
> performance profiles (and the ERMS "rep movsb" sucks baby donkeys
> through a straw).


I was actually thinking that the nvdimm accesses need to be treated
much more like (cached) memory mapped io space than normal system
memory.
So treating them the same as "iomem" and then having access functions
that report access failures (which the current readq() doesn't)
might make sense.

If you are using memory that 'might fail' for kernel code or data
you really get what you deserve.

OTOH system response to PCIe errors is currently rather problematic.
Mostly reads time out and return ~0u.
This can be checked for and, if possibly valid, a second location read.

However we have a x86 server box (I've forgotten whether it is HP or
Dell) that generates an NMI whenever a PCIe link goes down.
(The 'platform' takes the AER interrupt and uses an NMI to pass
it to the kernel - whose bright idea was it to use an NMI???)
This happens even after we've done an 'echo 1 >remove'.
The system is supposed to be NEBS (I think that is the term) compliant
which is supposed to be suitable for telephony work (including
emergency calls), but any PCIe failure crashes the box!

I've another system here that sometimes fails to bring the PCIe
link back up.
I guess these code paths don't get regular testing.
In my case the PCIe slave is an fpga, reloading the fpga image
(either over JTAG or after rewriting eeprom) doesn't always work.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-05-02 Thread Andy Lutomirski
On Fri, May 1, 2020 at 7:09 AM Luck, Tony  wrote:
>
> > Now maybe copy_to_user() should *always* work this way, but I’m not 
> > convinced.
> > Certainly put_user() shouldn’t — the result wouldn’t even be well defined. 
> > And I’m
> >  unconvinced that it makes much sense for the majority of copy_to_user() 
> > callers
> >  that are also directly accessing the source structure.
>
> One case that might work is copy_to_user() that's copying from the kernel 
> page cache
> to the user in response to a read(2) system call.  Action would be to check 
> if we could
> re-read from the file system to a different page. If not, return -EIO. Either 
> way ditch the
> poison page from the page cache.
>

I think that, before we do too much design of the semantics of just
the copy function, we need a design for the whole system.
Specifically:

When the kernel finds out that a kernel page is bad (via #MC or via
any other mechanism), what does the kernel do?  Does it unmap it?
Does it replace it with a dummy page?  Does it leave it there?

When a copy function hits a bad page and the page is not yet known to
be bad, what does it do?  (I.e. the page was believed to be fine but
the copy function gets #MC.)  Does it unmap it right away?  What does
it return?

When a copy function hits a page that is already known to be bad
because the kernel got the "oh crap, bad page" notification earlier,
what does it do?  Return -EIO?  Take some fancier action under the
assumption that it's called in a preemptible, IRQs-on context, whereas
the original #MC or other hardware notification may have come at a
less opportune time?


Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-05-01 Thread Dave Hansen
On 5/1/20 11:28 AM, Linus Torvalds wrote:
> Plus on x86 you can't reasonably even have different code sequences
> for that case, because CLAC/STAC don't have a "enable users read
> accesses" vs "write accesses" case. It's an all-or-nothing "enable
> user faults".
> 
> We _used_ to have a difference on x86, back when we did the whole "fs
> segment points to user space".

Protection keys might give us _some_ of this back.  If we're doing a
copy_from_user(), we could (logically) do:

stac()
save_pkru()
pkru |= ~0x
... do userspace read
restore_pkru()
clac()

That *should* generate a fault if we try to write to userspace in there
because PKRU affects all user *addresses* (PTEs with _PAGE_USER set) not
user-mode *accesses*.

Properly stashing the value off and context switching it correctly would
be fun, but probably not impossible to pull off.  You actually wouldn't
even technically need to restore PKRU in this path.  It would just need
to be restored before the thread runs userspace or hits a copy_to_user()
equivalent.

I can't imagine this would all be worth the trouble, but there are
crazier people out there than me.


Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-05-01 Thread Linus Torvalds
On Thu, Apr 30, 2020 at 6:21 PM Dan Williams  wrote:
>
> However now I see that copy_user_generic() works for the wrong reason.
> It works because the exception on the source address due to poison
> looks no different than a write fault on the user address to the
> caller, it's still just a short copy. So it makes copy_to_user() work
> for the wrong reason relative to the name.

Right.

And it won't work that way on other architectures. On x86, we have a
generic function that can take faults on either side, and we use it
for both cases (and for the "in_user" case too), but that's an
artifact of the architecture oddity.

In fact, it's probably wrong even on x86 - because it can hide bugs -
but writing those things is painful enough that everybody prefers
having just one function.

That's particularly true since that "one function" is actually a whole
family of functions - just for different optimizations.

Plus on x86 you can't reasonably even have different code sequences
for that case, because CLAC/STAC don't have a "enable users read
accesses" vs "write accesses" case. It's an all-or-nothing "enable
user faults".

We _used_ to have a difference on x86, back when we did the whole "fs
segment points to user space".

But on other architectures, there really is a difference between
"copy_to_user()" and "copy_from_user()", and the functions won't do
fault handling for the kernel side accesses.

> How about, following your suggestion, introduce copy_mc_to_user() (can
> just use copy_user_generic() internally) and copy_mc_to_kernel() for
> the other the helpers that the copy_to_iter() implementation needs?

Yes. That at least solves my naming worries, and is conceptually
something that works on other architectures.

Those other architectures may not have nvdimm support yet, but I think
everybody is at least looking at it.

And I really do think it will make the users more readable too, when
you see on a source level that "oh, this code is expecting that it
could take a poison fault/machine check on the source/destination".

> Following Jann's ex_handler_uaccess() example I could arrange for
> copy_mc_to_kernel() to use a new _ASM_EXTABLE_MC() to validate that
> the only type of exception meant to be handled is MC and warn
> otherwise?

That may be a good idea, but won't work for any shared case.

IOW, it would be lovely to have a "copy_mc_to_user()" check that if
it's a write fault, it's because it's a user address (and if it's a
read fault it's because it's a poisoned page or mc or whatever, but a
valid kernel address).

But it's exactly the kind of thing that we currently don't do even for
the bog-standard "copy_to_user()", because we share all the code
because we're lazy.

And as DavidL pointed out - if you ever have "iomem" as a source or
destination, you need yet another case. Not because they can take
another kind of fault (although on some platforms you have the machine
checks for that too), but because they have *very* different
performance profiles (and the ERMS "rep movsb" sucks baby donkeys
through a straw).

  Linus


RE: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-05-01 Thread Luck, Tony
> Now maybe copy_to_user() should *always* work this way, but I’m not convinced.
> Certainly put_user() shouldn’t — the result wouldn’t even be well defined. 
> And I’m
>  unconvinced that it makes much sense for the majority of copy_to_user() 
> callers
>  that are also directly accessing the source structure.

One case that might work is copy_to_user() that's copying from the kernel page 
cache
to the user in response to a read(2) system call.  Action would be to check if 
we could
re-read from the file system to a different page. If not, return -EIO. Either 
way ditch the
poison page from the page cache.

-Tony


RE: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-05-01 Thread David Laight
From: Andy Lutomirski
> Sent: 30 April 2020 19:42
...
> I suppose there could be a consistent naming like this:
> 
> copy_from_user()
> copy_to_user()
> 
> copy_from_unchecked_kernel_address() [what probe_kernel_read() is]
> copy_to_unchecked_kernel_address() [what probe_kernel_write() is]
> 
> copy_from_fallible() [from a kernel address that can fail to a kernel
> address that can't fail]
> copy_to_fallible() [the opposite, but hopefully identical to memcpy() on x86]
> 
> copy_from_fallible_to_user()
> copy_from_user_to_fallible()

You missed out:
copy_to/from_io()
copy_to_io_from_user()
copy_from_io_to_user()
All of which want aligned addresses on the 'io' side.

It might even be worth saying that the copy_to/from_io() can
fail due to bad IO accesses (rather than bad addresses).
This is not entirely unexpected since all PCIe accesses
can fail unexpectedly (usually without a trap and returning -1).
But a system could arrange to generate a synchronous fault.

If you are copying directly from io to user you need to
differentiate between a user page fault and an io access
error. The latter shouldn't generate SIGSEGV.
Possibly return -EFAULT on user page fault and 'transfer
length remaining' on io access error.
Although filling the rest of the buffer with 0xff might be
appropriate.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-04-30 Thread Dan Williams
On Thu, Apr 30, 2020 at 5:10 PM Linus Torvalds
 wrote:
>
> On Thu, Apr 30, 2020 at 4:52 PM Dan Williams  wrote:
> >
> > You had me until here. Up to this point I was grokking that Andy's
> > "_fallible" suggestion does help explain better than "_safe", because
> > the copy is doing extra safety checks. copy_to_user() and
> > copy_to_user_fallible() mean *something* where copy_to_user_safe()
> > does not.
>
> It's a horrible word, btw. The word doesn't actually mean what Andy
> means it to mean. "fallible" means "can make mistakes", not "can
> fault".
>
> So "fallible" is a horrible name.
>
> But anyway, I don't hate something like "copy_to_user_fallible()"
> conceptually. The naming needs to be fixed, in that "user" can always
> take a fault, so it's the _source_ that can fault, not the "user"
> part.
>
> It was the "copy_safe()" model that I find unacceptable, that uses
> _one_ name for what is at the very least *four* different operations:
>
>  - copy from faulting memory to user
>
>  - copy from faulting memory to kernel
>
>  - copy from kernel to faulting memory
>
>  - copy within faulting memory
>
> No way can you do that with one single function. A kernel address and
> a user address may literally have the exact same bit representation.
> So the user vs kernel distinction _has_ to be in the name.
>
> The "kernel vs faulting" doesn't necessarily have to be there from an
> implementation standpoint, but it *should* be there, because
>
>  - it might affect implemmentation
>
>  - but even if it DOESN'T affect implementation, it should be separate
> just from the standpoint of being self-documenting code.
>
> > However you lose me on this "broken nvdimm semantics" contention.
> > There is nothing nvdimm-hardware specific about the copy_safe()
> > implementation, zero, nada, nothing new to the error model that DRAM
> > did not also inflict on the Linux implementation.
>
> Ok, so good. Let's kill this all, and just use memcpy(), and copy_to_user().
>
> Just make sure that the nvdimm code doesn't use invalid kernel
> addresses or other broken poisoning.
>
> Problem solved.
>
> You can't have it both ways. Either memcpy just works, or it doesn't.

It doesn't, but copy_to_user() is frustratingly close and you can see
in the patch that I went ahead and used copy_user_generic() to
implement the backend of the default "fast" implementation.

However now I see that copy_user_generic() works for the wrong reason.
It works because the exception on the source address due to poison
looks no different than a write fault on the user address to the
caller, it's still just a short copy. So it makes copy_to_user() work
for the wrong reason relative to the name.

How about, following your suggestion, introduce copy_mc_to_user() (can
just use copy_user_generic() internally) and copy_mc_to_kernel() for
the other the helpers that the copy_to_iter() implementation needs?
That makes it clear that no mmu-faults are expected on reads, only
exceptions, and no protection-faults are expected at all for
copy_mc_to_kernel() even if it happens to accidentally handle it.
Following Jann's ex_handler_uaccess() example I could arrange for
copy_mc_to_kernel() to use a new _ASM_EXTABLE_MC() to validate that
the only type of exception meant to be handled is MC and warn
otherwise?


Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-04-30 Thread Andy Lutomirski



> On Apr 30, 2020, at 5:25 PM, Linus Torvalds  
> wrote:
> 
> 
> It wasn't clear how "copy_to_mc()" could ever fault. Poisoning
> after-the-fact? Why would that be preferable to just mapping a dummy
> page?

If the kernel gets an async memory error and maps a dummy page, then subsequent 
reads will subsequently succeed and return garbage when they should fail.  If 
x86 had write-only pages, we could map a dummy write-only page. But we don’t, 
so I think we’re stuck.

As for naming the kind of memory we’re taking about, ISTM there are two 
classes: DAX and monstrous memory-mapped non-persistent cache devices.  Both 
could be Optane, I suppose.

But I also think it’s legitimate to use these accessors to increase the chance 
of surviving a failure of normal memory. If a normal page happens to be page 
cache when it fails and if page cache access use these fancy accessors, then we 
might actually survive a failure.

We could be ambitious: declare that all page cache and all get_user_page’d 
memory should use the new accessors.  I doubt we’ll ever really succeed due to 
magical things like rseq and anything that thinks that users can set up their 
own memory as a kernel-accessed ring buffer, but I suppose we could try.



Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-04-30 Thread Andy Lutomirski



> On Apr 30, 2020, at 5:40 PM, Linus Torvalds  
> wrote:
> 
> On Thu, Apr 30, 2020 at 5:23 PM Andy Lutomirski  wrote:
>> 
>>> But anyway, I don't hate something like "copy_to_user_fallible()"
>>> conceptually. The naming needs to be fixed, in that "user" can always
>>> take a fault, so it's the _source_ that can fault, not the "user"
>>> part.
>> 
>> I don’t like this.  “user” already implied that basically anything can be 
>> wrong with the memory
> 
> Maybe I didn't explain.
> 
> "user" already implies faulting. We agree.
> 
> And since we by definition cannot know what the user has mapped into
> user space, *every* normal copy_to_user() has to be able to handle
> whatever faults that throws at us.
> 
> The reason I dislike "copy_to_user_fallible()" is that the user side
> already has that 'fallible".
> 
> If it's the _source_ being "fallible" (it really needs a better name -
> I will not call it just "f") then it should be "copy_f_to_user()".
> 
> That would be ok.
> 
> So "copy_f_to_user()" makes sense. But "copy_to_user_f()" does not.
> That puts the "f" on the "user", which we already know can fault.
> 
> See what I want in the name? I want the name to say which side can
> cause problems!

We are in violent agreement. I’m moderately confident that I never suggested 
copy_from_user_f(). We appear to agree completely.



Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-04-30 Thread Linus Torvalds
On Thu, Apr 30, 2020 at 5:23 PM Andy Lutomirski  wrote:
>
> > But anyway, I don't hate something like "copy_to_user_fallible()"
> > conceptually. The naming needs to be fixed, in that "user" can always
> > take a fault, so it's the _source_ that can fault, not the "user"
> > part.
>
> I don’t like this.  “user” already implied that basically anything can be 
> wrong with the memory

Maybe I didn't explain.

"user" already implies faulting. We agree.

And since we by definition cannot know what the user has mapped into
user space, *every* normal copy_to_user() has to be able to handle
whatever faults that throws at us.

The reason I dislike "copy_to_user_fallible()" is that the user side
already has that 'fallible".

If it's the _source_ being "fallible" (it really needs a better name -
I will not call it just "f") then it should be "copy_f_to_user()".

That would be ok.

So "copy_f_to_user()" makes sense. But "copy_to_user_f()" does not.
That puts the "f" on the "user", which we already know can fault.

See what I want in the name? I want the name to say which side can
cause problems!

If you are copying memory from some f source, it must not be
"copy_safe()". That doesn't say if the _source_ is f, or the
destination is f.

So "copy_to_f()" makes sense (we don't say "copy_kernel_to_user()" -
the "normal" case is silent), and "copy_from_f()" makes sense.
"copy_in_f()" makes sense too.

But not this "randomly copy some randomly f memory area that I don't
know if it's the source or the destination".

Sometimes you may then use a common implementation for the different
directions - if that works on the architecture.

For example, "copy_to_user()" and "copy_from_user()" on x86 both end
up internally using a shared "copy_user_generic()" implementation. I
wish that wasn't the case (when I asked for what was to become
STAC/CLAC, I asked for one that could determine which direction of a
"rep movs" could touch user space), but it so happens that the
implementations end up being symmetric on x86.

But that's a pure implementation issue, and it very much is not going
to be true in general, and it shouldn't be exposed to users as such
(and we obviously don't).

Linus


Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-04-30 Thread Linus Torvalds
On Thu, Apr 30, 2020 at 5:10 PM Linus Torvalds
 wrote:
>
> It's a horrible word, btw. The word doesn't actually mean what Andy
> means it to mean. "fallible" means "can make mistakes", not "can
> fault".

Btw, on naming: the name should be about _why_ it can fault, not about
whether it faults.

Which hasn't been explained to me.

I know why user accesses can fault. I still don't know why these new
accesses can fault. I know of the old name (mcs), but the newly
suggested name (safe) is the _opposite_ of an explanation of why it
faults.

Naming - like comments - shouldn't be about what some implementation
is, but about the concept.

Again, let me use that "copy_to_user()" as an example of this. Yes, it
can fault. Notice how the name doesn't say "copy_to_faulting()". That
would be WRONG. It can fault _because_ it is user memory, so
"copy_to_user()" not only describes what it does, but it also
implicitly describes that it can fault.

THAT is the kind of explanation I'm looking for. The "memcpy_mcsafe()"
at least had _some_ of that in it. It was wrong for all the _other_
reasons (not having a direction, and the hardware just being complete
and utter garbage), but at least there was a reason in the name.

I am not interested in adding new infrastructure that cannot even be
explained. Why would writes ever fault, considering they are posted
(and again, "user space" is not a valid reason, we have that case
already and have had it since day #1 even if the original naming was
the same kind of bad implementation-specific name that "mcsafe" was).

If the ONLY reason for the fault is a machine check, then the name
should say so, and "copy_mc_to_user()" would be a perfectly fine name
(along with copy_to_mc(), copy_from_mc(), and copy_in_mc()).

It wasn't clear how "copy_to_mc()" could ever fault. Poisoning
after-the-fact? Why would that be preferable to just mapping a dummy
page? But even if it cannot fault, I can see that you might want to do
it as a special kind of copy to avoid any read-mask-write artifacts
(which can definitely happen on other architectures, and I could see a
manual memcpy() implementation doing even on x86)

  Linus


Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-04-30 Thread Andy Lutomirski


> On Apr 30, 2020, at 5:10 PM, Linus Torvalds  
> wrote:
> 
> On Thu, Apr 30, 2020 at 4:52 PM Dan Williams  
> wrote:
>> 
>> You had me until here. Up to this point I was grokking that Andy's
>> "_fallible" suggestion does help explain better than "_safe", because
>> the copy is doing extra safety checks. copy_to_user() and
>> copy_to_user_fallible() mean *something* where copy_to_user_safe()
>> does not.
> 
> It's a horrible word, btw. The word doesn't actually mean what Andy
> means it to mean. "fallible" means "can make mistakes", not "can
> fault".
> 
> So "fallible" is a horrible name.

What I was trying to get at was not “can fault” but “can malfunction”.  Maybe 
“unreliable”?  Better words welcome.

> 
> But anyway, I don't hate something like "copy_to_user_fallible()"
> conceptually. The naming needs to be fixed, in that "user" can always
> take a fault, so it's the _source_ that can fault, not the "user"
> part.

I don’t like this.  “user” already implied that basically anything can be wrong 
with the memory — it can be unmapped entirely, it can have the wrong 
permissions, it can have the wrong protection key, it can have an ECC error, 
etc.  If the operation you want is “copy from unreliable kernel memory (but 
with a definitely valid pointer) to user memory”, you want 
copy_unreliable_to_user().

Now maybe copy_to_user() should *always* work this way, but I’m not convinced. 
Certainly put_user() shouldn’t — the result wouldn’t even be well defined. And 
I’m unconvinced that it makes much sense for the majority of copy_to_user() 
callers that are also directly accessing the source structure.

I also tend to think that the probe_kernel stuff should just stay separate. 
Those are really for two totally separate types of use: either the kernel is 
trying its best to print an errr message without exploding worse, or it’s 
involved in eBPF or trading hacks in which address is arbitrary and essentially 
untrusted.


Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-04-30 Thread Linus Torvalds
On Thu, Apr 30, 2020 at 4:52 PM Dan Williams  wrote:
>
> You had me until here. Up to this point I was grokking that Andy's
> "_fallible" suggestion does help explain better than "_safe", because
> the copy is doing extra safety checks. copy_to_user() and
> copy_to_user_fallible() mean *something* where copy_to_user_safe()
> does not.

It's a horrible word, btw. The word doesn't actually mean what Andy
means it to mean. "fallible" means "can make mistakes", not "can
fault".

So "fallible" is a horrible name.

But anyway, I don't hate something like "copy_to_user_fallible()"
conceptually. The naming needs to be fixed, in that "user" can always
take a fault, so it's the _source_ that can fault, not the "user"
part.

It was the "copy_safe()" model that I find unacceptable, that uses
_one_ name for what is at the very least *four* different operations:

 - copy from faulting memory to user

 - copy from faulting memory to kernel

 - copy from kernel to faulting memory

 - copy within faulting memory

No way can you do that with one single function. A kernel address and
a user address may literally have the exact same bit representation.
So the user vs kernel distinction _has_ to be in the name.

The "kernel vs faulting" doesn't necessarily have to be there from an
implementation standpoint, but it *should* be there, because

 - it might affect implemmentation

 - but even if it DOESN'T affect implementation, it should be separate
just from the standpoint of being self-documenting code.

> However you lose me on this "broken nvdimm semantics" contention.
> There is nothing nvdimm-hardware specific about the copy_safe()
> implementation, zero, nada, nothing new to the error model that DRAM
> did not also inflict on the Linux implementation.

Ok, so good. Let's kill this all, and just use memcpy(), and copy_to_user().

Just make sure that the nvdimm code doesn't use invalid kernel
addresses or other broken poisoning.

Problem solved.

You can't have it both ways. Either memcpy just works, or it doesn't.

So which way is it?

  Linus


Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-04-30 Thread Dan Williams
On Thu, Apr 30, 2020 at 12:56 PM Linus Torvalds
 wrote:
>
> On Thu, Apr 30, 2020 at 12:23 PM Luck, Tony  wrote:
> >
> > How about
> >
> > try_copy_catch(void *dst, void *src, size_t count, int *fault)
> >
> > returns number of bytes not-copied (like copy_to_user etc).
> >
> > if return is not zero, "fault" tells you what type of fault
> > cause the early stop (#PF, #MC).
>
> That just makes things worse.
>
> The problem isn't "what fault did I get?".
>
> The problem is "what is the point of this function?".
>
> In other words, I want the code to explain _why_ it uses a particular 
> function.
>
> So the question the function should answer is not "Why did it take a
> fault?", but "Why isn't this just a 'memcpy()'"?
>
> When somebody writes
>
> x = copy_to_user(a,b,c);
>
> the "why is it not a memcpy" question never comes up, because the code
> is obvious. It's not a memory copy, because it's copying to user
> space, and user space is different!
>
> In contrast, if you write
>
> x = copy_safe(a,b,c);
>
> then what is going on? There is no rhyme or reason to the call. Is the
> source unsafe? Wny? Is the destination faulting? Why? Both? How?
>
> So "copy_to_user()" _answers_ a question. But "copy_safe()" just
> results in more questions. See the difference?
>
> And notice that the "why did it fault" question is NOT about your
> "what kind of fault did it take" question. That question is generally
> pretty much uninteresting.
>
> The question I want answered is "why is this function being called AT ALL".
>
> Again, "copy_to_user()" can fail, and we know to check failure cases.
> But more importantly, the _reason_ it can fail is obvious from the
> name and from the use. There's no confusion about "why".
>
> "copy_safe()"? or "try_copy_catch()"? No such thing. It doesn't answer
> that fundamental "why" question.
>
> And yes, this also has practical consequences. If you know that the
> failure is due to the source being some special memory (and if we care
> about the MC case with unaligned accesses), then the function in
> question should probably strive to make those _source_ accesses be the
> aligned ones.  But if it's the destination that is special memory,
> then it's the writes to the destination that should be aligned. If you
> need both, you may need to be either mutually aligned, or do byte
> accesses, or do special shifting copies. So it matters for any initial
> alignment code (if the thing has alignment issues to begin with).
>
> I haven't even gotten an answer to the "why would the write fail".
> When I asked, Dan said
>
>  "writes can mmu-fault now that memcpy_mcsafe() is also used by
> _copy_to_iter_mcsafe()"
>
> but as mentioned, the only reason I can find for _that_ is that the
> destination is user space.
>
> In which case a "copy_safe()" absolutely could never work.
>
> If I can't figure out the "why is this function used" or even figure
> out why it needs the semantics it claims it needs, then there's a
> problem.
>
> Personally, I suspect that the *easiest* way to support the broken
> nvdimm semantics is to not have a "copy" function at all as the basic
> accessor model.

You had me until here. Up to this point I was grokking that Andy's
"_fallible" suggestion does help explain better than "_safe", because
the copy is doing extra safety checks. copy_to_user() and
copy_to_user_fallible() mean *something* where copy_to_user_safe()
does not.

However you lose me on this "broken nvdimm semantics" contention.
There is nothing nvdimm-hardware specific about the copy_safe()
implementation, zero, nada, nothing new to the error model that DRAM
did not also inflict on the Linux implementation.

This is about Linux treating memory as a disk and performing bulk
transfers with the CPU instead of a DMA engine. Whereas existing
memory error handling has a high chance for it to trigger on userspace
accesses, large copies in kernel mode now have a higher chance of
tripping over the same errors in kernel space. Since there is an error
model overlaid on top of the block-I/O, software is well prepared to
handle the short transfer case.

> Why? Exactly because "copy" is not a fundamental well-defined action.
> You get nasty combinatorial explosions of different things, where you
> have three different kinds of sources (user, RAM, nvdimm) and three
> different kinds of destinations.

True, copy is not a fundamentally well defined operation. It's
unfortunate that the low-level of a direct-I/O read(2) eventually
turns into a call to memcpy() inside a typical ramdisk driver. So
you're right, this isn't a copy routine as much as it's the backend of
read(2)/write(2) with short transfer semantics, but by the time it
gets to pmem driver it's been disconnected from its original intent
and all it sees is "move bytes from here to there if you can".

> And that's ignoring the whole "maybe you don't want to do a plain
> copy, maybe you want to calculate a RAID checksum, or do a 'strcpy()'
> or whatever". If 

Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-04-30 Thread Luck, Tony
On Thu, Apr 30, 2020 at 12:50:40PM -0700, Linus Torvalds wrote:

I see your point about the namimg being important.  I think Dan's
case is indeed "copy from pmem to user" where only options for faulting
are #MC on the source addresses, and #PF on the destination.

> The only *fundamental* access would likely be a single read/write
> operation, not a copy operation. Think "get_user()" instead of
> "copy_from_user()".  Even there you get combinatorial explosions with
> access sizes, but you can often generate those automatically or with
> simple patterns, and then you can build up the copy functions from
> that if you really need to.

That's maybe very clean. But it looks like it would be hard to build
a high performance interface on top of that primitive. Remember that
for Dan's copy 99.9367673%[1] of copies will not hit a machine
check on the read from pmem.

Dan wants (whatever the function name) to get to a "REP MOVS" with an
exception table entry to handle the cases where there is a fault.

-Tony

[1] Likely several more '9's in there


Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-04-30 Thread Andy Lutomirski


> On Apr 30, 2020, at 12:51 PM, Dan Williams  wrote:
> 
> On Thu, Apr 30, 2020 at 12:23 PM Luck, Tony  wrote:
>> 
>>> On Thu, Apr 30, 2020 at 11:42:20AM -0700, Andy Lutomirski wrote:
>>> I suppose there could be a consistent naming like this:
>>> 
>>> copy_from_user()
>>> copy_to_user()
>>> 
>>> copy_from_unchecked_kernel_address() [what probe_kernel_read() is]
>>> copy_to_unchecked_kernel_address() [what probe_kernel_write() is]
>>> 
>>> copy_from_fallible() [from a kernel address that can fail to a kernel
>>> address that can't fail]
>>> copy_to_fallible() [the opposite, but hopefully identical to memcpy() on 
>>> x86]
>>> 
>>> copy_from_fallible_to_user()
>>> copy_from_user_to_fallible()
>>> 
>>> These names are fairly verbose and could probably be improved.
>> 
>> How about
>> 
>>try_copy_catch(void *dst, void *src, size_t count, int *fault)
>> 
>> returns number of bytes not-copied (like copy_to_user etc).
>> 
>> if return is not zero, "fault" tells you what type of fault
>> cause the early stop (#PF, #MC).
> 
> I do like try_copy_catch() for the case when neither of the buffers
> are __user (like in the pmem driver) and _copy_to_iter_fallible()
> (plus all the helpers it implies) for the other cases.
> 
> copy_to_user_fallible
> copy_fallible_to_page
> copy_pipe_to_iter_fallible
> 
> ...because the mmu-fault handling is an aspect of "_user" and fallible
> implies other source fault reasons. It does leave a gap if an
> architecture has a concept of a fallible write, but that seems like
> something that will be handled asynchronously and not subject to this
> interface.


I’m suspicious that, as a practical matter, x86 does have a fallible write. In 
particular, if a page fails and the memory failure code is notified, the page 
will be unmapped. At that point, an attempt to write to the failed fallible 
page will get #PF, and code that writes to it needs to be prepared to handle 
#PF.  Perhaps copy_to_fallible(), etc can still return void, but I’m unconvinced
the implementation can be the same as memcpy.

Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-04-30 Thread Linus Torvalds
On Thu, Apr 30, 2020 at 12:23 PM Luck, Tony  wrote:
>
> How about
>
> try_copy_catch(void *dst, void *src, size_t count, int *fault)
>
> returns number of bytes not-copied (like copy_to_user etc).
>
> if return is not zero, "fault" tells you what type of fault
> cause the early stop (#PF, #MC).

That just makes things worse.

The problem isn't "what fault did I get?".

The problem is "what is the point of this function?".

In other words, I want the code to explain _why_ it uses a particular function.

So the question the function should answer is not "Why did it take a
fault?", but "Why isn't this just a 'memcpy()'"?

When somebody writes

x = copy_to_user(a,b,c);

the "why is it not a memcpy" question never comes up, because the code
is obvious. It's not a memory copy, because it's copying to user
space, and user space is different!

In contrast, if you write

x = copy_safe(a,b,c);

then what is going on? There is no rhyme or reason to the call. Is the
source unsafe? Wny? Is the destination faulting? Why? Both? How?

So "copy_to_user()" _answers_ a question. But "copy_safe()" just
results in more questions. See the difference?

And notice that the "why did it fault" question is NOT about your
"what kind of fault did it take" question. That question is generally
pretty much uninteresting.

The question I want answered is "why is this function being called AT ALL".

Again, "copy_to_user()" can fail, and we know to check failure cases.
But more importantly, the _reason_ it can fail is obvious from the
name and from the use. There's no confusion about "why".

"copy_safe()"? or "try_copy_catch()"? No such thing. It doesn't answer
that fundamental "why" question.

And yes, this also has practical consequences. If you know that the
failure is due to the source being some special memory (and if we care
about the MC case with unaligned accesses), then the function in
question should probably strive to make those _source_ accesses be the
aligned ones.  But if it's the destination that is special memory,
then it's the writes to the destination that should be aligned. If you
need both, you may need to be either mutually aligned, or do byte
accesses, or do special shifting copies. So it matters for any initial
alignment code (if the thing has alignment issues to begin with).

I haven't even gotten an answer to the "why would the write fail".
When I asked, Dan said

 "writes can mmu-fault now that memcpy_mcsafe() is also used by
_copy_to_iter_mcsafe()"

but as mentioned, the only reason I can find for _that_ is that the
destination is user space.

In which case a "copy_safe()" absolutely could never work.

If I can't figure out the "why is this function used" or even figure
out why it needs the semantics it claims it needs, then there's a
problem.

Personally, I suspect that the *easiest* way to support the broken
nvdimm semantics is to not have a "copy" function at all as the basic
accessor model.

Why? Exactly because "copy" is not a fundamental well-defined action.
You get nasty combinatorial explosions of different things, where you
have three different kinds of sources (user, RAM, nvdimm) and three
different kinds of destinations.

And that's ignoring the whole "maybe you don't want to do a plain
copy, maybe you want to calculate a RAID checksum, or do a 'strcpy()'
or whatever". If those are ever issues, you get another explosion of
combinations.

The only *fundamental* access would likely be a single read/write
operation, not a copy operation. Think "get_user()" instead of
"copy_from_user()".  Even there you get combinatorial explosions with
access sizes, but you can often generate those automatically or with
simple patterns, and then you can build up the copy functions from
that if you really need to.

   Linus


Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-04-30 Thread Dan Williams
On Thu, Apr 30, 2020 at 12:23 PM Luck, Tony  wrote:
>
> On Thu, Apr 30, 2020 at 11:42:20AM -0700, Andy Lutomirski wrote:
> > I suppose there could be a consistent naming like this:
> >
> > copy_from_user()
> > copy_to_user()
> >
> > copy_from_unchecked_kernel_address() [what probe_kernel_read() is]
> > copy_to_unchecked_kernel_address() [what probe_kernel_write() is]
> >
> > copy_from_fallible() [from a kernel address that can fail to a kernel
> > address that can't fail]
> > copy_to_fallible() [the opposite, but hopefully identical to memcpy() on 
> > x86]
> >
> > copy_from_fallible_to_user()
> > copy_from_user_to_fallible()
> >
> > These names are fairly verbose and could probably be improved.
>
> How about
>
> try_copy_catch(void *dst, void *src, size_t count, int *fault)
>
> returns number of bytes not-copied (like copy_to_user etc).
>
> if return is not zero, "fault" tells you what type of fault
> cause the early stop (#PF, #MC).

I do like try_copy_catch() for the case when neither of the buffers
are __user (like in the pmem driver) and _copy_to_iter_fallible()
(plus all the helpers it implies) for the other cases.

copy_to_user_fallible
copy_fallible_to_page
copy_pipe_to_iter_fallible

...because the mmu-fault handling is an aspect of "_user" and fallible
implies other source fault reasons. It does leave a gap if an
architecture has a concept of a fallible write, but that seems like
something that will be handled asynchronously and not subject to this
interface.


Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-04-30 Thread Luck, Tony
On Thu, Apr 30, 2020 at 11:42:20AM -0700, Andy Lutomirski wrote:
> I suppose there could be a consistent naming like this:
> 
> copy_from_user()
> copy_to_user()
> 
> copy_from_unchecked_kernel_address() [what probe_kernel_read() is]
> copy_to_unchecked_kernel_address() [what probe_kernel_write() is]
> 
> copy_from_fallible() [from a kernel address that can fail to a kernel
> address that can't fail]
> copy_to_fallible() [the opposite, but hopefully identical to memcpy() on x86]
> 
> copy_from_fallible_to_user()
> copy_from_user_to_fallible()
> 
> These names are fairly verbose and could probably be improved.

How about

try_copy_catch(void *dst, void *src, size_t count, int *fault)

returns number of bytes not-copied (like copy_to_user etc).

if return is not zero, "fault" tells you what type of fault
cause the early stop (#PF, #MC).

-Tony


Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-04-30 Thread Andy Lutomirski
On Thu, Apr 30, 2020 at 10:17 AM Linus Torvalds
 wrote:
>
> On Thu, Apr 30, 2020 at 9:52 AM Andy Lutomirski  wrote:
> >
> > If I'm going to copy from memory that might be bad but is at least a
> > valid pointer, I want a function to do this.  If I'm going to copy
> > from memory that might be entirely bogus, that's a different
> > operation.  In other words, if I'm writing e.g. filesystem that is
> > touching get_user_pages()'d persistent memory, I don't want to panic
> > if the memory fails, but I do want at least a very loud warning if I
> > follow a wild pointer.
> >
> > So I think that probe_kernel_copy() is not a valid replacement for
> > memcpy_mcsafe().
>
> Fair enough.
>
> That said, the part I do like about probe_kernel_read/write() is that
> it does indicate which part we think is possibly the one that needs
> more care.
>
> Sure, it _might_ be both sides, but honestly, that's likely the much
> less common case. Kind of like "copy_{to,from}_user()" vs
> "copy_in_user()".
>
> Yes, the "copy_in_user()" case exists, but it's the odd and unusual case.

I suppose there could be a consistent naming like this:

copy_from_user()
copy_to_user()

copy_from_unchecked_kernel_address() [what probe_kernel_read() is]
copy_to_unchecked_kernel_address() [what probe_kernel_write() is]

copy_from_fallible() [from a kernel address that can fail to a kernel
address that can't fail]
copy_to_fallible() [the opposite, but hopefully identical to memcpy() on x86]

copy_from_fallible_to_user()
copy_from_user_to_fallible()

These names are fairly verbose and could probably be improved.


Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-04-30 Thread Linus Torvalds
On Thu, Apr 30, 2020 at 9:52 AM Andy Lutomirski  wrote:
>
> If I'm going to copy from memory that might be bad but is at least a
> valid pointer, I want a function to do this.  If I'm going to copy
> from memory that might be entirely bogus, that's a different
> operation.  In other words, if I'm writing e.g. filesystem that is
> touching get_user_pages()'d persistent memory, I don't want to panic
> if the memory fails, but I do want at least a very loud warning if I
> follow a wild pointer.
>
> So I think that probe_kernel_copy() is not a valid replacement for
> memcpy_mcsafe().

Fair enough.

That said, the part I do like about probe_kernel_read/write() is that
it does indicate which part we think is possibly the one that needs
more care.

Sure, it _might_ be both sides, but honestly, that's likely the much
less common case. Kind of like "copy_{to,from}_user()" vs
"copy_in_user()".

Yes, the "copy_in_user()" case exists, but it's the odd and unusual case.

Looking at the existing cases of "memcpy_mcsafe()", they do seem to
generally have a very clearly defined direction, not "both sides can
break".

I also find myself suspecting that one case people _do_ want to
possibly do is to copy from nvdimm memory into user space. So then
that needs yet another function.

And we have that copy_to_user_mcsafe() for that, and used in the
disgustingly named "copyout_mcsafe()". Ugly incomprehensible BSD'ism.

But oddly we don't have the "from_user" case.

So this thing seems messy, the naming is odd and inconsistent, and I'd
really like the whole "access with exception handling" to have some
clear rules and clear names.

The whole "there are fifty different special cases" really drives me
wild. It's why I think the hardware was so broken.

And now the special "writes can fault" rule still confuses me.
_copy_to_iter_mcsafe() was mentioned, which makes me think that it's
literally about that "copy from nvram to user space" issue.

But that can't just trap on the destination, that fundamentally needs
special user space accesses anyway. Even on x86 you have the whole
STAC/CLAC issue, on other architectures the stores may not be normal
stores at all.

So a "copy_safe()" model doesn't actually work for that at all.

So I'm a bit (maybe a _lot_) confused about what the semantics should
actually be. And I want the naming to reflect whatever those semantics
are. And I don't think "copy_safe()" reflects any semantics at all.

 Linus


Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-04-30 Thread Andy Lutomirski
On Thu, Apr 30, 2020 at 7:03 AM Linus Torvalds
 wrote:
>
> On Thu, Apr 30, 2020 at 1:41 AM Dan Williams  wrote:
> >
> > With the above realizations the name "mcsafe" is no longer accurate and
> > copy_safe() is proposed as its replacement. x86 grows a copy_safe_fast()
> > implementation as a default implementation that is independent of
> > detecting the presence of x86-MCA.
>
> How is this then different from "probe_kernel_read()" and
> "probe_kernel_write()"? Other than the obvious "it does it for both
> reads and writes"?
>
> IOW, wouldn't it be sensible to try to match the naming and try to
> find some unified model for all these things?
>
> "probe_kernel_copy()"?

I don't like this whole concept.

If I'm going to copy from memory that might be bad but is at least a
valid pointer, I want a function to do this.  If I'm going to copy
from memory that might be entirely bogus, that's a different
operation.  In other words, if I'm writing e.g. filesystem that is
touching get_user_pages()'d persistent memory, I don't want to panic
if the memory fails, but I do want at least a very loud warning if I
follow a wild pointer.

So I think that probe_kernel_copy() is not a valid replacement for
memcpy_mcsafe().

--Andy


Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-04-30 Thread Linus Torvalds
On Thu, Apr 30, 2020 at 1:41 AM Dan Williams  wrote:
>
> With the above realizations the name "mcsafe" is no longer accurate and
> copy_safe() is proposed as its replacement. x86 grows a copy_safe_fast()
> implementation as a default implementation that is independent of
> detecting the presence of x86-MCA.

How is this then different from "probe_kernel_read()" and
"probe_kernel_write()"? Other than the obvious "it does it for both
reads and writes"?

IOW, wouldn't it be sensible to try to match the naming and try to
find some unified model for all these things?

"probe_kernel_copy()"?

  Linus