Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-25 Thread Dan Williams
On Wed, Apr 25, 2018 at 10:44 AM, Alex Deucher  wrote:
> On Wed, Apr 25, 2018 at 2:41 AM, Christoph Hellwig  wrote:
>> On Wed, Apr 25, 2018 at 02:24:36AM -0400, Alex Deucher wrote:
>>> > It has a non-coherent transaction mode (which the chipset can opt to
>>> > not implement and still flush), to make sure the AGP horror show
>>> > doesn't happen again and GPU folks are happy with PCIe. That's at
>>> > least my understanding from digging around in amd the last time we had
>>> > coherency issues between intel and amd gpus. GPUs have some bits
>>> > somewhere (in the pagetables, or in the buffer object description
>>> > table created by userspace) to control that stuff.
>>>
>>> Right.  We have a bit in the GPU page table entries that determines
>>> whether we snoop the CPU's cache or not.
>>
>> I can see how that works with the GPU on the same SOC or SOC set as the
>> CPU.  But how is that going to work for a GPU that is a plain old PCIe
>> card?  The cache snooping in that case is happening in the PCIe root
>> complex.
>
> I'm not a pci expert, but as far as I know, the device sends either a
> snooped or non-snooped transaction on the bus.  I think the
> transaction descriptor supports a no snoop attribute.  Our GPUs have
> supported this feature for probably 20 years if not more, going back
> to PCI.  Using non-snooped transactions have lower latency and faster
> throughput compared to snooped transactions.

Right, 'no snoop' and 'relaxed ordering' have been part of the PCI
spec since forever. With a plain old PCI-E card the root complex
indeed arranges for caches to be snooped. Although it's not always
faster depending on the platform. 'No snoop' traffic may be relegated
to less bus resources relative to the much more common snooped
traffic.


Re: [PATCH v2 00/19] prevent bounds-check bypass via speculative execution

2018-01-18 Thread Dan Williams
On Thu, Jan 18, 2018 at 5:18 AM, Will Deacon  wrote:
> Hi Dan, Linus,
>
> On Thu, Jan 11, 2018 at 05:41:08PM -0800, Dan Williams wrote:
>> On Thu, Jan 11, 2018 at 5:19 PM, Linus Torvalds
>>  wrote:
>> > On Thu, Jan 11, 2018 at 4:46 PM, Dan Williams  
>> > wrote:
>> >>
>> >> This series incorporates Mark Rutland's latest ARM changes and adds
>> >> the x86 specific implementation of 'ifence_array_ptr'. That ifence
>> >> based approach is provided as an opt-in fallback, but the default
>> >> mitigation, '__array_ptr', uses a 'mask' approach that removes
>> >> conditional branches instructions, and otherwise aims to redirect
>> >> speculation to use a NULL pointer rather than a user controlled value.
>> >
>> > Do you have any performance numbers and perhaps example code
>> > generation? Is this noticeable? Are there any microbenchmarks showing
>> > the difference between lfence use and the masking model?
>>
>> I don't have performance numbers, but here's a sample code generation
>> from __fcheck_files, where the 'and; lea; and' sequence is portion of
>> array_ptr() after the mask generation with 'sbb'.
>>
>> fdp = array_ptr(fdt->fd, fd, fdt->max_fds);
>>  8e7:   8b 02   mov(%rdx),%eax
>>  8e9:   48 39 c7cmp%rax,%rdi
>>  8ec:   48 19 c9sbb%rcx,%rcx
>>  8ef:   48 8b 42 08 mov0x8(%rdx),%rax
>>  8f3:   48 89 femov%rdi,%rsi
>>  8f6:   48 21 ceand%rcx,%rsi
>>  8f9:   48 8d 04 f0 lea(%rax,%rsi,8),%rax
>>  8fd:   48 21 c8and%rcx,%rax
>>
>>
>> > Having both seems good for testing, but wouldn't we want to pick one in 
>> > the end?
>>
>> I was thinking we'd keep it as a 'just in case' sort of thing, at
>> least until the 'probably safe' assumption of the 'mask' approach has
>> more time to settle out.
>
> From the arm64 side, the only concern I have (and this actually applies to
> our CSDB sequence as well) is the calculation of the array size by the
> caller. As Linus mentioned at the end of [1], if the determination of the
> size argument is based on a conditional branch, then masking doesn't help
> because you bound within the wrong range under speculation.
>
> We ran into this when trying to use masking to protect our uaccess routines
> where the conditional bound is either KERNEL_DS or USER_DS. It's possible
> that a prior conditional set_fs(KERNEL_DS) could defeat the masking and so
> we'd need to throw some heavy barriers in set_fs to make it robust.

At least in the conditional mask case near set_fs() usage the approach
we are taking is to use a barrier. I.e. the following guidance from
Linus:

"Basically, the rule is trivial: find all 'stac' users, and use address
masking if those users already integrate the limit check, and lfence
they don't."

...which translates to narrow the pointer for get_user() and use a
barrier  for __get_user().


Re: [PATCH v2 00/19] prevent bounds-check bypass via speculative execution

2018-01-11 Thread Dan Williams
On Thu, Jan 11, 2018 at 5:19 PM, Linus Torvalds
 wrote:
> On Thu, Jan 11, 2018 at 4:46 PM, Dan Williams  
> wrote:
>>
>> This series incorporates Mark Rutland's latest ARM changes and adds
>> the x86 specific implementation of 'ifence_array_ptr'. That ifence
>> based approach is provided as an opt-in fallback, but the default
>> mitigation, '__array_ptr', uses a 'mask' approach that removes
>> conditional branches instructions, and otherwise aims to redirect
>> speculation to use a NULL pointer rather than a user controlled value.
>
> Do you have any performance numbers and perhaps example code
> generation? Is this noticeable? Are there any microbenchmarks showing
> the difference between lfence use and the masking model?

I don't have performance numbers, but here's a sample code generation
from __fcheck_files, where the 'and; lea; and' sequence is portion of
array_ptr() after the mask generation with 'sbb'.

fdp = array_ptr(fdt->fd, fd, fdt->max_fds);
 8e7:   8b 02   mov(%rdx),%eax
 8e9:   48 39 c7cmp%rax,%rdi
 8ec:   48 19 c9sbb%rcx,%rcx
 8ef:   48 8b 42 08 mov0x8(%rdx),%rax
 8f3:   48 89 femov%rdi,%rsi
 8f6:   48 21 ceand%rcx,%rsi
 8f9:   48 8d 04 f0 lea(%rax,%rsi,8),%rax
 8fd:   48 21 c8and%rcx,%rax


> Having both seems good for testing, but wouldn't we want to pick one in the 
> end?

I was thinking we'd keep it as a 'just in case' sort of thing, at
least until the 'probably safe' assumption of the 'mask' approach has
more time to settle out.

>
> Also, I do think that there is one particular array load that would
> seem to be pretty obvious: the system call function pointer array.
>
> Yes, yes, the actual call is now behind a retpoline, but that protects
> against a speculative BTB access, it's not obvious that it  protects
> against the mispredict of the __NR_syscall_max comparison in
> arch/x86/entry/entry_64.S.
>
> The act of fetching code is a kind of read too. And retpoline protects
> against BTB stuffing etc, but what if the _actual_ system call
> function address is wrong (due to mis-prediction of the system call
> index check)?
>
> Should the array access in entry_SYSCALL_64_fastpath be made to use
> the masking approach?

I'll take a look. I'm firmly in the 'patch first / worry later' stance
on these investigations.


[PATCH v2 00/19] prevent bounds-check bypass via speculative execution

2018-01-11 Thread Dan Williams
Changes since v1 [1]:
* fixup the ifence definition to use alternative_2 per recent AMD
  changes in tip/x86/pti (Tom)

* drop 'nospec_ptr' (Linus, Mark)

* rename 'nospec_array_ptr' to 'array_ptr' (Alexei)

* rename 'nospec_barrier' to 'ifence' (Peter, Ingo)

* clean up occasions of 'variable assignment in if()' (Sergei, Stephen)

* make 'array_ptr' use a mask instead of an architectural ifence by
  default (Linus, Alexei)

* provide a command line and compile-time opt-in to the ifence
  mechanism, if an architecture provides 'ifence_array_ptr'.

* provide an optimized mask generation helper, 'array_ptr_mask', for
  x86 (Linus)

* move 'get_user' hardening from '__range_not_ok' to '__uaccess_begin'
  (Linus)

* drop "Thermal/int340x: prevent bounds-check..." since userspace does
  not have arbitrary control over the 'trip' index (Srinivas)

* update the changelog of "net: mpls: prevent bounds-check..." and keep
  it in the series to continue the debate about Spectre hygiene patches.
  (Eric).

* record a reviewed-by from Laurent on "[media] uvcvideo: prevent
  bounds-check..."

* update the cover letter

[1]: https://lwn.net/Articles/743376/

---

Quoting Mark's original RFC:

"Recently, Google Project Zero discovered several classes of attack
against speculative execution. One of these, known as variant-1, allows
explicit bounds checks to be bypassed under speculation, providing an
arbitrary read gadget. Further details can be found on the GPZ blog [2]
and the Documentation patch in this series."

This series incorporates Mark Rutland's latest ARM changes and adds
the x86 specific implementation of 'ifence_array_ptr'. That ifence
based approach is provided as an opt-in fallback, but the default
mitigation, '__array_ptr', uses a 'mask' approach that removes
conditional branches instructions, and otherwise aims to redirect
speculation to use a NULL pointer rather than a user controlled value.

The mask is generated by the following from Alexei, and Linus:

mask = ~(long)(_i | (_s - 1 - _i)) >> (BITS_PER_LONG - 1);

...and Linus provided an optimized mask generation helper for x86:

asm ("cmpq %1,%2; sbbq %0,%0;"
:"=r" (mask)
:"r"(sz),"r" (idx)
:"cc");

The 'array_ptr' mechanism can be switched between 'mask' and 'ifence'
via the spectre_v1={mask,ifence} command line option, and the
compile-time default is set by selecting either CONFIG_SPECTRE1_MASK or
CONFIG_SPECTRE1_IFENCE.

The 'array_ptr' infrastructure is the primary focus this patch set. The
individual patches that perform 'array_ptr' conversions are a point in
time (i.e. earlier kernel, early analysis tooling, x86 only etc...)
start at finding some of these gadgets.

Another consideration for reviewing these patches is the 'hygiene'
argument. When a patch refers to hygiene it is concerned with stopping
speculation on an unconstrained or insufficiently constrained pointer
value under userspace control. That by itself is not sufficient for
attack (per current understanding) [3], but it is a necessary
pre-condition.  So 'hygiene' refers to cleaning up those suspect
pointers regardless of whether they are usable as a gadget.

These patches are also be available via the 'nospec-v2' git branch
here:

git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux nospec-v2

Note that the BPF fix for Spectre variant1 is merged in the bpf.git
tree [4], and is not included in this branch.

[2]: 
https://googleprojectzero.blogspot.co.uk/2018/01/reading-privileged-memory-with-side.html
[3]: https://spectreattack.com/spectre.pdf
[4]: 
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=b2157399cc98

---

Dan Williams (16):
  x86: implement ifence()
  x86: implement ifence_array_ptr() and array_ptr_mask()
  asm-generic/barrier: mask speculative execution flows
  x86: introduce __uaccess_begin_nospec and ASM_IFENCE
  x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
  ipv6: prevent bounds-check bypass via speculative execution
  ipv4: prevent bounds-check bypass via speculative execution
  vfs, fdtable: prevent bounds-check bypass via speculative execution
  userns: prevent bounds-check bypass via speculative execution
  udf: prevent bounds-check bypass via speculative execution
  [media] uvcvideo: prevent bounds-check bypass via speculative execution
  carl9170: prevent bounds-check bypass via speculative execution
  p54: prevent bounds-check bypass via speculative execution
  qla2xxx: prevent bounds-check bypass via speculative execution
  cw1200: prevent bounds-che

[PATCH v2 14/19] [media] uvcvideo: prevent bounds-check bypass via speculative execution

2018-01-11 Thread Dan Williams
Static analysis reports that 'index' may be a user controlled value that
is used as a data dependency to read 'pin' from the
'selector->baSourceID' array. In order to avoid potential leaks of
kernel memory values, block speculative execution of the instruction
stream that could issue reads based on an invalid value of 'pin'.

Based on an original patch by Elena Reshetova.

Laurent notes:

"...as this is nowhere close to being a fast path, I think we can close
this potential hole as proposed in the patch"

Cc: Mauro Carvalho Chehab 
Cc: linux-media@vger.kernel.org
Reviewed-by: Laurent Pinchart 
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 drivers/media/usb/uvc/uvc_v4l2.c |9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/media/usb/uvc/uvc_v4l2.c b/drivers/media/usb/uvc/uvc_v4l2.c
index 3e7e283a44a8..30ee200206ee 100644
--- a/drivers/media/usb/uvc/uvc_v4l2.c
+++ b/drivers/media/usb/uvc/uvc_v4l2.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -809,8 +810,12 @@ static int uvc_ioctl_enum_input(struct file *file, void 
*fh,
const struct uvc_entity *selector = chain->selector;
struct uvc_entity *iterm = NULL;
u32 index = input->index;
+   __u8 *elem = NULL;
int pin = 0;
 
+   if (selector)
+   elem = array_ptr(selector->baSourceID, index,
+   selector->bNrInPins);
if (selector == NULL ||
(chain->dev->quirks & UVC_QUIRK_IGNORE_SELECTOR_UNIT)) {
if (index != 0)
@@ -820,8 +825,8 @@ static int uvc_ioctl_enum_input(struct file *file, void *fh,
break;
}
pin = iterm->id;
-   } else if (index < selector->bNrInPins) {
-   pin = selector->baSourceID[index];
+   } else if (elem) {
+   pin = *elem;
list_for_each_entry(iterm, &chain->entities, chain) {
if (!UVC_ENTITY_IS_ITERM(iterm))
continue;



Re: [PATCH 00/18] prevent bounds-check bypass via speculative execution

2018-01-11 Thread Dan Williams
On Thu, Jan 11, 2018 at 1:54 AM, Jiri Kosina  wrote:
> On Tue, 9 Jan 2018, Josh Poimboeuf wrote:
>
>> On Tue, Jan 09, 2018 at 11:44:05AM -0800, Dan Williams wrote:
>> > On Tue, Jan 9, 2018 at 11:34 AM, Jiri Kosina  wrote:
>> > > On Fri, 5 Jan 2018, Dan Williams wrote:
>> > >
>> > > [ ... snip ... ]
>> > >> Andi Kleen (1):
>> > >>   x86, barrier: stop speculation for failed access_ok
>> > >>
>> > >> Dan Williams (13):
>> > >>   x86: implement nospec_barrier()
>> > >>   [media] uvcvideo: prevent bounds-check bypass via speculative 
>> > >> execution
>> > >>   carl9170: prevent bounds-check bypass via speculative execution
>> > >>   p54: prevent bounds-check bypass via speculative execution
>> > >>   qla2xxx: prevent bounds-check bypass via speculative execution
>> > >>   cw1200: prevent bounds-check bypass via speculative execution
>> > >>   Thermal/int340x: prevent bounds-check bypass via speculative 
>> > >> execution
>> > >>   ipv6: prevent bounds-check bypass via speculative execution
>> > >>   ipv4: prevent bounds-check bypass via speculative execution
>> > >>   vfs, fdtable: prevent bounds-check bypass via speculative 
>> > >> execution
>> > >>   net: mpls: prevent bounds-check bypass via speculative execution
>> > >>   udf: prevent bounds-check bypass via speculative execution
>> > >>   userns: prevent bounds-check bypass via speculative execution
>> > >>
>> > >> Mark Rutland (4):
>> > >>   asm-generic/barrier: add generic nospec helpers
>> > >>   Documentation: document nospec helpers
>> > >>   arm64: implement nospec_ptr()
>> > >>   arm: implement nospec_ptr()
>> > >
>> > > So considering the recent publication of [1], how come we all of a sudden
>> > > don't need the barriers in ___bpf_prog_run(), namely for LD_IMM_DW and
>> > > LDX_MEM_##SIZEOP, and something comparable for eBPF JIT?
>> > >
>> > > Is this going to be handled in eBPF in some other way?
>> > >
>> > > Without that in place, and considering Jann Horn's paper, it would seem
>> > > like PTI doesn't really lock it down fully, right?
>> >
>> > Here is the latest (v3) bpf fix:
>> >
>> > https://patchwork.ozlabs.org/patch/856645/
>> >
>> > I currently have v2 on my 'nospec' branch and will move that to v3 for
>> > the next update, unless it goes upstream before then.
>
> Daniel, I guess you're planning to send this still for 4.15?

It's pending in the bpf.git tree:


https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=b2157399cc9

>> That patch seems specific to CONFIG_BPF_SYSCALL.  Is the bpf() syscall
>> the only attack vector?  Or are there other ways to run bpf programs
>> that we should be worried about?
>
> Seems like Alexei is probably the only person in the whole universe who
> isn't CCed here ... let's fix that.

He will be cc'd on v2 of this series which will be available later today.


Re: [PATCH 00/18] prevent bounds-check bypass via speculative execution

2018-01-09 Thread Dan Williams
On Tue, Jan 9, 2018 at 11:34 AM, Jiri Kosina  wrote:
> On Fri, 5 Jan 2018, Dan Williams wrote:
>
> [ ... snip ... ]
>> Andi Kleen (1):
>>   x86, barrier: stop speculation for failed access_ok
>>
>> Dan Williams (13):
>>   x86: implement nospec_barrier()
>>   [media] uvcvideo: prevent bounds-check bypass via speculative execution
>>   carl9170: prevent bounds-check bypass via speculative execution
>>   p54: prevent bounds-check bypass via speculative execution
>>   qla2xxx: prevent bounds-check bypass via speculative execution
>>   cw1200: prevent bounds-check bypass via speculative execution
>>   Thermal/int340x: prevent bounds-check bypass via speculative execution
>>   ipv6: prevent bounds-check bypass via speculative execution
>>   ipv4: prevent bounds-check bypass via speculative execution
>>   vfs, fdtable: prevent bounds-check bypass via speculative execution
>>   net: mpls: prevent bounds-check bypass via speculative execution
>>   udf: prevent bounds-check bypass via speculative execution
>>   userns: prevent bounds-check bypass via speculative execution
>>
>> Mark Rutland (4):
>>   asm-generic/barrier: add generic nospec helpers
>>   Documentation: document nospec helpers
>>   arm64: implement nospec_ptr()
>>   arm: implement nospec_ptr()
>
> So considering the recent publication of [1], how come we all of a sudden
> don't need the barriers in ___bpf_prog_run(), namely for LD_IMM_DW and
> LDX_MEM_##SIZEOP, and something comparable for eBPF JIT?
>
> Is this going to be handled in eBPF in some other way?
>
> Without that in place, and considering Jann Horn's paper, it would seem
> like PTI doesn't really lock it down fully, right?

Here is the latest (v3) bpf fix:

https://patchwork.ozlabs.org/patch/856645/

I currently have v2 on my 'nospec' branch and will move that to v3 for
the next update, unless it goes upstream before then.


Re: [PATCH 07/18] [media] uvcvideo: prevent bounds-check bypass via speculative execution

2018-01-08 Thread Dan Williams
On Mon, Jan 8, 2018 at 3:23 AM, Laurent Pinchart
 wrote:
> Hi Dan,
>
> Thank you for the patch.
>
> On Saturday, 6 January 2018 03:10:32 EET Dan Williams wrote:
>> Static analysis reports that 'index' may be a user controlled value that
>> is used as a data dependency to read 'pin' from the
>> 'selector->baSourceID' array. In order to avoid potential leaks of
>> kernel memory values, block speculative execution of the instruction
>> stream that could issue reads based on an invalid value of 'pin'.
>
> I won't repeat the arguments already made in the thread regarding having
> documented coverity rules for this, even if I agree with them.
>
>> Based on an original patch by Elena Reshetova.
>>
>> Cc: Laurent Pinchart 
>> Cc: Mauro Carvalho Chehab 
>> Cc: linux-media@vger.kernel.org
>> Signed-off-by: Elena Reshetova 
>> Signed-off-by: Dan Williams 
>> ---
>>  drivers/media/usb/uvc/uvc_v4l2.c |7 +--
>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/media/usb/uvc/uvc_v4l2.c
>> b/drivers/media/usb/uvc/uvc_v4l2.c index 3e7e283a44a8..7442626dc20e 100644
>> --- a/drivers/media/usb/uvc/uvc_v4l2.c
>> +++ b/drivers/media/usb/uvc/uvc_v4l2.c
>> @@ -22,6 +22,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>
>>  #include 
>>  #include 
>> @@ -810,6 +811,7 @@ static int uvc_ioctl_enum_input(struct file *file, void
>> *fh, struct uvc_entity *iterm = NULL;
>>   u32 index = input->index;
>>   int pin = 0;
>> + __u8 *elem;
>>
>>   if (selector == NULL ||
>>   (chain->dev->quirks & UVC_QUIRK_IGNORE_SELECTOR_UNIT)) {
>> @@ -820,8 +822,9 @@ static int uvc_ioctl_enum_input(struct file *file, void
>> *fh, break;
>>   }
>>   pin = iterm->id;
>> - } else if (index < selector->bNrInPins) {
>> - pin = selector->baSourceID[index];
>> + } else if ((elem = nospec_array_ptr(selector->baSourceID, index,
>> + selector->bNrInPins))) {
>> + pin = *elem;
>>   list_for_each_entry(iterm, &chain->entities, chain) {
>>   if (!UVC_ENTITY_IS_ITERM(iterm))
>>   continue;
>
> (adding a bit more context)
>
>>   if (iterm->id == pin)
>>   break;
>>   }
>>   }
>>
>>   if (iterm == NULL || iterm->id != pin)
>>   return -EINVAL;
>>
>>   memset(input, 0, sizeof(*input));
>>   input->index = index;
>>   strlcpy(input->name, iterm->name, sizeof(input->name));
>>   if (UVC_ENTITY_TYPE(iterm) == UVC_ITT_CAMERA)
>>   input->type = V4L2_INPUT_TYPE_CAMERA;
>
> So pin is used to search for an entry in the chain->entities list. Entries in
> that list are allocated separately through kmalloc and can thus end up in
> different cache lines, so I agree we have an issue. However, this is mitigated
> by the fact that typical UVC devices have a handful (sometimes up to a dozen)
> entities, so an attacker would only be able to read memory values that are
> equal to the entity IDs used by the device. Entity IDs can be freely allocated
> but typically count continuously from 0. It would take a specially-crafted UVC
> device to be able to read all memory.
>
> On the other hand, as this is nowhere close to being a fast path, I think we
> can close this potential hole as proposed in the patch. So,
>
> Reviewed-by: Laurent Pinchart 

Thanks Laurent!

> Will you merge the whole series in one go, or would you like me to take the
> patch in my tree ? In the latter case I'll wait until the nospec_array_ptr()
> gets merged in mainline.

I'll track it for now. Until the 'nospec_array_ptr()' discussion
resolves there won't be a stabilized commit-id for you to base a
branch.


Re: [PATCH 07/18] [media] uvcvideo: prevent bounds-check bypass via speculative execution

2018-01-07 Thread Dan Williams
On Sun, Jan 7, 2018 at 1:09 AM, Greg KH  wrote:
[..]
> Sorry for the confusion, no, I don't mean the "taint tracking", I mean
> the generic pattern of "speculative out of bounds access" that we are
> fixing here.
>
> Yes, as you mentioned before, there are tons of false-positives in the
> tree, as to find the real problems you have to show that userspace
> controls the access index.  But if we have a generic pattern that can
> rewrite that type of logic into one where it does not matter at all
> (i.e. like the ebpf proposed changes), then it would not be an issue if
> they are false or not, we just rewrite them all to be safe.
>
> We need to find some way not only to fix these issues now (like you are
> doing with this series), but to prevent them from every coming back into
> the codebase again.  It's that second part that we need to keep in the
> back of our minds here, while doing the first portion of this work.

I understand the goal, but I'm not sure any of our current annotation
mechanisms are suitable. We have:

__attribute__((noderef, address_space(x)))

...for the '__user' annotation and other pointers that must not be
de-referenced without a specific accessor. We also have:

__attribute__((bitwise))

...for values that should not be consumed directly without a specific
conversion like endian swapping.

The problem is that we need to see if a value derived from a userspace
controlled input is used to trigger a chain of dependent reads. As far
as I can see the annotation would need to be guided by taint analysis
to be useful, at which point we can just "annotate" the problem spot
with nospec_array_ptr(). Otherwise it seems the scope of a
"__nospec_array_index" annotation would have a low signal to noise
ratio.

Stopping speculation past a uacess_begin() boundary appears to handle
a wide swath of potential problems, and the rest likely needs taint
analysis, at least for now.

All that to say, yes, we need better tooling and infrastructure going forward.


Re: [PATCH 00/18] prevent bounds-check bypass via speculative execution

2018-01-06 Thread Dan Williams
On Sat, Jan 6, 2018 at 11:37 AM, Dan Williams  wrote:
> On Fri, Jan 5, 2018 at 5:09 PM, Dan Williams  wrote:
>> Quoting Mark's original RFC:
>>
>> "Recently, Google Project Zero discovered several classes of attack
>> against speculative execution. One of these, known as variant-1, allows
>> explicit bounds checks to be bypassed under speculation, providing an
>> arbitrary read gadget. Further details can be found on the GPZ blog [1]
>> and the Documentation patch in this series."
>>
>> This series incorporates Mark Rutland's latest api and adds the x86
>> specific implementation of nospec_barrier. The
>> nospec_{array_ptr,ptr,barrier} helpers are then combined with a kernel
>> wide analysis performed by Elena Reshetova to address static analysis
>> reports where speculative execution on a userspace controlled value
>> could bypass a bounds check. The patches address a precondition for the
>> attack discussed in the Spectre paper [2].
>>
>> A consideration worth noting for reviewing these patches is to weigh the
>> dramatic cost of being wrong about whether a given report is exploitable
>> vs the overhead nospec_{array_ptr,ptr} may introduce. In other words,
>> lets make the bar for applying these patches be "can you prove that the
>> bounds check bypass is *not* exploitable". Consider that the Spectre
>> paper reports one example of a speculation window being ~180 cycles.
>>
>> Note that there is also a proposal from Linus, array_access [3], that
>> attempts to quash speculative execution past a bounds check without
>> introducing an lfence instruction. That may be a future optimization
>> possibility that is compatible with this api, but it would appear to
>> need guarantees from the compiler that it is not clear the kernel can
>> rely on at this point. It is also not clear that it would be a
>> significant performance win vs lfence.
>>
>> These patches also will also be available via the 'nospec' git branch
>> here:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux nospec
>
> It appears that git.kernel.org has not mirrored out the new branch. In
> the meantime here's an alternative location:
>
> https://github.com/djbw/linux.git nospec
>
> If there are updates to these patches they will appear in nospec-v2,
> nospec-v3, etc... branches.

For completeness I appended the bpf fix [1] to the git branch.

https://lwn.net/Articles/743288/


Re: [PATCH 00/18] prevent bounds-check bypass via speculative execution

2018-01-06 Thread Dan Williams
On Fri, Jan 5, 2018 at 5:09 PM, Dan Williams  wrote:
> Quoting Mark's original RFC:
>
> "Recently, Google Project Zero discovered several classes of attack
> against speculative execution. One of these, known as variant-1, allows
> explicit bounds checks to be bypassed under speculation, providing an
> arbitrary read gadget. Further details can be found on the GPZ blog [1]
> and the Documentation patch in this series."
>
> This series incorporates Mark Rutland's latest api and adds the x86
> specific implementation of nospec_barrier. The
> nospec_{array_ptr,ptr,barrier} helpers are then combined with a kernel
> wide analysis performed by Elena Reshetova to address static analysis
> reports where speculative execution on a userspace controlled value
> could bypass a bounds check. The patches address a precondition for the
> attack discussed in the Spectre paper [2].
>
> A consideration worth noting for reviewing these patches is to weigh the
> dramatic cost of being wrong about whether a given report is exploitable
> vs the overhead nospec_{array_ptr,ptr} may introduce. In other words,
> lets make the bar for applying these patches be "can you prove that the
> bounds check bypass is *not* exploitable". Consider that the Spectre
> paper reports one example of a speculation window being ~180 cycles.
>
> Note that there is also a proposal from Linus, array_access [3], that
> attempts to quash speculative execution past a bounds check without
> introducing an lfence instruction. That may be a future optimization
> possibility that is compatible with this api, but it would appear to
> need guarantees from the compiler that it is not clear the kernel can
> rely on at this point. It is also not clear that it would be a
> significant performance win vs lfence.
>
> These patches also will also be available via the 'nospec' git branch
> here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux nospec

It appears that git.kernel.org has not mirrored out the new branch. In
the meantime here's an alternative location:

https://github.com/djbw/linux.git nospec

If there are updates to these patches they will appear in nospec-v2,
nospec-v3, etc... branches.


Re: [PATCH 07/18] [media] uvcvideo: prevent bounds-check bypass via speculative execution

2018-01-06 Thread Dan Williams
On Sat, Jan 6, 2018 at 1:40 AM, Greg KH  wrote:
> On Sat, Jan 06, 2018 at 10:09:07AM +0100, Greg KH wrote:
>> On Fri, Jan 05, 2018 at 05:10:32PM -0800, Dan Williams wrote:
>> > Static analysis reports that 'index' may be a user controlled value that
>> > is used as a data dependency to read 'pin' from the
>> > 'selector->baSourceID' array. In order to avoid potential leaks of
>> > kernel memory values, block speculative execution of the instruction
>> > stream that could issue reads based on an invalid value of 'pin'.
>> >
>> > Based on an original patch by Elena Reshetova.
>> >
>> > Cc: Laurent Pinchart 
>> > Cc: Mauro Carvalho Chehab 
>> > Cc: linux-media@vger.kernel.org
>> > Signed-off-by: Elena Reshetova 
>> > Signed-off-by: Dan Williams 
>> > ---
>> >  drivers/media/usb/uvc/uvc_v4l2.c |7 +--
>> >  1 file changed, 5 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/drivers/media/usb/uvc/uvc_v4l2.c 
>> > b/drivers/media/usb/uvc/uvc_v4l2.c
>> > index 3e7e283a44a8..7442626dc20e 100644
>> > --- a/drivers/media/usb/uvc/uvc_v4l2.c
>> > +++ b/drivers/media/usb/uvc/uvc_v4l2.c
>> > @@ -22,6 +22,7 @@
>> >  #include 
>> >  #include 
>> >  #include 
>> > +#include 
>> >
>> >  #include 
>> >  #include 
>> > @@ -810,6 +811,7 @@ static int uvc_ioctl_enum_input(struct file *file, 
>> > void *fh,
>> > struct uvc_entity *iterm = NULL;
>> > u32 index = input->index;
>> > int pin = 0;
>> > +   __u8 *elem;
>> >
>> > if (selector == NULL ||
>> > (chain->dev->quirks & UVC_QUIRK_IGNORE_SELECTOR_UNIT)) {
>> > @@ -820,8 +822,9 @@ static int uvc_ioctl_enum_input(struct file *file, 
>> > void *fh,
>> > break;
>> > }
>> > pin = iterm->id;
>> > -   } else if (index < selector->bNrInPins) {
>> > -   pin = selector->baSourceID[index];
>> > +   } else if ((elem = nospec_array_ptr(selector->baSourceID, index,
>> > +   selector->bNrInPins))) {
>> > +   pin = *elem;
>>
>> I dug through this before, and I couldn't find where index came from
>> userspace, I think seeing the coverity rule would be nice.
>
> Ok, I take it back, this looks correct.  Ugh, the v4l ioctl api is
> crazy complex (rightfully so), it's amazing that coverity could navigate
> that whole thing :)
>
> While I'm all for fixing this type of thing, I feel like we need to do
> something "else" for this as playing whack-a-mole for this pattern is
> going to be a never-ending battle for all drivers for forever.  Either
> we need some way to mark this data path to make it easy for tools like
> sparse to flag easily, or we need to catch the issue in the driver
> subsystems, which unfortunatly, would harm the drivers that don't have
> this type of issue (like here.)
>
> I'm guessing that other operating systems, which don't have the luxury
> of auditing all of their drivers are going for the "big hammer in the
> subsystem" type of fix, right?
>
> I don't have a good answer for this, but if there was some better way to
> rewrite these types of patterns to just prevent the need for the
> nospec_array_ptr() type thing, that might be the best overall for
> everyone.  Much like ebpf did with their changes.  That way a simple
> coccinelle rule would be able to catch the pattern and rewrite it.
>
> Or am I just dreaming?

At least on the coccinelle front you're dreaming. Julia already took a
look and said:

"I don't think Coccinelle would be good for doing this (ie
implementing taint analysis) because the dataflow is too complicated."

Perhaps the Coverity instance Dave mentioned at Ksummit 2012 has a
role to play here?


Re: [PATCH 00/18] prevent bounds-check bypass via speculative execution

2018-01-05 Thread Dan Williams
On Fri, Jan 5, 2018 at 6:22 PM, Eric W. Biederman  wrote:
> Dan Williams  writes:
>
>> Quoting Mark's original RFC:
>>
>> "Recently, Google Project Zero discovered several classes of attack
>> against speculative execution. One of these, known as variant-1, allows
>> explicit bounds checks to be bypassed under speculation, providing an
>> arbitrary read gadget. Further details can be found on the GPZ blog [1]
>> and the Documentation patch in this series."
>>
>> This series incorporates Mark Rutland's latest api and adds the x86
>> specific implementation of nospec_barrier. The
>> nospec_{array_ptr,ptr,barrier} helpers are then combined with a kernel
>> wide analysis performed by Elena Reshetova to address static analysis
>> reports where speculative execution on a userspace controlled value
>> could bypass a bounds check. The patches address a precondition for the
>> attack discussed in the Spectre paper [2].
>
> Please expand this.
>
> It is not clear what the static analysis is looking for.  Have a clear
> description of what is being fixed is crucial for allowing any of these
> changes.
>
> For the details given in the change description what I read is magic
> changes because a magic process says this code is vunlerable.

Yes, that was my first reaction to the patches as well, I try below to
add some more background and guidance, but in the end these are static
analysis reports across a wide swath of sub-systems. It's going to
take some iteration with domain experts to improve the patch
descriptions, and that's the point of this series, to get the better
trained eyes from the actual sub-system owners to take a look at these
reports.

For example, I'm looking for feedback like what Srinivas gave where he
identified that the report is bogus, the branch condition can not be
seeded with bad values in that path. Be like Srinivas.

> Given the similarities in the code that is being patched to many other
> places in the kernel it is not at all clear that this small set of
> changes is sufficient for any purpose.

I find this assertion absurd, when in the past have we as kernel
developers ever been handed a static analysis report and then
questioned why the static analysis did not flag other call sites
before first reviewing the ones it did find?

>> A consideration worth noting for reviewing these patches is to weigh the
>> dramatic cost of being wrong about whether a given report is exploitable
>> vs the overhead nospec_{array_ptr,ptr} may introduce. In other words,
>> lets make the bar for applying these patches be "can you prove that the
>> bounds check bypass is *not* exploitable". Consider that the Spectre
>> paper reports one example of a speculation window being ~180 cycles.
>
>
>> Note that there is also a proposal from Linus, array_access [3], that
>> attempts to quash speculative execution past a bounds check without
>> introducing an lfence instruction. That may be a future optimization
>> possibility that is compatible with this api, but it would appear to
>> need guarantees from the compiler that it is not clear the kernel can
>> rely on at this point. It is also not clear that it would be a
>> significant performance win vs lfence.
>
> It is also not clear that these changes fix anything, or are in any
> sense correct for the problem they are trying to fix as the problem
> is not clearly described.

I'll try my best. I don't have first hand knowledge of how the static
analyzer is doing this job, and I don't think it matters for
evaluating these reports. I'll give you my thoughts on how I would
handle one of these reports if it flagged one of the sub-systems I
maintain.

Start with the example from the Spectre paper:

if (x < array1_size)
y = array2[array1[x] * 256];

In all the patches 'x' and 'array1' are called out explicitly. For example:

net: mpls: prevent bounds-check bypass via speculative execution

Static analysis reports that 'index' may be a user controlled value that
is used as a data dependency reading 'rt' from the 'platform_label'
array...

So the first thing to review is whether the analyzer got it wrong and
'x' is not arbitrarily controllable by userspace to cause speculation
outside of the checked bounds. Be like Srinivas. The next step is to
ask whether the code can be refactored so that 'x' is sanitized
earlier in the call stack, especially if the nospec_array_ptr() lands
in a hot path. The next aspect that I expect most would be tempted to
go check is whether 'array2[array1[x]]' occurs later in the code
stream, but with speculation windows being architecture dependent and
p

[PATCH 07/18] [media] uvcvideo: prevent bounds-check bypass via speculative execution

2018-01-05 Thread Dan Williams
Static analysis reports that 'index' may be a user controlled value that
is used as a data dependency to read 'pin' from the
'selector->baSourceID' array. In order to avoid potential leaks of
kernel memory values, block speculative execution of the instruction
stream that could issue reads based on an invalid value of 'pin'.

Based on an original patch by Elena Reshetova.

Cc: Laurent Pinchart 
Cc: Mauro Carvalho Chehab 
Cc: linux-media@vger.kernel.org
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 drivers/media/usb/uvc/uvc_v4l2.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/media/usb/uvc/uvc_v4l2.c b/drivers/media/usb/uvc/uvc_v4l2.c
index 3e7e283a44a8..7442626dc20e 100644
--- a/drivers/media/usb/uvc/uvc_v4l2.c
+++ b/drivers/media/usb/uvc/uvc_v4l2.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -810,6 +811,7 @@ static int uvc_ioctl_enum_input(struct file *file, void *fh,
struct uvc_entity *iterm = NULL;
u32 index = input->index;
int pin = 0;
+   __u8 *elem;
 
if (selector == NULL ||
(chain->dev->quirks & UVC_QUIRK_IGNORE_SELECTOR_UNIT)) {
@@ -820,8 +822,9 @@ static int uvc_ioctl_enum_input(struct file *file, void *fh,
break;
}
pin = iterm->id;
-   } else if (index < selector->bNrInPins) {
-   pin = selector->baSourceID[index];
+   } else if ((elem = nospec_array_ptr(selector->baSourceID, index,
+   selector->bNrInPins))) {
+   pin = *elem;
list_for_each_entry(iterm, &chain->entities, chain) {
if (!UVC_ENTITY_IS_ITERM(iterm))
continue;



[PATCH 00/18] prevent bounds-check bypass via speculative execution

2018-01-05 Thread Dan Williams
Quoting Mark's original RFC:

"Recently, Google Project Zero discovered several classes of attack
against speculative execution. One of these, known as variant-1, allows
explicit bounds checks to be bypassed under speculation, providing an
arbitrary read gadget. Further details can be found on the GPZ blog [1]
and the Documentation patch in this series."

This series incorporates Mark Rutland's latest api and adds the x86
specific implementation of nospec_barrier. The
nospec_{array_ptr,ptr,barrier} helpers are then combined with a kernel
wide analysis performed by Elena Reshetova to address static analysis
reports where speculative execution on a userspace controlled value
could bypass a bounds check. The patches address a precondition for the
attack discussed in the Spectre paper [2].

A consideration worth noting for reviewing these patches is to weigh the
dramatic cost of being wrong about whether a given report is exploitable
vs the overhead nospec_{array_ptr,ptr} may introduce. In other words,
lets make the bar for applying these patches be "can you prove that the
bounds check bypass is *not* exploitable". Consider that the Spectre
paper reports one example of a speculation window being ~180 cycles.

Note that there is also a proposal from Linus, array_access [3], that
attempts to quash speculative execution past a bounds check without
introducing an lfence instruction. That may be a future optimization
possibility that is compatible with this api, but it would appear to
need guarantees from the compiler that it is not clear the kernel can
rely on at this point. It is also not clear that it would be a
significant performance win vs lfence.

These patches also will also be available via the 'nospec' git branch
here:

git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux nospec

[1]: 
https://googleprojectzero.blogspot.co.uk/2018/01/reading-privileged-memory-with-side.html
[2]: https://spectreattack.com/spectre.pdf
[3]: https://marc.info/?l=linux-kernel&m=151510446027625&w=2

---

Andi Kleen (1):
  x86, barrier: stop speculation for failed access_ok

Dan Williams (13):
  x86: implement nospec_barrier()
  [media] uvcvideo: prevent bounds-check bypass via speculative execution
  carl9170: prevent bounds-check bypass via speculative execution
  p54: prevent bounds-check bypass via speculative execution
  qla2xxx: prevent bounds-check bypass via speculative execution
  cw1200: prevent bounds-check bypass via speculative execution
  Thermal/int340x: prevent bounds-check bypass via speculative execution
  ipv6: prevent bounds-check bypass via speculative execution
  ipv4: prevent bounds-check bypass via speculative execution
  vfs, fdtable: prevent bounds-check bypass via speculative execution
  net: mpls: prevent bounds-check bypass via speculative execution
  udf: prevent bounds-check bypass via speculative execution
  userns: prevent bounds-check bypass via speculative execution

Mark Rutland (4):
  asm-generic/barrier: add generic nospec helpers
  Documentation: document nospec helpers
  arm64: implement nospec_ptr()
  arm: implement nospec_ptr()

 Documentation/speculation.txt  |  166 
 arch/arm/include/asm/barrier.h |   75 +
 arch/arm64/include/asm/barrier.h   |   55 +++
 arch/x86/include/asm/barrier.h |6 +
 arch/x86/include/asm/uaccess.h |   17 ++
 drivers/media/usb/uvc/uvc_v4l2.c   |7 +
 drivers/net/wireless/ath/carl9170/main.c   |6 -
 drivers/net/wireless/intersil/p54/main.c   |8 +
 drivers/net/wireless/st/cw1200/sta.c   |   10 +
 drivers/net/wireless/st/cw1200/wsm.h   |4 
 drivers/scsi/qla2xxx/qla_mr.c  |   15 +-
 .../thermal/int340x_thermal/int340x_thermal_zone.c |   14 +-
 fs/udf/misc.c  |   39 +++--
 include/asm-generic/barrier.h  |   68 
 include/linux/fdtable.h|5 -
 kernel/user_namespace.c|   10 -
 net/ipv4/raw.c |9 +
 net/ipv6/raw.c |9 +
 net/mpls/af_mpls.c |   12 +
 19 files changed, 466 insertions(+), 69 deletions(-)
 create mode 100644 Documentation/speculation.txt


[PATCH v3 0/4] introduce get_user_pages_longterm()

2017-11-29 Thread Dan Williams
Changes since v2 [1]:
* Add a comment for the vma_is_fsdax() check in get_vaddr_frames() (Jan)
* Collect Jan's Reviewed-by.
* Rebased on v4.15-rc1

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-November/013295.html

The summary text below is unchanged from v2.

---

Andrew,

Here is a new get_user_pages api for cases where a driver intends to
keep an elevated page count indefinitely. This is distinct from usages
like iov_iter_get_pages where the elevated page counts are transient.
The iov_iter_get_pages cases immediately turn around and submit the
pages to a device driver which will put_page when the i/o operation
completes (under kernel control).

In the longterm case userspace is responsible for dropping the page
reference at some undefined point in the future. This is untenable for
filesystem-dax case where the filesystem is in control of the lifetime
of the block / page and needs reasonable limits on how long it can wait
for pages in a mapping to become idle.

Fixing filesystems to actually wait for dax pages to be idle before
blocks from a truncate/hole-punch operation are repurposed is saved for
a later patch series.

Also, allowing longterm registration of dax mappings is a future patch
series that introduces a "map with lease" semantic where the kernel can
revoke a lease and force userspace to drop its page references.

I have also tagged these for -stable to purposely break cases that might
assume that longterm memory registrations for filesystem-dax mappings
were supported by the kernel. The behavior regression this policy change
implies is one of the reasons we maintain the "dax enabled. Warning:
EXPERIMENTAL, use at your own risk" notification when mounting a
filesystem in dax mode.

It is worth noting the device-dax interface does not suffer the same
constraints since it does not support file space management operations
like hole-punch.

---

Dan Williams (4):
  mm: introduce get_user_pages_longterm
  mm: fail get_vaddr_frames() for filesystem-dax mappings
  [media] v4l2: disable filesystem-dax mapping support
  IB/core: disable memory registration of fileystem-dax vmas


 drivers/infiniband/core/umem.c|2 -
 drivers/media/v4l2-core/videobuf-dma-sg.c |5 +-
 include/linux/fs.h|   14 ++
 include/linux/mm.h|   13 ++
 mm/frame_vector.c |   12 +
 mm/gup.c  |   64 +
 6 files changed, 107 insertions(+), 3 deletions(-)


[PATCH v3 2/4] mm: fail get_vaddr_frames() for filesystem-dax mappings

2017-11-29 Thread Dan Williams
Until there is a solution to the dma-to-dax vs truncate problem it is
not safe to allow V4L2, Exynos, and other frame vector users to create
long standing / irrevocable memory registrations against filesytem-dax
vmas.

Cc: Inki Dae 
Cc: Seung-Woo Kim 
Cc: Joonyoung Shim 
Cc: Kyungmin Park 
Cc: Mauro Carvalho Chehab 
Cc: linux-media@vger.kernel.org
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: Andrew Morton 
Cc: 
Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
Reviewed-by: Jan Kara 
Signed-off-by: Dan Williams 
---
 mm/frame_vector.c |   12 
 1 file changed, 12 insertions(+)

diff --git a/mm/frame_vector.c b/mm/frame_vector.c
index 2f98df0d460e..297c7238f7d4 100644
--- a/mm/frame_vector.c
+++ b/mm/frame_vector.c
@@ -53,6 +53,18 @@ int get_vaddr_frames(unsigned long start, unsigned int 
nr_frames,
ret = -EFAULT;
goto out;
}
+
+   /*
+* While get_vaddr_frames() could be used for transient (kernel
+* controlled lifetime) pinning of memory pages all current
+* users establish long term (userspace controlled lifetime)
+* page pinning. Treat get_vaddr_frames() like
+* get_user_pages_longterm() and disallow it for filesystem-dax
+* mappings.
+*/
+   if (vma_is_fsdax(vma))
+   return -EOPNOTSUPP;
+
if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
vec->got_ref = true;
vec->is_pfns = false;



[PATCH v3 3/4] [media] v4l2: disable filesystem-dax mapping support

2017-11-29 Thread Dan Williams
V4L2 memory registrations are incompatible with filesystem-dax that
needs the ability to revoke dma access to a mapping at will, or
otherwise allow the kernel to wait for completion of DMA. The
filesystem-dax implementation breaks the traditional solution of
truncate of active file backed mappings since there is no page-cache
page we can orphan to sustain ongoing DMA.

If v4l2 wants to support long lived DMA mappings it needs to arrange to
hold a file lease or use some other mechanism so that the kernel can
coordinate revoking DMA access when the filesystem needs to truncate
mappings.

Reported-by: Jan Kara 
Cc: Mauro Carvalho Chehab 
Cc: linux-media@vger.kernel.org
Cc: 
Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
Reviewed-by: Jan Kara 
Signed-off-by: Dan Williams 
---
 drivers/media/v4l2-core/videobuf-dma-sg.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c 
b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 0b5c43f7e020..f412429cf5ba 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -185,12 +185,13 @@ static int videobuf_dma_init_user_locked(struct 
videobuf_dmabuf *dma,
dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
data, size, dma->nr_pages);
 
-   err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
+   err = get_user_pages_longterm(data & PAGE_MASK, dma->nr_pages,
 flags, dma->pages, NULL);
 
if (err != dma->nr_pages) {
dma->nr_pages = (err >= 0) ? err : 0;
-   dprintk(1, "get_user_pages: err=%d [%d]\n", err, dma->nr_pages);
+   dprintk(1, "get_user_pages_longterm: err=%d [%d]\n", err,
+   dma->nr_pages);
return err < 0 ? err : -EINVAL;
}
return 0;



Re: [PATCH v2 2/4] mm: fail get_vaddr_frames() for filesystem-dax mappings

2017-11-27 Thread Dan Williams
On Mon, Nov 27, 2017 at 8:15 AM, Jan Kara  wrote:
> On Tue 14-11-17 11:56:39, Dan Williams wrote:
>> Until there is a solution to the dma-to-dax vs truncate problem it is
>> not safe to allow V4L2, Exynos, and other frame vector users to create
>> long standing / irrevocable memory registrations against filesytem-dax
>> vmas.
>>
>> Cc: Inki Dae 
>> Cc: Seung-Woo Kim 
>> Cc: Joonyoung Shim 
>> Cc: Kyungmin Park 
>> Cc: Mauro Carvalho Chehab 
>> Cc: linux-media@vger.kernel.org
>> Cc: Jan Kara 
>> Cc: Mel Gorman 
>> Cc: Vlastimil Babka 
>> Cc: Andrew Morton 
>> Cc: 
>> Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
>> Signed-off-by: Dan Williams 
>
> Makes sense. I'd just note that in principle get_vaddr_frames() is no more
> long-term than get_user_pages(). It is just so that all the users of
> get_vaddr_frames() currently want a long-term reference. Maybe could you
> add here also a comment that the vma_is_fsdax() check is there because all
> users of this function want a long term page reference? With that you can
> add:

Ok, will do.

> Reviewed-by: Jan Kara 

Thanks.


[PATCH v2 2/4] mm: fail get_vaddr_frames() for filesystem-dax mappings

2017-11-14 Thread Dan Williams
Until there is a solution to the dma-to-dax vs truncate problem it is
not safe to allow V4L2, Exynos, and other frame vector users to create
long standing / irrevocable memory registrations against filesytem-dax
vmas.

Cc: Inki Dae 
Cc: Seung-Woo Kim 
Cc: Joonyoung Shim 
Cc: Kyungmin Park 
Cc: Mauro Carvalho Chehab 
Cc: linux-media@vger.kernel.org
Cc: Jan Kara 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: Andrew Morton 
Cc: 
Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
Signed-off-by: Dan Williams 
---
 mm/frame_vector.c |4 
 1 file changed, 4 insertions(+)

diff --git a/mm/frame_vector.c b/mm/frame_vector.c
index 72ebec18629c..d2fdbeaadc8b 100644
--- a/mm/frame_vector.c
+++ b/mm/frame_vector.c
@@ -52,6 +52,10 @@ int get_vaddr_frames(unsigned long start, unsigned int 
nr_frames,
ret = -EFAULT;
goto out;
}
+
+   if (vma_is_fsdax(vma))
+   return -EOPNOTSUPP;
+
if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
vec->got_ref = true;
vec->is_pfns = false;



[PATCH v2 3/4] [media] v4l2: disable filesystem-dax mapping support

2017-11-14 Thread Dan Williams
V4L2 memory registrations are incompatible with filesystem-dax that
needs the ability to revoke dma access to a mapping at will, or
otherwise allow the kernel to wait for completion of DMA. The
filesystem-dax implementation breaks the traditional solution of
truncate of active file backed mappings since there is no page-cache
page we can orphan to sustain ongoing DMA.

If v4l2 wants to support long lived DMA mappings it needs to arrange to
hold a file lease or use some other mechanism so that the kernel can
coordinate revoking DMA access when the filesystem needs to truncate
mappings.

Reported-by: Jan Kara 
Cc: Mauro Carvalho Chehab 
Cc: linux-media@vger.kernel.org
Cc: 
Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
Signed-off-by: Dan Williams 
---
 drivers/media/v4l2-core/videobuf-dma-sg.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c 
b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 0b5c43f7e020..f412429cf5ba 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -185,12 +185,13 @@ static int videobuf_dma_init_user_locked(struct 
videobuf_dmabuf *dma,
dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
data, size, dma->nr_pages);
 
-   err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
+   err = get_user_pages_longterm(data & PAGE_MASK, dma->nr_pages,
 flags, dma->pages, NULL);
 
if (err != dma->nr_pages) {
dma->nr_pages = (err >= 0) ? err : 0;
-   dprintk(1, "get_user_pages: err=%d [%d]\n", err, dma->nr_pages);
+   dprintk(1, "get_user_pages_longterm: err=%d [%d]\n", err,
+   dma->nr_pages);
return err < 0 ? err : -EINVAL;
}
return 0;



[PATCH v2 0/4] introduce get_user_pages_longterm()

2017-11-14 Thread Dan Williams
Changes since v1 [1]:
* Cleanup local 'vmas' argument (Christoph)
* Replace inline IS_ENABLED(CONFIG_FS_DAX) in C code with ifdef
  versions of get_user_pages_longterm() for the FS_DAX on/off cases
  (Christoph)
* Add a new patch for the get_vaddr_frames() case, this impacts users
  like V4L2, and the Exynos driver.
* Collect Christoph's reviewed-by for the rdma change

[1]: https://lwn.net/Articles/738323/

---

Andrew,

Here is a new get_user_pages api for cases where a driver intends to
keep an elevated page count indefinitely. This is distinct from usages
like iov_iter_get_pages where the elevated page counts are transient.
The iov_iter_get_pages cases immediately turn around and submit the
pages to a device driver which will put_page when the i/o operation
completes (under kernel control).

In the longterm case userspace is responsible for dropping the page
reference at some undefined point in the future. This is untenable for
filesystem-dax case where the filesystem is in control of the lifetime
of the block / page and needs reasonable limits on how long it can wait
for pages in a mapping to become idle.

Fixing filesystems to actually wait for dax pages to be idle before
blocks from a truncate/hole-punch operation are repurposed is saved for
a later patch series.

Also, allowing longterm registration of dax mappings is a future patch
series that introduces a "map with lease" semantic where the kernel can
revoke a lease and force userspace to drop its page references.

I have also tagged these for -stable to purposely break cases that might
assume that longterm memory registrations for filesystem-dax mappings
were supported by the kernel. The behavior regression this policy change
implies is one of the reasons we maintain the "dax enabled. Warning:
EXPERIMENTAL, use at your own risk" notification when mounting a
filesystem in dax mode.

It is worth noting the device-dax interface does not suffer the same
constraints since it does not support file space management operations
like hole-punch.

---

Dan Williams (4):
  mm: introduce get_user_pages_longterm
  mm: fail get_vaddr_frames() for filesystem-dax mappings
  [media] v4l2: disable filesystem-dax mapping support
  IB/core: disable memory registration of fileystem-dax vmas


 drivers/infiniband/core/umem.c|2 -
 drivers/media/v4l2-core/videobuf-dma-sg.c |5 +-
 include/linux/fs.h|   14 ++
 include/linux/mm.h|   13 ++
 mm/frame_vector.c |4 ++
 mm/gup.c  |   64 +
 6 files changed, 99 insertions(+), 3 deletions(-)


Re: [PATCH 3/3] [media] v4l2: disable filesystem-dax mapping support

2017-11-07 Thread Dan Williams
On Tue, Nov 7, 2017 at 12:39 PM, Mauro Carvalho Chehab
 wrote:
> Em Tue, 7 Nov 2017 09:43:41 -0800
> Dan Williams  escreveu:
>
>> On Tue, Nov 7, 2017 at 12:33 AM, Mauro Carvalho Chehab
>>  wrote:
>> > Em Mon, 06 Nov 2017 16:57:28 -0800
>> > Dan Williams  escreveu:
>> >
>> >> V4L2 memory registrations are incompatible with filesystem-dax that
>> >> needs the ability to revoke dma access to a mapping at will, or
>> >> otherwise allow the kernel to wait for completion of DMA. The
>> >> filesystem-dax implementation breaks the traditional solution of
>> >> truncate of active file backed mappings since there is no page-cache
>> >> page we can orphan to sustain ongoing DMA.
>> >>
>> >> If v4l2 wants to support long lived DMA mappings it needs to arrange to
>> >> hold a file lease or use some other mechanism so that the kernel can
>> >> coordinate revoking DMA access when the filesystem needs to truncate
>> >> mappings.
>> >
>> >
>> > Not sure if I understand this your comment here... what happens
>> > if FS_DAX is enabled? The new err = get_user_pages_longterm()
>> > would cause DMA allocation to fail?
>>
>> Correct, any attempt to specify a filesystem-dax mapping range to
>> get_user_pages_longterm will fail with EOPNOTSUPP. In the future we
>> want to add something like a 'struct file_lock *' argument to
>> get_user_pages_longterm so that the kernel has a handle to revoke
>> access to the returned pages. Once we have a safe way for the kernel
>> to undo elevated page counts we can stop failing the longterm vs
>> filesystem-dax case.
>
> Argh! Perhaps we should make it depend on BROKEN while not fixed :-/

Small consolation, but we do warn that filesystem-dax is still
considered experimental when mounting a filesystem with "-o dax"

>> Here is more background on why _longterm gup is a problem for filesystem-dax:
>>
>> https://lwn.net/Articles/737273/
>>
>> > If so, that doesn't sound
>> > right. Instead, mm should somehow mark this mapping to be out
>> > of FS_DAX control range.
>>
>> DAX is currently global setting for the entire backing device of the
>> filesystem, so any mapping of any file when the "-o dax" mount option
>> is set is in the "FS_DAX control range". In other words there's
>> currently no way to prevent FS_DAX mappings from being exposed to V4L2
>> outside of CONFIG_FS_DAX=n.
>
> Grrr...
>
>> > Also, it is not only videobuf-dma-sg.c that does long lived
>> > DMA mappings. VB2 also does that (and videobuf-vmalloc).
>>
>> Without finding the code videobuf-vmalloc sounds like it should be ok
>> if the kernel is allocating memory separate from a file-backed DAX
>> mapping.
>
> videobuf-vmalloc do DMA mapping for pages allocated via vmalloc(),
> via vmalloc_user()/remap_vmalloc_range().

Ok, that's completely safe since filesystem-dax mappings are not
involved in a vmalloc backed virtual address range.

> There aren't much drivers using VB1 anymore, but a change at VB2
> will likely break support for almost all webcams if fs DAX is
> in usage.

Yes, unless / until we can switch userspace to using a new memory
registration api that includes a way for the kernel to revoke access
to a dax mapping. Another mitigation is following through on support
for moving dax support from a global mount flag to a per-inode flag to
at least prevent dax from leaking to use cases that need explicit
coordination.

>> Where is the VB2 get_user_pages call?
>
> Before changeset 3336c24f25ec, the logic for get_user_pages() were
> at drivers/media/v4l2-core/videobuf2-dma-sg.c. Now, the logic
> it uses is inside mm/frame_vector.c.

Ok, I'll take a look.


Re: [PATCH 3/3] [media] v4l2: disable filesystem-dax mapping support

2017-11-07 Thread Dan Williams
On Tue, Nov 7, 2017 at 12:33 AM, Mauro Carvalho Chehab
 wrote:
> Em Mon, 06 Nov 2017 16:57:28 -0800
> Dan Williams  escreveu:
>
>> V4L2 memory registrations are incompatible with filesystem-dax that
>> needs the ability to revoke dma access to a mapping at will, or
>> otherwise allow the kernel to wait for completion of DMA. The
>> filesystem-dax implementation breaks the traditional solution of
>> truncate of active file backed mappings since there is no page-cache
>> page we can orphan to sustain ongoing DMA.
>>
>> If v4l2 wants to support long lived DMA mappings it needs to arrange to
>> hold a file lease or use some other mechanism so that the kernel can
>> coordinate revoking DMA access when the filesystem needs to truncate
>> mappings.
>
>
> Not sure if I understand this your comment here... what happens
> if FS_DAX is enabled? The new err = get_user_pages_longterm()
> would cause DMA allocation to fail?

Correct, any attempt to specify a filesystem-dax mapping range to
get_user_pages_longterm will fail with EOPNOTSUPP. In the future we
want to add something like a 'struct file_lock *' argument to
get_user_pages_longterm so that the kernel has a handle to revoke
access to the returned pages. Once we have a safe way for the kernel
to undo elevated page counts we can stop failing the longterm vs
filesystem-dax case.

Here is more background on why _longterm gup is a problem for filesystem-dax:

https://lwn.net/Articles/737273/

> If so, that doesn't sound
> right. Instead, mm should somehow mark this mapping to be out
> of FS_DAX control range.

DAX is currently global setting for the entire backing device of the
filesystem, so any mapping of any file when the "-o dax" mount option
is set is in the "FS_DAX control range". In other words there's
currently no way to prevent FS_DAX mappings from being exposed to V4L2
outside of CONFIG_FS_DAX=n.

> Also, it is not only videobuf-dma-sg.c that does long lived
> DMA mappings. VB2 also does that (and videobuf-vmalloc).

Without finding the code videobuf-vmalloc sounds like it should be ok
if the kernel is allocating memory separate from a file-backed DAX
mapping. Where is the VB2 get_user_pages call?


[PATCH 0/3] introduce get_user_pages_longterm()

2017-11-06 Thread Dan Williams
Andrew,

Here is a new get_user_pages api for cases where a driver intends to
keep an elevated page count indefinitely. This is distinct from usages
like iov_iter_get_pages where the elevated page counts are transient.
The iov_iter_get_pages cases immediately turn around and submit the
pages to a device driver which will put_page when the i/o operation
completes (under kernel control).

In the longterm case userspace is responsible for dropping the page
reference at some undefined point in the future. This is untenable for
filesystem-dax case where the filesystem is in control of the lifetime
of the block / page and needs reasonable limits on how long it can wait
for pages in a mapping to become idle.

Fixing filesystems to actually wait for dax pages to be idle before
blocks from a truncate/hole-punch operation are repurposed is saved for
a later patch series.

Also, allowing longterm registration of dax mappings is a future patch
series that introduces a "map with lease" semantic where the kernel can
revoke a lease and force userspace to drop its page references.

I have also tagged these for -stable to purposely break cases that might
assume that longterm memory registrations for filesystem-dax mappings
were supported by the kernel. The behavior regression this policy change
implies is one of the reasons we maintain the "dax enabled. Warning:
EXPERIMENTAL, use at your own risk" notification when mounting a
filesystem in dax mode.

It is worth noting the device-dax interface does not suffer the same
constraints since it does not support file space management operations
like hole-punch.

---

Dan Williams (3):
  mm: introduce get_user_pages_longterm
  IB/core: disable memory registration of fileystem-dax vmas
  [media] v4l2: disable filesystem-dax mapping support


 drivers/infiniband/core/umem.c|2 -
 drivers/media/v4l2-core/videobuf-dma-sg.c |5 +-
 include/linux/mm.h|3 +
 mm/gup.c  |   75 +
 4 files changed, 82 insertions(+), 3 deletions(-)


[PATCH 3/3] [media] v4l2: disable filesystem-dax mapping support

2017-11-06 Thread Dan Williams
V4L2 memory registrations are incompatible with filesystem-dax that
needs the ability to revoke dma access to a mapping at will, or
otherwise allow the kernel to wait for completion of DMA. The
filesystem-dax implementation breaks the traditional solution of
truncate of active file backed mappings since there is no page-cache
page we can orphan to sustain ongoing DMA.

If v4l2 wants to support long lived DMA mappings it needs to arrange to
hold a file lease or use some other mechanism so that the kernel can
coordinate revoking DMA access when the filesystem needs to truncate
mappings.

Reported-by: Jan Kara 
Cc: Mauro Carvalho Chehab 
Cc: linux-media@vger.kernel.org
Cc: 
Signed-off-by: Dan Williams 
---
 drivers/media/v4l2-core/videobuf-dma-sg.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c 
b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 0b5c43f7e020..f412429cf5ba 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -185,12 +185,13 @@ static int videobuf_dma_init_user_locked(struct 
videobuf_dmabuf *dma,
dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
data, size, dma->nr_pages);
 
-   err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
+   err = get_user_pages_longterm(data & PAGE_MASK, dma->nr_pages,
 flags, dma->pages, NULL);
 
if (err != dma->nr_pages) {
dma->nr_pages = (err >= 0) ? err : 0;
-   dprintk(1, "get_user_pages: err=%d [%d]\n", err, dma->nr_pages);
+   dprintk(1, "get_user_pages_longterm: err=%d [%d]\n", err,
+   dma->nr_pages);
return err < 0 ? err : -EINVAL;
}
return 0;



[PATCH 00/15] dax: prep work for fixing dax-dma vs truncate collisions

2017-10-31 Thread Dan Williams
This is hopefully the uncontroversial lead-in set of changes that lay
the groundwork for solving the dax-dma vs truncate problem. The overview
of the changes is:

1/ Disable DAX when we do not have struct page entries backing dax
   mappings, or otherwise allow limited DAX support for axonram and
   dcssblk. Is anyone actually using the DAX capability of axonram
   dcssblk?

2/ Disable code paths that establish potentially long lived DMA
   access to a filesystem-dax memory mapping, i.e. RDMA and V4L2. In the
   4.16 timeframe the plan is to introduce a "register memory for DMA
   with a lease" mechanism for userspace to establish mappings but also
   be responsible for tearing down the mapping when the kernel needs to
   invalidate the mapping due to truncate or hole-punch.

3/ Add a wakeup mechanism for awaiting for DAX pages to be released
   from DMA access.

This overall effort started when Christoph noted during the review of
the MAP_DIRECT proposal:

get_user_pages on DAX doesn't give the same guarantees as on
pagecache or anonymous memory, and that is the problem we need to
fix. In fact I'm pretty sure if we try hard enough (and we might
have to try very hard) we can see the same problem with plain direct
I/O and without any RDMA involved, e.g. do a larger direct I/O write
to memory that is mmap()ed from a DAX file, then truncate the DAX
file and reallocate the blocks, and we might corrupt that new file.
We'll probably need a special setup where there is little other
chance but to reallocate those used blocks.

So what we need to do first is to fix get_user_pages vs unmapping
DAX mmap()ed blocks, be that from a hole punch, truncate, COW
operation, etc.

Included in the changes is a nfit_test mechanism to trivially trigger
this collision by delaying the put_page() that the block layer performs
after performing direct-I/O to a filesystem-DAX page.

Given the ongoing coordination of this set across multiple sub-systems
and the dax core my proposal is to manage this as a branch in the nvdimm
tree with acks from mm, rdma, v4l2, ext4, and xfs.

---

Dan Williams (15):
  dax: quiet bdev_dax_supported()
  mm, dax: introduce pfn_t_special()
  dax: require 'struct page' by default for filesystem dax
  brd: remove dax support
  dax: stop using VM_MIXEDMAP for dax
  dax: stop using VM_HUGEPAGE for dax
  dax: stop requiring a live device for dax_flush()
  dax: store pfns in the radix
  tools/testing/nvdimm: add 'bio_delay' mechanism
  IB/core: disable memory registration of fileystem-dax vmas
  [media] v4l2: disable filesystem-dax mapping support
  mm, dax: enable filesystems to trigger page-idle callbacks
  mm, devmap: introduce CONFIG_DEVMAP_MANAGED_PAGES
  dax: associate mappings with inodes, and warn if dma collides with 
truncate
  wait_bit: introduce {wait_on,wake_up}_devmap_idle


 arch/powerpc/platforms/Kconfig|1 
 arch/powerpc/sysdev/axonram.c |3 -
 drivers/block/Kconfig |   12 ---
 drivers/block/brd.c   |   65 --
 drivers/dax/device.c  |1 
 drivers/dax/super.c   |  113 +
 drivers/infiniband/core/umem.c|   49 ---
 drivers/media/v4l2-core/videobuf-dma-sg.c |   39 -
 drivers/nvdimm/pmem.c |   13 +++
 drivers/s390/block/Kconfig|1 
 drivers/s390/block/dcssblk.c  |4 +
 fs/Kconfig|8 ++
 fs/dax.c  |  131 +++--
 fs/ext2/file.c|1 
 fs/ext2/super.c   |6 +
 fs/ext4/file.c|1 
 fs/ext4/super.c   |6 +
 fs/xfs/xfs_file.c |2 
 fs/xfs/xfs_super.c|   20 ++--
 include/linux/dax.h   |   17 ++--
 include/linux/memremap.h  |   24 +
 include/linux/mm.h|   47 ++
 include/linux/mm_types.h  |   20 +++-
 include/linux/pfn_t.h |   13 +++
 include/linux/vma.h   |   33 +++
 include/linux/wait_bit.h  |   10 ++
 kernel/memremap.c |   36 ++--
 kernel/sched/wait_bit.c   |   64 --
 mm/Kconfig|5 +
 mm/hmm.c  |   13 ---
 mm/huge_memory.c  |8 +-
 mm/ksm.c  |3 +
 mm/madvise.c  |2 
 mm/memory.c   |   22 -
 mm/migrate.c  |3 -
 mm/mlock.c  

[PATCH 11/15] [media] v4l2: disable filesystem-dax mapping support

2017-10-31 Thread Dan Williams
V4L2 memory registrations are incompatible with filesystem-dax that
needs the ability to revoke dma access to a mapping at will, or
otherwise allow the kernel to wait for completion of DMA. The
filesystem-dax implementation breaks the traditional solution of
truncate of active file backed mappings since there is no page-cache
page we can orphan to sustain ongoing DMA.

If v4l2 wants to support long lived DMA mappings it needs to arrange to
hold a file lease or use some other mechanism so that the kernel can
coordinate revoking DMA access when the filesystem needs to truncate
mappings.

Reported-by: Jan Kara 
Cc: Mauro Carvalho Chehab 
Cc: linux-media@vger.kernel.org
Cc: 
Signed-off-by: Dan Williams 
---
 drivers/media/v4l2-core/videobuf-dma-sg.c |   39 -
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c 
b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 0b5c43f7e020..37a4ae61b2c0 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -155,8 +155,9 @@ static int videobuf_dma_init_user_locked(struct 
videobuf_dmabuf *dma,
int direction, unsigned long data, unsigned long size)
 {
unsigned long first, last;
-   int err, rw = 0;
+   int err, rw = 0, i, nr_pages;
unsigned int flags = FOLL_FORCE;
+   struct vm_area_struct **vmas = NULL;
 
dma->direction = direction;
switch (dma->direction) {
@@ -179,6 +180,16 @@ static int videobuf_dma_init_user_locked(struct 
videobuf_dmabuf *dma,
if (NULL == dma->pages)
return -ENOMEM;
 
+   if (IS_ENABLED(CONFIG_FS_DAX)) {
+   vmas = kmalloc(dma->nr_pages * sizeof(struct vm_area_struct *),
+   GFP_KERNEL);
+   if (NULL == vmas) {
+   kfree(dma->pages);
+   dma->pages = NULL;
+   return -ENOMEM;
+   }
+   }
+
if (rw == READ)
flags |= FOLL_WRITE;
 
@@ -186,7 +197,31 @@ static int videobuf_dma_init_user_locked(struct 
videobuf_dmabuf *dma,
data, size, dma->nr_pages);
 
err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
-flags, dma->pages, NULL);
+flags, dma->pages, vmas);
+   nr_pages = err;
+
+   for (i = 0; vmas && i < nr_pages; i++) {
+   struct vm_area_struct *vma = vmas[i];
+   struct inode *inode;
+
+   if (!vma_is_dax(vma))
+   continue;
+
+   /* device-dax is safe for long-lived v4l2 mappings... */
+   inode = file_inode(vma->vm_file);
+   if (inode->i_mode == S_IFCHR)
+   continue;
+
+   /* ...filesystem-dax is not. */
+   err = -EOPNOTSUPP;
+   break;
+
+   /*
+* FIXME: add a 'with lease' mechanism for v4l2 to
+* obtain time bounded access to filesytem-dax mappings
+*/
+   }
+   kfree(vmas);
 
if (err != dma->nr_pages) {
dma->nr_pages = (err >= 0) ? err : 0;



[RFC PATCH v8 0/2] mmap: safely enable support for new flags

2017-09-08 Thread Dan Williams
Changes since v7 [1]:
* rebase on the mid-merge-window state of the tree to pick up new mmap
  implementations.

* expand the mmap operation handler conversion beyond 'struct
  file_operations' to include, 'struct etnaviv_gem_ops', 'struct
  dma_buf_ops', 'struct drm_driver', 'struct fb_ops', and 'struct
  v4l2_file_operations'

* pass 'map_flags' through to all sub-handlers (Christoph)

* rework the mmap flag validation mechanism to the MAP_SHARED_VALIDATE
  scheme (Linus)

[1]: https://lwn.net/Articles/732886/

---

In order to safely add new mmap flags we want all mmap implementations
to both opt-in to the new flags they support and have the ability to
reject flags on a per-mmap-call basis. An alternative to all the churn
in patch1 is to add a new ->map_flags attribute to 'struct
vma_area_struct', but that bloats the runtime state everywhere for the
few mmap implementations that will care about new flags. Of course, this
also assumes that there are no general objections to the plans to
eventually add MAP_SYNC and/or MAP_DIRECT for DAX mappings [2].

The current request is to merge the final version of patch1 next week,
right before -rc1, i.e. before new ->mmap() handlers start landing in
-next. Given that the drm, media, and sound pull requests are already
merged only some small tweaks are expected from here on out. Patch2 is
included for review, but it can wait and go in with the new MAP_ flags.

Please holler if anything does not look right.

[2]: "Two more approaches to persistent-memory writes"
 https://lwn.net/Articles/731706/

---

Dan Williams (2):
  vfs: add flags parameter to all ->mmap() handlers
  mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap 
flags


 arch/alpha/include/uapi/asm/mman.h |1 
 arch/arc/kernel/arc_hostlink.c |3 +
 arch/mips/include/uapi/asm/mman.h  |1 
 arch/mips/kernel/vdso.c|2 -
 arch/parisc/include/uapi/asm/mman.h|1 
 arch/powerpc/kernel/proc_powerpc.c |3 +
 arch/powerpc/kvm/book3s_64_vio.c   |3 +
 arch/powerpc/platforms/cell/spufs/file.c   |   21 ++
 arch/powerpc/platforms/powernv/memtrace.c  |3 +
 arch/powerpc/platforms/powernv/opal-prd.c  |3 +
 arch/tile/mm/elf.c |3 +
 arch/um/drivers/mmapper_kern.c |3 +
 arch/xtensa/include/uapi/asm/mman.h|1 
 drivers/android/binder.c   |3 +
 drivers/auxdisplay/cfag12864bfb.c  |3 +
 drivers/auxdisplay/ht16k33.c   |3 +
 drivers/char/agp/frontend.c|3 +
 drivers/char/bsr.c |3 +
 drivers/char/hpet.c|6 ++-
 drivers/char/mbcs.c|3 +
 drivers/char/mbcs.h|3 +
 drivers/char/mem.c |   11 +++--
 drivers/char/mspec.c   |9 +++-
 drivers/char/uv_mmtimer.c  |6 ++-
 drivers/dax/device.c   |3 +
 drivers/dma-buf/dma-buf.c  |   11 +++--
 drivers/firewire/core-cdev.c   |3 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c|3 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h|3 +
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c   |5 +-
 drivers/gpu/drm/armada/armada_gem.c|3 +
 drivers/gpu/drm/ast/ast_drv.h  |3 +
 drivers/gpu/drm/ast/ast_ttm.c  |3 +
 drivers/gpu/drm/bochs/bochs.h  |3 +
 drivers/gpu/drm/bochs/bochs_fbdev.c|2 -
 drivers/gpu/drm/bochs/bochs_mm.c   |3 +
 drivers/gpu/drm/cirrus/cirrus_drv.h|3 +
 drivers/gpu/drm/cirrus/cirrus_ttm.c|3 +
 drivers/gpu/drm/drm_fb_cma_helper.c|8 ++--
 drivers/gpu/drm/drm_gem.c  |3 +
 drivers/gpu/drm/drm_gem_cma_helper.c   |8 ++--
 drivers/gpu/drm/drm_prime.c|5 +-
 drivers/gpu/drm/drm_vm.c   |3 +
 drivers/gpu/drm/etnaviv/etnaviv_drv.h  |6 ++-
 drivers/gpu/drm/etnaviv/etnaviv_gem.c  |   11 +++--
 drivers/gpu/drm/etnaviv/etnaviv_gem.h  |3 +
 drivers/gpu/drm/etnaviv/etnaviv_gem_prime.c|9 ++--
 drivers/gpu/drm/exynos/exynos_drm_fbdev.c  |2 -
 drivers/gpu/drm/exynos/exynos_drm_gem.c|   10 +++--
 drivers/gpu/drm/exynos/exynos_drm_gem.h|6 ++-
 drivers/gpu/drm/gma500/framebuffer.c   |3 +
 drive

Re: Enabling peer to peer device transactions for PCIe devices

2016-12-06 Thread Dan Williams
On Tue, Dec 6, 2016 at 1:47 PM, Logan Gunthorpe  wrote:
> Hey,
>
>> Okay, so clearly this needs a kernel side NVMe specific allocator
>> and locking so users don't step on each other..
>
> Yup, ideally. That's why device dax isn't ideal for this application: it
> doesn't provide any way to prevent users from stepping on each other.

On this particular point I'm in the process of posting patches that
allow device-dax sub-division, so you could carve up a bar into
multiple devices of various sizes.
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Dan Williams
On Mon, Dec 5, 2016 at 10:39 AM, Logan Gunthorpe  wrote:
> On 05/12/16 11:08 AM, Dan Williams wrote:
>>
>> I've already recommended that iopmem not be a block device and instead
>> be a device-dax instance. I also don't think it should claim the PCI
>> ID, rather the driver that wants to map one of its bars this way can
>> register the memory region with the device-dax core.
>>
>> I'm not sure there are enough device drivers that want to do this to
>> have it be a generic /sys/.../resource_dmableX capability. It still
>> seems to be an exotic one-off type of configuration.
>
>
> Yes, this is essentially my thinking. Except I think the userspace interface
> should really depend on the device itself. Device dax is a good  choice for
> many and I agree the block device approach wouldn't be ideal.
>
> Specifically for NVME CMB: I think it would make a lot of sense to just hand
> out these mappings with an mmap call on /dev/nvmeX. I expect CMB buffers
> would be volatile and thus you wouldn't need to keep track of where in the
> BAR the region came from. Thus, the mmap call would just be an allocator
> from BAR memory. If device-dax were used, userspace would need to lookup
> which device-dax instance corresponds to which nvme drive.
>

I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
to accomplish in sysfs through /sys/dev/char to find the sysfs path of
the device-dax instance under the nvme device, or if you already have
the nvme sysfs path the dax instance(s) will appear under the "dax"
sub-directory.
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Dan Williams
On Mon, Dec 5, 2016 at 10:02 AM, Jason Gunthorpe
 wrote:
> On Mon, Dec 05, 2016 at 09:40:38AM -0800, Dan Williams wrote:
>
>> > If it is kernel only with physical addresess we don't need a uAPI for
>> > it, so I'm not sure #1 is at all related to iopmem.
>> >
>> > Most people who want #1 probably can just mmap
>> > /sys/../pci/../resourceX to get a user handle to it, or pass around
>> > __iomem pointers in the kernel. This has been asked for before with
>> > RDMA.
>> >
>> > I'm still not really clear what iopmem is for, or why DAX should ever
>> > be involved in this..
>>
>> Right, by default remap_pfn_range() does not establish DMA capable
>> mappings. You can think of iopmem as remap_pfn_range() converted to
>> use devm_memremap_pages(). Given the extra constraints of
>> devm_memremap_pages() it seems reasonable to have those DMA capable
>> mappings be optionally established via a separate driver.
>
> Except the iopmem driver claims the PCI ID, and presents a block
> interface which is really *NOT* what people who have asked for this in
> the past have wanted. IIRC it was embedded stuff eg RDMA video
> directly out of a capture card or a similar kind of thinking.
>
> It is a good point about devm_memremap_pages limitations, but maybe
> that just says to create a /sys/.../resource_dmableX ?
>
> Or is there some reason why people want a filesystem on top of BAR
> memory? That does not seem to have been covered yet..
>

I've already recommended that iopmem not be a block device and instead
be a device-dax instance. I also don't think it should claim the PCI
ID, rather the driver that wants to map one of its bars this way can
register the memory region with the device-dax core.

I'm not sure there are enough device drivers that want to do this to
have it be a generic /sys/.../resource_dmableX capability. It still
seems to be an exotic one-off type of configuration.
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Dan Williams
On Mon, Dec 5, 2016 at 9:18 AM, Jason Gunthorpe
 wrote:
> On Sun, Dec 04, 2016 at 07:23:00AM -0600, Stephen Bates wrote:
>> Hi All
>>
>> This has been a great thread (thanks to Alex for kicking it off) and I
>> wanted to jump in and maybe try and put some summary around the
>> discussion. I also wanted to propose we include this as a topic for LFS/MM
>> because I think we need more discussion on the best way to add this
>> functionality to the kernel.
>>
>> As far as I can tell the people looking for P2P support in the kernel fall
>> into two main camps:
>>
>> 1. Those who simply want to expose static BARs on PCIe devices that can be
>> used as the source/destination for DMAs from another PCIe device. This
>> group has no need for memory invalidation and are happy to use
>> physical/bus addresses and not virtual addresses.
>
> I didn't think there was much on this topic except for the CMB
> thing.. Even that is really a mapped kernel address..
>
>> I think something like the iopmem patches Logan and I submitted recently
>> come close to addressing use case 1. There are some issues around
>> routability but based on feedback to date that does not seem to be a
>> show-stopper for an initial inclusion.
>
> If it is kernel only with physical addresess we don't need a uAPI for
> it, so I'm not sure #1 is at all related to iopmem.
>
> Most people who want #1 probably can just mmap
> /sys/../pci/../resourceX to get a user handle to it, or pass around
> __iomem pointers in the kernel. This has been asked for before with
> RDMA.
>
> I'm still not really clear what iopmem is for, or why DAX should ever
> be involved in this..

Right, by default remap_pfn_range() does not establish DMA capable
mappings. You can think of iopmem as remap_pfn_range() converted to
use devm_memremap_pages(). Given the extra constraints of
devm_memremap_pages() it seems reasonable to have those DMA capable
mappings be optionally established via a separate driver.
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Dan Williams
On Wed, Nov 23, 2016 at 1:55 PM, Jason Gunthorpe
 wrote:
> On Wed, Nov 23, 2016 at 02:11:29PM -0700, Logan Gunthorpe wrote:
>> > As I said, there is no possible special handling. Standard IB hardware
>> > does not support changing the DMA address once a MR is created. Forget
>> > about doing that.
>>
>> Yeah, that's essentially the point I was trying to make. Not to mention
>> all the other unrelated hardware that can't DMA to an address that might
>> disappear mid-transfer.
>
> Right, it is impossible to ask for generic page migration with ongoing
> DMA. That is simply not supported by any of the hardware at all.
>
>> > Only ODP hardware allows changing the DMA address on the fly, and it
>> > works at the page table level. We do not need special handling for
>> > RDMA.
>>
>> I am aware of ODP but, noted by others, it doesn't provide a general
>> solution to the points above.
>
> How do you mean?
>
> Perhaps I am not following what Serguei is asking for, but I
> understood the desire was for a complex GPU allocator that could
> migrate pages between GPU and CPU memory under control of the GPU
> driver, among other things. The desire is for DMA to continue to work
> even after these migrations happen.
>
> Page table mirroring *is* the general solution for this problem. The
> GPU driver controls the VMA and the DMA driver mirrors that VMA.
>
> Do you know of another option that doesn't just degenerate to page
> table mirroring??
>
> Remember, there are two facets to the RDMA ODP implementation, I feel
> there is some confusion here..
>
> The crucial part for this discussion is the ability to fence and block
> DMA for a specific range. This is the hardware capability that lets
> page migration happen: fence&block DMA, migrate page, update page
> table in HCA, unblock DMA.

Wait, ODP requires migratable pages, ZONE_DEVICE pages are not
migratable. You can't replace a PCIe mapping with just any other
System RAM physical address, right? At least not without a filesystem
recording where things went, but at point we're no longer talking
about the base P2P-DMA mapping mechanism and are instead talking about
something like pnfs-rdma to a DAX filesystem.
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Dan Williams
On Wed, Nov 23, 2016 at 9:27 AM, Bart Van Assche
 wrote:
> On 11/23/2016 09:13 AM, Logan Gunthorpe wrote:
>>
>> IMO any memory that has been registered for a P2P transaction should be
>> locked from being evicted. So if there's a get_user_pages call it needs
>> to be pinned until the put_page. The main issue being with the RDMA
>> case: handling an eviction when a chunk of memory has been registered as
>> an MR would be very tricky. The MR may be relied upon by another host
>> and the kernel would have to inform user-space the MR was invalid then
>> user-space would have to tell the remote application.
>
>
> Hello Logan,
>
> Are you aware that the Linux kernel already supports ODP (On Demand Paging)?
> See also the output of git grep -nHi on.demand.paging. See also
> https://www.openfabrics.org/images/eventpresos/workshops2014/DevWorkshop/presos/Tuesday/pdf/04_ODP_update.pdf.
>

I don't think that was designed for the case where the backing memory
is a special/static physical address range rather than anonymous
"System RAM", right?

I think we should handle the graphics P2P concerns separately from the
general P2P-DMA case since the latter does not require the higher
order memory management facilities. Using ZONE_DEVICE/DAX mappings to
avoid changes to every driver that wants to support P2P-DMA separately
from typical DMA still seems the path of least resistance.
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-22 Thread Dan Williams
On Tue, Nov 22, 2016 at 1:03 PM, Daniel Vetter  wrote:
> On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch
>  wrote:
>>
>> On 2016-11-22 03:10 PM, Daniel Vetter wrote:
>>>
>>> On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams 
>>> wrote:
>>>>
>>>> On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
>>>>  wrote:
>>>>>
>>>>> I personally like "device-DAX" idea but my concerns are:
>>>>>
>>>>> -  How well it will co-exists with the  DRM infrastructure /
>>>>> implementations
>>>>> in part dealing with CPU pointers?
>>>>
>>>> Inside the kernel a device-DAX range is "just memory" in the sense
>>>> that you can perform pfn_to_page() on it and issue I/O, but the vma is
>>>> not migratable. To be honest I do not know how well that co-exists
>>>> with drm infrastructure.
>>>>
>>>>> -  How well we will be able to handle case when we need to
>>>>> "move"/"evict"
>>>>> memory/data to the new location so CPU pointer should point to the
>>>>> new
>>>>> physical location/address
>>>>>  (and may be not in PCI device memory at all)?
>>>>
>>>> So, device-DAX deliberately avoids support for in-kernel migration or
>>>> overcommit. Those cases are left to the core mm or drm. The device-dax
>>>> interface is for cases where all that is needed is a direct-mapping to
>>>> a statically-allocated physical-address range be it persistent memory
>>>> or some other special reserved memory range.
>>>
>>> For some of the fancy use-cases (e.g. to be comparable to what HMM can
>>> pull off) I think we want all the magic in core mm, i.e. migration and
>>> overcommit. At least that seems to be the very strong drive in all
>>> general-purpose gpu abstractions and implementations, where memory is
>>> allocated with malloc, and then mapped/moved into vram/gpu address
>>> space through some magic,
>>
>> It is possible that there is other way around: memory is requested to be
>> allocated and should be kept in vram for  performance reason but due
>> to possible overcommit case we need at least temporally to "move" such
>> allocation to system memory.
>
> With migration I meant migrating both ways of course. And with stuff
> like numactl we can also influence where exactly the malloc'ed memory
> is allocated originally, at least if we'd expose the vram range as a
> very special numa node that happens to be far away and not hold any
> cpu cores.

I don't think we should be using numa distance to reverse engineer a
certain allocation behavior.  The latency data should be truthful, but
you're right we'll need a mechanism to keep general purpose
allocations out of that range by default. Btw, strict isolation is
another design point of device-dax, but I think in this case we're
describing something between the two extremes of full isolation and
full compatibility with existing numactl apis.
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-22 Thread Dan Williams
On Tue, Nov 22, 2016 at 12:10 PM, Daniel Vetter  wrote:
> On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams  
> wrote:
>> On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
>>  wrote:
>>> I personally like "device-DAX" idea but my concerns are:
>>>
>>> -  How well it will co-exists with the  DRM infrastructure / implementations
>>>in part dealing with CPU pointers?
>>
>> Inside the kernel a device-DAX range is "just memory" in the sense
>> that you can perform pfn_to_page() on it and issue I/O, but the vma is
>> not migratable. To be honest I do not know how well that co-exists
>> with drm infrastructure.
>>
>>> -  How well we will be able to handle case when we need to "move"/"evict"
>>>memory/data to the new location so CPU pointer should point to the new
>>> physical location/address
>>> (and may be not in PCI device memory at all)?
>>
>> So, device-DAX deliberately avoids support for in-kernel migration or
>> overcommit. Those cases are left to the core mm or drm. The device-dax
>> interface is for cases where all that is needed is a direct-mapping to
>> a statically-allocated physical-address range be it persistent memory
>> or some other special reserved memory range.
>
> For some of the fancy use-cases (e.g. to be comparable to what HMM can
> pull off) I think we want all the magic in core mm, i.e. migration and
> overcommit. At least that seems to be the very strong drive in all
> general-purpose gpu abstractions and implementations, where memory is
> allocated with malloc, and then mapped/moved into vram/gpu address
> space through some magic, but still visible on both the cpu and gpu
> side in some form. Special device to allocate memory, and not being
> able to migrate stuff around sound like misfeatures from that pov.

Agreed. For general purpose P2P use cases where all you want is
direct-I/O to a memory range that happens to be on a PCIe device then
I think a special device fits the bill. For gpu P2P use cases that
already have migration/overcommit expectations then it is not a good
fit.
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-22 Thread Dan Williams
On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
 wrote:
> Dan,
>
> I personally like "device-DAX" idea but my concerns are:
>
> -  How well it will co-exists with the  DRM infrastructure / implementations
>in part dealing with CPU pointers?

Inside the kernel a device-DAX range is "just memory" in the sense
that you can perform pfn_to_page() on it and issue I/O, but the vma is
not migratable. To be honest I do not know how well that co-exists
with drm infrastructure.

> -  How well we will be able to handle case when we need to "move"/"evict"
>memory/data to the new location so CPU pointer should point to the new
> physical location/address
> (and may be not in PCI device memory at all)?

So, device-DAX deliberately avoids support for in-kernel migration or
overcommit. Those cases are left to the core mm or drm. The device-dax
interface is for cases where all that is needed is a direct-mapping to
a statically-allocated physical-address range be it persistent memory
or some other special reserved memory range.
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-22 Thread Dan Williams
On Mon, Nov 21, 2016 at 12:36 PM, Deucher, Alexander
 wrote:
> This is certainly not the first time this has been brought up, but I'd like 
> to try and get some consensus on the best way to move this forward.  Allowing 
> devices to talk directly improves performance and reduces latency by avoiding 
> the use of staging buffers in system memory.  Also in cases where both 
> devices are behind a switch, it avoids the CPU entirely.  Most current APIs 
> (DirectGMA, PeerDirect, CUDA, HSA) that deal with this are pointer based.  
> Ideally we'd be able to take a CPU virtual address and be able to get to a 
> physical address taking into account IOMMUs, etc.  Having struct pages for 
> the memory would allow it to work more generally and wouldn't require as much 
> explicit support in drivers that wanted to use it.
>
> Some use cases:
> 1. Storage devices streaming directly to GPU device memory
> 2. GPU device memory to GPU device memory streaming
> 3. DVB/V4L/SDI devices streaming directly to GPU device memory
> 4. DVB/V4L/SDI devices streaming directly to storage devices
>
> Here is a relatively simple example of how this could work for testing.  This 
> is obviously not a complete solution.
> - Device memory will be registered with Linux memory sub-system by created 
> corresponding struct page structures for device memory
> - get_user_pages_fast() will  return corresponding struct pages when CPU 
> address points to the device memory
> - put_page() will deal with struct pages for device memory
>
[..]
> 4. iopmem
> iopmem : A block device for PCIe memory (https://lwn.net/Articles/703895/)

The change I suggest for this particular approach is to switch to
"device-DAX" [1]. I.e. a character device for establishing DAX
mappings rather than a block device plus a DAX filesystem. The pro of
this approach is standard user pointers and struct pages rather than a
new construct. The con is that this is done via an interface separate
from the existing gpu and storage device. For example it would require
a /dev/dax instance alongside a /dev/nvme interface, but I don't see
that as a significant blocking concern.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2016-October/007496.html
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 29/31] parisc: handle page-less SG entries

2015-08-14 Thread Dan Williams
On Thu, Aug 13, 2015 at 9:11 PM, David Miller  wrote:
> From: James Bottomley 
>> At least on some PA architectures, you have to be very careful.
>> Improperly managed, multiple aliases will cause the system to crash
>> (actually a machine check in the cache chequerboard). For the most
>> temperamental systems, we need the cache line flushed and the alias
>> mapping ejected from the TLB cache before we access the same page at an
>> inequivalent alias.
>
> Also, I want to mention that on sparc64 we manage the cache aliasing
> state in the page struct.
>
> Until a page is mapped into userspace, we just record the most recent
> cpu to store into that page with kernel side mappings.  Once the page
> ends up being mapped or the cpu doing kernel side stores changes, we
> actually perform the cache flush.
>
> Generally speaking, I think that all actual physical memory the kernel
> operates on should have a struct page backing it.  So this whole
> discussion of operating on physical memory in scatter lists without
> backing page structs feels really foreign to me.

So the only way for page-less pfns to enter the system is through the
->direct_access() method provided by a pmem device's struct
block_device_operations.  Architectures that require struct page for
cache management to must disable ->direct_access() in this case.

If an arch still wants to support pmem+DAX then it needs something
like this patchset (feedback welcome) to map pmem pfns:

https://lkml.org/lkml/2015/8/12/970

Effectively this would disable ->direct_access() on /dev/pmem0, but
permit ->direct_access() on /dev/pmem0m.
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 29/31] parisc: handle page-less SG entries

2015-08-13 Thread Dan Williams
On Thu, Aug 13, 2015 at 7:31 AM, Christoph Hellwig  wrote:
> On Wed, Aug 12, 2015 at 09:01:02AM -0700, Linus Torvalds wrote:
>> I'm assuming that anybody who wants to use the page-less
>> scatter-gather lists always does so on memory that isn't actually
>> virtually mapped at all, or only does so on sane architectures that
>> are cache coherent at a physical level, but I'd like that assumption
>> *documented* somewhere.
>
> It's temporarily mapped by kmap-like helpers.  That code isn't in
> this series. The most recent version of it is here:
>
> https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git/commit/?h=pfn&id=de8237c99fdb4352be2193f3a7610e902b9bb2f0
>
> note that it's not doing the cache flushing it would have to do yet, but
> it's also only enabled for x86 at the moment.

For virtually tagged caches I assume we would temporarily map with
kmap_atomic_pfn_t(), similar to how drm_clflush_pages() implements
powerpc support.  However with DAX we could end up with multiple
virtual aliases for a page-less pfn.
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2 v2] dma: ipu_idmac: do not lose valid received data in the irq handler

2011-02-14 Thread Dan Williams
On Mon, Feb 7, 2011 at 8:49 AM, Guennadi Liakhovetski
 wrote:
> Ok, I've found the reason. Buffer number repeats, when there is an
> underrun, which is happening in my tests, when frames are arriving quickly
> enough, but the user-space is not fast enough to process them, e.g., when
> it is writing them to files over NFS or even just displaying on the LCD.
> Without your patch these underruns happen just as well, they just don't
> get recognised, because there's always one buffer delayed, so, the queue
> is never empty.
>
> Dan, please add my
>
> Reviewed-(and-tested-)by: Guennadi Liakhovetski 

Thanks, applied v2.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] dma: ipu_idmac: do not lose valid received data in the irq handler

2011-01-30 Thread Dan Williams
On Thu, Jan 27, 2011 at 12:22 AM, Anatolij Gustschin  wrote:
> Reading the commit message again I now realize that there is
> a mistake.
>
> On Wed, 26 Jan 2011 09:49:49 +0100
> Anatolij Gustschin  wrote:
> ...
>> received data. DMA_BUFx_RDY won't be set by the IPU, so waiting
>> for this event in the interrupt handler is wrong.
>
> This should read 'DMAIC_x_CUR_BUF flag won't be flipped here by
> the IPU, so waiting for this event in the EOF interrupt handler
> is wrong.'
>
> I'll submit another patch fixing the commit message.
>

Ok, also looking for a reviewed/acked by Guennadi.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] dma: fix ipu_idmac.c to not discard the last queued buffer

2009-05-12 Thread Dan Williams
On Tue, May 12, 2009 at 5:14 AM, Agustin  wrote:
>
> On Tue, 12 May 2009, Guennadi Liakhovetski wrote:
>
>>
>> This also fixes the case of a single queued buffer, for example, when taking 
>> a
>> single frame snapshot with the mx3_camera driver.
>>
>> Reported-by: Agustin
>> Signed-off-by: Guennadi Liakhovetski
>
> Signed-off-by: Agustin Ferrin Pozuelo

Applied.
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html