RE: [PATCH v2 3/3] arch/*/: remove CONFIG_VIRT_TO_BUS

2022-06-30 Thread David Laight
From: Christophe Leroy
> Sent: 30 June 2022 10:40
> 
> Le 30/06/2022 à 10:04, David Laight a écrit :
> > From: Michael Schmitz
> >> Sent: 29 June 2022 00:09
> >>
> >> Hi Arnd,
> >>
> >> On 29/06/22 09:50, Arnd Bergmann wrote:
> >>> On Tue, Jun 28, 2022 at 11:03 PM Michael Schmitz  
> >>> wrote:
> >>>> On 28/06/22 19:03, Geert Uytterhoeven wrote:
> >>>>>> The driver allocates bounce buffers using kmalloc if it hits an
> >>>>>> unaligned data buffer - can such buffers still even happen these days?
> >>>>> No idea.
> >>>> Hmmm - I think I'll stick a WARN_ONCE() in there so we know whether this
> >>>> code path is still being used.
> >>> kmalloc() guarantees alignment to the next power-of-two size or
> >>> KMALLOC_MIN_ALIGN, whichever is bigger. On m68k this means it
> >>> is cacheline aligned.
> >>
> >> And all SCSI buffers are allocated using kmalloc? No way at all for user
> >> space to pass unaligned data?
> >
> > I didn't think kmalloc() gave any such guarantee about alignment.
> 
> I does since commit 59bb47985c1d ("mm, sl[aou]b: guarantee natural
> alignment for kmalloc(power-of-two)")

Looks like it is done for 'power-of-two' less than PAGE_SIZE.
This may not help scsi tape writes which could easily be (say) 47 bytes.
I think that only guarantees 2 byte alignment on m68k.
(Although increasing the min-alignment on m68k to 4 (or even 8)
will probably make no measurable difference.)

What happens above PAGE_SIZE?
Any structure with a trailing [] field could easily request
'64k + a_bit' bytes.
You don't really want to extend this to 128k - but I suspect
that is what happens.

David
 

> 
> Christophe
> 
> > There are cache-line alignment requirements on systems with non-coherent
> > dma, but otherwise the alignment can be much smaller.
> >
> > One of the allocators adds a header to each item, IIRC that can
> > lead to 'unexpected' alignments - especially on m68k.
> >
> > dma_alloc_coherent() does align to next 'power of 2'.
> > And sometimes you need (eg) 16k allocates that are 16k aligned.
> >
> > David
> >
> > -
> > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> > 1PT, UK
> > Registration No: 1397386 (Wales)

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [PATCH v2 3/3] arch/*/: remove CONFIG_VIRT_TO_BUS

2022-06-30 Thread David Laight
From: Michael Schmitz
> Sent: 29 June 2022 00:09
> 
> Hi Arnd,
> 
> On 29/06/22 09:50, Arnd Bergmann wrote:
> > On Tue, Jun 28, 2022 at 11:03 PM Michael Schmitz  
> > wrote:
> >> On 28/06/22 19:03, Geert Uytterhoeven wrote:
>  The driver allocates bounce buffers using kmalloc if it hits an
>  unaligned data buffer - can such buffers still even happen these days?
> >>> No idea.
> >> Hmmm - I think I'll stick a WARN_ONCE() in there so we know whether this
> >> code path is still being used.
> > kmalloc() guarantees alignment to the next power-of-two size or
> > KMALLOC_MIN_ALIGN, whichever is bigger. On m68k this means it
> > is cacheline aligned.
> 
> And all SCSI buffers are allocated using kmalloc? No way at all for user
> space to pass unaligned data?

I didn't think kmalloc() gave any such guarantee about alignment.
There are cache-line alignment requirements on systems with non-coherent
dma, but otherwise the alignment can be much smaller.

One of the allocators adds a header to each item, IIRC that can
lead to 'unexpected' alignments - especially on m68k.

dma_alloc_coherent() does align to next 'power of 2'.
And sometimes you need (eg) 16k allocates that are 16k aligned.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-28 Thread David Laight
From: Christoph Hellwig
> Sent: 28 March 2022 07:37
> 
> On Fri, Mar 25, 2022 at 11:46:09AM -0700, Linus Torvalds wrote:
> > I think my list of three different sync cases (not just two! It's not
> > just about whether to sync for the CPU or the device, it's also about
> > what direction the data itself is taking) is correct.
> >
> > But maybe I'm wrong.
> 
> At the high level you are correct.  It is all about which direction
> the data is taking.  That is the direction argument that all the
> map/unmap/sync call take.  The sync calls then just toggle the ownership.
> You seem to hate that ownership concept, but I don't see how things
> could work without that ownership concept as we're going to be in
> trouble without having that.  And yes, a peek operation could work in
> some cases, but it would have to be at the cache line granularity.

I don't think it is really 'ownership' but more about who has
write access.
Only one side can have write access (to a cache line [1]) at any
one time.

Read access is different.
You need a 'synchronise' action to pick up newly written data.
This might be a data copy, cache flush or cache invalidate.
It only need affect the area that needs to be read - not
full buffer.
Partial cache flush/invalidate will almost certainly speed
up receipt of short network packets that are copied into a
new skb - leaving the old one mapped for another receive.

[1] The cache line size might be a property of the device
and dma subsystem, not just the cpu.
I have used hardware when the effective size was 1kB.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-27 Thread David Laight
From: Linus Torvalds
> Sent: 27 March 2022 06:21
> 
> On Sat, Mar 26, 2022 at 10:06 PM Linus Torvalds
>  wrote:
> >
> > On Sat, Mar 26, 2022 at 8:49 PM Halil Pasic  wrote:
> > >
> > > I agree it CPU modified buffers *concurrently* with DMA can never work,
> > > and I believe the ownership model was conceived to prevent this
> > > situation.
> >
> > But that just means that the "ownership" model is garbage, and cannot
> > handle this REAL LIFE situation.
> 
> Just to clarify: I obviously agree that the "both sides modify
> concurrently" obviously cannot work with bounce buffers.

Aren't bounce buffers just a more extreme case on non-coherent
memory accesses?
They just need explicit memory copies rather than just cache
writeback and invalidate operations.

So 'both sides modify concurrently' just has the same issue
as it does with non-coherent memory in that the locations
need to be in separate (dma) cache lines.
Indeed, if the bounce buffers are actually coherent then
arbitrary concurrent updates are possible.

One issue is that the driver needs to indicate which parts
of any buffer are dirty.
Whereas the any 'cache writeback' request will only write
dirty data.

Get everything right and you can even support hardware where
the 'bounce buffers' are actually on the card and the copies
are MMIO (or better, especially on PCIe, synchronous host
initiated dma transfers).

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-26 Thread David Laight
From: Linus Torvalds
> Sent: 26 March 2022 18:39
> 
> On Sat, Mar 26, 2022 at 9:06 AM Toke Høiland-Jørgensen  wrote:
> >
> > I was also toying with the idea of having a copy-based peek helper like:
> >
> > u32 data = dma_peek_word(buf, offset)
> 
> I really don't think you can or want to have a word-based one.
> 
> That said, I like the *name* of that thing.
> 
> I think a lot of confusion comes from the very subtle naming of
> fundamentally having a lot of odd conditions with
> 
>  - two different "directions of the sync" - ie who it is that cares:
> 
>dma_sync_single_for_{cpu,device}
> 
>  - three different "direction of the data" - ie who it is that writes the 
> data:
> 
> DMA_FROM_DEVICE / DMA_TO_DEVICE / DMA_BIDIRECTIONAL
> 
> so you have six possible combinations, three of which seem insane and
> not useful, and of the three that are actually possible, some are very
> unusual (it exactly that "device is the one writing, but we want to
> sync the dma area for the device").

Another 2c :-)

Is the idea of 'buffer ownership' even a good one?
Perhaps the whole thing would be better described in terms of
what happens when bounce buffers are used.
So there are notionally two buffers.
One accessed by the cpu, the other by the device.

There are then just 3 things that happen:
1) Dirty data may get moved to the other buffer at any time.
   So the driver must not dirty buffers (cache lines) that the
   device might write to.
2) The cpu has to request data be copied to the device buffer
   after updating the cpu buffer.
   This makes the cpu buffer 'not dirty' so copies (1) can no
   longer happen.
3) The cpu has to request data be copied from the device buffer
   before looking at the data.
All copies affect a dma-cache-line sized block of data (which
might be device dependant).
An optimised version of (2) that doesn't actually do the copy
can be implemented for use prior to read requests.

For cache-coherent memory only (1) happens and (2) and (3)
are no operations.
For non-coherent memory (2) is write-back-and-invalidate and
(3) might just be an invalidate.
For bounce buffers all are actual copies - and additional
operations might be needed for device access to the bounce
buffer itself.

For security reasons bounce buffers may need initialising.
But this would be done when they are allocated.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-25 Thread David Laight
From: Linus Torvalds
> Sent: 25 March 2022 21:57
> 
> On Fri, Mar 25, 2022 at 2:13 PM Johannes Berg  
> wrote:
> >
> > Well I see now that you said 'cache "writeback"' in (1), and 'flush' in
> > (2), so perhaps you were thinking of the same, and I'm just calling it
> > "flush" and "invalidate" respectively?
> 
> Yeah, so I mentally tend to think of the operations as just
> "writeback" (which doesn't invalidate) and "flush" (which is a
> writeback-invalidate).

It almost certainly doesn't matter whether the "writeback"
invalidates or not.
You have to assume that all sorts of operations might cause
the cpu to read in a cacheline.
This includes, but is not limited to, speculative execution
and cache line prefetch.

But you definitely need an "invalidate" to force a cache line
be read after the hardware has accessed it.
Now such lines must not be dirty; because the cpu can write
back a dirty cache line at any time - which would break things.
So this can also be "write back if dirty" and "invalidate".

Bounce buffers and cache probably work much the same way.
But for bounce buffers I guess you want to ensure the initially
allocated buffer doesn't contain old data (belonging to
someone else).
So you might decide to zero them on allocation or always copy
from the driver buffer on the first request.

Then you get the really annoying cpu that don't have a 
"write back dirty line and invalidate" opcode.
And the only way is to read enough other memory areas
to displace all the existing cache line data.
You probably might as well give up and use PIO :-)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-25 Thread David Laight
I've been thinking of the case where a descriptor ring has
to be in non-coherent memory (eg because that is all there is).

The receive ring processing isn't actually that difficult.

The driver has to fill a cache line full of new buffer
descriptors in memory but without assigning the first
buffer to the hardware.
Then it has to do a cache line write of just that line.
Then it can assign ownership of the first buffer and
finally do a second cache line write.
(The first explicit write can be skipped if the cache
writes are known to be atomic.)
It then must not dirty that cache line.

To check for new frames it must invalidate the cache
line that contains the 'next to be filled' descriptor
and then read that cache line.
This will contain info about one or more receive frames.
But the hardware is still doing updates.

But both these operations can be happening at the same
time on different parts of the buffer.

So you need to know a 'cache line size' for the mapping
and be able to do writebacks and invalidates for parts
of the buffer, not just all of it.

The transmit side is harder.
It either requires waiting for all pending transmits to
finish or splitting a single transmit into enough fragments
that its descriptors end on a cache line boundary.
But again, and if the interface is busy, you want the cpu
to be able to update one cache line of transmit descriptors
while the device is writing transmit completion status
to the previous cache line.

I don't think that is materially different for non-coherent
memory or bounce buffers.
But partial flush/invalidate is needed.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 2/6] dma-mapping: add a dma_mmap_pages helper

2021-01-28 Thread David Laight
From: Christoph Hellwig
> Sent: 28 January 2021 14:59
> 
> Add a helper to map memory allocated using dma_alloc_pages into
> a user address space, similar to the dma_alloc_attrs function for
> coherent allocations.
> 
...
> +::
> +
> + int
> + dma_mmap_pages(struct device *dev, struct vm_area_struct *vma,
> +size_t size, struct page *page)
> +
> +Map an allocation returned from dma_alloc_pages() into a user address space.
> +dev and size must be the same as those passed into dma_alloc_pages().
> +page must be the pointer returned by dma_alloc_pages().

To be useful this needs to specify the offset into the user address area.
(ie the offset in the mmap() buffer.)

For example we have an fpga based PCIe card that converts internal
addresses that refer to one of 512 16k 'pages' to 64bit PCIe bus
master addresses.
So it (sort of) contains its own iommu.

So we can allocate (aligned) 16k kernel memory buffers with
dma_alloc_coherent() and make them appear contiguous to the
on-board PCIe bus master users.
We then mmap() them into contiguous user addresses.
So both 'ends' see contiguous addresses without requiring
contiguous physical memory or requiring all the memory be 
allocated at the same time.

Clearly in-kernel users have to allow for the 16k boundaries.
But the large structures are accesses from user space.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v3 4/7] iommu: Switch gather->end to unsigned long long

2020-12-16 Thread David Laight
From: Yong Wu
> Sent: 16 December 2020 10:36
> 
> Currently gather->end is "unsigned long" which may be overflow in
> arch32 in the corner case: 0xfff0 + 0x10(iova + size).
> Although it doesn't affect the size(end - start), it affects the checking
> "gather->end < end"
> 
> Fixes: a7d20dc19d9e ("iommu: Introduce struct iommu_iotlb_gather for batching 
> TLB flushes")
> Signed-off-by: Yong Wu 
> ---
>  include/linux/iommu.h | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 794d4085edd3..6e907a95d981 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -178,7 +178,7 @@ enum iommu_dev_features {
>   */
>  struct iommu_iotlb_gather {
>   unsigned long   start;
> - unsigned long   end;
> + unsigned long long  end;
>   size_t  pgsize;
>  };

Doesn't that add two pad words on many 32bit systems?
You probably ought to re-order the structure to keep the fields
on their natural boundaries.

I'm not sure what is being mapped here, but could it make sense
to just avoid using the highest addresses?
Then you never hit the problem.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] [v2] x86: apic: avoid -Wshadow warning in header

2020-10-29 Thread David Laight
From: Arvind Sankar
> Sent: 29 October 2020 21:35
> 
> On Thu, Oct 29, 2020 at 09:41:13PM +0100, Thomas Gleixner wrote:
> > On Thu, Oct 29 2020 at 17:59, Paolo Bonzini wrote:
> > > On 29/10/20 17:56, Arvind Sankar wrote:
> > >>> For those two just add:
> > >>> struct apic *apic = x86_system_apic;
> > >>> before all the assignments.
> > >>> Less churn and much better code.
> > >>>
> > >> Why would it be better code?
> > >>
> > >
> > > I think he means the compiler produces better code, because it won't
> > > read the global variable repeatedly.  Not sure if that's true,(*) but I
> > > think I do prefer that version if Arnd wants to do that tweak.
> >
> > It's not true.
> >
> >  foo *p = bar;
> >
> >  p->a = 1;
> >  p->b = 2;
> >
> > The compiler is free to reload bar after accessing p->a and with
> >
> > bar->a = 1;
> > bar->b = 1;
> >
> > it can either cache bar in a register or reread it after bar->a
> >
> > The generated code is the same as long as there is no reason to reload,
> > e.g. register pressure.
> >
> > Thanks,
> >
> > tglx
> 
> It's not quite the same.
> 
> https://godbolt.org/z/4dzPbM
> 
> With -fno-strict-aliasing, the compiler reloads the pointer if you write
> to the start of what it points to, but not if you write to later
> elements.

I guess it assumes that global data doesn't overlap.

But in general they are sort of opposites:

With the local variable it can reload if it knows the write
cannot have affected the global - but is unlikely to do so.

Using the global it must reload if it is possible the write
might have affected the global.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] [v2] x86: apic: avoid -Wshadow warning in header

2020-10-29 Thread David Laight
From: Arnd Bergmann
> Sent: 29 October 2020 09:51
...
> I think ideally there would be no global variable, withall accesses
> encapsulated in function calls, possibly using static_call() optimizations
> if any of them are performance critical.

There isn't really a massive difference between global variables
and global access functions.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] [v2] x86: apic: avoid -Wshadow warning in header

2020-10-29 Thread David Laight
From: Arnd Bergmann
> Sent: 28 October 2020 21:21
> 
> From: Arnd Bergmann 
> 
> There are hundreds of warnings in a W=2 build about a local
> variable shadowing the global 'apic' definition:
> 
> arch/x86/kvm/lapic.h:149:65: warning: declaration of 'apic' shadows a global 
> declaration [-Wshadow]
> 
> Avoid this by renaming the global 'apic' variable to the more descriptive
> 'x86_system_apic'. It was originally called 'genapic', but both that
> and the current 'apic' seem to be a little overly generic for a global
> variable.
> 
> Fixes: c48f14966cc4 ("KVM: inline kvm_apic_present() and kvm_lapic_enabled()")
> Fixes: c8d46cf06dc2 ("x86: rename 'genapic' to 'apic'")
> Signed-off-by: Arnd Bergmann 
> ---
> v2: rename the global instead of the local variable in the header
...
> diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
> index 284e73661a18..33e2dc78ca11 100644
> --- a/arch/x86/hyperv/hv_apic.c
> +++ b/arch/x86/hyperv/hv_apic.c
> @@ -259,14 +259,14 @@ void __init hv_apic_init(void)
>   /*
>* Set the IPI entry points.
>*/
> - orig_apic = *apic;
> -
> - apic->send_IPI = hv_send_ipi;
> - apic->send_IPI_mask = hv_send_ipi_mask;
> - apic->send_IPI_mask_allbutself = hv_send_ipi_mask_allbutself;
> - apic->send_IPI_allbutself = hv_send_ipi_allbutself;
> - apic->send_IPI_all = hv_send_ipi_all;
> - apic->send_IPI_self = hv_send_ipi_self;
> + orig_apic = *x86_system_apic;
> +
> + x86_system_apic->send_IPI = hv_send_ipi;
> + x86_system_apic->send_IPI_mask = hv_send_ipi_mask;
> + x86_system_apic->send_IPI_mask_allbutself = 
> hv_send_ipi_mask_allbutself;
> + x86_system_apic->send_IPI_allbutself = hv_send_ipi_allbutself;
> + x86_system_apic->send_IPI_all = hv_send_ipi_all;
> + x86_system_apic->send_IPI_self = hv_send_ipi_self;
>   }
> 
>   if (ms_hyperv.hints & HV_X64_APIC_ACCESS_RECOMMENDED) {
> @@ -285,10 +285,10 @@ void __init hv_apic_init(void)
>*/
>   apic_set_eoi_write(hv_apic_eoi_write);
>   if (!x2apic_enabled()) {
> - apic->read  = hv_apic_read;
> - apic->write = hv_apic_write;
> - apic->icr_write = hv_apic_icr_write;
> - apic->icr_read  = hv_apic_icr_read;
> + x86_system_apic->read  = hv_apic_read;
> + x86_system_apic->write = hv_apic_write;
> + x86_system_apic->icr_write = hv_apic_icr_write;
> + x86_system_apic->icr_read  = hv_apic_icr_read;
>   }

For those two just add:
struct apic *apic = x86_system_apic;
before all the assignments.
Less churn and much better code.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v3 17/35] x86/pci/xen: Use msi_msg shadow structs

2020-10-25 Thread David Laight
From: David Woodhouse
> Sent: 25 October 2020 10:26
> To: David Laight ; x...@kernel.org
> 
> On Sun, 2020-10-25 at 09:49 +, David Laight wrote:
> > Just looking at a random one of these patches...
> >
> > Does the compiler manage to optimise that reasonably?
> > Or does it generate a lot of shifts and masks as each
> > bitfield is set?
> >
> > The code generation for bitfields is often a lot worse
> > that that for |= setting bits in a word.
> 
> Indeed, it appears to be utterly appalling. That was one of my
> motivations for doing it with masks and shifts in the first place.

I thought it would be.

I'm not even sure using bitfields to map hardware registers
makes the code any more readable (or fool proof).

I suspect using #define constants and explicit |= and &= ~
is actually best - provided the names are descriptive enough.

If you set all the fields the compiler will merge the values
and do a single write - provided nothing you read might
alias the target.

The only way to get strongly typed integers is to cast them
to a structure pointer type.
Together with a static inline function to remove the casts
and | the values together it might make things fool proof.
But I've never tried it.

ISTR once doing something like that with error codes, but
it didn't ever make source control.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v3 17/35] x86/pci/xen: Use msi_msg shadow structs

2020-10-25 Thread David Laight
From: David Woodhouse
> Sent: 24 October 2020 22:35
> 
> From: Thomas Gleixner 
> 
> Use the msi_msg shadow structs and compose the message with named bitfields
> instead of the unreadable macro maze.
> 
> Signed-off-by: Thomas Gleixner 
> Signed-off-by: David Woodhouse 
> ---
>  arch/x86/pci/xen.c | 26 +++---
>  1 file changed, 11 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
> index c552cd2d0632..3d41a09c2c14 100644
> --- a/arch/x86/pci/xen.c
> +++ b/arch/x86/pci/xen.c
> @@ -152,7 +152,6 @@ static int acpi_register_gsi_xen(struct device *dev, u32 
> gsi,
> 
>  #if defined(CONFIG_PCI_MSI)
>  #include 
> -#include 
> 
>  struct xen_pci_frontend_ops *xen_pci_frontend;
>  EXPORT_SYMBOL_GPL(xen_pci_frontend);
> @@ -210,23 +209,20 @@ static int xen_setup_msi_irqs(struct pci_dev *dev, int 
> nvec, int type)
>   return ret;
>  }
> 
> -#define XEN_PIRQ_MSI_DATA  (MSI_DATA_TRIGGER_EDGE | \
> - MSI_DATA_LEVEL_ASSERT | (3 << 8) | MSI_DATA_VECTOR(0))
> -
>  static void xen_msi_compose_msg(struct pci_dev *pdev, unsigned int pirq,
>   struct msi_msg *msg)
>  {
> - /* We set vector == 0 to tell the hypervisor we don't care about it,
> -  * but we want a pirq setup instead.
> -  * We use the dest_id field to pass the pirq that we want. */
> - msg->address_hi = MSI_ADDR_BASE_HI | MSI_ADDR_EXT_DEST_ID(pirq);
> - msg->address_lo =
> - MSI_ADDR_BASE_LO |
> - MSI_ADDR_DEST_MODE_PHYSICAL |
> - MSI_ADDR_REDIRECTION_CPU |
> - MSI_ADDR_DEST_ID(pirq);
> -
> - msg->data = XEN_PIRQ_MSI_DATA;
> + /*
> +  * We set vector == 0 to tell the hypervisor we don't care about
> +  * it, but we want a pirq setup instead.  We use the dest_id fields
> +  * to pass the pirq that we want.
> +  */
> + memset(msg, 0, sizeof(*msg));
> + msg->address_hi = X86_MSI_BASE_ADDRESS_HIGH;
> + msg->arch_addr_hi.destid_8_31 = pirq >> 8;
> + msg->arch_addr_lo.destid_0_7 = pirq & 0xFF;
> + msg->arch_addr_lo.base_address = X86_MSI_BASE_ADDRESS_LOW;
> + msg->arch_data.delivery_mode = APIC_DELIVERY_MODE_EXTINT;
>  }

Just looking at a random one of these patches...

Does the compiler manage to optimise that reasonably?
Or does it generate a lot of shifts and masks as each
bitfield is set?

The code generation for bitfields is often a lot worse
that that for |= setting bits in a word.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 8/8] WIP: add a dma_alloc_contiguous API

2020-10-14 Thread David Laight
> On Wed, Sep 30, 2020 at 6:09 PM Christoph Hellwig  wrote:
> >
> > Add a new API that returns a virtually non-contigous array of pages
> > and dma address.  This API is only implemented for dma-iommu and will
> > not be implemented for non-iommu DMA API instances that have to allocate
> > contiguous memory.  It is up to the caller to check if the API is
> > available.

Isn't there already a flag that is only implemented for ARM
systems with iommu that forces pages to actually be physically
contiguous?

So isn't this some kind of strange opposite?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 3/3] dma-mapping: better document dma_addr_t and DMA_MAPPING_ERROR

2020-09-22 Thread David Laight
From: Christoph Hellwig
> Sent: 22 September 2020 14:40
...
> @@ -131,6 +125,16 @@ struct dma_map_ops {
>   unsigned long (*get_merge_boundary)(struct device *dev);
>  };
> 
> +/*
> + * A dma_addr_t can hold any valid DMA or bus address for the platform.  It 
> can
> + * be given to a device to use as a DMA source or target.  A CPU cannot
> + * reference a dma_addr_t directly because there may be translation between 
> its
> + * physical address space and the bus address space.

It can't access it 'directly' because it isn't a virtual address

> + *
> + * DMA_MAPPING_ERROR is the magic error code if a mapping failed.  It should 
> not
> + * be used directly in drivers, but checked for using dma_mapping_error()
> + * instead.
> + */

I think it might be worth adding:

A dma_addr_t value may be device dependant and differ from the
'physical address' of the memory.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v8 7/7] iommu/vt-d: Use bounce buffer for untrusted devices

2019-08-30 Thread David Laight
From: Lu Baolu
> Sent: 30 August 2019 08:17

> The Intel VT-d hardware uses paging for DMA remapping.
> The minimum mapped window is a page size. The device
> drivers may map buffers not filling the whole IOMMU
> window. This allows the device to access to possibly
> unrelated memory and a malicious device could exploit
> this to perform DMA attacks. To address this, the
> Intel IOMMU driver will use bounce pages for those
> buffers which don't fill whole IOMMU pages.

Won't this completely kill performance?

I'd expect to see something for dma_alloc_coherent() (etc)
that tries to give the driver page sized buffers.

Either that or the driver could allocate page sized buffers
even though it only passes fragments of these buffers to
the dma functions (to avoid excessive cache invalidates).

Since you have to trust the driver, why not actually trust it?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 16/16] dma-mapping: use exact allocation in dma_alloc_contiguous

2019-06-14 Thread David Laight
From: Robin Murphy
> Sent: 14 June 2019 16:06
...
> Well, apart from the bit in DMA-API-HOWTO which has said this since
> forever (well, before Git history, at least):
> 
> "The CPU virtual address and the DMA address are both
> guaranteed to be aligned to the smallest PAGE_SIZE order which
> is greater than or equal to the requested size.  This invariant
> exists (for example) to guarantee that if you allocate a chunk
> which is smaller than or equal to 64 kilobytes, the extent of the
> buffer you receive will not cross a 64K boundary."

I knew it was somewhere :-)
Interestingly that also implies that the address returned for a size
of (say) 128 will also be page aligned.
In that case 128 byte alignment should probably be ok - but it is still
an API change that could have horrid consequences.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 16/16] dma-mapping: use exact allocation in dma_alloc_contiguous

2019-06-14 Thread David Laight
From: 'Christoph Hellwig'
> Sent: 14 June 2019 15:50
> To: David Laight
> On Fri, Jun 14, 2019 at 02:15:44PM +, David Laight wrote:
> > Does this still guarantee that requests for 16k will not cross a 16k 
> > boundary?
> > It looks like you are losing the alignment parameter.
> 
> The DMA API never gave you alignment guarantees to start with,
> and you can get not naturally aligned memory from many of our
> current implementations.

Hmmm...
I thought that was even documented.

I'm pretty sure there is a lot of code out there that makes that assumption.
Without it many drivers will have to allocate almost double the
amount of memory they actually need in order to get the required alignment.
So instead of saving memory you'll actually make more be used.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 16/16] dma-mapping: use exact allocation in dma_alloc_contiguous

2019-06-14 Thread David Laight
From: Christoph Hellwig
> Sent: 14 June 2019 14:47
> 
> Many architectures (e.g. arm, m68 and sh) have always used exact
> allocation in their dma coherent allocator, which avoids a lot of
> memory waste especially for larger allocations.  Lift this behavior
> into the generic allocator so that dma-direct and the generic IOMMU
> code benefit from this behavior as well.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  include/linux/dma-contiguous.h |  8 +---
>  kernel/dma/contiguous.c| 17 +++--
>  2 files changed, 16 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/dma-contiguous.h b/include/linux/dma-contiguous.h
> index c05d4e661489..2e542e314acf 100644
> --- a/include/linux/dma-contiguous.h
> +++ b/include/linux/dma-contiguous.h
> @@ -161,15 +161,17 @@ static inline struct page *dma_alloc_contiguous(struct 
> device *dev, size_t size,
>   gfp_t gfp)
>  {
>   int node = dev ? dev_to_node(dev) : NUMA_NO_NODE;
> - size_t align = get_order(PAGE_ALIGN(size));
> + void *cpu_addr = alloc_pages_exact_node(node, size, gfp);
> 
> - return alloc_pages_node(node, gfp, align);
> + if (!cpu_addr)
> + return NULL;
> + return virt_to_page(p);
>  }

Does this still guarantee that requests for 16k will not cross a 16k boundary?
It looks like you are losing the alignment parameter.

There may be drivers and hardware that also require 12k allocates
to not cross 16k boundaries (etc).

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v4 0/3] PCIe Host request to reserve IOVA

2019-05-02 Thread David Laight
From: Srinath Mannam
> Sent: 01 May 2019 16:23
...
> > > On Fri, Apr 12, 2019 at 08:43:32AM +0530, Srinath Mannam wrote:
> > > > Few SOCs have limitation that their PCIe host can't allow few inbound
> > > > address ranges. Allowed inbound address ranges are listed in dma-ranges
> > > > DT property and this address ranges are required to do IOVA mapping.
> > > > Remaining address ranges have to be reserved in IOVA mapping.

You probably want to fix the english in the first sentence.
The sense is reversed.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2] dma-debug: Kconfig for PREALLOC_DMA_DEBUG_ENTRIES

2018-12-04 Thread David Laight
From: Qian Cai
> Sent: 30 November 2018 21:48
> To: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com
> Cc: yisen.zhu...@huawei.com; salil.me...@huawei.com; john.ga...@huawei.com; 
> linux...@huawei.com;
> iommu@lists.linux-foundation.org; net...@vger.kernel.org; 
> linux-ker...@vger.kernel.org; Qian Cai
> Subject: [PATCH v2] dma-debug: Kconfig for PREALLOC_DMA_DEBUG_ENTRIES
> 
> The amount of DMA mappings from Hisilicon HNS ethernet devices is huge,
> so it could trigger "DMA-API: debugging out of memory - disabling".
> 
> hnae_get_handle [1]
>   hnae_init_queue
> hnae_init_ring
>   hnae_alloc_buffers [2]
> debug_dma_map_page
>   dma_entry_alloc
> 
> [1] for (i = 0; i < handle->q_num; i++)
> [2] for (i = 0; i < ring->desc_num; i++)
> 
> Also, "#define HNS_DSAF_MAX_DESC_CNT 1024"
> 
> On this Huawei TaiShan 2280 aarch64 server, it has reached the limit
> already,
> 
> 4 (NICs) x 16 (queues) x 1024 (port descption numbers) = 65536
> 
> Added a Kconfig entry for PREALLOC_DMA_DEBUG_ENTRIES, so make it easier
> for users to deal with special cases like this.

Ugg. That is worse than a module parameter.

The driver needs to automatically reduce the number of mapping requests.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC v2] dma-coherent: introduce no-align to avoid allocation failure and save memory

2017-11-24 Thread David Laight
From: Jaewon Kim
> Sent: 24 November 2017 05:59
> 
> dma-coherent uses bitmap APIs which internally consider align based on the
> requested size. If most of allocations are small size like KBs, using
> alignment scheme seems to be good for anti-fragmentation. But if large
> allocation are commonly used, then an allocation could be failed because
> of the alignment. To avoid the allocation failure, we had to increase total
> size.
> 
> This is a example, total size is 30MB, only few memory at front is being
> used, and 9MB is being requsted. Then 9MB will be aligned to 16MB. The
> first try on offset 0MB will be failed because others already are using
> them. The second try on offset 16MB will be failed because of ouf of bound.
> 
> So if the alignment is not necessary on a specific dma-coherent memory
> region, we can set no-align property. Then dma-coherent will ignore the
> alignment only for the memory region.

ISTM that the alignment needs to be a property of the request, not of the
device. Certainly the device driver code is most likely to know the specific
alignment requirements of any specific allocation.

We've some hardware that would need large allocations to be 16k aligned.
We actually use multiple 16k allocations because any large buffers are
accessed directly from userspace (mmap and vm_iomap_memory) and the
card has its own page tables (with 16k pages).

David

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 5/9] PCI: host: brcmstb: add dma-ranges for inbound traffic

2017-10-25 Thread David Laight
From: Jim Quinla
> Sent: 24 October 2017 19:08
...
> Hi David,  Christoph was also concerned about this:
> 
> "For the block world take a look at __blk_segment_map_sg which does the 
> merging
> of contiguous pages into a single SG segment.  You'd have to override
> BIOVEC_PHYS_MERGEABLE to prevent this from happening in your supported
> architectures for the block layer."
> 
> Do you consider this a non-issue as well or can this happen and span
> memory regions?

Probably easiest to have an architecture limit on the addresses that
can be merged.

My guess is that this code only really ever merges pages that were
allocated contiguously, but have got converted to a list of VA.
So stopping merging over even 1MB boundaries would have minimal effect
(even if done unconditionally).

I even wonder if the merging is actually worthwhile at all?
Maybe if any code has limits on the SG list length.
(or reducing the number of cache lines the get dirtied...)

David



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 5/9] PCI: host: brcmstb: add dma-ranges for inbound traffic

2017-10-23 Thread David Laight
From: Jim Quinlan
> Sent: 20 October 2017 16:28
> On Fri, Oct 20, 2017 at 10:57 AM, Christoph Hellwig  wrote:
> > On Fri, Oct 20, 2017 at 10:41:56AM -0400, Jim Quinlan wrote:
> >> I am not sure I understand your comment -- the size of the request
> >> shouldn't be a factor.  Let's look at your example of the DMA request
> >> of 3f00 to 400f (physical memory).  Lets say it is for 15
> >> pages.  If we block out  the last page [0x3000..0x3fff] from
> >> what is available, there is no 15 page span that can happen across the
> >> 0x4000 boundary.  For SG, there can be no merge that connects a
> >> page from one region to another region.  Can you give an example of
> >> the scenario you are thinking of?
> >
> > What prevents a merge from say the regions of
> > 03fff and 40007fff?
> 
> Huh? [0x3000...x3ff] is not available to be used. Drawing from
> the original example, we now have to tell Linux that these are now our
> effective memory regions:
> 
>   memc0-a@[03fffefff] <=> pci@[03fffefff]
>   memc0-b@[1...13fffefff] <=> pci@[ 40007fffefff]
>   memc1-a@[ 40007fffefff] <=> pci@[ 8000bfffefff]
>   memc1-b@[3...33fffefff] <=> pci@[ c000efff]
>   memc2-a@[ 8000bfffefff] <=> pci@[1...13fffefff]
>   memc2-b@[c...c3fff] <=> pci@[14000...17fff]
> 
> This leaves a one-page gap between phsyical memory regions which would
> normally be contiguous. One cannot have a dma alloc that spans any two
> regions.  This is a drastic step, but I don't see an alternative.
> Perhaps  I may be missing what you are saying...

Isn't this all unnecessary?
Both kmalloc() and dma_alloc() are constrained to allocate memory
that doesn't cross an address boundary that is larger than the size.
So if you allocate 16k it won't cross a 16k physical address boundary.

David

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: the imfamous asix ax88179 iommu error

2017-10-06 Thread David Laight
> >> It appears that my ax88179 is working just fine now with the vendor
> >> driver. So perhaps it's possible to revert the old commit in the linux
> >> kernel and allow the use of scatter gather ? (perhaps for non-intel
> >> hosts ? I'm not sure if this device is effected by intel xhci errata)
> >
> > Thanks for the tip-off - as luck would have it I have an AX88179 dongle
> > to hand! Does the out-of-tree driver behave significantly differently to
> > the mainline driver (CONFIG_USB_NET_AX88179_178A)?
> 
> To answer myself - yes, the out-of-tree driver produces all manner of
> >PAGE_SIZE offsets for dma_map_sg():
> 
> ...
> [  571.019335] xhci_hcd :04:00.0: ### sg->offset = 0x7770
> [  571.025497] xhci_hcd :04:00.0: ### sg->offset = 0x7a60
> [  571.031402] xhci_hcd :04:00.0: ### sg->offset = 0x1100
> [  571.037353] xhci_hcd :04:00.0: ### sg->offset = 0x1c50
> [  571.043254] xhci_hcd :04:00.0: ### sg->offset = 0x27a0

I remember a load of XHCI patches I had that were needed to get the
AX88179 working reliably
I think most of the ones to do with the TRB boundaries and ZLP have
been fixed now.
There was a problem with an ASMedia host controller (1b21:1042) that
it needed the command ring doorbell rung in the command completion
code in order to process a second command.
Not sure I've seen a fix for that.
I've still got the host system, but not the dongle.

David

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 04/11] ia64: make dma_cache_sync a no-op

2017-10-04 Thread David Laight
From: Christoph Hellwig
> Sent: 03 October 2017 11:43
>
> ia64 does not implement DMA_ATTR_NON_CONSISTENT allocations, so it doesn't
> make any sense to do any work in dma_cache_sync given that it must be a
> no-op when dma_alloc_attrs returns coherent memory.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  arch/ia64/include/asm/dma-mapping.h | 5 -
>  1 file changed, 5 deletions(-)
> 
> diff --git a/arch/ia64/include/asm/dma-mapping.h 
> b/arch/ia64/include/asm/dma-mapping.h
> index 3ce5ab4339f3..99dfc1aa9d3c 100644
> --- a/arch/ia64/include/asm/dma-mapping.h
> +++ b/arch/ia64/include/asm/dma-mapping.h
> @@ -48,11 +48,6 @@ static inline void
>  dma_cache_sync (struct device *dev, void *vaddr, size_t size,
>   enum dma_data_direction dir)
>  {
> - /*
> -  * IA-64 is cache-coherent, so this is mostly a no-op.  However, we do 
> need to
> -  * ensure that dma_cache_sync() enforces order, hence the mb().
> -  */
> - mb();
>  }

Are you sure about this one?
It looks as though you are doing a mechanical change for all architectures.
Some of them are probably stranger than you realise.

Even with cache coherent memory any cpu 'store/write buffer' may not
be snooped by dma reads.

Something needs to flush the store buffer between the last cpu write
to the dma buffer and the write (probably a device register) that
tells the device it can read the memory.

My guess from the comment is that dma_cache_synch() is expected to
include that barrier - and it might not be anywhere else.

David

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 02/11] x86: make dma_cache_sync a no-op

2017-10-03 Thread David Laight
From: Christoph Hellwig
> Sent: 03 October 2017 11:43
> x86 does not implement DMA_ATTR_NON_CONSISTENT allocations, so it doesn't
> make any sense to do any work in dma_cache_sync given that it must be a
> no-op when dma_alloc_attrs returns coherent memory.

I believe it is just about possible to require an explicit
write flush on x86.
ISTR this can happen with something like write combining.

Whether this can actually happen is the kernel is another matter.

David

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table

2017-08-17 Thread David Laight
From: Alex Williamson
> Sent: 16 August 2017 17:56
...
> Firmware pissing match...  Processors running with 8k or less page size
> fall within the recommendations of the PCI spec for register alignment
> of MMIO regions of the device and this whole problem becomes less of an
> issue.

Actually if qemu is causing the MSI-X table accesses to fault, why doesn't
it just lie to the guest about the physical address of the MSI-X table?
Then mmio access to anything in the same physical page will just work.

It has already been pointed out that you can't actually police the
interrupts that are raised without host hardware support.

Actually, putting other vectors in the MSI-X table is boring, most
drivers will ignore unexpected interrupts.
Much more interesting are physical memory addresses and accessible IO
addresses.
Of course, a lot of boards have PCI master capability and can probably
be persuaded to do writes to specific location anyway.

David

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table

2017-08-15 Thread David Laight
From: Benjamin Herrenschmidt
> Sent: 15 August 2017 02:34
> On Tue, 2017-08-15 at 09:16 +0800, Jike Song wrote:
> > > Taking a step back, though, why does vfio-pci perform this check in the
> > > first place? If a malicious guest already has control of a device, any
> > > kind of interrupt spoofing it could do by fiddling with the MSI-X
> > > message address/data it could simply do with a DMA write anyway, so the
> > > security argument doesn't stand up in general (sure, not all PCIe
> > > devices may be capable of arbitrary DMA, but that seems like more of a
> > > tenuous security-by-obscurity angle to me).
> 
> I tried to make that point for years, thanks for re-iterating it :-)

Indeed, we have an FPGA based PCIe card where the MSI-X table is just a
piece of PCIe accessible memory.
The device driver has to read the MSI-X table and write the address+data
values to other registers which are then used to raise the interrupt.
(Ok, I've written a better interrupt generator so we don't do that
any more.)

David

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported

2016-05-13 Thread David Laight
From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: 13 May 2016 06:33
...
> Simply denying direct writes to the vector table or preventing mapping
> of the vector table into the user address space does not provide any
> tangible form of protection.  Many devices make use of window registers
> that allow backdoors to arbitrary device registers.  Some drivers even
> use this as the primary means for configuring MSI-X, which makes them
> incompatible with device assignment without device specific quirks to
> enable virtualization of these paths.

We have one fgpa based PCIe slave where the device driver has to read
the MSI-X table and then write the value to other fpga registers so
that the logic can generate the correct PCIe write cycle when an
interrupt is requested.
The MSI-X table itself is only as a PCIe slave.

We also have host accessible DMA controllers that the device driver
uses to copy data to kernel memory.
These could easily be used to generate arbitrary MSI-X requests.
As I've said earlier it is almost certainly possible to get any
ethernet hardware to perform something similar.

So without hardware that is able to limit the memory and MSI-X
that each PCIe endpoint can access I believe that if a virtualisation
system gives a guest kernel direct access to a PCIe devices it gives
the guest kernel the ability to raise and MSI-X interrupt and read/write
any physical memory.
(I've not looked at the cpu virtualisation support, but do know what
the PCIe devices can do.)

More interestingly, probably the 'worst' thing (from a security point of view)
that changing the MSI-X table lets you do is a write to an arbitrary
physical memory address.

David


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported

2016-05-05 Thread David Laight
From: Tian, Kevin
> Sent: 05 May 2016 10:37
...
> > Acutually, we are not aimed at accessing MSI-X table from
> > guest. So I think it's safe to passthrough MSI-X table if we
> > can make sure guest kernel would not touch MSI-X table in
> > normal code path such as para-virtualized guest kernel on PPC64.
> >
> 
> Then how do you prevent malicious guest kernel accessing it?

Or a malicious guest driver for an ethernet card setting up
the receive buffer ring to contain a single word entry that
contains the address associated with an MSI-X interrupt and
then using a loopback mode to cause a specific packet be
received that writes the required word through that address.

Remember the PCIe cycle for an interrupt is a normal memory write
cycle.

David

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC v6 06/10] PCI: Add a new PCI_BUS_FLAGS_MSI_REMAP flag

2016-04-18 Thread David Laight
From: Yongji Xie
> Sent: 18 April 2016 11:59
> We introduce a new pci_bus_flags, PCI_BUS_FLAGS_MSI_REMAP
> which indicates all devices on the bus are protected by the
> hardware which supports IRQ remapping(intel naming).
> 
> This flag will be used to know whether it's safe to expose
> MSI-X tables of PCI BARs to userspace. Because the capability
> of IRQ remapping can guarantee the PCI device cannot trigger
> MSIs that correspond to interrupt IDs of other devices.

I'm worried that this entire series is going to break drivers
for existing hardware.

I understand some of the reasoning for 'vm pass through' configurations,
but there will be PCIe devices out there that have the MSI-X tables
in the same BAR as other device registers.
If you are lucky nothing else is in the same 4k area, but I wouldn't
assume it.

In any case, if the hardware can't police the card's master transfers
there is nothing to stop a different bus master block on the card
from raising MSI-X interrupts - they are just a PCIe write.
So all you are doing is raising the bar slightly and giving a very false
sense of security.

David

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 1/2] ACPI / EC: Fix broken 64bit big-endian users of 'global_lock'

2015-09-28 Thread David Laight
From: James Bottomley 
> Sent: 28 September 2015 16:12
> > > > The x86 cpus will also do 32bit wide rmw cycles for the 'bit' 
> > > > operations.
> > >
> > > That's different: it's an atomic RMW operation.  The problem with the
> > > alpha was that the operation wasn't atomic (meaning that it can't be
> > > interrupted and no intermediate output states are visible).
> >
> > It is only atomic if prefixed by the 'lock' prefix.
> > Normally the read and write are separate bus cycles.
> 
> The essential point is that x86 has atomic bit ops and byte writes.
> Early alpha did not.

Early alpha didn't have any byte accesses.

On x86 if you have the following:
struct {
char  a;
volatile char b;
} *foo;
foo->a |= 4;

The compiler is likely to generate a 'bis #4, 0(rbx)' (or similar)
and the cpu will do two 32bit memory cycles that read and write
the 'volatile' field 'b'.
(gcc definitely used to do this...)

A lot of fields were made 32bit (and probably not bitfields) in the linux
kernel tree a year or two ago to avoid this very problem.

> > > > You still have to ensure the compiler doesn't do wider rmw cycles.
> > > > I believe the recent versions of gcc won't do wider accesses for 
> > > > volatile data.
> > >
> > > I don't understand this comment.  You seem to be implying gcc would do a
> > > 64 bit RMW for a 32 bit store ... that would be daft when a single
> > > instruction exists to perform the operation on all architectures.
> >
> > Read the object code and weep...
> > It is most likely to happen for operations that are rmw (eg bit set).
> > For instance the arm cpu has limited offsets for 16bit accesses, for
> > normal structures the compiler is likely to use a 32bit rmw sequence
> > for a 16bit field that has a large offset.
> > The C language allows the compiler to do it for any access (IIRC including
> > volatiles).
> 
> I think you might be confusing different things.  Most RISC CPUs can't
> do 32 bit store immediates because there aren't enough bits in their
> arsenal, so they tend to split 32 bit loads into a left and right part
> (first the top then the offset).  This (and other things) are mostly
> what you see in code.  However, 32 bit register stores are still atomic,
> which is all we require.  It's not really the compiler's fault, it's
> mostly an architectural limitation.

No, I'm not talking about how 32bit constants are generated.
I'm talking about structure offsets.

David

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 1/2] ACPI / EC: Fix broken 64bit big-endian users of 'global_lock'

2015-09-28 Thread David Laight
From: James Bottomley [mailto:james.bottom...@hansenpartnership.com]
> Sent: 28 September 2015 15:27
> On Mon, 2015-09-28 at 08:58 +, David Laight wrote:
> > From: Rafael J. Wysocki
> > > Sent: 27 September 2015 15:09
> > ...
> > > > > Say you have three adjacent fields in a structure, x, y, z, each one 
> > > > > byte long.
> > > > > Initially, all of them are equal to 0.
> > > > >
> > > > > CPU A writes 1 to x and CPU B writes 2 to y at the same time.
> > > > >
> > > > > What's the result?
> > > >
> > > > I think every CPU's  cache architecure guarantees adjacent store
> > > > integrity, even in the face of SMP, so it's x==1 and y==2.  If you're
> > > > thinking of old alpha SMP system where the lowest store width is 32 bits
> > > > and thus you have to do RMW to update a byte, this was usually fixed by
> > > > padding (assuming the structure is not packed).  However, it was such a
> > > > problem that even the later alpha chips had byte extensions.
> >
> > Does linux still support those old Alphas?
> >
> > The x86 cpus will also do 32bit wide rmw cycles for the 'bit' operations.
> 
> That's different: it's an atomic RMW operation.  The problem with the
> alpha was that the operation wasn't atomic (meaning that it can't be
> interrupted and no intermediate output states are visible).

It is only atomic if prefixed by the 'lock' prefix.
Normally the read and write are separate bus cycles.
 
> > You still have to ensure the compiler doesn't do wider rmw cycles.
> > I believe the recent versions of gcc won't do wider accesses for volatile 
> > data.
> 
> I don't understand this comment.  You seem to be implying gcc would do a
> 64 bit RMW for a 32 bit store ... that would be daft when a single
> instruction exists to perform the operation on all architectures.

Read the object code and weep...
It is most likely to happen for operations that are rmw (eg bit set).
For instance the arm cpu has limited offsets for 16bit accesses, for
normal structures the compiler is likely to use a 32bit rmw sequence
for a 16bit field that has a large offset.
The C language allows the compiler to do it for any access (IIRC including
volatiles).

David

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 1/2] ACPI / EC: Fix broken 64bit big-endian users of 'global_lock'

2015-09-28 Thread David Laight
From: Rafael J. Wysocki
> Sent: 27 September 2015 15:09
...
> > > Say you have three adjacent fields in a structure, x, y, z, each one byte 
> > > long.
> > > Initially, all of them are equal to 0.
> > >
> > > CPU A writes 1 to x and CPU B writes 2 to y at the same time.
> > >
> > > What's the result?
> >
> > I think every CPU's  cache architecure guarantees adjacent store
> > integrity, even in the face of SMP, so it's x==1 and y==2.  If you're
> > thinking of old alpha SMP system where the lowest store width is 32 bits
> > and thus you have to do RMW to update a byte, this was usually fixed by
> > padding (assuming the structure is not packed).  However, it was such a
> > problem that even the later alpha chips had byte extensions.

Does linux still support those old Alphas?

The x86 cpus will also do 32bit wide rmw cycles for the 'bit' operations.

> OK, thanks!

You still have to ensure the compiler doesn't do wider rmw cycles.
I believe the recent versions of gcc won't do wider accesses for volatile data.

David

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()

2014-07-08 Thread David Laight
From: Bjorn Helgaas
...
> >> Even if you do that, you ought to write valid interrupt information
> >> into the 4th slot (maybe replicating one of the earlier interrupts).
> >> Then, if the device does raise the 'unexpected' interrupt you don't
> >> get a write to a random kernel location.
> >
> > I might be missing something, but we are talking of MSI address space
> > here, aren't we? I am not getting how we could end up with a 'write'
> > to a random kernel location when a unclaimed MSI vector sent. We could
> > only expect a spurious interrupt at worst, which is handled and reported.
> 
> Yes, that's how I understand it.  With MSI, the OS specifies the a
> single Message Address, e.g., a LAPIC address, and a single Message
> Data value, e.g., a vector number that will be written to the LAPIC.
> The device is permitted to modify some low-order bits of the Message
> Data to send one of several vector numbers (the MME value tells the
> device how many bits it can modify).
> 
> Bottom line, I think a spurious interrupt is the failure we'd expect
> if a device used more vectors than the OS expects it to.

So you need to tell the device where to write in order to raise the
'spurious interrupt'.

David

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()

2014-07-04 Thread David Laight
From: Alexander Gordeev
...
> > Even if you do that, you ought to write valid interrupt information
> > into the 4th slot (maybe replicating one of the earlier interrupts).
> > Then, if the device does raise the 'unexpected' interrupt you don't
> > get a write to a random kernel location.
> 
> I might be missing something, but we are talking of MSI address space
> here, aren't we? I am not getting how we could end up with a 'write'
> to a random kernel location when a unclaimed MSI vector sent. We could
> only expect a spurious interrupt at worst, which is handled and reported.
> 
> Anyway, as I described in my reply to Bjorn, this is not a concern IMO.

I'm thinking of the following - which might be MSI-X ?
1) Hardware requests some interrupts and tells the host the BAR (and offset)
   where the 'vectors' should be written.
2) To raise an interrupt the hardware uses the 'vector' as the address
   of a normal PCIe write cycle.

So if the hardware requests 4 interrupts, but the driver (believing it
will only use 3) only write 3 vectors, and then the hardware uses the
4th vector it can write to a random location.

Debugging that would be hard!

David



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()

2014-07-03 Thread David Laight
From: Bjorn Helgaas
> On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
> > There are PCI devices that require a particular value written
> > to the Multiple Message Enable (MME) register while aligned on
> > power of 2 boundary value of actually used MSI vectors 'nvec'
> > is a lesser of that MME value:
> >
> > roundup_pow_of_two(nvec) < 'Multiple Message Enable'
> >
> > However the existing pci_enable_msi_block() interface is not
> > able to configure such devices, since the value written to the
> > MME register is calculated from the number of requested MSIs
> > 'nvec':
> >
> > 'Multiple Message Enable' = roundup_pow_of_two(nvec)
> 
> For MSI, software learns how many vectors a device requests by reading
> the Multiple Message Capable (MMC) field.  This field is encoded, so a
> device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
> for a device to request 3 vectors; it would have to round up that up
> to a power of two and request 4 vectors.
> 
> Software writes similarly encoded values to MME to tell the device how
> many vectors have been allocated for its use.  For example, it's
> impossible to tell the device that it can use 3 vectors; the OS has to
> round that up and tell the device it can use 4 vectors.
> 
> So if I understand correctly, the point of this series is to take
> advantage of device-specific knowledge, e.g., the device requests 4
> vectors via MMC, but we "know" the device is only capable of using 3.
> Moreover, we tell the device via MME that 4 vectors are available, but
> we've only actually set up 3 of them.
...

Even if you do that, you ought to write valid interrupt information
into the 4th slot (maybe replicating one of the earlier interrupts).
Then, if the device does raise the 'unexpected' interrupt you don't
get a write to a random kernel location.

Plausibly something similar should be done when a smaller number of
interrupts is assigned.

David

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu