Re: [syzbot] WARNING in __dma_map_sg_attrs

2021-12-01 Thread Christoph Hellwig
This means the virtgpu driver uses dma mapping helpers but has not set up
a DMA mask (which most likely suggests it is some kind of virtual device).

On Wed, Dec 01, 2021 at 10:18:21AM -0800, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:c5c17547b778 Merge tag 'net-5.16-rc3' of git://git.kernel...
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=13a73609b0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=bf85c53718a1e697
> dashboard link: https://syzkaller.appspot.com/bug?extid=10e27961f4da37c443b2
> compiler:   gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils 
> for Debian) 2.35.2
> 
> Unfortunately, I don't have any reproducer for this issue yet.
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+10e27961f4da37c44...@syzkaller.appspotmail.com
> 
> [ cut here ]
> WARNING: CPU: 2 PID: 17169 at kernel/dma/mapping.c:188 
> __dma_map_sg_attrs+0x181/0x1f0 kernel/dma/mapping.c:188
> Modules linked in:
> CPU: 0 PID: 17169 Comm: syz-executor.3 Not tainted 5.16.0-rc2-syzkaller #0
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
> RIP: 0010:__dma_map_sg_attrs+0x181/0x1f0 kernel/dma/mapping.c:188
> Code: 00 00 00 00 00 fc ff df 48 c1 e8 03 80 3c 10 00 75 71 4c 8b 3d 70 6d b1 
> 0d e9 db fe ff ff e8 86 ff 12 00 0f 0b e8 7f ff 12 00 <0f> 0b 45 31 e4 e9 54 
> ff ff ff e8 70 ff 12 00 49 8d 7f 50 48 b8 00
> RSP: 0018:c90002c0fb20 EFLAGS: 00010216
> RAX: 00013018 RBX: 0020 RCX: c900037d4000
> RDX: 0004 RSI: 8163d361 RDI: 8880182ae4d0
> RBP: 8880182ae088 R08: 0002 R09: 888017ba054f
> R10: 8163d242 R11: 0008808a R12: 
> R13: 888024ca5700 R14: 0001 R15: 
> FS:  7fa269e34700() GS:88802cb0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 0040c120 CR3: 6c77c000 CR4: 00150ee0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  
>  dma_map_sgtable+0x70/0xf0 kernel/dma/mapping.c:264
>  drm_gem_map_dma_buf+0x12a/0x1e0 drivers/gpu/drm/drm_prime.c:633
>  __map_dma_buf drivers/dma-buf/dma-buf.c:675 [inline]
>  dma_buf_map_attachment+0x39a/0x5b0 drivers/dma-buf/dma-buf.c:954
>  drm_gem_prime_import_dev.part.0+0x85/0x220 drivers/gpu/drm/drm_prime.c:939
>  drm_gem_prime_import_dev drivers/gpu/drm/drm_prime.c:982 [inline]
>  drm_gem_prime_import+0xc8/0x200 drivers/gpu/drm/drm_prime.c:982
>  virtgpu_gem_prime_import+0x49/0x150 
> drivers/gpu/drm/virtio/virtgpu_prime.c:166
>  drm_gem_prime_fd_to_handle+0x21d/0x550 drivers/gpu/drm/drm_prime.c:318
>  drm_prime_fd_to_handle_ioctl+0x9b/0xd0 drivers/gpu/drm/drm_prime.c:374
>  drm_ioctl_kernel+0x27d/0x4e0 drivers/gpu/drm/drm_ioctl.c:782
>  drm_ioctl+0x51e/0x9d0 drivers/gpu/drm/drm_ioctl.c:885
>  vfs_ioctl fs/ioctl.c:51 [inline]
>  __do_sys_ioctl fs/ioctl.c:874 [inline]
>  __se_sys_ioctl fs/ioctl.c:860 [inline]
>  __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
>  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>  do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> RIP: 0033:0x7fa26c8beae9
> Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 
> 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 
> 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
> RSP: 002b:7fa269e34188 EFLAGS: 0246 ORIG_RAX: 0010
> RAX: ffda RBX: 7fa26c9d1f60 RCX: 7fa26c8beae9
> RDX: 24c0 RSI: c00c642e RDI: 0005
> RBP: 7fa26c918f6d R08:  R09: 
> R10:  R11: 0246 R12: 
> R13: 7ffc0019c51f R14: 7fa269e34300 R15: 00022000
>  
> 
> 
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
> 
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
---end quoted text---
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Thomas Gleixner
Jason,

On Wed, Dec 01 2021 at 21:21, Thomas Gleixner wrote:
> On Wed, Dec 01 2021 at 14:14, Jason Gunthorpe wrote:
> Which in turn is consistent all over the place and does not require any
> special case for anything. Neither for interrupts nor for anything else.

that said, feel free to tell me that I'm getting it all wrong.

The reason I'm harping on this is that we are creating ABIs on several
ends and we all know that getting that wrong is a major pain.

Thanks,

tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Thomas Gleixner
dave,

On Wed, Dec 01 2021 at 15:53, Dave Jiang wrote:
> On 12/1/2021 3:03 PM, Thomas Gleixner wrote:
>> This still depends on how this overall discussion about representation
>> of all of this stuff is resolved.
>>
 What needs a subdevice to expose?
>> Can you answer that too please?
>
> Sorry. So initial version of the IDXD sub-device is represented with a 
> single queue. It needs a command irq (emulated) and a completion irq 
> that is backed by a device vector (currently IMS).

thanks for clarification! Let me think about it some more.

   tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/5] iommu/amd: fixes for suspend/resume

2021-12-01 Thread Maxim Levitsky
On Tue, 2021-11-23 at 18:10 +0200, Maxim Levitsky wrote:
> As I sadly found out, a s3 cycle makes the AMD's iommu stop sending interrupts
> until the system is rebooted.
> 
> I only noticed it now because otherwise the IOMMU works, and these interrupts
> are only used for errors and for GA log which I tend not to use by
> making my VMs do mwait/pause/etc in guest (cpu-pm=on).
> 
> There are two issues here that prevent interrupts from being generated after
> s3 cycle:
> 
> 1. GA log base address was not restored after resume, and was all zeroed
> after resume (by BIOS or such).
> 
> In theory if BIOS writes some junk to it, that can even cause a memory 
> corruption.
> Patch 2 fixes that.
> 
> 2. INTX (aka x2apic mode) settings were not restored after resume.
> That mode is used regardless if the host uses/supports x2apic, but rather when
> the IOMMU supports it, and mine does.
> Patches 3-4 fix that.
> 
> Note that there is still one slight (userspace) bug remaining:
> During suspend all but the boot CPU are offlined and then after resume
> are onlined again.
> 
> The offlining moves all non-affinity managed interrupts to CPU0, and
> later when all other CPUs are onlined, there is nothing in the kernel
> to spread back the interrupts over the cores.
> 
> The userspace 'irqbalance' daemon does fix this but it seems to ignore
> the IOMMU interrupts in INTX mode since they are not attached to any
> PCI device, and thus they remain on CPU0 after a s3 cycle,
> which is suboptimal when the system has multiple IOMMUs
> (mine has 4 of them).
> 
> Setting the IRQ affinity manually via /proc/irq/ does work.
> 
> This was tested on my 3970X with both INTX and regular MSI mode (later was 
> enabled
> by patching out INTX detection), by running a guest with AVIC enabled and with
> a PCI assigned device (network card), and observing interrupts from
> IOMMU while guest is mostly idle.
> 
> This was also tested on my AMD laptop with 4650U (which has the same issue)
> (I tested only INTX mode)
> 
> Patch 1 is a small refactoring to remove an unused struct field.
> 
> Best regards,
>Maxim Levitsky
> 
> Maxim Levitsky (5):
>   iommu/amd: restore GA log/tail pointer on host resume
>   iommu/amd: x2apic mode: re-enable after resume
>   iommu/amd: x2apic mode: setup the INTX registers on mask/unmask
>   iommu/amd: x2apic mode: mask/unmask interrupts on suspend/resume
>   iommu/amd: remove useless irq affinity notifier
> 
>  drivers/iommu/amd/amd_iommu_types.h |   2 -
>  drivers/iommu/amd/init.c| 107 +++-
>  2 files changed, 58 insertions(+), 51 deletions(-)
> 
> -- 
> 2.26.3
> 
> 

Polite ping on these patches.
Best regards,
Maxim Levitsky

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Dave Jiang



On 12/1/2021 3:03 PM, Thomas Gleixner wrote:

On Wed, Dec 01 2021 at 14:49, Dave Jiang wrote:

On 12/1/2021 2:44 PM, Thomas Gleixner wrote:

How that is backed on the host does not really matter. You can expose
MSI-X to the guest with a INTx backing as well.

I'm still failing to see the connection between the 9 MSIX vectors and
the 2048 IMS vectors which I assume that this is the limitation of the
physical device, right?

I think I was confused with what you were asking and was thinking you
are saying why can't we just have MSIX on guest backed by the MSIX on
the physical device and thought there would not be enough vectors to
service the many guests. I think I understand what your position is now
with the clarification above.

This still depends on how this overall discussion about representation
of all of this stuff is resolved.


What needs a subdevice to expose?

Can you answer that too please?


Sorry. So initial version of the IDXD sub-device is represented with a 
single queue. It needs a command irq (emulated) and a completion irq 
that is backed by a device vector (currently IMS).





Thanks,

 tglx

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Thomas Gleixner
On Wed, Dec 01 2021 at 14:49, Dave Jiang wrote:
> On 12/1/2021 2:44 PM, Thomas Gleixner wrote:
>> How that is backed on the host does not really matter. You can expose
>> MSI-X to the guest with a INTx backing as well.
>>
>> I'm still failing to see the connection between the 9 MSIX vectors and
>> the 2048 IMS vectors which I assume that this is the limitation of the
>> physical device, right?
>
> I think I was confused with what you were asking and was thinking you 
> are saying why can't we just have MSIX on guest backed by the MSIX on 
> the physical device and thought there would not be enough vectors to 
> service the many guests. I think I understand what your position is now 
> with the clarification above.

This still depends on how this overall discussion about representation
of all of this stuff is resolved.

>> What needs a subdevice to expose?

Can you answer that too please?

Thanks,

tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Dave Jiang



On 12/1/2021 2:44 PM, Thomas Gleixner wrote:

On Wed, Dec 01 2021 at 14:21, Dave Jiang wrote:

On 12/1/2021 1:25 PM, Thomas Gleixner wrote:

The hardware implementation does not have enough MSIX vectors for
guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS
vectors. So if we are to do MSI-X for all of them, then we need to do
the IMS backed MSIX scheme rather than passthrough IMS to guests.

Confused. Are you talking about passing a full IDXD device to the guest
or about passing a carved out subdevice, aka. queue?

I'm talking about carving out a subdevice. I had the impression of you
wanting IMS passed through for all variations. But it sounds like for a
sub-device, you are ok with the implementation of MSIX backed by IMS?

I don't see anything wrong with that. A subdevice is it's own entity and
VFIO can chose the most conveniant representation of it to the guest
obviously.

How that is backed on the host does not really matter. You can expose
MSI-X to the guest with a INTx backing as well.

I'm still failing to see the connection between the 9 MSIX vectors and
the 2048 IMS vectors which I assume that this is the limitation of the
physical device, right?


I think I was confused with what you were asking and was thinking you 
are saying why can't we just have MSIX on guest backed by the MSIX on 
the physical device and thought there would not be enough vectors to 
service the many guests. I think I understand what your position is now 
with the clarification above.





What needs a subdevice to expose?

Thanks,

 tglx




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Thomas Gleixner
On Wed, Dec 01 2021 at 14:21, Dave Jiang wrote:
> On 12/1/2021 1:25 PM, Thomas Gleixner wrote:
>>> The hardware implementation does not have enough MSIX vectors for
>>> guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS
>>> vectors. So if we are to do MSI-X for all of them, then we need to do
>>> the IMS backed MSIX scheme rather than passthrough IMS to guests.
>> Confused. Are you talking about passing a full IDXD device to the guest
>> or about passing a carved out subdevice, aka. queue?
>
> I'm talking about carving out a subdevice. I had the impression of you 
> wanting IMS passed through for all variations. But it sounds like for a 
> sub-device, you are ok with the implementation of MSIX backed by IMS?

I don't see anything wrong with that. A subdevice is it's own entity and
VFIO can chose the most conveniant representation of it to the guest
obviously.

How that is backed on the host does not really matter. You can expose
MSI-X to the guest with a INTx backing as well.

I'm still failing to see the connection between the 9 MSIX vectors and
the 2048 IMS vectors which I assume that this is the limitation of the
physical device, right?

What needs a subdevice to expose?

Thanks,

tglx



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Dave Jiang



On 12/1/2021 1:25 PM, Thomas Gleixner wrote:

On Wed, Dec 01 2021 at 11:47, Dave Jiang wrote:

On 12/1/2021 11:41 AM, Thomas Gleixner wrote:

Hi Thomas. This is actually the IDXD usage for a mediated device passed
to a guest kernel when we plumb the pass through of IMS to the guest
rather than doing previous implementation of having a MSIX vector on
guest backed by IMS.

Which makes a lot of sense.


The control block for the mediated device is emulated and therefore an
emulated MSIX vector will be surfaced as vector 0. However the queues
will backed by IMS vectors. So we end up needing MSIX and IMS coexist
running on the guest kernel for the same device.

Why? What's wrong with using straight MSI-X for all of them?

The hardware implementation does not have enough MSIX vectors for
guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS
vectors. So if we are to do MSI-X for all of them, then we need to do
the IMS backed MSIX scheme rather than passthrough IMS to guests.

Confused. Are you talking about passing a full IDXD device to the guest
or about passing a carved out subdevice, aka. queue?


I'm talking about carving out a subdevice. I had the impression of you 
wanting IMS passed through for all variations. But it sounds like for a 
sub-device, you are ok with the implementation of MSIX backed by IMS?





Thanks,

 tglx


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[syzbot] WARNING in __dma_map_sg_attrs

2021-12-01 Thread syzbot
Hello,

syzbot found the following issue on:

HEAD commit:c5c17547b778 Merge tag 'net-5.16-rc3' of git://git.kernel...
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=13a73609b0
kernel config:  https://syzkaller.appspot.com/x/.config?x=bf85c53718a1e697
dashboard link: https://syzkaller.appspot.com/bug?extid=10e27961f4da37c443b2
compiler:   gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for 
Debian) 2.35.2

Unfortunately, I don't have any reproducer for this issue yet.

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+10e27961f4da37c44...@syzkaller.appspotmail.com

[ cut here ]
WARNING: CPU: 2 PID: 17169 at kernel/dma/mapping.c:188 
__dma_map_sg_attrs+0x181/0x1f0 kernel/dma/mapping.c:188
Modules linked in:
CPU: 0 PID: 17169 Comm: syz-executor.3 Not tainted 5.16.0-rc2-syzkaller #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
RIP: 0010:__dma_map_sg_attrs+0x181/0x1f0 kernel/dma/mapping.c:188
Code: 00 00 00 00 00 fc ff df 48 c1 e8 03 80 3c 10 00 75 71 4c 8b 3d 70 6d b1 
0d e9 db fe ff ff e8 86 ff 12 00 0f 0b e8 7f ff 12 00 <0f> 0b 45 31 e4 e9 54 ff 
ff ff e8 70 ff 12 00 49 8d 7f 50 48 b8 00
RSP: 0018:c90002c0fb20 EFLAGS: 00010216
RAX: 00013018 RBX: 0020 RCX: c900037d4000
RDX: 0004 RSI: 8163d361 RDI: 8880182ae4d0
RBP: 8880182ae088 R08: 0002 R09: 888017ba054f
R10: 8163d242 R11: 0008808a R12: 
R13: 888024ca5700 R14: 0001 R15: 
FS:  7fa269e34700() GS:88802cb0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0040c120 CR3: 6c77c000 CR4: 00150ee0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 
 dma_map_sgtable+0x70/0xf0 kernel/dma/mapping.c:264
 drm_gem_map_dma_buf+0x12a/0x1e0 drivers/gpu/drm/drm_prime.c:633
 __map_dma_buf drivers/dma-buf/dma-buf.c:675 [inline]
 dma_buf_map_attachment+0x39a/0x5b0 drivers/dma-buf/dma-buf.c:954
 drm_gem_prime_import_dev.part.0+0x85/0x220 drivers/gpu/drm/drm_prime.c:939
 drm_gem_prime_import_dev drivers/gpu/drm/drm_prime.c:982 [inline]
 drm_gem_prime_import+0xc8/0x200 drivers/gpu/drm/drm_prime.c:982
 virtgpu_gem_prime_import+0x49/0x150 drivers/gpu/drm/virtio/virtgpu_prime.c:166
 drm_gem_prime_fd_to_handle+0x21d/0x550 drivers/gpu/drm/drm_prime.c:318
 drm_prime_fd_to_handle_ioctl+0x9b/0xd0 drivers/gpu/drm/drm_prime.c:374
 drm_ioctl_kernel+0x27d/0x4e0 drivers/gpu/drm/drm_ioctl.c:782
 drm_ioctl+0x51e/0x9d0 drivers/gpu/drm/drm_ioctl.c:885
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:874 [inline]
 __se_sys_ioctl fs/ioctl.c:860 [inline]
 __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fa26c8beae9
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 
c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:7fa269e34188 EFLAGS: 0246 ORIG_RAX: 0010
RAX: ffda RBX: 7fa26c9d1f60 RCX: 7fa26c8beae9
RDX: 24c0 RSI: c00c642e RDI: 0005
RBP: 7fa26c918f6d R08:  R09: 
R10:  R11: 0246 R12: 
R13: 7ffc0019c51f R14: 7fa269e34300 R15: 00022000
 


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkal...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Thomas Gleixner
On Wed, Dec 01 2021 at 11:47, Dave Jiang wrote:
> On 12/1/2021 11:41 AM, Thomas Gleixner wrote:
>>> Hi Thomas. This is actually the IDXD usage for a mediated device passed
>>> to a guest kernel when we plumb the pass through of IMS to the guest
>>> rather than doing previous implementation of having a MSIX vector on
>>> guest backed by IMS.
>> Which makes a lot of sense.
>>
>>> The control block for the mediated device is emulated and therefore an
>>> emulated MSIX vector will be surfaced as vector 0. However the queues
>>> will backed by IMS vectors. So we end up needing MSIX and IMS coexist
>>> running on the guest kernel for the same device.
>> Why? What's wrong with using straight MSI-X for all of them?
>
> The hardware implementation does not have enough MSIX vectors for 
> guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS 
> vectors. So if we are to do MSI-X for all of them, then we need to do 
> the IMS backed MSIX scheme rather than passthrough IMS to guests.

Confused. Are you talking about passing a full IDXD device to the guest
or about passing a carved out subdevice, aka. queue?

Thanks,

tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Thomas Gleixner
Jason,

On Wed, Dec 01 2021 at 14:14, Jason Gunthorpe wrote:
> On Wed, Dec 01, 2021 at 06:35:35PM +0100, Thomas Gleixner wrote:
>> On Wed, Dec 01 2021 at 09:00, Jason Gunthorpe wrote:
>> But NTB is operating through an abstraction layer and is not a direct
>> PCIe device driver.
>
> I'm not sure exactly how NTB seems to be split between switchtec and
> the ntb code, but since the ntbd code seems to be doing MMIO touches,
> it feels like part of a PCIe driver?

It's a maze of magic PCIe driver with callbacks left and right into the
underlying NTB hardware driver which itself does some stuff on its own
and also calls back into the underlying PCIe driver.

Decomposing that thing feels like being trapped in the Labyrinth of
Knossos. But let's ignore that for a moment.

>> The VFIO driver does not own the irq chip ever. The irq chip is of
>> course part of the underlying infrastructure. I never asked for that.
>
> That isn't quite what I ment.. I ment the PCIe driver cannot create
> the domain or make use of the irq_chip until the VFIO layer comes
> along and provides the struct device. To me this is backwards
> layering, the interrupts come from the PCIe layer and should exist
> independently from VFIO.

See below.

>>  When it allocates a slice for whatever usage then it also
>>  allocates the IMS interrupts (though the VFIO people want to
>>  have only one and do the allocations later on demand).
>> 
>>  That allocation cannot be part of the PCI/MSIx interrupt
>>  domain as we already agreed on.
>
> Yes, it is just an open question of where the new irq_domain need to
> reside

The irqdomain is created by and part of the physical device. But that
does not mean that the interrupts which are allocated from that irq
domain are stored in the physical device representation.

Going by that logic, the PCI/MSI domain would store all MSI[X]
interrupts which are allocated on some root bridge or wherever.

They are obviously stored per PCI device, but the irqdomain is owned
e.g. by the underlying IOMMU zone.

The irqdomain is managing and handing out resources. Like any other
resource manager does.

See?

>> 1) Storage
>> 
>>A) Having "subdevices" solves the storage problem nicely and
>>   makes everything just fall in place. Even for a purely
>>   physical multiqueue device one can argue that each queue is a
>>   "subdevice" of the physical device. The fact that we lump them
>>   all together today is not an argument against that.
>
> I don't like the idea that queue is a device, that is trying to force
> a struct device centric world onto a queue which doesn't really want
> it..

Here we are at the point where we agree to disagree.

Of course a queue is a resource, but what prevents us to represent a
queue as a carved out subdevice of the physical device?

Look at it from the VF point of view. If VFs are disabled then all
resources belong to the physical device. If VFs are enabled then the
hardware/firmware splits the resources into separate subdevices which
have their own device representation, interrupt storage etc.

As VFs have a scalability limitation due to the underlying PCIe
restrictions the step we are talking about now is to split the queues up
in software which means nothing else than creating a software
representation of finer grained and more scalable subdevices.

So why would we want to pretend that these are not devices?

They are from a conceptual and topology view a subdevice of the physical
device. Just because they are named queues does not make it any
different.

Let's look at VFIO again. If VFIO passes a VF through then it builds a
wrapper around the VF device.

So if a queue is represented as a subdevice, then VFIO can just build
a wrapper around that subdevice.

But with your model, VFIO has to create a device, request a queue,
wrestle the interrupts in place, etc. Which is exactly the opposite of
how VFs are handled.

So again, why would we want to make software managed subdevices look
exactly the opposite way like hardware/firmware managed subdevices?

Let me also look at the cdev which is exposed by the phsyical device.

The cdev is nothing else than a software vehicle to create a /dev/
node. (ignore the switchtec case which (ab)uses the cdev to connect to
NTB). The cdev allows user space to allocate/release a resource.

Of course it can just allocate a queue and stick the necessary
interrupts into the physical device MSI descriptor storage.

But there is no reason why the queue allocation cannot allocate a
subdevice representing the queue.

That makes the VFIO and the cdev case just using the same underlying
representation which can expose it's properties via the underlying
device.

Which in turn is consistent all over the place and does not require any
special case for anything. Neither for interrupts nor for anything else.

Thanks,

tglx



___
iommu mailing list
iommu@lists.l

Re: [PATCH v2 29/33] iommu: Use put_pages_list

2021-12-01 Thread Robin Murphy

On 2021-12-01 19:07, Matthew Wilcox wrote:

On Wed, Dec 01, 2021 at 07:15:06PM +0100, Vlastimil Babka wrote:

From: "Matthew Wilcox (Oracle)" 

page->freelist is for the use of slab.  We already have the ability
to free a list of pages in the core mm, but it requires the use of a
list_head and for the pages to be chained together through page->lru.
Switch the iommu code over to using put_pages_list().


FYI, this is going to have conflicts with
https://lore.kernel.org/lkml/cover.1637671820.git.robin.mur...@arm.com/

I'm not sure what the appropriate resolution is going to be here;
maybe hold back part of this patch series to the following merge
window to give the iommu people time to merge their own patches?


More than that, this version is subtly but catastrophically broken - we 
can't simply pass &gather->freelist through the current IOVA entry_dtor 
machinery, since gather is a stack variable from a few call frames up so 
the actual list head will be long gone by the time 
iommu_dma_entry_dtor() tries to dereference it. It took until I was 
elbow-deep in refactoring the RFC to notice that one :)


Thanks,
Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 29/33] iommu: Use put_pages_list

2021-12-01 Thread Matthew Wilcox
On Wed, Dec 01, 2021 at 07:15:06PM +0100, Vlastimil Babka wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> page->freelist is for the use of slab.  We already have the ability
> to free a list of pages in the core mm, but it requires the use of a
> list_head and for the pages to be chained together through page->lru.
> Switch the iommu code over to using put_pages_list().

FYI, this is going to have conflicts with
https://lore.kernel.org/lkml/cover.1637671820.git.robin.mur...@arm.com/

I'm not sure what the appropriate resolution is going to be here;
maybe hold back part of this patch series to the following merge
window to give the iommu people time to merge their own patches?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Dave Jiang



On 12/1/2021 11:41 AM, Thomas Gleixner wrote:

Dave,

please trim your replies.

On Wed, Dec 01 2021 at 09:28, Dave Jiang wrote:


On 12/1/2021 3:16 AM, Thomas Gleixner wrote:

Jason,

CC+ IOMMU folks

On Tue, Nov 30 2021 at 20:17, Jason Gunthorpe wrote:

On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote:

Though I fear there is also a use case for MSI-X and IMS tied to the
same device. That network card you are talking about might end up using
MSI-X for a control block and then IMS for the actual network queues
when it is used as physical function device as a whole, but that's
conceptually a different case.

Hi Thomas. This is actually the IDXD usage for a mediated device passed
to a guest kernel when we plumb the pass through of IMS to the guest
rather than doing previous implementation of having a MSIX vector on
guest backed by IMS.

Which makes a lot of sense.


The control block for the mediated device is emulated and therefore an
emulated MSIX vector will be surfaced as vector 0. However the queues
will backed by IMS vectors. So we end up needing MSIX and IMS coexist
running on the guest kernel for the same device.

Why? What's wrong with using straight MSI-X for all of them?


The hardware implementation does not have enough MSIX vectors for 
guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS 
vectors. So if we are to do MSI-X for all of them, then we need to do 
the IMS backed MSIX scheme rather than passthrough IMS to guests.





Thanks,

 tglx


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Logan Gunthorpe



On 2021-12-01 11:14 a.m., 'Jason Gunthorpe' via linux-ntb wrote:
> On Wed, Dec 01, 2021 at 06:35:35PM +0100, Thomas Gleixner wrote:
>> On Wed, Dec 01 2021 at 09:00, Jason Gunthorpe wrote:
>>> On Wed, Dec 01, 2021 at 11:16:47AM +0100, Thomas Gleixner wrote:
 Looking at the device slices as subdevices with their own struct device
 makes a lot of sense from the conceptual level.
>>>
>>> Except IMS is not just for subdevices, it should be usable for any
>>> driver in any case as a general interrupt mechiansm, as you alluded to
>>> below about ethernet queues. ntb seems to be the current example of
>>> this need..
>>
>> But NTB is operating through an abstraction layer and is not a direct
>> PCIe device driver.
> 
> I'm not sure exactly how NTB seems to be split between switchtec and
> the ntb code, but since the ntbd code seems to be doing MMIO touches,
> it feels like part of a PCIe driver?

Eh, sort of. NTB has lots of layers. At the top you'll find ntb_netdev
which is an network interface. Below that is ntb_transport() which is a
generic queue pair. Below that is the hardware driver itself (ie
switchtec) through the abstraction layer. The switchtec driver is split
in two: the main driver which just allows for information and
administration of the switch itself and switchtec_ntb which is the
module that provides the NTB abstractions to twiddle its registers.

ntb_transport() doesn't directly do MMIO touches (as it doesn't know
what the underlying hardware registers are). Except for the memory
windows which are usually setup to be a specific BAR (or parts of a
BAR). ntb_transport will do MMIO writes to those specific BAR address
which correspond to writing into buffers on the peer.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Thomas Gleixner
Dave,

please trim your replies.

On Wed, Dec 01 2021 at 09:28, Dave Jiang wrote:

> On 12/1/2021 3:16 AM, Thomas Gleixner wrote:
>> Jason,
>>
>> CC+ IOMMU folks
>>
>> On Tue, Nov 30 2021 at 20:17, Jason Gunthorpe wrote:
>>> On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote:
>>
>> Though I fear there is also a use case for MSI-X and IMS tied to the
>> same device. That network card you are talking about might end up using
>> MSI-X for a control block and then IMS for the actual network queues
>> when it is used as physical function device as a whole, but that's
>> conceptually a different case.
>
> Hi Thomas. This is actually the IDXD usage for a mediated device passed 
> to a guest kernel when we plumb the pass through of IMS to the guest 
> rather than doing previous implementation of having a MSIX vector on 
> guest backed by IMS.

Which makes a lot of sense.

> The control block for the mediated device is emulated and therefore an
> emulated MSIX vector will be surfaced as vector 0. However the queues
> will backed by IMS vectors. So we end up needing MSIX and IMS coexist
> running on the guest kernel for the same device.

Why? What's wrong with using straight MSI-X for all of them?

Thanks,

tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 29/33] iommu: Use put_pages_list

2021-12-01 Thread Vlastimil Babka
From: "Matthew Wilcox (Oracle)" 

page->freelist is for the use of slab.  We already have the ability
to free a list of pages in the core mm, but it requires the use of a
list_head and for the pages to be chained together through page->lru.
Switch the iommu code over to using put_pages_list().

Signed-off-by: Matthew Wilcox (Oracle) 
Signed-off-by: Vlastimil Babka 
Cc: Joerg Roedel 
Cc: Suravee Suthikulpanit 
Cc: Will Deacon 
Cc: David Woodhouse 
Cc: Lu Baolu 
Cc: 
---
 drivers/iommu/amd/io_pgtable.c | 59 +-
 drivers/iommu/dma-iommu.c  | 11 +
 drivers/iommu/intel/iommu.c| 89 --
 include/linux/iommu.h  |  3 +-
 4 files changed, 57 insertions(+), 105 deletions(-)

diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c
index 182c93a43efd..08ea6a02cda9 100644
--- a/drivers/iommu/amd/io_pgtable.c
+++ b/drivers/iommu/amd/io_pgtable.c
@@ -74,27 +74,15 @@ static u64 *first_pte_l7(u64 *pte, unsigned long *page_size,
  *
  /
 
-static void free_page_list(struct page *freelist)
-{
-   while (freelist != NULL) {
-   unsigned long p = (unsigned long)page_address(freelist);
-
-   freelist = freelist->freelist;
-   free_page(p);
-   }
-}
-
-static struct page *free_pt_page(unsigned long pt, struct page *freelist)
+static void free_pt_page(unsigned long pt, struct list_head *list)
 {
struct page *p = virt_to_page((void *)pt);
 
-   p->freelist = freelist;
-
-   return p;
+   list_add(&p->lru, list);
 }
 
 #define DEFINE_FREE_PT_FN(LVL, FN) 
\
-static struct page *free_pt_##LVL (unsigned long __pt, struct page *freelist)  
\
+static void free_pt_##LVL (unsigned long __pt, struct list_head *list) 
\
 {  
\
unsigned long p;
\
u64 *pt;
\
@@ -113,10 +101,10 @@ static struct page *free_pt_##LVL (unsigned long __pt, 
struct page *freelist) \
continue;   
\

\
p = (unsigned long)IOMMU_PTE_PAGE(pt[i]);   
\
-   freelist = FN(p, freelist); 
\
+   FN(p, list);
\
}   
\

\
-   return free_pt_page((unsigned long)pt, freelist);   
\
+   free_pt_page((unsigned long)pt, list);  
\
 }
 
 DEFINE_FREE_PT_FN(l2, free_pt_page)
@@ -125,36 +113,33 @@ DEFINE_FREE_PT_FN(l4, free_pt_l3)
 DEFINE_FREE_PT_FN(l5, free_pt_l4)
 DEFINE_FREE_PT_FN(l6, free_pt_l5)
 
-static struct page *free_sub_pt(unsigned long root, int mode,
-   struct page *freelist)
+static void free_sub_pt(unsigned long root, int mode, struct list_head *list)
 {
switch (mode) {
case PAGE_MODE_NONE:
case PAGE_MODE_7_LEVEL:
break;
case PAGE_MODE_1_LEVEL:
-   freelist = free_pt_page(root, freelist);
+   free_pt_page(root, list);
break;
case PAGE_MODE_2_LEVEL:
-   freelist = free_pt_l2(root, freelist);
+   free_pt_l2(root, list);
break;
case PAGE_MODE_3_LEVEL:
-   freelist = free_pt_l3(root, freelist);
+   free_pt_l3(root, list);
break;
case PAGE_MODE_4_LEVEL:
-   freelist = free_pt_l4(root, freelist);
+   free_pt_l4(root, list);
break;
case PAGE_MODE_5_LEVEL:
-   freelist = free_pt_l5(root, freelist);
+   free_pt_l5(root, list);
break;
case PAGE_MODE_6_LEVEL:
-   freelist = free_pt_l6(root, freelist);
+   free_pt_l6(root, list);
break;
default:
BUG();
}
-
-   return freelist;
 }
 
 void amd_iommu_domain_set_pgtable(struct protection_domain *domain,
@@ -362,7 +347,7 @@ static u64 *fetch_pte(struct amd_io_pgtable *pgtable,
return pte;
 }
 
-static struct page *free_clear_pte(u64 *pte, u64 pteval, struct page *freelist)
+static void free_clear_pte(u64 *pte, u64 pteval, struct list_head *list)
 {
unsigned long pt;
int mode;
@@ -373,12 +358,12 @@ static struct page *free_clear_pte(u64 *pte, u64 pteval, 
struct page *freelist)
}
 
if (!IOMMU_PTE_PRESENT(pteval))
-   

[PATCH v2 00/33] Separate struct slab from struct page

2021-12-01 Thread Vlastimil Babka
Folks from non-slab subsystems are Cc'd only to patches affecting them, and
this cover letter.

Series also available in git, based on 5.16-rc3:
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slab-struct_slab-v2r2

The plan: as my SLUB PREEMPT_RT series in 5.15, I would prefer to go again with
the git pull request way of eventually merging this, as it's also not a small
series. I will thus reply to this mail with asking to include my branch in
linux-next.

As stated in the v1/RFC cover letter, I wouldn't mind to then continue with
maintaining a git tree for all slab patches in general. It was apparently
already done that way before, by Pekka:
https://lore.kernel.org/linux-mm/alpine.DEB.2.00.1107221108190.2996@tiger/

Changes from v1/RFC:
https://lore.kernel.org/all/2026001628.24216-1-vba...@suse.cz/
- Added virt_to_folio() and folio_address() in the new Patch 1.
- Addressed feedback from Andrey Konovalov and Matthew Wilcox (Thanks!)
- Added Tested-by: Marco Elver for the KFENCE parts (Thanks!)

Previous version from Matthew Wilcox:
https://lore.kernel.org/all/20211004134650.4031813-1-wi...@infradead.org/

LWN coverage of the above:
https://lwn.net/Articles/871982/

This is originally an offshoot of the folio work by Matthew. One of the more
complex parts of the struct page definition are the parts used by the slab
allocators. It would be good for the MM in general if struct slab were its own
data type, and it also helps to prevent tail pages from slipping in anywhere.
As Matthew requested in his proof of concept series, I have taken over the
development of this series, so it's a mix of patches from him (often modified
by me) and my own.

One big difference is the use of coccinelle to perform the relatively trivial
parts of the conversions automatically and at once, instead of a larger number
of smaller incremental reviewable steps. Thanks to Julia Lawall and Luis
Chamberlain for all their help!

Another notable difference is (based also on review feedback) I don't represent
with a struct slab the large kmalloc allocations which are not really a slab,
but use page allocator directly. When going from an object address to a struct
slab, the code tests first folio slab flag, and only if it's set it converts to
struct slab. This makes the struct slab type stronger.

Finally, although Matthew's version didn't use any of the folio work, the
initial support has been merged meanwhile so my version builds on top of it
where appropriate. This eliminates some of the redundant compound_head()
being performed e.g. when testing the slab flag.

To sum up, after this series, struct page fields used by slab allocators are
moved from struct page to a new struct slab, that uses the same physical
storage. The availability of the fields is further distinguished by the
selected slab allocator implementation. The advantages include:

- Similar to folios, if the slab is of order > 0, struct slab always is
  guaranteed to be the head page. Additionally it's guaranteed to be an actual
  slab page, not a large kmalloc. This removes uncertainty and potential for
  bugs.
- It's not possible to accidentally use fields of the slab implementation that's
  not configured.
- Other subsystems cannot use slab's fields in struct page anymore (some
  existing non-slab usages had to be adjusted in this series), so slab
  implementations have more freedom in rearranging them in the struct slab.

Matthew Wilcox (Oracle) (16):
  mm: Split slab into its own type
  mm: Add account_slab() and unaccount_slab()
  mm: Convert virt_to_cache() to use struct slab
  mm: Convert __ksize() to struct slab
  mm: Use struct slab in kmem_obj_info()
  mm: Convert check_heap_object() to use struct slab
  mm/slub: Convert detached_freelist to use a struct slab
  mm/slub: Convert kfree() to use a struct slab
  mm/slub: Convert print_page_info() to print_slab_info()
  mm/slub: Convert pfmemalloc_match() to take a struct slab
  mm/slob: Convert SLOB to use struct slab
  mm/kasan: Convert to struct folio and struct slab
  zsmalloc: Stop using slab fields in struct page
  bootmem: Use page->index instead of page->freelist
  iommu: Use put_pages_list
  mm: Remove slab from struct page

Vlastimil Babka (17):
  mm: add virt_to_folio() and folio_address()
  mm/slab: Dissolve slab_map_pages() in its caller
  mm/slub: Make object_err() static
  mm/slub: Convert __slab_lock() and __slab_unlock() to struct slab
  mm/slub: Convert alloc_slab_page() to return a struct slab
  mm/slub: Convert __free_slab() to use struct slab
  mm/slub: Convert most struct page to struct slab by spatch
  mm/slub: Finish struct page to struct slab conversion
  mm/slab: Convert kmem_getpages() and kmem_freepages() to struct slab
  mm/slab: Convert most struct page to struct slab by spatch
  mm/slab: Finish struct page to struct slab conversion
  mm: Convert struct page to struct slab in functions used by other
subsystems
  mm/memcg: Convert slab objcgs from struct page 

Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Jason Gunthorpe via iommu
On Wed, Dec 01, 2021 at 06:35:35PM +0100, Thomas Gleixner wrote:
> On Wed, Dec 01 2021 at 09:00, Jason Gunthorpe wrote:
> > On Wed, Dec 01, 2021 at 11:16:47AM +0100, Thomas Gleixner wrote:
> >> Looking at the device slices as subdevices with their own struct device
> >> makes a lot of sense from the conceptual level.
> >
> > Except IMS is not just for subdevices, it should be usable for any
> > driver in any case as a general interrupt mechiansm, as you alluded to
> > below about ethernet queues. ntb seems to be the current example of
> > this need..
> 
> But NTB is operating through an abstraction layer and is not a direct
> PCIe device driver.

I'm not sure exactly how NTB seems to be split between switchtec and
the ntb code, but since the ntbd code seems to be doing MMIO touches,
it feels like part of a PCIe driver?

> > IDXD is not so much making device "slices", but virtualizing and
> > sharing a PCI device. The IDXD hardware is multi-queue like the NIC I
> > described and the VFIO driver is simply allocating queues from a PCI
> > device for specific usages and assigning them interrupts.
> 
> Right.
> 
> But what is the representation for that resulting device? Some VFIO
> specific homebrewn muck or something which is based on struct device?

Why is there a device? A queue is a queue, not a device.

If the task is to make some struct device (eg mdev, or cdev, or
whatever) then queues may be allocated to do this, but the queue is
logically a resource handed out by the PCIe driver and there should
not be a requirement to have an external struct device just to create
a queue.

> Right now with VF passthrough, I can see the interrupts which are
> associated to the device.
> 
> How is that going to be with something which is just made up? Does that
> expose it's own msi properties then somewhere hidden in the VFIO layer?

For sysfs, I think all interrupts should be on the PCI directory.

> > There is already a char dev interface that equally allocates queues
> > from the same IDXD device, why shouldn't it be able to access IMS
> > interrupt pools too?
> 
> Why wouldn't it be able to do so?

The only 'struct device' there is a cdev and I really don't think
cdevs should have interrupts. It is a bit hacky as a in-kernel thing
and downright wrong as a sysfs ABI.

> The VFIO driver does not own the irq chip ever. The irq chip is of
> course part of the underlying infrastructure. I never asked for that.

That isn't quite what I ment.. I ment the PCIe driver cannot create
the domain or make use of the irq_chip until the VFIO layer comes
along and provides the struct device. To me this is backwards
layering, the interrupts come from the PCIe layer and should exist
independently from VFIO.

>  When it allocates a slice for whatever usage then it also
>  allocates the IMS interrupts (though the VFIO people want to
>  have only one and do the allocations later on demand).
> 
>  That allocation cannot be part of the PCI/MSIx interrupt
>  domain as we already agreed on.

Yes, it is just an open question of where the new irq_domain need to
reside

> 1) Storage
> 
>A) Having "subdevices" solves the storage problem nicely and
>   makes everything just fall in place. Even for a purely
>   physical multiqueue device one can argue that each queue is a
>   "subdevice" of the physical device. The fact that we lump them
>   all together today is not an argument against that.

I don't like the idea that queue is a device, that is trying to force
a struct device centric world onto a queue which doesn't really want
it..
 
>B) Requires extra storage in the PCIe device and extra storage
>   per subdevice, queue to keep track of the interrupts which
>   are associated to it.

Yes

> 2) Exposure of VFIO interrupts via sysfs
> 
>A) Just works

I would say this is flawed, in sysfs I expect all the interrupts for
the PCIe device to be in the PCIe sysfs, not strewn over subsystem
owned sub-directories.

For instance, today in mlx5, when a subdevice allocates a queue for a
slice (which is modeled as an aux device) the queue's assigned MSI-X
interrupt shows up on the PCIe sysfs, not the aux.

It should be uniform, if I assign a queue a legacy INT, MSI or an IMS
it should show in sysfs in the same way. Leaking this kernel
implementation detail as sysfs ABI does not seem good.

> 3) On demand expansion of the vectors for VFIO
> 
>A) Just works because the device has an irqdomain assigned.
> 
>B) Requires extra indirections to do that

Yes.
 
> 4) PASID
> 
>While an Intel IDXD specific issue, it want's to be solved
>without any nasty hacks.
> 
>A) Having a "subdevice" allows to associate the PASID with the
>   underlying struct device which makes IOMMU integration trivial
> 
>B) Needs some other custom hackery to get that solved

Yes

> > Any possibility that the 'IMS' xarray could 

Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Thomas Gleixner
On Wed, Dec 01 2021 at 09:00, Jason Gunthorpe wrote:
> On Wed, Dec 01, 2021 at 11:16:47AM +0100, Thomas Gleixner wrote:
>> Looking at the device slices as subdevices with their own struct device
>> makes a lot of sense from the conceptual level.
>
> Except IMS is not just for subdevices, it should be usable for any
> driver in any case as a general interrupt mechiansm, as you alluded to
> below about ethernet queues. ntb seems to be the current example of
> this need..

But NTB is operating through an abstraction layer and is not a direct
PCIe device driver.

> IDXD is not so much making device "slices", but virtualizing and
> sharing a PCI device. The IDXD hardware is multi-queue like the NIC I
> described and the VFIO driver is simply allocating queues from a PCI
> device for specific usages and assigning them interrupts.

Right.

But what is the representation for that resulting device? Some VFIO
specific homebrewn muck or something which is based on struct device?

Right now with VF passthrough, I can see the interrupts which are
associated to the device.

How is that going to be with something which is just made up? Does that
expose it's own msi properties then somewhere hidden in the VFIO layer?

See below.

> There is already a char dev interface that equally allocates queues
> from the same IDXD device, why shouldn't it be able to access IMS
> interrupt pools too?

Why wouldn't it be able to do so?

> IMHO a well designed IDXD driver should put all the PCI MMIO, queue
> mangement, interrupts, etc in the PCI driver layer, and the VFIO
> driver layer should only deal with the MMIO trapping and VFIO
> interfacing.
>
> From this view it is conceptually wrong to have the VFIO driver
> directly talking to MMIO registers in the PCI device or owning the
> irq_chip.

The VFIO driver does not own the irq chip ever. The irq chip is of
course part of the underlying infrastructure. I never asked for that.

PCIe driver
 Owns the PCI/MSI[x] interrupts for the control block

 Has a control mechanism which handles queues or whatever the
 device is partitioned into, that's what I called slice.

 The irqdomain is part of that control mechanism and the irqchip
 implementation belongs to that as well. It has to because the
 message store is device specific.

 Whether the doamin and chip implementation is in a separate
 drivers/irqchip/foo.c file for sharing or directly built into the
 PCIe driver itself does not matter.

 When it allocates a slice for whatever usage then it also
 allocates the IMS interrupts (though the VFIO people want to
 have only one and do the allocations later on demand).

 That allocation cannot be part of the PCI/MSIx interrupt
 domain as we already agreed on.

We have several problems to solve. Let me look at it from both point of
views:

1) Storage

   A) Having "subdevices" solves the storage problem nicely and
  makes everything just fall in place. Even for a purely
  physical multiqueue device one can argue that each queue is a
  "subdevice" of the physical device. The fact that we lump them
  all together today is not an argument against that.

   B) Requires extra storage in the PCIe device and extra storage
  per subdevice, queue to keep track of the interrupts which
  are associated to it.

2) Exposure of VFIO interrupts via sysfs

   A) Just works

   B) Requires extra mechanisms to expose it

3) On demand expansion of the vectors for VFIO

   A) Just works because the device has an irqdomain assigned.

   B) Requires extra indirections to do that

4) IOMMU msi_desc::dev

   A) I think that's reasonably simple to address by having the
  relationship to the underlying PCIe device stored in struct
  device, which is not really adding any complexity to the IOMMU
  code.

  Quite the contrary it could be used to make the DMA aliasing a
  problem of device setup time and not a lookup per interrupt
  allocation in the IOMMU code.

   B) No change required.

4) PASID

   While an Intel IDXD specific issue, it want's to be solved
   without any nasty hacks.

   A) Having a "subdevice" allows to associate the PASID with the
  underlying struct device which makes IOMMU integration trivial

   B) Needs some other custom hackery to get that solved

So both variants come with their ups and downs.

IMO A) is the right thing to do when looking at all the involved moving
pieces.

> It would be very odd for the PCI driver to allocate interrupts from
> some random external struct device when it is creating queues on the
> PCI device.

No. That's not what I want.

>> and then have a store index for each allocation domain. With the
>> proposed encapsulation of the xarray handling that's definitely
>> feasible. Whether that buys much is a different question. Let me think
>> about i

[PATCH v3 2/5] iommu/virtio: Support bypass domains

2021-12-01 Thread Jean-Philippe Brucker
The VIRTIO_IOMMU_F_BYPASS_CONFIG feature adds a new flag to the ATTACH
request, that creates a bypass domain. Use it to enable identity
domains.

When VIRTIO_IOMMU_F_BYPASS_CONFIG is not supported by the device, we
currently fail attaching to an identity domain. Future patches will
instead create identity mappings in this case.

Reviewed-by: Kevin Tian 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/virtio-iommu.c | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 80930ce04a16..14dfee76fd19 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -71,6 +71,7 @@ struct viommu_domain {
struct rb_root_cached   mappings;
 
unsigned long   nr_endpoints;
+   boolbypass;
 };
 
 struct viommu_endpoint {
@@ -587,7 +588,9 @@ static struct iommu_domain *viommu_domain_alloc(unsigned 
type)
 {
struct viommu_domain *vdomain;
 
-   if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA)
+   if (type != IOMMU_DOMAIN_UNMANAGED &&
+   type != IOMMU_DOMAIN_DMA &&
+   type != IOMMU_DOMAIN_IDENTITY)
return NULL;
 
vdomain = kzalloc(sizeof(*vdomain), GFP_KERNEL);
@@ -630,6 +633,17 @@ static int viommu_domain_finalise(struct viommu_endpoint 
*vdev,
vdomain->map_flags  = viommu->map_flags;
vdomain->viommu = viommu;
 
+   if (domain->type == IOMMU_DOMAIN_IDENTITY) {
+   if (!virtio_has_feature(viommu->vdev,
+   VIRTIO_IOMMU_F_BYPASS_CONFIG)) {
+   ida_free(&viommu->domain_ids, vdomain->id);
+   vdomain->viommu = NULL;
+   return -EOPNOTSUPP;
+   }
+
+   vdomain->bypass = true;
+   }
+
return 0;
 }
 
@@ -691,6 +705,9 @@ static int viommu_attach_dev(struct iommu_domain *domain, 
struct device *dev)
.domain = cpu_to_le32(vdomain->id),
};
 
+   if (vdomain->bypass)
+   req.flags |= cpu_to_le32(VIRTIO_IOMMU_ATTACH_F_BYPASS);
+
for (i = 0; i < fwspec->num_ids; i++) {
req.endpoint = cpu_to_le32(fwspec->ids[i]);
 
@@ -1132,6 +1149,7 @@ static unsigned int features[] = {
VIRTIO_IOMMU_F_DOMAIN_RANGE,
VIRTIO_IOMMU_F_PROBE,
VIRTIO_IOMMU_F_MMIO,
+   VIRTIO_IOMMU_F_BYPASS_CONFIG,
 };
 
 static struct virtio_device_id id_table[] = {
-- 
2.34.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 3/5] iommu/virtio: Sort reserved regions

2021-12-01 Thread Jean-Philippe Brucker
To ease identity mapping support, keep the list of reserved regions
sorted.

Reviewed-by: Eric Auger 
Reviewed-by: Kevin Tian 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/virtio-iommu.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 14dfee76fd19..1b3c1f2741c6 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -423,7 +423,7 @@ static int viommu_add_resv_mem(struct viommu_endpoint *vdev,
size_t size;
u64 start64, end64;
phys_addr_t start, end;
-   struct iommu_resv_region *region = NULL;
+   struct iommu_resv_region *region = NULL, *next;
unsigned long prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO;
 
start = start64 = le64_to_cpu(mem->start);
@@ -454,7 +454,12 @@ static int viommu_add_resv_mem(struct viommu_endpoint 
*vdev,
if (!region)
return -ENOMEM;
 
-   list_add(®ion->list, &vdev->resv_regions);
+   /* Keep the list sorted */
+   list_for_each_entry(next, &vdev->resv_regions, list) {
+   if (next->start > region->start)
+   break;
+   }
+   list_add_tail(®ion->list, &next->list);
return 0;
 }
 
-- 
2.34.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 4/5] iommu/virtio: Pass end address to viommu_add_mapping()

2021-12-01 Thread Jean-Philippe Brucker
To support identity mappings, the virtio-iommu driver must be able to
represent full 64-bit ranges internally. Pass (start, end) instead of
(start, size) to viommu_add/del_mapping().

Clean comments. The one about the returned size was never true: when
sweeping the whole address space the returned size will most certainly
be smaller than 2^64.

Reviewed-by: Eric Auger 
Reviewed-by: Kevin Tian 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/virtio-iommu.c | 31 +++
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 1b3c1f2741c6..2fa370c2659c 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -311,8 +311,8 @@ static int viommu_send_req_sync(struct viommu_dev *viommu, 
void *buf,
  *
  * On success, return the new mapping. Otherwise return NULL.
  */
-static int viommu_add_mapping(struct viommu_domain *vdomain, unsigned long 
iova,
- phys_addr_t paddr, size_t size, u32 flags)
+static int viommu_add_mapping(struct viommu_domain *vdomain, u64 iova, u64 end,
+ phys_addr_t paddr, u32 flags)
 {
unsigned long irqflags;
struct viommu_mapping *mapping;
@@ -323,7 +323,7 @@ static int viommu_add_mapping(struct viommu_domain 
*vdomain, unsigned long iova,
 
mapping->paddr  = paddr;
mapping->iova.start = iova;
-   mapping->iova.last  = iova + size - 1;
+   mapping->iova.last  = end;
mapping->flags  = flags;
 
spin_lock_irqsave(&vdomain->mappings_lock, irqflags);
@@ -338,26 +338,24 @@ static int viommu_add_mapping(struct viommu_domain 
*vdomain, unsigned long iova,
  *
  * @vdomain: the domain
  * @iova: start of the range
- * @size: size of the range. A size of 0 corresponds to the entire address
- * space.
+ * @end: end of the range
  *
- * On success, returns the number of unmapped bytes (>= size)
+ * On success, returns the number of unmapped bytes
  */
 static size_t viommu_del_mappings(struct viommu_domain *vdomain,
- unsigned long iova, size_t size)
+ u64 iova, u64 end)
 {
size_t unmapped = 0;
unsigned long flags;
-   unsigned long last = iova + size - 1;
struct viommu_mapping *mapping = NULL;
struct interval_tree_node *node, *next;
 
spin_lock_irqsave(&vdomain->mappings_lock, flags);
-   next = interval_tree_iter_first(&vdomain->mappings, iova, last);
+   next = interval_tree_iter_first(&vdomain->mappings, iova, end);
while (next) {
node = next;
mapping = container_of(node, struct viommu_mapping, iova);
-   next = interval_tree_iter_next(node, iova, last);
+   next = interval_tree_iter_next(node, iova, end);
 
/* Trying to split a mapping? */
if (mapping->iova.start < iova)
@@ -656,8 +654,8 @@ static void viommu_domain_free(struct iommu_domain *domain)
 {
struct viommu_domain *vdomain = to_viommu_domain(domain);
 
-   /* Free all remaining mappings (size 2^64) */
-   viommu_del_mappings(vdomain, 0, 0);
+   /* Free all remaining mappings */
+   viommu_del_mappings(vdomain, 0, ULLONG_MAX);
 
if (vdomain->viommu)
ida_free(&vdomain->viommu->domain_ids, vdomain->id);
@@ -742,6 +740,7 @@ static int viommu_map(struct iommu_domain *domain, unsigned 
long iova,
 {
int ret;
u32 flags;
+   u64 end = iova + size - 1;
struct virtio_iommu_req_map map;
struct viommu_domain *vdomain = to_viommu_domain(domain);
 
@@ -752,7 +751,7 @@ static int viommu_map(struct iommu_domain *domain, unsigned 
long iova,
if (flags & ~vdomain->map_flags)
return -EINVAL;
 
-   ret = viommu_add_mapping(vdomain, iova, paddr, size, flags);
+   ret = viommu_add_mapping(vdomain, iova, end, paddr, flags);
if (ret)
return ret;
 
@@ -761,7 +760,7 @@ static int viommu_map(struct iommu_domain *domain, unsigned 
long iova,
.domain = cpu_to_le32(vdomain->id),
.virt_start = cpu_to_le64(iova),
.phys_start = cpu_to_le64(paddr),
-   .virt_end   = cpu_to_le64(iova + size - 1),
+   .virt_end   = cpu_to_le64(end),
.flags  = cpu_to_le32(flags),
};
 
@@ -770,7 +769,7 @@ static int viommu_map(struct iommu_domain *domain, unsigned 
long iova,
 
ret = viommu_send_req_sync(vdomain->viommu, &map, sizeof(map));
if (ret)
-   viommu_del_mappings(vdomain, iova, size);
+   viommu_del_mappings(vdomain, iova, end);
 
return ret;
 }
@@ -783,7 +782,7 @@ static size_t viommu_unmap(struct iommu_domain *domain, 
unsigned long iova,
struct virtio_iommu_req_unmap unmap;
s

[PATCH v3 1/5] iommu/virtio: Add definitions for VIRTIO_IOMMU_F_BYPASS_CONFIG

2021-12-01 Thread Jean-Philippe Brucker
Add definitions for the VIRTIO_IOMMU_F_BYPASS_CONFIG, which supersedes
VIRTIO_IOMMU_F_BYPASS.

Reviewed-by: Kevin Tian 
Signed-off-by: Jean-Philippe Brucker 
---
 include/uapi/linux/virtio_iommu.h | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/virtio_iommu.h 
b/include/uapi/linux/virtio_iommu.h
index 237e36a280cb..1ff357f0d72e 100644
--- a/include/uapi/linux/virtio_iommu.h
+++ b/include/uapi/linux/virtio_iommu.h
@@ -16,6 +16,7 @@
 #define VIRTIO_IOMMU_F_BYPASS  3
 #define VIRTIO_IOMMU_F_PROBE   4
 #define VIRTIO_IOMMU_F_MMIO5
+#define VIRTIO_IOMMU_F_BYPASS_CONFIG   6
 
 struct virtio_iommu_range_64 {
__le64  start;
@@ -36,6 +37,8 @@ struct virtio_iommu_config {
struct virtio_iommu_range_32domain_range;
/* Probe buffer size */
__le32  probe_size;
+   __u8bypass;
+   __u8reserved[3];
 };
 
 /* Request types */
@@ -66,11 +69,14 @@ struct virtio_iommu_req_tail {
__u8reserved[3];
 };
 
+#define VIRTIO_IOMMU_ATTACH_F_BYPASS   (1 << 0)
+
 struct virtio_iommu_req_attach {
struct virtio_iommu_req_headhead;
__le32  domain;
__le32  endpoint;
-   __u8reserved[8];
+   __le32  flags;
+   __u8reserved[4];
struct virtio_iommu_req_tailtail;
 };
 
-- 
2.34.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 5/5] iommu/virtio: Support identity-mapped domains

2021-12-01 Thread Jean-Philippe Brucker
Support identity domains for devices that do not offer the
VIRTIO_IOMMU_F_BYPASS_CONFIG feature, by creating 1:1 mappings between
the virtual and physical address space. Identity domains created this
way still perform noticeably better than DMA domains, because they don't
have the overhead of setting up and tearing down mappings at runtime.
The performance difference between this and bypass is minimal in
comparison.

It does not matter that the physical addresses in the identity mappings
do not all correspond to memory. By enabling passthrough we are trusting
the device driver and the device itself to only perform DMA to suitable
locations. In some cases it may even be desirable to perform DMA to MMIO
regions.

Reviewed-by: Eric Auger 
Reviewed-by: Kevin Tian 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/virtio-iommu.c | 61 +---
 1 file changed, 57 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 2fa370c2659c..6a8a52b4297b 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -375,6 +375,55 @@ static size_t viommu_del_mappings(struct viommu_domain 
*vdomain,
return unmapped;
 }
 
+/*
+ * Fill the domain with identity mappings, skipping the device's reserved
+ * regions.
+ */
+static int viommu_domain_map_identity(struct viommu_endpoint *vdev,
+ struct viommu_domain *vdomain)
+{
+   int ret;
+   struct iommu_resv_region *resv;
+   u64 iova = vdomain->domain.geometry.aperture_start;
+   u64 limit = vdomain->domain.geometry.aperture_end;
+   u32 flags = VIRTIO_IOMMU_MAP_F_READ | VIRTIO_IOMMU_MAP_F_WRITE;
+   unsigned long granule = 1UL << __ffs(vdomain->domain.pgsize_bitmap);
+
+   iova = ALIGN(iova, granule);
+   limit = ALIGN_DOWN(limit + 1, granule) - 1;
+
+   list_for_each_entry(resv, &vdev->resv_regions, list) {
+   u64 resv_start = ALIGN_DOWN(resv->start, granule);
+   u64 resv_end = ALIGN(resv->start + resv->length, granule) - 1;
+
+   if (resv_end < iova || resv_start > limit)
+   /* No overlap */
+   continue;
+
+   if (resv_start > iova) {
+   ret = viommu_add_mapping(vdomain, iova, resv_start - 1,
+(phys_addr_t)iova, flags);
+   if (ret)
+   goto err_unmap;
+   }
+
+   if (resv_end >= limit)
+   return 0;
+
+   iova = resv_end + 1;
+   }
+
+   ret = viommu_add_mapping(vdomain, iova, limit, (phys_addr_t)iova,
+flags);
+   if (ret)
+   goto err_unmap;
+   return 0;
+
+err_unmap:
+   viommu_del_mappings(vdomain, 0, iova);
+   return ret;
+}
+
 /*
  * viommu_replay_mappings - re-send MAP requests
  *
@@ -637,14 +686,18 @@ static int viommu_domain_finalise(struct viommu_endpoint 
*vdev,
vdomain->viommu = viommu;
 
if (domain->type == IOMMU_DOMAIN_IDENTITY) {
-   if (!virtio_has_feature(viommu->vdev,
-   VIRTIO_IOMMU_F_BYPASS_CONFIG)) {
+   if (virtio_has_feature(viommu->vdev,
+  VIRTIO_IOMMU_F_BYPASS_CONFIG)) {
+   vdomain->bypass = true;
+   return 0;
+   }
+
+   ret = viommu_domain_map_identity(vdev, vdomain);
+   if (ret) {
ida_free(&viommu->domain_ids, vdomain->id);
vdomain->viommu = NULL;
return -EOPNOTSUPP;
}
-
-   vdomain->bypass = true;
}
 
return 0;
-- 
2.34.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 0/5] iommu/virtio: Add identity domains

2021-12-01 Thread Jean-Philippe Brucker
Support identity domains, allowing to only enable IOMMU protection for a
subset of endpoints (those assigned to userspace, for example). Users
may enable identity domains at compile time
(CONFIG_IOMMU_DEFAULT_PASSTHROUGH), boot time (iommu.passthrough=1) or
runtime (/sys/kernel/iommu_groups/*/type = identity).

Since v2 [1] I fixed the padding in patch 1 and a rebase error in patch
5, reported by Eric.

Patches 1-2 support identity domains using the optional
VIRTIO_IOMMU_F_BYPASS_CONFIG feature, which was accepted into the spec
[2]. Patches 3-5 add a fallback to identity mappings, when the feature
is not supported.

QEMU patches are on my virtio-iommu/bypass branch [3], and depend on the
UAPI update.

[1] 
https://lore.kernel.org/linux-iommu/20211123155301.1047943-1-jean-phili...@linaro.org/
[2] https://github.com/oasis-tcs/virtio-spec/issues/119
[3] https://jpbrucker.net/git/qemu/log/?h=virtio-iommu/bypass

Jean-Philippe Brucker (5):
  iommu/virtio: Add definitions for VIRTIO_IOMMU_F_BYPASS_CONFIG
  iommu/virtio: Support bypass domains
  iommu/virtio: Sort reserved regions
  iommu/virtio: Pass end address to viommu_add_mapping()
  iommu/virtio: Support identity-mapped domains

 include/uapi/linux/virtio_iommu.h |   8 ++-
 drivers/iommu/virtio-iommu.c  | 113 +-
 2 files changed, 101 insertions(+), 20 deletions(-)

-- 
2.34.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Dave Jiang



On 12/1/2021 3:16 AM, Thomas Gleixner wrote:

Jason,

CC+ IOMMU folks

On Tue, Nov 30 2021 at 20:17, Jason Gunthorpe wrote:

On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote:

The real problem is where to store the MSI descriptors because the PCI
device has its own real PCI/MSI-X interrupts which means it still shares
the storage space.

Er.. I never realized that just looking at the patches :|

That is relevant to all real "IMS" users. IDXD escaped this because
it, IMHO, wrongly used the mdev with the IRQ layer. The mdev is purely
a messy artifact of VFIO, it should not be required to make the IRQ
layers work.
I don't think it makes sense that the msi_desc would point to a mdev,
the iommu layer consumes the msi_desc_to_dev(), it really should point
to the physical device that originates the message with a proper
iommu ops/data/etc.

Looking at the device slices as subdevices with their own struct device
makes a lot of sense from the conceptual level. That makes is pretty
much obvious to manage the MSIs of those devices at this level like we
do for any other device.

Whether mdev is the right encapsulation for these subdevices is an
orthogonal problem.

I surely agree that msi_desc::dev is an interesting question, but we
already have this disconnect of msi_desc::dev and DMA today due to DMA
aliasing. I haven't looked at that in detail yet, but of course the
alias handling is substantially different accross the various IOMMU
implementations.

Though I fear there is also a use case for MSI-X and IMS tied to the
same device. That network card you are talking about might end up using
MSI-X for a control block and then IMS for the actual network queues
when it is used as physical function device as a whole, but that's
conceptually a different case.


Hi Thomas. This is actually the IDXD usage for a mediated device passed 
to a guest kernel when we plumb the pass through of IMS to the guest 
rather than doing previous implementation of having a MSIX vector on 
guest backed by IMS. The control block for the mediated device is 
emulated and therefore an emulated MSIX vector will be surfaced as 
vector 0. However the queues will backed by IMS vectors. So we end up 
needing MSIX and IMS coexist running on the guest kernel for the same 
device.


DJ




I'm currently tending to partition the index space in the xarray:

  0x - 0x  PCI/MSI-X
  0x0001 - 0x0001  NTB

It is OK, with some xarray work it can be range allocating & reserving
so that the msi_domain_alloc_irqs() flows can carve out chunks of the
number space..

Another view is the msi_domain_alloc_irqs() flows should have their
own xarrays..

Yes, I was thinking about that as well. The trivial way would be:

 struct xarray store[MSI_MAX_STORES];

and then have a store index for each allocation domain. With the
proposed encapsulation of the xarray handling that's definitely
feasible. Whether that buys much is a different question. Let me think
about it some more.


which is feasible now with the range modifications and way simpler to do
with xarray than with the linked list.

Indeed!

I'm glad you like the approach.

Thanks,

 tglx



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH V3 3/5] hyperv/IOMMU: Enable swiotlb bounce buffer for Isolation VM

2021-12-01 Thread Tianyu Lan
From: Tianyu Lan 

hyperv Isolation VM requires bounce buffer support to copy
data from/to encrypted memory and so enable swiotlb force
mode to use swiotlb bounce buffer for DMA transaction.

In Isolation VM with AMD SEV, the bounce buffer needs to be
accessed via extra address space which is above shared_gpa_boundary
(E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG.
The access physical address will be original physical address +
shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP
spec is called virtual top of memory(vTOM). Memory addresses below
vTOM are automatically treated as private while memory above
vTOM is treated as shared.

Hyper-V initalizes swiotlb bounce buffer and default swiotlb
needs to be disabled. pci_swiotlb_detect_override() and
pci_swiotlb_detect_4gb() enable the default one. To override
the setting, hyperv_swiotlb_detect() needs to run before
these detect functions which depends on the pci_xen_swiotlb_
init(). Make pci_xen_swiotlb_init() depends on the hyperv_swiotlb
_detect() to keep the order.

Swiotlb bounce buffer code calls set_memory_decrypted()
to mark bounce buffer visible to host and map it in extra
address space via memremap. Populate the shared_gpa_boundary
(vTOM) via swiotlb_unencrypted_base variable.

The map function memremap() can't work in the early place
hyperv_iommu_swiotlb_init() and so call swiotlb_update_mem_attributes()
in the hyperv_iommu_swiotlb_later_init().

Signed-off-by: Tianyu Lan 
---
 arch/x86/xen/pci-swiotlb-xen.c |  3 +-
 drivers/hv/vmbus_drv.c |  3 ++
 drivers/iommu/hyperv-iommu.c   | 56 ++
 include/linux/hyperv.h |  8 +
 4 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c
index 46df59aeaa06..30fd0600b008 100644
--- a/arch/x86/xen/pci-swiotlb-xen.c
+++ b/arch/x86/xen/pci-swiotlb-xen.c
@@ -4,6 +4,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -91,6 +92,6 @@ int pci_xen_swiotlb_init_late(void)
 EXPORT_SYMBOL_GPL(pci_xen_swiotlb_init_late);
 
 IOMMU_INIT_FINISH(pci_xen_swiotlb_detect,
- NULL,
+ hyperv_swiotlb_detect,
  pci_xen_swiotlb_init,
  NULL);
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 392c1ac4f819..0a64ccfafb8b 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "hyperv_vmbus.h"
 
@@ -2078,6 +2079,7 @@ struct hv_device *vmbus_device_create(const guid_t *type,
return child_device_obj;
 }
 
+static u64 vmbus_dma_mask = DMA_BIT_MASK(64);
 /*
  * vmbus_device_register - Register the child device
  */
@@ -2118,6 +2120,7 @@ int vmbus_device_register(struct hv_device 
*child_device_obj)
}
hv_debug_add_dev_dir(child_device_obj);
 
+   child_device_obj->device.dma_mask = &vmbus_dma_mask;
return 0;
 
 err_kset_unregister:
diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
index e285a220c913..dd729d49a1eb 100644
--- a/drivers/iommu/hyperv-iommu.c
+++ b/drivers/iommu/hyperv-iommu.c
@@ -13,14 +13,20 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "irq_remapping.h"
 
@@ -337,4 +343,54 @@ static const struct irq_domain_ops 
hyperv_root_ir_domain_ops = {
.free = hyperv_root_irq_remapping_free,
 };
 
+static void __init hyperv_iommu_swiotlb_init(void)
+{
+   unsigned long hyperv_io_tlb_size;
+   void *hyperv_io_tlb_start;
+
+   /*
+* Allocate Hyper-V swiotlb bounce buffer at early place
+* to reserve large contiguous memory.
+*/
+   hyperv_io_tlb_size = swiotlb_size_or_default();
+   hyperv_io_tlb_start = memblock_alloc(hyperv_io_tlb_size, PAGE_SIZE);
+
+   if (!hyperv_io_tlb_start)
+   pr_warn("Fail to allocate Hyper-V swiotlb buffer.\n");
+
+   swiotlb_init_with_tbl(hyperv_io_tlb_start,
+ hyperv_io_tlb_size >> IO_TLB_SHIFT, true);
+}
+
+int __init hyperv_swiotlb_detect(void)
+{
+   if (!hypervisor_is_type(X86_HYPER_MS_HYPERV))
+   return 0;
+
+   if (!hv_is_isolation_supported())
+   return 0;
+
+   /*
+* Enable swiotlb force mode in Isolation VM to
+* use swiotlb bounce buffer for dma transaction.
+*/
+   if (hv_isolation_type_snp())
+   swiotlb_unencrypted_base = ms_hyperv.shared_gpa_boundary;
+   swiotlb_force = SWIOTLB_FORCE;
+   return 1;
+}
+
+static void __init hyperv_iommu_swiotlb_later_init(void)
+{
+   /*
+* Swiotlb bounce buffer needs to be mapped in extra address
+* space. Map function doesn't work in the early place and so
+* call swiotlb_update_mem_attributes() here.
+*/
+ 

[PATCH V3 5/5] hv_netvsc: Add Isolation VM support for netvsc driver

2021-12-01 Thread Tianyu Lan
From: Tianyu Lan 

In Isolation VM, all shared memory with host needs to mark visible
to host via hvcall. vmbus_establish_gpadl() has already done it for
netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
pagebuffer() stills need to be handled. Use DMA API to map/umap
these memory during sending/receiving packet and Hyper-V swiotlb
bounce buffer dma adress will be returned. The swiotlb bounce buffer
has been masked to be visible to host during boot up.

rx/tx ring buffer is allocated via vzalloc() and they need to be
mapped into unencrypted address space(above vTOM) before sharing
with host and accessing. Add hv_map/unmap_memory() to map/umap rx
/tx ring buffer.

Signed-off-by: Tianyu Lan 
---
Change since v2:
   * Add hv_map/unmap_memory() to map/umap rx/tx ring buffer.
---
 arch/x86/hyperv/ivm.c |  28 ++
 drivers/hv/hv_common.c|  11 +++
 drivers/net/hyperv/hyperv_net.h   |   5 ++
 drivers/net/hyperv/netvsc.c   | 136 +-
 drivers/net/hyperv/netvsc_drv.c   |   1 +
 drivers/net/hyperv/rndis_filter.c |   2 +
 include/asm-generic/mshyperv.h|   2 +
 include/linux/hyperv.h|   5 ++
 8 files changed, 187 insertions(+), 3 deletions(-)

diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index 69c7a57f3307..9f78d8f67ea3 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -287,3 +287,31 @@ int hv_set_mem_host_visibility(unsigned long kbuffer, int 
pagecount, bool visibl
kfree(pfn_array);
return ret;
 }
+
+/*
+ * hv_map_memory - map memory to extra space in the AMD SEV-SNP Isolation VM.
+ */
+void *hv_map_memory(void *addr, unsigned long size)
+{
+   unsigned long *pfns = kcalloc(size / HV_HYP_PAGE_SIZE,
+ sizeof(unsigned long), GFP_KERNEL);
+   void *vaddr;
+   int i;
+
+   if (!pfns)
+   return NULL;
+
+   for (i = 0; i < size / PAGE_SIZE; i++)
+   pfns[i] = virt_to_hvpfn(addr + i * PAGE_SIZE) +
+   (ms_hyperv.shared_gpa_boundary >> PAGE_SHIFT);
+
+   vaddr = vmap_pfn(pfns, size / PAGE_SIZE, PAGE_KERNEL_IO);
+   kfree(pfns);
+
+   return vaddr;
+}
+
+void hv_unmap_memory(void *addr)
+{
+   vunmap(addr);
+}
diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 7be173a99f27..3c5cb1f70319 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -295,3 +295,14 @@ u64 __weak hv_ghcb_hypercall(u64 control, void *input, 
void *output, u32 input_s
return HV_STATUS_INVALID_PARAMETER;
 }
 EXPORT_SYMBOL_GPL(hv_ghcb_hypercall);
+
+void __weak *hv_map_memory(void *addr, unsigned long size)
+{
+   return NULL;
+}
+EXPORT_SYMBOL_GPL(hv_map_memory);
+
+void __weak hv_unmap_memory(void *addr)
+{
+}
+EXPORT_SYMBOL_GPL(hv_unmap_memory);
diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 315278a7cf88..cf69da0e296c 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -164,6 +164,7 @@ struct hv_netvsc_packet {
u32 total_bytes;
u32 send_buf_index;
u32 total_data_buflen;
+   struct hv_dma_range *dma_range;
 };
 
 #define NETVSC_HASH_KEYLEN 40
@@ -1074,6 +1075,7 @@ struct netvsc_device {
 
/* Receive buffer allocated by us but manages by NetVSP */
void *recv_buf;
+   void *recv_original_buf;
u32 recv_buf_size; /* allocated bytes */
struct vmbus_gpadl recv_buf_gpadl_handle;
u32 recv_section_cnt;
@@ -1082,6 +1084,7 @@ struct netvsc_device {
 
/* Send buffer allocated by us */
void *send_buf;
+   void *send_original_buf;
u32 send_buf_size;
struct vmbus_gpadl send_buf_gpadl_handle;
u32 send_section_cnt;
@@ -1731,4 +1734,6 @@ struct rndis_message {
 #define RETRY_US_HI1
 #define RETRY_MAX  2000/* >10 sec */
 
+void netvsc_dma_unmap(struct hv_device *hv_dev,
+ struct hv_netvsc_packet *packet);
 #endif /* _HYPERV_NET_H */
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 396bc1c204e6..b7ade735a806 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -153,8 +153,21 @@ static void free_netvsc_device(struct rcu_head *head)
int i;
 
kfree(nvdev->extension);
-   vfree(nvdev->recv_buf);
-   vfree(nvdev->send_buf);
+
+   if (nvdev->recv_original_buf) {
+   hv_unmap_memory(nvdev->recv_buf);
+   vfree(nvdev->recv_original_buf);
+   } else {
+   vfree(nvdev->recv_buf);
+   }
+
+   if (nvdev->send_original_buf) {
+   hv_unmap_memory(nvdev->send_buf);
+   vfree(nvdev->send_original_buf);
+   } else {
+   vfree(nvdev->send_buf);
+   }
+
kfree(nvdev->send_section_map);
 
for (i = 0; i < VRSS_CHANNEL_MAX; i++) {
@@ -338,6 +351,7 @@ static int netvsc_init_buf(struct hv_device *device

[PATCH V3 4/5] scsi: storvsc: Add Isolation VM support for storvsc driver

2021-12-01 Thread Tianyu Lan
From: Tianyu Lan 

In Isolation VM, all shared memory with host needs to mark visible
to host via hvcall. vmbus_establish_gpadl() has already done it for
storvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
mpb_desc() still needs to be handled. Use DMA API(scsi_dma_map/unmap)
to map these memory during sending/receiving packet and return swiotlb
bounce buffer dma address. In Isolation VM, swiotlb  bounce buffer is
marked to be visible to host and the swiotlb force mode is enabled.

Set device's dma min align mask to HV_HYP_PAGE_SIZE - 1 in order to
keep the original data offset in the bounce buffer.

Signed-off-by: Tianyu Lan 
---
 drivers/hv/vmbus_drv.c |  1 +
 drivers/scsi/storvsc_drv.c | 37 +
 include/linux/hyperv.h |  1 +
 3 files changed, 23 insertions(+), 16 deletions(-)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 0a64ccfafb8b..ae6ec503399a 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -2121,6 +2121,7 @@ int vmbus_device_register(struct hv_device 
*child_device_obj)
hv_debug_add_dev_dir(child_device_obj);
 
child_device_obj->device.dma_mask = &vmbus_dma_mask;
+   child_device_obj->device.dma_parms = &child_device_obj->dma_parms;
return 0;
 
 err_kset_unregister:
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 20595c0ba0ae..ae293600d799 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -21,6 +21,8 @@
 #include 
 #include 
 #include 
+#include 
+
 #include 
 #include 
 #include 
@@ -1336,6 +1338,7 @@ static void storvsc_on_channel_callback(void *context)
continue;
}
request = (struct storvsc_cmd_request 
*)scsi_cmd_priv(scmnd);
+   scsi_dma_unmap(scmnd);
}
 
storvsc_on_receive(stor_device, packet, request);
@@ -1749,7 +1752,6 @@ static int storvsc_queuecommand(struct Scsi_Host *host, 
struct scsi_cmnd *scmnd)
struct hv_host_device *host_dev = shost_priv(host);
struct hv_device *dev = host_dev->dev;
struct storvsc_cmd_request *cmd_request = scsi_cmd_priv(scmnd);
-   int i;
struct scatterlist *sgl;
unsigned int sg_count;
struct vmscsi_request *vm_srb;
@@ -1831,10 +1833,11 @@ static int storvsc_queuecommand(struct Scsi_Host *host, 
struct scsi_cmnd *scmnd)
payload_sz = sizeof(cmd_request->mpb);
 
if (sg_count) {
-   unsigned int hvpgoff, hvpfns_to_add;
unsigned long offset_in_hvpg = offset_in_hvpage(sgl->offset);
unsigned int hvpg_count = HVPFN_UP(offset_in_hvpg + length);
-   u64 hvpfn;
+   struct scatterlist *sg;
+   unsigned long hvpfn, hvpfns_to_add;
+   int j, i = 0;
 
if (hvpg_count > MAX_PAGE_BUFFER_COUNT) {
 
@@ -1848,21 +1851,22 @@ static int storvsc_queuecommand(struct Scsi_Host *host, 
struct scsi_cmnd *scmnd)
payload->range.len = length;
payload->range.offset = offset_in_hvpg;
 
+   sg_count = scsi_dma_map(scmnd);
+   if (sg_count < 0)
+   return SCSI_MLQUEUE_DEVICE_BUSY;
 
-   for (i = 0; sgl != NULL; sgl = sg_next(sgl)) {
+   for_each_sg(sgl, sg, sg_count, j) {
/*
-* Init values for the current sgl entry. hvpgoff
-* and hvpfns_to_add are in units of Hyper-V size
-* pages. Handling the PAGE_SIZE != HV_HYP_PAGE_SIZE
-* case also handles values of sgl->offset that are
-* larger than PAGE_SIZE. Such offsets are handled
-* even on other than the first sgl entry, provided
-* they are a multiple of PAGE_SIZE.
+* Init values for the current sgl entry. hvpfns_to_add
+* is in units of Hyper-V size pages. Handling the
+* PAGE_SIZE != HV_HYP_PAGE_SIZE case also handles
+* values of sgl->offset that are larger than PAGE_SIZE.
+* Such offsets are handled even on other than the first
+* sgl entry, provided they are a multiple of PAGE_SIZE.
 */
-   hvpgoff = HVPFN_DOWN(sgl->offset);
-   hvpfn = page_to_hvpfn(sg_page(sgl)) + hvpgoff;
-   hvpfns_to_add = HVPFN_UP(sgl->offset + sgl->length) -
-   hvpgoff;
+   hvpfn = HVPFN_DOWN(sg_dma_address(sg));
+   hvpfns_to_add = HVPFN_UP(sg_dma_address(sg) +
+sg_dma_len(sg)) - hvpfn;
 
 

[PATCH V3 2/5] x86/hyper-v: Add hyperv Isolation VM check in the cc_platform_has()

2021-12-01 Thread Tianyu Lan
From: Tianyu Lan 

Hyper-V provides Isolation VM which has memory encrypt support. Add
hyperv_cc_platform_has() and return true for check of GUEST_MEM_ENCRYPT
attribute.

Signed-off-by: Tianyu Lan 
---
 arch/x86/kernel/cc_platform.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/kernel/cc_platform.c b/arch/x86/kernel/cc_platform.c
index 03bb2f343ddb..f3bb0431f5c5 100644
--- a/arch/x86/kernel/cc_platform.c
+++ b/arch/x86/kernel/cc_platform.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 
+#include 
 #include 
 
 static bool __maybe_unused intel_cc_platform_has(enum cc_attr attr)
@@ -58,9 +59,23 @@ static bool amd_cc_platform_has(enum cc_attr attr)
 #endif
 }
 
+static bool hyperv_cc_platform_has(enum cc_attr attr)
+{
+#ifdef CONFIG_HYPERV
+   if (attr == CC_ATTR_GUEST_MEM_ENCRYPT)
+   return true;
+   else
+   return false;
+#else
+   return false;
+#endif
+}
 
 bool cc_platform_has(enum cc_attr attr)
 {
+   if (hv_is_isolation_supported())
+   return hyperv_cc_platform_has(attr);
+
if (sme_me_mask)
return amd_cc_platform_has(attr);
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH V3 1/5] Swiotlb: Add Swiotlb bounce buffer remap function for HV IVM

2021-12-01 Thread Tianyu Lan
From: Tianyu Lan 

In Isolation VM with AMD SEV, bounce buffer needs to be accessed via
extra address space which is above shared_gpa_boundary (E.G 39 bit
address line) reported by Hyper-V CPUID ISOLATION_CONFIG. The access
physical address will be original physical address + shared_gpa_boundary.
The shared_gpa_boundary in the AMD SEV SNP spec is called virtual top of
memory(vTOM). Memory addresses below vTOM are automatically treated as
private while memory above vTOM is treated as shared.

Expose swiotlb_unencrypted_base for platforms to set unencrypted
memory base offset and platform calls swiotlb_update_mem_attributes()
to remap swiotlb mem to unencrypted address space. memremap() can
not be called in the early stage and so put remapping code into
swiotlb_update_mem_attributes(). Store remap address and use it to copy
data from/to swiotlb bounce buffer.

Signed-off-by: Tianyu Lan 
---
Change since v2:
* Leave mem->vaddr with phys_to_virt(mem->start) when fail
  to remap swiotlb memory.

Change since v1:
* Rework comment in the swiotlb_init_io_tlb_mem()
* Make swiotlb_init_io_tlb_mem() back to return void.
---
 include/linux/swiotlb.h |  6 ++
 kernel/dma/swiotlb.c| 47 -
 2 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 569272871375..f6c3638255d5 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -73,6 +73,9 @@ extern enum swiotlb_force swiotlb_force;
  * @end:   The end address of the swiotlb memory pool. Used to do a quick
  * range check to see if the memory was in fact allocated by this
  * API.
+ * @vaddr: The vaddr of the swiotlb memory pool. The swiotlb memory pool
+ * may be remapped in the memory encrypted case and store virtual
+ * address for bounce buffer operation.
  * @nslabs:The number of IO TLB blocks (in groups of 64) between @start and
  * @end. For default swiotlb, this is command line adjustable via
  * setup_io_tlb_npages.
@@ -92,6 +95,7 @@ extern enum swiotlb_force swiotlb_force;
 struct io_tlb_mem {
phys_addr_t start;
phys_addr_t end;
+   void *vaddr;
unsigned long nslabs;
unsigned long used;
unsigned int index;
@@ -186,4 +190,6 @@ static inline bool is_swiotlb_for_alloc(struct device *dev)
 }
 #endif /* CONFIG_DMA_RESTRICTED_POOL */
 
+extern phys_addr_t swiotlb_unencrypted_base;
+
 #endif /* __LINUX_SWIOTLB_H */
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 8e840fbbed7c..adb9d06af5c8 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 #include 
@@ -72,6 +73,8 @@ enum swiotlb_force swiotlb_force;
 
 struct io_tlb_mem io_tlb_default_mem;
 
+phys_addr_t swiotlb_unencrypted_base;
+
 /*
  * Max segment that we can provide which (if pages are contingous) will
  * not be bounced (unless SWIOTLB_FORCE is set).
@@ -155,6 +158,27 @@ static inline unsigned long nr_slots(u64 val)
return DIV_ROUND_UP(val, IO_TLB_SIZE);
 }
 
+/*
+ * Remap swioltb memory in the unencrypted physical address space
+ * when swiotlb_unencrypted_base is set. (e.g. for Hyper-V AMD SEV-SNP
+ * Isolation VMs).
+ */
+void *swiotlb_mem_remap(struct io_tlb_mem *mem, unsigned long bytes)
+{
+   void *vaddr = NULL;
+
+   if (swiotlb_unencrypted_base) {
+   phys_addr_t paddr = mem->start + swiotlb_unencrypted_base;
+
+   vaddr = memremap(paddr, bytes, MEMREMAP_WB);
+   if (!vaddr)
+   pr_err("Failed to map the unencrypted memory %llx size 
%lx.\n",
+  paddr, bytes);
+   }
+
+   return vaddr;
+}
+
 /*
  * Early SWIOTLB allocation may be too early to allow an architecture to
  * perform the desired operations.  This function allows the architecture to
@@ -172,7 +196,12 @@ void __init swiotlb_update_mem_attributes(void)
vaddr = phys_to_virt(mem->start);
bytes = PAGE_ALIGN(mem->nslabs << IO_TLB_SHIFT);
set_memory_decrypted((unsigned long)vaddr, bytes >> PAGE_SHIFT);
-   memset(vaddr, 0, bytes);
+
+   mem->vaddr = swiotlb_mem_remap(mem, bytes);
+   if (!mem->vaddr)
+   mem->vaddr = vaddr;
+
+   memset(mem->vaddr, 0, bytes);
 }
 
 static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
@@ -196,7 +225,18 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem 
*mem, phys_addr_t start,
mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
mem->slots[i].alloc_size = 0;
}
+
+   /*
+* If swiotlb_unencrypted_base is set, the bounce buffer memory will
+* be remapped and cleared in swiotlb_update_mem_attributes.
+*/
+   if (swiotlb_unencrypted_base)
+   return;
+
+   set_memory_decrypted(

[PATCH V3 0/5] x86/Hyper-V: Add Hyper-V Isolation VM support(Second part)

2021-12-01 Thread Tianyu Lan
From: Tianyu Lan 

Hyper-V provides two kinds of Isolation VMs. VBS(Virtualization-based
security) and AMD SEV-SNP unenlightened Isolation VMs. This patchset
is to add support for these Isolation VM support in Linux.

The memory of these vms are encrypted and host can't access guest
memory directly. Hyper-V provides new host visibility hvcall and
the guest needs to call new hvcall to mark memory visible to host
before sharing memory with host. For security, all network/storage
stack memory should not be shared with host and so there is bounce
buffer requests.

Vmbus channel ring buffer already plays bounce buffer role because
all data from/to host needs to copy from/to between the ring buffer
and IO stack memory. So mark vmbus channel ring buffer visible.

For SNP isolation VM, guest needs to access the shared memory via
extra address space which is specified by Hyper-V CPUID HYPERV_CPUID_
ISOLATION_CONFIG. The access physical address of the shared memory
should be bounce buffer memory GPA plus with shared_gpa_boundary
reported by CPUID.

This patchset is to enable swiotlb bounce buffer for netvsc/storvsc
in Isolation VM.

This version follows Michael Kelley suggestion in the following link.
https://lkml.org/lkml/2021/11/24/2044

Change since v2:
 * Remove Hyper-V dma ops and dma_alloc/free_noncontiguous. Add
   hv_map/unmap_memory() to map/umap netvsc rx/tx ring into extra
   address space.
 * Leave mem->vaddr in swiotlb code with phys_to_virt(mem->start)
   when fail to remap swiotlb memory.

Change since v1:
 * Add Hyper-V Isolation support check in the cc_platform_has()
   and return true for guest memory encrypt attr.
 * Remove hv isolation check in the sev_setup_arch()

Tianyu Lan (5):
  Swiotlb: Add Swiotlb bounce buffer remap function for HV IVM
  x86/hyper-v: Add hyperv Isolation VM check in the cc_platform_has()
  hyperv/IOMMU: Enable swiotlb bounce buffer for Isolation VM
  scsi: storvsc: Add Isolation VM support for storvsc driver
  hv_netvsc: Add Isolation VM support for netvsc driver

 arch/x86/hyperv/ivm.c |  28 ++
 arch/x86/kernel/cc_platform.c |  15 
 arch/x86/xen/pci-swiotlb-xen.c|   3 +-
 drivers/hv/hv_common.c|  11 +++
 drivers/hv/vmbus_drv.c|   4 +
 drivers/iommu/hyperv-iommu.c  |  56 
 drivers/net/hyperv/hyperv_net.h   |   5 ++
 drivers/net/hyperv/netvsc.c   | 136 +-
 drivers/net/hyperv/netvsc_drv.c   |   1 +
 drivers/net/hyperv/rndis_filter.c |   2 +
 drivers/scsi/storvsc_drv.c|  37 
 include/asm-generic/mshyperv.h|   2 +
 include/linux/hyperv.h|  14 +++
 include/linux/swiotlb.h   |   6 ++
 kernel/dma/swiotlb.c  |  47 +--
 15 files changed, 342 insertions(+), 25 deletions(-)

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 2/2] arm64: tegra: Add ISO SMMU controller for Tegra194

2021-12-01 Thread Jon Hunter via iommu
The display controllers are attached to a separate ARM SMMU instance
that is dedicated to servicing isochronous memory clients. Add this ISO
instance of the ARM SMMU to device tree.

Please note that the display controllers are not hooked up to this SMMU
yet, because we are still missing a means to transition framebuffers
used by the bootloader to the kernel.

This based upon an initial patch by Thierry Reding .

Signed-off-by: Jon Hunter 
---
 arch/arm64/boot/dts/nvidia/tegra194.dtsi | 76 
 1 file changed, 76 insertions(+)

diff --git a/arch/arm64/boot/dts/nvidia/tegra194.dtsi 
b/arch/arm64/boot/dts/nvidia/tegra194.dtsi
index d78c9ed08c47..496e31b5c637 100644
--- a/arch/arm64/boot/dts/nvidia/tegra194.dtsi
+++ b/arch/arm64/boot/dts/nvidia/tegra194.dtsi
@@ -1474,6 +1474,82 @@ pmc: pmc@c36 {
interrupt-controller;
};
 
+   iommu@1000 {
+   compatible = "nvidia,tegra194-smmu", "nvidia,smmu-500";
+   reg = <0x1000 0x80>;
+   interrupts = ,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+;
+   stream-match-mask = <0x7f80>;
+   #global-interrupts = <1>;
+   #iommu-cells = <1>;
+
+   nvidia,memory-controller = <&mc>;
+   status = "okay";
+   };
+
smmu: iommu@1200 {
compatible = "nvidia,tegra194-smmu", "nvidia,smmu-500";
reg = <0x1200 0x80>,
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 1/2] dt-bindings: arm-smmu: Fix json-schema for Tegra

2021-12-01 Thread Jon Hunter via iommu
The dt_binding_check currently issues the following warnings for the
Tegra186 and Tegra194 SMMUs ...

 arch/arm64/boot/dts/nvidia/tegra186-p2771-.dt.yaml: iommu@1200:
  'nvidia,memory-controller' does not match any of the regexes: 'pinctrl-[0-9]+'
From schema: Documentation/devicetree/bindings/iommu/arm,smmu.yaml
  DTC arch/arm64/boot/dts/nvidia/tegra186-p3509-+p3636-0001.dt.yaml
  CHECK   arch/arm64/boot/dts/nvidia/tegra186-p3509-+p3636-0001.dt.yaml

Add the missing 'nvidia,memory-controller' property to fix the above
warning.

Signed-off-by: Jon Hunter 
---
 Documentation/devicetree/bindings/iommu/arm,smmu.yaml | 4 
 1 file changed, 4 insertions(+)

diff --git a/Documentation/devicetree/bindings/iommu/arm,smmu.yaml 
b/Documentation/devicetree/bindings/iommu/arm,smmu.yaml
index f66a3effba73..5dce07b12cd7 100644
--- a/Documentation/devicetree/bindings/iommu/arm,smmu.yaml
+++ b/Documentation/devicetree/bindings/iommu/arm,smmu.yaml
@@ -155,6 +155,10 @@ properties:
   power-domains:
 maxItems: 1
 
+  nvidia,memory-controller:
+$ref: /schemas/types.yaml#/definitions/phandle
+description: phandle of the memory controller node
+
 required:
   - compatible
   - reg
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/2] iommu: arm-smmu-impl: Add SM8450 qcom iommu implementation

2021-12-01 Thread Konrad Dybcio


On 01.12.2021 08:39, Vinod Koul wrote:
> Add SM8450 qcom iommu implementation to the table of
> qcom_smmu_impl_of_match table which brings in iommu support for
> SM8450 SoC
>
> Signed-off-by: Vinod Koul 
> Tested-by: Dmitry Baryshkov 
> ---

With deep pain, as we've had to deal with this for a few generations now..

Acked-by: Konrad Dybcio 



Konrad

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Jason Gunthorpe via iommu
On Wed, Dec 01, 2021 at 11:16:47AM +0100, Thomas Gleixner wrote:

> Looking at the device slices as subdevices with their own struct device
> makes a lot of sense from the conceptual level.

Except IMS is not just for subdevices, it should be usable for any
driver in any case as a general interrupt mechiansm, as you alluded to
below about ethernet queues. ntb seems to be the current example of
this need..

If it works properly on the physical PCI dev there is no reason to try
to also make it work on the mdev and add complixity in iommu land.

> Though I fear there is also a use case for MSI-X and IMS tied to the
> same device. That network card you are talking about might end up using
> MSI-X for a control block and then IMS for the actual network queues
> when it is used as physical function device as a whole, but that's
> conceptually a different case.

Is it? Why?

IDXD is not so much making device "slices", but virtualizing and
sharing a PCI device. The IDXD hardware is multi-queue like the NIC I
described and the VFIO driver is simply allocating queues from a PCI
device for specific usages and assigning them interrupts.

There is already a char dev interface that equally allocates queues
from the same IDXD device, why shouldn't it be able to access IMS
interrupt pools too?

IMHO a well designed IDXD driver should put all the PCI MMIO, queue
mangement, interrupts, etc in the PCI driver layer, and the VFIO
driver layer should only deal with the MMIO trapping and VFIO
interfacing.

>From this view it is conceptually wrong to have the VFIO driver
directly talking to MMIO registers in the PCI device or owning the
irq_chip. It would be very odd for the PCI driver to allocate
interrupts from some random external struct device when it is creating
queues on the PCI device.

> > Another view is the msi_domain_alloc_irqs() flows should have their
> > own xarrays..
> 
> Yes, I was thinking about that as well. The trivial way would be:
> 
> struct xarray store[MSI_MAX_STORES];
> 
> and then have a store index for each allocation domain. With the
> proposed encapsulation of the xarray handling that's definitely
> feasible. Whether that buys much is a different question. Let me think
> about it some more.

Any possibility that the 'IMS' xarray could be outside the struct
device?

Thanks,
Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/2] iommu: arm-smmu-impl: Add SM8450 qcom iommu implementation

2021-12-01 Thread Dmitry Baryshkov

On 01/12/2021 10:39, Vinod Koul wrote:

Add SM8450 qcom iommu implementation to the table of
qcom_smmu_impl_of_match table which brings in iommu support for
SM8450 SoC

Signed-off-by: Vinod Koul 


Tested-by: Dmitry Baryshkov 



---
  drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index ca736b065dd0..4aee83330629 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -415,6 +415,7 @@ static const struct of_device_id __maybe_unused 
qcom_smmu_impl_of_match[] = {
{ .compatible = "qcom,sm8150-smmu-500" },
{ .compatible = "qcom,sm8250-smmu-500" },
{ .compatible = "qcom,sm8350-smmu-500" },
+   { .compatible = "qcom,sm8450-smmu-500" },
{ }
  };
  




--
With best wishes
Dmitry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Thomas Gleixner
Jason,

CC+ IOMMU folks

On Tue, Nov 30 2021 at 20:17, Jason Gunthorpe wrote:
> On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote:
>> The real problem is where to store the MSI descriptors because the PCI
>> device has its own real PCI/MSI-X interrupts which means it still shares
>> the storage space.
>
> Er.. I never realized that just looking at the patches :|
>
> That is relevant to all real "IMS" users. IDXD escaped this because
> it, IMHO, wrongly used the mdev with the IRQ layer. The mdev is purely
> a messy artifact of VFIO, it should not be required to make the IRQ
> layers work.

> I don't think it makes sense that the msi_desc would point to a mdev,
> the iommu layer consumes the msi_desc_to_dev(), it really should point
> to the physical device that originates the message with a proper
> iommu ops/data/etc.

Looking at the device slices as subdevices with their own struct device
makes a lot of sense from the conceptual level. That makes is pretty
much obvious to manage the MSIs of those devices at this level like we
do for any other device.

Whether mdev is the right encapsulation for these subdevices is an
orthogonal problem.

I surely agree that msi_desc::dev is an interesting question, but we
already have this disconnect of msi_desc::dev and DMA today due to DMA
aliasing. I haven't looked at that in detail yet, but of course the
alias handling is substantially different accross the various IOMMU
implementations.

Though I fear there is also a use case for MSI-X and IMS tied to the
same device. That network card you are talking about might end up using
MSI-X for a control block and then IMS for the actual network queues
when it is used as physical function device as a whole, but that's
conceptually a different case.

>> I'm currently tending to partition the index space in the xarray:
>> 
>>  0x - 0x  PCI/MSI-X
>>  0x0001 - 0x0001  NTB
>
> It is OK, with some xarray work it can be range allocating & reserving
> so that the msi_domain_alloc_irqs() flows can carve out chunks of the
> number space..
>
> Another view is the msi_domain_alloc_irqs() flows should have their
> own xarrays..

Yes, I was thinking about that as well. The trivial way would be:

struct xarray store[MSI_MAX_STORES];

and then have a store index for each allocation domain. With the
proposed encapsulation of the xarray handling that's definitely
feasible. Whether that buys much is a different question. Let me think
about it some more.

>> which is feasible now with the range modifications and way simpler to do
>> with xarray than with the linked list.
>
> Indeed!

I'm glad you like the approach.

Thanks,

tglx


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu