Re: [syzbot] WARNING in __dma_map_sg_attrs
This means the virtgpu driver uses dma mapping helpers but has not set up a DMA mask (which most likely suggests it is some kind of virtual device). On Wed, Dec 01, 2021 at 10:18:21AM -0800, syzbot wrote: > Hello, > > syzbot found the following issue on: > > HEAD commit:c5c17547b778 Merge tag 'net-5.16-rc3' of git://git.kernel... > git tree: upstream > console output: https://syzkaller.appspot.com/x/log.txt?x=13a73609b0 > kernel config: https://syzkaller.appspot.com/x/.config?x=bf85c53718a1e697 > dashboard link: https://syzkaller.appspot.com/bug?extid=10e27961f4da37c443b2 > compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils > for Debian) 2.35.2 > > Unfortunately, I don't have any reproducer for this issue yet. > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > Reported-by: syzbot+10e27961f4da37c44...@syzkaller.appspotmail.com > > [ cut here ] > WARNING: CPU: 2 PID: 17169 at kernel/dma/mapping.c:188 > __dma_map_sg_attrs+0x181/0x1f0 kernel/dma/mapping.c:188 > Modules linked in: > CPU: 0 PID: 17169 Comm: syz-executor.3 Not tainted 5.16.0-rc2-syzkaller #0 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 > RIP: 0010:__dma_map_sg_attrs+0x181/0x1f0 kernel/dma/mapping.c:188 > Code: 00 00 00 00 00 fc ff df 48 c1 e8 03 80 3c 10 00 75 71 4c 8b 3d 70 6d b1 > 0d e9 db fe ff ff e8 86 ff 12 00 0f 0b e8 7f ff 12 00 <0f> 0b 45 31 e4 e9 54 > ff ff ff e8 70 ff 12 00 49 8d 7f 50 48 b8 00 > RSP: 0018:c90002c0fb20 EFLAGS: 00010216 > RAX: 00013018 RBX: 0020 RCX: c900037d4000 > RDX: 0004 RSI: 8163d361 RDI: 8880182ae4d0 > RBP: 8880182ae088 R08: 0002 R09: 888017ba054f > R10: 8163d242 R11: 0008808a R12: > R13: 888024ca5700 R14: 0001 R15: > FS: 7fa269e34700() GS:88802cb0() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 0040c120 CR3: 6c77c000 CR4: 00150ee0 > DR0: DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0400 > Call Trace: > > dma_map_sgtable+0x70/0xf0 kernel/dma/mapping.c:264 > drm_gem_map_dma_buf+0x12a/0x1e0 drivers/gpu/drm/drm_prime.c:633 > __map_dma_buf drivers/dma-buf/dma-buf.c:675 [inline] > dma_buf_map_attachment+0x39a/0x5b0 drivers/dma-buf/dma-buf.c:954 > drm_gem_prime_import_dev.part.0+0x85/0x220 drivers/gpu/drm/drm_prime.c:939 > drm_gem_prime_import_dev drivers/gpu/drm/drm_prime.c:982 [inline] > drm_gem_prime_import+0xc8/0x200 drivers/gpu/drm/drm_prime.c:982 > virtgpu_gem_prime_import+0x49/0x150 > drivers/gpu/drm/virtio/virtgpu_prime.c:166 > drm_gem_prime_fd_to_handle+0x21d/0x550 drivers/gpu/drm/drm_prime.c:318 > drm_prime_fd_to_handle_ioctl+0x9b/0xd0 drivers/gpu/drm/drm_prime.c:374 > drm_ioctl_kernel+0x27d/0x4e0 drivers/gpu/drm/drm_ioctl.c:782 > drm_ioctl+0x51e/0x9d0 drivers/gpu/drm/drm_ioctl.c:885 > vfs_ioctl fs/ioctl.c:51 [inline] > __do_sys_ioctl fs/ioctl.c:874 [inline] > __se_sys_ioctl fs/ioctl.c:860 [inline] > __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860 > do_syscall_x64 arch/x86/entry/common.c:50 [inline] > do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 > entry_SYSCALL_64_after_hwframe+0x44/0xae > RIP: 0033:0x7fa26c8beae9 > Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 > 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 > 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 > RSP: 002b:7fa269e34188 EFLAGS: 0246 ORIG_RAX: 0010 > RAX: ffda RBX: 7fa26c9d1f60 RCX: 7fa26c8beae9 > RDX: 24c0 RSI: c00c642e RDI: 0005 > RBP: 7fa26c918f6d R08: R09: > R10: R11: 0246 R12: > R13: 7ffc0019c51f R14: 7fa269e34300 R15: 00022000 > > > > --- > This report is generated by a bot. It may contain errors. > See https://goo.gl/tpsmEJ for more information about syzbot. > syzbot engineers can be reached at syzkal...@googlegroups.com. > > syzbot will keep track of this issue. See: > https://goo.gl/tpsmEJ#status for how to communicate with syzbot. ---end quoted text--- ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
Jason, On Wed, Dec 01 2021 at 21:21, Thomas Gleixner wrote: > On Wed, Dec 01 2021 at 14:14, Jason Gunthorpe wrote: > Which in turn is consistent all over the place and does not require any > special case for anything. Neither for interrupts nor for anything else. that said, feel free to tell me that I'm getting it all wrong. The reason I'm harping on this is that we are creating ABIs on several ends and we all know that getting that wrong is a major pain. Thanks, tglx ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
dave, On Wed, Dec 01 2021 at 15:53, Dave Jiang wrote: > On 12/1/2021 3:03 PM, Thomas Gleixner wrote: >> This still depends on how this overall discussion about representation >> of all of this stuff is resolved. >> What needs a subdevice to expose? >> Can you answer that too please? > > Sorry. So initial version of the IDXD sub-device is represented with a > single queue. It needs a command irq (emulated) and a completion irq > that is backed by a device vector (currently IMS). thanks for clarification! Let me think about it some more. tglx ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 0/5] iommu/amd: fixes for suspend/resume
On Tue, 2021-11-23 at 18:10 +0200, Maxim Levitsky wrote: > As I sadly found out, a s3 cycle makes the AMD's iommu stop sending interrupts > until the system is rebooted. > > I only noticed it now because otherwise the IOMMU works, and these interrupts > are only used for errors and for GA log which I tend not to use by > making my VMs do mwait/pause/etc in guest (cpu-pm=on). > > There are two issues here that prevent interrupts from being generated after > s3 cycle: > > 1. GA log base address was not restored after resume, and was all zeroed > after resume (by BIOS or such). > > In theory if BIOS writes some junk to it, that can even cause a memory > corruption. > Patch 2 fixes that. > > 2. INTX (aka x2apic mode) settings were not restored after resume. > That mode is used regardless if the host uses/supports x2apic, but rather when > the IOMMU supports it, and mine does. > Patches 3-4 fix that. > > Note that there is still one slight (userspace) bug remaining: > During suspend all but the boot CPU are offlined and then after resume > are onlined again. > > The offlining moves all non-affinity managed interrupts to CPU0, and > later when all other CPUs are onlined, there is nothing in the kernel > to spread back the interrupts over the cores. > > The userspace 'irqbalance' daemon does fix this but it seems to ignore > the IOMMU interrupts in INTX mode since they are not attached to any > PCI device, and thus they remain on CPU0 after a s3 cycle, > which is suboptimal when the system has multiple IOMMUs > (mine has 4 of them). > > Setting the IRQ affinity manually via /proc/irq/ does work. > > This was tested on my 3970X with both INTX and regular MSI mode (later was > enabled > by patching out INTX detection), by running a guest with AVIC enabled and with > a PCI assigned device (network card), and observing interrupts from > IOMMU while guest is mostly idle. > > This was also tested on my AMD laptop with 4650U (which has the same issue) > (I tested only INTX mode) > > Patch 1 is a small refactoring to remove an unused struct field. > > Best regards, >Maxim Levitsky > > Maxim Levitsky (5): > iommu/amd: restore GA log/tail pointer on host resume > iommu/amd: x2apic mode: re-enable after resume > iommu/amd: x2apic mode: setup the INTX registers on mask/unmask > iommu/amd: x2apic mode: mask/unmask interrupts on suspend/resume > iommu/amd: remove useless irq affinity notifier > > drivers/iommu/amd/amd_iommu_types.h | 2 - > drivers/iommu/amd/init.c| 107 +++- > 2 files changed, 58 insertions(+), 51 deletions(-) > > -- > 2.26.3 > > Polite ping on these patches. Best regards, Maxim Levitsky ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
On 12/1/2021 3:03 PM, Thomas Gleixner wrote: On Wed, Dec 01 2021 at 14:49, Dave Jiang wrote: On 12/1/2021 2:44 PM, Thomas Gleixner wrote: How that is backed on the host does not really matter. You can expose MSI-X to the guest with a INTx backing as well. I'm still failing to see the connection between the 9 MSIX vectors and the 2048 IMS vectors which I assume that this is the limitation of the physical device, right? I think I was confused with what you were asking and was thinking you are saying why can't we just have MSIX on guest backed by the MSIX on the physical device and thought there would not be enough vectors to service the many guests. I think I understand what your position is now with the clarification above. This still depends on how this overall discussion about representation of all of this stuff is resolved. What needs a subdevice to expose? Can you answer that too please? Sorry. So initial version of the IDXD sub-device is represented with a single queue. It needs a command irq (emulated) and a completion irq that is backed by a device vector (currently IMS). Thanks, tglx ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
On Wed, Dec 01 2021 at 14:49, Dave Jiang wrote: > On 12/1/2021 2:44 PM, Thomas Gleixner wrote: >> How that is backed on the host does not really matter. You can expose >> MSI-X to the guest with a INTx backing as well. >> >> I'm still failing to see the connection between the 9 MSIX vectors and >> the 2048 IMS vectors which I assume that this is the limitation of the >> physical device, right? > > I think I was confused with what you were asking and was thinking you > are saying why can't we just have MSIX on guest backed by the MSIX on > the physical device and thought there would not be enough vectors to > service the many guests. I think I understand what your position is now > with the clarification above. This still depends on how this overall discussion about representation of all of this stuff is resolved. >> What needs a subdevice to expose? Can you answer that too please? Thanks, tglx ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
On 12/1/2021 2:44 PM, Thomas Gleixner wrote: On Wed, Dec 01 2021 at 14:21, Dave Jiang wrote: On 12/1/2021 1:25 PM, Thomas Gleixner wrote: The hardware implementation does not have enough MSIX vectors for guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS vectors. So if we are to do MSI-X for all of them, then we need to do the IMS backed MSIX scheme rather than passthrough IMS to guests. Confused. Are you talking about passing a full IDXD device to the guest or about passing a carved out subdevice, aka. queue? I'm talking about carving out a subdevice. I had the impression of you wanting IMS passed through for all variations. But it sounds like for a sub-device, you are ok with the implementation of MSIX backed by IMS? I don't see anything wrong with that. A subdevice is it's own entity and VFIO can chose the most conveniant representation of it to the guest obviously. How that is backed on the host does not really matter. You can expose MSI-X to the guest with a INTx backing as well. I'm still failing to see the connection between the 9 MSIX vectors and the 2048 IMS vectors which I assume that this is the limitation of the physical device, right? I think I was confused with what you were asking and was thinking you are saying why can't we just have MSIX on guest backed by the MSIX on the physical device and thought there would not be enough vectors to service the many guests. I think I understand what your position is now with the clarification above. What needs a subdevice to expose? Thanks, tglx ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
On Wed, Dec 01 2021 at 14:21, Dave Jiang wrote: > On 12/1/2021 1:25 PM, Thomas Gleixner wrote: >>> The hardware implementation does not have enough MSIX vectors for >>> guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS >>> vectors. So if we are to do MSI-X for all of them, then we need to do >>> the IMS backed MSIX scheme rather than passthrough IMS to guests. >> Confused. Are you talking about passing a full IDXD device to the guest >> or about passing a carved out subdevice, aka. queue? > > I'm talking about carving out a subdevice. I had the impression of you > wanting IMS passed through for all variations. But it sounds like for a > sub-device, you are ok with the implementation of MSIX backed by IMS? I don't see anything wrong with that. A subdevice is it's own entity and VFIO can chose the most conveniant representation of it to the guest obviously. How that is backed on the host does not really matter. You can expose MSI-X to the guest with a INTx backing as well. I'm still failing to see the connection between the 9 MSIX vectors and the 2048 IMS vectors which I assume that this is the limitation of the physical device, right? What needs a subdevice to expose? Thanks, tglx ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
On 12/1/2021 1:25 PM, Thomas Gleixner wrote: On Wed, Dec 01 2021 at 11:47, Dave Jiang wrote: On 12/1/2021 11:41 AM, Thomas Gleixner wrote: Hi Thomas. This is actually the IDXD usage for a mediated device passed to a guest kernel when we plumb the pass through of IMS to the guest rather than doing previous implementation of having a MSIX vector on guest backed by IMS. Which makes a lot of sense. The control block for the mediated device is emulated and therefore an emulated MSIX vector will be surfaced as vector 0. However the queues will backed by IMS vectors. So we end up needing MSIX and IMS coexist running on the guest kernel for the same device. Why? What's wrong with using straight MSI-X for all of them? The hardware implementation does not have enough MSIX vectors for guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS vectors. So if we are to do MSI-X for all of them, then we need to do the IMS backed MSIX scheme rather than passthrough IMS to guests. Confused. Are you talking about passing a full IDXD device to the guest or about passing a carved out subdevice, aka. queue? I'm talking about carving out a subdevice. I had the impression of you wanting IMS passed through for all variations. But it sounds like for a sub-device, you are ok with the implementation of MSIX backed by IMS? Thanks, tglx ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[syzbot] WARNING in __dma_map_sg_attrs
Hello, syzbot found the following issue on: HEAD commit:c5c17547b778 Merge tag 'net-5.16-rc3' of git://git.kernel... git tree: upstream console output: https://syzkaller.appspot.com/x/log.txt?x=13a73609b0 kernel config: https://syzkaller.appspot.com/x/.config?x=bf85c53718a1e697 dashboard link: https://syzkaller.appspot.com/bug?extid=10e27961f4da37c443b2 compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2 Unfortunately, I don't have any reproducer for this issue yet. IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+10e27961f4da37c44...@syzkaller.appspotmail.com [ cut here ] WARNING: CPU: 2 PID: 17169 at kernel/dma/mapping.c:188 __dma_map_sg_attrs+0x181/0x1f0 kernel/dma/mapping.c:188 Modules linked in: CPU: 0 PID: 17169 Comm: syz-executor.3 Not tainted 5.16.0-rc2-syzkaller #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 RIP: 0010:__dma_map_sg_attrs+0x181/0x1f0 kernel/dma/mapping.c:188 Code: 00 00 00 00 00 fc ff df 48 c1 e8 03 80 3c 10 00 75 71 4c 8b 3d 70 6d b1 0d e9 db fe ff ff e8 86 ff 12 00 0f 0b e8 7f ff 12 00 <0f> 0b 45 31 e4 e9 54 ff ff ff e8 70 ff 12 00 49 8d 7f 50 48 b8 00 RSP: 0018:c90002c0fb20 EFLAGS: 00010216 RAX: 00013018 RBX: 0020 RCX: c900037d4000 RDX: 0004 RSI: 8163d361 RDI: 8880182ae4d0 RBP: 8880182ae088 R08: 0002 R09: 888017ba054f R10: 8163d242 R11: 0008808a R12: R13: 888024ca5700 R14: 0001 R15: FS: 7fa269e34700() GS:88802cb0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0040c120 CR3: 6c77c000 CR4: 00150ee0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: dma_map_sgtable+0x70/0xf0 kernel/dma/mapping.c:264 drm_gem_map_dma_buf+0x12a/0x1e0 drivers/gpu/drm/drm_prime.c:633 __map_dma_buf drivers/dma-buf/dma-buf.c:675 [inline] dma_buf_map_attachment+0x39a/0x5b0 drivers/dma-buf/dma-buf.c:954 drm_gem_prime_import_dev.part.0+0x85/0x220 drivers/gpu/drm/drm_prime.c:939 drm_gem_prime_import_dev drivers/gpu/drm/drm_prime.c:982 [inline] drm_gem_prime_import+0xc8/0x200 drivers/gpu/drm/drm_prime.c:982 virtgpu_gem_prime_import+0x49/0x150 drivers/gpu/drm/virtio/virtgpu_prime.c:166 drm_gem_prime_fd_to_handle+0x21d/0x550 drivers/gpu/drm/drm_prime.c:318 drm_prime_fd_to_handle_ioctl+0x9b/0xd0 drivers/gpu/drm/drm_prime.c:374 drm_ioctl_kernel+0x27d/0x4e0 drivers/gpu/drm/drm_ioctl.c:782 drm_ioctl+0x51e/0x9d0 drivers/gpu/drm/drm_ioctl.c:885 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:874 [inline] __se_sys_ioctl fs/ioctl.c:860 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fa26c8beae9 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:7fa269e34188 EFLAGS: 0246 ORIG_RAX: 0010 RAX: ffda RBX: 7fa26c9d1f60 RCX: 7fa26c8beae9 RDX: 24c0 RSI: c00c642e RDI: 0005 RBP: 7fa26c918f6d R08: R09: R10: R11: 0246 R12: R13: 7ffc0019c51f R14: 7fa269e34300 R15: 00022000 --- This report is generated by a bot. It may contain errors. See https://goo.gl/tpsmEJ for more information about syzbot. syzbot engineers can be reached at syzkal...@googlegroups.com. syzbot will keep track of this issue. See: https://goo.gl/tpsmEJ#status for how to communicate with syzbot. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
On Wed, Dec 01 2021 at 11:47, Dave Jiang wrote: > On 12/1/2021 11:41 AM, Thomas Gleixner wrote: >>> Hi Thomas. This is actually the IDXD usage for a mediated device passed >>> to a guest kernel when we plumb the pass through of IMS to the guest >>> rather than doing previous implementation of having a MSIX vector on >>> guest backed by IMS. >> Which makes a lot of sense. >> >>> The control block for the mediated device is emulated and therefore an >>> emulated MSIX vector will be surfaced as vector 0. However the queues >>> will backed by IMS vectors. So we end up needing MSIX and IMS coexist >>> running on the guest kernel for the same device. >> Why? What's wrong with using straight MSI-X for all of them? > > The hardware implementation does not have enough MSIX vectors for > guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS > vectors. So if we are to do MSI-X for all of them, then we need to do > the IMS backed MSIX scheme rather than passthrough IMS to guests. Confused. Are you talking about passing a full IDXD device to the guest or about passing a carved out subdevice, aka. queue? Thanks, tglx ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
Jason, On Wed, Dec 01 2021 at 14:14, Jason Gunthorpe wrote: > On Wed, Dec 01, 2021 at 06:35:35PM +0100, Thomas Gleixner wrote: >> On Wed, Dec 01 2021 at 09:00, Jason Gunthorpe wrote: >> But NTB is operating through an abstraction layer and is not a direct >> PCIe device driver. > > I'm not sure exactly how NTB seems to be split between switchtec and > the ntb code, but since the ntbd code seems to be doing MMIO touches, > it feels like part of a PCIe driver? It's a maze of magic PCIe driver with callbacks left and right into the underlying NTB hardware driver which itself does some stuff on its own and also calls back into the underlying PCIe driver. Decomposing that thing feels like being trapped in the Labyrinth of Knossos. But let's ignore that for a moment. >> The VFIO driver does not own the irq chip ever. The irq chip is of >> course part of the underlying infrastructure. I never asked for that. > > That isn't quite what I ment.. I ment the PCIe driver cannot create > the domain or make use of the irq_chip until the VFIO layer comes > along and provides the struct device. To me this is backwards > layering, the interrupts come from the PCIe layer and should exist > independently from VFIO. See below. >> When it allocates a slice for whatever usage then it also >> allocates the IMS interrupts (though the VFIO people want to >> have only one and do the allocations later on demand). >> >> That allocation cannot be part of the PCI/MSIx interrupt >> domain as we already agreed on. > > Yes, it is just an open question of where the new irq_domain need to > reside The irqdomain is created by and part of the physical device. But that does not mean that the interrupts which are allocated from that irq domain are stored in the physical device representation. Going by that logic, the PCI/MSI domain would store all MSI[X] interrupts which are allocated on some root bridge or wherever. They are obviously stored per PCI device, but the irqdomain is owned e.g. by the underlying IOMMU zone. The irqdomain is managing and handing out resources. Like any other resource manager does. See? >> 1) Storage >> >>A) Having "subdevices" solves the storage problem nicely and >> makes everything just fall in place. Even for a purely >> physical multiqueue device one can argue that each queue is a >> "subdevice" of the physical device. The fact that we lump them >> all together today is not an argument against that. > > I don't like the idea that queue is a device, that is trying to force > a struct device centric world onto a queue which doesn't really want > it.. Here we are at the point where we agree to disagree. Of course a queue is a resource, but what prevents us to represent a queue as a carved out subdevice of the physical device? Look at it from the VF point of view. If VFs are disabled then all resources belong to the physical device. If VFs are enabled then the hardware/firmware splits the resources into separate subdevices which have their own device representation, interrupt storage etc. As VFs have a scalability limitation due to the underlying PCIe restrictions the step we are talking about now is to split the queues up in software which means nothing else than creating a software representation of finer grained and more scalable subdevices. So why would we want to pretend that these are not devices? They are from a conceptual and topology view a subdevice of the physical device. Just because they are named queues does not make it any different. Let's look at VFIO again. If VFIO passes a VF through then it builds a wrapper around the VF device. So if a queue is represented as a subdevice, then VFIO can just build a wrapper around that subdevice. But with your model, VFIO has to create a device, request a queue, wrestle the interrupts in place, etc. Which is exactly the opposite of how VFs are handled. So again, why would we want to make software managed subdevices look exactly the opposite way like hardware/firmware managed subdevices? Let me also look at the cdev which is exposed by the phsyical device. The cdev is nothing else than a software vehicle to create a /dev/ node. (ignore the switchtec case which (ab)uses the cdev to connect to NTB). The cdev allows user space to allocate/release a resource. Of course it can just allocate a queue and stick the necessary interrupts into the physical device MSI descriptor storage. But there is no reason why the queue allocation cannot allocate a subdevice representing the queue. That makes the VFIO and the cdev case just using the same underlying representation which can expose it's properties via the underlying device. Which in turn is consistent all over the place and does not require any special case for anything. Neither for interrupts nor for anything else. Thanks, tglx ___ iommu mailing list iommu@lists.l
Re: [PATCH v2 29/33] iommu: Use put_pages_list
On 2021-12-01 19:07, Matthew Wilcox wrote: On Wed, Dec 01, 2021 at 07:15:06PM +0100, Vlastimil Babka wrote: From: "Matthew Wilcox (Oracle)" page->freelist is for the use of slab. We already have the ability to free a list of pages in the core mm, but it requires the use of a list_head and for the pages to be chained together through page->lru. Switch the iommu code over to using put_pages_list(). FYI, this is going to have conflicts with https://lore.kernel.org/lkml/cover.1637671820.git.robin.mur...@arm.com/ I'm not sure what the appropriate resolution is going to be here; maybe hold back part of this patch series to the following merge window to give the iommu people time to merge their own patches? More than that, this version is subtly but catastrophically broken - we can't simply pass &gather->freelist through the current IOVA entry_dtor machinery, since gather is a stack variable from a few call frames up so the actual list head will be long gone by the time iommu_dma_entry_dtor() tries to dereference it. It took until I was elbow-deep in refactoring the RFC to notice that one :) Thanks, Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 29/33] iommu: Use put_pages_list
On Wed, Dec 01, 2021 at 07:15:06PM +0100, Vlastimil Babka wrote: > From: "Matthew Wilcox (Oracle)" > > page->freelist is for the use of slab. We already have the ability > to free a list of pages in the core mm, but it requires the use of a > list_head and for the pages to be chained together through page->lru. > Switch the iommu code over to using put_pages_list(). FYI, this is going to have conflicts with https://lore.kernel.org/lkml/cover.1637671820.git.robin.mur...@arm.com/ I'm not sure what the appropriate resolution is going to be here; maybe hold back part of this patch series to the following merge window to give the iommu people time to merge their own patches? ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
On 12/1/2021 11:41 AM, Thomas Gleixner wrote: Dave, please trim your replies. On Wed, Dec 01 2021 at 09:28, Dave Jiang wrote: On 12/1/2021 3:16 AM, Thomas Gleixner wrote: Jason, CC+ IOMMU folks On Tue, Nov 30 2021 at 20:17, Jason Gunthorpe wrote: On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote: Though I fear there is also a use case for MSI-X and IMS tied to the same device. That network card you are talking about might end up using MSI-X for a control block and then IMS for the actual network queues when it is used as physical function device as a whole, but that's conceptually a different case. Hi Thomas. This is actually the IDXD usage for a mediated device passed to a guest kernel when we plumb the pass through of IMS to the guest rather than doing previous implementation of having a MSIX vector on guest backed by IMS. Which makes a lot of sense. The control block for the mediated device is emulated and therefore an emulated MSIX vector will be surfaced as vector 0. However the queues will backed by IMS vectors. So we end up needing MSIX and IMS coexist running on the guest kernel for the same device. Why? What's wrong with using straight MSI-X for all of them? The hardware implementation does not have enough MSIX vectors for guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS vectors. So if we are to do MSI-X for all of them, then we need to do the IMS backed MSIX scheme rather than passthrough IMS to guests. Thanks, tglx ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
On 2021-12-01 11:14 a.m., 'Jason Gunthorpe' via linux-ntb wrote: > On Wed, Dec 01, 2021 at 06:35:35PM +0100, Thomas Gleixner wrote: >> On Wed, Dec 01 2021 at 09:00, Jason Gunthorpe wrote: >>> On Wed, Dec 01, 2021 at 11:16:47AM +0100, Thomas Gleixner wrote: Looking at the device slices as subdevices with their own struct device makes a lot of sense from the conceptual level. >>> >>> Except IMS is not just for subdevices, it should be usable for any >>> driver in any case as a general interrupt mechiansm, as you alluded to >>> below about ethernet queues. ntb seems to be the current example of >>> this need.. >> >> But NTB is operating through an abstraction layer and is not a direct >> PCIe device driver. > > I'm not sure exactly how NTB seems to be split between switchtec and > the ntb code, but since the ntbd code seems to be doing MMIO touches, > it feels like part of a PCIe driver? Eh, sort of. NTB has lots of layers. At the top you'll find ntb_netdev which is an network interface. Below that is ntb_transport() which is a generic queue pair. Below that is the hardware driver itself (ie switchtec) through the abstraction layer. The switchtec driver is split in two: the main driver which just allows for information and administration of the switch itself and switchtec_ntb which is the module that provides the NTB abstractions to twiddle its registers. ntb_transport() doesn't directly do MMIO touches (as it doesn't know what the underlying hardware registers are). Except for the memory windows which are usually setup to be a specific BAR (or parts of a BAR). ntb_transport will do MMIO writes to those specific BAR address which correspond to writing into buffers on the peer. Logan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
Dave, please trim your replies. On Wed, Dec 01 2021 at 09:28, Dave Jiang wrote: > On 12/1/2021 3:16 AM, Thomas Gleixner wrote: >> Jason, >> >> CC+ IOMMU folks >> >> On Tue, Nov 30 2021 at 20:17, Jason Gunthorpe wrote: >>> On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote: >> >> Though I fear there is also a use case for MSI-X and IMS tied to the >> same device. That network card you are talking about might end up using >> MSI-X for a control block and then IMS for the actual network queues >> when it is used as physical function device as a whole, but that's >> conceptually a different case. > > Hi Thomas. This is actually the IDXD usage for a mediated device passed > to a guest kernel when we plumb the pass through of IMS to the guest > rather than doing previous implementation of having a MSIX vector on > guest backed by IMS. Which makes a lot of sense. > The control block for the mediated device is emulated and therefore an > emulated MSIX vector will be surfaced as vector 0. However the queues > will backed by IMS vectors. So we end up needing MSIX and IMS coexist > running on the guest kernel for the same device. Why? What's wrong with using straight MSI-X for all of them? Thanks, tglx ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 29/33] iommu: Use put_pages_list
From: "Matthew Wilcox (Oracle)" page->freelist is for the use of slab. We already have the ability to free a list of pages in the core mm, but it requires the use of a list_head and for the pages to be chained together through page->lru. Switch the iommu code over to using put_pages_list(). Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Vlastimil Babka Cc: Joerg Roedel Cc: Suravee Suthikulpanit Cc: Will Deacon Cc: David Woodhouse Cc: Lu Baolu Cc: --- drivers/iommu/amd/io_pgtable.c | 59 +- drivers/iommu/dma-iommu.c | 11 + drivers/iommu/intel/iommu.c| 89 -- include/linux/iommu.h | 3 +- 4 files changed, 57 insertions(+), 105 deletions(-) diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c index 182c93a43efd..08ea6a02cda9 100644 --- a/drivers/iommu/amd/io_pgtable.c +++ b/drivers/iommu/amd/io_pgtable.c @@ -74,27 +74,15 @@ static u64 *first_pte_l7(u64 *pte, unsigned long *page_size, * / -static void free_page_list(struct page *freelist) -{ - while (freelist != NULL) { - unsigned long p = (unsigned long)page_address(freelist); - - freelist = freelist->freelist; - free_page(p); - } -} - -static struct page *free_pt_page(unsigned long pt, struct page *freelist) +static void free_pt_page(unsigned long pt, struct list_head *list) { struct page *p = virt_to_page((void *)pt); - p->freelist = freelist; - - return p; + list_add(&p->lru, list); } #define DEFINE_FREE_PT_FN(LVL, FN) \ -static struct page *free_pt_##LVL (unsigned long __pt, struct page *freelist) \ +static void free_pt_##LVL (unsigned long __pt, struct list_head *list) \ { \ unsigned long p; \ u64 *pt; \ @@ -113,10 +101,10 @@ static struct page *free_pt_##LVL (unsigned long __pt, struct page *freelist) \ continue; \ \ p = (unsigned long)IOMMU_PTE_PAGE(pt[i]); \ - freelist = FN(p, freelist); \ + FN(p, list); \ } \ \ - return free_pt_page((unsigned long)pt, freelist); \ + free_pt_page((unsigned long)pt, list); \ } DEFINE_FREE_PT_FN(l2, free_pt_page) @@ -125,36 +113,33 @@ DEFINE_FREE_PT_FN(l4, free_pt_l3) DEFINE_FREE_PT_FN(l5, free_pt_l4) DEFINE_FREE_PT_FN(l6, free_pt_l5) -static struct page *free_sub_pt(unsigned long root, int mode, - struct page *freelist) +static void free_sub_pt(unsigned long root, int mode, struct list_head *list) { switch (mode) { case PAGE_MODE_NONE: case PAGE_MODE_7_LEVEL: break; case PAGE_MODE_1_LEVEL: - freelist = free_pt_page(root, freelist); + free_pt_page(root, list); break; case PAGE_MODE_2_LEVEL: - freelist = free_pt_l2(root, freelist); + free_pt_l2(root, list); break; case PAGE_MODE_3_LEVEL: - freelist = free_pt_l3(root, freelist); + free_pt_l3(root, list); break; case PAGE_MODE_4_LEVEL: - freelist = free_pt_l4(root, freelist); + free_pt_l4(root, list); break; case PAGE_MODE_5_LEVEL: - freelist = free_pt_l5(root, freelist); + free_pt_l5(root, list); break; case PAGE_MODE_6_LEVEL: - freelist = free_pt_l6(root, freelist); + free_pt_l6(root, list); break; default: BUG(); } - - return freelist; } void amd_iommu_domain_set_pgtable(struct protection_domain *domain, @@ -362,7 +347,7 @@ static u64 *fetch_pte(struct amd_io_pgtable *pgtable, return pte; } -static struct page *free_clear_pte(u64 *pte, u64 pteval, struct page *freelist) +static void free_clear_pte(u64 *pte, u64 pteval, struct list_head *list) { unsigned long pt; int mode; @@ -373,12 +358,12 @@ static struct page *free_clear_pte(u64 *pte, u64 pteval, struct page *freelist) } if (!IOMMU_PTE_PRESENT(pteval)) -
[PATCH v2 00/33] Separate struct slab from struct page
Folks from non-slab subsystems are Cc'd only to patches affecting them, and this cover letter. Series also available in git, based on 5.16-rc3: https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slab-struct_slab-v2r2 The plan: as my SLUB PREEMPT_RT series in 5.15, I would prefer to go again with the git pull request way of eventually merging this, as it's also not a small series. I will thus reply to this mail with asking to include my branch in linux-next. As stated in the v1/RFC cover letter, I wouldn't mind to then continue with maintaining a git tree for all slab patches in general. It was apparently already done that way before, by Pekka: https://lore.kernel.org/linux-mm/alpine.DEB.2.00.1107221108190.2996@tiger/ Changes from v1/RFC: https://lore.kernel.org/all/2026001628.24216-1-vba...@suse.cz/ - Added virt_to_folio() and folio_address() in the new Patch 1. - Addressed feedback from Andrey Konovalov and Matthew Wilcox (Thanks!) - Added Tested-by: Marco Elver for the KFENCE parts (Thanks!) Previous version from Matthew Wilcox: https://lore.kernel.org/all/20211004134650.4031813-1-wi...@infradead.org/ LWN coverage of the above: https://lwn.net/Articles/871982/ This is originally an offshoot of the folio work by Matthew. One of the more complex parts of the struct page definition are the parts used by the slab allocators. It would be good for the MM in general if struct slab were its own data type, and it also helps to prevent tail pages from slipping in anywhere. As Matthew requested in his proof of concept series, I have taken over the development of this series, so it's a mix of patches from him (often modified by me) and my own. One big difference is the use of coccinelle to perform the relatively trivial parts of the conversions automatically and at once, instead of a larger number of smaller incremental reviewable steps. Thanks to Julia Lawall and Luis Chamberlain for all their help! Another notable difference is (based also on review feedback) I don't represent with a struct slab the large kmalloc allocations which are not really a slab, but use page allocator directly. When going from an object address to a struct slab, the code tests first folio slab flag, and only if it's set it converts to struct slab. This makes the struct slab type stronger. Finally, although Matthew's version didn't use any of the folio work, the initial support has been merged meanwhile so my version builds on top of it where appropriate. This eliminates some of the redundant compound_head() being performed e.g. when testing the slab flag. To sum up, after this series, struct page fields used by slab allocators are moved from struct page to a new struct slab, that uses the same physical storage. The availability of the fields is further distinguished by the selected slab allocator implementation. The advantages include: - Similar to folios, if the slab is of order > 0, struct slab always is guaranteed to be the head page. Additionally it's guaranteed to be an actual slab page, not a large kmalloc. This removes uncertainty and potential for bugs. - It's not possible to accidentally use fields of the slab implementation that's not configured. - Other subsystems cannot use slab's fields in struct page anymore (some existing non-slab usages had to be adjusted in this series), so slab implementations have more freedom in rearranging them in the struct slab. Matthew Wilcox (Oracle) (16): mm: Split slab into its own type mm: Add account_slab() and unaccount_slab() mm: Convert virt_to_cache() to use struct slab mm: Convert __ksize() to struct slab mm: Use struct slab in kmem_obj_info() mm: Convert check_heap_object() to use struct slab mm/slub: Convert detached_freelist to use a struct slab mm/slub: Convert kfree() to use a struct slab mm/slub: Convert print_page_info() to print_slab_info() mm/slub: Convert pfmemalloc_match() to take a struct slab mm/slob: Convert SLOB to use struct slab mm/kasan: Convert to struct folio and struct slab zsmalloc: Stop using slab fields in struct page bootmem: Use page->index instead of page->freelist iommu: Use put_pages_list mm: Remove slab from struct page Vlastimil Babka (17): mm: add virt_to_folio() and folio_address() mm/slab: Dissolve slab_map_pages() in its caller mm/slub: Make object_err() static mm/slub: Convert __slab_lock() and __slab_unlock() to struct slab mm/slub: Convert alloc_slab_page() to return a struct slab mm/slub: Convert __free_slab() to use struct slab mm/slub: Convert most struct page to struct slab by spatch mm/slub: Finish struct page to struct slab conversion mm/slab: Convert kmem_getpages() and kmem_freepages() to struct slab mm/slab: Convert most struct page to struct slab by spatch mm/slab: Finish struct page to struct slab conversion mm: Convert struct page to struct slab in functions used by other subsystems mm/memcg: Convert slab objcgs from struct page
Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
On Wed, Dec 01, 2021 at 06:35:35PM +0100, Thomas Gleixner wrote: > On Wed, Dec 01 2021 at 09:00, Jason Gunthorpe wrote: > > On Wed, Dec 01, 2021 at 11:16:47AM +0100, Thomas Gleixner wrote: > >> Looking at the device slices as subdevices with their own struct device > >> makes a lot of sense from the conceptual level. > > > > Except IMS is not just for subdevices, it should be usable for any > > driver in any case as a general interrupt mechiansm, as you alluded to > > below about ethernet queues. ntb seems to be the current example of > > this need.. > > But NTB is operating through an abstraction layer and is not a direct > PCIe device driver. I'm not sure exactly how NTB seems to be split between switchtec and the ntb code, but since the ntbd code seems to be doing MMIO touches, it feels like part of a PCIe driver? > > IDXD is not so much making device "slices", but virtualizing and > > sharing a PCI device. The IDXD hardware is multi-queue like the NIC I > > described and the VFIO driver is simply allocating queues from a PCI > > device for specific usages and assigning them interrupts. > > Right. > > But what is the representation for that resulting device? Some VFIO > specific homebrewn muck or something which is based on struct device? Why is there a device? A queue is a queue, not a device. If the task is to make some struct device (eg mdev, or cdev, or whatever) then queues may be allocated to do this, but the queue is logically a resource handed out by the PCIe driver and there should not be a requirement to have an external struct device just to create a queue. > Right now with VF passthrough, I can see the interrupts which are > associated to the device. > > How is that going to be with something which is just made up? Does that > expose it's own msi properties then somewhere hidden in the VFIO layer? For sysfs, I think all interrupts should be on the PCI directory. > > There is already a char dev interface that equally allocates queues > > from the same IDXD device, why shouldn't it be able to access IMS > > interrupt pools too? > > Why wouldn't it be able to do so? The only 'struct device' there is a cdev and I really don't think cdevs should have interrupts. It is a bit hacky as a in-kernel thing and downright wrong as a sysfs ABI. > The VFIO driver does not own the irq chip ever. The irq chip is of > course part of the underlying infrastructure. I never asked for that. That isn't quite what I ment.. I ment the PCIe driver cannot create the domain or make use of the irq_chip until the VFIO layer comes along and provides the struct device. To me this is backwards layering, the interrupts come from the PCIe layer and should exist independently from VFIO. > When it allocates a slice for whatever usage then it also > allocates the IMS interrupts (though the VFIO people want to > have only one and do the allocations later on demand). > > That allocation cannot be part of the PCI/MSIx interrupt > domain as we already agreed on. Yes, it is just an open question of where the new irq_domain need to reside > 1) Storage > >A) Having "subdevices" solves the storage problem nicely and > makes everything just fall in place. Even for a purely > physical multiqueue device one can argue that each queue is a > "subdevice" of the physical device. The fact that we lump them > all together today is not an argument against that. I don't like the idea that queue is a device, that is trying to force a struct device centric world onto a queue which doesn't really want it.. >B) Requires extra storage in the PCIe device and extra storage > per subdevice, queue to keep track of the interrupts which > are associated to it. Yes > 2) Exposure of VFIO interrupts via sysfs > >A) Just works I would say this is flawed, in sysfs I expect all the interrupts for the PCIe device to be in the PCIe sysfs, not strewn over subsystem owned sub-directories. For instance, today in mlx5, when a subdevice allocates a queue for a slice (which is modeled as an aux device) the queue's assigned MSI-X interrupt shows up on the PCIe sysfs, not the aux. It should be uniform, if I assign a queue a legacy INT, MSI or an IMS it should show in sysfs in the same way. Leaking this kernel implementation detail as sysfs ABI does not seem good. > 3) On demand expansion of the vectors for VFIO > >A) Just works because the device has an irqdomain assigned. > >B) Requires extra indirections to do that Yes. > 4) PASID > >While an Intel IDXD specific issue, it want's to be solved >without any nasty hacks. > >A) Having a "subdevice" allows to associate the PASID with the > underlying struct device which makes IOMMU integration trivial > >B) Needs some other custom hackery to get that solved Yes > > Any possibility that the 'IMS' xarray could
Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
On Wed, Dec 01 2021 at 09:00, Jason Gunthorpe wrote: > On Wed, Dec 01, 2021 at 11:16:47AM +0100, Thomas Gleixner wrote: >> Looking at the device slices as subdevices with their own struct device >> makes a lot of sense from the conceptual level. > > Except IMS is not just for subdevices, it should be usable for any > driver in any case as a general interrupt mechiansm, as you alluded to > below about ethernet queues. ntb seems to be the current example of > this need.. But NTB is operating through an abstraction layer and is not a direct PCIe device driver. > IDXD is not so much making device "slices", but virtualizing and > sharing a PCI device. The IDXD hardware is multi-queue like the NIC I > described and the VFIO driver is simply allocating queues from a PCI > device for specific usages and assigning them interrupts. Right. But what is the representation for that resulting device? Some VFIO specific homebrewn muck or something which is based on struct device? Right now with VF passthrough, I can see the interrupts which are associated to the device. How is that going to be with something which is just made up? Does that expose it's own msi properties then somewhere hidden in the VFIO layer? See below. > There is already a char dev interface that equally allocates queues > from the same IDXD device, why shouldn't it be able to access IMS > interrupt pools too? Why wouldn't it be able to do so? > IMHO a well designed IDXD driver should put all the PCI MMIO, queue > mangement, interrupts, etc in the PCI driver layer, and the VFIO > driver layer should only deal with the MMIO trapping and VFIO > interfacing. > > From this view it is conceptually wrong to have the VFIO driver > directly talking to MMIO registers in the PCI device or owning the > irq_chip. The VFIO driver does not own the irq chip ever. The irq chip is of course part of the underlying infrastructure. I never asked for that. PCIe driver Owns the PCI/MSI[x] interrupts for the control block Has a control mechanism which handles queues or whatever the device is partitioned into, that's what I called slice. The irqdomain is part of that control mechanism and the irqchip implementation belongs to that as well. It has to because the message store is device specific. Whether the doamin and chip implementation is in a separate drivers/irqchip/foo.c file for sharing or directly built into the PCIe driver itself does not matter. When it allocates a slice for whatever usage then it also allocates the IMS interrupts (though the VFIO people want to have only one and do the allocations later on demand). That allocation cannot be part of the PCI/MSIx interrupt domain as we already agreed on. We have several problems to solve. Let me look at it from both point of views: 1) Storage A) Having "subdevices" solves the storage problem nicely and makes everything just fall in place. Even for a purely physical multiqueue device one can argue that each queue is a "subdevice" of the physical device. The fact that we lump them all together today is not an argument against that. B) Requires extra storage in the PCIe device and extra storage per subdevice, queue to keep track of the interrupts which are associated to it. 2) Exposure of VFIO interrupts via sysfs A) Just works B) Requires extra mechanisms to expose it 3) On demand expansion of the vectors for VFIO A) Just works because the device has an irqdomain assigned. B) Requires extra indirections to do that 4) IOMMU msi_desc::dev A) I think that's reasonably simple to address by having the relationship to the underlying PCIe device stored in struct device, which is not really adding any complexity to the IOMMU code. Quite the contrary it could be used to make the DMA aliasing a problem of device setup time and not a lookup per interrupt allocation in the IOMMU code. B) No change required. 4) PASID While an Intel IDXD specific issue, it want's to be solved without any nasty hacks. A) Having a "subdevice" allows to associate the PASID with the underlying struct device which makes IOMMU integration trivial B) Needs some other custom hackery to get that solved So both variants come with their ups and downs. IMO A) is the right thing to do when looking at all the involved moving pieces. > It would be very odd for the PCI driver to allocate interrupts from > some random external struct device when it is creating queues on the > PCI device. No. That's not what I want. >> and then have a store index for each allocation domain. With the >> proposed encapsulation of the xarray handling that's definitely >> feasible. Whether that buys much is a different question. Let me think >> about i
[PATCH v3 2/5] iommu/virtio: Support bypass domains
The VIRTIO_IOMMU_F_BYPASS_CONFIG feature adds a new flag to the ATTACH request, that creates a bypass domain. Use it to enable identity domains. When VIRTIO_IOMMU_F_BYPASS_CONFIG is not supported by the device, we currently fail attaching to an identity domain. Future patches will instead create identity mappings in this case. Reviewed-by: Kevin Tian Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/virtio-iommu.c | 20 +++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index 80930ce04a16..14dfee76fd19 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -71,6 +71,7 @@ struct viommu_domain { struct rb_root_cached mappings; unsigned long nr_endpoints; + boolbypass; }; struct viommu_endpoint { @@ -587,7 +588,9 @@ static struct iommu_domain *viommu_domain_alloc(unsigned type) { struct viommu_domain *vdomain; - if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA) + if (type != IOMMU_DOMAIN_UNMANAGED && + type != IOMMU_DOMAIN_DMA && + type != IOMMU_DOMAIN_IDENTITY) return NULL; vdomain = kzalloc(sizeof(*vdomain), GFP_KERNEL); @@ -630,6 +633,17 @@ static int viommu_domain_finalise(struct viommu_endpoint *vdev, vdomain->map_flags = viommu->map_flags; vdomain->viommu = viommu; + if (domain->type == IOMMU_DOMAIN_IDENTITY) { + if (!virtio_has_feature(viommu->vdev, + VIRTIO_IOMMU_F_BYPASS_CONFIG)) { + ida_free(&viommu->domain_ids, vdomain->id); + vdomain->viommu = NULL; + return -EOPNOTSUPP; + } + + vdomain->bypass = true; + } + return 0; } @@ -691,6 +705,9 @@ static int viommu_attach_dev(struct iommu_domain *domain, struct device *dev) .domain = cpu_to_le32(vdomain->id), }; + if (vdomain->bypass) + req.flags |= cpu_to_le32(VIRTIO_IOMMU_ATTACH_F_BYPASS); + for (i = 0; i < fwspec->num_ids; i++) { req.endpoint = cpu_to_le32(fwspec->ids[i]); @@ -1132,6 +1149,7 @@ static unsigned int features[] = { VIRTIO_IOMMU_F_DOMAIN_RANGE, VIRTIO_IOMMU_F_PROBE, VIRTIO_IOMMU_F_MMIO, + VIRTIO_IOMMU_F_BYPASS_CONFIG, }; static struct virtio_device_id id_table[] = { -- 2.34.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 3/5] iommu/virtio: Sort reserved regions
To ease identity mapping support, keep the list of reserved regions sorted. Reviewed-by: Eric Auger Reviewed-by: Kevin Tian Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/virtio-iommu.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index 14dfee76fd19..1b3c1f2741c6 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -423,7 +423,7 @@ static int viommu_add_resv_mem(struct viommu_endpoint *vdev, size_t size; u64 start64, end64; phys_addr_t start, end; - struct iommu_resv_region *region = NULL; + struct iommu_resv_region *region = NULL, *next; unsigned long prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO; start = start64 = le64_to_cpu(mem->start); @@ -454,7 +454,12 @@ static int viommu_add_resv_mem(struct viommu_endpoint *vdev, if (!region) return -ENOMEM; - list_add(®ion->list, &vdev->resv_regions); + /* Keep the list sorted */ + list_for_each_entry(next, &vdev->resv_regions, list) { + if (next->start > region->start) + break; + } + list_add_tail(®ion->list, &next->list); return 0; } -- 2.34.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 4/5] iommu/virtio: Pass end address to viommu_add_mapping()
To support identity mappings, the virtio-iommu driver must be able to represent full 64-bit ranges internally. Pass (start, end) instead of (start, size) to viommu_add/del_mapping(). Clean comments. The one about the returned size was never true: when sweeping the whole address space the returned size will most certainly be smaller than 2^64. Reviewed-by: Eric Auger Reviewed-by: Kevin Tian Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/virtio-iommu.c | 31 +++ 1 file changed, 15 insertions(+), 16 deletions(-) diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index 1b3c1f2741c6..2fa370c2659c 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -311,8 +311,8 @@ static int viommu_send_req_sync(struct viommu_dev *viommu, void *buf, * * On success, return the new mapping. Otherwise return NULL. */ -static int viommu_add_mapping(struct viommu_domain *vdomain, unsigned long iova, - phys_addr_t paddr, size_t size, u32 flags) +static int viommu_add_mapping(struct viommu_domain *vdomain, u64 iova, u64 end, + phys_addr_t paddr, u32 flags) { unsigned long irqflags; struct viommu_mapping *mapping; @@ -323,7 +323,7 @@ static int viommu_add_mapping(struct viommu_domain *vdomain, unsigned long iova, mapping->paddr = paddr; mapping->iova.start = iova; - mapping->iova.last = iova + size - 1; + mapping->iova.last = end; mapping->flags = flags; spin_lock_irqsave(&vdomain->mappings_lock, irqflags); @@ -338,26 +338,24 @@ static int viommu_add_mapping(struct viommu_domain *vdomain, unsigned long iova, * * @vdomain: the domain * @iova: start of the range - * @size: size of the range. A size of 0 corresponds to the entire address - * space. + * @end: end of the range * - * On success, returns the number of unmapped bytes (>= size) + * On success, returns the number of unmapped bytes */ static size_t viommu_del_mappings(struct viommu_domain *vdomain, - unsigned long iova, size_t size) + u64 iova, u64 end) { size_t unmapped = 0; unsigned long flags; - unsigned long last = iova + size - 1; struct viommu_mapping *mapping = NULL; struct interval_tree_node *node, *next; spin_lock_irqsave(&vdomain->mappings_lock, flags); - next = interval_tree_iter_first(&vdomain->mappings, iova, last); + next = interval_tree_iter_first(&vdomain->mappings, iova, end); while (next) { node = next; mapping = container_of(node, struct viommu_mapping, iova); - next = interval_tree_iter_next(node, iova, last); + next = interval_tree_iter_next(node, iova, end); /* Trying to split a mapping? */ if (mapping->iova.start < iova) @@ -656,8 +654,8 @@ static void viommu_domain_free(struct iommu_domain *domain) { struct viommu_domain *vdomain = to_viommu_domain(domain); - /* Free all remaining mappings (size 2^64) */ - viommu_del_mappings(vdomain, 0, 0); + /* Free all remaining mappings */ + viommu_del_mappings(vdomain, 0, ULLONG_MAX); if (vdomain->viommu) ida_free(&vdomain->viommu->domain_ids, vdomain->id); @@ -742,6 +740,7 @@ static int viommu_map(struct iommu_domain *domain, unsigned long iova, { int ret; u32 flags; + u64 end = iova + size - 1; struct virtio_iommu_req_map map; struct viommu_domain *vdomain = to_viommu_domain(domain); @@ -752,7 +751,7 @@ static int viommu_map(struct iommu_domain *domain, unsigned long iova, if (flags & ~vdomain->map_flags) return -EINVAL; - ret = viommu_add_mapping(vdomain, iova, paddr, size, flags); + ret = viommu_add_mapping(vdomain, iova, end, paddr, flags); if (ret) return ret; @@ -761,7 +760,7 @@ static int viommu_map(struct iommu_domain *domain, unsigned long iova, .domain = cpu_to_le32(vdomain->id), .virt_start = cpu_to_le64(iova), .phys_start = cpu_to_le64(paddr), - .virt_end = cpu_to_le64(iova + size - 1), + .virt_end = cpu_to_le64(end), .flags = cpu_to_le32(flags), }; @@ -770,7 +769,7 @@ static int viommu_map(struct iommu_domain *domain, unsigned long iova, ret = viommu_send_req_sync(vdomain->viommu, &map, sizeof(map)); if (ret) - viommu_del_mappings(vdomain, iova, size); + viommu_del_mappings(vdomain, iova, end); return ret; } @@ -783,7 +782,7 @@ static size_t viommu_unmap(struct iommu_domain *domain, unsigned long iova, struct virtio_iommu_req_unmap unmap; s
[PATCH v3 1/5] iommu/virtio: Add definitions for VIRTIO_IOMMU_F_BYPASS_CONFIG
Add definitions for the VIRTIO_IOMMU_F_BYPASS_CONFIG, which supersedes VIRTIO_IOMMU_F_BYPASS. Reviewed-by: Kevin Tian Signed-off-by: Jean-Philippe Brucker --- include/uapi/linux/virtio_iommu.h | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/virtio_iommu.h b/include/uapi/linux/virtio_iommu.h index 237e36a280cb..1ff357f0d72e 100644 --- a/include/uapi/linux/virtio_iommu.h +++ b/include/uapi/linux/virtio_iommu.h @@ -16,6 +16,7 @@ #define VIRTIO_IOMMU_F_BYPASS 3 #define VIRTIO_IOMMU_F_PROBE 4 #define VIRTIO_IOMMU_F_MMIO5 +#define VIRTIO_IOMMU_F_BYPASS_CONFIG 6 struct virtio_iommu_range_64 { __le64 start; @@ -36,6 +37,8 @@ struct virtio_iommu_config { struct virtio_iommu_range_32domain_range; /* Probe buffer size */ __le32 probe_size; + __u8bypass; + __u8reserved[3]; }; /* Request types */ @@ -66,11 +69,14 @@ struct virtio_iommu_req_tail { __u8reserved[3]; }; +#define VIRTIO_IOMMU_ATTACH_F_BYPASS (1 << 0) + struct virtio_iommu_req_attach { struct virtio_iommu_req_headhead; __le32 domain; __le32 endpoint; - __u8reserved[8]; + __le32 flags; + __u8reserved[4]; struct virtio_iommu_req_tailtail; }; -- 2.34.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 5/5] iommu/virtio: Support identity-mapped domains
Support identity domains for devices that do not offer the VIRTIO_IOMMU_F_BYPASS_CONFIG feature, by creating 1:1 mappings between the virtual and physical address space. Identity domains created this way still perform noticeably better than DMA domains, because they don't have the overhead of setting up and tearing down mappings at runtime. The performance difference between this and bypass is minimal in comparison. It does not matter that the physical addresses in the identity mappings do not all correspond to memory. By enabling passthrough we are trusting the device driver and the device itself to only perform DMA to suitable locations. In some cases it may even be desirable to perform DMA to MMIO regions. Reviewed-by: Eric Auger Reviewed-by: Kevin Tian Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/virtio-iommu.c | 61 +--- 1 file changed, 57 insertions(+), 4 deletions(-) diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index 2fa370c2659c..6a8a52b4297b 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -375,6 +375,55 @@ static size_t viommu_del_mappings(struct viommu_domain *vdomain, return unmapped; } +/* + * Fill the domain with identity mappings, skipping the device's reserved + * regions. + */ +static int viommu_domain_map_identity(struct viommu_endpoint *vdev, + struct viommu_domain *vdomain) +{ + int ret; + struct iommu_resv_region *resv; + u64 iova = vdomain->domain.geometry.aperture_start; + u64 limit = vdomain->domain.geometry.aperture_end; + u32 flags = VIRTIO_IOMMU_MAP_F_READ | VIRTIO_IOMMU_MAP_F_WRITE; + unsigned long granule = 1UL << __ffs(vdomain->domain.pgsize_bitmap); + + iova = ALIGN(iova, granule); + limit = ALIGN_DOWN(limit + 1, granule) - 1; + + list_for_each_entry(resv, &vdev->resv_regions, list) { + u64 resv_start = ALIGN_DOWN(resv->start, granule); + u64 resv_end = ALIGN(resv->start + resv->length, granule) - 1; + + if (resv_end < iova || resv_start > limit) + /* No overlap */ + continue; + + if (resv_start > iova) { + ret = viommu_add_mapping(vdomain, iova, resv_start - 1, +(phys_addr_t)iova, flags); + if (ret) + goto err_unmap; + } + + if (resv_end >= limit) + return 0; + + iova = resv_end + 1; + } + + ret = viommu_add_mapping(vdomain, iova, limit, (phys_addr_t)iova, +flags); + if (ret) + goto err_unmap; + return 0; + +err_unmap: + viommu_del_mappings(vdomain, 0, iova); + return ret; +} + /* * viommu_replay_mappings - re-send MAP requests * @@ -637,14 +686,18 @@ static int viommu_domain_finalise(struct viommu_endpoint *vdev, vdomain->viommu = viommu; if (domain->type == IOMMU_DOMAIN_IDENTITY) { - if (!virtio_has_feature(viommu->vdev, - VIRTIO_IOMMU_F_BYPASS_CONFIG)) { + if (virtio_has_feature(viommu->vdev, + VIRTIO_IOMMU_F_BYPASS_CONFIG)) { + vdomain->bypass = true; + return 0; + } + + ret = viommu_domain_map_identity(vdev, vdomain); + if (ret) { ida_free(&viommu->domain_ids, vdomain->id); vdomain->viommu = NULL; return -EOPNOTSUPP; } - - vdomain->bypass = true; } return 0; -- 2.34.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 0/5] iommu/virtio: Add identity domains
Support identity domains, allowing to only enable IOMMU protection for a subset of endpoints (those assigned to userspace, for example). Users may enable identity domains at compile time (CONFIG_IOMMU_DEFAULT_PASSTHROUGH), boot time (iommu.passthrough=1) or runtime (/sys/kernel/iommu_groups/*/type = identity). Since v2 [1] I fixed the padding in patch 1 and a rebase error in patch 5, reported by Eric. Patches 1-2 support identity domains using the optional VIRTIO_IOMMU_F_BYPASS_CONFIG feature, which was accepted into the spec [2]. Patches 3-5 add a fallback to identity mappings, when the feature is not supported. QEMU patches are on my virtio-iommu/bypass branch [3], and depend on the UAPI update. [1] https://lore.kernel.org/linux-iommu/20211123155301.1047943-1-jean-phili...@linaro.org/ [2] https://github.com/oasis-tcs/virtio-spec/issues/119 [3] https://jpbrucker.net/git/qemu/log/?h=virtio-iommu/bypass Jean-Philippe Brucker (5): iommu/virtio: Add definitions for VIRTIO_IOMMU_F_BYPASS_CONFIG iommu/virtio: Support bypass domains iommu/virtio: Sort reserved regions iommu/virtio: Pass end address to viommu_add_mapping() iommu/virtio: Support identity-mapped domains include/uapi/linux/virtio_iommu.h | 8 ++- drivers/iommu/virtio-iommu.c | 113 +- 2 files changed, 101 insertions(+), 20 deletions(-) -- 2.34.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
On 12/1/2021 3:16 AM, Thomas Gleixner wrote: Jason, CC+ IOMMU folks On Tue, Nov 30 2021 at 20:17, Jason Gunthorpe wrote: On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote: The real problem is where to store the MSI descriptors because the PCI device has its own real PCI/MSI-X interrupts which means it still shares the storage space. Er.. I never realized that just looking at the patches :| That is relevant to all real "IMS" users. IDXD escaped this because it, IMHO, wrongly used the mdev with the IRQ layer. The mdev is purely a messy artifact of VFIO, it should not be required to make the IRQ layers work. I don't think it makes sense that the msi_desc would point to a mdev, the iommu layer consumes the msi_desc_to_dev(), it really should point to the physical device that originates the message with a proper iommu ops/data/etc. Looking at the device slices as subdevices with their own struct device makes a lot of sense from the conceptual level. That makes is pretty much obvious to manage the MSIs of those devices at this level like we do for any other device. Whether mdev is the right encapsulation for these subdevices is an orthogonal problem. I surely agree that msi_desc::dev is an interesting question, but we already have this disconnect of msi_desc::dev and DMA today due to DMA aliasing. I haven't looked at that in detail yet, but of course the alias handling is substantially different accross the various IOMMU implementations. Though I fear there is also a use case for MSI-X and IMS tied to the same device. That network card you are talking about might end up using MSI-X for a control block and then IMS for the actual network queues when it is used as physical function device as a whole, but that's conceptually a different case. Hi Thomas. This is actually the IDXD usage for a mediated device passed to a guest kernel when we plumb the pass through of IMS to the guest rather than doing previous implementation of having a MSIX vector on guest backed by IMS. The control block for the mediated device is emulated and therefore an emulated MSIX vector will be surfaced as vector 0. However the queues will backed by IMS vectors. So we end up needing MSIX and IMS coexist running on the guest kernel for the same device. DJ I'm currently tending to partition the index space in the xarray: 0x - 0x PCI/MSI-X 0x0001 - 0x0001 NTB It is OK, with some xarray work it can be range allocating & reserving so that the msi_domain_alloc_irqs() flows can carve out chunks of the number space.. Another view is the msi_domain_alloc_irqs() flows should have their own xarrays.. Yes, I was thinking about that as well. The trivial way would be: struct xarray store[MSI_MAX_STORES]; and then have a store index for each allocation domain. With the proposed encapsulation of the xarray handling that's definitely feasible. Whether that buys much is a different question. Let me think about it some more. which is feasible now with the range modifications and way simpler to do with xarray than with the linked list. Indeed! I'm glad you like the approach. Thanks, tglx ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH V3 3/5] hyperv/IOMMU: Enable swiotlb bounce buffer for Isolation VM
From: Tianyu Lan hyperv Isolation VM requires bounce buffer support to copy data from/to encrypted memory and so enable swiotlb force mode to use swiotlb bounce buffer for DMA transaction. In Isolation VM with AMD SEV, the bounce buffer needs to be accessed via extra address space which is above shared_gpa_boundary (E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG. The access physical address will be original physical address + shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP spec is called virtual top of memory(vTOM). Memory addresses below vTOM are automatically treated as private while memory above vTOM is treated as shared. Hyper-V initalizes swiotlb bounce buffer and default swiotlb needs to be disabled. pci_swiotlb_detect_override() and pci_swiotlb_detect_4gb() enable the default one. To override the setting, hyperv_swiotlb_detect() needs to run before these detect functions which depends on the pci_xen_swiotlb_ init(). Make pci_xen_swiotlb_init() depends on the hyperv_swiotlb _detect() to keep the order. Swiotlb bounce buffer code calls set_memory_decrypted() to mark bounce buffer visible to host and map it in extra address space via memremap. Populate the shared_gpa_boundary (vTOM) via swiotlb_unencrypted_base variable. The map function memremap() can't work in the early place hyperv_iommu_swiotlb_init() and so call swiotlb_update_mem_attributes() in the hyperv_iommu_swiotlb_later_init(). Signed-off-by: Tianyu Lan --- arch/x86/xen/pci-swiotlb-xen.c | 3 +- drivers/hv/vmbus_drv.c | 3 ++ drivers/iommu/hyperv-iommu.c | 56 ++ include/linux/hyperv.h | 8 + 4 files changed, 69 insertions(+), 1 deletion(-) diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c index 46df59aeaa06..30fd0600b008 100644 --- a/arch/x86/xen/pci-swiotlb-xen.c +++ b/arch/x86/xen/pci-swiotlb-xen.c @@ -4,6 +4,7 @@ #include #include +#include #include #include @@ -91,6 +92,6 @@ int pci_xen_swiotlb_init_late(void) EXPORT_SYMBOL_GPL(pci_xen_swiotlb_init_late); IOMMU_INIT_FINISH(pci_xen_swiotlb_detect, - NULL, + hyperv_swiotlb_detect, pci_xen_swiotlb_init, NULL); diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index 392c1ac4f819..0a64ccfafb8b 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -33,6 +33,7 @@ #include #include #include +#include #include #include "hyperv_vmbus.h" @@ -2078,6 +2079,7 @@ struct hv_device *vmbus_device_create(const guid_t *type, return child_device_obj; } +static u64 vmbus_dma_mask = DMA_BIT_MASK(64); /* * vmbus_device_register - Register the child device */ @@ -2118,6 +2120,7 @@ int vmbus_device_register(struct hv_device *child_device_obj) } hv_debug_add_dev_dir(child_device_obj); + child_device_obj->device.dma_mask = &vmbus_dma_mask; return 0; err_kset_unregister: diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c index e285a220c913..dd729d49a1eb 100644 --- a/drivers/iommu/hyperv-iommu.c +++ b/drivers/iommu/hyperv-iommu.c @@ -13,14 +13,20 @@ #include #include #include +#include +#include #include #include #include #include +#include +#include #include #include #include +#include +#include #include "irq_remapping.h" @@ -337,4 +343,54 @@ static const struct irq_domain_ops hyperv_root_ir_domain_ops = { .free = hyperv_root_irq_remapping_free, }; +static void __init hyperv_iommu_swiotlb_init(void) +{ + unsigned long hyperv_io_tlb_size; + void *hyperv_io_tlb_start; + + /* +* Allocate Hyper-V swiotlb bounce buffer at early place +* to reserve large contiguous memory. +*/ + hyperv_io_tlb_size = swiotlb_size_or_default(); + hyperv_io_tlb_start = memblock_alloc(hyperv_io_tlb_size, PAGE_SIZE); + + if (!hyperv_io_tlb_start) + pr_warn("Fail to allocate Hyper-V swiotlb buffer.\n"); + + swiotlb_init_with_tbl(hyperv_io_tlb_start, + hyperv_io_tlb_size >> IO_TLB_SHIFT, true); +} + +int __init hyperv_swiotlb_detect(void) +{ + if (!hypervisor_is_type(X86_HYPER_MS_HYPERV)) + return 0; + + if (!hv_is_isolation_supported()) + return 0; + + /* +* Enable swiotlb force mode in Isolation VM to +* use swiotlb bounce buffer for dma transaction. +*/ + if (hv_isolation_type_snp()) + swiotlb_unencrypted_base = ms_hyperv.shared_gpa_boundary; + swiotlb_force = SWIOTLB_FORCE; + return 1; +} + +static void __init hyperv_iommu_swiotlb_later_init(void) +{ + /* +* Swiotlb bounce buffer needs to be mapped in extra address +* space. Map function doesn't work in the early place and so +* call swiotlb_update_mem_attributes() here. +*/ +
[PATCH V3 5/5] hv_netvsc: Add Isolation VM support for netvsc driver
From: Tianyu Lan In Isolation VM, all shared memory with host needs to mark visible to host via hvcall. vmbus_establish_gpadl() has already done it for netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_ pagebuffer() stills need to be handled. Use DMA API to map/umap these memory during sending/receiving packet and Hyper-V swiotlb bounce buffer dma adress will be returned. The swiotlb bounce buffer has been masked to be visible to host during boot up. rx/tx ring buffer is allocated via vzalloc() and they need to be mapped into unencrypted address space(above vTOM) before sharing with host and accessing. Add hv_map/unmap_memory() to map/umap rx /tx ring buffer. Signed-off-by: Tianyu Lan --- Change since v2: * Add hv_map/unmap_memory() to map/umap rx/tx ring buffer. --- arch/x86/hyperv/ivm.c | 28 ++ drivers/hv/hv_common.c| 11 +++ drivers/net/hyperv/hyperv_net.h | 5 ++ drivers/net/hyperv/netvsc.c | 136 +- drivers/net/hyperv/netvsc_drv.c | 1 + drivers/net/hyperv/rndis_filter.c | 2 + include/asm-generic/mshyperv.h| 2 + include/linux/hyperv.h| 5 ++ 8 files changed, 187 insertions(+), 3 deletions(-) diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c index 69c7a57f3307..9f78d8f67ea3 100644 --- a/arch/x86/hyperv/ivm.c +++ b/arch/x86/hyperv/ivm.c @@ -287,3 +287,31 @@ int hv_set_mem_host_visibility(unsigned long kbuffer, int pagecount, bool visibl kfree(pfn_array); return ret; } + +/* + * hv_map_memory - map memory to extra space in the AMD SEV-SNP Isolation VM. + */ +void *hv_map_memory(void *addr, unsigned long size) +{ + unsigned long *pfns = kcalloc(size / HV_HYP_PAGE_SIZE, + sizeof(unsigned long), GFP_KERNEL); + void *vaddr; + int i; + + if (!pfns) + return NULL; + + for (i = 0; i < size / PAGE_SIZE; i++) + pfns[i] = virt_to_hvpfn(addr + i * PAGE_SIZE) + + (ms_hyperv.shared_gpa_boundary >> PAGE_SHIFT); + + vaddr = vmap_pfn(pfns, size / PAGE_SIZE, PAGE_KERNEL_IO); + kfree(pfns); + + return vaddr; +} + +void hv_unmap_memory(void *addr) +{ + vunmap(addr); +} diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c index 7be173a99f27..3c5cb1f70319 100644 --- a/drivers/hv/hv_common.c +++ b/drivers/hv/hv_common.c @@ -295,3 +295,14 @@ u64 __weak hv_ghcb_hypercall(u64 control, void *input, void *output, u32 input_s return HV_STATUS_INVALID_PARAMETER; } EXPORT_SYMBOL_GPL(hv_ghcb_hypercall); + +void __weak *hv_map_memory(void *addr, unsigned long size) +{ + return NULL; +} +EXPORT_SYMBOL_GPL(hv_map_memory); + +void __weak hv_unmap_memory(void *addr) +{ +} +EXPORT_SYMBOL_GPL(hv_unmap_memory); diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h index 315278a7cf88..cf69da0e296c 100644 --- a/drivers/net/hyperv/hyperv_net.h +++ b/drivers/net/hyperv/hyperv_net.h @@ -164,6 +164,7 @@ struct hv_netvsc_packet { u32 total_bytes; u32 send_buf_index; u32 total_data_buflen; + struct hv_dma_range *dma_range; }; #define NETVSC_HASH_KEYLEN 40 @@ -1074,6 +1075,7 @@ struct netvsc_device { /* Receive buffer allocated by us but manages by NetVSP */ void *recv_buf; + void *recv_original_buf; u32 recv_buf_size; /* allocated bytes */ struct vmbus_gpadl recv_buf_gpadl_handle; u32 recv_section_cnt; @@ -1082,6 +1084,7 @@ struct netvsc_device { /* Send buffer allocated by us */ void *send_buf; + void *send_original_buf; u32 send_buf_size; struct vmbus_gpadl send_buf_gpadl_handle; u32 send_section_cnt; @@ -1731,4 +1734,6 @@ struct rndis_message { #define RETRY_US_HI1 #define RETRY_MAX 2000/* >10 sec */ +void netvsc_dma_unmap(struct hv_device *hv_dev, + struct hv_netvsc_packet *packet); #endif /* _HYPERV_NET_H */ diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 396bc1c204e6..b7ade735a806 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -153,8 +153,21 @@ static void free_netvsc_device(struct rcu_head *head) int i; kfree(nvdev->extension); - vfree(nvdev->recv_buf); - vfree(nvdev->send_buf); + + if (nvdev->recv_original_buf) { + hv_unmap_memory(nvdev->recv_buf); + vfree(nvdev->recv_original_buf); + } else { + vfree(nvdev->recv_buf); + } + + if (nvdev->send_original_buf) { + hv_unmap_memory(nvdev->send_buf); + vfree(nvdev->send_original_buf); + } else { + vfree(nvdev->send_buf); + } + kfree(nvdev->send_section_map); for (i = 0; i < VRSS_CHANNEL_MAX; i++) { @@ -338,6 +351,7 @@ static int netvsc_init_buf(struct hv_device *device
[PATCH V3 4/5] scsi: storvsc: Add Isolation VM support for storvsc driver
From: Tianyu Lan In Isolation VM, all shared memory with host needs to mark visible to host via hvcall. vmbus_establish_gpadl() has already done it for storvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_ mpb_desc() still needs to be handled. Use DMA API(scsi_dma_map/unmap) to map these memory during sending/receiving packet and return swiotlb bounce buffer dma address. In Isolation VM, swiotlb bounce buffer is marked to be visible to host and the swiotlb force mode is enabled. Set device's dma min align mask to HV_HYP_PAGE_SIZE - 1 in order to keep the original data offset in the bounce buffer. Signed-off-by: Tianyu Lan --- drivers/hv/vmbus_drv.c | 1 + drivers/scsi/storvsc_drv.c | 37 + include/linux/hyperv.h | 1 + 3 files changed, 23 insertions(+), 16 deletions(-) diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index 0a64ccfafb8b..ae6ec503399a 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -2121,6 +2121,7 @@ int vmbus_device_register(struct hv_device *child_device_obj) hv_debug_add_dev_dir(child_device_obj); child_device_obj->device.dma_mask = &vmbus_dma_mask; + child_device_obj->device.dma_parms = &child_device_obj->dma_parms; return 0; err_kset_unregister: diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index 20595c0ba0ae..ae293600d799 100644 --- a/drivers/scsi/storvsc_drv.c +++ b/drivers/scsi/storvsc_drv.c @@ -21,6 +21,8 @@ #include #include #include +#include + #include #include #include @@ -1336,6 +1338,7 @@ static void storvsc_on_channel_callback(void *context) continue; } request = (struct storvsc_cmd_request *)scsi_cmd_priv(scmnd); + scsi_dma_unmap(scmnd); } storvsc_on_receive(stor_device, packet, request); @@ -1749,7 +1752,6 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd) struct hv_host_device *host_dev = shost_priv(host); struct hv_device *dev = host_dev->dev; struct storvsc_cmd_request *cmd_request = scsi_cmd_priv(scmnd); - int i; struct scatterlist *sgl; unsigned int sg_count; struct vmscsi_request *vm_srb; @@ -1831,10 +1833,11 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd) payload_sz = sizeof(cmd_request->mpb); if (sg_count) { - unsigned int hvpgoff, hvpfns_to_add; unsigned long offset_in_hvpg = offset_in_hvpage(sgl->offset); unsigned int hvpg_count = HVPFN_UP(offset_in_hvpg + length); - u64 hvpfn; + struct scatterlist *sg; + unsigned long hvpfn, hvpfns_to_add; + int j, i = 0; if (hvpg_count > MAX_PAGE_BUFFER_COUNT) { @@ -1848,21 +1851,22 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd) payload->range.len = length; payload->range.offset = offset_in_hvpg; + sg_count = scsi_dma_map(scmnd); + if (sg_count < 0) + return SCSI_MLQUEUE_DEVICE_BUSY; - for (i = 0; sgl != NULL; sgl = sg_next(sgl)) { + for_each_sg(sgl, sg, sg_count, j) { /* -* Init values for the current sgl entry. hvpgoff -* and hvpfns_to_add are in units of Hyper-V size -* pages. Handling the PAGE_SIZE != HV_HYP_PAGE_SIZE -* case also handles values of sgl->offset that are -* larger than PAGE_SIZE. Such offsets are handled -* even on other than the first sgl entry, provided -* they are a multiple of PAGE_SIZE. +* Init values for the current sgl entry. hvpfns_to_add +* is in units of Hyper-V size pages. Handling the +* PAGE_SIZE != HV_HYP_PAGE_SIZE case also handles +* values of sgl->offset that are larger than PAGE_SIZE. +* Such offsets are handled even on other than the first +* sgl entry, provided they are a multiple of PAGE_SIZE. */ - hvpgoff = HVPFN_DOWN(sgl->offset); - hvpfn = page_to_hvpfn(sg_page(sgl)) + hvpgoff; - hvpfns_to_add = HVPFN_UP(sgl->offset + sgl->length) - - hvpgoff; + hvpfn = HVPFN_DOWN(sg_dma_address(sg)); + hvpfns_to_add = HVPFN_UP(sg_dma_address(sg) + +sg_dma_len(sg)) - hvpfn;
[PATCH V3 2/5] x86/hyper-v: Add hyperv Isolation VM check in the cc_platform_has()
From: Tianyu Lan Hyper-V provides Isolation VM which has memory encrypt support. Add hyperv_cc_platform_has() and return true for check of GUEST_MEM_ENCRYPT attribute. Signed-off-by: Tianyu Lan --- arch/x86/kernel/cc_platform.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/arch/x86/kernel/cc_platform.c b/arch/x86/kernel/cc_platform.c index 03bb2f343ddb..f3bb0431f5c5 100644 --- a/arch/x86/kernel/cc_platform.c +++ b/arch/x86/kernel/cc_platform.c @@ -11,6 +11,7 @@ #include #include +#include #include static bool __maybe_unused intel_cc_platform_has(enum cc_attr attr) @@ -58,9 +59,23 @@ static bool amd_cc_platform_has(enum cc_attr attr) #endif } +static bool hyperv_cc_platform_has(enum cc_attr attr) +{ +#ifdef CONFIG_HYPERV + if (attr == CC_ATTR_GUEST_MEM_ENCRYPT) + return true; + else + return false; +#else + return false; +#endif +} bool cc_platform_has(enum cc_attr attr) { + if (hv_is_isolation_supported()) + return hyperv_cc_platform_has(attr); + if (sme_me_mask) return amd_cc_platform_has(attr); -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH V3 1/5] Swiotlb: Add Swiotlb bounce buffer remap function for HV IVM
From: Tianyu Lan In Isolation VM with AMD SEV, bounce buffer needs to be accessed via extra address space which is above shared_gpa_boundary (E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG. The access physical address will be original physical address + shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP spec is called virtual top of memory(vTOM). Memory addresses below vTOM are automatically treated as private while memory above vTOM is treated as shared. Expose swiotlb_unencrypted_base for platforms to set unencrypted memory base offset and platform calls swiotlb_update_mem_attributes() to remap swiotlb mem to unencrypted address space. memremap() can not be called in the early stage and so put remapping code into swiotlb_update_mem_attributes(). Store remap address and use it to copy data from/to swiotlb bounce buffer. Signed-off-by: Tianyu Lan --- Change since v2: * Leave mem->vaddr with phys_to_virt(mem->start) when fail to remap swiotlb memory. Change since v1: * Rework comment in the swiotlb_init_io_tlb_mem() * Make swiotlb_init_io_tlb_mem() back to return void. --- include/linux/swiotlb.h | 6 ++ kernel/dma/swiotlb.c| 47 - 2 files changed, 48 insertions(+), 5 deletions(-) diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index 569272871375..f6c3638255d5 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -73,6 +73,9 @@ extern enum swiotlb_force swiotlb_force; * @end: The end address of the swiotlb memory pool. Used to do a quick * range check to see if the memory was in fact allocated by this * API. + * @vaddr: The vaddr of the swiotlb memory pool. The swiotlb memory pool + * may be remapped in the memory encrypted case and store virtual + * address for bounce buffer operation. * @nslabs:The number of IO TLB blocks (in groups of 64) between @start and * @end. For default swiotlb, this is command line adjustable via * setup_io_tlb_npages. @@ -92,6 +95,7 @@ extern enum swiotlb_force swiotlb_force; struct io_tlb_mem { phys_addr_t start; phys_addr_t end; + void *vaddr; unsigned long nslabs; unsigned long used; unsigned int index; @@ -186,4 +190,6 @@ static inline bool is_swiotlb_for_alloc(struct device *dev) } #endif /* CONFIG_DMA_RESTRICTED_POOL */ +extern phys_addr_t swiotlb_unencrypted_base; + #endif /* __LINUX_SWIOTLB_H */ diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 8e840fbbed7c..adb9d06af5c8 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -50,6 +50,7 @@ #include #include +#include #include #include #include @@ -72,6 +73,8 @@ enum swiotlb_force swiotlb_force; struct io_tlb_mem io_tlb_default_mem; +phys_addr_t swiotlb_unencrypted_base; + /* * Max segment that we can provide which (if pages are contingous) will * not be bounced (unless SWIOTLB_FORCE is set). @@ -155,6 +158,27 @@ static inline unsigned long nr_slots(u64 val) return DIV_ROUND_UP(val, IO_TLB_SIZE); } +/* + * Remap swioltb memory in the unencrypted physical address space + * when swiotlb_unencrypted_base is set. (e.g. for Hyper-V AMD SEV-SNP + * Isolation VMs). + */ +void *swiotlb_mem_remap(struct io_tlb_mem *mem, unsigned long bytes) +{ + void *vaddr = NULL; + + if (swiotlb_unencrypted_base) { + phys_addr_t paddr = mem->start + swiotlb_unencrypted_base; + + vaddr = memremap(paddr, bytes, MEMREMAP_WB); + if (!vaddr) + pr_err("Failed to map the unencrypted memory %llx size %lx.\n", + paddr, bytes); + } + + return vaddr; +} + /* * Early SWIOTLB allocation may be too early to allow an architecture to * perform the desired operations. This function allows the architecture to @@ -172,7 +196,12 @@ void __init swiotlb_update_mem_attributes(void) vaddr = phys_to_virt(mem->start); bytes = PAGE_ALIGN(mem->nslabs << IO_TLB_SHIFT); set_memory_decrypted((unsigned long)vaddr, bytes >> PAGE_SHIFT); - memset(vaddr, 0, bytes); + + mem->vaddr = swiotlb_mem_remap(mem, bytes); + if (!mem->vaddr) + mem->vaddr = vaddr; + + memset(mem->vaddr, 0, bytes); } static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start, @@ -196,7 +225,18 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start, mem->slots[i].orig_addr = INVALID_PHYS_ADDR; mem->slots[i].alloc_size = 0; } + + /* +* If swiotlb_unencrypted_base is set, the bounce buffer memory will +* be remapped and cleared in swiotlb_update_mem_attributes. +*/ + if (swiotlb_unencrypted_base) + return; + + set_memory_decrypted(
[PATCH V3 0/5] x86/Hyper-V: Add Hyper-V Isolation VM support(Second part)
From: Tianyu Lan Hyper-V provides two kinds of Isolation VMs. VBS(Virtualization-based security) and AMD SEV-SNP unenlightened Isolation VMs. This patchset is to add support for these Isolation VM support in Linux. The memory of these vms are encrypted and host can't access guest memory directly. Hyper-V provides new host visibility hvcall and the guest needs to call new hvcall to mark memory visible to host before sharing memory with host. For security, all network/storage stack memory should not be shared with host and so there is bounce buffer requests. Vmbus channel ring buffer already plays bounce buffer role because all data from/to host needs to copy from/to between the ring buffer and IO stack memory. So mark vmbus channel ring buffer visible. For SNP isolation VM, guest needs to access the shared memory via extra address space which is specified by Hyper-V CPUID HYPERV_CPUID_ ISOLATION_CONFIG. The access physical address of the shared memory should be bounce buffer memory GPA plus with shared_gpa_boundary reported by CPUID. This patchset is to enable swiotlb bounce buffer for netvsc/storvsc in Isolation VM. This version follows Michael Kelley suggestion in the following link. https://lkml.org/lkml/2021/11/24/2044 Change since v2: * Remove Hyper-V dma ops and dma_alloc/free_noncontiguous. Add hv_map/unmap_memory() to map/umap netvsc rx/tx ring into extra address space. * Leave mem->vaddr in swiotlb code with phys_to_virt(mem->start) when fail to remap swiotlb memory. Change since v1: * Add Hyper-V Isolation support check in the cc_platform_has() and return true for guest memory encrypt attr. * Remove hv isolation check in the sev_setup_arch() Tianyu Lan (5): Swiotlb: Add Swiotlb bounce buffer remap function for HV IVM x86/hyper-v: Add hyperv Isolation VM check in the cc_platform_has() hyperv/IOMMU: Enable swiotlb bounce buffer for Isolation VM scsi: storvsc: Add Isolation VM support for storvsc driver hv_netvsc: Add Isolation VM support for netvsc driver arch/x86/hyperv/ivm.c | 28 ++ arch/x86/kernel/cc_platform.c | 15 arch/x86/xen/pci-swiotlb-xen.c| 3 +- drivers/hv/hv_common.c| 11 +++ drivers/hv/vmbus_drv.c| 4 + drivers/iommu/hyperv-iommu.c | 56 drivers/net/hyperv/hyperv_net.h | 5 ++ drivers/net/hyperv/netvsc.c | 136 +- drivers/net/hyperv/netvsc_drv.c | 1 + drivers/net/hyperv/rndis_filter.c | 2 + drivers/scsi/storvsc_drv.c| 37 include/asm-generic/mshyperv.h| 2 + include/linux/hyperv.h| 14 +++ include/linux/swiotlb.h | 6 ++ kernel/dma/swiotlb.c | 47 +-- 15 files changed, 342 insertions(+), 25 deletions(-) -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 2/2] arm64: tegra: Add ISO SMMU controller for Tegra194
The display controllers are attached to a separate ARM SMMU instance that is dedicated to servicing isochronous memory clients. Add this ISO instance of the ARM SMMU to device tree. Please note that the display controllers are not hooked up to this SMMU yet, because we are still missing a means to transition framebuffers used by the bootloader to the kernel. This based upon an initial patch by Thierry Reding . Signed-off-by: Jon Hunter --- arch/arm64/boot/dts/nvidia/tegra194.dtsi | 76 1 file changed, 76 insertions(+) diff --git a/arch/arm64/boot/dts/nvidia/tegra194.dtsi b/arch/arm64/boot/dts/nvidia/tegra194.dtsi index d78c9ed08c47..496e31b5c637 100644 --- a/arch/arm64/boot/dts/nvidia/tegra194.dtsi +++ b/arch/arm64/boot/dts/nvidia/tegra194.dtsi @@ -1474,6 +1474,82 @@ pmc: pmc@c36 { interrupt-controller; }; + iommu@1000 { + compatible = "nvidia,tegra194-smmu", "nvidia,smmu-500"; + reg = <0x1000 0x80>; + interrupts = , +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +, +; + stream-match-mask = <0x7f80>; + #global-interrupts = <1>; + #iommu-cells = <1>; + + nvidia,memory-controller = <&mc>; + status = "okay"; + }; + smmu: iommu@1200 { compatible = "nvidia,tegra194-smmu", "nvidia,smmu-500"; reg = <0x1200 0x80>, -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 1/2] dt-bindings: arm-smmu: Fix json-schema for Tegra
The dt_binding_check currently issues the following warnings for the Tegra186 and Tegra194 SMMUs ... arch/arm64/boot/dts/nvidia/tegra186-p2771-.dt.yaml: iommu@1200: 'nvidia,memory-controller' does not match any of the regexes: 'pinctrl-[0-9]+' From schema: Documentation/devicetree/bindings/iommu/arm,smmu.yaml DTC arch/arm64/boot/dts/nvidia/tegra186-p3509-+p3636-0001.dt.yaml CHECK arch/arm64/boot/dts/nvidia/tegra186-p3509-+p3636-0001.dt.yaml Add the missing 'nvidia,memory-controller' property to fix the above warning. Signed-off-by: Jon Hunter --- Documentation/devicetree/bindings/iommu/arm,smmu.yaml | 4 1 file changed, 4 insertions(+) diff --git a/Documentation/devicetree/bindings/iommu/arm,smmu.yaml b/Documentation/devicetree/bindings/iommu/arm,smmu.yaml index f66a3effba73..5dce07b12cd7 100644 --- a/Documentation/devicetree/bindings/iommu/arm,smmu.yaml +++ b/Documentation/devicetree/bindings/iommu/arm,smmu.yaml @@ -155,6 +155,10 @@ properties: power-domains: maxItems: 1 + nvidia,memory-controller: +$ref: /schemas/types.yaml#/definitions/phandle +description: phandle of the memory controller node + required: - compatible - reg -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/2] iommu: arm-smmu-impl: Add SM8450 qcom iommu implementation
On 01.12.2021 08:39, Vinod Koul wrote: > Add SM8450 qcom iommu implementation to the table of > qcom_smmu_impl_of_match table which brings in iommu support for > SM8450 SoC > > Signed-off-by: Vinod Koul > Tested-by: Dmitry Baryshkov > --- With deep pain, as we've had to deal with this for a few generations now.. Acked-by: Konrad Dybcio Konrad ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
On Wed, Dec 01, 2021 at 11:16:47AM +0100, Thomas Gleixner wrote: > Looking at the device slices as subdevices with their own struct device > makes a lot of sense from the conceptual level. Except IMS is not just for subdevices, it should be usable for any driver in any case as a general interrupt mechiansm, as you alluded to below about ethernet queues. ntb seems to be the current example of this need.. If it works properly on the physical PCI dev there is no reason to try to also make it work on the mdev and add complixity in iommu land. > Though I fear there is also a use case for MSI-X and IMS tied to the > same device. That network card you are talking about might end up using > MSI-X for a control block and then IMS for the actual network queues > when it is used as physical function device as a whole, but that's > conceptually a different case. Is it? Why? IDXD is not so much making device "slices", but virtualizing and sharing a PCI device. The IDXD hardware is multi-queue like the NIC I described and the VFIO driver is simply allocating queues from a PCI device for specific usages and assigning them interrupts. There is already a char dev interface that equally allocates queues from the same IDXD device, why shouldn't it be able to access IMS interrupt pools too? IMHO a well designed IDXD driver should put all the PCI MMIO, queue mangement, interrupts, etc in the PCI driver layer, and the VFIO driver layer should only deal with the MMIO trapping and VFIO interfacing. >From this view it is conceptually wrong to have the VFIO driver directly talking to MMIO registers in the PCI device or owning the irq_chip. It would be very odd for the PCI driver to allocate interrupts from some random external struct device when it is creating queues on the PCI device. > > Another view is the msi_domain_alloc_irqs() flows should have their > > own xarrays.. > > Yes, I was thinking about that as well. The trivial way would be: > > struct xarray store[MSI_MAX_STORES]; > > and then have a store index for each allocation domain. With the > proposed encapsulation of the xarray handling that's definitely > feasible. Whether that buys much is a different question. Let me think > about it some more. Any possibility that the 'IMS' xarray could be outside the struct device? Thanks, Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/2] iommu: arm-smmu-impl: Add SM8450 qcom iommu implementation
On 01/12/2021 10:39, Vinod Koul wrote: Add SM8450 qcom iommu implementation to the table of qcom_smmu_impl_of_match table which brings in iommu support for SM8450 SoC Signed-off-by: Vinod Koul Tested-by: Dmitry Baryshkov --- drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c index ca736b065dd0..4aee83330629 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c @@ -415,6 +415,7 @@ static const struct of_device_id __maybe_unused qcom_smmu_impl_of_match[] = { { .compatible = "qcom,sm8150-smmu-500" }, { .compatible = "qcom,sm8250-smmu-500" }, { .compatible = "qcom,sm8350-smmu-500" }, + { .compatible = "qcom,sm8450-smmu-500" }, { } }; -- With best wishes Dmitry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
Jason, CC+ IOMMU folks On Tue, Nov 30 2021 at 20:17, Jason Gunthorpe wrote: > On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote: >> The real problem is where to store the MSI descriptors because the PCI >> device has its own real PCI/MSI-X interrupts which means it still shares >> the storage space. > > Er.. I never realized that just looking at the patches :| > > That is relevant to all real "IMS" users. IDXD escaped this because > it, IMHO, wrongly used the mdev with the IRQ layer. The mdev is purely > a messy artifact of VFIO, it should not be required to make the IRQ > layers work. > I don't think it makes sense that the msi_desc would point to a mdev, > the iommu layer consumes the msi_desc_to_dev(), it really should point > to the physical device that originates the message with a proper > iommu ops/data/etc. Looking at the device slices as subdevices with their own struct device makes a lot of sense from the conceptual level. That makes is pretty much obvious to manage the MSIs of those devices at this level like we do for any other device. Whether mdev is the right encapsulation for these subdevices is an orthogonal problem. I surely agree that msi_desc::dev is an interesting question, but we already have this disconnect of msi_desc::dev and DMA today due to DMA aliasing. I haven't looked at that in detail yet, but of course the alias handling is substantially different accross the various IOMMU implementations. Though I fear there is also a use case for MSI-X and IMS tied to the same device. That network card you are talking about might end up using MSI-X for a control block and then IMS for the actual network queues when it is used as physical function device as a whole, but that's conceptually a different case. >> I'm currently tending to partition the index space in the xarray: >> >> 0x - 0x PCI/MSI-X >> 0x0001 - 0x0001 NTB > > It is OK, with some xarray work it can be range allocating & reserving > so that the msi_domain_alloc_irqs() flows can carve out chunks of the > number space.. > > Another view is the msi_domain_alloc_irqs() flows should have their > own xarrays.. Yes, I was thinking about that as well. The trivial way would be: struct xarray store[MSI_MAX_STORES]; and then have a store index for each allocation domain. With the proposed encapsulation of the xarray handling that's definitely feasible. Whether that buys much is a different question. Let me think about it some more. >> which is feasible now with the range modifications and way simpler to do >> with xarray than with the linked list. > > Indeed! I'm glad you like the approach. Thanks, tglx ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu