Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
On Wed, Apr 22, 2020 at 4:49 PM Alexey Kardashevskiy wrote: > > 32bit MMIO is what puzzles me in this picture, how does it work? For devices with no m64 we allocate a PE number as described above. In the 32bit MMIO window we have a segment-to-PE remapping table so any m32 segment can be assigned to any PE. As a result slave PE concept isn't really needed. If the BARs of a device span multiple m32 segments then we can setup the remapping table so that all the segments point to the same PE. > > I was thinking we should try minimise the number of DMA-only PEs since > > it complicates the EEH freeze handling. When MMIO and DMA are mapped > > to the same PE an error on either will cause the hardware to stop > > both. When seperate PEs are used for DMA and MMIO you lose that > > atomicity. It's not a big deal if DMA is stopped and MMIO allowed > > since PAPR (sort-of) allows that, but having MMIO frozen with DMA > > unfrozen is a bit sketch. > > You suggested using slave PEs for crippled functions - won't we have the > same problem then? Yes, but I think it's probably worth doing in that case. You get slightly janky EEH in exchange for better DMA performance. > And is this "slave PE" something the hardware supports or it is a > software concept? It's all in software. The hardware does have the PELT-V which allows you to specify a group of PEs to additionally freeze when a PE is frozen, but the PELT-V is only used when handling AER messages. All other error sources (DMAs, MMIOs, etc) will only freeze one PE (or all of them in very rare cases). > > There's been no official FW releases with a skiboot that supports the > > phb get/set option opal calls so the only systems that can actually > > take advantage of it are our lab systems. It might still be useful for > > future systems, but I'd rather something that doesn't depend on FW > > support. > > Pensando folks use it ;) the what folks Oliver
Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
On 21/04/2020 16:35, Oliver O'Halloran wrote: > On Tue, Apr 21, 2020 at 3:11 PM Alexey Kardashevskiy wrote: >> >> One example of a problem device is AMD GPU with 64bit video PCI function >> and 32bit audio, no? >> >> What PEs will they get assigned to now? Where will audio's MMIO go? It >> cannot be the same 64bit MMIO segment, right? If so, it is a separate PE >> already. If not, then I do not understand "we're free to assign whatever >> PE number we want. > > The BARs stay in the same place and as far as MMIO is concerned > nothing has changed. For MMIO the PHB uses the MMIO address to find a > PE via the M64 BAR table, but for DMA it uses a *completely* different > mechanism. Instead it takes the BDFN (included in the DMA packet > header) and the Requester Translation Table (RTT) to map the BDFN to a > PE. Normally you would configure the PHB so the same PE used for MMIO > and DMA, but you don't have to. 32bit MMIO is what puzzles me in this picture, how does it work? >>> I think the key thing to realise is that we'd only be using the DMA-PE >>> when a crippled DMA mask is set by the driver. In all other cases we >>> can just use the "native PE" and when the driver unbinds we can de- >>> allocate our DMA-PE and return the device to the PE containing it's >>> MMIO BARs. I think we can keep things relatively sane that way and the >>> real issue is detecting EEH events on the DMA-PE. >> >> >> Oooor we could just have 1 DMA window (or, more precisely, a single >> "TVE" as it is either window or bypass) per a PE and give every function >> its own PE and create a window or a table when a device sets a DMA mask. >> I feel I am missing something here though. > > Yes, we could do that, but do we want to? > > I was thinking we should try minimise the number of DMA-only PEs since > it complicates the EEH freeze handling. When MMIO and DMA are mapped > to the same PE an error on either will cause the hardware to stop > both. When seperate PEs are used for DMA and MMIO you lose that > atomicity. It's not a big deal if DMA is stopped and MMIO allowed > since PAPR (sort-of) allows that, but having MMIO frozen with DMA > unfrozen is a bit sketch. You suggested using slave PEs for crippled functions - won't we have the same problem then? And is this "slave PE" something the hardware supports or it is a software concept? For the time being, this patchset is good for: 1. weird hardware which has limited DMA mask (this is why the patchset was written in the first place) 2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled). >>> >>> Sure, but it's still dependent on having firmware which supports the >>> 4GB hack and I don't think that's in any offical firmware releases yet. >> >> It's been a while :-/ > > There's been no official FW releases with a skiboot that supports the > phb get/set option opal calls so the only systems that can actually > take advantage of it are our lab systems. It might still be useful for > future systems, but I'd rather something that doesn't depend on FW > support. Pensando folks use it ;) -- Alexey
Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
On Tue, Apr 21, 2020 at 3:11 PM Alexey Kardashevskiy wrote: > > One example of a problem device is AMD GPU with 64bit video PCI function > and 32bit audio, no? > > What PEs will they get assigned to now? Where will audio's MMIO go? It > cannot be the same 64bit MMIO segment, right? If so, it is a separate PE > already. If not, then I do not understand "we're free to assign whatever > PE number we want. The BARs stay in the same place and as far as MMIO is concerned nothing has changed. For MMIO the PHB uses the MMIO address to find a PE via the M64 BAR table, but for DMA it uses a *completely* different mechanism. Instead it takes the BDFN (included in the DMA packet header) and the Requester Translation Table (RTT) to map the BDFN to a PE. Normally you would configure the PHB so the same PE used for MMIO and DMA, but you don't have to. > > I think the key thing to realise is that we'd only be using the DMA-PE > > when a crippled DMA mask is set by the driver. In all other cases we > > can just use the "native PE" and when the driver unbinds we can de- > > allocate our DMA-PE and return the device to the PE containing it's > > MMIO BARs. I think we can keep things relatively sane that way and the > > real issue is detecting EEH events on the DMA-PE. > > > Oooor we could just have 1 DMA window (or, more precisely, a single > "TVE" as it is either window or bypass) per a PE and give every function > its own PE and create a window or a table when a device sets a DMA mask. > I feel I am missing something here though. Yes, we could do that, but do we want to? I was thinking we should try minimise the number of DMA-only PEs since it complicates the EEH freeze handling. When MMIO and DMA are mapped to the same PE an error on either will cause the hardware to stop both. When seperate PEs are used for DMA and MMIO you lose that atomicity. It's not a big deal if DMA is stopped and MMIO allowed since PAPR (sort-of) allows that, but having MMIO frozen with DMA unfrozen is a bit sketch. > >> For the time being, this patchset is good for: > >> 1. weird hardware which has limited DMA mask (this is why the patchset > >> was written in the first place) > >> 2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled). > > > > Sure, but it's still dependent on having firmware which supports the > > 4GB hack and I don't think that's in any offical firmware releases yet. > > It's been a while :-/ There's been no official FW releases with a skiboot that supports the phb get/set option opal calls so the only systems that can actually take advantage of it are our lab systems. It might still be useful for future systems, but I'd rather something that doesn't depend on FW support. Oliver
Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
On 21/04/2020 00:04, Oliver O'Halloran wrote: > On Fri, 2020-04-17 at 15:47 +1000, Alexey Kardashevskiy wrote: >> >> On 17/04/2020 11:26, Russell Currey wrote: >>> >>> For what it's worth this sounds like a good idea to me, it just sounds >>> tricky to implement. You're adding another layer of complexity on top >>> of EEH (well, making things look simple to the EEH core and doing your >>> own freezing on top of it) in addition to the DMA handling. >>> >>> If it works then great, just has a high potential to become a new bug >>> haven. >> >> imho putting every PCI function to a separate PE is the right thing to >> do here but I've been told it is not that simple, and I believe that. >> Reusing slave PEs seems unreliable - the configuration will depend on >> whether a PE occupied enough segments to give an unique PE to a PCI >> function and my little brain explodes. > > You're overthinking it. > > If a bus has no m64 MMIO space we're free to assign whatever PE number > we want since the PE for the bus isn't fixed by the m64 segment its > BARs were placed in. For those buses we assign a PE number starting > from the max and counting down (0xff, 0xfe, etc). For example, with > this PHB: > > # lspci -s 1:: -v | egrep '0001:|Memory at' > 0001:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode]) > 0001:01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) > (prog-if 00 [Normal decode]) > Memory at 600c08100 (32-bit, non-prefetchable) [size=256K] > 0001:02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) > (prog-if 00 [Normal decode]) > 0001:02:08.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) > (prog-if 00 [Normal decode]) > 0001:02:09.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) > (prog-if 00 [Normal decode]) > 0001:03:00.0 Non-Volatile memory controller: PMC-Sierra Inc. Device > f117 (rev 06) (prog-if 02 [NVM Express]) > Memory at 600c08000 (64-bit, non-prefetchable) [size=16K] > Memory at 60040 (64-bit, prefetchable) [size=1M] > 0001:09:00.0 Ethernet controller: Intel Corporation Ethernet Controller > X710/X557-AT 10GBASE-T (rev 02) > Memory at 600404800 (64-bit, prefetchable) [size=8M] > Memory at 600404a00 (64-bit, prefetchable) [size=32K] > (redundant functions removed) > > We get these PE assignments: > > 0001:00:00.0 - 0xfe # Root port > 0001:01:00.0 - 0xfc # upstream port > 0001:02:01.0 - 0xfd # downstream port bus > 0001:02:08.0 - 0xfd > 0001:02:09.0 - 0xfd > 0001:03:00.0 - 0x0 # NVMe > 0001:09:00.0 - 0x1 # Ethernet > > All the switch ports either have 32bit BARs or no BARs so they get > assigned PEs starting from the top. The Ethernet and the NVMe have some > prefetchable 64bit BARs so they have to be in PE 0x0 and 0x1 > respectively since that's where their m64 BARs are located. For our > DMA-only slave PEs any MMIO space would remain in their master PE so we > can allocate a PE number for the DMA-PE (our iommu context). One example of a problem device is AMD GPU with 64bit video PCI function and 32bit audio, no? What PEs will they get assigned to now? Where will audio's MMIO go? It cannot be the same 64bit MMIO segment, right? If so, it is a separate PE already. If not, then I do not understand "we're free to assign whatever PE number we want. > I think the key thing to realise is that we'd only be using the DMA-PE > when a crippled DMA mask is set by the driver. In all other cases we > can just use the "native PE" and when the driver unbinds we can de- > allocate our DMA-PE and return the device to the PE containing it's > MMIO BARs. I think we can keep things relatively sane that way and the > real issue is detecting EEH events on the DMA-PE. Oooor we could just have 1 DMA window (or, more precisely, a single "TVE" as it is either window or bypass) per a PE and give every function its own PE and create a window or a table when a device sets a DMA mask. I feel I am missing something here though. > > On P9 we don't have PHB error interrupts enabled in firmware so we're > completely reliant on seeing a 0xFF response to an MMIO and manually > checking the PE status to see if it's due to a PE freeze. For our DMA- > PE it could be frozen (due to a bad DMA) and we'd never notice unless > we explicitly check the status of the DMA-PE since there's no > corresponding MMIO space to freeze. > > On P8 we had PHB Error interrupts so you would notice that *something* > happened, then go check for frozen PEs, at which point the master-slave > grouping logic would see that one PE in the group was frozen and freeze > the rest of them. We can re-use that on that, but we still need > something to actually notice a freeze occured. A background poller > checking for freezes on each PE might do the trick. > >> So this is not happening soon. > > Oh ye of little faith. > >> For the time being, this patchset is good for: >> 1. weird hardware which has limited DMA mask (this is why the
Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
On Fri, 2020-04-17 at 15:47 +1000, Alexey Kardashevskiy wrote: > > On 17/04/2020 11:26, Russell Currey wrote: > > > > For what it's worth this sounds like a good idea to me, it just sounds > > tricky to implement. You're adding another layer of complexity on top > > of EEH (well, making things look simple to the EEH core and doing your > > own freezing on top of it) in addition to the DMA handling. > > > > If it works then great, just has a high potential to become a new bug > > haven. > > imho putting every PCI function to a separate PE is the right thing to > do here but I've been told it is not that simple, and I believe that. > Reusing slave PEs seems unreliable - the configuration will depend on > whether a PE occupied enough segments to give an unique PE to a PCI > function and my little brain explodes. You're overthinking it. If a bus has no m64 MMIO space we're free to assign whatever PE number we want since the PE for the bus isn't fixed by the m64 segment its BARs were placed in. For those buses we assign a PE number starting from the max and counting down (0xff, 0xfe, etc). For example, with this PHB: # lspci -s 1:: -v | egrep '0001:|Memory at' 0001:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode]) 0001:01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 [Normal decode]) Memory at 600c08100 (32-bit, non-prefetchable) [size=256K] 0001:02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 [Normal decode]) 0001:02:08.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 [Normal decode]) 0001:02:09.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 [Normal decode]) 0001:03:00.0 Non-Volatile memory controller: PMC-Sierra Inc. Device f117 (rev 06) (prog-if 02 [NVM Express]) Memory at 600c08000 (64-bit, non-prefetchable) [size=16K] Memory at 60040 (64-bit, prefetchable) [size=1M] 0001:09:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710/X557-AT 10GBASE-T (rev 02) Memory at 600404800 (64-bit, prefetchable) [size=8M] Memory at 600404a00 (64-bit, prefetchable) [size=32K] (redundant functions removed) We get these PE assignments: 0001:00:00.0 - 0xfe # Root port 0001:01:00.0 - 0xfc # upstream port 0001:02:01.0 - 0xfd # downstream port bus 0001:02:08.0 - 0xfd 0001:02:09.0 - 0xfd 0001:03:00.0 - 0x0 # NVMe 0001:09:00.0 - 0x1 # Ethernet All the switch ports either have 32bit BARs or no BARs so they get assigned PEs starting from the top. The Ethernet and the NVMe have some prefetchable 64bit BARs so they have to be in PE 0x0 and 0x1 respectively since that's where their m64 BARs are located. For our DMA-only slave PEs any MMIO space would remain in their master PE so we can allocate a PE number for the DMA-PE (our iommu context). I think the key thing to realise is that we'd only be using the DMA-PE when a crippled DMA mask is set by the driver. In all other cases we can just use the "native PE" and when the driver unbinds we can de- allocate our DMA-PE and return the device to the PE containing it's MMIO BARs. I think we can keep things relatively sane that way and the real issue is detecting EEH events on the DMA-PE. On P9 we don't have PHB error interrupts enabled in firmware so we're completely reliant on seeing a 0xFF response to an MMIO and manually checking the PE status to see if it's due to a PE freeze. For our DMA- PE it could be frozen (due to a bad DMA) and we'd never notice unless we explicitly check the status of the DMA-PE since there's no corresponding MMIO space to freeze. On P8 we had PHB Error interrupts so you would notice that *something* happened, then go check for frozen PEs, at which point the master-slave grouping logic would see that one PE in the group was frozen and freeze the rest of them. We can re-use that on that, but we still need something to actually notice a freeze occured. A background poller checking for freezes on each PE might do the trick. > So this is not happening soon. Oh ye of little faith. > For the time being, this patchset is good for: > 1. weird hardware which has limited DMA mask (this is why the patchset > was written in the first place) > 2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled). Sure, but it's still dependent on having firmware which supports the 4GB hack and I don't think that's in any offical firmware releases yet. Oliver
Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
On 17/04/2020 11:26, Russell Currey wrote: > On Thu, 2020-04-16 at 12:53 +1000, Oliver O'Halloran wrote: >> On Thu, Apr 16, 2020 at 12:34 PM Oliver O'Halloran >> wrote: >>> On Thu, Apr 16, 2020 at 11:27 AM Alexey Kardashevskiy < >>> a...@ozlabs.ru> wrote: Anyone? Is it totally useless or wrong approach? Thanks, >>> >>> I wouldn't say it's either, but I still hate it. >>> >>> The 4GB mode being per-PHB makes it difficult to use unless we >>> force >>> that mode on 100% of the time which I'd prefer not to do. Ideally >>> devices that actually support 64bit addressing (which is most of >>> them) >>> should be able to use no-translate mode when possible since a) It's >>> faster, and b) It frees up room in the TCE cache devices that >>> actually >>> need them. I know you've done some testing with 100G NICs and found >>> the overhead was fine, but IMO that's a bad test since it's pretty >>> much the best-case scenario since all the devices on the PHB are in >>> the same PE. The PHB's TCE cache only hits when the TCE matches the >>> DMA bus address and the PE number for the device so in a multi-PE >>> environment there's a lot of potential for TCE cache trashing. If >>> there was one or two PEs under that PHB it's probably not going to >>> matter, but if you have an NVMe rack with 20 drives it starts to >>> look >>> a bit ugly. >>> >>> That all said, it might be worth doing this anyway since we >>> probably >>> want the software infrastructure in place to take advantage of it. >>> Maybe expand the command line parameters to allow it to be enabled >>> on >>> a per-PHB basis rather than globally. >> >> Since we're on the topic >> >> I've been thinking the real issue we have is that we're trying to >> pick >> an "optimal" IOMMU config at a point where we don't have enough >> information to work out what's actually optimal. The IOMMU config is >> done on a per-PE basis, but since PEs may contain devices with >> different DMA masks (looking at you wierd AMD audio function) we're >> always going to have to pick something conservative as the default >> config for TVE#0 (64k, no bypass mapping) since the driver will tell >> us what the device actually supports long after the IOMMU >> configuation >> is done. What we really want is to be able to have separate IOMMU >> contexts for each device, or at the very least a separate context for >> the crippled devices. >> >> We could allow a per-device IOMMU context by extending the Master / >> Slave PE thing to cover DMA in addition to MMIO. Right now we only >> use >> slave PEs when a device's MMIO BARs extend over multiple m64 >> segments. >> When that happens an MMIO error causes the PHB to freezes the PE >> corresponding to one of those segments, but not any of the others. To >> present a single "PE" to the EEH core we check the freeze status of >> each of the slave PEs when the EEH core does a PE status check and if >> any of them are frozen, we freeze the rest of them too. When a driver >> sets a limited DMA mask we could move that device to a seperate slave >> PE so that it has it's own IOMMU context taylored to its DMA >> addressing limits. >> >> Thoughts? > > For what it's worth this sounds like a good idea to me, it just sounds > tricky to implement. You're adding another layer of complexity on top > of EEH (well, making things look simple to the EEH core and doing your > own freezing on top of it) in addition to the DMA handling. > > If it works then great, just has a high potential to become a new bug > haven. imho putting every PCI function to a separate PE is the right thing to do here but I've been told it is not that simple, and I believe that. Reusing slave PEs seems unreliable - the configuration will depend on whether a PE occupied enough segments to give an unique PE to a PCI function and my little brain explodes. So this is not happening soon. For the time being, this patchset is good for: 1. weird hardware which has limited DMA mask (this is why the patchset was written in the first place) 2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled). -- Alexey
Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
On Thu, 2020-04-16 at 12:53 +1000, Oliver O'Halloran wrote: > On Thu, Apr 16, 2020 at 12:34 PM Oliver O'Halloran > wrote: > > On Thu, Apr 16, 2020 at 11:27 AM Alexey Kardashevskiy < > > a...@ozlabs.ru> wrote: > > > Anyone? Is it totally useless or wrong approach? Thanks, > > > > I wouldn't say it's either, but I still hate it. > > > > The 4GB mode being per-PHB makes it difficult to use unless we > > force > > that mode on 100% of the time which I'd prefer not to do. Ideally > > devices that actually support 64bit addressing (which is most of > > them) > > should be able to use no-translate mode when possible since a) It's > > faster, and b) It frees up room in the TCE cache devices that > > actually > > need them. I know you've done some testing with 100G NICs and found > > the overhead was fine, but IMO that's a bad test since it's pretty > > much the best-case scenario since all the devices on the PHB are in > > the same PE. The PHB's TCE cache only hits when the TCE matches the > > DMA bus address and the PE number for the device so in a multi-PE > > environment there's a lot of potential for TCE cache trashing. If > > there was one or two PEs under that PHB it's probably not going to > > matter, but if you have an NVMe rack with 20 drives it starts to > > look > > a bit ugly. > > > > That all said, it might be worth doing this anyway since we > > probably > > want the software infrastructure in place to take advantage of it. > > Maybe expand the command line parameters to allow it to be enabled > > on > > a per-PHB basis rather than globally. > > Since we're on the topic > > I've been thinking the real issue we have is that we're trying to > pick > an "optimal" IOMMU config at a point where we don't have enough > information to work out what's actually optimal. The IOMMU config is > done on a per-PE basis, but since PEs may contain devices with > different DMA masks (looking at you wierd AMD audio function) we're > always going to have to pick something conservative as the default > config for TVE#0 (64k, no bypass mapping) since the driver will tell > us what the device actually supports long after the IOMMU > configuation > is done. What we really want is to be able to have separate IOMMU > contexts for each device, or at the very least a separate context for > the crippled devices. > > We could allow a per-device IOMMU context by extending the Master / > Slave PE thing to cover DMA in addition to MMIO. Right now we only > use > slave PEs when a device's MMIO BARs extend over multiple m64 > segments. > When that happens an MMIO error causes the PHB to freezes the PE > corresponding to one of those segments, but not any of the others. To > present a single "PE" to the EEH core we check the freeze status of > each of the slave PEs when the EEH core does a PE status check and if > any of them are frozen, we freeze the rest of them too. When a driver > sets a limited DMA mask we could move that device to a seperate slave > PE so that it has it's own IOMMU context taylored to its DMA > addressing limits. > > Thoughts? For what it's worth this sounds like a good idea to me, it just sounds tricky to implement. You're adding another layer of complexity on top of EEH (well, making things look simple to the EEH core and doing your own freezing on top of it) in addition to the DMA handling. If it works then great, just has a high potential to become a new bug haven. > > Oliver
Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
On Thu, Apr 16, 2020 at 12:34 PM Oliver O'Halloran wrote: > > On Thu, Apr 16, 2020 at 11:27 AM Alexey Kardashevskiy wrote: > > > > Anyone? Is it totally useless or wrong approach? Thanks, > > I wouldn't say it's either, but I still hate it. > > The 4GB mode being per-PHB makes it difficult to use unless we force > that mode on 100% of the time which I'd prefer not to do. Ideally > devices that actually support 64bit addressing (which is most of them) > should be able to use no-translate mode when possible since a) It's > faster, and b) It frees up room in the TCE cache devices that actually > need them. I know you've done some testing with 100G NICs and found > the overhead was fine, but IMO that's a bad test since it's pretty > much the best-case scenario since all the devices on the PHB are in > the same PE. The PHB's TCE cache only hits when the TCE matches the > DMA bus address and the PE number for the device so in a multi-PE > environment there's a lot of potential for TCE cache trashing. If > there was one or two PEs under that PHB it's probably not going to > matter, but if you have an NVMe rack with 20 drives it starts to look > a bit ugly. > > That all said, it might be worth doing this anyway since we probably > want the software infrastructure in place to take advantage of it. > Maybe expand the command line parameters to allow it to be enabled on > a per-PHB basis rather than globally. Since we're on the topic I've been thinking the real issue we have is that we're trying to pick an "optimal" IOMMU config at a point where we don't have enough information to work out what's actually optimal. The IOMMU config is done on a per-PE basis, but since PEs may contain devices with different DMA masks (looking at you wierd AMD audio function) we're always going to have to pick something conservative as the default config for TVE#0 (64k, no bypass mapping) since the driver will tell us what the device actually supports long after the IOMMU configuation is done. What we really want is to be able to have separate IOMMU contexts for each device, or at the very least a separate context for the crippled devices. We could allow a per-device IOMMU context by extending the Master / Slave PE thing to cover DMA in addition to MMIO. Right now we only use slave PEs when a device's MMIO BARs extend over multiple m64 segments. When that happens an MMIO error causes the PHB to freezes the PE corresponding to one of those segments, but not any of the others. To present a single "PE" to the EEH core we check the freeze status of each of the slave PEs when the EEH core does a PE status check and if any of them are frozen, we freeze the rest of them too. When a driver sets a limited DMA mask we could move that device to a seperate slave PE so that it has it's own IOMMU context taylored to its DMA addressing limits. Thoughts? Oliver
Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
On Thu, Apr 16, 2020 at 11:27 AM Alexey Kardashevskiy wrote: > > Anyone? Is it totally useless or wrong approach? Thanks, I wouldn't say it's either, but I still hate it. The 4GB mode being per-PHB makes it difficult to use unless we force that mode on 100% of the time which I'd prefer not to do. Ideally devices that actually support 64bit addressing (which is most of them) should be able to use no-translate mode when possible since a) It's faster, and b) It frees up room in the TCE cache devices that actually need them. I know you've done some testing with 100G NICs and found the overhead was fine, but IMO that's a bad test since it's pretty much the best-case scenario since all the devices on the PHB are in the same PE. The PHB's TCE cache only hits when the TCE matches the DMA bus address and the PE number for the device so in a multi-PE environment there's a lot of potential for TCE cache trashing. If there was one or two PEs under that PHB it's probably not going to matter, but if you have an NVMe rack with 20 drives it starts to look a bit ugly. That all said, it might be worth doing this anyway since we probably want the software infrastructure in place to take advantage of it. Maybe expand the command line parameters to allow it to be enabled on a per-PHB basis rather than globally. Oliver
Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
Anyone? Is it totally useless or wrong approach? Thanks, On 08/04/2020 19:43, Alexey Kardashevskiy wrote: > > > On 23/03/2020 18:53, Alexey Kardashevskiy wrote: >> Here is an attempt to support bigger DMA space for devices >> supporting DMA masks less than 59 bits (GPUs come into mind >> first). POWER9 PHBs have an option to map 2 windows at 0 >> and select a windows based on DMA address being below or above >> 4GB. >> >> This adds the "iommu=iommu_bypass" kernel parameter and >> supports VFIO+pseries machine - current this requires telling >> upstream+unmodified QEMU about this via >> -global spapr-pci-host-bridge.dma64_win_addr=0x1 >> or per-phb property. 4/4 advertises the new option but >> there is no automation around it in QEMU (should it be?). >> >> For now it is either 1<<59 or 4GB mode; dynamic switching is >> not supported (could be via sysfs). >> >> This is a rebased version of >> https://lore.kernel.org/kvm/20191202015953.127902-1-...@ozlabs.ru/ >> >> The main change since v1 is that now it is 7 patches with >> clearer separation of steps. >> >> >> This is based on 6c90b86a745a "Merge tag 'mmc-v5.6-rc6' of >> git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc" >> >> Please comment. Thanks. > > Ping? > > >> >> >> >> Alexey Kardashevskiy (7): >> powerpc/powernv/ioda: Move TCE bypass base to PE >> powerpc/powernv/ioda: Rework for huge DMA window at 4GB >> powerpc/powernv/ioda: Allow smaller TCE table levels >> powerpc/powernv/phb4: Use IOMMU instead of bypassing >> powerpc/iommu: Add a window number to >> iommu_table_group_ops::get_table_size >> powerpc/powernv/phb4: Add 4GB IOMMU bypass mode >> vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB >> >> arch/powerpc/include/asm/iommu.h | 3 + >> arch/powerpc/include/asm/opal-api.h | 9 +- >> arch/powerpc/include/asm/opal.h | 2 + >> arch/powerpc/platforms/powernv/pci.h | 4 +- >> include/uapi/linux/vfio.h | 2 + >> arch/powerpc/platforms/powernv/npu-dma.c | 1 + >> arch/powerpc/platforms/powernv/opal-call.c| 2 + >> arch/powerpc/platforms/powernv/pci-ioda-tce.c | 4 +- >> arch/powerpc/platforms/powernv/pci-ioda.c | 234 ++ >> drivers/vfio/vfio_iommu_spapr_tce.c | 17 +- >> 10 files changed, 213 insertions(+), 65 deletions(-) >> > -- Alexey
Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
On 23/03/2020 18:53, Alexey Kardashevskiy wrote: > Here is an attempt to support bigger DMA space for devices > supporting DMA masks less than 59 bits (GPUs come into mind > first). POWER9 PHBs have an option to map 2 windows at 0 > and select a windows based on DMA address being below or above > 4GB. > > This adds the "iommu=iommu_bypass" kernel parameter and > supports VFIO+pseries machine - current this requires telling > upstream+unmodified QEMU about this via > -global spapr-pci-host-bridge.dma64_win_addr=0x1 > or per-phb property. 4/4 advertises the new option but > there is no automation around it in QEMU (should it be?). > > For now it is either 1<<59 or 4GB mode; dynamic switching is > not supported (could be via sysfs). > > This is a rebased version of > https://lore.kernel.org/kvm/20191202015953.127902-1-...@ozlabs.ru/ > > The main change since v1 is that now it is 7 patches with > clearer separation of steps. > > > This is based on 6c90b86a745a "Merge tag 'mmc-v5.6-rc6' of > git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc" > > Please comment. Thanks. Ping? > > > > Alexey Kardashevskiy (7): > powerpc/powernv/ioda: Move TCE bypass base to PE > powerpc/powernv/ioda: Rework for huge DMA window at 4GB > powerpc/powernv/ioda: Allow smaller TCE table levels > powerpc/powernv/phb4: Use IOMMU instead of bypassing > powerpc/iommu: Add a window number to > iommu_table_group_ops::get_table_size > powerpc/powernv/phb4: Add 4GB IOMMU bypass mode > vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB > > arch/powerpc/include/asm/iommu.h | 3 + > arch/powerpc/include/asm/opal-api.h | 9 +- > arch/powerpc/include/asm/opal.h | 2 + > arch/powerpc/platforms/powernv/pci.h | 4 +- > include/uapi/linux/vfio.h | 2 + > arch/powerpc/platforms/powernv/npu-dma.c | 1 + > arch/powerpc/platforms/powernv/opal-call.c| 2 + > arch/powerpc/platforms/powernv/pci-ioda-tce.c | 4 +- > arch/powerpc/platforms/powernv/pci-ioda.c | 234 ++ > drivers/vfio/vfio_iommu_spapr_tce.c | 17 +- > 10 files changed, 213 insertions(+), 65 deletions(-) > -- Alexey
[PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
Here is an attempt to support bigger DMA space for devices supporting DMA masks less than 59 bits (GPUs come into mind first). POWER9 PHBs have an option to map 2 windows at 0 and select a windows based on DMA address being below or above 4GB. This adds the "iommu=iommu_bypass" kernel parameter and supports VFIO+pseries machine - current this requires telling upstream+unmodified QEMU about this via -global spapr-pci-host-bridge.dma64_win_addr=0x1 or per-phb property. 4/4 advertises the new option but there is no automation around it in QEMU (should it be?). For now it is either 1<<59 or 4GB mode; dynamic switching is not supported (could be via sysfs). This is a rebased version of https://lore.kernel.org/kvm/20191202015953.127902-1-...@ozlabs.ru/ The main change since v1 is that now it is 7 patches with clearer separation of steps. This is based on 6c90b86a745a "Merge tag 'mmc-v5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc" Please comment. Thanks. Alexey Kardashevskiy (7): powerpc/powernv/ioda: Move TCE bypass base to PE powerpc/powernv/ioda: Rework for huge DMA window at 4GB powerpc/powernv/ioda: Allow smaller TCE table levels powerpc/powernv/phb4: Use IOMMU instead of bypassing powerpc/iommu: Add a window number to iommu_table_group_ops::get_table_size powerpc/powernv/phb4: Add 4GB IOMMU bypass mode vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB arch/powerpc/include/asm/iommu.h | 3 + arch/powerpc/include/asm/opal-api.h | 9 +- arch/powerpc/include/asm/opal.h | 2 + arch/powerpc/platforms/powernv/pci.h | 4 +- include/uapi/linux/vfio.h | 2 + arch/powerpc/platforms/powernv/npu-dma.c | 1 + arch/powerpc/platforms/powernv/opal-call.c| 2 + arch/powerpc/platforms/powernv/pci-ioda-tce.c | 4 +- arch/powerpc/platforms/powernv/pci-ioda.c | 234 ++ drivers/vfio/vfio_iommu_spapr_tce.c | 17 +- 10 files changed, 213 insertions(+), 65 deletions(-) -- 2.17.1