Re: [PATCH] iommu/dma: Explicitly sort PCI DMA windows

2022-03-23 Thread Marc Zyngier
On Tue, 22 Mar 2022 17:27:36 +,
Robin Murphy  wrote:
> 
> Originally, creating the dma_ranges resource list in pre-sorted fashion
> was the simplest and most efficient way to enforce the order required by
> iova_reserve_pci_windows(). However since then at least one PCI host
> driver is now re-sorting the list for its own probe-time processing,
> which doesn't seem entirely unreasonable, so that basic assumption no
> longer holds. Make iommu-dma robust and get the sort order it needs by
> explicitly sorting, which means we can also save the effort at creation
> time and just build the list in whatever natural order the DT had.
> 
> Signed-off-by: Robin Murphy 
> ---
> 
> Looking at this area off the back of the XGene thread[1] made me realise
> that we need to do it anyway, regardless of whether it might also happen
> to restore the previous XGene behaviour or not. Presumably nobody's
> tried to use pcie-cadence-host behind an IOMMU yet...

This definitely restores PCIe functionality on my Mustang (XGene-1).
Hopefully dann can comment on whether this addresses his own issue, as
his firmware is significantly different.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-23 Thread Kalle Valo
Adding regressions list so that this can be tracked properly, including
the full report below.

Oleksandr Natalenko  writes:

> Hello.
>
> The following upstream commits:
>
> aa6f8dcbab47 swiotlb: rework "fix info leak with DMA_FROM_DEVICE"
> ddbd89deb7d3 swiotlb: fix info leak with DMA_FROM_DEVICE
>
> break ath9k-based Wi-Fi access point for me. The AP emits beacons, but no 
> client can connect to it, either from the very beginning, or shortly after 
> start. These are the only symptoms I've noticed (i.e., no BUG/WARNING 
> messages in `dmesg` etc).
>
> The hardware is:
>
> ```
> $ dmesg | grep -i swiotlb
> [0.426785] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
>
> BIOS Information
> Vendor: American Megatrends Inc.
> Version: P1.50
> Release Date: 04/16/2018
>
> Base Board Information
> Manufacturer: ASRock
> Product Name: J3710-ITX
>
> 02:00.0 Network controller: Qualcomm Atheros AR9462 Wireless Network Adapter 
> (rev 01)
>   Subsystem: Lite-On Communications Inc Device 6621
>   Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
> Stepping- SERR- FastB2B- DisINTx-
>   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-  SERR-Latency: 0, Cache Line Size: 64 bytes
>   Interrupt: pin A routed to IRQ 17
>   Region 0: Memory at 8140 (64-bit, non-prefetchable) [size=512K]
>   Expansion ROM at 8148 [disabled] [size=64K]
>   Capabilities: [40] Power Management version 2
>   Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA 
> PME(D0+,D1+,D2+,D3hot+,D3cold+)
>   Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
>   Capabilities: [50] MSI: Enable- Count=1/4 Maskable+ 64bit+
>   Address:   Data: 
>   Masking:   Pending: 
>   Capabilities: [70] Express (v2) Endpoint, MSI 00
>   DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s 
> unlimited, L1 <64us
>   ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- 
> SlotPowerLimit 10.000W
>   DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
>   RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
>   MaxPayload 128 bytes, MaxReadReq 512 bytes
>   DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ 
> TransPend-
>   LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit 
> Latency L0s <4us, L1 <64us
>   ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
>   LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
>   ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>   LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
>   TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>   DevCap2: Completion Timeout: Not Supported, TimeoutDis+ 
> NROPrPrP- LTR-
>10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- 
> EETLPPrefix-
>EmergencyPowerReduction Not Supported, 
> EmergencyPowerReductionInit-
>FRS- TPHComp- ExtTPHComp-
>AtomicOpsCap: 32bit- 64bit- 128bitCAS-
>   DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 
> OBFF Disabled,
>AtomicOpsCtl: ReqEn-
>   LnkCap2: Supported Link Speeds: 2.5GT/s, Crosslink- Retimer- 
> 2Retimers- DRS-
>   LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
>Transmit Margin: Normal Operating Range, 
> EnterModifiedCompliance- ComplianceSOS-
>Compliance De-emphasis: -6dB
>   LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- 
> EqualizationPhase1-
>EqualizationPhase2- EqualizationPhase3- 
> LinkEqualizationRequest-
>Retimer- 2Retimers- CrosslinkRes: unsupported
>   Capabilities: [100 v1] Advanced Error Reporting
>   UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- 
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>   UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- 
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>   UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- 
> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>   CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
> AdvNonFatalErr-
>   CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
> AdvNonFatalErr+
>   AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- 
> ECRCChkCap- ECRCChkEn-
>   MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
>   HeaderLog:    
>   Capabilities: [140 v1] Virtual Channel
>   Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
>   Arb:Fixed- WRR32- WRR64- WRR128-
>   Ctrl:   ArbSelect=Fixed
> 

[REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-23 Thread Oleksandr Natalenko via iommu
Hello.

The following upstream commits:

aa6f8dcbab47 swiotlb: rework "fix info leak with DMA_FROM_DEVICE"
ddbd89deb7d3 swiotlb: fix info leak with DMA_FROM_DEVICE

break ath9k-based Wi-Fi access point for me. The AP emits beacons, but no 
client can connect to it, either from the very beginning, or shortly after 
start. These are the only symptoms I've noticed (i.e., no BUG/WARNING messages 
in `dmesg` etc).

The hardware is:

```
$ dmesg | grep -i swiotlb
[0.426785] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)

BIOS Information
Vendor: American Megatrends Inc.
Version: P1.50
Release Date: 04/16/2018

Base Board Information
Manufacturer: ASRock
Product Name: J3710-ITX

02:00.0 Network controller: Qualcomm Atheros AR9462 Wireless Network Adapter 
(rev 01)
Subsystem: Lite-On Communications Inc Device 6621
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] iommu/arm: Expose ARM SMMUv3 related registers via sysfs

2022-03-23 Thread Miguel Luis
Allows userspace to check for SMMUv3 features.

Expose the following RO registers related to ARM SMMUv3 via sysfs:
SMMU_IDR0
SMMU_IDR1
SMMU_IDR2
SMMU_IDR3
SMMU_IDR4
SMMU_IDR5
SMMU_IDR6
SMMU_IIDR
SMMU_AIDR

 # find /sys | grep arm-iommu
/sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu
/sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr5
/sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr3
/sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr1
/sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_aidr
/sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr6
/sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr4
/sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_iidr
/sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr2
/sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr0

Signed-off-by: Miguel Luis 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 199 
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  27 +++
 2 files changed, 191 insertions(+), 35 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 6dc6d8b6b368..7f779d3f88f2 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3424,17 +3424,16 @@ static int arm_smmu_device_reset(struct arm_smmu_device 
*smmu, bool bypass)
 
 static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 {
-   u32 reg;
bool coherent = smmu->features & ARM_SMMU_FEAT_COHERENCY;
 
/* IDR0 */
-   reg = readl_relaxed(smmu->base + ARM_SMMU_IDR0);
+   smmu->idr0 = readl_relaxed(smmu->base + ARM_SMMU_IDR0);
 
/* 2-level structures */
-   if (FIELD_GET(IDR0_ST_LVL, reg) == IDR0_ST_LVL_2LVL)
+   if (FIELD_GET(IDR0_ST_LVL, smmu->idr0) == IDR0_ST_LVL_2LVL)
smmu->features |= ARM_SMMU_FEAT_2_LVL_STRTAB;
 
-   if (reg & IDR0_CD2L)
+   if (smmu->idr0 & IDR0_CD2L)
smmu->features |= ARM_SMMU_FEAT_2_LVL_CDTAB;
 
/*
@@ -3442,7 +3441,7 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
 * We currently require the same endianness as the CPU, but this
 * could be changed later by adding a new IO_PGTABLE_QUIRK.
 */
-   switch (FIELD_GET(IDR0_TTENDIAN, reg)) {
+   switch (FIELD_GET(IDR0_TTENDIAN, smmu->idr0)) {
case IDR0_TTENDIAN_MIXED:
smmu->features |= ARM_SMMU_FEAT_TT_LE | ARM_SMMU_FEAT_TT_BE;
break;
@@ -3461,22 +3460,22 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
}
 
/* Boolean feature flags */
-   if (IS_ENABLED(CONFIG_PCI_PRI) && reg & IDR0_PRI)
+   if (IS_ENABLED(CONFIG_PCI_PRI) && smmu->idr0 & IDR0_PRI)
smmu->features |= ARM_SMMU_FEAT_PRI;
 
-   if (IS_ENABLED(CONFIG_PCI_ATS) && reg & IDR0_ATS)
+   if (IS_ENABLED(CONFIG_PCI_ATS) && smmu->idr0 & IDR0_ATS)
smmu->features |= ARM_SMMU_FEAT_ATS;
 
-   if (reg & IDR0_SEV)
+   if (smmu->idr0 & IDR0_SEV)
smmu->features |= ARM_SMMU_FEAT_SEV;
 
-   if (reg & IDR0_MSI) {
+   if (smmu->idr0 & IDR0_MSI) {
smmu->features |= ARM_SMMU_FEAT_MSI;
if (coherent && !disable_msipolling)
smmu->options |= ARM_SMMU_OPT_MSIPOLL;
}
 
-   if (reg & IDR0_HYP) {
+   if (smmu->idr0 & IDR0_HYP) {
smmu->features |= ARM_SMMU_FEAT_HYP;
if (cpus_have_cap(ARM64_HAS_VIRT_HOST_EXTN))
smmu->features |= ARM_SMMU_FEAT_E2H;
@@ -3486,11 +3485,11 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
 * The coherency feature as set by FW is used in preference to the ID
 * register, but warn on mismatch.
 */
-   if (!!(reg & IDR0_COHACC) != coherent)
+   if (!!(smmu->idr0 & IDR0_COHACC) != coherent)
dev_warn(smmu->dev, "IDR0.COHACC overridden by FW configuration 
(%s)\n",
 coherent ? "true" : "false");
 
-   switch (FIELD_GET(IDR0_STALL_MODEL, reg)) {
+   switch (FIELD_GET(IDR0_STALL_MODEL, smmu->idr0)) {
case IDR0_STALL_MODEL_FORCE:
smmu->features |= ARM_SMMU_FEAT_STALL_FORCE;
fallthrough;
@@ -3498,19 +3497,19 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
smmu->features |= ARM_SMMU_FEAT_STALLS;
}
 
-   if (reg & IDR0_S1P)
+   if (smmu->idr0 & IDR0_S1P)
smmu->features |= ARM_SMMU_FEAT_TRANS_S1;
 
-   if (reg & IDR0_S2P)
+   if (smmu->idr0 & IDR0_S2P)
smmu->features 

Re: [PATCH] iommu/arm: Expose ARM SMMUv3 related registers via sysfs

2022-03-23 Thread Robin Murphy

On 2022-03-23 12:54, Miguel Luis wrote:

Allows userspace to check for SMMUv3 features.


What will userspace do with that information?

It hardly matters what fancy new features might be present, if the 
kernel and/or the abstracted interfaces available to userspace aren't 
using them. Any functionality which is supported by a usable interface 
should ideally be represented via that interface itself.


Furthermore many of the raw SMMU features depend on other system 
components and/or firmware, so the ID registers alone don't tell the 
full story anyway.



Expose the following RO registers related to ARM SMMUv3 via sysfs:
SMMU_IDR0
SMMU_IDR1
SMMU_IDR2
SMMU_IDR3
SMMU_IDR4
SMMU_IDR5
SMMU_IDR6
SMMU_IIDR
SMMU_AIDR

  # find /sys | grep arm-iommu
/sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu


Nit: my main comment above notwithstanding, is this level of hierarchcy 
meaningful or useful? "arm-iommu" isn't an established name for anything 
as far as I'm aware.


Robin.


/sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr5
/sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr3
/sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr1
/sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_aidr
/sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr6
/sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr4
/sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_iidr
/sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr2
/sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr0

Signed-off-by: Miguel Luis 
---
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 199 
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  27 +++
  2 files changed, 191 insertions(+), 35 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 6dc6d8b6b368..7f779d3f88f2 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3424,17 +3424,16 @@ static int arm_smmu_device_reset(struct arm_smmu_device 
*smmu, bool bypass)
  
  static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)

  {
-   u32 reg;
bool coherent = smmu->features & ARM_SMMU_FEAT_COHERENCY;
  
  	/* IDR0 */

-   reg = readl_relaxed(smmu->base + ARM_SMMU_IDR0);
+   smmu->idr0 = readl_relaxed(smmu->base + ARM_SMMU_IDR0);
  
  	/* 2-level structures */

-   if (FIELD_GET(IDR0_ST_LVL, reg) == IDR0_ST_LVL_2LVL)
+   if (FIELD_GET(IDR0_ST_LVL, smmu->idr0) == IDR0_ST_LVL_2LVL)
smmu->features |= ARM_SMMU_FEAT_2_LVL_STRTAB;
  
-	if (reg & IDR0_CD2L)

+   if (smmu->idr0 & IDR0_CD2L)
smmu->features |= ARM_SMMU_FEAT_2_LVL_CDTAB;
  
  	/*

@@ -3442,7 +3441,7 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
 * We currently require the same endianness as the CPU, but this
 * could be changed later by adding a new IO_PGTABLE_QUIRK.
 */
-   switch (FIELD_GET(IDR0_TTENDIAN, reg)) {
+   switch (FIELD_GET(IDR0_TTENDIAN, smmu->idr0)) {
case IDR0_TTENDIAN_MIXED:
smmu->features |= ARM_SMMU_FEAT_TT_LE | ARM_SMMU_FEAT_TT_BE;
break;
@@ -3461,22 +3460,22 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
}
  
  	/* Boolean feature flags */

-   if (IS_ENABLED(CONFIG_PCI_PRI) && reg & IDR0_PRI)
+   if (IS_ENABLED(CONFIG_PCI_PRI) && smmu->idr0 & IDR0_PRI)
smmu->features |= ARM_SMMU_FEAT_PRI;
  
-	if (IS_ENABLED(CONFIG_PCI_ATS) && reg & IDR0_ATS)

+   if (IS_ENABLED(CONFIG_PCI_ATS) && smmu->idr0 & IDR0_ATS)
smmu->features |= ARM_SMMU_FEAT_ATS;
  
-	if (reg & IDR0_SEV)

+   if (smmu->idr0 & IDR0_SEV)
smmu->features |= ARM_SMMU_FEAT_SEV;
  
-	if (reg & IDR0_MSI) {

+   if (smmu->idr0 & IDR0_MSI) {
smmu->features |= ARM_SMMU_FEAT_MSI;
if (coherent && !disable_msipolling)
smmu->options |= ARM_SMMU_OPT_MSIPOLL;
}
  
-	if (reg & IDR0_HYP) {

+   if (smmu->idr0 & IDR0_HYP) {
smmu->features |= ARM_SMMU_FEAT_HYP;
if (cpus_have_cap(ARM64_HAS_VIRT_HOST_EXTN))
smmu->features |= ARM_SMMU_FEAT_E2H;
@@ -3486,11 +3485,11 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
 * The coherency feature as set by FW is used in preference to the ID
 * register, but warn on mismatch.
 */
-   if (!!(reg & IDR0_COHACC) != coherent)
+   if (!!(smmu->idr0 & IDR0_COHACC) != coherent)
dev_warn(smmu->dev, "IDR0.COHACC overridden by FW configuration 
(%s)\

Re: [PATCH] iommu/arm: Expose ARM SMMUv3 related registers via sysfs

2022-03-23 Thread Miguel Luis


> On 23 Mar 2022, at 12:40, Robin Murphy  wrote:
> 
> On 2022-03-23 12:54, Miguel Luis wrote:
>> Allows userspace to check for SMMUv3 features.
> 
> What will userspace do with that information?
> 
> It hardly matters what fancy new features might be present, if the kernel 
> and/or the abstracted interfaces available to userspace aren't using them. 
> Any functionality which is supported by a usable interface should ideally be 
> represented via that interface itself.
> 

The inspiration was the same that Intel (cap/ecap) and AMD (cap/features) took
exposing it's iommu feature registers on sysfs. It's an easy way to understand
which features are supported by the hardware regardless of what the kernel
supports.

For example I could print the smmu->features and that would cover kernel
supported features but wouldn't help when new hardware arrives to know which
features are supported by the hardware.

> Furthermore many of the raw SMMU features depend on other system components 
> and/or firmware, so the ID registers alone don't tell the full story anyway.
> 

Would you mind to elaborate a bit more on that please? Would that mean that if a
feature bit is set it doesn’t really tell that the feature is supported?

>> Expose the following RO registers related to ARM SMMUv3 via sysfs:
>> SMMU_IDR0
>> SMMU_IDR1
>> SMMU_IDR2
>> SMMU_IDR3
>> SMMU_IDR4
>> SMMU_IDR5
>> SMMU_IDR6
>> SMMU_IIDR
>> SMMU_AIDR
>>  # find /sys | grep arm-iommu
>> /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu
> 
> Nit: my main comment above notwithstanding, is this level of hierarchcy 
> meaningful or useful? "arm-iommu" isn't an established name for anything as 
> far as I'm aware.
> 

I've followed the existent convention in other IOMMUs but I'm totally open
to alternatives/suggestions.

Miguel

> Robin.
> 
>> /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr5
>> /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr3
>> /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr1
>> /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_aidr
>> /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr6
>> /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr4
>> /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_iidr
>> /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr2
>> /sys/devices/platform/905.smmuv3/iommu/smmu3.0x0905/arm-iommu/smmu_idr0
>> Signed-off-by: Miguel Luis 
>> ---
>>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 199 
>>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  27 +++
>>  2 files changed, 191 insertions(+), 35 deletions(-)
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index 6dc6d8b6b368..7f779d3f88f2 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -3424,17 +3424,16 @@ static int arm_smmu_device_reset(struct 
>> arm_smmu_device *smmu, bool bypass)
>>static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
>>  {
>> -u32 reg;
>>  bool coherent = smmu->features & ARM_SMMU_FEAT_COHERENCY;
>>  /* IDR0 */
>> -reg = readl_relaxed(smmu->base + ARM_SMMU_IDR0);
>> +smmu->idr0 = readl_relaxed(smmu->base + ARM_SMMU_IDR0);
>>  /* 2-level structures */
>> -if (FIELD_GET(IDR0_ST_LVL, reg) == IDR0_ST_LVL_2LVL)
>> +if (FIELD_GET(IDR0_ST_LVL, smmu->idr0) == IDR0_ST_LVL_2LVL)
>>  smmu->features |= ARM_SMMU_FEAT_2_LVL_STRTAB;
>>  -   if (reg & IDR0_CD2L)
>> +if (smmu->idr0 & IDR0_CD2L)
>>  smmu->features |= ARM_SMMU_FEAT_2_LVL_CDTAB;
>>  /*
>> @@ -3442,7 +3441,7 @@ static int arm_smmu_device_hw_probe(struct 
>> arm_smmu_device *smmu)
>>   * We currently require the same endianness as the CPU, but this
>>   * could be changed later by adding a new IO_PGTABLE_QUIRK.
>>   */
>> -switch (FIELD_GET(IDR0_TTENDIAN, reg)) {
>> +switch (FIELD_GET(IDR0_TTENDIAN, smmu->idr0)) {
>>  case IDR0_TTENDIAN_MIXED:
>>  smmu->features |= ARM_SMMU_FEAT_TT_LE | ARM_SMMU_FEAT_TT_BE;
>>  break;
>> @@ -3461,22 +3460,22 @@ static int arm_smmu_device_hw_probe(struct 
>> arm_smmu_device *smmu)
>>  }
>>  /* Boolean feature flags */
>> -if (IS_ENABLED(CONFIG_PCI_PRI) && reg & IDR0_PRI)
>> +if (IS_ENABLED(CONFIG_PCI_PRI) && smmu->idr0 & IDR0_PRI)
>>  smmu->features |= ARM_SMMU_FEAT_PRI;
>>  -   if (IS_ENABLED(CONFIG_PCI_ATS) && reg & IDR0_ATS)
>> +if (IS_ENABLED(CONFIG_PCI_ATS) && smmu->idr0 & IDR0_ATS)
>>  smmu->features |= ARM_SMMU_FEAT_ATS;
>>  -   if (reg & IDR0_SEV)
>> +if (smmu->idr0 & IDR0_SEV)
>>  smmu->features |= ARM_SMM

Re: [PATCH RFC 05/12] iommufd: PFN handling for iopt_pages

2022-03-23 Thread Niklas Schnelle
On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote:
> The top of the data structure provides an IO Address Space (IOAS) that is
> similar to a VFIO container. The IOAS allows map/unmap of memory into
> ranges of IOVA called iopt_areas. Domains and in-kernel users (like VFIO
> mdevs) can be attached to the IOAS to access the PFNs that those IOVA
> areas cover.
> 
> The IO Address Space (IOAS) datastructure is composed of:
>  - struct io_pagetable holding the IOVA map
>  - struct iopt_areas representing populated portions of IOVA
>  - struct iopt_pages representing the storage of PFNs
>  - struct iommu_domain representing the IO page table in the system IOMMU
>  - struct iopt_pages_user representing in-kernel users of PFNs (ie VFIO
>mdevs)
>  - struct xarray pinned_pfns holding a list of pages pinned by in-kernel
>users
> 
> This patch introduces the lowest part of the datastructure - the movement
> of PFNs in a tiered storage scheme:
>  1) iopt_pages::pinned_pfns xarray
>  2) An iommu_domain
>  3) The origin of the PFNs, i.e. the userspace pointer
> 
> PFN have to be copied between all combinations of tiers, depending on the
> configuration.
> 
> The interface is an iterator called a 'pfn_reader' which determines which
> tier each PFN is stored and loads it into a list of PFNs held in a struct
> pfn_batch.
> 
> Each step of the iterator will fill up the pfn_batch, then the caller can
> use the pfn_batch to send the PFNs to the required destination. Repeating
> this loop will read all the PFNs in an IOVA range.
> 
> The pfn_reader and pfn_batch also keep track of the pinned page accounting.
> 
> While PFNs are always stored and accessed as full PAGE_SIZE units the
> iommu_domain tier can store with a sub-page offset/length to support
> IOMMUs with a smaller IOPTE size than PAGE_SIZE.
> 
> Signed-off-by: Jason Gunthorpe 
> ---
>  drivers/iommu/iommufd/Makefile  |   3 +-
>  drivers/iommu/iommufd/io_pagetable.h| 101 
>  drivers/iommu/iommufd/iommufd_private.h |  20 +
>  drivers/iommu/iommufd/pages.c   | 723 
>  4 files changed, 846 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/iommu/iommufd/io_pagetable.h
>  create mode 100644 drivers/iommu/iommufd/pages.c
> 
> 
---8<---
> +
> +/*
> + * This holds a pinned page list for multiple areas of IO address space. The
> + * pages always originate from a linear chunk of userspace VA. Multiple
> + * io_pagetable's, through their iopt_area's, can share a single iopt_pages
> + * which avoids multi-pinning and double accounting of page consumption.
> + *
> + * indexes in this structure are measured in PAGE_SIZE units, are 0 based 
> from
> + * the start of the uptr and extend to npages. pages are pinned dynamically
> + * according to the intervals in the users_itree and domains_itree, npages
> + * records the current number of pages pinned.

This sounds wrong or at least badly named. If npages records the
current number of pages pinned then what does npinned record?

> + */
> +struct iopt_pages {
> + struct kref kref;
> + struct mutex mutex;
> + size_t npages;
> + size_t npinned;
> + size_t last_npinned;
> + struct task_struct *source_task;
> + struct mm_struct *source_mm;
> + struct user_struct *source_user;
> + void __user *uptr;
> + bool writable:1;
> + bool has_cap_ipc_lock:1;
> +
> + struct xarray pinned_pfns;
> + /* Of iopt_pages_user::node */
> + struct rb_root_cached users_itree;
> + /* Of iopt_area::pages_node */
> + struct rb_root_cached domains_itree;
> +};
> +
---8<---

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v8 02/11] iommu: Introduce a union to struct iommu_resv_region

2022-03-23 Thread Shameerali Kolothum Thodi via iommu



> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: 22 March 2022 18:27
> To: Shameerali Kolothum Thodi ;
> linux-arm-ker...@lists.infradead.org; linux-a...@vger.kernel.org;
> iommu@lists.linux-foundation.org
> Cc: Linuxarm ; lorenzo.pieral...@arm.com;
> j...@8bytes.org; w...@kernel.org; wanghuiqiang
> ; Guohanjun (Hanjun Guo)
> ; steven.pr...@arm.com; sami.muja...@arm.com;
> j...@solid-run.com; eric.au...@redhat.com; yangyicong
> 
> Subject: Re: [PATCH v8 02/11] iommu: Introduce a union to struct
> iommu_resv_region
> 
> On 2022-02-21 15:43, Shameer Kolothum wrote:
> > A union is introduced to struct iommu_resv_region to hold
> > any firmware specific data. This is in preparation to add
> > support for IORT RMR reserve regions and the union now holds
> > the RMR specific information.
> >
> > Signed-off-by: Shameer Kolothum
> 
> > ---
> >   include/linux/iommu.h | 9 +
> >   1 file changed, 9 insertions(+)
> >
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index de0c57a567c8..b06952a75f95 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -126,6 +126,11 @@ enum iommu_resv_type {
> > IOMMU_RESV_SW_MSI,
> >   };
> >
> > +struct iommu_iort_rmr_data {
> > +   u32 *sids;  /* Stream Ids associated with IORT RMR entry */
> 
> Please make this const.
> 
> Further nit: capitalisation of "IDs" in the comment, otherwise I might
> worry about the possibility of Stream Egos too :P

True :). Will do that.

Thanks,
Shameer 

> 
> > +   u32 num_sids;
> > +};
> > +
> >   /**
> >* struct iommu_resv_region - descriptor for a reserved memory region
> >* @list: Linked list pointers
> > @@ -133,6 +138,7 @@ enum iommu_resv_type {
> >* @length: Length of the region in bytes
> >* @prot: IOMMU Protection flags (READ/WRITE/...)
> >* @type: Type of the reserved region
> > + * @fw_data: FW specific reserved region data
> 
> Nit: we've got plenty of room to spell out "Firmware-specific", and it
> never hurts to make documentation as easy to read as possible.
> 
> Thanks,
> Robin.
> 
> >*/
> >   struct iommu_resv_region {
> > struct list_headlist;
> > @@ -140,6 +146,9 @@ struct iommu_resv_region {
> > size_t  length;
> > int prot;
> > enum iommu_resv_typetype;
> > +   union {
> > +   struct iommu_iort_rmr_data rmr;
> > +   } fw_data;
> >   };
> >
> >   /**
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v8 05/11] ACPI/IORT: Add a helper to retrieve RMR memory regions

2022-03-23 Thread Shameerali Kolothum Thodi via iommu



> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: 22 March 2022 19:09
> To: Shameerali Kolothum Thodi ;
> linux-arm-ker...@lists.infradead.org; linux-a...@vger.kernel.org;
> iommu@lists.linux-foundation.org
> Cc: j...@solid-run.com; Linuxarm ;
> steven.pr...@arm.com; Guohanjun (Hanjun Guo) ;
> yangyicong ; sami.muja...@arm.com;
> w...@kernel.org; wanghuiqiang 
> Subject: Re: [PATCH v8 05/11] ACPI/IORT: Add a helper to retrieve RMR
> memory regions
> 
> On 2022-02-21 15:43, Shameer Kolothum via iommu wrote:
> > Add helper functions (iort_iommu_get/put_rmrs()) that
> > retrieves/releases RMR memory descriptors associated
> > with a given IOMMU. This will be used by IOMMU drivers
> > to set up necessary mappings.
> >
> > Invoke it from the generic iommu helper functions.
> 
> iommu_dma_get_resv_regions() already exists - please extend that rather
> than adding a parallel implementation of the same thing but different.
> IORT should export a single get_resv_regions helper which combines the
> new RMRs with the existing MSI workaround, and a separate "do I need to
> bypass this StreamID" helper for the SMMU drivers to call directly at
> reset time, since the latter isn't really an iommu-dma responsibility.

Right. I actually had couple of back and forth on the interfaces and settled
on this mainly because it just requires a single interface from IORT for both
get_resv_regions() and SMMU driver reset bypass path.
ie, iort_iommu_get/put_rmrs()).

But agree the above comment is also valid. 

> I'm happy to do that just by shuffling wrappers around for now - we can
> come back and streamline the code properly afterwards - but the sheer
> amount of indirection currently at play here is so hard to follow that
> it's not even all that easy to see how it's crossing abstraction levels
> improperly.

Please find below the revised ones. Please take a look and let me know.

Thanks,
Shameer

iommu-dma:

void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list) {

if (!is_of_node(dev_iommu_fwspec_get(dev)->iommu_fwnode))
iort_iommu_get_resv_regions(dev, list);
}

void iommu_dma_put_resv_regions(struct device *dev, struct list_head *list) {

if (!is_of_node(dev_iommu_fwspec_get(dev)->iommu_fwnode))
iort_iommu_put_resv_regions(dev, list);
generic_iommu_put_resv_regions(dev, list);
}

iort:

void iort_iommu_get_resv_regions(struct device *dev, struct list_head *head) {
struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);

iort_iommu_msi_get_resv_regions(dev, head);
iort_iommu_get_rmrs(fwspec->iommu_fwnode, dev, head);
}

void iort_iommu_put_resv_regions(struct device *dev, struct list_head *head) {
   ./*Free both RMRs and HW MSI ones */
}

/* The below ones will be directly called from SMMU drivers during reset */
void iort_get_rmr_sids(struct fwnode_handle *iommu_fwnode, struct list_head 
*head) {
iort_iommu_get_rmrs(iommu_fwnode, NULL, head); }
}

void iort_put_rmr_sids(struct fwnode_handle *iommu_fwnode, struct list_head 
*head) {
iort_iommu_put_resv_regions(NULL, head);
}
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 05/12] iommufd: PFN handling for iopt_pages

2022-03-23 Thread Jason Gunthorpe via iommu
On Wed, Mar 23, 2022 at 04:37:50PM +0100, Niklas Schnelle wrote:
> > +/*
> > + * This holds a pinned page list for multiple areas of IO address space. 
> > The
> > + * pages always originate from a linear chunk of userspace VA. Multiple
> > + * io_pagetable's, through their iopt_area's, can share a single iopt_pages
> > + * which avoids multi-pinning and double accounting of page consumption.
> > + *
> > + * indexes in this structure are measured in PAGE_SIZE units, are 0 based 
> > from
> > + * the start of the uptr and extend to npages. pages are pinned dynamically
> > + * according to the intervals in the users_itree and domains_itree, npages
> > + * records the current number of pages pinned.
> 
> This sounds wrong or at least badly named. If npages records the
> current number of pages pinned then what does npinned record?

Oh, it is a typo, the 2nd npages should be npinned, thanks

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-23 Thread Linus Torvalds
On Wed, Mar 23, 2022 at 12:19 AM Oleksandr Natalenko
 wrote:
>
> The following upstream commits:
>
> aa6f8dcbab47 swiotlb: rework "fix info leak with DMA_FROM_DEVICE"
> ddbd89deb7d3 swiotlb: fix info leak with DMA_FROM_DEVICE
>
> break ath9k-based Wi-Fi access point for me. The AP emits beacons, but
> no client can connect to it, either from the very beginning, or
> shortly after start. These are the only symptoms I've noticed (i.e.,
> no BUG/WARNING messages in `dmesg` etc).

Funky, but clearly true:

> These commits appeared in v5.17 and v5.16.15, and both kernels are
> broken for me. I'm pretty confident these commits make the difference
> since I've built both v5.17 and v5.16.15 without them, and it fixed
> the issue.

Can you double-check (or just explicitly confirm if you already did
that test) that you need to revert *both* of those commits, and it's
the later "rework" fix that triggers it?

> So, I do understand this might be an issue with regard to SG I/O
> handling in ath9k, hence relevant people in Cc.

Yeah, almost certainly an ath9k bug, but a regression is a regression,
so if people can't find the issue in ath9k, we'll have to revert those
commits.

Honestly, I personally think they were a bit draconian to begin with,
and didn't limit their effects sufficiently.

I'm assuming that the ath9k issue is that it gives DMA mapping a big
enough area to handle any possible packet size, and just expects -
quite reasonably - smaller packets to only fill the part they need.

Which that "info leak" patch obviously breaks entirely.

So my expectation is that this is something we'll just revert, but it
would be really good to have the ath9k people double-check.

Linus
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 10/12] iommufd: Add kAPI toward external drivers

2022-03-23 Thread Alex Williamson
On Fri, 18 Mar 2022 14:27:35 -0300
Jason Gunthorpe  wrote:
> +/**
> + * iommufd_bind_pci_device - Bind a physical device to an iommu fd
> + * @fd: iommufd file descriptor.
> + * @pdev: Pointer to a physical PCI device struct
> + * @id: Output ID number to return to userspace for this device
> + *
> + * A successful bind establishes an ownership over the device and returns
> + * struct iommufd_device pointer, otherwise returns error pointer.
> + *
> + * A driver using this API must set driver_managed_dma and must not touch
> + * the device until this routine succeeds and establishes ownership.
> + *
> + * Binding a PCI device places the entire RID under iommufd control.
> + *
> + * The caller must undo this with iommufd_unbind_device()
> + */
> +struct iommufd_device *iommufd_bind_pci_device(int fd, struct pci_dev *pdev,
> +u32 *id)
> +{
> + struct iommufd_device *idev;
> + struct iommufd_ctx *ictx;
> + struct iommu_group *group;
> + int rc;
> +
> + ictx = iommufd_fget(fd);
> + if (!ictx)
> + return ERR_PTR(-EINVAL);
> +
> + group = iommu_group_get(&pdev->dev);
> + if (!group) {
> + rc = -ENODEV;
> + goto out_file_put;
> + }
> +
> + /*
> +  * FIXME: Use a device-centric iommu api and this won't work with
> +  * multi-device groups
> +  */
> + rc = iommu_group_claim_dma_owner(group, ictx->filp);
> + if (rc)
> + goto out_group_put;
> +
> + idev = iommufd_object_alloc(ictx, idev, IOMMUFD_OBJ_DEVICE);
> + if (IS_ERR(idev)) {
> + rc = PTR_ERR(idev);
> + goto out_release_owner;
> + }
> + idev->ictx = ictx;
> + idev->dev = &pdev->dev;
> + /* The calling driver is a user until iommufd_unbind_device() */
> + refcount_inc(&idev->obj.users);
> + /* group refcount moves into iommufd_device */
> + idev->group = group;
> +
> + /*
> +  * If the caller fails after this success it must call
> +  * iommufd_unbind_device() which is safe since we hold this refcount.
> +  * This also means the device is a leaf in the graph and no other object
> +  * can take a reference on it.
> +  */
> + iommufd_object_finalize(ictx, &idev->obj);
> + *id = idev->obj.id;
> + return idev;
> +
> +out_release_owner:
> + iommu_group_release_dma_owner(group);
> +out_group_put:
> + iommu_group_put(group);
> +out_file_put:
> + fput(ictx->filp);
> + return ERR_PTR(rc);
> +}
> +EXPORT_SYMBOL_GPL(iommufd_bind_pci_device);

I'm stumped why this needs to be PCI specific.  Anything beyond the RID
comment?  Please enlighten.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 10/12] iommufd: Add kAPI toward external drivers

2022-03-23 Thread Jason Gunthorpe via iommu
On Wed, Mar 23, 2022 at 12:10:01PM -0600, Alex Williamson wrote:
> > +EXPORT_SYMBOL_GPL(iommufd_bind_pci_device);
> 
> I'm stumped why this needs to be PCI specific.  Anything beyond the RID
> comment?  Please enlighten.  Thanks,

The way it turned out in the end it is not for a good reason any
more. I'll change it

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping

2022-03-23 Thread Jason Gunthorpe via iommu
On Tue, Mar 22, 2022 at 04:15:44PM -0600, Alex Williamson wrote:

> > +static struct iopt_area *
> > +iopt_alloc_area(struct io_pagetable *iopt, struct iopt_pages *pages,
> > +   unsigned long iova, unsigned long start_byte,
> > +   unsigned long length, int iommu_prot, unsigned int flags)
> > +{
> > +   struct iopt_area *area;
> > +   int rc;
> > +
> > +   area = kzalloc(sizeof(*area), GFP_KERNEL);
> > +   if (!area)
> > +   return ERR_PTR(-ENOMEM);
> > +
> > +   area->iopt = iopt;
> > +   area->iommu_prot = iommu_prot;
> > +   area->page_offset = start_byte % PAGE_SIZE;
> > +   area->pages_node.start = start_byte / PAGE_SIZE;
> > +   if (check_add_overflow(start_byte, length - 1, &area->pages_node.last))
> > +   return ERR_PTR(-EOVERFLOW);
> > +   area->pages_node.last = area->pages_node.last / PAGE_SIZE;
> > +   if (WARN_ON(area->pages_node.last >= pages->npages))
> > +   return ERR_PTR(-EOVERFLOW);
> 
> @area leaked in the above two error cases.

Yes, thanks

> > +int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
> > + unsigned long length, struct page **out_pages, bool write)
> > +{
> > +   unsigned long cur_iova = iova;
> > +   unsigned long last_iova;
> > +   struct iopt_area *area;
> > +   int rc;
> > +
> > +   if (!length)
> > +   return -EINVAL;
> > +   if (check_add_overflow(iova, length - 1, &last_iova))
> > +   return -EOVERFLOW;
> > +
> > +   down_read(&iopt->iova_rwsem);
> > +   for (area = iopt_area_iter_first(iopt, iova, last_iova); area;
> > +area = iopt_area_iter_next(area, iova, last_iova)) {
> > +   unsigned long last = min(last_iova, iopt_area_last_iova(area));
> > +   unsigned long last_index;
> > +   unsigned long index;
> > +
> > +   /* Need contiguous areas in the access */
> > +   if (iopt_area_iova(area) < cur_iova || !area->pages) {
> ^^^
> Should this be (cur_iova != iova && iopt_area_iova(area) < cur_iova)?

Oh good eye

That is a typo < should be >:

if (iopt_area_iova(area) > cur_iova || !area->pages) {

There are three boundary conditions here to worry about
 - interval trees return any nodes that intersect the queried range
   so the first found node can start after iova

 - There can be a gap in the intervals

 - The final area can end short of last_iova

The last one is botched too and needs this:
for ... { ...
}
+   if (cur_iova != last_iova)
+   goto out_remove;

The test suite isn't covering the boundary cases here yet, I added a
FIXME for this for now.

Thanks,
Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-23 Thread Robin Murphy

On 2022-03-23 17:27, Linus Torvalds wrote:

On Wed, Mar 23, 2022 at 12:19 AM Oleksandr Natalenko
 wrote:


The following upstream commits:

aa6f8dcbab47 swiotlb: rework "fix info leak with DMA_FROM_DEVICE"
ddbd89deb7d3 swiotlb: fix info leak with DMA_FROM_DEVICE

break ath9k-based Wi-Fi access point for me. The AP emits beacons, but
no client can connect to it, either from the very beginning, or
shortly after start. These are the only symptoms I've noticed (i.e.,
no BUG/WARNING messages in `dmesg` etc).


Funky, but clearly true:


These commits appeared in v5.17 and v5.16.15, and both kernels are
broken for me. I'm pretty confident these commits make the difference
since I've built both v5.17 and v5.16.15 without them, and it fixed
the issue.


Can you double-check (or just explicitly confirm if you already did
that test) that you need to revert *both* of those commits, and it's
the later "rework" fix that triggers it?


So, I do understand this might be an issue with regard to SG I/O
handling in ath9k, hence relevant people in Cc.


Yeah, almost certainly an ath9k bug, but a regression is a regression,
so if people can't find the issue in ath9k, we'll have to revert those
commits.

Honestly, I personally think they were a bit draconian to begin with,
and didn't limit their effects sufficiently.

I'm assuming that the ath9k issue is that it gives DMA mapping a big
enough area to handle any possible packet size, and just expects -
quite reasonably - smaller packets to only fill the part they need.

Which that "info leak" patch obviously breaks entirely.


Except that's the exact case which the new patch is addressing - by 
copying the whole original area into the SWIOTLB bounce buffer to begin 
with, if we bounce the whole lot back after the device has only updated 
part of it, the non-updated parts now get overwritten with the same 
original contents, rather than whatever random crap happened to be left 
in the SWIOTLB buffer by its previous user. I'm extremely puzzled how 
any driver could somehow be dependent on non-device-written data getting 
replaced with random crap, given that it wouldn't happen with a real 
IOMMU, or if SWIOTLB just didn't need to bounce, and the data would 
hardly be deterministic either.


I think I can see how aa6f8dcbab47 might increase the severity of a 
driver bug where it calls dma_sync_*_for_device() on part of a 
DMA_FROM_DEVICE mapping that the device *has* written to, without having 
called a corresponding dma_sync_*_for_cpu() first - previously that 
would have had no effect, but now SWIOTLB will effectively behave more 
like an eagerly-prefetching non-coherent cache and write back old data 
over new - but if ddbd89deb7d3 alone makes a difference then something 
really weird must be going on.


Has anyone run a sanity check with CONFIG_DMA_API_DEBUG enabled to see 
if that flags anything up?


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-23 Thread Alex Williamson
On Fri, 18 Mar 2022 14:27:33 -0300
Jason Gunthorpe  wrote:

> +static int conv_iommu_prot(u32 map_flags)
> +{
> + int iommu_prot;
> +
> + /*
> +  * We provide no manual cache coherency ioctls to userspace and most
> +  * architectures make the CPU ops for cache flushing privileged.
> +  * Therefore we require the underlying IOMMU to support CPU coherent
> +  * operation.
> +  */
> + iommu_prot = IOMMU_CACHE;

Where is this requirement enforced?  AIUI we'd need to test
IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
intel_iommu_map() simply drop the flag when not supported by HW.

This also seems like an issue relative to vfio compatibility that I
don't see mentioned in that patch.  Thanks,

Alex

> + if (map_flags & IOMMU_IOAS_MAP_WRITEABLE)
> + iommu_prot |= IOMMU_WRITE;
> + if (map_flags & IOMMU_IOAS_MAP_READABLE)
> + iommu_prot |= IOMMU_READ;
> + return iommu_prot;
> +}


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-23 Thread Linus Torvalds
On Wed, Mar 23, 2022 at 12:06 PM Robin Murphy  wrote:
>
> On 2022-03-23 17:27, Linus Torvalds wrote:
> >
> > I'm assuming that the ath9k issue is that it gives DMA mapping a big
> > enough area to handle any possible packet size, and just expects -
> > quite reasonably - smaller packets to only fill the part they need.
> >
> > Which that "info leak" patch obviously breaks entirely.
>
> Except that's the exact case which the new patch is addressing

Not "addressing". Breaking.

Which is why it will almost certainly get reverted.

Not doing DMA to the whole area seems to be quite the sane thing to do
for things like network packets, and overwriting the part that didn't
get DMA'd with zeroes seems to be exactly the wrong thing here.

So the SG_IO - and other random untrusted block command sources - data
leak will almost certainly have to be addressed differently. Possibly
by simply allocating the area with GFP_ZERO to begin with.

  Linus
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-23 Thread Jason Gunthorpe via iommu
On Wed, Mar 23, 2022 at 01:10:38PM -0600, Alex Williamson wrote:
> On Fri, 18 Mar 2022 14:27:33 -0300
> Jason Gunthorpe  wrote:
> 
> > +static int conv_iommu_prot(u32 map_flags)
> > +{
> > +   int iommu_prot;
> > +
> > +   /*
> > +* We provide no manual cache coherency ioctls to userspace and most
> > +* architectures make the CPU ops for cache flushing privileged.
> > +* Therefore we require the underlying IOMMU to support CPU coherent
> > +* operation.
> > +*/
> > +   iommu_prot = IOMMU_CACHE;
> 
> Where is this requirement enforced?  AIUI we'd need to test
> IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
> intel_iommu_map() simply drop the flag when not supported by HW.

You are right, the correct thing to do is to fail device
binding/attach entirely if IOMMU_CAP_CACHE_COHERENCY is not there,
however we can't do that because Intel abuses the meaning of
IOMMU_CAP_CACHE_COHERENCY to mean their special no-snoop behavior is
supported.

I want Intel to split out their special no-snoop from IOMMU_CACHE and
IOMMU_CAP_CACHE_COHERENCY so these things have a consisent meaning in
all iommu drivers. Once this is done vfio and iommufd should both
always set IOMMU_CACHE and refuse to work without
IOMMU_CAP_CACHE_COHERENCY. (unless someone knows of an !IOMMU_CACHE
arch that does in fact work today with vfio, somehow, but I don't..)

I added a fixme about this.

> This also seems like an issue relative to vfio compatibility that I
> don't see mentioned in that patch.  Thanks,

Yes, it was missed in the notes for vfio compat that Intel no-snoop is
not working currently, I fixed it.

Thanks,
Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-23 Thread Alex Williamson
On Wed, 23 Mar 2022 16:34:39 -0300
Jason Gunthorpe  wrote:

> On Wed, Mar 23, 2022 at 01:10:38PM -0600, Alex Williamson wrote:
> > On Fri, 18 Mar 2022 14:27:33 -0300
> > Jason Gunthorpe  wrote:
> >   
> > > +static int conv_iommu_prot(u32 map_flags)
> > > +{
> > > + int iommu_prot;
> > > +
> > > + /*
> > > +  * We provide no manual cache coherency ioctls to userspace and most
> > > +  * architectures make the CPU ops for cache flushing privileged.
> > > +  * Therefore we require the underlying IOMMU to support CPU coherent
> > > +  * operation.
> > > +  */
> > > + iommu_prot = IOMMU_CACHE;  
> > 
> > Where is this requirement enforced?  AIUI we'd need to test
> > IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
> > intel_iommu_map() simply drop the flag when not supported by HW.  
> 
> You are right, the correct thing to do is to fail device
> binding/attach entirely if IOMMU_CAP_CACHE_COHERENCY is not there,
> however we can't do that because Intel abuses the meaning of
> IOMMU_CAP_CACHE_COHERENCY to mean their special no-snoop behavior is
> supported.
> 
> I want Intel to split out their special no-snoop from IOMMU_CACHE and
> IOMMU_CAP_CACHE_COHERENCY so these things have a consisent meaning in
> all iommu drivers. Once this is done vfio and iommufd should both
> always set IOMMU_CACHE and refuse to work without
> IOMMU_CAP_CACHE_COHERENCY. (unless someone knows of an !IOMMU_CACHE
> arch that does in fact work today with vfio, somehow, but I don't..)

IIRC, the DMAR on Intel CPUs dedicated to IGD was where we'd often see
lack of snoop-control support, causing us to have mixed coherent and
non-coherent domains.  I don't recall if you go back far enough in VT-d
history if the primary IOMMU might have lacked this support.  So I
think there are systems we care about with IOMMUs that can't enforce
DMA coherency.

As it is today, if the IOMMU reports IOMMU_CAP_CACHE_COHERENCY and all
mappings make use of IOMMU_CACHE, then all DMA is coherent.  Are you
suggesting IOMMU_CAP_CACHE_COHERENCY should indicate that all mappings
are coherent regardless of mapping protection flags?  What's the point
of IOMMU_CACHE at that point?

> I added a fixme about this.
> 
> > This also seems like an issue relative to vfio compatibility that I
> > don't see mentioned in that patch.  Thanks,  
> 
> Yes, it was missed in the notes for vfio compat that Intel no-snoop is
> not working currently, I fixed it.

Right, I see it in the comments relative to extensions, but missed in
the commit log.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 0/2] Fix coherence for VMbus and PCI pass-thru devices in Hyper-V VM

2022-03-23 Thread Michael Kelley via iommu
Hyper-V VMs have VMbus synthetic devices and PCI pass-thru devices that are 
added
dynamically via the VMbus protocol and are not represented in the ACPI DSDT. 
Only
the top level VMbus node exists in the DSDT. As such, on ARM64 these devices 
don't
pick up coherence information and default to not hardware coherent.  This 
results
in extra software coherence management overhead since the synthetic devices are
always hardware coherent. PCI pass-thru devices are also hardware coherent in 
all
current usage scenarios.

Fix this by propagating coherence information from the top level VMbus node in
the DSDT to all VMbus synthetic devices and PCI pass-thru devices. While smaller
granularity of control would be better, basing on the VMbus node in the DSDT
gives as escape path if a future scenario arises with devices that are not
hardware coherent.

Robin Murphy -- I'm not ignoring your feedback about pci_dma_configure(), but
I wanted to try this alternate approach where pci_dma_configure() works as is.
If reviewers prefer modifying pci_dma_configure() to handle the Hyper-V
specifics, I can go back to that.

Changes since v1:
* Use device_get_dma_attr() instead of acpi_get_dma_attr(), eliminating the
  need to export acpi_get_dma_attr() [Robin Murphy]
* Use arch_setup_dma_ops() to set device coherence [Robin Murphy]
* Move handling of missing _CCA to vmbus_acpi_add() so it is only done once
* Rework handling of PCI devices so existing code in pci_dma_configure()
  just works

Michael Kelley (2):
  Drivers: hv: vmbus: Propagate VMbus coherence to each VMbus device
  PCI: hv: Propagate coherence from VMbus device to PCI device

 drivers/hv/hv_common.c  | 11 +++
 drivers/hv/vmbus_drv.c  | 23 +++
 drivers/pci/controller/pci-hyperv.c |  9 +
 include/asm-generic/mshyperv.h  |  1 +
 4 files changed, 44 insertions(+)

-- 
1.8.3.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 1/2] Drivers: hv: vmbus: Propagate VMbus coherence to each VMbus device

2022-03-23 Thread Michael Kelley via iommu
VMbus synthetic devices are not represented in the ACPI DSDT -- only
the top level VMbus device is represented. As a result, on ARM64
coherence information in the _CCA method is not specified for
synthetic devices, so they default to not hardware coherent.
Drivers for some of these synthetic devices have been recently
updated to use the standard DMA APIs, and they are incurring extra
overhead of unneeded software coherence management.

Fix this by propagating coherence information from the VMbus node
in ACPI to the individual synthetic devices. There's no effect on
x86/x64 where devices are always hardware coherent.

Signed-off-by: Michael Kelley 
---
 drivers/hv/hv_common.c | 11 +++
 drivers/hv/vmbus_drv.c | 23 +++
 include/asm-generic/mshyperv.h |  1 +
 3 files changed, 35 insertions(+)

diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 181d16b..820e814 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -216,6 +217,16 @@ bool hv_query_ext_cap(u64 cap_query)
 }
 EXPORT_SYMBOL_GPL(hv_query_ext_cap);
 
+void hv_setup_dma_ops(struct device *dev, bool coherent)
+{
+   /*
+* Hyper-V does not offer a vIOMMU in the guest
+* VM, so pass 0/NULL for the IOMMU settings
+*/
+   arch_setup_dma_ops(dev, 0, 0, NULL, coherent);
+}
+EXPORT_SYMBOL_GPL(hv_setup_dma_ops);
+
 bool hv_is_hibernation_supported(void)
 {
return !hv_root_partition && acpi_sleep_state_supported(ACPI_STATE_S4);
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 12a2b37..2d2c54c 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -905,6 +905,14 @@ static int vmbus_probe(struct device *child_device)
struct hv_device *dev = device_to_hv_device(child_device);
const struct hv_vmbus_device_id *dev_id;
 
+   /*
+* On ARM64, propagate the DMA coherence setting from the top level
+* VMbus ACPI device to the child VMbus device being added here.
+* On x86/x64 coherence is assumed and these calls have no effect.
+*/
+   hv_setup_dma_ops(child_device,
+   device_get_dma_attr(&hv_acpi_dev->dev) == DEV_DMA_COHERENT);
+
dev_id = hv_vmbus_get_id(drv, dev);
if (drv->probe) {
ret = drv->probe(dev, dev_id);
@@ -2428,6 +2436,21 @@ static int vmbus_acpi_add(struct acpi_device *device)
 
hv_acpi_dev = device;
 
+   /*
+* Older versions of Hyper-V for ARM64 fail to include the _CCA
+* method on the top level VMbus device in the DSDT. But devices
+* are hardware coherent in all current Hyper-V use cases, so fix
+* up the ACPI device to behave as if _CCA is present and indicates
+* hardware coherence.
+*/
+   ACPI_COMPANION_SET(&device->dev, device);
+   if (IS_ENABLED(CONFIG_ACPI_CCA_REQUIRED) &&
+   device_get_dma_attr(&device->dev) == DEV_DMA_NOT_SUPPORTED) {
+   pr_info("No ACPI _CCA found; assuming coherent device I/O\n");
+   device->flags.cca_seen = true;
+   device->flags.coherent_dma = true;
+   }
+
result = acpi_walk_resources(device->handle, METHOD_NAME__CRS,
vmbus_walk_resources, NULL);
 
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index c08758b..c05d2ce 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -269,6 +269,7 @@ static inline int cpumask_to_vpset_noself(struct hv_vpset 
*vpset,
 u64 hv_ghcb_hypercall(u64 control, void *input, void *output, u32 input_size);
 void hyperv_cleanup(void);
 bool hv_query_ext_cap(u64 cap_query);
+void hv_setup_dma_ops(struct device *dev, bool coherent);
 void *hv_map_memory(void *addr, unsigned long size);
 void hv_unmap_memory(void *addr);
 #else /* CONFIG_HYPERV */
-- 
1.8.3.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 2/2] PCI: hv: Propagate coherence from VMbus device to PCI device

2022-03-23 Thread Michael Kelley via iommu
PCI pass-thru devices in a Hyper-V VM are represented as a VMBus
device and as a PCI device.  The coherence of the VMbus device is
set based on the VMbus node in ACPI, but the PCI device has no
ACPI node and defaults to not hardware coherent.  This results
in extra software coherence management overhead on ARM64 when
devices are hardware coherent.

Fix this by setting up the PCI host bus so that normal
PCI mechanisms will propagate the coherence of the VMbus
device to the PCI device. There's no effect on x86/x64 where
devices are always hardware coherent.

Signed-off-by: Michael Kelley 
---
 drivers/pci/controller/pci-hyperv.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/pci/controller/pci-hyperv.c 
b/drivers/pci/controller/pci-hyperv.c
index ae0bc2f..88b3b56 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3404,6 +3404,15 @@ static int hv_pci_probe(struct hv_device *hdev,
hbus->bridge->domain_nr = dom;
 #ifdef CONFIG_X86
hbus->sysdata.domain = dom;
+#elif defined(CONFIG_ARM64)
+   /*
+* Set the PCI bus parent to be the corresponding VMbus
+* device. Then the VMbus device will be assigned as the
+* ACPI companion in pcibios_root_bridge_prepare() and
+* pci_dma_configure() will propagate device coherence
+* information to devices created on the bus.
+*/
+   hbus->sysdata.parent = hdev->device.parent;
 #endif
 
hbus->hdev = hdev;
-- 
1.8.3.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-23 Thread Jason Gunthorpe via iommu
On Wed, Mar 23, 2022 at 02:04:46PM -0600, Alex Williamson wrote:
> On Wed, 23 Mar 2022 16:34:39 -0300
> Jason Gunthorpe  wrote:
> 
> > On Wed, Mar 23, 2022 at 01:10:38PM -0600, Alex Williamson wrote:
> > > On Fri, 18 Mar 2022 14:27:33 -0300
> > > Jason Gunthorpe  wrote:
> > >   
> > > > +static int conv_iommu_prot(u32 map_flags)
> > > > +{
> > > > +   int iommu_prot;
> > > > +
> > > > +   /*
> > > > +* We provide no manual cache coherency ioctls to userspace and 
> > > > most
> > > > +* architectures make the CPU ops for cache flushing privileged.
> > > > +* Therefore we require the underlying IOMMU to support CPU 
> > > > coherent
> > > > +* operation.
> > > > +*/
> > > > +   iommu_prot = IOMMU_CACHE;  
> > > 
> > > Where is this requirement enforced?  AIUI we'd need to test
> > > IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
> > > intel_iommu_map() simply drop the flag when not supported by HW.  
> > 
> > You are right, the correct thing to do is to fail device
> > binding/attach entirely if IOMMU_CAP_CACHE_COHERENCY is not there,
> > however we can't do that because Intel abuses the meaning of
> > IOMMU_CAP_CACHE_COHERENCY to mean their special no-snoop behavior is
> > supported.
> > 
> > I want Intel to split out their special no-snoop from IOMMU_CACHE and
> > IOMMU_CAP_CACHE_COHERENCY so these things have a consisent meaning in
> > all iommu drivers. Once this is done vfio and iommufd should both
> > always set IOMMU_CACHE and refuse to work without
> > IOMMU_CAP_CACHE_COHERENCY. (unless someone knows of an !IOMMU_CACHE
> > arch that does in fact work today with vfio, somehow, but I don't..)
> 
> IIRC, the DMAR on Intel CPUs dedicated to IGD was where we'd often see
> lack of snoop-control support, causing us to have mixed coherent and
> non-coherent domains.  I don't recall if you go back far enough in VT-d
> history if the primary IOMMU might have lacked this support.  So I
> think there are systems we care about with IOMMUs that can't enforce
> DMA coherency.
> 
> As it is today, if the IOMMU reports IOMMU_CAP_CACHE_COHERENCY and all
> mappings make use of IOMMU_CACHE, then all DMA is coherent.  Are you
> suggesting IOMMU_CAP_CACHE_COHERENCY should indicate that all mappings
> are coherent regardless of mapping protection flags?  What's the point
> of IOMMU_CACHE at that point?

IOMMU_CAP_CACHE_COHERENCY should return to what it was before Intel's
change.

It only means normal DMAs issued in a normal way are coherent with the
CPU and do not require special cache flushing instructions. ie DMA
issued by a kernel driver using the DMA API.

It does not mean that non-coherence DMA does not exist, or that
platform or device specific ways to trigger non-coherence are blocked.

Stated another way, any platform that wires dev_is_dma_coherent() to
true, like all x86 does, must support IOMMU_CACHE and report
IOMMU_CAP_CACHE_COHERENCY for every iommu_domain the platform
supports. The platform obviously declares it support this in order to
support the in-kernel DMA API.

Thus, a new cap indicating that 'all dma is coherent' or 'no-snoop
blocking' should be created to cover Intel's special need. From what I
know it is only implemented in the Intel driver and apparently only
for some IOMMUs connected to IGD.

> > Yes, it was missed in the notes for vfio compat that Intel no-snoop is
> > not working currently, I fixed it.
> 
> Right, I see it in the comments relative to extensions, but missed in
> the commit log.  Thanks,

Oh good, I remembered it was someplace..

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-23 Thread Robin Murphy

On 2022-03-23 19:16, Linus Torvalds wrote:

On Wed, Mar 23, 2022 at 12:06 PM Robin Murphy  wrote:


On 2022-03-23 17:27, Linus Torvalds wrote:


I'm assuming that the ath9k issue is that it gives DMA mapping a big
enough area to handle any possible packet size, and just expects -
quite reasonably - smaller packets to only fill the part they need.

Which that "info leak" patch obviously breaks entirely.


Except that's the exact case which the new patch is addressing


Not "addressing". Breaking.

Which is why it will almost certainly get reverted.

Not doing DMA to the whole area seems to be quite the sane thing to do
for things like network packets, and overwriting the part that didn't
get DMA'd with zeroes seems to be exactly the wrong thing here.

So the SG_IO - and other random untrusted block command sources - data
leak will almost certainly have to be addressed differently. Possibly
by simply allocating the area with GFP_ZERO to begin with.


Er, the point of the block layer case is that whole area *is* zeroed to 
begin with, and a latent memory corruption problem in SWIOTLB itself 
replaces those zeros with random other kernel data unexpectedly. Let me 
try illustrating some sequences for clarity...


Expected behaviour/without SWIOTLB:
 Memory
---
start12345678
dma_map(DMA_FROM_DEVICE)  no-op
device writes partial data   12ABC678 <- ABC
dma_unmap(DMA_FROM_DEVICE)   12ABC678


SWIOTLB previously:
 Memory  Bounce buffer
---
start12345678
dma_map(DMA_FROM_DEVICE) no-op
device writes partial data   12345678xxABCxxx <- ABC
dma_unmap(DMA_FROM_DEVICE)   xxABCxxx <- xxABCxxx


SWIOTLB Now:
 Memory  Bounce buffer
---
start12345678
dma_map(DMA_FROM_DEVICE) 12345678 -> 12345678
device writes partial data   1234567812ABC678 <- ABC
dma_unmap(DMA_FROM_DEVICE)   12ABC678 <- 12ABC678


Now, sure we can prevent any actual information leakage by initialising 
the bounce buffer slot with zeros, but then we're just corrupting the 
not-written-to parts of the mapping with zeros instead of anyone else's 
old data. That's still fundamentally not OK. The only thing SWIOTLB can 
do to be correct is treat DMA_FROM_DEVICE as a read-modify-write of the 
entire mapping, because it has no way to know how much of it is actually 
going to be modified.


I'll admit I still never quite grasped the reason for also adding the 
override to swiotlb_sync_single_for_device() in aa6f8dcbab47, but I 
think by that point we were increasingly tired and confused and starting 
to second-guess ourselves (well, I was, at least). I don't think it's 
wrong per se, but as I said I do think it can bite anyone who's been 
doing dma_sync_*() wrong but getting away with it until now. If 
ddbd89deb7d3 alone turns out to work OK then I'd be inclined to try a 
partial revert of just that one hunk.


Thanks,
Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/dma: Explicitly sort PCI DMA windows

2022-03-23 Thread dann frazier
On Wed, Mar 23, 2022 at 09:49:04AM +, Marc Zyngier wrote:
> On Tue, 22 Mar 2022 17:27:36 +,
> Robin Murphy  wrote:
> > 
> > Originally, creating the dma_ranges resource list in pre-sorted fashion
> > was the simplest and most efficient way to enforce the order required by
> > iova_reserve_pci_windows(). However since then at least one PCI host
> > driver is now re-sorting the list for its own probe-time processing,
> > which doesn't seem entirely unreasonable, so that basic assumption no
> > longer holds. Make iommu-dma robust and get the sort order it needs by
> > explicitly sorting, which means we can also save the effort at creation
> > time and just build the list in whatever natural order the DT had.
> > 
> > Signed-off-by: Robin Murphy 
> > ---
> > 
> > Looking at this area off the back of the XGene thread[1] made me realise
> > that we need to do it anyway, regardless of whether it might also happen
> > to restore the previous XGene behaviour or not. Presumably nobody's
> > tried to use pcie-cadence-host behind an IOMMU yet...
> 
> This definitely restores PCIe functionality on my Mustang (XGene-1).
> Hopefully dann can comment on whether this addresses his own issue, as
> his firmware is significantly different.

Robin, Marc,

Adding just this patch on top of v5.17 (w/ vendor dtb) isn't enough to
fix m400 networking:

  https://paste.ubuntu.com/p/H5ZNbRvP8V/

  -dann
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-03-23 Thread Alex Williamson
On Fri, 18 Mar 2022 14:27:36 -0300
Jason Gunthorpe  wrote:

> iommufd can directly implement the /dev/vfio/vfio container IOCTLs by
> mapping them into io_pagetable operations. Doing so allows the use of
> iommufd by symliking /dev/vfio/vfio to /dev/iommufd. Allowing VFIO to
> SET_CONTAINER using a iommufd instead of a container fd is a followup
> series.
> 
> Internally the compatibility API uses a normal IOAS object that, like
> vfio, is automatically allocated when the first device is
> attached.
> 
> Userspace can also query or set this IOAS object directly using the
> IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only
> features while still using the VFIO style map/unmap ioctls.
> 
> While this is enough to operate qemu, it is still a bit of a WIP with a
> few gaps to be resolved:
> 
>  - Only the TYPE1v2 mode is supported where unmap cannot punch holes or
>split areas. The old mode can be implemented with a new operation to
>split an iopt_area into two without disturbing the iopt_pages or the
>domains, then unmapping a whole area as normal.
> 
>  - Resource limits rely on memory cgroups to bound what userspace can do
>instead of the module parameter dma_entry_limit.
> 
>  - VFIO P2P is not implemented. Avoiding the follow_pfn() mis-design will
>require some additional work to properly expose PFN lifecycle between
>VFIO and iommfd
> 
>  - Various components of the mdev API are not completed yet
>
>  - Indefinite suspend of SW access (VFIO_DMA_MAP_FLAG_VADDR) is not
>implemented.
> 
>  - The 'dirty tracking' is not implemented
> 
>  - A full audit for pedantic compatibility details (eg errnos, etc) has
>not yet been done
> 
>  - powerpc SPAPR is left out, as it is not connected to the iommu_domain
>framework. My hope is that SPAPR will be moved into the iommu_domain
>framework as a special HW specific type and would expect power to
>support the generic interface through a normal iommu_domain.

My overall question here would be whether we can actually achieve a
compatibility interface that has sufficient feature transparency that we
can dump vfio code in favor of this interface, or will there be enough
niche use cases that we need to keep type1 and vfio containers around
through a deprecation process?

The locked memory differences for one seem like something that libvirt
wouldn't want hidden and we have questions regarding support for vaddr
hijacking and different ideas how to implement dirty page tracking, not
to mention the missing features that are currently well used, like p2p
mappings, coherency tracking, mdev, etc.

It seems like quite an endeavor to fill all these gaps, while at the
same time QEMU will be working to move to use iommufd directly in order
to gain all the new features.

Where do we focus attention?  Is symlinking device files our proposal
to userspace and is that something achievable, or do we want to use
this compatibility interface as a means to test the interface and
allow userspace to make use of it for transition, if their use cases
allow it, perhaps eventually performing the symlink after deprecation
and eventual removal of the vfio container and type1 code?  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-23 Thread Jason Gunthorpe via iommu
On Wed, Mar 23, 2022 at 05:34:18PM -0300, Jason Gunthorpe wrote:

> Stated another way, any platform that wires dev_is_dma_coherent() to
> true, like all x86 does, must support IOMMU_CACHE and report
> IOMMU_CAP_CACHE_COHERENCY for every iommu_domain the platform
> supports. The platform obviously declares it support this in order to
> support the in-kernel DMA API.

That gives me a nice simple idea:

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 3c6b95ad026829..8366884df4a030 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "iommufd_private.h"
 
@@ -61,6 +62,10 @@ struct iommufd_device *iommufd_bind_pci_device(int fd, 
struct pci_dev *pdev,
struct iommu_group *group;
int rc;
 
+   /* iommufd always uses IOMMU_CACHE */
+   if (!dev_is_dma_coherent(&pdev->dev))
+   return ERR_PTR(-EINVAL);
+
ictx = iommufd_fget(fd);
if (!ictx)
return ERR_PTR(-EINVAL);
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 48149988c84bbc..3d6df1ffbf93e6 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -129,7 +129,8 @@ static int conv_iommu_prot(u32 map_flags)
 * We provide no manual cache coherency ioctls to userspace and most
 * architectures make the CPU ops for cache flushing privileged.
 * Therefore we require the underlying IOMMU to support CPU coherent
-* operation.
+* operation. Support for IOMMU_CACHE is enforced by the
+* dev_is_dma_coherent() test during bind.
 */
iommu_prot = IOMMU_CACHE;
if (map_flags & IOMMU_IOAS_MAP_WRITEABLE)

Looking at it I would say all the places that test
IOMMU_CAP_CACHE_COHERENCY can be replaced with dev_is_dma_coherent()
except for the one call in VFIO that is supporting the Intel no-snoop
behavior.

Then we can rename IOMMU_CAP_CACHE_COHERENCY to something like
IOMMU_CAP_ENFORCE_CACHE_COHERENCY and just create a
IOMMU_ENFORCE_CACHE prot flag for Intel IOMMU to use instead of
abusing IOMMU_CACHE.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-03-23 Thread Jason Gunthorpe via iommu
On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:

> My overall question here would be whether we can actually achieve a
> compatibility interface that has sufficient feature transparency that we
> can dump vfio code in favor of this interface, or will there be enough
> niche use cases that we need to keep type1 and vfio containers around
> through a deprecation process?

Other than SPAPR, I think we can.

> The locked memory differences for one seem like something that
> libvirt wouldn't want hidden

I'm first interested to have an understanding how this change becomes
a real problem in practice that requires libvirt to do something
different for vfio or iommufd. We can discuss in the other thread

If this is the make or break point then I think we can deal with it
either by going back to what vfio does now or perhaps some other
friendly compat approach..

> and we have questions regarding support for vaddr hijacking

I'm not sure what vaddr hijacking is? Do you mean
VFIO_DMA_MAP_FLAG_VADDR ? There is a comment that outlines my plan to
implement it in a functionally compatible way without the deadlock
problem. I estimate this as a small project.

> and different ideas how to implement dirty page tracking, 

I don't think this is compatibility. No kernel today triggers qemu to
use this feature as no kernel supports live migration. No existing
qemu will trigger this feature with new kernels that support live
migration v2. Therefore we can adjust qemu's dirty tracking at the
same time we enable migration v2 in qemu.

With Joao's work we are close to having a solid RFC to come with
something that can be fully implemented.

Hopefully we can agree to this soon enough that qemu can come with a
full package of migration v2 support including the dirty tracking
solution.

> not to mention the missing features that are currently well used,
> like p2p mappings, coherency tracking, mdev, etc.

I consider these all mandatory things, they won't be left out.

The reason they are not in the RFC is mostly because supporting them
requires work outside just this iommufd area, and I'd like this series
to remain self-contained.

I've already got a draft to add DMABUF support to VFIO PCI which
nicely solves the follow_pfn security problem, we want to do this for
another reason already. I'm waiting for some testing feedback before
posting it. Need some help from Daniel make the DMABUF revoke semantic
him and I have been talking about. In the worst case can copy the
follow_pfn approach.

Intel no-snoop is simple enough, just needs some Intel cleanup parts.

mdev will come along with the final VFIO integration, all the really
hard parts are done already. The VFIO integration is a medium sized
task overall.

So, I'm not ready to give up yet :)

> Where do we focus attention?  Is symlinking device files our proposal
> to userspace and is that something achievable, or do we want to use
> this compatibility interface as a means to test the interface and
> allow userspace to make use of it for transition, if their use cases
> allow it, perhaps eventually performing the symlink after deprecation
> and eventual removal of the vfio container and type1 code?  Thanks,

symlinking device files is definitely just a suggested way to expedite
testing.

Things like qemu that are learning to use iommufd-only features should
learn to directly open iommufd instead of vfio container to activate
those features.

Looking long down the road I don't think we want to have type 1 and
iommufd code forever. So, I would like to make an option to compile
out vfio container support entirely and have that option arrange for
iommufd to provide the container device node itself.

I think we can get there pretty quickly, or at least I haven't got
anything that is scaring me alot (beyond SPAPR of course)

For the dpdk/etcs of the world I think we are already there.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/dma: Explicitly sort PCI DMA windows

2022-03-23 Thread Rob Herring
On Wed, Mar 23, 2022 at 5:15 PM dann frazier  wrote:
>
> On Wed, Mar 23, 2022 at 09:49:04AM +, Marc Zyngier wrote:
> > On Tue, 22 Mar 2022 17:27:36 +,
> > Robin Murphy  wrote:
> > >
> > > Originally, creating the dma_ranges resource list in pre-sorted fashion
> > > was the simplest and most efficient way to enforce the order required by
> > > iova_reserve_pci_windows(). However since then at least one PCI host
> > > driver is now re-sorting the list for its own probe-time processing,
> > > which doesn't seem entirely unreasonable, so that basic assumption no
> > > longer holds. Make iommu-dma robust and get the sort order it needs by
> > > explicitly sorting, which means we can also save the effort at creation
> > > time and just build the list in whatever natural order the DT had.
> > >
> > > Signed-off-by: Robin Murphy 
> > > ---
> > >
> > > Looking at this area off the back of the XGene thread[1] made me realise
> > > that we need to do it anyway, regardless of whether it might also happen
> > > to restore the previous XGene behaviour or not. Presumably nobody's
> > > tried to use pcie-cadence-host behind an IOMMU yet...
> >
> > This definitely restores PCIe functionality on my Mustang (XGene-1).
> > Hopefully dann can comment on whether this addresses his own issue, as
> > his firmware is significantly different.
>
> Robin, Marc,
>
> Adding just this patch on top of v5.17 (w/ vendor dtb) isn't enough to
> fix m400 networking:

I wouldn't expect it to given both the IB register selection changed
and the 2nd dma-ranges entry is ignored.

Can you (and others) try out this branch:

git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git xgene-pci-fix

It should maintain the same IB register usage for both cases and
handle the error in 'dma-ranges'.

Rob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/dma: Explicitly sort PCI DMA windows

2022-03-23 Thread Rob Herring
On Tue, Mar 22, 2022 at 12:27 PM Robin Murphy  wrote:
>
> Originally, creating the dma_ranges resource list in pre-sorted fashion
> was the simplest and most efficient way to enforce the order required by
> iova_reserve_pci_windows(). However since then at least one PCI host
> driver is now re-sorting the list for its own probe-time processing,
> which doesn't seem entirely unreasonable, so that basic assumption no
> longer holds. Make iommu-dma robust and get the sort order it needs by
> explicitly sorting, which means we can also save the effort at creation
> time and just build the list in whatever natural order the DT had.
>
> Signed-off-by: Robin Murphy 
> ---
>
> Looking at this area off the back of the XGene thread[1] made me realise
> that we need to do it anyway, regardless of whether it might also happen
> to restore the previous XGene behaviour or not. Presumably nobody's
> tried to use pcie-cadence-host behind an IOMMU yet...
>
> Boot-tested on Juno to make sure I hadn't got the sort comparison
> backwards.
>
> Robin.
>
> [1] https://lore.kernel.org/linux-pci/20220321104843.949645-1-...@kernel.org/
>
>  drivers/iommu/dma-iommu.c | 13 -
>  drivers/pci/of.c  |  7 +--
>  2 files changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index b22034975301..91d134c0c9b1 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -414,6 +415,15 @@ static int cookie_init_hw_msi_region(struct 
> iommu_dma_cookie *cookie,
> return 0;
>  }
>
> +static int iommu_dma_ranges_sort(void *priv, const struct list_head *a,
> +   const struct list_head *b)
> +{
> +   struct resource_entry *res_a = list_entry(a, typeof(*res_a), node);
> +   struct resource_entry *res_b = list_entry(b, typeof(*res_b), node);
> +
> +   return res_a->res->start > res_b->res->start;
> +}
> +
>  static int iova_reserve_pci_windows(struct pci_dev *dev,
> struct iova_domain *iovad)
>  {
> @@ -432,6 +442,7 @@ static int iova_reserve_pci_windows(struct pci_dev *dev,
> }
>
> /* Get reserved DMA windows from host bridge */
> +   list_sort(NULL, &bridge->dma_ranges, iommu_dma_ranges_sort);
> resource_list_for_each_entry(window, &bridge->dma_ranges) {
> end = window->res->start - window->offset;
>  resv_iova:
> @@ -440,7 +451,7 @@ static int iova_reserve_pci_windows(struct pci_dev *dev,
> hi = iova_pfn(iovad, end);
> reserve_iova(iovad, lo, hi);
> } else if (end < start) {
> -   /* dma_ranges list should be sorted */
> +   /* DMA ranges should be non-overlapping */
> dev_err(&dev->dev,
> "Failed to reserve IOVA [%pa-%pa]\n",
> &start, &end);
> diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> index cb2e8351c2cc..d176b4bc6193 100644
> --- a/drivers/pci/of.c
> +++ b/drivers/pci/of.c
> @@ -393,12 +393,7 @@ static int devm_of_pci_get_host_bridge_resources(struct 
> device *dev,
> goto failed;
> }
>
> -   /* Keep the resource list sorted */
> -   resource_list_for_each_entry(entry, ib_resources)
> -   if (entry->res->start > res->start)
> -   break;
> -
> -   pci_add_resource_offset(&entry->node, res,

entry is now unused and causes a warning.

> +   pci_add_resource_offset(ib_resources, res,
> res->start - range.pci_addr);
> }
>
> --
> 2.28.0.dirty
>
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 2/2] PCI: hv: Propagate coherence from VMbus device to PCI device

2022-03-23 Thread Boqun Feng
On Wed, Mar 23, 2022 at 01:31:12PM -0700, Michael Kelley wrote:
> PCI pass-thru devices in a Hyper-V VM are represented as a VMBus
> device and as a PCI device.  The coherence of the VMbus device is
> set based on the VMbus node in ACPI, but the PCI device has no
> ACPI node and defaults to not hardware coherent.  This results
> in extra software coherence management overhead on ARM64 when
> devices are hardware coherent.
> 
> Fix this by setting up the PCI host bus so that normal
> PCI mechanisms will propagate the coherence of the VMbus
> device to the PCI device. There's no effect on x86/x64 where
> devices are always hardware coherent.
> 
> Signed-off-by: Michael Kelley 

Acked-by: Boqun Feng 

Regards,
Boqun

> ---
>  drivers/pci/controller/pci-hyperv.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c 
> b/drivers/pci/controller/pci-hyperv.c
> index ae0bc2f..88b3b56 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -3404,6 +3404,15 @@ static int hv_pci_probe(struct hv_device *hdev,
>   hbus->bridge->domain_nr = dom;
>  #ifdef CONFIG_X86
>   hbus->sysdata.domain = dom;
> +#elif defined(CONFIG_ARM64)
> + /*
> +  * Set the PCI bus parent to be the corresponding VMbus
> +  * device. Then the VMbus device will be assigned as the
> +  * ACPI companion in pcibios_root_bridge_prepare() and
> +  * pci_dma_configure() will propagate device coherence
> +  * information to devices created on the bus.
> +  */
> + hbus->sysdata.parent = hdev->device.parent;
>  #endif
>  
>   hbus->hdev = hdev;
> -- 
> 1.8.3.1
> 
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-23 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, March 23, 2022 12:15 AM
> 
> On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> 
> > I'm still picking my way through the series, but the later compat
> > interface doesn't mention this difference as an outstanding issue.
> > Doesn't this difference need to be accounted in how libvirt manages VM
> > resource limits?
> 
> AFACIT, no, but it should be checked.
> 
> > AIUI libvirt uses some form of prlimit(2) to set process locked
> > memory limits.
> 
> Yes, and ulimit does work fully. prlimit adjusts the value:
> 
> int do_prlimit(struct task_struct *tsk, unsigned int resource,
>   struct rlimit *new_rlim, struct rlimit *old_rlim)
> {
>   rlim = tsk->signal->rlim + resource;
> [..]
>   if (new_rlim)
>   *rlim = *new_rlim;
> 
> Which vfio reads back here:
> 
> drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit =
> rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> drivers/vfio/vfio_iommu_type1.c:unsigned long limit =
> rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> 
> And iommufd does the same read back:
> 
>   lock_limit =
>   task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> PAGE_SHIFT;
>   npages = pages->npinned - pages->last_npinned;
>   do {
>   cur_pages = atomic_long_read(&pages->source_user-
> >locked_vm);
>   new_pages = cur_pages + npages;
>   if (new_pages > lock_limit)
>   return -ENOMEM;
>   } while (atomic_long_cmpxchg(&pages->source_user->locked_vm,
> cur_pages,
>new_pages) != cur_pages);
> 
> So it does work essentially the same.
> 
> The difference is more subtle, iouring/etc puts the charge in the user
> so it is additive with things like iouring and additively spans all
> the users processes.
> 
> However vfio is accounting only per-process and only for itself - no
> other subsystem uses locked as the charge variable for DMA pins.
> 
> The user visible difference will be that a limit X that worked with
> VFIO may start to fail after a kernel upgrade as the charge accounting
> is now cross user and additive with things like iommufd.
> 
> This whole area is a bit peculiar (eg mlock itself works differently),
> IMHO, but with most of the places doing pins voting to use
> user->locked_vm as the charge it seems the right path in today's
> kernel.
> 
> Ceratinly having qemu concurrently using three different subsystems
> (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> RLIMIT_MEMLOCK differently cannot be sane or correct.
> 
> I plan to fix RDMA like this as well so at least we can have
> consistency within qemu.
> 

I have an impression that iommufd and vfio type1 must use
the same accounting scheme given the management stack
has no insight into qemu on which one is actually used thus
cannot adapt to the subtle difference in between. in this
regard either we start fixing vfio type1 to use user->locked_vm
now or have iommufd follow vfio type1 for upward compatibility
and then change them together at a later point.

I prefer to the former as IMHO I don't know when will be a later
point w/o certain kernel changes to actually break the userspace
policy built on a wrong accounting scheme... 

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-23 Thread Jason Wang
On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin  wrote:
>
> > From: Jason Gunthorpe 
> > Sent: Wednesday, March 23, 2022 12:15 AM
> >
> > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> >
> > > I'm still picking my way through the series, but the later compat
> > > interface doesn't mention this difference as an outstanding issue.
> > > Doesn't this difference need to be accounted in how libvirt manages VM
> > > resource limits?
> >
> > AFACIT, no, but it should be checked.
> >
> > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > memory limits.
> >
> > Yes, and ulimit does work fully. prlimit adjusts the value:
> >
> > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> >   struct rlimit *new_rlim, struct rlimit *old_rlim)
> > {
> >   rlim = tsk->signal->rlim + resource;
> > [..]
> >   if (new_rlim)
> >   *rlim = *new_rlim;
> >
> > Which vfio reads back here:
> >
> > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit =
> > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > drivers/vfio/vfio_iommu_type1.c:unsigned long limit =
> > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> >
> > And iommufd does the same read back:
> >
> >   lock_limit =
> >   task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> > PAGE_SHIFT;
> >   npages = pages->npinned - pages->last_npinned;
> >   do {
> >   cur_pages = atomic_long_read(&pages->source_user-
> > >locked_vm);
> >   new_pages = cur_pages + npages;
> >   if (new_pages > lock_limit)
> >   return -ENOMEM;
> >   } while (atomic_long_cmpxchg(&pages->source_user->locked_vm,
> > cur_pages,
> >new_pages) != cur_pages);
> >
> > So it does work essentially the same.
> >
> > The difference is more subtle, iouring/etc puts the charge in the user
> > so it is additive with things like iouring and additively spans all
> > the users processes.
> >
> > However vfio is accounting only per-process and only for itself - no
> > other subsystem uses locked as the charge variable for DMA pins.
> >
> > The user visible difference will be that a limit X that worked with
> > VFIO may start to fail after a kernel upgrade as the charge accounting
> > is now cross user and additive with things like iommufd.
> >
> > This whole area is a bit peculiar (eg mlock itself works differently),
> > IMHO, but with most of the places doing pins voting to use
> > user->locked_vm as the charge it seems the right path in today's
> > kernel.
> >
> > Ceratinly having qemu concurrently using three different subsystems
> > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> > RLIMIT_MEMLOCK differently cannot be sane or correct.
> >
> > I plan to fix RDMA like this as well so at least we can have
> > consistency within qemu.
> >
>
> I have an impression that iommufd and vfio type1 must use
> the same accounting scheme given the management stack
> has no insight into qemu on which one is actually used thus
> cannot adapt to the subtle difference in between. in this
> regard either we start fixing vfio type1 to use user->locked_vm
> now or have iommufd follow vfio type1 for upward compatibility
> and then change them together at a later point.
>
> I prefer to the former as IMHO I don't know when will be a later
> point w/o certain kernel changes to actually break the userspace
> policy built on a wrong accounting scheme...

I wonder if the kernel is the right place to do this. We have new uAPI
so management layer can know the difference of the accounting in
advance by

-device vfio-pci,iommufd=on

?

>
> Thanks
> Kevin
>

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-23 Thread Tian, Kevin
> From: Jason Wang 
> Sent: Thursday, March 24, 2022 10:28 AM
> 
> On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin  wrote:
> >
> > > From: Jason Gunthorpe 
> > > Sent: Wednesday, March 23, 2022 12:15 AM
> > >
> > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > >
> > > > I'm still picking my way through the series, but the later compat
> > > > interface doesn't mention this difference as an outstanding issue.
> > > > Doesn't this difference need to be accounted in how libvirt manages VM
> > > > resource limits?
> > >
> > > AFACIT, no, but it should be checked.
> > >
> > > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > > memory limits.
> > >
> > > Yes, and ulimit does work fully. prlimit adjusts the value:
> > >
> > > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > >   struct rlimit *new_rlim, struct rlimit *old_rlim)
> > > {
> > >   rlim = tsk->signal->rlim + resource;
> > > [..]
> > >   if (new_rlim)
> > >   *rlim = *new_rlim;
> > >
> > > Which vfio reads back here:
> > >
> > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit =
> > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit =
> > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > >
> > > And iommufd does the same read back:
> > >
> > >   lock_limit =
> > >   task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> > > PAGE_SHIFT;
> > >   npages = pages->npinned - pages->last_npinned;
> > >   do {
> > >   cur_pages = atomic_long_read(&pages->source_user-
> > > >locked_vm);
> > >   new_pages = cur_pages + npages;
> > >   if (new_pages > lock_limit)
> > >   return -ENOMEM;
> > >   } while (atomic_long_cmpxchg(&pages->source_user->locked_vm,
> > > cur_pages,
> > >new_pages) != cur_pages);
> > >
> > > So it does work essentially the same.
> > >
> > > The difference is more subtle, iouring/etc puts the charge in the user
> > > so it is additive with things like iouring and additively spans all
> > > the users processes.
> > >
> > > However vfio is accounting only per-process and only for itself - no
> > > other subsystem uses locked as the charge variable for DMA pins.
> > >
> > > The user visible difference will be that a limit X that worked with
> > > VFIO may start to fail after a kernel upgrade as the charge accounting
> > > is now cross user and additive with things like iommufd.
> > >
> > > This whole area is a bit peculiar (eg mlock itself works differently),
> > > IMHO, but with most of the places doing pins voting to use
> > > user->locked_vm as the charge it seems the right path in today's
> > > kernel.
> > >
> > > Ceratinly having qemu concurrently using three different subsystems
> > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> > > RLIMIT_MEMLOCK differently cannot be sane or correct.
> > >
> > > I plan to fix RDMA like this as well so at least we can have
> > > consistency within qemu.
> > >
> >
> > I have an impression that iommufd and vfio type1 must use
> > the same accounting scheme given the management stack
> > has no insight into qemu on which one is actually used thus
> > cannot adapt to the subtle difference in between. in this
> > regard either we start fixing vfio type1 to use user->locked_vm
> > now or have iommufd follow vfio type1 for upward compatibility
> > and then change them together at a later point.
> >
> > I prefer to the former as IMHO I don't know when will be a later
> > point w/o certain kernel changes to actually break the userspace
> > policy built on a wrong accounting scheme...
> 
> I wonder if the kernel is the right place to do this. We have new uAPI

I didn't get this. This thread is about that VFIO uses a wrong accounting
scheme and then the discussion is about the impact of fixing it to the
userspace. I didn't see the question on the right place part.

> so management layer can know the difference of the accounting in
> advance by
> 
> -device vfio-pci,iommufd=on
> 

I suppose iommufd will be used once Qemu supports it, as long as
the compatibility opens that Jason/Alex discussed in another thread
are well addressed. It is not necessarily to be a control knob exposed
to the caller.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-23 Thread Jason Wang
On Thu, Mar 24, 2022 at 10:42 AM Tian, Kevin  wrote:
>
> > From: Jason Wang 
> > Sent: Thursday, March 24, 2022 10:28 AM
> >
> > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin  wrote:
> > >
> > > > From: Jason Gunthorpe 
> > > > Sent: Wednesday, March 23, 2022 12:15 AM
> > > >
> > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > > >
> > > > > I'm still picking my way through the series, but the later compat
> > > > > interface doesn't mention this difference as an outstanding issue.
> > > > > Doesn't this difference need to be accounted in how libvirt manages VM
> > > > > resource limits?
> > > >
> > > > AFACIT, no, but it should be checked.
> > > >
> > > > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > > > memory limits.
> > > >
> > > > Yes, and ulimit does work fully. prlimit adjusts the value:
> > > >
> > > > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > > >   struct rlimit *new_rlim, struct rlimit *old_rlim)
> > > > {
> > > >   rlim = tsk->signal->rlim + resource;
> > > > [..]
> > > >   if (new_rlim)
> > > >   *rlim = *new_rlim;
> > > >
> > > > Which vfio reads back here:
> > > >
> > > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit =
> > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit =
> > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > >
> > > > And iommufd does the same read back:
> > > >
> > > >   lock_limit =
> > > >   task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> > > > PAGE_SHIFT;
> > > >   npages = pages->npinned - pages->last_npinned;
> > > >   do {
> > > >   cur_pages = atomic_long_read(&pages->source_user-
> > > > >locked_vm);
> > > >   new_pages = cur_pages + npages;
> > > >   if (new_pages > lock_limit)
> > > >   return -ENOMEM;
> > > >   } while (atomic_long_cmpxchg(&pages->source_user->locked_vm,
> > > > cur_pages,
> > > >new_pages) != cur_pages);
> > > >
> > > > So it does work essentially the same.
> > > >
> > > > The difference is more subtle, iouring/etc puts the charge in the user
> > > > so it is additive with things like iouring and additively spans all
> > > > the users processes.
> > > >
> > > > However vfio is accounting only per-process and only for itself - no
> > > > other subsystem uses locked as the charge variable for DMA pins.
> > > >
> > > > The user visible difference will be that a limit X that worked with
> > > > VFIO may start to fail after a kernel upgrade as the charge accounting
> > > > is now cross user and additive with things like iommufd.
> > > >
> > > > This whole area is a bit peculiar (eg mlock itself works differently),
> > > > IMHO, but with most of the places doing pins voting to use
> > > > user->locked_vm as the charge it seems the right path in today's
> > > > kernel.
> > > >
> > > > Ceratinly having qemu concurrently using three different subsystems
> > > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> > > > RLIMIT_MEMLOCK differently cannot be sane or correct.
> > > >
> > > > I plan to fix RDMA like this as well so at least we can have
> > > > consistency within qemu.
> > > >
> > >
> > > I have an impression that iommufd and vfio type1 must use
> > > the same accounting scheme given the management stack
> > > has no insight into qemu on which one is actually used thus
> > > cannot adapt to the subtle difference in between. in this
> > > regard either we start fixing vfio type1 to use user->locked_vm
> > > now or have iommufd follow vfio type1 for upward compatibility
> > > and then change them together at a later point.
> > >
> > > I prefer to the former as IMHO I don't know when will be a later
> > > point w/o certain kernel changes to actually break the userspace
> > > policy built on a wrong accounting scheme...
> >
> > I wonder if the kernel is the right place to do this. We have new uAPI
>
> I didn't get this. This thread is about that VFIO uses a wrong accounting
> scheme and then the discussion is about the impact of fixing it to the
> userspace.

It's probably too late to fix the kernel considering it may break the userspace.

> I didn't see the question on the right place part.

I meant it would be easier to let userspace know the difference than
trying to fix or workaround in the kernel.

>
> > so management layer can know the difference of the accounting in
> > advance by
> >
> > -device vfio-pci,iommufd=on
> >
>
> I suppose iommufd will be used once Qemu supports it, as long as
> the compatibility opens that Jason/Alex discussed in another thread
> are well addressed. It is not necessarily to be a control knob exposed
> to the caller.

It has a lot of implications if we do this, it means iommufd needs to
inherit all the userspace noticeable behaviour as well as the "bugs"
of VFIO.

We know it's e

Re: [PATCH] iommu/dma: Explicitly sort PCI DMA windows

2022-03-23 Thread dann frazier
On Wed, Mar 23, 2022 at 07:55:23PM -0500, Rob Herring wrote:
> On Wed, Mar 23, 2022 at 5:15 PM dann frazier  
> wrote:
> >
> > On Wed, Mar 23, 2022 at 09:49:04AM +, Marc Zyngier wrote:
> > > On Tue, 22 Mar 2022 17:27:36 +,
> > > Robin Murphy  wrote:
> > > >
> > > > Originally, creating the dma_ranges resource list in pre-sorted fashion
> > > > was the simplest and most efficient way to enforce the order required by
> > > > iova_reserve_pci_windows(). However since then at least one PCI host
> > > > driver is now re-sorting the list for its own probe-time processing,
> > > > which doesn't seem entirely unreasonable, so that basic assumption no
> > > > longer holds. Make iommu-dma robust and get the sort order it needs by
> > > > explicitly sorting, which means we can also save the effort at creation
> > > > time and just build the list in whatever natural order the DT had.
> > > >
> > > > Signed-off-by: Robin Murphy 
> > > > ---
> > > >
> > > > Looking at this area off the back of the XGene thread[1] made me realise
> > > > that we need to do it anyway, regardless of whether it might also happen
> > > > to restore the previous XGene behaviour or not. Presumably nobody's
> > > > tried to use pcie-cadence-host behind an IOMMU yet...
> > >
> > > This definitely restores PCIe functionality on my Mustang (XGene-1).
> > > Hopefully dann can comment on whether this addresses his own issue, as
> > > his firmware is significantly different.
> >
> > Robin, Marc,
> >
> > Adding just this patch on top of v5.17 (w/ vendor dtb) isn't enough to
> > fix m400 networking:
> 
> I wouldn't expect it to given both the IB register selection changed
> and the 2nd dma-ranges entry is ignored.
> 
> Can you (and others) try out this branch:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git xgene-pci-fix
> 
> It should maintain the same IB register usage for both cases and
> handle the error in 'dma-ranges'.

Looks good Rob :)

https://paste.ubuntu.com/p/zJF9PKhQpS/


  -dann

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping

2022-03-23 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, March 24, 2022 2:16 AM
> 
> On Tue, Mar 22, 2022 at 04:15:44PM -0600, Alex Williamson wrote:
> 
> > > +int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
> > > +   unsigned long length, struct page **out_pages, bool write)
> > > +{
> > > + unsigned long cur_iova = iova;
> > > + unsigned long last_iova;
> > > + struct iopt_area *area;
> > > + int rc;
> > > +
> > > + if (!length)
> > > + return -EINVAL;
> > > + if (check_add_overflow(iova, length - 1, &last_iova))
> > > + return -EOVERFLOW;
> > > +
> > > + down_read(&iopt->iova_rwsem);
> > > + for (area = iopt_area_iter_first(iopt, iova, last_iova); area;
> > > +  area = iopt_area_iter_next(area, iova, last_iova)) {
> > > + unsigned long last = min(last_iova, iopt_area_last_iova(area));
> > > + unsigned long last_index;
> > > + unsigned long index;
> > > +
> > > + /* Need contiguous areas in the access */
> > > + if (iopt_area_iova(area) < cur_iova || !area->pages) {
> > ^^^
> > Should this be (cur_iova != iova && iopt_area_iova(area) < cur_iova)?
> 
> Oh good eye
> 
> That is a typo < should be >:
> 
>   if (iopt_area_iova(area) > cur_iova || !area->pages) {
> 
> There are three boundary conditions here to worry about
>  - interval trees return any nodes that intersect the queried range
>so the first found node can start after iova
> 
>  - There can be a gap in the intervals
> 
>  - The final area can end short of last_iova
> 
> The last one is botched too and needs this:
> for ... { ...
>   }
> + if (cur_iova != last_iova)
> + goto out_remove;
> 
> The test suite isn't covering the boundary cases here yet, I added a
> FIXME for this for now.
> 

Another nit about below:

+   /*
+* Can't cross areas that are not aligned to the system page
+* size with this API.
+*/
+   if (cur_iova % PAGE_SIZE) {
+   rc = -EINVAL;
+   goto out_remove;
+   }

Currently it's done after iopt_pages_add_user() but before cur_iova 
is adjusted, which implies the last add_user() will not be reverted in
case of failed check here.

suppose this should be checked at the start of the loop.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-23 Thread Tian, Kevin
> From: Jason Wang 
> Sent: Thursday, March 24, 2022 10:57 AM
> 
> On Thu, Mar 24, 2022 at 10:42 AM Tian, Kevin  wrote:
> >
> > > From: Jason Wang 
> > > Sent: Thursday, March 24, 2022 10:28 AM
> > >
> > > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin 
> wrote:
> > > >
> > > > > From: Jason Gunthorpe 
> > > > > Sent: Wednesday, March 23, 2022 12:15 AM
> > > > >
> > > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > > > >
> > > > > > I'm still picking my way through the series, but the later compat
> > > > > > interface doesn't mention this difference as an outstanding issue.
> > > > > > Doesn't this difference need to be accounted in how libvirt manages
> VM
> > > > > > resource limits?
> > > > >
> > > > > AFACIT, no, but it should be checked.
> > > > >
> > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > > > > memory limits.
> > > > >
> > > > > Yes, and ulimit does work fully. prlimit adjusts the value:
> > > > >
> > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > > > >   struct rlimit *new_rlim, struct rlimit *old_rlim)
> > > > > {
> > > > >   rlim = tsk->signal->rlim + resource;
> > > > > [..]
> > > > >   if (new_rlim)
> > > > >   *rlim = *new_rlim;
> > > > >
> > > > > Which vfio reads back here:
> > > > >
> > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit =
> > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit =
> > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > >
> > > > > And iommufd does the same read back:
> > > > >
> > > > >   lock_limit =
> > > > >   task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> > > > > PAGE_SHIFT;
> > > > >   npages = pages->npinned - pages->last_npinned;
> > > > >   do {
> > > > >   cur_pages = atomic_long_read(&pages->source_user-
> > > > > >locked_vm);
> > > > >   new_pages = cur_pages + npages;
> > > > >   if (new_pages > lock_limit)
> > > > >   return -ENOMEM;
> > > > >   } while (atomic_long_cmpxchg(&pages->source_user->locked_vm,
> > > > > cur_pages,
> > > > >new_pages) != cur_pages);
> > > > >
> > > > > So it does work essentially the same.
> > > > >
> > > > > The difference is more subtle, iouring/etc puts the charge in the user
> > > > > so it is additive with things like iouring and additively spans all
> > > > > the users processes.
> > > > >
> > > > > However vfio is accounting only per-process and only for itself - no
> > > > > other subsystem uses locked as the charge variable for DMA pins.
> > > > >
> > > > > The user visible difference will be that a limit X that worked with
> > > > > VFIO may start to fail after a kernel upgrade as the charge accounting
> > > > > is now cross user and additive with things like iommufd.
> > > > >
> > > > > This whole area is a bit peculiar (eg mlock itself works differently),
> > > > > IMHO, but with most of the places doing pins voting to use
> > > > > user->locked_vm as the charge it seems the right path in today's
> > > > > kernel.
> > > > >
> > > > > Ceratinly having qemu concurrently using three different subsystems
> > > > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> > > > > RLIMIT_MEMLOCK differently cannot be sane or correct.
> > > > >
> > > > > I plan to fix RDMA like this as well so at least we can have
> > > > > consistency within qemu.
> > > > >
> > > >
> > > > I have an impression that iommufd and vfio type1 must use
> > > > the same accounting scheme given the management stack
> > > > has no insight into qemu on which one is actually used thus
> > > > cannot adapt to the subtle difference in between. in this
> > > > regard either we start fixing vfio type1 to use user->locked_vm
> > > > now or have iommufd follow vfio type1 for upward compatibility
> > > > and then change them together at a later point.
> > > >
> > > > I prefer to the former as IMHO I don't know when will be a later
> > > > point w/o certain kernel changes to actually break the userspace
> > > > policy built on a wrong accounting scheme...
> > >
> > > I wonder if the kernel is the right place to do this. We have new uAPI
> >
> > I didn't get this. This thread is about that VFIO uses a wrong accounting
> > scheme and then the discussion is about the impact of fixing it to the
> > userspace.
> 
> It's probably too late to fix the kernel considering it may break the 
> userspace.
> 
> > I didn't see the question on the right place part.
> 
> I meant it would be easier to let userspace know the difference than
> trying to fix or workaround in the kernel.

Jason already plans to fix RDMA which will also leads to user-aware
impact when Qemu uses both VFIO and RDMA. Why is VFIO so special
and left behind to carry the legacy misdesign?

> 
> >
> > > so management layer can know the difference

Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-23 Thread Jason Wang
On Thu, Mar 24, 2022 at 11:15 AM Tian, Kevin  wrote:
>
> > From: Jason Wang 
> > Sent: Thursday, March 24, 2022 10:57 AM
> >
> > On Thu, Mar 24, 2022 at 10:42 AM Tian, Kevin  wrote:
> > >
> > > > From: Jason Wang 
> > > > Sent: Thursday, March 24, 2022 10:28 AM
> > > >
> > > > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin 
> > wrote:
> > > > >
> > > > > > From: Jason Gunthorpe 
> > > > > > Sent: Wednesday, March 23, 2022 12:15 AM
> > > > > >
> > > > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > > > > >
> > > > > > > I'm still picking my way through the series, but the later compat
> > > > > > > interface doesn't mention this difference as an outstanding issue.
> > > > > > > Doesn't this difference need to be accounted in how libvirt 
> > > > > > > manages
> > VM
> > > > > > > resource limits?
> > > > > >
> > > > > > AFACIT, no, but it should be checked.
> > > > > >
> > > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > > > > > memory limits.
> > > > > >
> > > > > > Yes, and ulimit does work fully. prlimit adjusts the value:
> > > > > >
> > > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > > > > >   struct rlimit *new_rlim, struct rlimit *old_rlim)
> > > > > > {
> > > > > >   rlim = tsk->signal->rlim + resource;
> > > > > > [..]
> > > > > >   if (new_rlim)
> > > > > >   *rlim = *new_rlim;
> > > > > >
> > > > > > Which vfio reads back here:
> > > > > >
> > > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit =
> > > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit =
> > > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > > >
> > > > > > And iommufd does the same read back:
> > > > > >
> > > > > >   lock_limit =
> > > > > >   task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> > > > > > PAGE_SHIFT;
> > > > > >   npages = pages->npinned - pages->last_npinned;
> > > > > >   do {
> > > > > >   cur_pages = atomic_long_read(&pages->source_user-
> > > > > > >locked_vm);
> > > > > >   new_pages = cur_pages + npages;
> > > > > >   if (new_pages > lock_limit)
> > > > > >   return -ENOMEM;
> > > > > >   } while (atomic_long_cmpxchg(&pages->source_user->locked_vm,
> > > > > > cur_pages,
> > > > > >new_pages) != cur_pages);
> > > > > >
> > > > > > So it does work essentially the same.
> > > > > >
> > > > > > The difference is more subtle, iouring/etc puts the charge in the 
> > > > > > user
> > > > > > so it is additive with things like iouring and additively spans all
> > > > > > the users processes.
> > > > > >
> > > > > > However vfio is accounting only per-process and only for itself - no
> > > > > > other subsystem uses locked as the charge variable for DMA pins.
> > > > > >
> > > > > > The user visible difference will be that a limit X that worked with
> > > > > > VFIO may start to fail after a kernel upgrade as the charge 
> > > > > > accounting
> > > > > > is now cross user and additive with things like iommufd.
> > > > > >
> > > > > > This whole area is a bit peculiar (eg mlock itself works 
> > > > > > differently),
> > > > > > IMHO, but with most of the places doing pins voting to use
> > > > > > user->locked_vm as the charge it seems the right path in today's
> > > > > > kernel.
> > > > > >
> > > > > > Ceratinly having qemu concurrently using three different subsystems
> > > > > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> > > > > > RLIMIT_MEMLOCK differently cannot be sane or correct.
> > > > > >
> > > > > > I plan to fix RDMA like this as well so at least we can have
> > > > > > consistency within qemu.
> > > > > >
> > > > >
> > > > > I have an impression that iommufd and vfio type1 must use
> > > > > the same accounting scheme given the management stack
> > > > > has no insight into qemu on which one is actually used thus
> > > > > cannot adapt to the subtle difference in between. in this
> > > > > regard either we start fixing vfio type1 to use user->locked_vm
> > > > > now or have iommufd follow vfio type1 for upward compatibility
> > > > > and then change them together at a later point.
> > > > >
> > > > > I prefer to the former as IMHO I don't know when will be a later
> > > > > point w/o certain kernel changes to actually break the userspace
> > > > > policy built on a wrong accounting scheme...
> > > >
> > > > I wonder if the kernel is the right place to do this. We have new uAPI
> > >
> > > I didn't get this. This thread is about that VFIO uses a wrong accounting
> > > scheme and then the discussion is about the impact of fixing it to the
> > > userspace.
> >
> > It's probably too late to fix the kernel considering it may break the 
> > userspace.
> >
> > > I didn't see the question on the right place part.
> >
> > I meant it would be easier 

RE: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-23 Thread Tian, Kevin
> From: Jason Wang 
> Sent: Thursday, March 24, 2022 11:51 AM
> 
> > >
> >
> > In the end vfio type1 will be replaced by iommufd compat layer. With
> > that goal in mind iommufd has to inherit type1 behaviors.
> 
> So the compatibility should be provided by the compat layer instead of
> the core iommufd.
> 
> And I wonder what happens if iommufd is used by other subsystems like
> vDPA. Does it mean vDPA needs to inherit type1 behaviours? If yes, do
> we need a per subsystem new uAPI to expose this capability? If yes,
> why can't VFIO have such an API then we don't even need the compat
> layer at all?
> 

No, compat layer is just for vfio. other subsystems including vdpa is
expected to use iommu uAPI directly, except having their own
bind_iommufd and attach_ioas uAPIs to build the connection between
device and iommufd/ioas.

And having a compat layer for vfio is just for transition purpose. Yi has
demonstrated how vfio can follow what other subsystems are
expected to do here:

https://github.com/luxis1999/iommufd/commits/iommufd-v5.17-rc6
(specifically commits related to "cover-letter: Adapting vfio-pci to iommufd")

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-23 Thread Christoph Hellwig
On Wed, Mar 23, 2022 at 08:54:08PM +, Robin Murphy wrote:
> I'll admit I still never quite grasped the reason for also adding the 
> override to swiotlb_sync_single_for_device() in aa6f8dcbab47, but I think 
> by that point we were increasingly tired and confused and starting to 
> second-guess ourselves (well, I was, at least). I don't think it's wrong 
> per se, but as I said I do think it can bite anyone who's been doing 
> dma_sync_*() wrong but getting away with it until now. If ddbd89deb7d3 
> alone turns out to work OK then I'd be inclined to try a partial revert of 
> just that one hunk.

Agreed.  Let's try that first.

Oleksandr, can you try the patch below:


diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 6db1c475ec827..6c350555e5a1c 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -701,13 +701,10 @@ void swiotlb_tbl_unmap_single(struct device *dev, 
phys_addr_t tlb_addr,
 void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr,
size_t size, enum dma_data_direction dir)
 {
-   /*
-* Unconditional bounce is necessary to avoid corruption on
-* sync_*_for_cpu or dma_ummap_* when the device didn't overwrite
-* the whole lengt of the bounce buffer.
-*/
-   swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE);
-   BUG_ON(!valid_dma_direction(dir));
+   if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
+   swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE);
+   else
+   BUG_ON(dir != DMA_FROM_DEVICE);
 }
 
 void swiotlb_sync_single_for_cpu(struct device *dev, phys_addr_t tlb_addr,
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-23 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, March 24, 2022 4:34 AM
> 
> On Wed, Mar 23, 2022 at 02:04:46PM -0600, Alex Williamson wrote:
> > On Wed, 23 Mar 2022 16:34:39 -0300
> > Jason Gunthorpe  wrote:
> >
> > > On Wed, Mar 23, 2022 at 01:10:38PM -0600, Alex Williamson wrote:
> > > > On Fri, 18 Mar 2022 14:27:33 -0300
> > > > Jason Gunthorpe  wrote:
> > > >
> > > > > +static int conv_iommu_prot(u32 map_flags)
> > > > > +{
> > > > > + int iommu_prot;
> > > > > +
> > > > > + /*
> > > > > +  * We provide no manual cache coherency ioctls to userspace
> and most
> > > > > +  * architectures make the CPU ops for cache flushing
> privileged.
> > > > > +  * Therefore we require the underlying IOMMU to support
> CPU coherent
> > > > > +  * operation.
> > > > > +  */
> > > > > + iommu_prot = IOMMU_CACHE;
> > > >
> > > > Where is this requirement enforced?  AIUI we'd need to test
> > > > IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
> > > > intel_iommu_map() simply drop the flag when not supported by HW.
> > >
> > > You are right, the correct thing to do is to fail device
> > > binding/attach entirely if IOMMU_CAP_CACHE_COHERENCY is not there,
> > > however we can't do that because Intel abuses the meaning of
> > > IOMMU_CAP_CACHE_COHERENCY to mean their special no-snoop
> behavior is
> > > supported.
> > >
> > > I want Intel to split out their special no-snoop from IOMMU_CACHE and
> > > IOMMU_CAP_CACHE_COHERENCY so these things have a consisent
> meaning in
> > > all iommu drivers. Once this is done vfio and iommufd should both
> > > always set IOMMU_CACHE and refuse to work without
> > > IOMMU_CAP_CACHE_COHERENCY. (unless someone knows of
> an !IOMMU_CACHE
> > > arch that does in fact work today with vfio, somehow, but I don't..)
> >
> > IIRC, the DMAR on Intel CPUs dedicated to IGD was where we'd often see
> > lack of snoop-control support, causing us to have mixed coherent and
> > non-coherent domains.  I don't recall if you go back far enough in VT-d
> > history if the primary IOMMU might have lacked this support.  So I
> > think there are systems we care about with IOMMUs that can't enforce
> > DMA coherency.
> >
> > As it is today, if the IOMMU reports IOMMU_CAP_CACHE_COHERENCY and
> all
> > mappings make use of IOMMU_CACHE, then all DMA is coherent.  Are you
> > suggesting IOMMU_CAP_CACHE_COHERENCY should indicate that all
> mappings
> > are coherent regardless of mapping protection flags?  What's the point
> > of IOMMU_CACHE at that point?
> 
> IOMMU_CAP_CACHE_COHERENCY should return to what it was before Intel's
> change.

One nit (as I explained in previous v1 discussion). It is not that Intel
abuses this capability as it was firstly introduced for Intel's force 
snoop requirement. It is just that when later its meaning was changed
to match what you described below the original use of Intel was not
caught and fixed properly. 😊

> 
> It only means normal DMAs issued in a normal way are coherent with the
> CPU and do not require special cache flushing instructions. ie DMA
> issued by a kernel driver using the DMA API.
> 
> It does not mean that non-coherence DMA does not exist, or that
> platform or device specific ways to trigger non-coherence are blocked.
> 
> Stated another way, any platform that wires dev_is_dma_coherent() to
> true, like all x86 does, must support IOMMU_CACHE and report
> IOMMU_CAP_CACHE_COHERENCY for every iommu_domain the platform
> supports. The platform obviously declares it support this in order to
> support the in-kernel DMA API.

This is a good explanation of IOMMU_CACHE. From that intel-iommu
driver should always report this capability and do nothing with
IOMMU_CACHE since it's already guaranteed by the arch. Actually
this is exactly what AMD iommu driver does today.

Then we'll introduce a new cap/prot in particular for enforcing snoop
as you suggested below.

> 
> Thus, a new cap indicating that 'all dma is coherent' or 'no-snoop
> blocking' should be created to cover Intel's special need. From what I
> know it is only implemented in the Intel driver and apparently only
> for some IOMMUs connected to IGD.
> 
> > > Yes, it was missed in the notes for vfio compat that Intel no-snoop is
> > > not working currently, I fixed it.
> >
> > Right, I see it in the comments relative to extensions, but missed in
> > the commit log.  Thanks,
> 
> Oh good, I remembered it was someplace..
> 

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu