Re: Device specific pass through in host systems - discuss user interface

2019-06-07 Thread Sai Praneeth Prakhya


> It's interesting to see this from a fresh angle which isn't clouded by 
> other SoC GPU details - thanks for the proposal! A couple more thoughts 
> jump out immediately...
> 

Thanks a lot! for jumping in and providing your valuable insights :)
Sorry! for taking time to reply on this. Did some quick testing on
unbinding/binding stuff.

> > Since, this feature could be generic across architectures, we thought it
> > would be better if the user interface is discussed in the community first.
> > We are envisioning this to be used both during boot time and runtime and
> > hence having a kernel command line argument along with a sysfs entry are
> > needed. So, please pour in your suggestions on how the user interface
> > should look like to make it architecture agnostic.
> > 
> > 
> > 1.  Have a kernel command line argument that takes a list of BDF's as
> > an input and puts them in pass through mode
> > 
> > a.  Accepting BDF as an input has a downside - BDF is dynamic and
> > could change if BIOS/OS enumerates a new device in next reboot
> > 
> > b.  Accepting  pair as an input has a downside -
> > What to do when there are multiple such devices and user would like to put
> > only some of them in PT mode
> > 
> 
> c. Not all devices are PCI in the first place.

Agreed, maybe we could limit this feature only to PCIe devices :(

> 
> > 2.  Have a sysfs file which takes 1 or 0 as an input to enable/disable
> > pass through mode. Some places that seem to be reasonable are
> > 
> > a.  /sys/class/iommu/dmar0/devices/
> > 
> > b.  /sys/kernel/iommu_groups//devices
> 
> Note that this this works out a bit tricky to actually use, since 
> changing the meaning of DMA addresses under the device's feet would be 
> disastrous.

Yes, that makes sense.

> It can only safely take effect by unbinding and rebinding 
> the driver and/or resetting the device (as a side note, this starts to 
> overlap with the IOMMU-drivers-as-modules concept, and whatever solution 
> to this we end up with may be helpful in making that work too). In most 
> cases that's probably viable, but not every driver supports unbinding, 
> and either way what if the device in question is the one hosting the 
> root filesystem... :/

We would like to propose something like this

1. User will first unbind the device driver by
echo "BDF" > /sys/bus/pci/drivers//unbind

2. Set the Pass Through bit (this is to say the intent of the user that upon
the next rebind, please make this device as pass through)
echo 1 > /sys/class/iommu/dmar0/devices//pt

3. Re-bind the driver again
echo "BDF" > /sys/bus/pci/drivers//bind

So, if the driver doesn't support unbind then the user cannot put the device
in pass through mode, in which case he has to use kernel command line
argument.

I did unbind on my device that has root file system and after that I was
unable to run any command (like cat or echo), which made sense. So, it's upto
the user to decide which devices can go through unbind and hence can be put in
pass through mode. Since this feature requires sudo privileges, I think, we
can assume admin does the right thing.

Please let me know what you think about this and also please do suggest if you
have any other better ways to deal with this.

Regards,
Sai

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH 1/2] dt-bindings: soc: ti: Add TI PAT bindings

2019-06-07 Thread Andrew F. Davis via iommu
This patch adds the bindings for the Page-based Address Translator (PAT)
present on various TI SoCs. A Page-based Address Translator (PAT) device
performs address translation using tables stored in an internal SRAM.
Each PAT supports a set number of pages, each occupying a programmable
4KB, 16KB, 64KB, or 1MB of addresses in a window for which an incoming
transaction will be translated.

Signed-off-by: Andrew F. Davis 
---
 .../devicetree/bindings/misc/ti,pat.txt   | 34 +++
 1 file changed, 34 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/misc/ti,pat.txt

diff --git a/Documentation/devicetree/bindings/misc/ti,pat.txt 
b/Documentation/devicetree/bindings/misc/ti,pat.txt
new file mode 100644
index ..fac20d45ad4c
--- /dev/null
+++ b/Documentation/devicetree/bindings/misc/ti,pat.txt
@@ -0,0 +1,34 @@
+Texas Instruments Page-based Address Translator (PAT) driver binding
+
+
+A Page-based Address Translator (PAT) device performs address translation
+using tables stored in an internal SRAM. Each PAT supports a set number of
+pages, each occupying a programmable 4KB, 16KB, 64KB, or 1MB of addresses
+in a window for which an incoming transaction will be translated.
+
+TI-PAT controller Device Node
+=
+
+The TI-PAT node describes the Texas Instrument's Page-based Address
+Translator (PAT).
+
+Required properties:
+---
+- compatible: should be "ti,j721e-pat"
+- reg-names:
+   mmrs - Memory mapped registers region
+   table - Location of the table of translation pages
+   window - Window of memory addresses translated by this PAT
+- reg: register addresses corresponding to the above
+
+Example:
+
+navss_pat0: pat@3101 {
+   compatible = "ti,j721e-pat";
+   reg = <0x00 0x3101 0x00 0x0100>,
+ <0x00 0x3640 0x00 0x0004>,
+ <0x48 0x 0x00 0x4000>;
+   reg-names = "mmrs",
+   "table",
+   "window";
+};
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH 0/2] Support for TI Page-based Address Translator

2019-06-07 Thread Andrew F. Davis via iommu
Hello all,

So I've got a new IP on our new SoC I'm looking to make use of and would
like some help figuring out what framework best matches its function. The
IP is called a "Page-based Address Translator" or PAT. A PAT instance
(there are 5 of these things on our J721e device[0]) is basically a
really simple IOMMU sitting on the interconnect between the device bus
and what is effectively our northbridge called
MSMC (DRAM/SRAM/L3-Cache/Coherency controller).

Simplified it looks about like this:

 CPUs
  |
DRAM --- MSMC --- SRAM/L3
  |
NAVSS - (PATs)
  |
  --- Device Bus -
 |  |  |  |
Device  Device  Device   etc..

Each PAT has a set a window in high memory (about 0x48__ area)
for which any transaction with an address targeting its window will be
routed into that PAT. The PAT then does a simple calculation based on
the how far into the window the address is and the current page size,
does a lookup to an internally held table of translations, then sends the
transaction back out on the interconnect with a new address. Usually this
address should be towards somewhere in DRAM, but can be some other device
or even back into PAT (I'm not sure there is a valid use-case for this
but just a point of interest).

My gut reaction is that this is an IOMMU which belongs in the IOMMU
subsystem. But there are a couple oddities that make me less sure it is
really suitable for the IOMMU framework. First it doesn't sit in front of
any devices, it sits in front of *all* devices, this means we would have
every device claim it as an IOMMU parent, even though many devices also
have a traditional IOMMU connected. Second, there is only a limited
window of address space per PAT, this means we will get fragmentation and
allocation failures on occasion, in this way it looks to me more like AGP
GART. Third, the window is in high-memory, so unlike some IOMMU devices
which can be used to allow DMA to high-mem from low-mem only devices, PAT
can't be used for that. Lastly it doesn't provide any isolation, if the
access does not target the PAT window it is not used (that is not to say
we don't have isolation, just that it is taken care of by other parts of
the interconnect).

This means, to me, that PAT has one main purpose: making
physically-contiguous views of scattered pages in system memory for DMA.
But it does that really well, the whole translation table is held in a
PAT-internal SRAM giving 1 bus cycle latency and at full bus bandwidth.

So what are my options here, is IOMMU the right way to go or not?

Looking around the kernel I also see the char dev ARP/GART interface
which looks like a good fit, but also looks quite dated and my guess
deprecated at this point. Moving right along..

Another thing I saw is we have the support upstream of the DMM device[1]
available in some OMAPx/AM57x SoCs. I'm a little more familiar with this
device. The DMM is a bundle of IPs and in fact one of them is called
"PAT" and it even does basically the same thing this incarnation of "PAT"
does. It's upstream integration design is a bit questionable
unfortunately, the DMM support was integrated into the OMAPDRM display
driver, which does make some sense then given its support for rotation
(using TILER IP contained in DMM). The issue with this was that the
DMM/TILER/PAT IP was not part of the our display IP, but instead out at
the end of the shared device bus, inside the external memory controller.
Like this new PAT this meant that any IP that could make use of it, but
only the display framework could actually provide buffers backed by it.
This meant, for instance, if we wanted to decode some video buffer using
our video decoder we would have to allocate from DRM framework then pass
that over to the V4L2 system. This doesn't make much sense and required
the user-space to know about this odd situation and allocate from the
right spot or else have to use up valuable CMA space or waste memory with
dedicated carveouts.

Another idea would be to have this as a special central allocator
(exposed through DMA-BUF heaps[2] or ION) that would give out normal
system memory as a DMA-BUF but remap it with PAT if a device that only
supports contiguous memory tries to attach/map that DMA-BUF.

One last option would be to allow user-space to choose to make the buffer
contiguous when it needs. That's what the driver in this series allows.
We expose a remapping device, user-space passes it a non-contiguous
DMA-BUF handle and it passes a contiguous one back. Simple as that.

So how do we use this, lets take Android for example, we don't know at
allocation time if a rendering buffer will end up going back into the GPU
for further processing, or if it will be consumed directly by the display.
This is a problem for us as our GPU can work with non-contiguous buffers
but our display cannot, so any buffers that could possibly go to the
display at some point currently needs to be allocated as contiguous 

[RFC PATCH 2/2] soc: ti: Add Support for the TI Page-based Address Translator (PAT)

2019-06-07 Thread Andrew F. Davis via iommu
This patch adds a driver for the Page-based Address Translator (PAT)
present on various TI SoCs. A PAT device performs address translation
using tables stored in an internal SRAM. Each PAT supports a set number
of pages, each occupying a programmable 4KB, 16KB, 64KB, or 1MB of
addresses in a window for which an incoming transaction will be
translated.

Signed-off-by: Andrew F. Davis 
---
 drivers/soc/ti/Kconfig  |   9 +
 drivers/soc/ti/Makefile |   1 +
 drivers/soc/ti/ti-pat.c | 569 
 include/uapi/linux/ti-pat.h |  44 +++
 4 files changed, 623 insertions(+)
 create mode 100644 drivers/soc/ti/ti-pat.c
 create mode 100644 include/uapi/linux/ti-pat.h

diff --git a/drivers/soc/ti/Kconfig b/drivers/soc/ti/Kconfig
index f0be35d3dcba..b838ae74d01f 100644
--- a/drivers/soc/ti/Kconfig
+++ b/drivers/soc/ti/Kconfig
@@ -86,4 +86,13 @@ config TI_SCI_INTA_MSI_DOMAIN
help
  Driver to enable Interrupt Aggregator specific MSI Domain.
 
+config TI_PAT
+   tristate "TI PAT DMA-BUF exporter"
+   select REGMAP
+   help
+ Driver for TI Page-based Address Translator (PAT). This driver
+ provides the an API allowing the remapping of a non-contiguous
+ DMA-BUF into a contiguous one that is sutable for devices needing
+ coniguous memory.
+
 endif # SOC_TI
diff --git a/drivers/soc/ti/Makefile b/drivers/soc/ti/Makefile
index b3868d392d4f..1369642b40c3 100644
--- a/drivers/soc/ti/Makefile
+++ b/drivers/soc/ti/Makefile
@@ -9,3 +9,4 @@ obj-$(CONFIG_AMX3_PM)   += pm33xx.o
 obj-$(CONFIG_WKUP_M3_IPC)  += wkup_m3_ipc.o
 obj-$(CONFIG_TI_SCI_PM_DOMAINS)+= ti_sci_pm_domains.o
 obj-$(CONFIG_TI_SCI_INTA_MSI_DOMAIN)   += ti_sci_inta_msi.o
+obj-$(CONFIG_TI_PAT)   += ti-pat.o
diff --git a/drivers/soc/ti/ti-pat.c b/drivers/soc/ti/ti-pat.c
new file mode 100644
index ..7359ea0f7ccf
--- /dev/null
+++ b/drivers/soc/ti/ti-pat.c
@@ -0,0 +1,569 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * TI PAT mapped DMA-BUF memory re-exporter
+ *
+ * Copyright (C) 2018-2019 Texas Instruments Incorporated - http://www.ti.com/
+ * Andrew F. Davis 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#define TI_PAT_DRIVER_NAME "ti-pat"
+
+/* TI PAT MMRS registers */
+#define TI_PAT_MMRS_PID0x0 /* Revision Register */
+#define TI_PAT_MMRS_CONFIG 0x4 /* Config Register */
+#define TI_PAT_MMRS_CONTROL0x10 /* Control Register */
+
+/* TI PAT CONTROL register field values */
+#define TI_PAT_CONTROL_ARB_MODE_UF 0x0 /* Updates first */
+#define TI_PAT_CONTROL_ARB_MODE_RR 0x2 /* Round-robin */
+
+#define TI_PAT_CONTROL_PAGE_SIZE_4KB   0x0
+#define TI_PAT_CONTROL_PAGE_SIZE_16KB  0x1
+#define TI_PAT_CONTROL_PAGE_SIZE_64KB  0x2
+#define TI_PAT_CONTROL_PAGE_SIZE_1MB   0x3
+
+static unsigned int ti_pat_page_sizes[] = {
+   [TI_PAT_CONTROL_PAGE_SIZE_4KB]  = 4 * 1024,
+   [TI_PAT_CONTROL_PAGE_SIZE_16KB] = 16 * 1024,
+   [TI_PAT_CONTROL_PAGE_SIZE_64KB] = 64 * 1024,
+   [TI_PAT_CONTROL_PAGE_SIZE_1MB]  = 1024 * 1024,
+};
+
+enum ti_pat_mmrs_fields {
+   /* Revision */
+   F_PID_MAJOR,
+   F_PID_MINOR,
+
+   /* Controls */
+   F_CONTROL_ARB_MODE,
+   F_CONTROL_PAGE_SIZE,
+   F_CONTROL_REPLACE_OID_EN,
+   F_CONTROL_EN,
+
+   /* sentinel */
+   F_MAX_FIELDS
+};
+
+static const struct reg_field ti_pat_mmrs_reg_fields[] = {
+   /* Revision */
+   [F_PID_MAJOR]   = REG_FIELD(TI_PAT_MMRS_PID, 8, 10),
+   [F_PID_MINOR]   = REG_FIELD(TI_PAT_MMRS_PID, 0, 5),
+   /* Controls */
+   [F_CONTROL_ARB_MODE]= REG_FIELD(TI_PAT_MMRS_CONTROL, 6, 7),
+   [F_CONTROL_PAGE_SIZE]   = REG_FIELD(TI_PAT_MMRS_CONTROL, 4, 5),
+   [F_CONTROL_REPLACE_OID_EN]  = REG_FIELD(TI_PAT_MMRS_CONTROL, 1, 1),
+   [F_CONTROL_EN]  = REG_FIELD(TI_PAT_MMRS_CONTROL, 0, 0),
+};
+
+/**
+ * struct ti_pat_data - PAT device instance data
+ * @dev: PAT device structure
+ * @mdev: misc device
+ * @mmrs_map: Register map of MMRS region
+ * @table_base: Base address of TABLE region
+ */
+struct ti_pat_data {
+   struct device *dev;
+   struct miscdevice mdev;
+   struct regmap *mmrs_map;
+   struct regmap_field *mmrs_fields[F_MAX_FIELDS];
+   void __iomem *table_base;
+   unsigned int page_count;
+   unsigned int page_size;
+   phys_addr_t window_base;
+   struct gen_pool *pool;
+};
+
+struct ti_pat_dma_buf_attachment {
+   struct device *dev;
+   struct sg_table *table;
+   struct ti_pat_buffer *buffer;
+   struct list_head list;
+};
+
+struct ti_pat_buffer {
+   struct ti_pat_data *pat;
+   struct dma_buf *i_dma_buf;
+   size_t size;
+   unsigned long offset;
+   struct dma_buf *e_dma_buf;
+
+   struct 

"iommu/vt-d: Delegate DMA domain to generic iommu" series breaks megaraid_sas

2019-06-07 Thread Qian Cai
The linux-next series "iommu/vt-d: Delegate DMA domain to generic iommu" [1]
causes a system with the rootfs on megaraid_sas card unable to boot.

Reverted the whole series on the top of linux-next (next-20190607) fixed the
issue.

The information regards this storage card is,

[  116.466810][  T324] megaraid_sas :06:00.0: FW provided supportMaxExtLDs:
0   max_lds: 32
[  116.476052][  T324] megaraid_sas :06:00.0: controller type   :
iMR(0MB)
[  116.483646][  T324] megaraid_sas :06:00.0: Online Controller Reset(OCR)  
: Enabled
[  116.492403][  T324] megaraid_sas :06:00.0: Secure JBOD support   :
Yes
[  116.499887][  T324] megaraid_sas :06:00.0: NVMe passthru support :
No
[  116.507480][  T324] megaraid_sas :06:00.0: FW provided
[  116.612523][  T324] megaraid_sas :06:00.0: NVME page size: (0)
[  116.629991][  T324] megaraid_sas :06:00.0: INIT adapter done
[  116.714789][  T324] megaraid_sas :06:00.0: pci id:
(0x1000)/(0x0017)/(0x1d49)/(0x0500)
[  116.724228][  T324] megaraid_sas :06:00.0: unevenspan support: no
[  116.731518][  T324] megaraid_sas :06:00.0: firmware crash dump   :
no
[  116.738981][  T324] megaraid_sas :06:00.0: jbod sync map :
yes
[  116.787433][  T324] scsi host0: Avago SAS based MegaRAID driver
[  117.081088][  T324] scsi 0:0:0:0: Direct-
Access LENOVO   ST900MM0168  L587 PQ: 0 ANSI: 6

[1] https://lore.kernel.org/patchwork/cover/1078960/


Re: [PATCH v8 26/29] vfio-pci: Register an iommu fault handler

2019-06-07 Thread Jacob Pan
On Fri, 7 Jun 2019 11:28:13 +0100
Jean-Philippe Brucker  wrote:

> On 06/06/2019 21:29, Jacob Pan wrote:
> >> iommu_unregister_device_fault_handler(>pdev->dev);  
> >
> >
> > But this can fail if there are pending faults which leaves a
> > device reference and then the system is broken :(  
>  This series only features unrecoverable errors and for those the
>  unregistration cannot fail. Now unrecoverable errors were added I
>  admit this is confusing. We need to sort this out or clean the
>  dependencies.
> >>> As Alex pointed out in 4/29, we can make
> >>> iommu_unregister_device_fault_handler() never fail and clean up
> >>> all the pending faults in the host IOMMU belong to that device.
> >>> But the problem is that if a fault, such as PRQ, has already been
> >>> injected into the guest, the page response may come back after
> >>> handler is unregistered and registered again.
> >>
> >> I'm trying to figure out if that would be harmful in any way. I
> >> guess it can be a bit nasty if we handle the page response right
> >> after having injected a new page request that uses the same PRGI.
> >> In any other case we discard the page response, but here we
> >> forward it to the endpoint and:
> >>
> >> * If the response status is success, endpoint retries the
> >> translation. The guest probably hasn't had time to handle the new
> >> page request and translation will fail, which may lead the endpoint
> >> to give up (two unsuccessful translation requests). Or send a new
> >> request
> >>  
> > Good point, there shouldn't be any harm if the page response is a
> > "fake" success. In fact it could happen in the normal operation when
> > PRQs to two devices share the same non-leaf translation structure.
> > The worst case is just a retry. I am not aware of the retry limit,
> > is it in the PCIe spec? I cannot find it.  
> 
> I don't think so, it's the implementation's choice. In general I don't
> think devices will have a retry limit, but it doesn't seem like the
> PCI spec prevents them from implementing one either. It could be
> useful to stop retrying after a certain number of faults, for
> preventing livelocks when the OS doesn't fix up the page tables and
> the device would just repeat the fault indefinitely.
> 
> > I think we should just document it, similar to having a spurious
> > interrupt. The PRQ trace event should capture that as well.
> >   
> >> * otherwise the endpoint won't retry the access, and could also
> >> disable PRI if the status is failure.
> >>  
> > That would be true regardless this race condition with handler
> > registration. So should be fine.  
> 
> We do give an invalid response for the old PRG (because of
> unregistering), but also for the new one, which has a different
> address that the guest might be able to page in and would normally
> return success.
> 
> >>> We need a way to reject such page response belong
> >>> to the previous life of the handler. Perhaps a sync call to the
> >>> guest with your fault queue eventfd? I am not sure.
> >>
> >> We could simply expect the device driver not to send any page
> >> response after unregistering the fault handler. Is there any
> >> reason VFIO would need to unregister and re-register the fault
> >> handler on a live guest? 
> > There is no reason for VFIO to unregister and register again, I was
> > just thinking from security perspective. Someone could write a VFIO
> > app do this attack. But I agree the damage is within the device,
> > may get PRI disabled as a result.  
> 
> Yes I think the damage would always be contained within the
> misbehaving software
> 
> > So it seems we agree on the following:
> > - iommu_unregister_device_fault_handler() will never fail
> > - iommu driver cleans up all pending faults when handler is
> > unregistered
> > - assume device driver or guest not sending more page response
> > _after_ handler is unregistered.
> > - system will tolerate rare spurious response
> > 
> > Sounds right?  
> 
> Yes, I'll add that to the fault series
Hold on a second please, I think we need more clarifications. Ashok
pointed out to me that the spurious response can be harmful to other
devices when it comes to mdev, where PRQ group id is not per PASID,
device may reuse the group number and receiving spurious page response
can confuse the entire PF. Having spurious page response is also not
abiding the PCIe spec. exactly.

We have two options here:
1. unregister handler will get -EBUSY if outstanding fault exists.
-PROs: block offending device unbind only, eventually timeout
will clear.
-CONs: flooded faults can prevent clearing
2. unregister handle will block until all faults are clear in the host.
   Never fails unregistration
-PROs: simple flow for VFIO, no need to worry about device
holding reference.
-CONs: spurious page response may come from
misbehaving/malicious guest if guest does unregister and

Re: [PATCH] Documentation: DMA-API: fix a function name of max_mapping_size

2019-06-07 Thread Jonathan Corbet
On Fri,  7 Jun 2019 16:47:13 +0900
Yoshihiro Shimoda  wrote:

> The exported function name is dma_max_mapping_size(), not
> dma_direct_max_mapping_size() so that this patch fixes
> the function name in the documentation.
> 
> Fixes: 133d624b1cee ("dma: Introduce dma_max_mapping_size()")
> Signed-off-by: Yoshihiro Shimoda 
> ---
>  Documentation/DMA-API.txt | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt
> index 0076150..e47c63b 100644
> --- a/Documentation/DMA-API.txt
> +++ b/Documentation/DMA-API.txt
> @@ -198,7 +198,7 @@ call to set the mask to the value returned.
>  ::
>  
>   size_t
> - dma_direct_max_mapping_size(struct device *dev);
> + dma_max_mapping_size(struct device *dev);
>  
>  Returns the maximum size of a mapping for the device. The size parameter
>  of the mapping functions like dma_map_single(), dma_map_page() and

Applied, thanks.

jon


Re: [PATCH v8 25/29] vfio-pci: Add a new VFIO_REGION_TYPE_NESTED region type

2019-06-07 Thread Alex Williamson
On Fri, 7 Jun 2019 10:28:06 +0200
Auger Eric  wrote:

> Hi Alex,
> 
> On 6/4/19 12:31 AM, Alex Williamson wrote:
> > On Sun, 26 May 2019 18:10:00 +0200
> > Eric Auger  wrote:
> >   
> >> This patch adds two new regions aiming to handle nested mode
> >> translation faults.
> >>
> >> The first region (two host kernel pages) is read-only from the
> >> user-space perspective. The first page contains an header
> >> that provides information about the circular buffer located in the
> >> second page. The circular buffer is put in a different page in
> >> the prospect to be mmappable.
> >>
> >> The max user API version supported by the kernel is returned
> >> through a dedicated fault region capability.
> >>
> >> The prod header contains
> >> - the user API version in use (potentially inferior to the one
> >>   returned in the capability),
> >> - the offset of the queue within the region,
> >> - the producer index relative to the start of the queue
> >> - the max number of fault records,
> >> - the size of each record.
> >>
> >> The second region is write-only from the user perspective. It
> >> contains the version of the requested fault ABI and the consumer
> >> index that is updated by the userspace each time this latter has
> >> consumed fault records.
> >>
> >> The natural order of operation for the userspace is:
> >> - retrieve the highest supported fault ABI version
> >> - set the requested fault ABI version in the consumer region
> >>
> >> Until the ABI version is not set by the userspace, the kernel
> >> cannot return a comprehensive set of information inside the
> >> prod header (entry size and number of entries in the fault queue).  
> > 
> > It's not clear to me why two regions are required for this.  If the
> > first page is not mmap capable, why does it need to be read-only?  If
> > it were not read-only couldn't the fields of the second region also fit
> > within this first page?  If you wanted to deal with an mmap capable
> > writeable region, it could just be yet a 3rd page in the first region.  
> I thought it would be clearer for the userspace to have 2 separate
> regions, one for the producer and one for the consumer. Otherwise I will
> need to specify which fields are read-only or write-only. But this may
> be more self-contained in a single region.

We need to figure out read vs write anyway, but separating them to
separate regions just for that seems unnecessary.  How many regions do
we expect to require for a single feature?

> >   
> >>
> >> Signed-off-by: Eric Auger 
> >>
> >> ---
> >>
> >> v4 -> v5
> >> - check cons is not null in vfio_pci_check_cons_fault
> >>
> >> v3 -> v4:
> >> - use 2 separate regions, respectively in read and write modes
> >> - add the version capability
> >> ---
> >>  drivers/vfio/pci/vfio_pci.c | 105 
> >>  drivers/vfio/pci/vfio_pci_private.h |  17 +
> >>  drivers/vfio/pci/vfio_pci_rdwr.c|  73 +++
> >>  include/uapi/linux/vfio.h   |  42 +++
> >>  4 files changed, 237 insertions(+)
> >>
> >> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> >> index cab71da46f4a..f75f61127277 100644
> >> --- a/drivers/vfio/pci/vfio_pci.c
> >> +++ b/drivers/vfio/pci/vfio_pci.c
> >> @@ -261,6 +261,106 @@ int vfio_pci_set_power_state(struct vfio_pci_device 
> >> *vdev, pci_power_t state)
> >>return ret;
> >>  }
> >>  
> >> +void vfio_pci_fault_release(struct vfio_pci_device *vdev,
> >> +  struct vfio_pci_region *region)
> >> +{
> >> +}
> >> +
> >> +static const struct vfio_pci_fault_abi fault_abi_versions[] = {
> >> +  [0] = {
> >> +  .entry_size = sizeof(struct iommu_fault),
> >> +  },
> >> +};
> >> +
> >> +#define NR_FAULT_ABIS ARRAY_SIZE(fault_abi_versions)  
> > 
> > This looks like it's leading to some dangerous complicated code to
> > support multiple user selected ABIs.  How many ABIs do we plan to
> > support?  The region capability also exposes a type, sub-type, and
> > version.  How much of this could be exposed that way?  ie. if we need
> > to support multiple versions, expose multiple regions.  
> 
> This is something that was discussed earlier and suggested by
> Jean-Philippe that we may need to support several versions of the ABI
> (typicallu when adding PRI support).
> Exposing multiple region is an interesting idea and I will explore that
> direction.
> >   
> >> +
> >> +static int vfio_pci_fault_prod_add_capability(struct vfio_pci_device 
> >> *vdev,
> >> +  struct vfio_pci_region *region, struct vfio_info_cap *caps)
> >> +{
> >> +  struct vfio_region_info_cap_fault cap = {
> >> +  .header.id = VFIO_REGION_INFO_CAP_PRODUCER_FAULT,
> >> +  .header.version = 1,
> >> +  .version = NR_FAULT_ABIS,
> >> +  };
> >> +  return vfio_info_add_capability(caps, , sizeof(cap));
> >> +}
> >> +
> >> +static const struct vfio_pci_regops vfio_pci_fault_cons_regops = {
> >> +  .rw = vfio_pci_fault_cons_rw,
> >> +  

Re: [PATCH v8 26/29] vfio-pci: Register an iommu fault handler

2019-06-07 Thread Auger Eric
Hi Jean,

On 6/7/19 2:48 PM, Jean-Philippe Brucker wrote:
> On 26/05/2019 17:10, Eric Auger wrote:
>> +int vfio_pci_iommu_dev_fault_handler(struct iommu_fault_event *evt, void 
>> *data)
>> +{
>> +struct vfio_pci_device *vdev = (struct vfio_pci_device *) data;
>> +struct vfio_region_fault_prod *prod_region =
>> +(struct vfio_region_fault_prod *)vdev->fault_pages;
>> +struct vfio_region_fault_cons *cons_region =
>> +(struct vfio_region_fault_cons *)(vdev->fault_pages + 2 * 
>> PAGE_SIZE);
>> +struct iommu_fault *new =
>> +(struct iommu_fault *)(vdev->fault_pages + prod_region->offset +
>> +prod_region->prod * prod_region->entry_size);
>> +int prod, cons, size;
>> +
>> +mutex_lock(>fault_queue_lock);
>> +
>> +if (!vdev->fault_abi)
>> +goto unlock;
>> +
>> +prod = prod_region->prod;
>> +cons = cons_region->cons;
>> +size = prod_region->nb_entries;
>> +
>> +if (CIRC_SPACE(prod, cons, size) < 1)
>> +goto unlock;
>> +
>> +*new = evt->fault;
> 
> Could you check fault.type and return an error if it's not UNRECOV here?
> If the fault is recoverable (very unlikely since the PRI capability is
> disabled, but allowed) and we return an error here, then the caller
> takes care of completing the fault. If we forward it to the guest
> instead, the producer will wait indefinitely for a response.
Sure I will add that check in the next version.

Thanks

Eric
> 
> Thanks,
> Jean
> 
>> +prod = (prod + 1) % size;
>> +prod_region->prod = prod;
>> +mutex_unlock(>fault_queue_lock);
>> +
>> +mutex_lock(>igate);
>> +if (vdev->dma_fault_trigger)
>> +eventfd_signal(vdev->dma_fault_trigger, 1);
>> +mutex_unlock(>igate);
>> +return 0;
>> +
>> +unlock:
>> +mutex_unlock(>fault_queue_lock);
>> +return -EINVAL;
>> +}


Re: [RFC PATCH v4 17/21] x86/tsc: Switch to perf-based hardlockup detector if TSC become unstable

2019-06-07 Thread Ricardo Neri
On Thu, Jun 06, 2019 at 05:35:51PM -0700, Stephane Eranian wrote:
> Hi Ricardo,

Hi Stephane,

> Thanks for your contribution here. It is very important to move the
> watchdog out of the PMU wherever possible.

Indeed, using the PMU for the hardlockup detector is still the default
option. This patch series proposes a new kernel command line to switch
to use the HPET.

> 
> On Thu, May 23, 2019 at 6:17 PM Ricardo Neri
>  wrote:
> >
> > The HPET-based hardlockup detector relies on the TSC to determine if an
> > observed NMI interrupt was originated by HPET timer. Hence, this detector
> > can no longer be used with an unstable TSC.
> >
> > In such case, permanently stop the HPET-based hardlockup detector and
> > start the perf-based detector.
> >
> > Signed-off-by: Ricardo Neri 
> > ---
> >  arch/x86/include/asm/hpet.h| 2 ++
> >  arch/x86/kernel/tsc.c  | 2 ++
> >  arch/x86/kernel/watchdog_hld.c | 7 +++
> >  3 files changed, 11 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
> > index fd99f2390714..a82cbe17479d 100644
> > --- a/arch/x86/include/asm/hpet.h
> > +++ b/arch/x86/include/asm/hpet.h
> > @@ -128,6 +128,7 @@ extern int hardlockup_detector_hpet_init(void);
> >  extern void hardlockup_detector_hpet_stop(void);
> >  extern void hardlockup_detector_hpet_enable(unsigned int cpu);
> >  extern void hardlockup_detector_hpet_disable(unsigned int cpu);
> > +extern void hardlockup_detector_switch_to_perf(void);
> >  #else
> >  static inline struct hpet_hld_data 
> > *hpet_hardlockup_detector_assign_timer(void)
> >  { return NULL; }
> > @@ -136,6 +137,7 @@ static inline int hardlockup_detector_hpet_init(void)
> >  static inline void hardlockup_detector_hpet_stop(void) {}
> >  static inline void hardlockup_detector_hpet_enable(unsigned int cpu) {}
> >  static inline void hardlockup_detector_hpet_disable(unsigned int cpu) {}
> > +static void harrdlockup_detector_switch_to_perf(void) {}
> >  #endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
> >
> This does not compile for me when CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
> is not enabled.
> because:
>1- you have a typo on the function name
> 2- you are missing the inline keyword

I am sorry. This was an oversight on my side. I have corrected this in
preparation for a v5.

Thanks and BR,
Ricardo


Re: Device specific pass through in host systems - discuss user interface

2019-06-07 Thread Robin Murphy

On 07/06/2019 03:24, Prakhya, Sai Praneeth wrote:

Hi All,

I am working on an IOMMU driver feature that allows a user to specify if the DMA from a device 
should be translated by IOMMU or not. Presently, we support only all devices or none mode i.e. if 
user specifies "iommu=pt" [X86] or "iommu.passthrough" [ARM64] through kernel 
command line, all the devices would be in pass through mode and we don't have per device 
granularity, but, we were requested by a customer to selectively put devices in pass through mode 
and not all.


It's interesting to see this from a fresh angle which isn't clouded by 
other SoC GPU details - thanks for the proposal! A couple more thoughts 
jump out immediately...



Since, this feature could be generic across architectures, we thought it would 
be better if the user interface is discussed in the community first. We are 
envisioning this to be used both during boot time and runtime and hence having 
a kernel command line argument along with a sysfs entry are needed. So, please 
pour in your suggestions on how the user interface should look like to make it 
architecture agnostic.


1.  Have a kernel command line argument that takes a list of BDF's as an 
input and puts them in pass through mode

a.  Accepting BDF as an input has a downside - BDF is dynamic and could 
change if BIOS/OS enumerates a new device in next reboot

b.  Accepting  pair as an input has a downside - What 
to do when there are multiple such devices and user would like to put only some of 
them in PT mode



c. Not all devices are PCI in the first place.


2.  Have a sysfs file which takes 1 or 0 as an input to enable/disable pass 
through mode. Some places that seem to be reasonable are

a.  /sys/class/iommu/dmar0/devices/

b.  /sys/kernel/iommu_groups//devices


Note that this this works out a bit tricky to actually use, since 
changing the meaning of DMA addresses under the device's feet would be 
disastrous. It can only safely take effect by unbinding and rebinding 
the driver and/or resetting the device (as a side note, this starts to 
overlap with the IOMMU-drivers-as-modules concept, and whatever solution 
to this we end up with may be helpful in making that work too). In most 
cases that's probably viable, but not every driver supports unbinding, 
and either way what if the device in question is the one hosting the 
root filesystem... :/


Robin.



I am looking for a consensus on *how the kernel command line argument should 
look like and path for sysfs entry*. Also, please note that if a device is put 
in pass through mode it won't be available for the guest and that's ok.

Regards,
Sai

PS: Idea credits: Ashok Raj


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v8 26/29] vfio-pci: Register an iommu fault handler

2019-06-07 Thread Jean-Philippe Brucker
On 26/05/2019 17:10, Eric Auger wrote:
> +int vfio_pci_iommu_dev_fault_handler(struct iommu_fault_event *evt, void 
> *data)
> +{
> + struct vfio_pci_device *vdev = (struct vfio_pci_device *) data;
> + struct vfio_region_fault_prod *prod_region =
> + (struct vfio_region_fault_prod *)vdev->fault_pages;
> + struct vfio_region_fault_cons *cons_region =
> + (struct vfio_region_fault_cons *)(vdev->fault_pages + 2 * 
> PAGE_SIZE);
> + struct iommu_fault *new =
> + (struct iommu_fault *)(vdev->fault_pages + prod_region->offset +
> + prod_region->prod * prod_region->entry_size);
> + int prod, cons, size;
> +
> + mutex_lock(>fault_queue_lock);
> +
> + if (!vdev->fault_abi)
> + goto unlock;
> +
> + prod = prod_region->prod;
> + cons = cons_region->cons;
> + size = prod_region->nb_entries;
> +
> + if (CIRC_SPACE(prod, cons, size) < 1)
> + goto unlock;
> +
> + *new = evt->fault;

Could you check fault.type and return an error if it's not UNRECOV here?
If the fault is recoverable (very unlikely since the PRI capability is
disabled, but allowed) and we return an error here, then the caller
takes care of completing the fault. If we forward it to the guest
instead, the producer will wait indefinitely for a response.

Thanks,
Jean

> + prod = (prod + 1) % size;
> + prod_region->prod = prod;
> + mutex_unlock(>fault_queue_lock);
> +
> + mutex_lock(>igate);
> + if (vdev->dma_fault_trigger)
> + eventfd_signal(vdev->dma_fault_trigger, 1);
> + mutex_unlock(>igate);
> + return 0;
> +
> +unlock:
> + mutex_unlock(>fault_queue_lock);
> + return -EINVAL;
> +}


Re: [PATCH v8 25/29] vfio-pci: Add a new VFIO_REGION_TYPE_NESTED region type

2019-06-07 Thread Jean-Philippe Brucker
On 07/06/2019 09:28, Auger Eric wrote:
>>> +static const struct vfio_pci_fault_abi fault_abi_versions[] = {
>>> +   [0] = {
>>> +   .entry_size = sizeof(struct iommu_fault),
>>> +   },
>>> +};
>>> +
>>> +#define NR_FAULT_ABIS ARRAY_SIZE(fault_abi_versions)
>>
>> This looks like it's leading to some dangerous complicated code to
>> support multiple user selected ABIs.  How many ABIs do we plan to
>> support?  The region capability also exposes a type, sub-type, and
>> version.  How much of this could be exposed that way?  ie. if we need
>> to support multiple versions, expose multiple regions.
> 
> This is something that was discussed earlier and suggested by
> Jean-Philippe that we may need to support several versions of the ABI
> (typicallu when adding PRI support).
> Exposing multiple region is an interesting idea and I will explore that
> direction.

At the moment the ABI support errors and PRI. We're considering setting
the fault report structure to 64 or 128 bytes (see "[PATCH v2 2/4]
iommu: Introduce device fault data"). 64-byte allows for 2 additional
fields before we have to introduce a new ABI version, while 128 byte
should last us a while.

But that's for adding new fields to existing fault types. It's probably
a good idea to have different region types in VFIO for different fault
types, since userspace isn't necessarily prepared to deal with them. For
example right now userspace doesn't have a method to complete
recoverable faults, so we can't add them to the queue.

Thanks,
Jean


RE: [PATCH] Documentation: DMA-API: fix a function name of max_mapping_size

2019-06-07 Thread Yoshihiro Shimoda
Hi Christoph,

> From: Christoph Hellwig, Sent: Friday, June 7, 2019 5:35 PM
> 
> On Fri, Jun 07, 2019 at 08:19:08AM +, Yoshihiro Shimoda wrote:
> > Hi Christoph,
> >
> > > From: Christoph Hellwig, Sent: Friday, June 7, 2019 5:08 PM
> > >
> > > Looks good.  And it seems like you've also found the solution to
> > > your usb storage problem, but I'm going to post the variant I just
> > > hacked up nevertheless.
> >
> > Thank you for your reply! I think this API is related to my problem,
> > but I don't have any actual solution (a patch) for now. So, I'll wait
> > for your patch!
> 
> Turns out it isn't as simple as I thought, as there doesn't seem to
> be an easy way to get to the struct device used for DMA mapping
> from USB drivers.  I'll need to think a bit more how to handle that
> best.

Thank you for your reply. I sent an email on the original report as below.
https://marc.info/?l=linux-block=155990883224615=2

Best regards,
Yoshihiro Shimoda



RE: How to resolve an issue in swiotlb environment?

2019-06-07 Thread Yoshihiro Shimoda
Hi Christoph,

I think we should continue to discuss on this email thread instead of the fixed 
DMA-API.txt patch [1]

[1]
https://marc.info/?t=15598941221=1=2

> From: Yoshihiro Shimoda, Sent: Monday, June 3, 2019 3:42 PM
> 
> Hi linux-block and iommu mailing lists,
> 
> I have an issue that a USB SSD with xHCI on R-Car H3 causes "swiotlb is full" 
> like below.
> 
> [   36.745286] xhci-hcd ee00.usb: swiotlb buffer is full (sz: 524288 
> bytes), total 32768 (slots), used 1338 (slots)
> 
> I have investigated this issue by using git bisect, and then I found the 
> following commit:
> 
> ---
> commit 09324d32d2a0843e66652a087da6f77924358e62
> Author: Christoph Hellwig 
> Date:   Tue May 21 09:01:41 2019 +0200
> 
> block: force an unlimited segment size on queues with a virt boundary
> ---

Thank you for your comment on other email thread [2] like below:
---
Turns out it isn't as simple as I thought, as there doesn't seem to
be an easy way to get to the struct device used for DMA mapping
from USB drivers.  I'll need to think a bit more how to handle that
best.
---

[2]
https://marc.info/?l=linux-doc=155989651620473=2

I'm not sure this is a correct way, but the issue disappears if I applied a 
patch below
to USB storage driver. Especially, WARNING happened on 
blk_queue_max_segment_size().
Maybe we need to expand the argument "struct device *" of 
blk_queue_virt_boundary() to
call dma_max_mapping_size()?
---
diff --git a/drivers/usb/storage/scsiglue.c b/drivers/usb/storage/scsiglue.c
index 59190d8..fa37b39 100644
--- a/drivers/usb/storage/scsiglue.c
+++ b/drivers/usb/storage/scsiglue.c
@@ -28,6 +28,7 @@
  * status of a command.
  */
 
+#include 
 #include 
 #include 
 
@@ -83,6 +84,15 @@ static int slave_alloc (struct scsi_device *sdev)
maxp = usb_maxpacket(us->pusb_dev, us->recv_bulk_pipe, 0);
blk_queue_virt_boundary(sdev->request_queue, maxp - 1);
 
+{
+   struct device *dev = us->pusb_dev->bus->controller;
+
+   dev_info(dev, "%s: size = %zu\n", __func__, dma_max_mapping_size(dev));
+   blk_queue_max_segment_size(sdev->request_queue,
+  dma_max_mapping_size(dev));
+}
+
+
/*
 * Some host controllers may have alignment requirements.
 * We'll play it safe by requiring 512-byte alignment always.
---

Best regards,
Yoshihiro Shimoda



Re: [PATCH v3] iommu/arm-smmu: Avoid constant zero in TLBI writes

2019-06-07 Thread Marc Gonzalez
On 05/06/2019 14:19, Will Deacon wrote:

> On Mon, Jun 03, 2019 at 02:15:37PM +0200, Marc Gonzalez wrote:
>
>> From: Robin Murphy 
>>
>> Apparently, some Qualcomm arm64 platforms which appear to expose their
>> SMMU global register space are still, in fact, using a hypervisor to
>> mediate it by trapping and emulating register accesses. Sadly, some
>> deployed versions of said trapping code have bugs wherein they go
>> horribly wrong for stores using r31 (i.e. XZR/WZR) as the source
>> register.
>>
>> While this can be mitigated for GCC today by tweaking the constraints
>> for the implementation of writel_relaxed(), to avoid any potential
>> arms race with future compilers more aggressively optimising register
>> allocation, the simple way is to just remove all the problematic
>> constant zeros. For the write-only TLB operations, the actual value is
>> irrelevant anyway and any old nearby variable will provide a suitable
>> GPR to encode. The one point at which we really do need a zero to clear
>> a context bank happens before any of the TLB maintenance where crashes
>> have been reported, so is apparently not a problem... :/
>>
>> Reported-by: AngeloGioacchino Del Regno 
>> Tested-by: Marc Gonzalez 
>> Signed-off-by: Robin Murphy 
>> Signed-off-by: Marc Gonzalez 
> 
> Acked-by: Will Deacon 
> 
> Joerg -- Please can you take this as a fix for 5.2, with a Cc stable?

Hello Joerg,

Can you ping this thread once this patch hits linux-next, so I can
ask Bjorn to pick up the 8998 ANOC1 DT node, and the PCIe DT node
that requires ANOC1.

Bjorn: for ANOC1, a small fixup: s/arm,smmu/iommu/

https://patchwork.kernel.org/project/linux-arm-msm/list/?series=99701
https://patchwork.kernel.org/patch/10895341/

Regards.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v8 26/29] vfio-pci: Register an iommu fault handler

2019-06-07 Thread Jean-Philippe Brucker
On 06/06/2019 21:29, Jacob Pan wrote:
>> iommu_unregister_device_fault_handler(>pdev->dev);
>
>
> But this can fail if there are pending faults which leaves a
> device reference and then the system is broken :(
 This series only features unrecoverable errors and for those the
 unregistration cannot fail. Now unrecoverable errors were added I
 admit this is confusing. We need to sort this out or clean the
 dependencies.  
>>> As Alex pointed out in 4/29, we can make
>>> iommu_unregister_device_fault_handler() never fail and clean up all
>>> the pending faults in the host IOMMU belong to that device. But the
>>> problem is that if a fault, such as PRQ, has already been injected
>>> into the guest, the page response may come back after handler is
>>> unregistered and registered again.  
>>
>> I'm trying to figure out if that would be harmful in any way. I guess
>> it can be a bit nasty if we handle the page response right after
>> having injected a new page request that uses the same PRGI. In any
>> other case we discard the page response, but here we forward it to
>> the endpoint and:
>>
>> * If the response status is success, endpoint retries the
>> translation. The guest probably hasn't had time to handle the new
>> page request and translation will fail, which may lead the endpoint
>> to give up (two unsuccessful translation requests). Or send a new
>> request
>>
> Good point, there shouldn't be any harm if the page response is a
> "fake" success. In fact it could happen in the normal operation when
> PRQs to two devices share the same non-leaf translation structure. The
> worst case is just a retry. I am not aware of the retry limit, is it in
> the PCIe spec? I cannot find it.

I don't think so, it's the implementation's choice. In general I don't
think devices will have a retry limit, but it doesn't seem like the PCI
spec prevents them from implementing one either. It could be useful to
stop retrying after a certain number of faults, for preventing livelocks
when the OS doesn't fix up the page tables and the device would just
repeat the fault indefinitely.

> I think we should just document it, similar to having a spurious
> interrupt. The PRQ trace event should capture that as well.
> 
>> * otherwise the endpoint won't retry the access, and could also
>> disable PRI if the status is failure.
>>
> That would be true regardless this race condition with handler
> registration. So should be fine.

We do give an invalid response for the old PRG (because of unregistering),
but also for the new one, which has a different address that the guest
might be able to page in and would normally return success.

>>> We need a way to reject such page response belong
>>> to the previous life of the handler. Perhaps a sync call to the
>>> guest with your fault queue eventfd? I am not sure.  
>>
>> We could simply expect the device driver not to send any page response
>> after unregistering the fault handler. Is there any reason VFIO would
>> need to unregister and re-register the fault handler on a live guest?
>>
> There is no reason for VFIO to unregister and register again, I was
> just thinking from security perspective. Someone could write a VFIO app
> do this attack. But I agree the damage is within the device, may get
> PRI disabled as a result.

Yes I think the damage would always be contained within the misbehaving
software

> So it seems we agree on the following:
> - iommu_unregister_device_fault_handler() will never fail
> - iommu driver cleans up all pending faults when handler is unregistered
> - assume device driver or guest not sending more page response _after_
>   handler is unregistered.
> - system will tolerate rare spurious response
> 
> Sounds right?

Yes, I'll add that to the fault series

Thanks,
Jean


Re: [PATCH] driver: core: Allow subsystems to continue deferring probe

2019-06-07 Thread Ulf Hansson
+ Rob

On Wed, 5 Jun 2019 at 16:23, Thierry Reding  wrote:
>
> From: Thierry Reding 
>
> Some subsystems, such as pinctrl, allow continuing to defer probe
> indefinitely. This is useful for devices that depend on resources
> provided by devices that are only probed after the init stage.
>
> One example of this can be seen on Tegra, where the DPAUX hardware
> contains pinmuxing controls for pins that it shares with an I2C
> controller. The I2C controller is typically used for communication
> with a monitor over HDMI (DDC). However, other instances of the I2C
> controller are used to access system critical components, such as a
> PMIC. The I2C controller driver will therefore usually be a builtin
> driver, whereas the DPAUX driver is part of the display driver that
> is loaded from a module to avoid bloating the kernel image with all
> of the DRM/KMS subsystem.
>
> In this particular case the pins used by this I2C/DDC controller
> become accessible very late in the boot process. However, since the
> controller is only used in conjunction with display, that's not an
> issue.
>
> Unfortunately the driver core currently outputs a warning message
> when a device fails to get the pinctrl before the end of the init
> stage. That can be confusing for the user because it may sound like
> an unwanted error occurred, whereas it's really an expected and
> harmless situation.
>
> In order to eliminate this warning, this patch allows callers of the
> driver_deferred_probe_check_state() helper to specify that they want
> to continue deferring probe, regardless of whether we're past the
> init stage or not. All of the callers of that function are updated
> for the new signature, but only the pinctrl subsystem passes a true
> value in the new persist parameter if appropriate.
>
> Signed-off-by: Thierry Reding 

Overall this looks good to me. I guess Greg prefers a separate
function, which sets a flag for the device to switch to this new
behavior. That seems like a reasonable change to make and avoids
changing calls to driver_deferred_probe_check_state().

Kind regards
Uffe

> ---
>  drivers/base/dd.c| 17 -
>  drivers/base/power/domain.c  |  2 +-
>  drivers/iommu/of_iommu.c |  2 +-
>  drivers/pinctrl/devicetree.c | 10 ++
>  include/linux/device.h   |  2 +-
>  5 files changed, 21 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 0df9b4461766..25ffbadf4187 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -238,23 +238,30 @@ __setup("deferred_probe_timeout=", 
> deferred_probe_timeout_setup);
>  /**
>   * driver_deferred_probe_check_state() - Check deferred probe state
>   * @dev: device to check
> + * @persist: Boolean flag indicating whether drivers should keep trying to
> + *   probe after built-in drivers have had a chance to probe. This is useful
> + *   for built-in drivers that rely on resources provided by modular drivers.
>   *
>   * Returns -ENODEV if init is done and all built-in drivers have had a chance
> - * to probe (i.e. initcalls are done), -ETIMEDOUT if deferred probe debug
> - * timeout has expired, or -EPROBE_DEFER if none of those conditions are met.
> + * to probe (i.e. initcalls are done) and unless persist is set, -ETIMEDOUT 
> if
> + * deferred probe debug timeout has expired, or -EPROBE_DEFER if none of 
> those
> + * conditions are met.
>   *
>   * Drivers or subsystems can opt-in to calling this function instead of 
> directly
>   * returning -EPROBE_DEFER.
>   */
> -int driver_deferred_probe_check_state(struct device *dev)
> +int driver_deferred_probe_check_state(struct device *dev, bool persist)
>  {
> if (initcalls_done) {
> if (!deferred_probe_timeout) {
> dev_WARN(dev, "deferred probe timeout, ignoring 
> dependency");
> return -ETIMEDOUT;
> }
> -   dev_warn(dev, "ignoring dependency for device, assuming no 
> driver");
> -   return -ENODEV;
> +
> +   if (!persist) {
> +   dev_warn(dev, "ignoring dependency for device, 
> assuming no driver");
> +   return -ENODEV;
> +   }
> }
> return -EPROBE_DEFER;
>  }
> diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
> index 33c30c1e6a30..effa5276a773 100644
> --- a/drivers/base/power/domain.c
> +++ b/drivers/base/power/domain.c
> @@ -2423,7 +2423,7 @@ static int __genpd_dev_pm_attach(struct device *dev, 
> struct device *base_dev,
> mutex_unlock(_list_lock);
> dev_dbg(dev, "%s() failed to find PM domain: %ld\n",
> __func__, PTR_ERR(pd));
> -   return driver_deferred_probe_check_state(base_dev);
> +   return driver_deferred_probe_check_state(base_dev, false);
> }
>
> dev_dbg(dev, "adding to PM domain %s\n", pd->name);
> diff --git 

Re: [PATCH] Documentation: DMA-API: fix a function name of max_mapping_size

2019-06-07 Thread Christoph Hellwig
On Fri, Jun 07, 2019 at 08:19:08AM +, Yoshihiro Shimoda wrote:
> Hi Christoph,
> 
> > From: Christoph Hellwig, Sent: Friday, June 7, 2019 5:08 PM
> > 
> > Looks good.  And it seems like you've also found the solution to
> > your usb storage problem, but I'm going to post the variant I just
> > hacked up nevertheless.
> 
> Thank you for your reply! I think this API is related to my problem,
> but I don't have any actual solution (a patch) for now. So, I'll wait
> for your patch!

Turns out it isn't as simple as I thought, as there doesn't seem to
be an easy way to get to the struct device used for DMA mapping
from USB drivers.  I'll need to think a bit more how to handle that
best.


Re: [PATCH v8 24/29] vfio: VFIO_IOMMU_BIND/UNBIND_MSI

2019-06-07 Thread Auger Eric
Hi Alex,

On 6/4/19 12:32 AM, Alex Williamson wrote:
> On Sun, 26 May 2019 18:09:59 +0200
> Eric Auger  wrote:
> 
>> This patch adds the VFIO_IOMMU_BIND/UNBIND_MSI ioctl which aim
>> to pass/withdraw the guest MSI binding to/from the host.
>>
>> Signed-off-by: Eric Auger 
>>
>> ---
>> v6 -> v7:
>> - removed the dev arg
>>
>> v3 -> v4:
>> - add UNBIND
>> - unwind on BIND error
>>
>> v2 -> v3:
>> - adapt to new proto of bind_guest_msi
>> - directly use vfio_iommu_for_each_dev
>>
>> v1 -> v2:
>> - s/vfio_iommu_type1_guest_msi_binding/vfio_iommu_type1_bind_guest_msi
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 64 +
>>  include/uapi/linux/vfio.h   | 29 +++
>>  2 files changed, 93 insertions(+)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>> b/drivers/vfio/vfio_iommu_type1.c
>> index 6fda4fbc9bfa..18142cb078a3 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -1832,6 +1832,42 @@ static int vfio_cache_inv_fn(struct device *dev, void 
>> *data)
>>  return iommu_cache_invalidate(dc->domain, dev, >info);
>>  }
>>  
>> +static int
>> +vfio_bind_msi(struct vfio_iommu *iommu,
>> +  dma_addr_t giova, phys_addr_t gpa, size_t size)
>> +{
>> +struct vfio_domain *d;
>> +int ret = 0;
>> +
>> +mutex_lock(>lock);
>> +
>> +list_for_each_entry(d, >domain_list, next) {
>> +ret = iommu_bind_guest_msi(d->domain, giova, gpa, size);
>> +if (ret)
>> +goto unwind;
>> +}
>> +goto unlock;
>> +unwind:
>> +list_for_each_entry_continue_reverse(d, >domain_list, next) {
>> +iommu_unbind_guest_msi(d->domain, giova);
>> +}
>> +unlock:
>> +mutex_unlock(>lock);
>> +return ret;
>> +}
>> +
>> +static void
>> +vfio_unbind_msi(struct vfio_iommu *iommu, dma_addr_t giova)
>> +{
>> +struct vfio_domain *d;
>> +
>> +mutex_lock(>lock);
>> +list_for_each_entry(d, >domain_list, next) {
>> +iommu_unbind_guest_msi(d->domain, giova);
>> +}
>> +mutex_unlock(>lock);
>> +}
>> +
>>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>> unsigned int cmd, unsigned long arg)
>>  {
>> @@ -1936,6 +1972,34 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>  );
>>  mutex_unlock(>lock);
>>  return ret;
>> +} else if (cmd == VFIO_IOMMU_BIND_MSI) {
>> +struct vfio_iommu_type1_bind_msi ustruct;
>> +
>> +minsz = offsetofend(struct vfio_iommu_type1_bind_msi,
>> +size);
>> +
>> +if (copy_from_user(, (void __user *)arg, minsz))
>> +return -EFAULT;
>> +
>> +if (ustruct.argsz < minsz || ustruct.flags)
>> +return -EINVAL;
>> +
>> +return vfio_bind_msi(iommu, ustruct.iova, ustruct.gpa,
>> + ustruct.size);
>> +} else if (cmd == VFIO_IOMMU_UNBIND_MSI) {
>> +struct vfio_iommu_type1_unbind_msi ustruct;
>> +
>> +minsz = offsetofend(struct vfio_iommu_type1_unbind_msi,
>> +iova);
>> +
>> +if (copy_from_user(, (void __user *)arg, minsz))
>> +return -EFAULT;
>> +
>> +if (ustruct.argsz < minsz || ustruct.flags)
>> +return -EINVAL;
>> +
>> +vfio_unbind_msi(iommu, ustruct.iova);
>> +return 0;
>>  }
>>  
>>  return -ENOTTY;
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 055aa9b9745a..2774a1ab37ae 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -798,6 +798,35 @@ struct vfio_iommu_type1_cache_invalidate {
>>  };
>>  #define VFIO_IOMMU_CACHE_INVALIDATE  _IO(VFIO_TYPE, VFIO_BASE + 24)
>>  
>> +/**
>> + * VFIO_IOMMU_BIND_MSI - _IOWR(VFIO_TYPE, VFIO_BASE + 25,
>> + *  struct vfio_iommu_type1_bind_msi)
>> + *
>> + * Pass a stage 1 MSI doorbell mapping to the host so that this
>> + * latter can build a nested stage2 mapping
>> + */
>> +struct vfio_iommu_type1_bind_msi {
>> +__u32   argsz;
>> +__u32   flags;
>> +__u64   iova;
>> +__u64   gpa;
>> +__u64   size;
>> +};
>> +#define VFIO_IOMMU_BIND_MSI  _IO(VFIO_TYPE, VFIO_BASE + 25)
>> +
>> +/**
>> + * VFIO_IOMMU_UNBIND_MSI - _IOWR(VFIO_TYPE, VFIO_BASE + 26,
>> + *  struct vfio_iommu_type1_unbind_msi)
>> + *
>> + * Unregister an MSI mapping
>> + */
>> +struct vfio_iommu_type1_unbind_msi {
>> +__u32   argsz;
>> +__u32   flags;
>> +__u64   iova;
>> +};
>> +#define VFIO_IOMMU_UNBIND_MSI  _IO(VFIO_TYPE, VFIO_BASE + 26)
>> +
>>  /*  Additional API for SPAPR TCE (Server POWERPC) IOMMU  */
>>  
>>  /*
> 
> And another pair of ioctls.  Maybe think about how we can reduce the
> ioctl bloat of this series.  I don't want to impose an awkward
> interface 

Re: [PATCH v8 25/29] vfio-pci: Add a new VFIO_REGION_TYPE_NESTED region type

2019-06-07 Thread Auger Eric
Hi Alex,

On 6/4/19 12:31 AM, Alex Williamson wrote:
> On Sun, 26 May 2019 18:10:00 +0200
> Eric Auger  wrote:
> 
>> This patch adds two new regions aiming to handle nested mode
>> translation faults.
>>
>> The first region (two host kernel pages) is read-only from the
>> user-space perspective. The first page contains an header
>> that provides information about the circular buffer located in the
>> second page. The circular buffer is put in a different page in
>> the prospect to be mmappable.
>>
>> The max user API version supported by the kernel is returned
>> through a dedicated fault region capability.
>>
>> The prod header contains
>> - the user API version in use (potentially inferior to the one
>>   returned in the capability),
>> - the offset of the queue within the region,
>> - the producer index relative to the start of the queue
>> - the max number of fault records,
>> - the size of each record.
>>
>> The second region is write-only from the user perspective. It
>> contains the version of the requested fault ABI and the consumer
>> index that is updated by the userspace each time this latter has
>> consumed fault records.
>>
>> The natural order of operation for the userspace is:
>> - retrieve the highest supported fault ABI version
>> - set the requested fault ABI version in the consumer region
>>
>> Until the ABI version is not set by the userspace, the kernel
>> cannot return a comprehensive set of information inside the
>> prod header (entry size and number of entries in the fault queue).
> 
> It's not clear to me why two regions are required for this.  If the
> first page is not mmap capable, why does it need to be read-only?  If
> it were not read-only couldn't the fields of the second region also fit
> within this first page?  If you wanted to deal with an mmap capable
> writeable region, it could just be yet a 3rd page in the first region.
I thought it would be clearer for the userspace to have 2 separate
regions, one for the producer and one for the consumer. Otherwise I will
need to specify which fields are read-only or write-only. But this may
be more self-contained in a single region.
> 
>>
>> Signed-off-by: Eric Auger 
>>
>> ---
>>
>> v4 -> v5
>> - check cons is not null in vfio_pci_check_cons_fault
>>
>> v3 -> v4:
>> - use 2 separate regions, respectively in read and write modes
>> - add the version capability
>> ---
>>  drivers/vfio/pci/vfio_pci.c | 105 
>>  drivers/vfio/pci/vfio_pci_private.h |  17 +
>>  drivers/vfio/pci/vfio_pci_rdwr.c|  73 +++
>>  include/uapi/linux/vfio.h   |  42 +++
>>  4 files changed, 237 insertions(+)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index cab71da46f4a..f75f61127277 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -261,6 +261,106 @@ int vfio_pci_set_power_state(struct vfio_pci_device 
>> *vdev, pci_power_t state)
>>  return ret;
>>  }
>>  
>> +void vfio_pci_fault_release(struct vfio_pci_device *vdev,
>> +struct vfio_pci_region *region)
>> +{
>> +}
>> +
>> +static const struct vfio_pci_fault_abi fault_abi_versions[] = {
>> +[0] = {
>> +.entry_size = sizeof(struct iommu_fault),
>> +},
>> +};
>> +
>> +#define NR_FAULT_ABIS ARRAY_SIZE(fault_abi_versions)
> 
> This looks like it's leading to some dangerous complicated code to
> support multiple user selected ABIs.  How many ABIs do we plan to
> support?  The region capability also exposes a type, sub-type, and
> version.  How much of this could be exposed that way?  ie. if we need
> to support multiple versions, expose multiple regions.

This is something that was discussed earlier and suggested by
Jean-Philippe that we may need to support several versions of the ABI
(typicallu when adding PRI support).
Exposing multiple region is an interesting idea and I will explore that
direction.
> 
>> +
>> +static int vfio_pci_fault_prod_add_capability(struct vfio_pci_device *vdev,
>> +struct vfio_pci_region *region, struct vfio_info_cap *caps)
>> +{
>> +struct vfio_region_info_cap_fault cap = {
>> +.header.id = VFIO_REGION_INFO_CAP_PRODUCER_FAULT,
>> +.header.version = 1,
>> +.version = NR_FAULT_ABIS,
>> +};
>> +return vfio_info_add_capability(caps, , sizeof(cap));
>> +}
>> +
>> +static const struct vfio_pci_regops vfio_pci_fault_cons_regops = {
>> +.rw = vfio_pci_fault_cons_rw,
>> +.release= vfio_pci_fault_release,
>> +};
>> +
>> +static const struct vfio_pci_regops vfio_pci_fault_prod_regops = {
>> +.rw = vfio_pci_fault_prod_rw,
>> +.release= vfio_pci_fault_release,
>> +.add_capability = vfio_pci_fault_prod_add_capability,
>> +};
>> +
>> +static int vfio_pci_init_fault_region(struct vfio_pci_device *vdev)
>> +{
>> +struct vfio_region_fault_prod *header;
>> +int ret;
>> +
>> +

RE: [PATCH] Documentation: DMA-API: fix a function name of max_mapping_size

2019-06-07 Thread Yoshihiro Shimoda
Hi Christoph,

> From: Christoph Hellwig, Sent: Friday, June 7, 2019 5:08 PM
> 
> Looks good.  And it seems like you've also found the solution to
> your usb storage problem, but I'm going to post the variant I just
> hacked up nevertheless.

Thank you for your reply! I think this API is related to my problem,
but I don't have any actual solution (a patch) for now. So, I'll wait
for your patch!

Best regards,
Yoshihiro Shimoda



Re: [PATCH] Documentation: DMA-API: fix a function name of max_mapping_size

2019-06-07 Thread Christoph Hellwig
Looks good.  And it seems like you've also found the solution to
your usb storage problem, but I'm going to post the variant I just
hacked up nevertheless.


[PATCH] Documentation: DMA-API: fix a function name of max_mapping_size

2019-06-07 Thread Yoshihiro Shimoda
The exported function name is dma_max_mapping_size(), not
dma_direct_max_mapping_size() so that this patch fixes
the function name in the documentation.

Fixes: 133d624b1cee ("dma: Introduce dma_max_mapping_size()")
Signed-off-by: Yoshihiro Shimoda 
---
 Documentation/DMA-API.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt
index 0076150..e47c63b 100644
--- a/Documentation/DMA-API.txt
+++ b/Documentation/DMA-API.txt
@@ -198,7 +198,7 @@ call to set the mask to the value returned.
 ::
 
size_t
-   dma_direct_max_mapping_size(struct device *dev);
+   dma_max_mapping_size(struct device *dev);
 
 Returns the maximum size of a mapping for the device. The size parameter
 of the mapping functions like dma_map_single(), dma_map_page() and
-- 
2.7.4



Re: [PATCH v8 26/29] vfio-pci: Register an iommu fault handler

2019-06-07 Thread Auger Eric
Hi Jean, Jacob,

On 6/6/19 10:29 PM, Jacob Pan wrote:
> On Thu, 6 Jun 2019 19:54:05 +0100
> Jean-Philippe Brucker  wrote:
> 
>> On 05/06/2019 23:45, Jacob Pan wrote:
>>> On Tue, 4 Jun 2019 18:11:08 +0200
>>> Auger Eric  wrote:
>>>   
 Hi Alex,

 On 6/4/19 12:31 AM, Alex Williamson wrote:  
> On Sun, 26 May 2019 18:10:01 +0200
> Eric Auger  wrote:
> 
>> This patch registers a fault handler which records faults in
>> a circular buffer and then signals an eventfd. This buffer is
>> exposed within the fault region.
>>
>> Signed-off-by: Eric Auger 
>>
>> ---
>>
>> v3 -> v4:
>> - move iommu_unregister_device_fault_handler to vfio_pci_release
>> ---
>>  drivers/vfio/pci/vfio_pci.c | 49
>> + drivers/vfio/pci/vfio_pci_private.h
>> |  1 + 2 files changed, 50 insertions(+)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci.c
>> b/drivers/vfio/pci/vfio_pci.c index f75f61127277..52094ba8
>> 100644 --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -30,6 +30,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include "vfio_pci_private.h"
>>  
>> @@ -296,6 +297,46 @@ static const struct vfio_pci_regops
>> vfio_pci_fault_prod_regops = { .add_capability =
>> vfio_pci_fault_prod_add_capability, };
>>  
>> +int vfio_pci_iommu_dev_fault_handler(struct iommu_fault_event
>> *evt, void *data) +{
>> +struct vfio_pci_device *vdev = (struct vfio_pci_device
>> *) data;
>> +struct vfio_region_fault_prod *prod_region =
>> +(struct vfio_region_fault_prod
>> *)vdev->fault_pages;
>> +struct vfio_region_fault_cons *cons_region =
>> +(struct vfio_region_fault_cons
>> *)(vdev->fault_pages + 2 * PAGE_SIZE);
>> +struct iommu_fault *new =
>> +(struct iommu_fault *)(vdev->fault_pages +
>> prod_region->offset +
>> +prod_region->prod *
>> prod_region->entry_size);
>> +int prod, cons, size;
>> +
>> +mutex_lock(>fault_queue_lock);
>> +
>> +if (!vdev->fault_abi)
>> +goto unlock;
>> +
>> +prod = prod_region->prod;
>> +cons = cons_region->cons;
>> +size = prod_region->nb_entries;
>> +
>> +if (CIRC_SPACE(prod, cons, size) < 1)
>> +goto unlock;
>> +
>> +*new = evt->fault;
>> +prod = (prod + 1) % size;
>> +prod_region->prod = prod;
>> +mutex_unlock(>fault_queue_lock);
>> +
>> +mutex_lock(>igate);
>> +if (vdev->dma_fault_trigger)
>> +eventfd_signal(vdev->dma_fault_trigger, 1);
>> +mutex_unlock(>igate);
>> +return 0;
>> +
>> +unlock:
>> +mutex_unlock(>fault_queue_lock);
>> +return -EINVAL;
>> +}
>> +
>>  static int vfio_pci_init_fault_region(struct vfio_pci_device
>> *vdev) {
>>  struct vfio_region_fault_prod *header;
>> @@ -328,6 +369,13 @@ static int vfio_pci_init_fault_region(struct
>> vfio_pci_device *vdev) header = (struct vfio_region_fault_prod
>> *)vdev->fault_pages; header->version = -1;
>>  header->offset = PAGE_SIZE;
>> +
>> +ret =
>> iommu_register_device_fault_handler(>pdev->dev,
>> +
>> vfio_pci_iommu_dev_fault_handler,
>> +vdev);
>> +if (ret)
>> +goto out;
>> +
>>  return 0;
>>  out:
>>  kfree(vdev->fault_pages);
>> @@ -570,6 +618,7 @@ static void vfio_pci_release(void
>> *device_data) if (!(--vdev->refcnt)) {
>>  vfio_spapr_pci_eeh_release(vdev->pdev);
>>  vfio_pci_disable(vdev);
>> +
>> iommu_unregister_device_fault_handler(>pdev->dev);
>
>
> But this can fail if there are pending faults which leaves a
> device reference and then the system is broken :(
 This series only features unrecoverable errors and for those the
 unregistration cannot fail. Now unrecoverable errors were added I
 admit this is confusing. We need to sort this out or clean the
 dependencies.  
>>> As Alex pointed out in 4/29, we can make
>>> iommu_unregister_device_fault_handler() never fail and clean up all
>>> the pending faults in the host IOMMU belong to that device. But the
>>> problem is that if a fault, such as PRQ, has already been injected
>>> into the guest, the page response may come back after handler is
>>> unregistered and registered again.  
>>
>> I'm trying to figure out if that would be harmful in any way. I guess
>> it can be a bit nasty if we handle the page response right after
>> having 

RE: [RFC PATCH v5 3/8] iommu: add a new capable IOMMU_CAP_MERGING

2019-06-07 Thread Yoshihiro Shimoda
Hi Christoph,

> From: Christoph Hellwig, Sent: Friday, June 7, 2019 2:50 PM
> 
> On Fri, Jun 07, 2019 at 05:41:56AM +, Yoshihiro Shimoda wrote:
> > > bool blk_can_use_iommu_merging(struct request_queue *q, struct device 
> > > *dev)
> > > {
> > >   if (!IOMMU_CAN_MERGE_SEGMENTS(dev))
> > >   return false;
> >
> > As Robin mentioned, all IOMMUs can merge segments so that we don't need
> > this condition, IIUC. However, this should check whether the device is 
> > mapped
> > on iommu by using device_iommu_mapped().
> 
> There are plenty of dma_map_ops based drivers that can't merge segments.
> Examples:
> 
>  - arch/ia64/sn/pci/pci_dma.c
>  - arch/mips/jazz/jazzdma.c
>  - arch/sparc/mm/io-unit.c
>  - arch/sparc/mm/iommu.c
>  - arch/x86/kernel/pci-calgary_64.c

Thank you for the indicate. I'll check these codes.

> Nevermind the diret mapping, swiotlb and other weirdos.

I got it.

> > >   blk_queue_virt_boundary(q, IOMMU_PAGE_SIZE(dev));
> > >   blk_queue_max_segment_size(q, IOMMU_MAX_SEGMENT_SIZE(dev));
> >
> > By the way, I reported an issue [1] and I'm thinking dima_is_direct() 
> > environment
> > (especially for swiotlb) is also needed such max_segment_size changes 
> > somehow.
> > What do you think?
> >
> > [1]
> > https://marc.info/?l=linux-block=155954415603356=2
> 
> That doesn't seem to be related to the segment merging.  I'll take
> a look, but next time please Cc the author of a suspect commit if
> you already bisect things.

Oops. I'll Cc the author in next time.

Best regards,
Yoshihiro Shimoda