Re: [PATCH v3 1/2] PCI: iproc: Retry request when CRS returned from EP

2017-06-20 Thread Oza Oza
On Tue, Jun 20, 2017 at 4:09 AM, Bjorn Helgaas  wrote:
> On Tue, Jun 13, 2017 at 09:58:22AM +0530, Oza Oza wrote:
>> On Tue, Jun 13, 2017 at 5:00 AM, Bjorn Helgaas  wrote:
>> > Please wrap your changelogs to use 75 columns.  "git log" indents the
>> > changelog by four spaces, so if your text is 75 wide, it will still
>> > fit without wrapping.
>> >
>> > On Sun, Jun 11, 2017 at 09:35:37AM +0530, Oza Pawandeep wrote:
>> >> For Configuration Requests only, following reset
>> >> it is possible for a device to terminate the request
>> >> but indicate that it is temporarily unable to process
>> >> the Request, but will be able to process the Request
>> >> in the future – in this case, the Configuration Request
>> >> Retry Status 10 (CRS) Completion Status is used
>> >
>> > How does this relate to the CRS support we already have in the core,
>> > e.g., pci_bus_read_dev_vendor_id()?  It looks like your root complex
>> > already returns 0x0001 (CFG_RETRY_STATUS) in some cases.
>> >
>> > Also, per spec (PCIe r3.1, sec 2.3.2), CRS Software Visibility only
>> > affects config reads of the Vendor ID, but you call
>> > iproc_pcie_cfg_retry() for all config offsets.
>>
>> Yes, as per Spec, CRS Software Visibility only affects config read of
>> the Vendor ID.
>> For config write or any other config read the Root must automatically
>> re-issue configuration
>> request again as a new request, and our PCIe RC fails to do so.
>
> OK, if this is a workaround for a hardware defect, let's make that
> explicit in the changelog (and probably a comment in the code, too).
>
> I'm actually not sure the spec *requires* the CRS retries to be done
> directly in hardware, so it's conceivable the hardware could be
> working as designed.  But a comment would go a long way toward making
> this understandable by differentiating it from the generic CRS
> handling in the core.
>
> Bjorn

Sure, I will update the commit message and also will add explicit
comment in the code.

Regards,
Oza.


Re: [PATCH v3 2/2] PCI: iproc: add device shutdown for PCI RC

2017-06-21 Thread Oza Oza
On Thu, Jun 15, 2017 at 7:11 PM, Bjorn Helgaas  wrote:
> On Wed, Jun 14, 2017 at 10:24:11AM +0530, Oza Oza wrote:
>> On Tue, Jun 13, 2017 at 5:13 AM, Bjorn Helgaas  wrote:
>> > On Sun, Jun 11, 2017 at 09:35:38AM +0530, Oza Pawandeep wrote:
>> >> PERST# must be asserted around ~500ms before
>> >> the reboot is applied.
>> >>
>> >> During soft reset (e.g., "reboot" from Linux) on some iProc based SoCs
>> >> LCPLL clock and PERST both goes off simultaneously.
>> >> This will cause certain Endpoints Intel NVMe not get detected, upon
>> >> next boot sequence.
>> >>
>> >> This happens because; Endpoint is expecting the clock for some amount of
>> >> time after PERST is asserted, which is not happening in our case
>> >> (Compare to Intel X86 boards, will have clocks running).
>> >> this cause NVMe to behave in undefined way.
>> >
>> > This doesn't smell right.  The description makes this sound like a
>> > band-aid that happens to "solve" a problem with one particular device.
>> > It doesn't sound like this is making iProc adhere to something in the
>> > PCIe spec.
>> >
>> > Is there some PCIe spec timing requirement that tells you about this
>> > 500ms number?  Please include the reference if so.
>>
>> On chip PLL which sources the reference clock to the device gets reset
>> when soft reset is applied; the soft reset also asserts PERST#.
>> This simultaneous assertion of reset and clock being lost is a problem
>> with some NVME cards.
>> The delay is added to keep clock alive while PERST gets asserted.
>> The time delay number can be set to a number that will allow the NVME
>> device to go into reset state.
>> 500 ms delay is picked for that reason only, which is long enough to
>> get EP into reset correctly.
>
> This doesn't really tell me anything new.
>
> We're talking about the protocol on the link between the RC and the
> endpoint.  That *should* be completely specified by the PCIe spec.  If
> there's some requirement that the clock keep running after PERST is
> asserted, that should be in the spec.
>
> If it's not in the spec, and an endpoint relies on it, the endpoint is
> non-compliant, and we'll likely see similar issues with that endpoint
> on other platforms.
>
> If we need a workaround for a specific endpoint, that might be OK; the
> world isn't perfect.  But if that's the case, we should include more
> specifics about the device, e.g., vendor/device IDs and "lspci -vv"
> output.
>

This is specifically happening with Intel P3700 400 gb series.
http://ark.intel.com/products/79625/Intel-SSD-DC-P3700-Series-400GB-2_5in-PCIe-3_0-20nm-MLC

but on Intel board you will not find this problem, because ref clock
is supplied by
on-board oscillator.

But our board(s) do not have clock coming from on-board,
rather it is derived from PLL coming from SOC.

Besides, PCI spec does not stipulate about such timings.
In-fact it does not tell us, whether PCIe device should consider
refclk to be available or not to be.
500 ms is just based on the observation and taken as safe margin.

So, this is partly because of the Endpoint behavior and partly due to the
way our boards are designed.

with the same SSD, we will not see this issue on x86 boards, or
even any other ARM based boards, who supplies there ref clock by
on-bard oscillator,
so in case of reboot, ref clock does not go off.

please also refer to the link below which stipulate on clk being
active after PERST being asserted.
http://www.eetimes.com/document.asp?doc_id=1279299

Product: Intel P3700 400 gb SSD
Vendor ID: 8086
Device ID: 0953.

I am sorry, but we have made replacement order of those disks, so I
dont have lspci -vv output.
but hope that above information is sufficient.

>> >> Essentially clock will remain alive for 500ms with PERST# = 0
>> >> before reboot.
>> >
>> >> This patch adds platform shutdown where it should be
>> >> called in device_shutdown while reboot command is issued.
>> >>
>> >> So in sequence first Endpoint Shutdown (e.g. nvme_shutdown)
>> >> followed by RC shutdown is called, which issues safe PERST
>> >> assertion.
>> >>
>> >> Signed-off-by: Oza Pawandeep 
>> >> Reviewed-by: Ray Jui 
>> >> Reviewed-by: Scott Branden 
>> >>
>> >> diff --git a/drivers/pci/host/pcie-iproc-platform.c 
>> >> b/drivers/pci/host/pcie-iproc-platform.c
>> >> index 90d2bdd..9512960 100644
>> >> --- a/drivers/pci/host/pcie-iproc

[RFC] NVMe Configuraiton using sysctl

2017-05-15 Thread Oza Oza
Hi,

we are configuring interrupt coalesce for NVMe, but right now, it uses
module param.
so the same interrupt coalesce settings get applied for all the NVMEs
connected to different RCs.

ideally it should be with sysctl.
for e.g.
sysctl should provide interface to change
Per-CPU IO queue pairs, interrupt coalesce settings etc..

please suggest if we could have/implement sysctl module for NVMe ?

Regards,
Oza.


Re: [RFC] NVMe Configuraiton using sysctl

2017-05-15 Thread Oza Oza
On Mon, May 15, 2017 at 2:04 PM, Oza Oza  wrote:
> Hi,
>
> we are configuring interrupt coalesce for NVMe, but right now, it uses
> module param.
> so the same interrupt coalesce settings get applied for all the NVMEs
> connected to different RCs.
>
> ideally it should be with sysctl.
> for e.g.
> sysctl should provide interface to change
> Per-CPU IO queue pairs, interrupt coalesce settings etc..
>
> please suggest if we could have/implement sysctl module for NVMe ?
>
> Regards,
> Oza.

+ linux-n...@lists.infradead.org


Re: [RFC] NVMe Configuraiton using sysctl

2017-05-15 Thread Oza Oza
On Mon, May 15, 2017 at 2:45 PM, Sagi Grimberg  wrote:
>
>>> Hi,
>
>
> Hi Oza,
>
>>> we are configuring interrupt coalesce for NVMe, but right now, it uses
>>> module param.
>>> so the same interrupt coalesce settings get applied for all the NVMEs
>>> connected to different RCs.
>>>
>>> ideally it should be with sysctl.
>
>
> If at all, I would place this in nvme-cli (via ioctl) instead of
> sysctl.
>
>>> for e.g.
>>> sysctl should provide interface to change
>>> Per-CPU IO queue pairs, interrupt coalesce settings etc..
>
>
> My personal feeling is that percpu granularity is a lot to take in for
> the user, and also can yield some unexpected performance
> characteristics. But I might be wrong here..
>

I thought of nvme_ioctl, but was not sure whether sysctl or ioctl.
although we are interested only introducing interrupt coalesce,
because that brings improvements.

>>> please suggest if we could have/implement sysctl module for NVMe ?
>
>
> I have asked this before, but interrupt coalescing has very little
> merit without being able to be adaptive. net drivers maintain online
> stats and schedule interrupt coalescing modifications.
>
> Should work in theory, but having said that, interrupt coalescing as a
> whole is essentially unusable in nvme since the coalescing time limit
> is in units of 100us increments...

surprisingly, it brings 20% improvement in CPU utilization for us.
so it saves lot of our CPU cycles there freeing up to do something else.
the value has to be tuned but that's all there it is.
so we are keen on having this to tune in.

so your suggestion is to use IOCTL instead of sysctl right ?.
and as of now we are only interested in interrupt coalesce alone.

Regards,
Oza.


RE: [RFC PATCH] iommu/dma: account pci host bridge dma_mask for IOVA allocation

2017-03-17 Thread Oza Oza
Hi Robin,

Currently this patch involves multiple framework.
I have coalesced the patch into one to present it as a whole as one RFC.

it involves

1) pcie of framework changes
2) iommu ops
3) pci dma-ranges discussion.
4) also it talks about the bug in device tree framework (dma-ranges) (just
in the commit message)

There are some minor glitches it remains with, which is open for feedback
and improvement.
And this could be split later into separate patches, once you see it as a
potential scope for solving the problem statement end to end.

Regards,
Oza.
-Original Message-
From: Oza Oza [mailto:oza@broadcom.com]
Sent: Friday, March 17, 2017 11:45 AM
To: 'Joerg Roedel'; 'Robin Murphy'
Cc: 'io...@lists.linux-foundation.org'; 'linux-kernel@vger.kernel.org';
'linux-arm-ker...@lists.infradead.org'; 'devicet...@vger.kernel.org';
'bcm-kernel-feedback-l...@broadcom.com'
Subject: RE: [RFC PATCH] iommu/dma: account pci host bridge dma_mask for
IOVA allocation

Hi,

There are certain areas which requires contemplation.
And this problem requires more attention from Pci of framework and iommu,
and integration of both.

Regards,
Oza.

-Original Message-
From: Oza Pawandeep [mailto:oza@broadcom.com]
Sent: Friday, March 17, 2017 11:41 AM
To: Joerg Roedel; Robin Murphy
Cc: io...@lists.linux-foundation.org; linux-kernel@vger.kernel.org;
linux-arm-ker...@lists.infradead.org; devicet...@vger.kernel.org;
bcm-kernel-feedback-l...@broadcom.com; Oza Pawandeep
Subject: [RFC PATCH] iommu/dma: account pci host bridge dma_mask for IOVA
allocation

It is possible that PCI device supports 64-bit DMA addressing, and thus
it's driver sets device's dma_mask to DMA_BIT_MASK(64), however PCI host
bridge may have limitations on the inbound transaction addressing. As an
example, consider NVME SSD device connected to iproc-PCIe controller.

Currently, the IOMMU DMA ops only considers PCI device dma_mask when
allocating an IOVA. This is particularly problematic on
ARM/ARM64 SOCs where the IOMMU (i.e. SMMU) translates IOVA to PA for
in-bound transactions only after PCI Host has forwarded these transactions
on SOC IO bus. This means on such ARM/ARM64 SOCs the IOVA of in-bound
transactions has to honor the addressing restrictions of the PCI Host.

this patch is inspired by
http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1306545.html
http://www.spinics.net/lists/arm-kernel/msg566947.html

but above inspiraiton solves the half of the problem.
the rest of the problem is descrbied below, what we face on iproc based
SOCs.

current pcie frmework and of framework integration assumes dma-ranges in a
way where memory-mapped devices define their dma-ranges.
dma-ranges: (child-bus-address, parent-bus-address, length).

but iproc based SOCs and even Rcar based SOCs has PCI world dma-ranges.
dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>;

of_dma_configure is specifically witten to take care of memory mapped
devices.
but no implementation exists for pci to take care of pcie based memory
ranges.
in fact pci world doesnt seem to define standard dma-ranges since there is
an absense of the same, the dma_mask used to remain 32bit because of
0 size return (parsed by of_dma_configure())

this patch also implements of_pci_get_dma_ranges to cater to pci world
dma-ranges.
so then the returned size get best possible (largest) dma_mask.
for e.g.
dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>; we should get
dev->coherent_dma_mask=0x7f.

conclusion: there are following problems
1) linux pci and iommu framework integration has glitches with respect to
dma-ranges
2) pci linux framework look very uncertain about dma-ranges, rather
binding is not defined
   the way it is defined for memory mapped devices.
   rcar and iproc based SOCs use their custom one dma-ranges
   (rather can be standard)
3) even if in case of default parser of_dma_get_ranges,:
   it throws and erro"
   "no dma-ranges found for node"
   because of the bug which exists.
   following lines should be moved to the end of while(1)
839 node = of_get_next_parent(node);
840 if (!node)
841 break;

Reviewed-by: Anup Patel 
Reviewed-by: Scott Branden 
Signed-off-by: Oza Pawandeep 

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index
8c7c244..20cfff7 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -217,6 +217,9 @@ config NEED_DMA_MAP_STATE  config NEED_SG_DMA_LENGTH
def_bool y

+config ARCH_HAS_DMA_SET_COHERENT_MASK
+   def_bool y
+
 config SMP
def_bool y

diff --git a/arch/arm64/include/asm/device.h
b/arch/arm64/include/asm/device.h index 73d5bab..64b4dc3 100644
--- a/arch/arm64/include/asm/device.h
+++ b/arch/arm64/include/asm/device.h
@@ -20,6 +20,7 @@ struct dev_archdata {
 #ifdef CONFIG_IOMMU_API
void *iommu;

[RFC PATCH] iommu/dma/pci: account pci host bridge dma_mask for IOVA allocation

2017-03-20 Thread Oza Oza
+  linux-pci

Regards,
Oza.

-Original Message-
From: Oza Pawandeep [mailto:oza@broadcom.com]
Sent: Friday, March 17, 2017 11:41 AM
To: Joerg Roedel; Robin Murphy
Cc: io...@lists.linux-foundation.org; linux-kernel@vger.kernel.org;
linux-arm-ker...@lists.infradead.org; devicet...@vger.kernel.org;
bcm-kernel-feedback-l...@broadcom.com; Oza Pawandeep
Subject: [RFC PATCH] iommu/dma: account pci host bridge dma_mask for IOVA
allocation

It is possible that PCI device supports 64-bit DMA addressing, and thus
it's driver sets device's dma_mask to DMA_BIT_MASK(64), however PCI host
bridge may have limitations on the inbound transaction addressing. As an
example, consider NVME SSD device connected to iproc-PCIe controller.

Currently, the IOMMU DMA ops only considers PCI device dma_mask when
allocating an IOVA. This is particularly problematic on
ARM/ARM64 SOCs where the IOMMU (i.e. SMMU) translates IOVA to PA for
in-bound transactions only after PCI Host has forwarded these transactions
on SOC IO bus. This means on such ARM/ARM64 SOCs the IOVA of in-bound
transactions has to honor the addressing restrictions of the PCI Host.

this patch is inspired by
http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1306545.html
http://www.spinics.net/lists/arm-kernel/msg566947.html

but above inspiraiton solves the half of the problem.
the rest of the problem is descrbied below, what we face on iproc based
SOCs.

current pcie frmework and of framework integration assumes dma-ranges in a
way where memory-mapped devices define their dma-ranges.
dma-ranges: (child-bus-address, parent-bus-address, length).

but iproc based SOCs and even Rcar based SOCs has PCI world dma-ranges.
dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>;

of_dma_configure is specifically witten to take care of memory mapped
devices.
but no implementation exists for pci to take care of pcie based memory
ranges.
in fact pci world doesnt seem to define standard dma-ranges since there is
an absense of the same, the dma_mask used to remain 32bit because of
0 size return (parsed by of_dma_configure())

this patch also implements of_pci_get_dma_ranges to cater to pci world
dma-ranges.
so then the returned size get best possible (largest) dma_mask.
for e.g.
dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>; we should get
dev->coherent_dma_mask=0x7f.

conclusion: there are following problems
1) linux pci and iommu framework integration has glitches with respect to
dma-ranges
2) pci linux framework look very uncertain about dma-ranges, rather
binding is not defined
   the way it is defined for memory mapped devices.
   rcar and iproc based SOCs use their custom one dma-ranges
   (rather can be standard)
3) even if in case of default parser of_dma_get_ranges,:
   it throws and erro"
   "no dma-ranges found for node"
   because of the bug which exists.
   following lines should be moved to the end of while(1)
839 node = of_get_next_parent(node);
840 if (!node)
841 break;

Reviewed-by: Anup Patel 
Reviewed-by: Scott Branden 
Signed-off-by: Oza Pawandeep 

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index
8c7c244..20cfff7 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -217,6 +217,9 @@ config NEED_DMA_MAP_STATE  config NEED_SG_DMA_LENGTH
def_bool y

+config ARCH_HAS_DMA_SET_COHERENT_MASK
+   def_bool y
+
 config SMP
def_bool y

diff --git a/arch/arm64/include/asm/device.h
b/arch/arm64/include/asm/device.h index 73d5bab..64b4dc3 100644
--- a/arch/arm64/include/asm/device.h
+++ b/arch/arm64/include/asm/device.h
@@ -20,6 +20,7 @@ struct dev_archdata {
 #ifdef CONFIG_IOMMU_API
void *iommu;/* private IOMMU data */
 #endif
+   u64 parent_dma_mask;
bool dma_coherent;
 };

diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index 81cdb2e..5845ecd 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -564,6 +564,7 @@ static void flush_page(struct device *dev, const void
*virt, phys_addr_t phys)
__dma_flush_area(virt, PAGE_SIZE);
 }

+
 static void *__iommu_alloc_attrs(struct device *dev, size_t size,
 dma_addr_t *handle, gfp_t gfp,
 unsigned long attrs)
@@ -795,6 +796,20 @@ static void __iommu_unmap_sg_attrs(struct device
*dev,
iommu_dma_unmap_sg(dev, sgl, nelems, dir, attrs);  }

+static int __iommu_set_dma_mask(struct device *dev, u64 mask) {
+   /* device is not DMA capable */
+   if (!dev->dma_mask)
+   return -EIO;
+
+   if (mask > dev->archdata.parent_dma_mask)
+   mask = dev->archdata.parent_dma_mask;
+
+   *dev->dma_mask = mask;
+
+   return 0;
+}
+
 static const struct dma_map_ops iommu_dma_ops = {
.alloc = __iommu_alloc_attrs,
.free = __iommu_free_attrs,
@@ -811,8 +826,21 @@ static void __iomm

RE: [RFC PATCH] iommu/dma/pci: account pci host bridge dma_mask for IOVA allocation

2017-03-20 Thread Oza Oza
Hi Robin,

Please find my comments inline.

Regards,
Oza.

-Original Message-
From: Robin Murphy [mailto:robin.mur...@arm.com]
Sent: Monday, March 20, 2017 9:14 PM
To: Oza Oza
Cc: Joerg Roedel; linux-...@vger.kernel.org;
io...@lists.linux-foundation.org; linux-kernel@vger.kernel.org;
linux-arm-ker...@lists.infradead.org; devicet...@vger.kernel.org;
bcm-kernel-feedback-l...@broadcom.com
Subject: Re: [RFC PATCH] iommu/dma/pci: account pci host bridge dma_mask for
IOVA allocation

On 20/03/17 08:57, Oza Oza wrote:
> +  linux-pci
>
> Regards,
> Oza.
>
> -Original Message-
> From: Oza Pawandeep [mailto:oza@broadcom.com]
> Sent: Friday, March 17, 2017 11:41 AM
> To: Joerg Roedel; Robin Murphy
> Cc: io...@lists.linux-foundation.org; linux-kernel@vger.kernel.org;
> linux-arm-ker...@lists.infradead.org; devicet...@vger.kernel.org;
> bcm-kernel-feedback-l...@broadcom.com; Oza Pawandeep
> Subject: [RFC PATCH] iommu/dma: account pci host bridge dma_mask for
> IOVA allocation
>
> It is possible that PCI device supports 64-bit DMA addressing, and
> thus it's driver sets device's dma_mask to DMA_BIT_MASK(64), however
> PCI host bridge may have limitations on the inbound transaction
> addressing. As an example, consider NVME SSD device connected to
> iproc-PCIe controller.
>
> Currently, the IOMMU DMA ops only considers PCI device dma_mask when
> allocating an IOVA. This is particularly problematic on
> ARM/ARM64 SOCs where the IOMMU (i.e. SMMU) translates IOVA to PA for
> in-bound transactions only after PCI Host has forwarded these
> transactions on SOC IO bus. This means on such ARM/ARM64 SOCs the IOVA
> of in-bound transactions has to honor the addressing restrictions of the
> PCI Host.
>
> this patch is inspired by
> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1306545.ht
> ml http://www.spinics.net/lists/arm-kernel/msg566947.html
>
> but above inspiraiton solves the half of the problem.
> the rest of the problem is descrbied below, what we face on iproc
> based SOCs.
>
> current pcie frmework and of framework integration assumes dma-ranges
> in a way where memory-mapped devices define their dma-ranges.
> dma-ranges: (child-bus-address, parent-bus-address, length).
>
> but iproc based SOCs and even Rcar based SOCs has PCI world dma-ranges.
> dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>;
>
> of_dma_configure is specifically witten to take care of memory mapped
> devices.
> but no implementation exists for pci to take care of pcie based memory
> ranges.
> in fact pci world doesnt seem to define standard dma-ranges since
> there is an absense of the same, the dma_mask used to remain 32bit
> because of
> 0 size return (parsed by of_dma_configure())
>
> this patch also implements of_pci_get_dma_ranges to cater to pci world
> dma-ranges.
> so then the returned size get best possible (largest) dma_mask.
> for e.g.
> dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>; we should get
> dev->coherent_dma_mask=0x7f.
>
> conclusion: there are following problems
> 1) linux pci and iommu framework integration has glitches with respect
> to dma-ranges
> 2) pci linux framework look very uncertain about dma-ranges, rather
> binding is not defined
>the way it is defined for memory mapped devices.
>rcar and iproc based SOCs use their custom one dma-ranges
>(rather can be standard)
> 3) even if in case of default parser of_dma_get_ranges,:
>it throws and erro"
>"no dma-ranges found for node"
>because of the bug which exists.
>following lines should be moved to the end of while(1)
>   839 node = of_get_next_parent(node);
>   840 if (!node)
>   841 break;

Right, having made sense of this and looked into things myself I think I
understand now; what this boils down to is that the existing implementation
of of_dma_get_range() expects always to be given a leaf device_node, and
doesn't cope with being given a device_node for the given device's parent
bus directly. That's really all there is; it's not specific to PCI (there
are other probeable and DMA-capable buses whose children aren't described in
DT, like the fsl-mc thing), and it definitely doesn't have anything to do
with IOMMUs.

>Oza: I think it’s the other way around, or rather it is given leaf device
>node correctly. At-least in this case.
>The problem is of_dma_get_range jumps to parent node  of_get_next_parent(node);> without examining child.
>Although I tried to fix it, but in that case, the dma-ranges parse code
>doesn’t really parse pci ranges. And size returned is 0.
>Rather it parses memory mapped devices dm

RE: [RFC PATCH] iommu/dma/pci: account pci host bridge dma_mask for IOVA allocation

2017-03-24 Thread Oza Oza
Hi Robin,

I have made 3 separate patches now, which gives clear idea about the
changes.
we can have discussion there.

Regards,
Oza.

-Original Message-
From: Robin Murphy [mailto:robin.mur...@arm.com]
Sent: Monday, March 20, 2017 9:14 PM
To: Oza Oza
Cc: Joerg Roedel; linux-...@vger.kernel.org;
io...@lists.linux-foundation.org; linux-kernel@vger.kernel.org;
linux-arm-ker...@lists.infradead.org; devicet...@vger.kernel.org;
bcm-kernel-feedback-l...@broadcom.com
Subject: Re: [RFC PATCH] iommu/dma/pci: account pci host bridge dma_mask for
IOVA allocation

On 20/03/17 08:57, Oza Oza wrote:
> +  linux-pci
>
> Regards,
> Oza.
>
> -Original Message-
> From: Oza Pawandeep [mailto:oza@broadcom.com]
> Sent: Friday, March 17, 2017 11:41 AM
> To: Joerg Roedel; Robin Murphy
> Cc: io...@lists.linux-foundation.org; linux-kernel@vger.kernel.org;
> linux-arm-ker...@lists.infradead.org; devicet...@vger.kernel.org;
> bcm-kernel-feedback-l...@broadcom.com; Oza Pawandeep
> Subject: [RFC PATCH] iommu/dma: account pci host bridge dma_mask for
> IOVA allocation
>
> It is possible that PCI device supports 64-bit DMA addressing, and
> thus it's driver sets device's dma_mask to DMA_BIT_MASK(64), however
> PCI host bridge may have limitations on the inbound transaction
> addressing. As an example, consider NVME SSD device connected to
> iproc-PCIe controller.
>
> Currently, the IOMMU DMA ops only considers PCI device dma_mask when
> allocating an IOVA. This is particularly problematic on
> ARM/ARM64 SOCs where the IOMMU (i.e. SMMU) translates IOVA to PA for
> in-bound transactions only after PCI Host has forwarded these
> transactions on SOC IO bus. This means on such ARM/ARM64 SOCs the IOVA
> of in-bound transactions has to honor the addressing restrictions of the
> PCI Host.
>
> this patch is inspired by
> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1306545.ht
> ml http://www.spinics.net/lists/arm-kernel/msg566947.html
>
> but above inspiraiton solves the half of the problem.
> the rest of the problem is descrbied below, what we face on iproc
> based SOCs.
>
> current pcie frmework and of framework integration assumes dma-ranges
> in a way where memory-mapped devices define their dma-ranges.
> dma-ranges: (child-bus-address, parent-bus-address, length).
>
> but iproc based SOCs and even Rcar based SOCs has PCI world dma-ranges.
> dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>;
>
> of_dma_configure is specifically witten to take care of memory mapped
> devices.
> but no implementation exists for pci to take care of pcie based memory
> ranges.
> in fact pci world doesnt seem to define standard dma-ranges since
> there is an absense of the same, the dma_mask used to remain 32bit
> because of
> 0 size return (parsed by of_dma_configure())
>
> this patch also implements of_pci_get_dma_ranges to cater to pci world
> dma-ranges.
> so then the returned size get best possible (largest) dma_mask.
> for e.g.
> dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>; we should get
> dev->coherent_dma_mask=0x7f.
>
> conclusion: there are following problems
> 1) linux pci and iommu framework integration has glitches with respect
> to dma-ranges
> 2) pci linux framework look very uncertain about dma-ranges, rather
> binding is not defined
>the way it is defined for memory mapped devices.
>rcar and iproc based SOCs use their custom one dma-ranges
>(rather can be standard)
> 3) even if in case of default parser of_dma_get_ranges,:
>it throws and erro"
>"no dma-ranges found for node"
>because of the bug which exists.
>following lines should be moved to the end of while(1)
>   839 node = of_get_next_parent(node);
>   840 if (!node)
>   841 break;

Right, having made sense of this and looked into things myself I think I
understand now; what this boils down to is that the existing implementation
of of_dma_get_range() expects always to be given a leaf device_node, and
doesn't cope with being given a device_node for the given device's parent
bus directly. That's really all there is; it's not specific to PCI (there
are other probeable and DMA-capable buses whose children aren't described in
DT, like the fsl-mc thing), and it definitely doesn't have anything to do
with IOMMUs.

Now, that's certainly something to fix, but AFAICS this patch doesn't do
that, only adds some PCI-specific code which is never called.

DMA mask inheritance for arm64 is another issue, which again is general, but
does tend to be more visible in the IOMMU case. That still needs some work
on the APCI side - all the DT-centric approach

RE: [RFC PATCH] iommu/dma: check pci host bridge dma_mask for IOVA allocation

2017-03-15 Thread Oza Oza
Hi Robin,

I tried applying

[1]:http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1306545.ht
ml
[3]:http://www.spinics.net/lists/arm-kernel/msg566947.html

Because of 3 its crashing on our platform. (with SDHCI running with iommu
both enabled and disabled)

[   19.925018] PC is at sdhci_send_command+0x648/0xb30
[   19.930048] LR is at sdhci_send_command+0x588/0xb30
[   19.935078] pc : [] lr : [] pstate:
a1c5
[   19.942707] sp : 80097ff1ede0
[   19.946123] x29: 80097ff1ede0 x28: 
[   19.951623] x27: 0001 x26: 08923ca8
[   19.957124] x25: 8009750dc0d8 x24: 8009750dacc0
[   19.962625] x23: 00418958 x22: 0003
[   19.968125] x21: 8009750dc118 x20: 8009750dc198
[   19.973626] x19: 8009750dac00 x18: 0400
[   19.979126] x17:  x16: 
[   19.984627] x15: 8009764e4880 x14: 800976811548
[   19.990128] x13: 0008 x12: 7e0025ff9580
[   19.995628] x11: 800976d28080 x10: 0840
[   20.001129] x9 : 0040 x8 : 800976801020
[   20.006629] x7 : 0009d000 x6 : 08441358
[   20.012130] x5 :  x4 : 
[   20.017631] x3 : 800976465080 x2 : 
[   20.023130] x1 : 7e0025ce7802 x0 : ffe4
[   20.028630]
[   20.030165] ---[ end trace cd394f1ca2a1b19b ]---
[   20.034925] Call trace:
[   20.037446] Exception stack(0x80097ff1ec10 to 0x80097ff1ed40)
[   20.044089] ec00:   8009750dac00
0001
[   20.052165] ec20: 80097ff1ede0 085c9cb0 
0001
[   20.060242] ec40: 80097ff1ec90 084402b4 800976421000
800976465040
[   20.068318] ec60: 800976465040  80097659c850
80097653d810
[   20.076393] ec80: 0001 800973cb9300 80097ff1ecb0
08440324
[   20.084470] eca0: 800976421000 01c0 ffe4
7e0025ce7802
[   20.092547] ecc0:  800976465080 

[   20.100623] ece0: 08441358 0009d000 800976801020
0040
[   20.108699] ed00: 0840 800976d28080 7e0025ff9580
0008
[   20.116776] ed20: 800976811548 8009764e4880 

[   20.124853] [] sdhci_send_command+0x648/0xb30
[   20.130959] [] sdhci_irq+0x9e8/0xa20
[   20.136259] [] __handle_irq_event_percpu+0x9c/0x128
[   20.142901] [] handle_irq_event_percpu+0x1c/0x58
[   20.149275] [] handle_irq_event+0x48/0x78
[   20.155022] [] handle_fasteoi_irq+0xb8/0x1a0
[   20.161036] [] generic_handle_irq+0x24/0x38
[   20.166962] [] __handle_domain_irq+0x5c/0xb8
[   20.172977] [] gic_handle_irq+0xbc/0x168

Regards,
Oza.
-Original Message-
From: Oza Oza [mailto:oza@broadcom.com]
Sent: Tuesday, March 14, 2017 5:16 PM
To: 'Robin Murphy'; 'Joerg Roedel'
Cc: 'io...@lists.linux-foundation.org'; 'linux-kernel@vger.kernel.org';
'linux-arm-ker...@lists.infradead.org';
'bcm-kernel-feedback-l...@broadcom.com'; 'a...@arndb.de'; 'Nikita
Yushchenko'
Subject: RE: [RFC PATCH] iommu/dma: check pci host bridge dma_mask for
IOVA allocation

My responses inline:

-Original Message-
From: Robin Murphy [mailto:robin.mur...@arm.com]
Sent: Tuesday, March 14, 2017 4:27 PM
To: Oza Pawandeep; Joerg Roedel
Cc: io...@lists.linux-foundation.org; linux-kernel@vger.kernel.org;
linux-arm-ker...@lists.infradead.org;
bcm-kernel-feedback-l...@broadcom.com; a...@arndb.de; Nikita Yushchenko
Subject: Re: [RFC PATCH] iommu/dma: check pci host bridge dma_mask for
IOVA allocation

On 14/03/17 08:48, Oza Pawandeep wrote:
> It is possible that PCI device supports 64-bit DMA addressing, and
> thus it's driver sets device's dma_mask to DMA_BIT_MASK(64), however
> PCI host bridge may have limitations on the inbound transaction
> addressing. As an example, consider NVME SSD device connected to
> iproc-PCIe controller.

Aw heck, not another one :(

> Currently, the IOMMU DMA ops only considers PCI device dma_mask when
> allocating an IOVA. This is particularly problematic on
> ARM/ARM64 SOCs where the IOMMU (i.e. SMMU) translates IOVA to PA for
> in-bound transactions only after PCI Host has forwarded these
> transactions on SOC IO bus. This means on such ARM/ARM64 SOCs the IOVA
> of in-bound transactions has to honor the addressing restrictions of
> the PCI Host.

Depending on whether this most closely matches the R-Car situation[1] or
the X-Gene situation[2], this may not address the real problem at all (if
the non-IOMMU case is also affected). Either way it also fails to help
non-PCI devices which can face the exact same problem.

I am not fully aware of R-car and X-gene situation, but for iproc based
SOCs, our pcie host floats a

RE: [RFC PATCH] iommu/dma: account pci host bridge dma_mask for IOVA allocation

2017-03-16 Thread Oza Oza
Hi,

There are certain areas which requires contemplation.
And this problem requires more attention from Pci of framework and iommu,
and integration of both.

Regards,
Oza.

-Original Message-
From: Oza Pawandeep [mailto:oza@broadcom.com]
Sent: Friday, March 17, 2017 11:41 AM
To: Joerg Roedel; Robin Murphy
Cc: io...@lists.linux-foundation.org; linux-kernel@vger.kernel.org;
linux-arm-ker...@lists.infradead.org; devicet...@vger.kernel.org;
bcm-kernel-feedback-l...@broadcom.com; Oza Pawandeep
Subject: [RFC PATCH] iommu/dma: account pci host bridge dma_mask for IOVA
allocation

It is possible that PCI device supports 64-bit DMA addressing, and thus
it's driver sets device's dma_mask to DMA_BIT_MASK(64), however PCI host
bridge may have limitations on the inbound transaction addressing. As an
example, consider NVME SSD device connected to iproc-PCIe controller.

Currently, the IOMMU DMA ops only considers PCI device dma_mask when
allocating an IOVA. This is particularly problematic on
ARM/ARM64 SOCs where the IOMMU (i.e. SMMU) translates IOVA to PA for
in-bound transactions only after PCI Host has forwarded these transactions
on SOC IO bus. This means on such ARM/ARM64 SOCs the IOVA of in-bound
transactions has to honor the addressing restrictions of the PCI Host.

this patch is inspired by
http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1306545.html
http://www.spinics.net/lists/arm-kernel/msg566947.html

but above inspiraiton solves the half of the problem.
the rest of the problem is descrbied below, what we face on iproc based
SOCs.

current pcie frmework and of framework integration assumes dma-ranges in a
way where memory-mapped devices define their dma-ranges.
dma-ranges: (child-bus-address, parent-bus-address, length).

but iproc based SOCs and even Rcar based SOCs has PCI world dma-ranges.
dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>;

of_dma_configure is specifically witten to take care of memory mapped
devices.
but no implementation exists for pci to take care of pcie based memory
ranges.
in fact pci world doesnt seem to define standard dma-ranges since there is
an absense of the same, the dma_mask used to remain 32bit because of
0 size return (parsed by of_dma_configure())

this patch also implements of_pci_get_dma_ranges to cater to pci world
dma-ranges.
so then the returned size get best possible (largest) dma_mask.
for e.g.
dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>; we should get
dev->coherent_dma_mask=0x7f.

conclusion: there are following problems
1) linux pci and iommu framework integration has glitches with respect to
dma-ranges
2) pci linux framework look very uncertain about dma-ranges, rather
binding is not defined
   the way it is defined for memory mapped devices.
   rcar and iproc based SOCs use their custom one dma-ranges
   (rather can be standard)
3) even if in case of default parser of_dma_get_ranges,:
   it throws and erro"
   "no dma-ranges found for node"
   because of the bug which exists.
   following lines should be moved to the end of while(1)
839 node = of_get_next_parent(node);
840 if (!node)
841 break;

Reviewed-by: Anup Patel 
Reviewed-by: Scott Branden 
Signed-off-by: Oza Pawandeep 

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index
8c7c244..20cfff7 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -217,6 +217,9 @@ config NEED_DMA_MAP_STATE  config NEED_SG_DMA_LENGTH
def_bool y

+config ARCH_HAS_DMA_SET_COHERENT_MASK
+   def_bool y
+
 config SMP
def_bool y

diff --git a/arch/arm64/include/asm/device.h
b/arch/arm64/include/asm/device.h index 73d5bab..64b4dc3 100644
--- a/arch/arm64/include/asm/device.h
+++ b/arch/arm64/include/asm/device.h
@@ -20,6 +20,7 @@ struct dev_archdata {
 #ifdef CONFIG_IOMMU_API
void *iommu;/* private IOMMU data */
 #endif
+   u64 parent_dma_mask;
bool dma_coherent;
 };

diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index 81cdb2e..5845ecd 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -564,6 +564,7 @@ static void flush_page(struct device *dev, const void
*virt, phys_addr_t phys)
__dma_flush_area(virt, PAGE_SIZE);
 }

+
 static void *__iommu_alloc_attrs(struct device *dev, size_t size,
 dma_addr_t *handle, gfp_t gfp,
 unsigned long attrs)
@@ -795,6 +796,20 @@ static void __iommu_unmap_sg_attrs(struct device
*dev,
iommu_dma_unmap_sg(dev, sgl, nelems, dir, attrs);  }

+static int __iommu_set_dma_mask(struct device *dev, u64 mask) {
+   /* device is not DMA capable */
+   if (!dev->dma_mask)
+   return -EIO;
+
+   if (mask > dev->archdata.parent_dma_mask)
+   mask = dev->archdata.parent_dma_mask;
+
+   *dev->dma_mask = mask;
+
+   return 0;
+}
+
 static const struc

RE: [RFC PATCH] iommu/dma: check pci host bridge dma_mask for IOVA allocation

2017-03-14 Thread Oza Oza
My responses inline:

-Original Message-
From: Robin Murphy [mailto:robin.mur...@arm.com]
Sent: Tuesday, March 14, 2017 4:27 PM
To: Oza Pawandeep; Joerg Roedel
Cc: io...@lists.linux-foundation.org; linux-kernel@vger.kernel.org;
linux-arm-ker...@lists.infradead.org;
bcm-kernel-feedback-l...@broadcom.com; a...@arndb.de; Nikita Yushchenko
Subject: Re: [RFC PATCH] iommu/dma: check pci host bridge dma_mask for
IOVA allocation

On 14/03/17 08:48, Oza Pawandeep wrote:
> It is possible that PCI device supports 64-bit DMA addressing, and
> thus it's driver sets device's dma_mask to DMA_BIT_MASK(64), however
> PCI host bridge may have limitations on the inbound transaction
> addressing. As an example, consider NVME SSD device connected to
> iproc-PCIe controller.

Aw heck, not another one :(

> Currently, the IOMMU DMA ops only considers PCI device dma_mask when
> allocating an IOVA. This is particularly problematic on
> ARM/ARM64 SOCs where the IOMMU (i.e. SMMU) translates IOVA to PA for
> in-bound transactions only after PCI Host has forwarded these
> transactions on SOC IO bus. This means on such ARM/ARM64 SOCs the IOVA
> of in-bound transactions has to honor the addressing restrictions of
> the PCI Host.

Depending on whether this most closely matches the R-Car situation[1] or
the X-Gene situation[2], this may not address the real problem at all (if
the non-IOMMU case is also affected). Either way it also fails to help
non-PCI devices which can face the exact same problem.

I am not fully aware of R-car and X-gene situation, but for iproc based
SOCs, our pcie host floats addresses in higher order dma_mask is set.

> This patch tries to solve above described IOVA allocation problem by:
> 1. Adding iommu_get_dma_mask() to get dma_mask of any device 2. For
> PCI device, iommu_get_dma_mask() compare dma_mask of PCI device and
> corresponding PCI Host dma_mask (if set).
> 3. Use iommu_get_dma_mask() in IOMMU DMA ops implementation instead of
> dma_get_mask()

Sorry, but NAK to this approach - not only is the implementation rather
ugly, and incomplete as above, but the fact that it starts with a clear
description of the underlying problem and then goes on to bodge around it
down the line instead of actually fixing it is the kicker.

:)
that is why this is RFC patch, and I had something similar in mind to go
and poke around which you have provided.. [1] and [3].
Let me try this on our platform, and see if that solves the issue.

I've got a simple arm64 patch for proper DMA mask inheritance based on the
various discussions at [1] which I will try to clean up and send out for
4.12 if nobody else sends anything soon, but it does depend on getting [3]
merged first to avoid regressing certain platforms.

Robin.

[1]:http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1306545.ht
ml
[2]:http://www.spinics.net/lists/arm-kernel/msg552422.html
[3]:http://www.spinics.net/lists/arm-kernel/msg566947.html



> Signed-off-by: Oza Pawandeep 
> Reviewed-by: Anup Patel 
> Reviewed-by: Scott Branden 
> ---
>  drivers/iommu/dma-iommu.c | 44
> 
>  1 file changed, 40 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 48d36ce..e93e536 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -108,6 +108,42 @@ int iommu_get_dma_cookie(struct iommu_domain
> *domain)  }  EXPORT_SYMBOL(iommu_get_dma_cookie);
>
> +static u64 __iommu_dma_mask(struct device *dev, bool is_coherent) {
> +#ifdef CONFIG_PCI
> + u64 pci_hb_dma_mask;
> +
> + if (dev_is_pci(dev)) {
> + struct pci_dev *pdev = to_pci_dev(dev);
> + struct pci_host_bridge *br =
pci_find_host_bridge(pdev->bus);
> +
> + if ((!is_coherent) && !(br->dev.dma_mask))
> + goto default_dev_dma_mask;
> +
> + /* pci host bridge dma-mask. */
> + pci_hb_dma_mask = (!is_coherent) ? *br->dev.dma_mask :
> +br->dev.coherent_dma_mask;
> +
> + if (pci_hb_dma_mask && ((pci_hb_dma_mask) <
(*dev->dma_mask)))
> + return pci_hb_dma_mask;
> + }
> +default_dev_dma_mask:
> +#endif
> + return (!is_coherent) ? dma_get_mask(dev) :
> + dev->coherent_dma_mask;
> +}
> +
> +static u64 __iommu_dma_get_coherent_mask(struct device *dev) {
> + return __iommu_dma_mask(dev, true);
> +}
> +
> +static u64 __iommu_dma_get_mask(struct device *dev) {
> + return __iommu_dma_mask(dev, false); }
> +
> +
>  /**
>   * iommu_get_msi_cookie - Acquire just MSI remapping resources
>   * @domain: IOMMU domain to prepare
> @@ -461,7 +497,7 @@ struct page **iommu_dma_alloc(struct device *dev,
size_t size, gfp_t gfp,
>   if (!pages)
>   return NULL;
>
> - iova = __alloc_iova(domain, size, dev->coherent_dma_mask, dev);
> + iova = __alloc_iova(domain, size,
> +__iommu_dma_get_coherent_mask(dev), dev);
>   if (

Re: [RFC PATCH 3/3] of: fix node traversing in of_dma_get_range

2017-03-27 Thread Oza Oza
please find my comments inline.

On Mon, Mar 27, 2017 at 8:15 PM, Robin Murphy  wrote:
> Hi Rob,
>
> On 27/03/17 15:34, Rob Herring wrote:
>> On Sat, Mar 25, 2017 at 12:31 AM, Oza Pawandeep  wrote:
>>> it jumps to the parent node without examining the child node.
>>> also with that, it throws "no dma-ranges found for node"
>>> for pci dma-ranges.
>>>
>>> this patch fixes device node traversing for dma-ranges.
>>
>> What's the DT look like that doesn't work?
>
> The problem is the bodge in pci_dma_configure() where we don't have an
> OF node for the actual device itself, so pass in the host bridge's OF
> node instead. This happens to work well enough for dma-coherent, but I
> don't think dma-ranges was even considered at the time.
>
> As it happens I'm currently halfway through writing an experiment
> wherein pci_dma_configure() creates a temporary child node for the
> of_dma_configure() call if no other suitable alternative (e.g. some
> intermediate bridge node) exists. How hard are you likely to NAK that
> approach? ;)
>
>> dma-ranges is supposed to be a bus property, not a device's property.
>> So looking at the parent is correct behavior generally.
>
> Indeed, this patch as-is will break currently correct DTs (because we
> won't find dma-ranges on the device, so will bail before even looking at
> the parent as we should).

current parsing of dma-ranges assume that dma-ranges always to be
found in parent node.

based on that, my thinking is following:
couple of options

1)
instead while(1)  some meaningful condition such as while(!node)
the following bail out is not required.
  if (!ranges)
   break;

2)
have check based on dt-property to distinguish between pci and handle
dma-ranges root bridge

but again these changes do not solve the entire problem for choosing
right dma_mask.
neither it actually correctly address root bridge pci dma-ranges.

and hence I have written
https://lkml.org/lkml/2017/3/27/540

my final take is: this function does not need to change, let it parse
memory mapped dma-ranges as it is doing.

I am more inclined to have generic pci dma-ranges and parsing.
which following already addresses.
https://lkml.org/lkml/2017/3/27/540

Regards,
Oza.


Re: [RFC PATCH 1/3] of/pci: dma-ranges to account highest possible host bridge dma_mask

2017-03-27 Thread Oza Oza
On Mon, Mar 27, 2017 at 8:16 PM, Rob Herring  wrote:
> On Sat, Mar 25, 2017 at 12:31 AM, Oza Pawandeep  wrote:
>> it is possible that PCI device supports 64-bit DMA addressing,
>> and thus it's driver sets device's dma_mask to DMA_BIT_MASK(64),
>> however PCI host bridge may have limitations on the inbound
>> transaction addressing. As an example, consider NVME SSD device
>> connected to iproc-PCIe controller.
>>
>> Currently, the IOMMU DMA ops only considers PCI device dma_mask
>> when allocating an IOVA. This is particularly problematic on
>> ARM/ARM64 SOCs where the IOMMU (i.e. SMMU) translates IOVA to
>> PA for in-bound transactions only after PCI Host has forwarded
>> these transactions on SOC IO bus. This means on such ARM/ARM64
>> SOCs the IOVA of in-bound transactions has to honor the addressing
>> restrictions of the PCI Host.
>>
>> current pcie frmework and of framework integration assumes dma-ranges
>> in a way where memory-mapped devices define their dma-ranges.
>> dma-ranges: (child-bus-address, parent-bus-address, length).
>>
>> but iproc based SOCs and even Rcar based SOCs has PCI world dma-ranges.
>> dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>;
>
> If you implement a common function, then I expect to see other users
> converted to use it. There's also PCI hosts in arch/powerpc that parse
> dma-ranges.

the common function should be similar to what
of_pci_get_host_bridge_resources is doing right now.
it parses ranges property right now.

the new function would look look following.

of_pci_get_dma_ranges(struct device_node *dev, struct list_head *resources)
where resources would return the dma-ranges.

but right now if you see the patch, of_dma_configure calls the new
function, which actually returns the largest possible size.
so this new function has to be generic in a way where other PCI hosts
can use it. but certainly iproc(Broadcom SOC) , rcar based SOCs can
use it for sure.

although having powerpc using it;  is a separate exercise, since I do
not have any access to other PCI hosts such as powerpc. but we can
workout with them on thsi forum if required.

so overall, of_pci_get_dma_ranges has to serve following 2 purposes.

1) it has to return largest possible size to of_dma_configure to
generate largest possible dma_mask.

2) it also has to return resources (dma-ranges) parsed, to the users.

so to address above needs

of_pci_get_dma_ranges(struct device_node *dev, struct list_head
*resources, u64 *size)

dev -> device node.
resources -> dma-ranges in allocated list.
size -> highest possible size to generate possible dma_mask for
of_dma_configure.

let em know how this sounds.

Regards,
Oza.


Re: [RFC PATCH 1/3] of/pci: dma-ranges to account highest possible host bridge dma_mask

2017-03-28 Thread Oza Oza
On Tue, Mar 28, 2017 at 7:59 PM, Robin Murphy  wrote:
> On 28/03/17 06:27, Oza Oza wrote:
>> On Mon, Mar 27, 2017 at 8:16 PM, Rob Herring  wrote:
>>> On Sat, Mar 25, 2017 at 12:31 AM, Oza Pawandeep  
>>> wrote:
>>>> it is possible that PCI device supports 64-bit DMA addressing,
>>>> and thus it's driver sets device's dma_mask to DMA_BIT_MASK(64),
>>>> however PCI host bridge may have limitations on the inbound
>>>> transaction addressing. As an example, consider NVME SSD device
>>>> connected to iproc-PCIe controller.
>>>>
>>>> Currently, the IOMMU DMA ops only considers PCI device dma_mask
>>>> when allocating an IOVA. This is particularly problematic on
>>>> ARM/ARM64 SOCs where the IOMMU (i.e. SMMU) translates IOVA to
>>>> PA for in-bound transactions only after PCI Host has forwarded
>>>> these transactions on SOC IO bus. This means on such ARM/ARM64
>>>> SOCs the IOVA of in-bound transactions has to honor the addressing
>>>> restrictions of the PCI Host.
>>>>
>>>> current pcie frmework and of framework integration assumes dma-ranges
>>>> in a way where memory-mapped devices define their dma-ranges.
>>>> dma-ranges: (child-bus-address, parent-bus-address, length).
>>>>
>>>> but iproc based SOCs and even Rcar based SOCs has PCI world dma-ranges.
>>>> dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>;
>>>
>>> If you implement a common function, then I expect to see other users
>>> converted to use it. There's also PCI hosts in arch/powerpc that parse
>>> dma-ranges.
>>
>> the common function should be similar to what
>> of_pci_get_host_bridge_resources is doing right now.
>> it parses ranges property right now.
>>
>> the new function would look look following.
>>
>> of_pci_get_dma_ranges(struct device_node *dev, struct list_head *resources)
>> where resources would return the dma-ranges.
>>
>> but right now if you see the patch, of_dma_configure calls the new
>> function, which actually returns the largest possible size.
>> so this new function has to be generic in a way where other PCI hosts
>> can use it. but certainly iproc(Broadcom SOC) , rcar based SOCs can
>> use it for sure.
>>
>> although having powerpc using it;  is a separate exercise, since I do
>> not have any access to other PCI hosts such as powerpc. but we can
>> workout with them on thsi forum if required.
>>
>> so overall, of_pci_get_dma_ranges has to serve following 2 purposes.
>>
>> 1) it has to return largest possible size to of_dma_configure to
>> generate largest possible dma_mask.
>>
>> 2) it also has to return resources (dma-ranges) parsed, to the users.
>>
>> so to address above needs
>>
>> of_pci_get_dma_ranges(struct device_node *dev, struct list_head
>> *resources, u64 *size)
>>
>> dev -> device node.
>> resources -> dma-ranges in allocated list.
>> size -> highest possible size to generate possible dma_mask for
>> of_dma_configure.
>>
>> let em know how this sounds.
>
> Note that the point of passing PCI host bridges into of_dma_configure()
> in the first place was to avoid having some separate PCI-specific path
> for DMA configuration. I worry that introducing bus-specific dma-ranges
> parsing largely defeats that, since we end up with the worst of both
> worlds; effectively-duplicated code, and/or a load of extra complexity
> to then attempt to reconverge the divergent paths (there really
> shouldn't be any need to allocate a list of anything). Given that
> of_translate_dma_address() is already bus-agnostic, it hardly seems
> justifiable for its caller not to be so as well, especially when it
> mostly just comes down to getting the right #address-cells value.
>
> The patch below is actually enough to make typical cases work, but is
> vile, so I'm not seriously considering it (hence I've not bothered
> making IOMMU configuration handle all circumstances). What it has served
> to do, though, is give me a clear idea of how to properly sort out the
> not-quite-right device/parent assumptions between of_dma_configure() and
> of_dma_get_range() rather than bodging around them any further - stay tuned.
>
> Robin.
>
> ->8-
> From: Robin Murphy 
> Subject: [PATCH] of/pci: Use child node for DMA configuration
>
> of_dma_configure() expects to be passed an OF node representing the
> device being configured - for PCI devices we currently pass the node of
> the a

Re: [RFC PATCH 1/3] of/pci: dma-ranges to account highest possible host bridge dma_mask

2017-03-29 Thread Oza Oza
On Wed, Mar 29, 2017 at 10:13 AM, Oza Oza  wrote:
> On Tue, Mar 28, 2017 at 7:59 PM, Robin Murphy  wrote:
>> On 28/03/17 06:27, Oza Oza wrote:
>>> On Mon, Mar 27, 2017 at 8:16 PM, Rob Herring  wrote:
>>>> On Sat, Mar 25, 2017 at 12:31 AM, Oza Pawandeep  
>>>> wrote:
>>>>> it is possible that PCI device supports 64-bit DMA addressing,
>>>>> and thus it's driver sets device's dma_mask to DMA_BIT_MASK(64),
>>>>> however PCI host bridge may have limitations on the inbound
>>>>> transaction addressing. As an example, consider NVME SSD device
>>>>> connected to iproc-PCIe controller.
>>>>>
>>>>> Currently, the IOMMU DMA ops only considers PCI device dma_mask
>>>>> when allocating an IOVA. This is particularly problematic on
>>>>> ARM/ARM64 SOCs where the IOMMU (i.e. SMMU) translates IOVA to
>>>>> PA for in-bound transactions only after PCI Host has forwarded
>>>>> these transactions on SOC IO bus. This means on such ARM/ARM64
>>>>> SOCs the IOVA of in-bound transactions has to honor the addressing
>>>>> restrictions of the PCI Host.
>>>>>
>>>>> current pcie frmework and of framework integration assumes dma-ranges
>>>>> in a way where memory-mapped devices define their dma-ranges.
>>>>> dma-ranges: (child-bus-address, parent-bus-address, length).
>>>>>
>>>>> but iproc based SOCs and even Rcar based SOCs has PCI world dma-ranges.
>>>>> dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>;
>>>>
>>>> If you implement a common function, then I expect to see other users
>>>> converted to use it. There's also PCI hosts in arch/powerpc that parse
>>>> dma-ranges.
>>>
>>> the common function should be similar to what
>>> of_pci_get_host_bridge_resources is doing right now.
>>> it parses ranges property right now.
>>>
>>> the new function would look look following.
>>>
>>> of_pci_get_dma_ranges(struct device_node *dev, struct list_head *resources)
>>> where resources would return the dma-ranges.
>>>
>>> but right now if you see the patch, of_dma_configure calls the new
>>> function, which actually returns the largest possible size.
>>> so this new function has to be generic in a way where other PCI hosts
>>> can use it. but certainly iproc(Broadcom SOC) , rcar based SOCs can
>>> use it for sure.
>>>
>>> although having powerpc using it;  is a separate exercise, since I do
>>> not have any access to other PCI hosts such as powerpc. but we can
>>> workout with them on thsi forum if required.
>>>
>>> so overall, of_pci_get_dma_ranges has to serve following 2 purposes.
>>>
>>> 1) it has to return largest possible size to of_dma_configure to
>>> generate largest possible dma_mask.
>>>
>>> 2) it also has to return resources (dma-ranges) parsed, to the users.
>>>
>>> so to address above needs
>>>
>>> of_pci_get_dma_ranges(struct device_node *dev, struct list_head
>>> *resources, u64 *size)
>>>
>>> dev -> device node.
>>> resources -> dma-ranges in allocated list.
>>> size -> highest possible size to generate possible dma_mask for
>>> of_dma_configure.
>>>
>>> let em know how this sounds.
>>
>> Note that the point of passing PCI host bridges into of_dma_configure()
>> in the first place was to avoid having some separate PCI-specific path
>> for DMA configuration. I worry that introducing bus-specific dma-ranges
>> parsing largely defeats that, since we end up with the worst of both
>> worlds; effectively-duplicated code, and/or a load of extra complexity
>> to then attempt to reconverge the divergent paths (there really
>> shouldn't be any need to allocate a list of anything). Given that
>> of_translate_dma_address() is already bus-agnostic, it hardly seems
>> justifiable for its caller not to be so as well, especially when it
>> mostly just comes down to getting the right #address-cells value.
>>
>> The patch below is actually enough to make typical cases work, but is
>> vile, so I'm not seriously considering it (hence I've not bothered
>> making IOMMU configuration handle all circumstances). What it has served
>> to do, though, is give me a clear idea of how to properly sort out the
>> not-quite-right device/parent assumptions between of_dma_configure() and
>

Re: [RFC PATCH 1/3] of/pci: dma-ranges to account highest possible host bridge dma_mask

2017-03-30 Thread Oza Oza
On Tue, Mar 28, 2017 at 7:43 PM, Rob Herring  wrote:
> On Tue, Mar 28, 2017 at 12:27 AM, Oza Oza  wrote:
>> On Mon, Mar 27, 2017 at 8:16 PM, Rob Herring  wrote:
>>> On Sat, Mar 25, 2017 at 12:31 AM, Oza Pawandeep  
>>> wrote:
>>>> it is possible that PCI device supports 64-bit DMA addressing,
>>>> and thus it's driver sets device's dma_mask to DMA_BIT_MASK(64),
>>>> however PCI host bridge may have limitations on the inbound
>>>> transaction addressing. As an example, consider NVME SSD device
>>>> connected to iproc-PCIe controller.
>>>>
>>>> Currently, the IOMMU DMA ops only considers PCI device dma_mask
>>>> when allocating an IOVA. This is particularly problematic on
>>>> ARM/ARM64 SOCs where the IOMMU (i.e. SMMU) translates IOVA to
>>>> PA for in-bound transactions only after PCI Host has forwarded
>>>> these transactions on SOC IO bus. This means on such ARM/ARM64
>>>> SOCs the IOVA of in-bound transactions has to honor the addressing
>>>> restrictions of the PCI Host.
>>>>
>>>> current pcie frmework and of framework integration assumes dma-ranges
>>>> in a way where memory-mapped devices define their dma-ranges.
>>>> dma-ranges: (child-bus-address, parent-bus-address, length).
>>>>
>>>> but iproc based SOCs and even Rcar based SOCs has PCI world dma-ranges.
>>>> dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>;
>>>
>>> If you implement a common function, then I expect to see other users
>>> converted to use it. There's also PCI hosts in arch/powerpc that parse
>>> dma-ranges.
>>
>> the common function should be similar to what
>> of_pci_get_host_bridge_resources is doing right now.
>> it parses ranges property right now.
>>
>> the new function would look look following.
>>
>> of_pci_get_dma_ranges(struct device_node *dev, struct list_head *resources)
>> where resources would return the dma-ranges.
>>
>> but right now if you see the patch, of_dma_configure calls the new
>> function, which actually returns the largest possible size.
>> so this new function has to be generic in a way where other PCI hosts
>> can use it. but certainly iproc(Broadcom SOC) , rcar based SOCs can
>> use it for sure.
>>
>> although having powerpc using it;  is a separate exercise, since I do
>> not have any access to other PCI hosts such as powerpc. but we can
>> workout with them on thsi forum if required.
>
> You don't need h/w. You can analyze what parts are common, write
> patches to convert to common implementation, and build test. The PPC
> and rcar folks can test on h/w.
>
> Rob


Hi Rob,

I have addressed your comment and made generic function.
Gentle request to have a look at following approach and patch.

[RFC PATCH 2/2] of/pci: call pci specific dma-ranges instead of memory-mapped.
[RFC PATCH 1/2] of/pci: implement inbound dma-ranges for PCI

I have tested this on our platform, with and without iommu, and seems to work.

let me know your view on this.

Regards,
Oza.


Re: [PATCH v6 2/3] iommu/pci: reserve IOVA for PCI masters

2017-05-18 Thread Oza Oza
On Wed, May 17, 2017 at 10:41 PM, Bjorn Helgaas  wrote:
> On Tue, May 16, 2017 at 10:52:06AM +0530, Oza Pawandeep wrote:
>> this patch reserves the IOVA for PCI masters.
>> ARM64 based SOCs may have scattered memory banks.
>> such as iproc based SOC has
>>
>> <0x 0x8000 0x0 0x8000>, /* 2G @ 2G */
>> <0x0008 0x8000 0x3 0x8000>, /* 14G @ 34G */
>> <0x0090 0x 0x4 0x>, /* 16G @ 576G */
>> <0x00a0 0x 0x4 0x>; /* 16G @ 640G */
>>
>> but incoming PCI transcation addressing capability is limited
>
> s/transcation/transaction/
>
>> by host bridge, for example if max incoming window capability
>> is 512 GB, then 0x0090 and 0x00a0 will fall beyond it.
>>
>> to address this problem, iommu has to avoid allocating IOVA which
>
> s/iommu/IOMMU/
>
>> are reserved. which inturn does not allocate IOVA if it falls into hole.
>
> s/inturn/in turn/

Hi Bjorn,

Thank you for the comments.
Will take care of all your comments.

Regards,
Oza.


Re: [PATCH v6 1/3] of/pci/dma: fix DMA configuration for PCI masters

2017-05-18 Thread Oza Oza
On Wed, May 17, 2017 at 10:40 PM, Bjorn Helgaas  wrote:
> On Tue, May 16, 2017 at 10:52:05AM +0530, Oza Pawandeep wrote:
>> current device framework and OF framework integration assumes
>
> s/current/The current/
>
>> dma-ranges in a way where memory-mapped devices define their
>> dma-ranges. (child-bus-address, parent-bus-address, length).
>>
>> of_dma_configure is specifically written to take care of memory
>> mapped devices. but no implementation exists for pci to take
>> care of pcie based memory ranges.
>
> s/pci/PCI/  (also other occurrences below)
> s/pcie/PCIe/
>
> I don't see how PCIe is relevant here.  The bridge might support PCIe,
> but I don't think anything here is actually specific to PCIe.  If
> that's the case, I think it's confusing to mention PCIe.

It attempts to fix of_dma_get_range for PCI master,
because it currently it is returning *size as 0 (to the caller of_dma_configure)
resulting into largest dma_mask which would be 64-bit mask on armv8.
which usually has worked so far, because any other SOC's PCI RC, do not
have the limitations as of Broadcom iproc based PCI RC.
our RC will drop 64bit IOVAs, because it is not capable of addressing
entire 64bit range.

infact there are 2 real problems, please allow me to explain.
please refer to my next mail in reply to Arnd Bergmann,

>
>> for e.g. iproc based SOCs and other SOCs(suc as rcar) have PCI
>> world dma-ranges.
>> dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>;
>>
>> this patch serves following:
>>
>> 1) exposes interface to the pci host driver for their
>> inbound memory ranges
>>
>> 2) provide an interface to callers such as of_dma_get_ranges.
>> so then the returned size get best possible (largest) dma_mask.
>> because PCI RC drivers do not call APIs such as
>> dma_set_coherent_mask() and hence rather it shows its addressing
>> capabilities based on dma-ranges.
>> for e.g.
>> dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>;
>> we should get dev->coherent_dma_mask=0x7f.
>>
>> 3) this patch handles multiple inbound windows and dma-ranges.
>> it is left to the caller, how it wants to use them.
>> the new function returns the resources in a standard and unform way
>>
>> 4) this way the callers of for e.g. of_dma_get_ranges
>> does not need to change.
>
> Please start sentences with a capital letter.

will take care of your comments.
Thanks,
Oza.


Re: [PATCH v6 1/3] of/pci/dma: fix DMA configuration for PCI masters

2017-05-18 Thread Oza Oza
On Thu, May 18, 2017 at 12:43 AM, Arnd Bergmann  wrote:
> On Tue, May 16, 2017 at 7:22 AM, Oza Pawandeep  wrote:
>> current device framework and OF framework integration assumes
>> dma-ranges in a way where memory-mapped devices define their
>> dma-ranges. (child-bus-address, parent-bus-address, length).
>>
>> of_dma_configure is specifically written to take care of memory
>> mapped devices. but no implementation exists for pci to take
>> care of pcie based memory ranges.
>
> Hi Oza,
>
> I'm trying to make sense of this, but am still rather puzzled. I have
> no idea what the distinction between memory-mapped devices and
> pcie based devices is in your description, as PCIe is usually memory
> mapped, and Linux doesn't actually support other kinds of PCIe
> devices on most architectures.
>
there are 2 problems which I am trying to address here.

problem-1:
let me explain our PCI RC's limitations first.

IOVA allocaiton honours device's coherent_dma_mask/dma_mask.
in PCI case, current code honours DMA mask set by EP,
there is no concept of PCI host bridge dma-mask, which should be there
and could truely reflect the limitaiton of PCI host bridge.

having said that we have
dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>;
which means we can only address 512GB.

now because of broken of_dma_get_range we end up getting 64bit dma_mask.
please check the code:of_dma_configure()
if (ret < 0) {
dma_addr = offset = 0;
size = max(dev->coherent_dma_mask, dev->coherent_dma_mask + 1);

now in this process I figred out problems in of_dma_get_range: hence the fix
1) return of wrong size as 0 becasue of whole parsing problem.
2) not handling absence of dma-ranges which is valid for PCI master.
3) not handling multipe inbound windows.
4) in order to get largest possible dma_mask. this patch also returns
the largest possible size based on dma-ranges,

please have a look at
[PATCH v6 3/3] PCI/of fix of_dma_get_range; get PCI specific dma-ranges
I just made is bus specific leaving origional of_dma_get_range unmodified
and defining new PCI handling of_bus_pci_get_dma_ranges

also when I say memory-mapped and PCI device, I only mean to say with respect
to the dma-ranges format. (of coure PCI is memory mapped as well).
probbaly my commit description is misleading, sorry about that.

so Problem1: is just bug fix, Nothing else


Problem2: [PATCH v6 2/3] iommu/pci: reserve IOVA for PCI masters

we have memory banks

<0x 0x8000 0x0 0x8000>, /* 2G @ 2G */
<0x0008 0x8000 0x3 0x8000>, /* 14G @ 34G */
<0x0090 0x 0x4 0x>, /* 16G @ 576G */
<0x00a0 0x 0x4 0x>; /* 16G @ 640G */

when I run SPDK (User sapce) which internally uses vfio to access PCI
endpoint directly.
vfio uses huge-pages which could coming from 640G/0x00a0.
and the way vfio maps the hugepage to user space and generates IOVA is different
from the way kernel allocate IOVA.

vfio just maps one to one IOVA to Physical address.
it just calls directly remap_pfn_range.

so the way kernel allocates IOVA (where it honours device dma_mask)
and the way userspace gets IOVA is totally different.

so dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>;
will not work.
instead we have to go for scatterred dma-ranges leaving holes.

having said that we have to reserve IOVA allocations for inbound memory.
I am in a process of addressing Robin Murphy's comment on that and rebasing
my patch on rc12.

this problem statement is more important to us.
because it makes both kernel and use space IOVA allocations work when
IOMMU is enabled.

probably thing might be confusing because I am clubbing my patches to
address both
the problems. going forward I should just try to first send out patch
for problem2 alone (not sure)
because my next patch-set would bring some changes in pci/probe.c as well.

>> for e.g. iproc based SOCs and other SOCs(suc as rcar) have PCI
>> world dma-ranges.
>> dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>;
>>
>> this patch serves following:
>>
>> 1) exposes interface to the pci host driver for their
>> inbound memory ranges
>>
>> 2) provide an interface to callers such as of_dma_get_ranges.
>> so then the returned size get best possible (largest) dma_mask.
>> because PCI RC drivers do not call APIs such as
>> dma_set_coherent_mask() and hence rather it shows its addressing
>> capabilities based on dma-ranges.
>>
>> for e.g.
>> dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>;
>> we should get dev->coherent_dma_mask=0x7f.
>
> do you mean the coherent_dma_mask of the PCI host bridge
> or an attached device here?
>
> If you require PCI devices to come up with an initial
> coherent_dma_mask other than 0xff, there are other
> problems involved. In particular, you will need to use
> swiotlb, which is not supported on arm32 at the moment,
> and the dma_set_mask()/dma_set_coherent_mask()
> functions need to be modified.

even without this patch also it comes up with coherent_

Re: [PATCH v4 2/2] PCI: iproc: add device shutdown for PCI RC

2017-07-05 Thread Oza Oza
On Wed, Jul 5, 2017 at 9:21 AM, Ray Jui  wrote:
> Hi Oza,
>
> It looks like you missed my comments during the internal review. See my
> comments inline below.
>
>
>
> On 7/4/2017 8:08 PM, Oza Pawandeep wrote:
>>
>> PERST must be asserted around ~500ms before the reboot is applied.
>>
>> During soft reset (e.g., "reboot" from Linux) on some iproc based SOCs
>> LCPLL clock and PERST both goes off simultaneously.
>> This will cause certain Endpoints Intel NVMe not get detected, upon
>> next boot sequence.
>>
>> This is specifically happening with Intel P3700 400GB series.
>> Endpoint is expecting the clock for some amount of time after PERST is
>> asserted, which is not happening in Stingray (iproc based SOC).
>> This causes NVMe to behave in undefined way.
>>
>> On the contrary Intel x86 boards, will have ref clock running, so we
>> do not see this behavior there.
>>
>> Besides, PCI spec does not stipulate about such timings.
>> In-fact it does not tell us, whether PCIe device should consider
>> refclk to be available or not to be.
>>
>> It is completely up to vendor to design their EP the way they want
>> with respect to ref clock availability.
>>
>> 500ms is just based on the observation and taken as safe margin.
>> This patch adds platform shutdown where it should be
>> called in device_shutdown while reboot command is issued.
>> So in sequence first Endpoint Shutdown (e.g. nvme_shutdown)
>> followed by RC shutdown, which issues safe PERST assertion.
>>
>> Signed-off-by: Oza Pawandeep 
>> Reviewed-by: Ray Jui 
>> Reviewed-by: Scott Branden 
>>
>> diff --git a/drivers/pci/host/pcie-iproc-platform.c
>> b/drivers/pci/host/pcie-iproc-platform.c
>> index 90d2bdd..9512960 100644
>> --- a/drivers/pci/host/pcie-iproc-platform.c
>> +++ b/drivers/pci/host/pcie-iproc-platform.c
>> @@ -131,6 +131,13 @@ static int iproc_pcie_pltfm_remove(struct
>> platform_device *pdev)
>> return iproc_pcie_remove(pcie);
>>   }
>>   +static void iproc_pcie_pltfm_shutdown(struct platform_device *pdev)
>> +{
>> +   struct iproc_pcie *pcie = platform_get_drvdata(pdev);
>> +
>> +   iproc_pcie_shutdown(pcie);
>> +}
>> +
>>   static struct platform_driver iproc_pcie_pltfm_driver = {
>> .driver = {
>> .name = "iproc-pcie",
>> @@ -138,6 +145,7 @@ static int iproc_pcie_pltfm_remove(struct
>> platform_device *pdev)
>> },
>> .probe = iproc_pcie_pltfm_probe,
>> .remove = iproc_pcie_pltfm_remove,
>> +   .shutdown = iproc_pcie_pltfm_shutdown,
>>   };
>>   module_platform_driver(iproc_pcie_pltfm_driver);
>>   diff --git a/drivers/pci/host/pcie-iproc.c
>> b/drivers/pci/host/pcie-iproc.c
>> index b0abcd7..d1bcdc9 100644
>> --- a/drivers/pci/host/pcie-iproc.c
>> +++ b/drivers/pci/host/pcie-iproc.c
>> @@ -616,32 +616,40 @@ static int iproc_pcie_config_write32(struct pci_bus
>> *bus, unsigned int devfn,
>> .write = iproc_pcie_config_write32,
>>   };
>>   -static void iproc_pcie_reset(struct iproc_pcie *pcie)
>> +static void iproc_pcie_perst_ctrl(struct iproc_pcie *pcie, bool assert)
>>   {
>> u32 val;
>> /*
>> -* PAXC and the internal emulated endpoint device downstream
>> should not
>> -* be reset.  If firmware has been loaded on the endpoint device
>> at an
>> -* earlier boot stage, reset here causes issues.
>> +* The internal emulated endpoints (such as PAXC) device
>> downstream
>> +* should not be reset. If firmware has been loaded on the
>> endpoint
>> +* device at an earlier boot stage, reset here causes issues.
>>  */
>
>
> Why is the above comment modified? Is it even related to this patch that
> implements the shutdown routine for PAXB? In addition, the original
> comment is more correct. The new comment by stating the "internal
> emulated endpoints (such as PAXC)" is wrong. PAXC is not an endpoint.
> PAXC is the root complex interface. The endpoint is Nitro.
>
> Thanks,
>
> Ray

Thanks Ray for pointing this. I did not intend to change this but for
some reason,
comment looked changed.

Regards,
Oza.

>
>> if (pcie->ep_is_internal)
>> return;
>>   - /*
>> -* Select perst_b signal as reset source. Put the device into
>> reset,
>> -* and then bring it out of reset
>> -*/
>> -   val = iproc_pcie_read_reg(pcie, IPROC_PCIE_CLK_CTRL);
>> -   val &= ~EP_PERST_SOURCE_SELECT & ~EP_MODE_SURVIVE_PERST &
>> -   ~RC_PCIE_RST_OUTPUT;
>> -   iproc_pcie_write_reg(pcie, IPROC_PCIE_CLK_CTRL, val);
>> -   udelay(250);
>> -
>> -   val |= RC_PCIE_RST_OUTPUT;
>> -   iproc_pcie_write_reg(pcie, IPROC_PCIE_CLK_CTRL, val);
>> -   msleep(100);
>> +   if (assert) {
>> +   val = iproc_pcie_read_reg(pcie, IPROC_PCIE_CLK_CTRL);
>> +   val &= ~EP_PERST_SOURCE_SELECT & ~EP_MODE_SURVIVE_PERST &
>> +   ~RC_PCIE_RST_OUTPUT;
>> +   iproc_pcie_write_reg(pcie, IPROC_PCIE_CLK_CTRL, val);
>>

Re: [PATCH v7 3/3] IOMMU/PCI: Reserve IOVA for inbound memory for PCI masters

2017-07-19 Thread Oza Oza
Hi Robin,

My apology for noise.

I have taken care of your comments.
but these whole patch-set, (specially PCI patch-set) inbound memory
addition depends on Lorenzo's patch-set
.
So I will be posting version 8 patches for IOVA reservation soon after
Lorenzo's patches are made in.

Regards,
Oza.

On Mon, May 22, 2017 at 10:09 PM, Oza Pawandeep  wrote:
> This patch reserves the inbound memory holes for PCI masters.
> ARM64 based SOCs may have scattered memory banks.
> For e.g as iproc based SOC has
>
> <0x 0x8000 0x0 0x8000>, /* 2G @ 2G */
> <0x0008 0x8000 0x3 0x8000>, /* 14G @ 34G */
> <0x0090 0x 0x4 0x>, /* 16G @ 576G */
> <0x00a0 0x 0x4 0x>; /* 16G @ 640G */
>
> But incoming PCI transaction addressing capability is limited
> by host bridge, for example if max incoming window capability
> is 512 GB, then 0x0090 and 0x00a0 will fall beyond it.
>
> To address this problem, iommu has to avoid allocating IOVA which
> are reserved.
>
> Which in turn does not allocate IOVA if it falls into hole.
> and the holes should be reserved before any of the IOVA allocations
> can happen.
>
> Signed-off-by: Oza Pawandeep 
>
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 8348f366..efe3d07 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -171,16 +171,15 @@ void iommu_dma_get_resv_regions(struct device *dev, 
> struct list_head *list)
>  {
> struct pci_host_bridge *bridge;
> struct resource_entry *window;
> +   struct iommu_resv_region *region;
> +   phys_addr_t start, end;
> +   size_t length;
>
> if (!dev_is_pci(dev))
> return;
>
> bridge = pci_find_host_bridge(to_pci_dev(dev)->bus);
> resource_list_for_each_entry(window, &bridge->windows) {
> -   struct iommu_resv_region *region;
> -   phys_addr_t start;
> -   size_t length;
> -
> if (resource_type(window->res) != IORESOURCE_MEM)
> continue;
>
> @@ -193,6 +192,43 @@ void iommu_dma_get_resv_regions(struct device *dev, 
> struct list_head *list)
>
> list_add_tail(®ion->list, list);
> }
> +
> +   /* PCI inbound memory reservation. */
> +   start = length = 0;
> +   resource_list_for_each_entry(window, &bridge->inbound_windows) {
> +   end = window->res->start - window->offset;
> +
> +   if (start > end) {
> +   /* multiple ranges assumed sorted. */
> +   pr_warn("PCI: failed to reserve iovas\n");
> +   return;
> +   }
> +
> +   if (start != end) {
> +   length = end - start - 1;
> +   region = iommu_alloc_resv_region(start, length, 0,
> +   IOMMU_RESV_RESERVED);
> +   if (!region)
> +   return;
> +
> +   list_add_tail(®ion->list, list);
> +   }
> +
> +   start += end + length + 1;
> +   }
> +   /*
> +* the last dma-range should honour based on the
> +* 32/64-bit dma addresses.
> +*/
> +   if ((start) && (start < DMA_BIT_MASK(sizeof(dma_addr_t) * 8))) {
> +   length = DMA_BIT_MASK((sizeof(dma_addr_t) * 8)) - 1;
> +   region = iommu_alloc_resv_region(start, length, 0,
> +   IOMMU_RESV_RESERVED);
> +   if (!region)
> +   return;
> +
> +   list_add_tail(®ion->list, list);
> +   }
>  }
>  EXPORT_SYMBOL(iommu_dma_get_resv_regions);
>
> --
> 1.9.1
>


Re: [PATCH v8 2/3] PCI: iproc: retry request when CRS returned from EP

2017-08-28 Thread Oza Oza
On Tue, Aug 29, 2017 at 12:47 AM, Bjorn Helgaas  wrote:
> On Thu, Aug 24, 2017 at 10:34:25AM +0530, Oza Pawandeep wrote:
>> PCIe spec r3.1, sec 2.3.2
>> If CRS software visibility is not enabled, the RC must reissue the
>> config request as a new request.
>>
>> - If CRS software visibility is enabled,
>> - for a config read of Vendor ID, the RC must return 0x0001 data
>> - for all other config reads/writes, the RC must reissue the
>>   request
>>
>> iproc PCIe Controller spec:
>> 4.7.3.3. Retry Status On Configuration Cycle
>> Endpoints are allowed to generate retry status on configuration
>> cycles. In this case, the RC needs to re-issue the request. The IP
>> does not handle this because the number of configuration cycles needed
>> will probably be less than the total number of non-posted operations
>> needed.
>>
>> When a retry status is received on the User RX interface for a
>> configuration request that was sent on the User TX interface,
>> it will be indicated with a completion with the CMPL_STATUS field set
>> to 2=CRS, and the user will have to find the address and data values
>> and send a new transaction on the User TX interface.
>> When the internal configuration space returns a retry status during a
>> configuration cycle (user_cscfg = 1) on the Command/Status interface,
>> the pcie_cscrs will assert with the pcie_csack signal to indicate the
>> CRS status.
>> When the CRS Software Visibility Enable register in the Root Control
>> register is enabled, the IP will return the data value to 0x0001 for
>> the Vendor ID value and 0x  (all 1’s) for the rest of the data in
>> the request for reads of offset 0 that return with CRS status.  This
>> is true for both the User RX Interface and for the Command/Status
>> interface.  When CRS Software Visibility is enabled, the CMPL_STATUS
>> field of the completion on the User RX Interface will not be 2=CRS and
>> the pcie_cscrs signal will not assert on the Command/Status interface.
>>
>> Per PCIe r3.1, sec 2.3.2, config requests that receive completions
>> with Configuration Request Retry Status (CRS) should be reissued by
>> the hardware except reads of the Vendor ID when CRS Software
>> Visibility is enabled.
>>
>> This hardware never reissues configuration requests when it receives
>> CRS completions.
>> Note that, neither PCIe host bridge nor PCIe core re-issues the
>> request for any configuration offset.
>>
>> For config reads, this hardware returns CFG_RETRY_STATUS data when
>> it receives a CRS completion for a config read, regardless of the
>> address of the read or the CRS Software Visibility Enable bit.
>
> I can't remember how Stingray handles the CRS Software Visibility Enable
> bit.  Is it a read-only zero?  Is it writable?  Does the hardware look at
> it all (I think not)?

HW doesnt look it at all, it is "dont care" bit.

Regards,
Oza.


Re: [PATCH v8 2/3] PCI: iproc: retry request when CRS returned from EP

2017-08-28 Thread Oza Oza
On Tue, Aug 29, 2017 at 1:24 AM, Bjorn Helgaas  wrote:
> On Tue, Aug 29, 2017 at 01:09:53AM +0530, Oza Oza wrote:
>> On Tue, Aug 29, 2017 at 12:47 AM, Bjorn Helgaas  wrote:
>> > On Thu, Aug 24, 2017 at 10:34:25AM +0530, Oza Pawandeep wrote:
>> >> PCIe spec r3.1, sec 2.3.2
>> >> If CRS software visibility is not enabled, the RC must reissue the
>> >> config request as a new request.
>> >>
>> >> - If CRS software visibility is enabled,
>> >> - for a config read of Vendor ID, the RC must return 0x0001 data
>> >> - for all other config reads/writes, the RC must reissue the
>> >>   request
>> >>
>> >> iproc PCIe Controller spec:
>> >> 4.7.3.3. Retry Status On Configuration Cycle
>> >> Endpoints are allowed to generate retry status on configuration
>> >> cycles. In this case, the RC needs to re-issue the request. The IP
>> >> does not handle this because the number of configuration cycles needed
>> >> will probably be less than the total number of non-posted operations
>> >> needed.
>> >>
>> >> When a retry status is received on the User RX interface for a
>> >> configuration request that was sent on the User TX interface,
>> >> it will be indicated with a completion with the CMPL_STATUS field set
>> >> to 2=CRS, and the user will have to find the address and data values
>> >> and send a new transaction on the User TX interface.
>> >> When the internal configuration space returns a retry status during a
>> >> configuration cycle (user_cscfg = 1) on the Command/Status interface,
>> >> the pcie_cscrs will assert with the pcie_csack signal to indicate the
>> >> CRS status.
>> >> When the CRS Software Visibility Enable register in the Root Control
>> >> register is enabled, the IP will return the data value to 0x0001 for
>> >> the Vendor ID value and 0x  (all 1’s) for the rest of the data in
>> >> the request for reads of offset 0 that return with CRS status.  This
>> >> is true for both the User RX Interface and for the Command/Status
>> >> interface.  When CRS Software Visibility is enabled, the CMPL_STATUS
>> >> field of the completion on the User RX Interface will not be 2=CRS and
>> >> the pcie_cscrs signal will not assert on the Command/Status interface.
>> >>
>> >> Per PCIe r3.1, sec 2.3.2, config requests that receive completions
>> >> with Configuration Request Retry Status (CRS) should be reissued by
>> >> the hardware except reads of the Vendor ID when CRS Software
>> >> Visibility is enabled.
>> >>
>> >> This hardware never reissues configuration requests when it receives
>> >> CRS completions.
>> >> Note that, neither PCIe host bridge nor PCIe core re-issues the
>> >> request for any configuration offset.
>> >>
>> >> For config reads, this hardware returns CFG_RETRY_STATUS data when
>> >> it receives a CRS completion for a config read, regardless of the
>> >> address of the read or the CRS Software Visibility Enable bit.
>> >
>> > I can't remember how Stingray handles the CRS Software Visibility Enable
>> > bit.  Is it a read-only zero?  Is it writable?  Does the hardware look at
>> > it all (I think not)?
>>
>> HW doesnt look it at all, it is "dont care" bit.
>
> Sigh, I made the classic mistake of asking more than one question, and
> I guess I'm about to do it again :)  It'll save time if you can answer
> all the questions at once.
>
> Linux enables PCI_EXP_RTCTL_CRSSVE if PCI_EXP_RTCAP says it is
> supported (see pci_enable_crs()).
>
>   - What does PCI_EXP_RTCAP say?
it is rea as set: value :0x1
>
>   - Is PCI_EXP_RTCTL_CRSSVE writable?
yes: it is:
read as 0 and written as 1.
>
> With all the CRS-related work going on, I suspect we may someday want to
> read PCI_EXP_RTCTL_CRSSVE to see if CRS SV is enabled.
>
> Bjorn


Re: [PATCH v8 0/3] PCI: iproc: SOC specific fixes

2017-08-28 Thread Oza Oza
On Tue, Aug 29, 2017 at 3:23 AM, Bjorn Helgaas  wrote:
> On Thu, Aug 24, 2017 at 10:34:23AM +0530, Oza Pawandeep wrote:
>> PCI: iproc: Retry request when CRS returned from EP Above patch adds
>> support for CRS in PCI RC driver, otherwise if not handled at lower
>> level, the user space PMD (poll mode drivers) can timeout.
>>
>> PCI: iproc: add device shutdown for PCI RC This fixes the issue where
>> certian PCI endpoints are not getting detected on Stingray SOC after
>> reboot.
>>
>> Changes Since v7:
>> Factor out the ep config access code.
>>
>> Changes Since v6:
>> Rebased patches on top of Lorenzo's patches.
>> Bjorn's comments addressed.
>> now the confg retry returns 0x as data.
>> Added reference to PCIe spec and iproc Controller spec in Changelog.
>>
>> Changes Since v5:
>> Ray's comments addressed.
>>
>> Changes Since v4:
>> Bjorn's comments addressed.
>>
>> Changes Since v3:
>> [re-send]
>>
>> Changes Since v2:
>> Fix compilation errors for pcie-iproc-platform.ko which was caught
>> by kbuild.
>>
>> Oza Pawandeep (3):
>>   PCI: iproc: factor-out ep configuration access
>>   PCI: iproc: Retry request when CRS returned from EP
>>   PCI: iproc: add device shutdown for PCI RC
>>
>>  drivers/pci/host/pcie-iproc-platform.c |   8 ++
>>  drivers/pci/host/pcie-iproc.c  | 143 
>> ++---
>>  drivers/pci/host/pcie-iproc.h  |   1 +
>>  3 files changed, 124 insertions(+), 28 deletions(-)
>
> I applied these to pci/host-iproc for v4.14.  Man, this is ugly.
>
> I reworked the changelog to try to make it more readable.  I also tried to
> disable the PCI_EXP_RTCAP_CRSVIS bit, which advertises CRS SV support.  And
> I removed what looked like a duplicate pci_generic_config_read32() call.
> And I added a warning about the fact that we corrupt reads of config
> registers that happen to contain 0x0001.
>
> I'm pretty sure I broke something, so please take a look.

Appreciate your time in adding PCI_EXP_RTCAP_CRSVIS and other changes.
I just tested the patch, and it works fine.
which tells us, that CRS visibility bit has no effect.

so things look okay to me.

Regards,
Oza.
>
> Incremental diff from your v8 to what's on pci/host-iproc:
>
> diff --git a/drivers/pci/host/pcie-iproc.c b/drivers/pci/host/pcie-iproc.c
> index cbdabe8a073e..8bd5e544b1c1 100644
> --- a/drivers/pci/host/pcie-iproc.c
> +++ b/drivers/pci/host/pcie-iproc.c
> @@ -69,7 +69,7 @@
>  #define APB_ERR_EN   BIT(APB_ERR_EN_SHIFT)
>
>  #define CFG_RETRY_STATUS 0x0001
> -#define CFG_RETRY_STATUS_TIMEOUT_US  50 /* 500 milli-seconds. */
> +#define CFG_RETRY_STATUS_TIMEOUT_US  50 /* 500 milliseconds */
>
>  /* derive the enum index of the outbound/inbound mapping registers */
>  #define MAP_REG(base_reg, index)  ((base_reg) + (index) * 2)
> @@ -482,17 +482,21 @@ static unsigned int iproc_pcie_cfg_retry(void __iomem 
> *cfg_data_p)
> unsigned int data;
>
> /*
> -* As per PCIe spec r3.1, sec 2.3.2, CRS Software
> -* Visibility only affects config read of the Vendor ID.
> -* For config write or any other config read the Root must
> -* automatically re-issue configuration request again as a
> -* new request.
> +* As per PCIe spec r3.1, sec 2.3.2, CRS Software Visibility only
> +* affects config reads of the Vendor ID.  For config writes or any
> +* other config reads, the Root may automatically reissue the
> +* configuration request again as a new request.
>  *
> -* For config reads, this hardware returns CFG_RETRY_STATUS data when
> -* it receives a CRS completion for a config read, regardless of the
> -* address of the read or the CRS Software Visibility Enable bit. As a
> +* For config reads, this hardware returns CFG_RETRY_STATUS data
> +* when it receives a CRS completion, regardless of the address of
> +* the read or the CRS Software Visibility Enable bit.  As a
>  * partial workaround for this, we retry in software any read that
>  * returns CFG_RETRY_STATUS.
> +*
> +* Note that a non-Vendor ID config register may have a value of
> +* CFG_RETRY_STATUS.  If we read that, we can't distinguish it from
> +* a CRS completion, so we will incorrectly retry the read and
> +* eventually return the wrong data (0x).
>  */
> data = readl(cfg_data_p);
> while (data == CFG_RETRY_STATUS && timeout--) {
> @@ -515,10 +519,19 @@ static int iproc_pcie_config_read(struct pci_bus *bus, 
> unsigned int devfn,
> unsigned int busno = bus->number;
> void __iomem *cfg_data_p;
> unsigned int data;
> +   int ret;
>
> -   /* root complex access. */
> -   if (busno == 0)
> -   return pci_generic_config_read32(bus, devfn, where, size, 
> val);
> +   /* root complex access */
> +   if (busno == 0) {
> + 

Re: [PATCH v8 2/3] PCI: iproc: retry request when CRS returned from EP

2017-08-28 Thread Oza Oza
On Tue, Aug 29, 2017 at 3:17 AM, Bjorn Helgaas  wrote:
> On Thu, Aug 24, 2017 at 10:34:25AM +0530, Oza Pawandeep wrote:
>> PCIe spec r3.1, sec 2.3.2
>> If CRS software visibility is not enabled, the RC must reissue the
>> config request as a new request.
>>
>> - If CRS software visibility is enabled,
>> - for a config read of Vendor ID, the RC must return 0x0001 data
>> - for all other config reads/writes, the RC must reissue the
>>   request
>>
>> iproc PCIe Controller spec:
>> 4.7.3.3. Retry Status On Configuration Cycle
>> Endpoints are allowed to generate retry status on configuration
>> cycles. In this case, the RC needs to re-issue the request. The IP
>> does not handle this because the number of configuration cycles needed
>> will probably be less than the total number of non-posted operations
>> needed.
>>
>> When a retry status is received on the User RX interface for a
>> configuration request that was sent on the User TX interface,
>> it will be indicated with a completion with the CMPL_STATUS field set
>> to 2=CRS, and the user will have to find the address and data values
>> and send a new transaction on the User TX interface.
>> When the internal configuration space returns a retry status during a
>> configuration cycle (user_cscfg = 1) on the Command/Status interface,
>> the pcie_cscrs will assert with the pcie_csack signal to indicate the
>> CRS status.
>> When the CRS Software Visibility Enable register in the Root Control
>> register is enabled, the IP will return the data value to 0x0001 for
>> the Vendor ID value and 0x  (all 1’s) for the rest of the data in
>> the request for reads of offset 0 that return with CRS status.  This
>> is true for both the User RX Interface and for the Command/Status
>> interface.  When CRS Software Visibility is enabled, the CMPL_STATUS
>> field of the completion on the User RX Interface will not be 2=CRS and
>> the pcie_cscrs signal will not assert on the Command/Status interface.
>>
>> Per PCIe r3.1, sec 2.3.2, config requests that receive completions
>> with Configuration Request Retry Status (CRS) should be reissued by
>> the hardware except reads of the Vendor ID when CRS Software
>> Visibility is enabled.
>>
>> This hardware never reissues configuration requests when it receives
>> CRS completions.
>> Note that, neither PCIe host bridge nor PCIe core re-issues the
>> request for any configuration offset.
>>
>> For config reads, this hardware returns CFG_RETRY_STATUS data when
>> it receives a CRS completion for a config read, regardless of the
>> address of the read or the CRS Software Visibility Enable bit.
>>
>> This patch implements iproc_pcie_config_read which gets called for
>> Stingray, if it receives a CRS completion, it retries reading it again.
>> In case of timeout, it returns 0x.
>> For other iproc based SOC, it falls back to PCI generic APIs.
>>
>> Signed-off-by: Oza Pawandeep 
>>
>> diff --git a/drivers/pci/host/pcie-iproc.c b/drivers/pci/host/pcie-iproc.c
>> index 61d9be6..37f4adf 100644
>> --- a/drivers/pci/host/pcie-iproc.c
>> +++ b/drivers/pci/host/pcie-iproc.c
>> @@ -68,6 +68,9 @@
>>  #define APB_ERR_EN_SHIFT 0
>>  #define APB_ERR_EN   BIT(APB_ERR_EN_SHIFT)
>>
>> +#define CFG_RETRY_STATUS 0x0001
>> +#define CFG_RETRY_STATUS_TIMEOUT_US  50 /* 500 milli-seconds. */
>> +
>>  /* derive the enum index of the outbound/inbound mapping registers */
>>  #define MAP_REG(base_reg, index)  ((base_reg) + (index) * 2)
>>
>> @@ -473,6 +476,64 @@ static void __iomem *iproc_pcie_map_ep_cfg_reg(struct 
>> iproc_pcie *pcie,
>>   return (pcie->base + offset);
>>  }
>>
>> +static unsigned int iproc_pcie_cfg_retry(void __iomem *cfg_data_p)
>> +{
>> + int timeout = CFG_RETRY_STATUS_TIMEOUT_US;
>> + unsigned int data;
>> +
>> + /*
>> +  * As per PCIe spec r3.1, sec 2.3.2, CRS Software
>> +  * Visibility only affects config read of the Vendor ID.
>> +  * For config write or any other config read the Root must
>> +  * automatically re-issue configuration request again as a
>> +  * new request.
>> +  *
>> +  * For config reads, this hardware returns CFG_RETRY_STATUS data when
>> +  * it receives a CRS completion for a config read, regardless of the
>> +  * address of the read or the CRS Software Visibility Enable bit. As a
>> +  * partial workaround for this, we retry in software any read that
>> +  * returns CFG_RETRY_STATUS.
>> +  */
>> + data = readl(cfg_data_p);
>> + while (data == CFG_RETRY_STATUS && timeout--) {
>> + udelay(1);
>> + data = readl(cfg_data_p);
>> + }
>> +
>> + if (data == CFG_RETRY_STATUS)
>> + data = 0x;
>> +
>> + return data;
>> +}
>> +
>> +static int iproc_pcie_config_read(struct pci_bus *bus, unsigned int devfn,
>> + int where, int size, u32 *val)
>> +{
>> + struct iproc_pcie *pcie = iproc_data(bus);
>> + u

Re: [PATCH v8 2/3] PCI: iproc: retry request when CRS returned from EP

2017-08-30 Thread Oza Oza
On Wed, Aug 30, 2017 at 1:32 AM, Bjorn Helgaas  wrote:
> On Tue, Aug 29, 2017 at 11:02:23AM +0530, Oza Oza wrote:
>> On Tue, Aug 29, 2017 at 3:17 AM, Bjorn Helgaas  wrote:
>> > On Thu, Aug 24, 2017 at 10:34:25AM +0530, Oza Pawandeep wrote:
>> >> PCIe spec r3.1, sec 2.3.2
>> >> If CRS software visibility is not enabled, the RC must reissue the
>> >> config request as a new request.
>> >>
>> >> - If CRS software visibility is enabled,
>> >> - for a config read of Vendor ID, the RC must return 0x0001 data
>> >> - for all other config reads/writes, the RC must reissue the
>> >>   request
>> >>
>> >> iproc PCIe Controller spec:
>> >> 4.7.3.3. Retry Status On Configuration Cycle
>> >> Endpoints are allowed to generate retry status on configuration
>> >> cycles. In this case, the RC needs to re-issue the request. The IP
>> >> does not handle this because the number of configuration cycles needed
>> >> will probably be less than the total number of non-posted operations
>> >> needed.
>> >>
>> >> When a retry status is received on the User RX interface for a
>> >> configuration request that was sent on the User TX interface,
>> >> it will be indicated with a completion with the CMPL_STATUS field set
>> >> to 2=CRS, and the user will have to find the address and data values
>> >> and send a new transaction on the User TX interface.
>> >> When the internal configuration space returns a retry status during a
>> >> configuration cycle (user_cscfg = 1) on the Command/Status interface,
>> >> the pcie_cscrs will assert with the pcie_csack signal to indicate the
>> >> CRS status.
>> >> When the CRS Software Visibility Enable register in the Root Control
>> >> register is enabled, the IP will return the data value to 0x0001 for
>> >> the Vendor ID value and 0x  (all 1’s) for the rest of the data in
>> >> the request for reads of offset 0 that return with CRS status.  This
>> >> is true for both the User RX Interface and for the Command/Status
>> >> interface.  When CRS Software Visibility is enabled, the CMPL_STATUS
>> >> field of the completion on the User RX Interface will not be 2=CRS and
>> >> the pcie_cscrs signal will not assert on the Command/Status interface.
>> >>
>> >> Per PCIe r3.1, sec 2.3.2, config requests that receive completions
>> >> with Configuration Request Retry Status (CRS) should be reissued by
>> >> the hardware except reads of the Vendor ID when CRS Software
>> >> Visibility is enabled.
>> >>
>> >> This hardware never reissues configuration requests when it receives
>> >> CRS completions.
>> >> Note that, neither PCIe host bridge nor PCIe core re-issues the
>> >> request for any configuration offset.
>> >>
>> >> For config reads, this hardware returns CFG_RETRY_STATUS data when
>> >> it receives a CRS completion for a config read, regardless of the
>> >> address of the read or the CRS Software Visibility Enable bit.
>> >>
>> >> This patch implements iproc_pcie_config_read which gets called for
>> >> Stingray, if it receives a CRS completion, it retries reading it again.
>> >> In case of timeout, it returns 0x.
>> >> For other iproc based SOC, it falls back to PCI generic APIs.
>> >>
>> >> Signed-off-by: Oza Pawandeep 
>> >>
>> >> diff --git a/drivers/pci/host/pcie-iproc.c b/drivers/pci/host/pcie-iproc.c
>> >> index 61d9be6..37f4adf 100644
>> >> --- a/drivers/pci/host/pcie-iproc.c
>> >> +++ b/drivers/pci/host/pcie-iproc.c
>> >> @@ -68,6 +68,9 @@
>> >>  #define APB_ERR_EN_SHIFT 0
>> >>  #define APB_ERR_EN   BIT(APB_ERR_EN_SHIFT)
>> >>
>> >> +#define CFG_RETRY_STATUS 0x0001
>> >> +#define CFG_RETRY_STATUS_TIMEOUT_US  50 /* 500 milli-seconds. */
>> >> +
>> >>  /* derive the enum index of the outbound/inbound mapping registers */
>> >>  #define MAP_REG(base_reg, index)  ((base_reg) + (index) * 2)
>> >>
>> >> @@ -473,6 +476,64 @@ static void __iomem 
>> >> *iproc_pcie_map_ep_cfg_reg(struct iproc_pcie *pcie,
>> >>   return (pcie->base + offset);
>> >>  }
>> >>
>> >> +static unsigned int iproc_pcie_cfg_retry(void __iomem *c

Re: [PATCH v7 1/2] PCI: iproc: Retry request when CRS returned from EP

2017-08-23 Thread Oza Oza
On Wed, Aug 23, 2017 at 2:03 AM, Bjorn Helgaas  wrote:
> [+cc Sinan, Timur, Alex]
>
> Hi Oza,
>
> On Mon, Aug 21, 2017 at 09:28:43PM +0530, Oza Pawandeep wrote:
>> PCIe spec r3.1, sec 2.3.2
>> If CRS software visibility is not enabled, the RC must reissue the
>> config request as a new request.
>>
>> - If CRS software visibility is enabled,
>> - for a config read of Vendor ID, the RC must return 0x0001 data
>> - for all other config reads/writes, the RC must reissue the
>>   request
>
> You mentioned early on that this fixes an issue with NVMe after a
> reset.  Is that reset an FLR?  I'm wondering if this overlaps with the
> work Sinan is doing:
>
>   
> http://lkml.kernel.org/r/20170818212310.15145.21732.st...@bhelgaas-glaptop.roam.corp.google.com
>
> With the current code in the tree, pci_flr_wait() probably waits only
> 100ms on your system.  It currently reads PCI_COMMAND and probably
> sees 0x0001 because the NVMe device returns a CRS completion
> (which this iproc RC incorrectly turns into 0x0001 data), so we
> exit the loop after a single msleep(100).
>
> Sinan is extending pci_flr_wait() to read the Vendor ID and look for
> the CRS SV indication.  If this is where you're seeing the NVMe issue,
> I bet we can fix this by simply changing iproc_pcie_config_read() to
> return the correct data when it sees a CRS completion.  Then the
> retry loop in pci_flr_wait() will work as intended.
>

The issue with User Space poll mode drivers resulting into CRS.


root@bcm958742k:~# /usr/share/spdk/examples/nvme/perf -r 'trtype:PCIe
traddr::01:00.0' -q 128 -s 4096 -w read -d 2048 -t 30 -c 0x1
Starting DPDK 16.11.1 initialization...
[ DPDK EAL parameters: perf -c 1 -m 2048 --file-prefix=spdk_pid2202 ]
EAL: Detected 8 lcore(s)
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: cannot open /proc/self/numa_maps, consider that all memory is in
socket_id 0
Initializing NVMe Controllers
EAL: PCI device :01:00.0 on NUMA socket 0
EAL:   probe driver: 8086:953 spdk_nvme
EAL:   using IOMMU type 1 (Type 1)
[   45.811353] vfio_ecap_init: :01:00.0 hiding ecap 0x19@0x2a0
Attaching to NVMe Controller at :01:00.0 [8086:0953]
[   46.486601] vfio_bar_restore: :01:00.0 reset recovery - restoring bars
nvme_ctrlr.c:1237:nvme_ctrlr_process_init: ***ERROR*** Initialization
timed out in state 3
nvme_ctrlr.c: 414:nvme_ctrlr_shutdown: ***ERROR*** did not shutdown
within 5 seconds
EAL: Hotplug doesn't support vfio yet
spdk_nvme_probe() failed for transport address ':01:00.0'
/usr/share/spdk/examples/nvme/perf: errors occured

ok, so this is in the context of
[   52.500579] [] dump_backtrace+0x0/0x208
[   52.506147] [] show_stack+0x14/0x20
[   52.511357] [] dump_stack+0x8c/0xb0
[   52.516567] [] pci_flr_wait+0x18/0xc8
[   52.521955] [] pcie_flr+0x34/0x68
[   52.526986] [] __pci_dev_reset+0x100/0x2c0
[   52.532823] [] pci_try_reset_function+0x64/0x80
[   52.539108] [] vfio_pci_ioctl+0x398/0xb78
[   52.544857] [] vfio_device_fops_unl_ioctl+0x20/0x30
[   52.551501] [] do_vfs_ioctl+0xa4/0x738
[   52.556980] [] SyS_ioctl+0x8c/0xa0
[   52.562100] [] __sys_trace_return+0x0/0x4


So, I have two things to do here.

1) remove retry funciton and just do following.
   data = readl(cfg_data_p);
   if (data == CFG_RETRY_STATUS)/* 0x0001 */
 data = 0x;

2) refactor the code with separate patch.

but for that Sinan's patch is required.
then with both the patches I think, things will work out correctly for
iproc SOC.

Let me know how this sounds and if you agree with above two points, I
will be posting final set of patches.

let me know if you want me to change anything in the commit description.
in my opinion I should remove config write description which is irrelevant here.

Regards,
Oza.

> Sec 2.3.2 says the RC may limit the number of retries, and it doesn't
> say anything about how many retries the RC should do or what interval
> should be between them.  So I think it should be legal to say "this RC
> does zero retries".
>
> I think an RC that does zero retries and doesn't support CRS SV would
> always handle a CRS completion as a "failed transaction" so it would
> synthesize 0x data (and, I assume, set an error bit like
> Received Master Abort).  If we treated this controller as though it
> doesn't support CRS SV, the workaround in iproc_pcie_config_read()
> would be simply:
>
>   data = readl(cfg_data_p);
>   if (data == CFG_RETRY_STATUS)/* 0x0001 */
> data = 0x;
>
> That would make the loop in pci_flr_wait() that reads PCI_COMMAND work
> correctly.  It would get 0x data as long as the device
> returned CRS completions.
>
> It wouldn't make Sinan's new CRS SV code work, but that's OK: that all
> has to work correctly even on systems that don't support CRS SV.
>
> The only problem is that when we read a non-Vendor ID register that
> happens to really be 0x0001, we *should* return 0x0001, but we
> incorrectly return 0x.  But we already kn

Re: [PATCH v7 1/2] PCI: iproc: Retry request when CRS returned from EP

2017-08-23 Thread Oza Oza
On Wed, Aug 23, 2017 at 7:21 PM, Bjorn Helgaas  wrote:
> On Wed, Aug 23, 2017 at 03:57:02PM +0530, Oza Oza wrote:
>> On Wed, Aug 23, 2017 at 2:03 AM, Bjorn Helgaas  wrote:
>> > [+cc Sinan, Timur, Alex]
>> >
>> > Hi Oza,
>> >
>> > On Mon, Aug 21, 2017 at 09:28:43PM +0530, Oza Pawandeep wrote:
>> >> PCIe spec r3.1, sec 2.3.2
>> >> If CRS software visibility is not enabled, the RC must reissue the
>> >> config request as a new request.
>> >>
>> >> - If CRS software visibility is enabled,
>> >> - for a config read of Vendor ID, the RC must return 0x0001 data
>> >> - for all other config reads/writes, the RC must reissue the
>> >>   request
>> >
>> > You mentioned early on that this fixes an issue with NVMe after a
>> > reset.  Is that reset an FLR?  I'm wondering if this overlaps with the
>> > work Sinan is doing:
>> >
>> >   
>> > http://lkml.kernel.org/r/20170818212310.15145.21732.st...@bhelgaas-glaptop.roam.corp.google.com
>> >
>> > With the current code in the tree, pci_flr_wait() probably waits only
>> > 100ms on your system.  It currently reads PCI_COMMAND and probably
>> > sees 0x0001 because the NVMe device returns a CRS completion
>> > (which this iproc RC incorrectly turns into 0x0001 data), so we
>> > exit the loop after a single msleep(100).
>> >
>> > Sinan is extending pci_flr_wait() to read the Vendor ID and look for
>> > the CRS SV indication.  If this is where you're seeing the NVMe issue,
>> > I bet we can fix this by simply changing iproc_pcie_config_read() to
>> > return the correct data when it sees a CRS completion.  Then the
>> > retry loop in pci_flr_wait() will work as intended.
>> >
>>
>> The issue with User Space poll mode drivers resulting into CRS.
>>
>>
>> root@bcm958742k:~# /usr/share/spdk/examples/nvme/perf -r 'trtype:PCIe
>> traddr::01:00.0' -q 128 -s 4096 -w read -d 2048 -t 30 -c 0x1
>> Starting DPDK 16.11.1 initialization...
>> [ DPDK EAL parameters: perf -c 1 -m 2048 --file-prefix=spdk_pid2202 ]
>> EAL: Detected 8 lcore(s)
>> EAL: Probing VFIO support...
>> EAL: VFIO support initialized
>> EAL: cannot open /proc/self/numa_maps, consider that all memory is in
>> socket_id 0
>> Initializing NVMe Controllers
>> EAL: PCI device :01:00.0 on NUMA socket 0
>> EAL:   probe driver: 8086:953 spdk_nvme
>> EAL:   using IOMMU type 1 (Type 1)
>> [   45.811353] vfio_ecap_init: :01:00.0 hiding ecap 0x19@0x2a0
>> Attaching to NVMe Controller at :01:00.0 [8086:0953]
>> [   46.486601] vfio_bar_restore: :01:00.0 reset recovery - restoring bars
>> nvme_ctrlr.c:1237:nvme_ctrlr_process_init: ***ERROR*** Initialization
>> timed out in state 3
>> nvme_ctrlr.c: 414:nvme_ctrlr_shutdown: ***ERROR*** did not shutdown
>> within 5 seconds
>> EAL: Hotplug doesn't support vfio yet
>> spdk_nvme_probe() failed for transport address ':01:00.0'
>> /usr/share/spdk/examples/nvme/perf: errors occured
>>
>> ok, so this is in the context of
>> [   52.500579] [] dump_backtrace+0x0/0x208
>> [   52.506147] [] show_stack+0x14/0x20
>> [   52.511357] [] dump_stack+0x8c/0xb0
>> [   52.516567] [] pci_flr_wait+0x18/0xc8
>> [   52.521955] [] pcie_flr+0x34/0x68
>> [   52.526986] [] __pci_dev_reset+0x100/0x2c0
>> [   52.532823] [] pci_try_reset_function+0x64/0x80
>> [   52.539108] [] vfio_pci_ioctl+0x398/0xb78
>> [   52.544857] [] vfio_device_fops_unl_ioctl+0x20/0x30
>> [   52.551501] [] do_vfs_ioctl+0xa4/0x738
>> [   52.556980] [] SyS_ioctl+0x8c/0xa0
>> [   52.562100] [] __sys_trace_return+0x0/0x4
>>
>>
>> So, I have two things to do here.
>>
>> 1) remove retry funciton and just do following.
>>data = readl(cfg_data_p);
>>if (data == CFG_RETRY_STATUS)/* 0x0001 */
>>  data = 0x;
>>
>> 2) refactor the code with separate patch.
>>
>> but for that Sinan's patch is required.
>> then with both the patches I think, things will work out correctly for
>> iproc SOC.
>
> It looks like you are using the pci_flr_wait() path as I suspected.
>
> If you do the check as you mentioned in 1), this should work even
> without Sinan's patch.

looks like this can not work.
If I add folowing in iproc_pcie_config_read()

 if (data == CFG_RETRY_STATUS)/* 0x0001 */
  data = 0x;

because pci_bus_read_dev_vendor_id returns false
/* some broken boards return 0 or ~0 if a 

Re: [PATCH v7 1/2] PCI: iproc: Retry request when CRS returned from EP

2017-08-23 Thread Oza Oza
On Wed, Aug 23, 2017 at 9:22 PM, Sinan Kaya  wrote:
> Hi Oza,
>
>> In working Enumuration case I get following:
>> [9.125976] pci :00:00.0: bridge configuration invalid ([bus
>> 00-00]), re-configuring
>> [9.134267] where=0x0 val=0x0001
>> [9.146946] where=0x0 val=0x0001
>> [9.158943] where=0x0 val=0x0001
>> [9.170945] where=0x0 val=0x0001
>> [9.186945] where=0x0 val=0x0001
>> [9.210944] where=0x0 val=0x0001
>> [9.250943] where=0x0 val=0x0001
>> [9.322942] where=0x0 val=0x0001
>> [9.458943] where=0x0 val=0x0001
>> [9.726942] where=0x0 val=0x9538086>> actual vendor and device id.
>>
>> so I think I have to retry in RC driver, so the old code still holds good.
>> except that I have to do factoring out
> You need to return 0x0001 for vendor ID register and return 0x for
> other registers like COMMAND register during the CRS period.
>

Hi Sinan,

Although I don't disagree entirely with your suggestion, but at the same time,
I am not sure, in RC driver I should have special case for vendor id
and device id, and have another case for rest of the configuration
access.
my thinking is, the RC driver should closely reflect the PCI iproc HW
behaviour (in a generic way, rather than handling vendor/device id
case separately)

Bjorn, please comment as how do you see it being handled.

Regards,
Oza.

>>
>> please let me know If I missed anything, or you want me to try anything else.
>
> Sinan


Re: [PATCH v7 1/2] PCI: iproc: Retry request when CRS returned from EP

2017-08-23 Thread Oza Oza
On Wed, Aug 23, 2017 at 11:30 PM, Bjorn Helgaas  wrote:
> On Wed, Aug 23, 2017 at 09:32:06PM +0530, Oza Oza wrote:
>> On Wed, Aug 23, 2017 at 9:22 PM, Sinan Kaya  wrote:
>> > Hi Oza,
>> >
>> >> In working Enumuration case I get following:
>> >> [9.125976] pci :00:00.0: bridge configuration invalid ([bus
>> >> 00-00]), re-configuring
>> >> [9.134267] where=0x0 val=0x0001
>> >> [9.146946] where=0x0 val=0x0001
>> >> [9.158943] where=0x0 val=0x0001
>> >> [9.170945] where=0x0 val=0x0001
>> >> [9.186945] where=0x0 val=0x0001
>> >> [9.210944] where=0x0 val=0x0001
>> >> [9.250943] where=0x0 val=0x0001
>> >> [9.322942] where=0x0 val=0x0001
>> >> [9.458943] where=0x0 val=0x0001
>> >> [9.726942] where=0x0 val=0x9538086>> actual vendor and device id.
>> >>
>> >> so I think I have to retry in RC driver, so the old code still holds good.
>> >> except that I have to do factoring out
>> > You need to return 0x0001 for vendor ID register and return 0x 
>> > for
>> > other registers like COMMAND register during the CRS period.
>
> The proposal we're trying to implement is to handle this controller as
> an RP that does not support CRS SV.  Such an RP would never return
> 0x0001 for the vendor/device ID.  If it never got a successful
> completion, it would return 0x.
>
> So I think you do have to either retry (as in your v7 patch) or add a
> delay before we start enumeration.
>
> It looks like we waited somewhere between 320ms and 590ms before this
> device became ready.
>
> The PCI specs do require a delay after a reset (PCIe r3.1, sec 6.6.1)
> before software sends a config request.  I don't think there's
> anywhere in the PCI core where we delay before we first enumerate
> devices under a host bridge.  I'm not sure we'd *want* such a delay in
> the PCI core, because on many systems the BIOS has already initialized
> the PCI controller, and an additional delay would be unnecessary and
> would only slow down boot.
>
> But it might make sense to add a delay in the PCI controller driver --
> it knows when the reset occurs and probably knows more about the
> firmware environment.  I haven't tried to analyze all of sec 6.6.1,
> but this section:
>
>   Unless Readiness Notifications mechanisms are used (see Section
>   6.23), the Root Complex and/or system software must allow at least
>   1.0 s after a Conventional Reset of a device, before it may
>   determine that a device which fails to return a Successful
>   Completion status for a valid Configuration Request is a broken
>   device. This period is independent of how quickly Link training
>   completes.
>
>   Note: This delay is analogous to the Trhfa parameter specified for
>   PCI/PCI-X, and is intended to allow an adequate amount of time for
>   devices which require self initialization.
>
> makes it sound like a 1sec delay might be needed.  That sounds like an
> awful lot, but this device did take close to that amount of time.
>
> I don't care which way you go.  You've already implemented the delay
> in the v7 patch, and that's fine with me.
>

Thanks for your inputs Bjorn.
I will have v8 patch which will have;
factored out separate patch + retry implementation of v7 patch.

> Bjorn


Re: [PATCH v6 0/2] PCI: iproc: SOC specific fixes

2017-08-19 Thread Oza Oza
On Sat, Aug 19, 2017 at 10:40 PM, Bjorn Helgaas  wrote:
> On Fri, Aug 04, 2017 at 09:18:14PM +0530, Oza Pawandeep wrote:
>> PCI: iproc: Retry request when CRS returned from EP
>> Above patch adds support for CRS in PCI RC driver, otherwise if not
>> handled at lower level, the user space PMD (poll mode drivers) can
>> timeout.
>>
>> PCI: iproc: add device shutdown for PCI RC
>> This fixes the issue where certian PCI endpoints are not getting
>> detected on Stingray SOC after reboot.
>>
>> Changes Since v6:
>
> This current series is v6, so I guess you mean "since v5".
>
>> Bjorn's comments addressed.
>> Added reference to PCIe spec and iproc Controller spec in Changelog.
>>
>> Changes Since v5:
>> Ray's comments addressed.
>>
>> Changes Since v4:
>> Bjorn's comments addressed.
>>
>> Changes Since v3:
>> [re-send]
>>
>> Changes Since v2:
>> Fix compilation errors for pcie-iproc-platform.ko which was caught by
>> kbuild.
>>
>> Oza Pawandeep (2):
>>   PCI: iproc: Retry request when CRS returned from EP
>>   PCI: iproc: add device shutdown for PCI RC
>>
>>  drivers/pci/host/pcie-iproc-platform.c |   8 ++
>>  drivers/pci/host/pcie-iproc.c  | 139 
>> +
>>  drivers/pci/host/pcie-iproc.h  |   1 +
>>  3 files changed, 132 insertions(+), 16 deletions(-)
>
> What is this series based on?  It doesn't apply to my master branch,
> my next branch, or my pci/host-iproc branch.  It's always best to base
> patches on my master branch, unless they depend on something that's
> not there.

I am on master branch.
It looks like following Lorenzo Pieralisi's iproc patches have gone in
breaking it.

commit 64bcd00a7ef54d3d55d7eaa8b058e5125d215129
commit 527740765629142993966afbd7a836fc47fb30ee
commit 022adcfc4666a375185d14a43d1de1cdc58d8905

I think both the patch sets were running in parallel.
I will rebase the patches.

Regards,
Oza.


Re: [PATCH v6 1/2] PCI: iproc: Retry request when CRS returned from EP

2017-08-19 Thread Oza Oza
On Sat, Aug 19, 2017 at 11:56 PM, Bjorn Helgaas  wrote:
> On Fri, Aug 04, 2017 at 09:18:15PM +0530, Oza Pawandeep wrote:
>> PCIe spec r3.1, sec 2.3.2
>> If CRS software visibility is not enabled, the RC must reissue the
>> config request as a new request.
>>
>> - If CRS software visibility is enabled,
>> - for a config read of Vendor ID, the RC must return 0x0001 data
>> - for all other config reads/writes, the RC must reissue the
>>   request
>>
>> iproc PCIe Controller spec:
>> 4.7.3.3. Retry Status On Configuration Cycle
>> Endpoints are allowed to generate retry status on configuration
>> cycles. In this case, the RC needs to re-issue the request. The IP
>> does not handle this because the number of configuration cycles needed
>> will probably be less than the total number of non-posted operations
>> needed.
>>
>> When a retry status is received on the User RX interface for a
>> configuration request that was sent on the User TX interface,
>> it will be indicated with a completion with the CMPL_STATUS field set
>> to 2=CRS, and the user will have to find the address and data values
>> and send a new transaction on the User TX interface.
>> When the internal configuration space returns a retry status during a
>> configuration cycle (user_cscfg = 1) on the Command/Status interface,
>> the pcie_cscrs will assert with the pcie_csack signal to indicate the
>> CRS status.
>> When the CRS Software Visibility Enable register in the Root Control
>> register is enabled, the IP will return the data value to 0x0001 for
>> the Vendor ID value and 0x  (all 1’s) for the rest of the data in
>> the request for reads of offset 0 that return with CRS status.  This
>> is true for both the User RX Interface and for the Command/Status
>> interface.  When CRS Software Visibility is enabled, the CMPL_STATUS
>> field of the completion on the User RX Interface will not be 2=CRS and
>> the pcie_cscrs signal will not assert on the Command/Status interface.
>>
>> Note that, neither PCIe host bridge nor PCIe core re-issues the
>> request for any configuration offset.
>>
>> As a result of the fact, PCIe RC driver (sw) should take care of
>> retrying.
>> This patch fixes the problem, and attempts to read config space again
>> in case of PCIe code forwarding the CRS back to CPU.
>
> I think "fixes the problem" is a little bit of an overstatement.  I
> think it covers up some cases of the problem, but if I understand
> correctly, we're still susceptible to data corruptions on both read
> and write:
>
>   Per PCIe r3.1, sec 2.3.2, config requests that receive completions
>   with Configuration Request Retry Status (CRS) should be reissued by
>   the hardware *except* reads of the Vendor ID when CRS Software
>   Visibility is enabled.
>
>   This hardware never reissues configuration requests when it receives
>   CRS completions.
>
>   For config writes, there is no way for this driver to learn about
>   CRS completions.  A write that receives a CRS completion will not be
>   retried and the data will be discarded.  This is a potential data
>   corruption.
>
>   For config reads, this hardware returns CFG_RETRY_STATUS data when
>   it receives a CRS completion for a config read, regardless of the
>   address of the read or the CRS Software Visibility Enable bit.  As a
>   partial workaround for this, we retry in software any read that
>   returns CFG_RETRY_STATUS.
>
>   CFG_RETRY_STATUS could be valid data for other config registers.  It
>   is impossible to read such registers correctly because we can't tell
>   whether the hardware (1) received a CRS completion and synthesized
>   data of CFG_RETRY_STATUS or (2) received a successful completion
>   with data of CFG_RETRY_STATUS.
>

this will be fixed in our SOC in next revision, we will have separate register
in host bridge space to tell us, whether it has to be retried.
so there will not be any data corruption.
it will be minor change (instead of checking for CFG_RETRY_STATUS, we
will directly check that register)
so that will be clean.

>   The only thing we can really do is return 0x data, which
>   indicates a config read failure in the first case but is a data
>   corruption in the second case.

Instead of returning PCIBIOS_DEVICE_NOT_FOUND,
you want me to return 0x data, and I suppose that you have
suggsted in the code at the end also.
Is my understanding right on this comment ?

And do you want me to add all the texts in commit description,
which you have written above regarding config read and write handling ?
instead of "fixes the problem"

It will be quiet long commit description :),
but that's okay.

>
> The write case is probably not that important because we have a
> similar situation on other systems.  On other systems, the root
> complex automatically reissues config writes with CRS completions, but
> it may limit the number of retries.  In that case, the failed write is
> probably logged as a Target Abort error, but nobody pays attention to
> that, so

Re: [PATCH v6 1/2] PCI: iproc: Retry request when CRS returned from EP

2017-08-19 Thread Oza Oza
On Sat, Aug 19, 2017 at 11:56 PM, Bjorn Helgaas  wrote:
> On Fri, Aug 04, 2017 at 09:18:15PM +0530, Oza Pawandeep wrote:
>> PCIe spec r3.1, sec 2.3.2
>> If CRS software visibility is not enabled, the RC must reissue the
>> config request as a new request.
>>
>> - If CRS software visibility is enabled,
>> - for a config read of Vendor ID, the RC must return 0x0001 data
>> - for all other config reads/writes, the RC must reissue the
>>   request
>>
>> iproc PCIe Controller spec:
>> 4.7.3.3. Retry Status On Configuration Cycle
>> Endpoints are allowed to generate retry status on configuration
>> cycles. In this case, the RC needs to re-issue the request. The IP
>> does not handle this because the number of configuration cycles needed
>> will probably be less than the total number of non-posted operations
>> needed.
>>
>> When a retry status is received on the User RX interface for a
>> configuration request that was sent on the User TX interface,
>> it will be indicated with a completion with the CMPL_STATUS field set
>> to 2=CRS, and the user will have to find the address and data values
>> and send a new transaction on the User TX interface.
>> When the internal configuration space returns a retry status during a
>> configuration cycle (user_cscfg = 1) on the Command/Status interface,
>> the pcie_cscrs will assert with the pcie_csack signal to indicate the
>> CRS status.
>> When the CRS Software Visibility Enable register in the Root Control
>> register is enabled, the IP will return the data value to 0x0001 for
>> the Vendor ID value and 0x  (all 1’s) for the rest of the data in
>> the request for reads of offset 0 that return with CRS status.  This
>> is true for both the User RX Interface and for the Command/Status
>> interface.  When CRS Software Visibility is enabled, the CMPL_STATUS
>> field of the completion on the User RX Interface will not be 2=CRS and
>> the pcie_cscrs signal will not assert on the Command/Status interface.
>>
>> Note that, neither PCIe host bridge nor PCIe core re-issues the
>> request for any configuration offset.
>>
>> As a result of the fact, PCIe RC driver (sw) should take care of
>> retrying.
>> This patch fixes the problem, and attempts to read config space again
>> in case of PCIe code forwarding the CRS back to CPU.
>
> I think "fixes the problem" is a little bit of an overstatement.  I
> think it covers up some cases of the problem, but if I understand
> correctly, we're still susceptible to data corruptions on both read
> and write:
>
>   Per PCIe r3.1, sec 2.3.2, config requests that receive completions
>   with Configuration Request Retry Status (CRS) should be reissued by
>   the hardware *except* reads of the Vendor ID when CRS Software
>   Visibility is enabled.
>
>   This hardware never reissues configuration requests when it receives
>   CRS completions.
>
>   For config writes, there is no way for this driver to learn about
>   CRS completions.  A write that receives a CRS completion will not be
>   retried and the data will be discarded.  This is a potential data
>   corruption.
>
>   For config reads, this hardware returns CFG_RETRY_STATUS data when
>   it receives a CRS completion for a config read, regardless of the
>   address of the read or the CRS Software Visibility Enable bit.  As a
>   partial workaround for this, we retry in software any read that
>   returns CFG_RETRY_STATUS.
>
>   CFG_RETRY_STATUS could be valid data for other config registers.  It
>   is impossible to read such registers correctly because we can't tell
>   whether the hardware (1) received a CRS completion and synthesized
>   data of CFG_RETRY_STATUS or (2) received a successful completion
>   with data of CFG_RETRY_STATUS.
>
>   The only thing we can really do is return 0x data, which
>   indicates a config read failure in the first case but is a data
>   corruption in the second case.
>
> The write case is probably not that important because we have a
> similar situation on other systems.  On other systems, the root
> complex automatically reissues config writes with CRS completions, but
> it may limit the number of retries.  In that case, the failed write is
> probably logged as a Target Abort error, but nobody pays attention to
> that, so this isn't really much worse.
>
>> It implements iproc_pcie_config_read which gets called for Stingray,
>> Otherwise it falls back to PCI generic APIs.
>>
>> Signed-off-by: Oza Pawandeep 
>> Reviewed-by: Ray Jui 
>> Reviewed-by: Scott Branden 
>>
>> diff --git a/drivers/pci/host/pcie-iproc.c b/drivers/pci/host/pcie-iproc.c
>> index 0f39bd2..583cee0 100644
>> --- a/drivers/pci/host/pcie-iproc.c
>> +++ b/drivers/pci/host/pcie-iproc.c
>> @@ -68,6 +68,9 @@
>>  #define APB_ERR_EN_SHIFT 0
>>  #define APB_ERR_EN   BIT(APB_ERR_EN_SHIFT)
>>
>> +#define CFG_RETRY_STATUS 0x0001
>> +#define CFG_RETRY_STATUS_TIMEOUT_US  50 /* 500 milli-seconds. */
>> +
>>  /* derive the enum index of the outbound/

Re: [PATCH v6 2/2] PCI: iproc: add device shutdown for PCI RC

2017-08-19 Thread Oza Oza
On Sun, Aug 20, 2017 at 2:34 AM, Bjorn Helgaas  wrote:
> On Fri, Aug 04, 2017 at 09:18:16PM +0530, Oza Pawandeep wrote:
>> PERST must be asserted around ~500ms before the reboot is applied.
>>
>> During soft reset (e.g., "reboot" from Linux) on some iproc based SOCs
>> LCPLL clock and PERST both goes off simultaneously.
>> This will cause certain Endpoints Intel NVMe not get detected, upon
>> next boot sequence.
>>
>> This is specifically happening with Intel P3700 400GB series.
>> Endpoint is expecting the clock for some amount of time after PERST is
>> asserted, which is not happening in Stingray (iproc based SOC).
>> This causes NVMe to behave in undefined way.
>>
>> On the contrary, Intel x86 boards will have ref clock running, so we
>> do not see this behavior there.
>>
>> Besides, PCI spec does not stipulate about such timings.
>> In-fact it does not tell us, whether PCIe device should consider
>> refclk to be available or not to be.
>>
>> It is completely up to vendor to design their EP the way they want
>> with respect to ref clock availability.
>
> I am unconvinced.  Designing an endpoint that relies on ref clock
> timing not guaranteed by the spec does not sound like a way to build
> reliable hardware.
>
> The PCIe Card Electromechanical spec, r2.0, sec 2.2.3 says the clock
> goes inactive after PERST# goes active, but doesn't specify how long.
> In the absence of a spec requirement, I don't see a reason to assume
> other systems preserve the ref clock after PERST#, so if the Intel
> device requires clocks for 500ms after PERST#, we should see problems
> on other systems.

The reason you do not see and most likely you will not see on any
other system is
because, the ref clock is supplied by on board oscillator.
when PERST# is asserted, the ref clock is still available.

but in our case, we do not have on-board clock generator, rather we
rely on the clock
coming from SOC, so if SOC reset is issued, both PERST# and the ref
clock goes off simultaneously.

please also refer to the link below which stipulate on clk being
active after PERST# being asserted.
http://www.eetimes.com/document.asp?doc_id=1279299  [Figure 2:
Power-down waveforms]

however I am not saying that, this article has more to claim than PCIe spec.
but, I think the PCIe Card Electromechanical spec leaves the margin
for card manufactures to design the card based on the assumption
that ref clock could be available after PERST#  is asserted.

most of the cards do not assume that, but at the least we have seen that,
once particular card which behaves as per the link which I have just
pasted above.

>
> Sec 2.2 says that on power up, the power rails must be stable at least
> T(PVPERL) (100ms) and reference clocks must be stable at least
> T(PERST-CLK) (100us) before PERST# is deasserted.  I think it's more
> likely the problem is here.
>
> The current iproc_pcie_reset() looks like it sleeps 100ms *after* it
> deasserts PERST#.  Should that be *before* deasserting PERST#?
>

T(PERST-CLK) (100us) before PERST# is deasserted.
which we are already waiting for 250us

T(PVPERL) (100ms), the power rail is stable much before linux comes up.
so I still think we are meeting power up requirements.

>> 500ms is just based on the observation and taken as safe margin.
>> This patch adds platform shutdown where it should be
>> called in device_shutdown while reboot command is issued.
>> So in sequence first Endpoint Shutdown (e.g. nvme_shutdown)
>> followed by RC shutdown, which issues safe PERST assertion.
>>
>> Signed-off-by: Oza Pawandeep 
>> Reviewed-by: Ray Jui 
>> Reviewed-by: Scott Branden 
>>
>> diff --git a/drivers/pci/host/pcie-iproc-platform.c 
>> b/drivers/pci/host/pcie-iproc-platform.c
>> index 90d2bdd..9512960 100644
>> --- a/drivers/pci/host/pcie-iproc-platform.c
>> +++ b/drivers/pci/host/pcie-iproc-platform.c
>> @@ -131,6 +131,13 @@ static int iproc_pcie_pltfm_remove(struct 
>> platform_device *pdev)
>>   return iproc_pcie_remove(pcie);
>>  }
>>
>> +static void iproc_pcie_pltfm_shutdown(struct platform_device *pdev)
>> +{
>> + struct iproc_pcie *pcie = platform_get_drvdata(pdev);
>> +
>> + iproc_pcie_shutdown(pcie);
>> +}
>> +
>>  static struct platform_driver iproc_pcie_pltfm_driver = {
>>   .driver = {
>>   .name = "iproc-pcie",
>> @@ -138,6 +145,7 @@ static int iproc_pcie_pltfm_remove(struct 
>> platform_device *pdev)
>>   },
>>   .probe = iproc_pcie_pltfm_probe,
>>   .remove = iproc_pcie_pltfm_remove,
>> + .shutdown = iproc_pcie_pltfm_shutdown,
>>  };
>>  module_platform_driver(iproc_pcie_pltfm_driver);
>>
>> diff --git a/drivers/pci/host/pcie-iproc.c b/drivers/pci/host/pcie-iproc.c
>> index 583cee0..ee40651 100644
>> --- a/drivers/pci/host/pcie-iproc.c
>> +++ b/drivers/pci/host/pcie-iproc.c
>> @@ -627,7 +627,7 @@ static int iproc_pcie_config_write32(struct pci_bus 
>> *bus, unsigned int devfn,
>>   .write = iproc_pcie_config_write32,
>>  };
>>
>> -static void iproc_pcie_reset(struct ip

Re: [PATCH v6 1/2] PCI: iproc: Retry request when CRS returned from EP

2017-08-19 Thread Oza Oza
On Sun, Aug 20, 2017 at 1:56 AM, Bjorn Helgaas  wrote:
> On Sun, Aug 20, 2017 at 01:02:09AM +0530, Oza Oza wrote:
>> On Sat, Aug 19, 2017 at 11:56 PM, Bjorn Helgaas  wrote:
>
>> > I think you should do something like this instead so you don't do the
>> > MMIO read any more times than necessary:
>> >
>> >   static u32 iproc_pcie_cfg_retry(void __iomem *cfg_data_p)
>> >   {
>> > u32 data;
>> > int timeout = CFG_RETRY_STATUS_TIMEOUT_US;
>> >
>> > data = readl(cfg_data_p);
>> > while (data == CFG_RETRY_STATUS && timeout--) {
>> >   udelay(1);
>> >   data = readl(cfg_data_p);
>> > }
>> >
>> > if (data == CFG_RETRY_STATUS)
>> >   data = 0x;
>> > return data;
>> >   }
>> >
>> >   static int iproc_pcie_config_read(...)
>> >   {
>> > u32 data;
>> >
>> > ...
>> > data = iproc_pcie_cfg_retry(cfg_data_p);
>> > if (size <= 2)
>> >   *val = (data >> (8 * (where & 3))) & ((1 << (size * 8)) - 1);
>> >
>> > In the event of a timeout, this returns PCIBIOS_SUCCESSFUL and
>> > 0x data.  That's what most other platforms do, and most
>> > callers of the PCI config accessors check for that data instead of
>> > checking the return code to see whether the access was successful.
>> >
>> I see one problem with this.
>> we have Samsung NVMe which exposes 64-bit IO BAR.
>> 0x0001.
>>
>> and if you see __pci_read_base which does following in sequence.
>>
>> pci_read_config_dword(dev, pos, &l);
>> pci_write_config_dword(dev, pos, l | mask);
>> pci_read_config_dword(dev, pos, &sz);
>> pci_write_config_dword(dev, pos, l);
>>
>> returning 0x would not be correct in that case.
>> even if callers retry, they will end up in a loop if caller retries.
>>
>> hence I was returning 0x0001, it is upto the upper layer to treat
>> it as data or not.
>
> In your patch, I don't think the upper layer will ever see 0x0001.
> iproc_pcie_config_read() only updates *data if iproc_pcie_cfg_retry()
> returns PCIBIOS_SUCCESSFUL.  And iproc_pcie_cfg_retry() only returns
> PCIBIOS_SUCCESSFUL if it reads something other than 0x0001.
>
> Even if you *did* return 0x0001 to upper layers, I think we'd have
> a problem.  Here's an example scenario.  If we do a Function-Level
> Reset, pcie_flr() starts the reset, then calls pci_flr_wait() to wait
> until the device becomes ready again.  pci_flr_wait() thinks that any
> PCI_COMMAND value other than 0x means the device is ready.
>
> If the device returns CRS status and you return 0x0001 when
> pci_flr_wait() reads PCI_COMMAND, it thinks that means the device is
> ready, but it's not.
>
>> let me know if this sounds like a problem to you as well.
>>
>> so in my opinion returning 0x is not an option.
>>
>> > For example, pci_flr_wait() assumes that if a read of PCI_COMMAND
>> > returns ~0, it's because the device isn't ready yet, and we should
>> > wait and retry.

yes that will be a problem if I return 0x0001.
ok let me revise the patch by returning 0x as of now.

Regards,
Oza.


Re: [PATCH v6 2/2] PCI: iproc: add device shutdown for PCI RC

2017-08-20 Thread Oza Oza
On Mon, Aug 21, 2017 at 2:55 AM, Bjorn Helgaas  wrote:
> On Sun, Aug 20, 2017 at 09:06:51AM +0530, Oza Oza wrote:
>> On Sun, Aug 20, 2017 at 2:34 AM, Bjorn Helgaas  wrote:
>> > On Fri, Aug 04, 2017 at 09:18:16PM +0530, Oza Pawandeep wrote:
>> >> PERST must be asserted around ~500ms before the reboot is applied.
>> >>
>> >> During soft reset (e.g., "reboot" from Linux) on some iproc based SOCs
>> >> LCPLL clock and PERST both goes off simultaneously.
>> >> This will cause certain Endpoints Intel NVMe not get detected, upon
>> >> next boot sequence.
>> >>
>> >> This is specifically happening with Intel P3700 400GB series.
>> >> Endpoint is expecting the clock for some amount of time after PERST is
>> >> asserted, which is not happening in Stingray (iproc based SOC).
>> >> This causes NVMe to behave in undefined way.
>> >>
>> >> On the contrary, Intel x86 boards will have ref clock running, so we
>> >> do not see this behavior there.
>> >>
>> >> Besides, PCI spec does not stipulate about such timings.
>> >> In-fact it does not tell us, whether PCIe device should consider
>> >> refclk to be available or not to be.
>> >>
>> >> It is completely up to vendor to design their EP the way they want
>> >> with respect to ref clock availability.
>> >
>> > I am unconvinced.  Designing an endpoint that relies on ref clock
>> > timing not guaranteed by the spec does not sound like a way to build
>> > reliable hardware.
>> >
>> > The PCIe Card Electromechanical spec, r2.0, sec 2.2.3 says the clock
>> > goes inactive after PERST# goes active, but doesn't specify how long.
>> > In the absence of a spec requirement, I don't see a reason to assume
>> > other systems preserve the ref clock after PERST#, so if the Intel
>> > device requires clocks for 500ms after PERST#, we should see problems
>> > on other systems.
>>
>> The reason you do not see and most likely you will not see on any
>> other system is
>> because, the ref clock is supplied by on board oscillator.
>> when PERST# is asserted, the ref clock is still available.
>>
>> but in our case, we do not have on-board clock generator, rather we
>> rely on the clock
>> coming from SOC, so if SOC reset is issued, both PERST# and the ref
>> clock goes off simultaneously.
>
> OK.  I'm not a hardware person and will have to take your word for
> this.
>
>> please also refer to the link below which stipulate on clk being
>> active after PERST# being asserted.
>> http://www.eetimes.com/document.asp?doc_id=1279299  [Figure 2:
>> Power-down waveforms]
>
> This is just a copy of Figure 2-13 from the PCIe CEM spec I cited
> above.  It's better to cite the spec itself than an article based on
> the spec.
>
>> however I am not saying that, this article has more to claim than PCIe spec.
>> but, I think the PCIe Card Electromechanical spec leaves the margin
>> for card manufactures to design the card based on the assumption
>> that ref clock could be available after PERST#  is asserted.
>
> The only statement in the spec I'm aware of is that "Clock and JTAG go
> inactive after PERST# goes inactive."  Since there's no required time
> the clock must remain active, a robust card would not assume the clock
> is available at all after PERST.  500ms is a *huge* length of time;
> I'd be very surprised if Intel designed a card that required that.
>
> I don't feel like we really got to the root cause of this, but this
> patch only hurts the iproc reboot time, so I'm OK with it.  I was just
> hoping to avoid slowing down your reboot time.
>

I appreciate your concern and valuable inputs.

However, while writing this patch I shared the similar concern.

And, after multiple discussions this is the best we could do.
reboot is less likely in data centers,
but failing to detect the NVMe after reboot has more severe business
impact than delaying reboot by 500ms.

I will rebase both the patches on top of Lorenzo's patches, and submit v7 today.

Thanks and Regards,
Oza.

>> most of the cards do not assume that, but at the least we have seen that,
>> once particular card which behaves as per the link which I have just
>> pasted above.
>>
>> >
>> > Sec 2.2 says that on power up, the power rails must be stable at least
>> > T(PVPERL) (100ms) and reference clocks must be stable at least
>> > T(PERST-CLK) (100us) before PERST# is deasserted.  I think it&

Re: [PATCH v7 2/3] PCI: Add support for PCI inbound windows resources

2017-05-31 Thread Oza Oza
On Wed, May 31, 2017 at 4:12 AM, Bjorn Helgaas  wrote:
> On Mon, May 22, 2017 at 11:39 AM, Oza Pawandeep  wrote:
>> This patch adds support for inbound memory window
>> for PCI RC drivers.
>>
>> It defines new function pci_create_root_bus2 which
>> takes inbound resources as an argument and fills in the
>> memory resource to PCI internal host bridge structure
>> as inbound_windows.
>>
>> Legacy RC driver could continue to use pci_create_root_bus,
>> but any RC driver who wants to reseve IOVAS for their
>> inbound memory holes, should use new API pci_create_root_bus2.
>>
>> Signed-off-by: Oza Pawandeep 
>> ...
>
>> +struct pci_bus *pci_create_root_bus2(struct device *parent, int bus,
>> +   struct pci_ops *ops, void *sysdata, struct list_head 
>> *resources,
>> +   struct list_head *in_res)
>> +{
>> +   return pci_create_root_bus_msi(parent, bus, ops, sysdata,
>> +  resources, in_res, NULL);
>> +}
>> +EXPORT_SYMBOL_GPL(pci_create_root_bus2);
>
> Based on your response to Lorenzo's "[RFC/RFT PATCH 03/18] PCI:
> Introduce pci_scan_root_bus_bridge()", I'm hoping you can avoid adding
> yet another variant of pci_create_root_bus().
>
> So I think I can wait for that to settle out and look for a v8?
>
> Bjorn

Sure Bjorn, please wait for v8.

But there is one more associated patch
[PATCH v7 1/3] OF/PCI: Export inbound memory interface to PCI RC
which basically aims to provide an interface to RC drivers for their
inbound resources.
RC driver already get their outbound resources from
of_pci_get_host_bridge_resources,
similar attempt for inbound dma-ranges.

Thanking you for looking into this.

Regards,
Oza.


Re: [PATCH v7 2/3] PCI: Add support for PCI inbound windows resources

2017-06-01 Thread Oza Oza
On Thu, Jun 1, 2017 at 10:38 PM, Bjorn Helgaas  wrote:
> On Wed, May 31, 2017 at 11:17 AM, Oza Oza  wrote:
>> On Wed, May 31, 2017 at 4:12 AM, Bjorn Helgaas  wrote:
>>> On Mon, May 22, 2017 at 11:39 AM, Oza Pawandeep  
>>> wrote:
>>>> This patch adds support for inbound memory window
>>>> for PCI RC drivers.
>>>>
>>>> It defines new function pci_create_root_bus2 which
>>>> takes inbound resources as an argument and fills in the
>>>> memory resource to PCI internal host bridge structure
>>>> as inbound_windows.
>>>>
>>>> Legacy RC driver could continue to use pci_create_root_bus,
>>>> but any RC driver who wants to reseve IOVAS for their
>>>> inbound memory holes, should use new API pci_create_root_bus2.
>>>>
>>>> Signed-off-by: Oza Pawandeep 
>>>> ...
>>>
>>>> +struct pci_bus *pci_create_root_bus2(struct device *parent, int bus,
>>>> +   struct pci_ops *ops, void *sysdata, struct list_head 
>>>> *resources,
>>>> +   struct list_head *in_res)
>>>> +{
>>>> +   return pci_create_root_bus_msi(parent, bus, ops, sysdata,
>>>> +  resources, in_res, NULL);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(pci_create_root_bus2);
>>>
>>> Based on your response to Lorenzo's "[RFC/RFT PATCH 03/18] PCI:
>>> Introduce pci_scan_root_bus_bridge()", I'm hoping you can avoid adding
>>> yet another variant of pci_create_root_bus().
>>>
>>> So I think I can wait for that to settle out and look for a v8?
>>>
>>> Bjorn
>>
>> Sure Bjorn, please wait for v8.
>>
>> But there is one more associated patch
>> [PATCH v7 1/3] OF/PCI: Export inbound memory interface to PCI RC
>> which basically aims to provide an interface to RC drivers for their
>> inbound resources.
>> RC driver already get their outbound resources from
>> of_pci_get_host_bridge_resources,
>> similar attempt for inbound dma-ranges.
>
> Not sure I understand.  Patch 1/3 adds of_pci_get_dma_ranges(), but
> none of the patches adds a caller, so I don't see the point of it yet.
>
> In general, if I'm expecting another revision of one patch in a
> series, I expect the next revision to include *all* the patches in the
> series.  I normally don't pick out and apply individual patches from
> the series.
>
> Bjorn

Yes, it does not get called by anybody, because it is supposed to be called
by RC drivers who want to reserve IOVAs, not all the PCI host bridge driver
might call it, but certainly iproc based PCI driver has to call it.

Then I will have PATCH v8, and with that, I will include PCI RC driver
patch calling it as well.
Thanks for the Review.

Regards,
Oza.


Re: [PATCH 2/2] PCI: iproc: add device shutdown for PCI RC

2017-06-01 Thread Oza Oza
On Thu, Jun 1, 2017 at 11:41 PM, Ray Jui  wrote:
> Hi Oza,
>
> On 5/31/17 10:27 PM, Oza Pawandeep wrote:
>> PERST# must be asserted around ~500ms before
>> the reboot is applied.
>>
>> During soft reset (e.g., "reboot" from Linux) on some iProc based SoCs
>> LCPLL clock and PERST both goes off simultaneously.
>> This will cause certain Endpoints Intel NVMe not get detected, upon
>> next boot sequence.
>>
>> This happens because; Endpoint is expecting the clock for some amount of
>> time after PERST is asserted, which is not happening in our case
>> (Compare to Intel X86 boards, will have clocks running).
>> this cause NVMe to behave in undefined way.
>>
>> Essentially clock will remain alive for 500ms with PERST# = 0
>> before reboot.
>>
>> This patch adds platform shutdown where it should be
>> called in device_shutdown while reboot command is issued.
>>
>> So in sequence first Endpoint Shutdown (e.g. nvme_shutdown)
>> followed by RC shutdown is called, which issues safe PERST
>> assertion.
>>
>> Signed-off-by: Oza Pawandeep 
>> Reviewed-by: Ray Jui 
>> Reviewed-by: Scott Branden 
>>
>> diff --git a/drivers/pci/host/pcie-iproc-platform.c 
>> b/drivers/pci/host/pcie-iproc-platform.c
>> index 90d2bdd..9512960 100644
>> --- a/drivers/pci/host/pcie-iproc-platform.c
>> +++ b/drivers/pci/host/pcie-iproc-platform.c
>> @@ -131,6 +131,13 @@ static int iproc_pcie_pltfm_remove(struct 
>> platform_device *pdev)
>>   return iproc_pcie_remove(pcie);
>>  }
>>
>> +static void iproc_pcie_pltfm_shutdown(struct platform_device *pdev)
>> +{
>> + struct iproc_pcie *pcie = platform_get_drvdata(pdev);
>> +
>> + iproc_pcie_shutdown(pcie);
>> +}
>> +
>>  static struct platform_driver iproc_pcie_pltfm_driver = {
>>   .driver = {
>>   .name = "iproc-pcie",
>> @@ -138,6 +145,7 @@ static int iproc_pcie_pltfm_remove(struct 
>> platform_device *pdev)
>>   },
>>   .probe = iproc_pcie_pltfm_probe,
>>   .remove = iproc_pcie_pltfm_remove,
>> + .shutdown = iproc_pcie_pltfm_shutdown,
>>  };
>>  module_platform_driver(iproc_pcie_pltfm_driver);
>>
>> diff --git a/drivers/pci/host/pcie-iproc.c b/drivers/pci/host/pcie-iproc.c
>> index 05a3647..e9afc63 100644
>> --- a/drivers/pci/host/pcie-iproc.c
>> +++ b/drivers/pci/host/pcie-iproc.c
>> @@ -608,31 +608,38 @@ static int iproc_pcie_config_write32(struct pci_bus 
>> *bus, unsigned int devfn,
>>   .write = iproc_pcie_config_write32,
>>  };
>>
>> -static void iproc_pcie_reset(struct iproc_pcie *pcie)
>> +static void iproc_pcie_perst_ctrl(struct iproc_pcie *pcie, bool assert)
>>  {
>>   u32 val;
>>
>>   /*
>> -  * PAXC and the internal emulated endpoint device downstream should not
>> -  * be reset.  If firmware has been loaded on the endpoint device at an
>> -  * earlier boot stage, reset here causes issues.
>> +  * The internal emulated endpoints (such as PAXC) device downstream
>> +  * should not be reset. If firmware has been loaded on the endpoint
>> +  * device at an earlier boot stage, reset here causes issues.
>>*/
>>   if (pcie->ep_is_internal)
>>   return;
>>
>> - /*
>> -  * Select perst_b signal as reset source. Put the device into reset,
>> -  * and then bring it out of reset
>> -  */
>> - val = iproc_pcie_read_reg(pcie, IPROC_PCIE_CLK_CTRL);
>> - val &= ~EP_PERST_SOURCE_SELECT & ~EP_MODE_SURVIVE_PERST &
>> - ~RC_PCIE_RST_OUTPUT;
>> - iproc_pcie_write_reg(pcie, IPROC_PCIE_CLK_CTRL, val);
>> - udelay(250);
>> -
>> - val |= RC_PCIE_RST_OUTPUT;
>> - iproc_pcie_write_reg(pcie, IPROC_PCIE_CLK_CTRL, val);
>> - msleep(100);
>> + if (assert) {
>> + val = iproc_pcie_read_reg(pcie, IPROC_PCIE_CLK_CTRL);
>> + val &= ~EP_PERST_SOURCE_SELECT & ~EP_MODE_SURVIVE_PERST &
>> + ~RC_PCIE_RST_OUTPUT;
>> + iproc_pcie_write_reg(pcie, IPROC_PCIE_CLK_CTRL, val);
>> + udelay(250);
>> + } else {
>> + val = iproc_pcie_read_reg(pcie, IPROC_PCIE_CLK_CTRL);
>> + val |= RC_PCIE_RST_OUTPUT;
>> + iproc_pcie_write_reg(pcie, IPROC_PCIE_CLK_CTRL, val);
>> + msleep(100);
>> + }
>> +}
>> +
>> +int iproc_pcie_shutdown(struct iproc_pcie *pcie)
>> +{
>> + iproc_pcie_perst_ctrl(pcie, true);
>> + msleep(500);
>> +
>> + return 0;
>>  }
>
> Please export the symbol here to fix the build issue detected by kbuild
> test when the iProc PCIe platform driver is compiled as a kernel module.
>

Will take care of this Ray, Thanks for pointing it out.

Regards,
Oza.

>>
>>  static int iproc_pcie_check_link(struct iproc_pcie *pcie, struct pci_bus 
>> *bus)
>> @@ -1310,7 +1317,8 @@ int iproc_pcie_setup(struct iproc_pcie *pcie, struct 
>> list_head *res)
>>   goto err_exit_phy;
>>   }
>>
>> - iproc_pcie_reset(pcie);
>> + iproc_pcie_perst_ctrl(pcie, true);
>> + iproc_pcie_perst_ctrl(pcie, false);
>>
>>   if (pcie->need_

Re: [PATCH 1/3] of/pci/dma: fix DMA configuration for PCI masters

2017-05-15 Thread Oza Oza
On Sat, May 6, 2017 at 11:00 AM, Oza Oza  wrote:
> On Fri, May 5, 2017 at 8:55 PM, Robin Murphy  wrote:
>> On 04/05/17 19:41, Oza Oza wrote:
>> [...]
>>>>> 5) leaves scope of adding PCI flag handling for inbound memory
>>>>> by the new function.
>>>>
>>>> Which flags would ever actually matter? DMA windows aren't going to be
>>>> to config or I/O space, so the memory type can be assumed, and the
>>>> 32/64-bit distinction is irrelevant as it's not a relocatable BAR;
>>>> DMA-able system memory isn't going to be read-sensitive, so the
>>>> prefetchable flag shouldn't matter; and not being a BAR none of the
>>>> others would be relevant either.
>>>>
>>>
>>> Thanks Robin; for your reply and attention:
>>>
>>> agree with you, at present it would not matter,
>>> but it does not mean that we do not scope it to make it matter in future.
>>>
>>> now where it could matter:
>>> there is Relaxed Ordering for inbound memory for PCI.
>>> According to standard, Relaxed Ordering (RO) bit can be set only for
>>> Memory requests and completions (if present in the original request).
>>> Also, according to transaction ordering rules, I/O and configuration
>>> requests can still be re-ordered ahead of each other.
>>> and we would like to make use of it.
>>> for e.g. lets say we mark memory as Relaxed Ordered with flag.
>>> the special about this memory is incoming PCI transactions can be
>>> reordered and rest memory has to be strongly ordered.
>>
>> Please look at "PCI Bus Binding to: IEEE Std 1275-1994 Standard for Boot
>> (Initialization Configuration) Firmware" (as referenced in DTSpec) and
>> explain how PCIe Relaxed Order has anything to do with the DT binding.
>>
>>> how it our SOC would make use of this is out of scope for the
>>> discussion at this point of time, but I am just bringing in the
>>> idea/point how flags could be useful
>>> for inbound memory, since we would not like throw-away flags completely.
>>
>> The premise for implementing a PCI-specific parser is that you claim we
>> need to do something with the phys.hi cell of a DT PCI address, rather
>> than just taking the numerical part out of the phys.mid and phys.lo
>> cells. Please make that argument in reference to the flags which that
>> upper cell actually encodes, not unrelated things.
>>
>
> I think, the whole discussion around inbound flags is not what I
> intended to bring.
> this patch does nothing about inbound flag and never intends to solve
> anything regarding inbound flags.
> infact I would like to remove point 5 form the commit message. which
> should keep it out of discussion completely.
>
> let met tell what this patch is trying to address/solve 2 BUGs
> 1) fix wrong size return from of_dma_configure for PCI masters. (which
> is right now BUG)
>
> 2) handles multiple dma-ranges cleanly
>
> 3) It takes care of dma-ranges being optional.
>
> 4) following is the comment on function of_dma_get_range (which is also a BUG)
> "It returns -ENODEV if "dma-ranges" property was not found
>  * for this device in DT."
> which I think is wrong for PCI device, because if dma-ranges are
> absent then instead of returning  -ENODEV,
> it should return 0 with largest possible host memory.
>
> it solves all the above 4 problems.
>
>> [...]
>>>>> +int of_pci_get_dma_ranges(struct device_node *np, struct list_head 
>>>>> *resources)
>>>>> +{
>>>>> + struct device_node *node = of_node_get(np);
>>>>> + int rlen;
>>>>> + int ret = 0;
>>>>> + const int na = 3, ns = 2;
>>>>> + struct resource *res;
>>>>> + struct of_pci_range_parser parser;
>>>>> + struct of_pci_range range;
>>>>> +
>>>>> + if (!node)
>>>>> + return -EINVAL;
>>>>> +
>>>>> + parser.node = node;
>>>>> + parser.pna = of_n_addr_cells(node);
>>>>> + parser.np = parser.pna + na + ns;
>>>>> +
>>>>> + parser.range = of_get_property(node, "dma-ranges", &rlen);
>>>>> +
>>>>> + if (!parser.range) {
>>>>> + pr_debug("pcie device has no dma-ranges defined for 
>>>>> node(%s)\n",
>>>>> +   np-&

Re: [PATCH v6 1/3] of/pci/dma: fix DMA configuration for PCI masters

2017-05-22 Thread Oza Oza
On Thu, May 18, 2017 at 12:43 AM, Arnd Bergmann  wrote:
> On Tue, May 16, 2017 at 7:22 AM, Oza Pawandeep  wrote:
>> current device framework and OF framework integration assumes
>> dma-ranges in a way where memory-mapped devices define their
>> dma-ranges. (child-bus-address, parent-bus-address, length).
>>
>> of_dma_configure is specifically written to take care of memory
>> mapped devices. but no implementation exists for pci to take
>> care of pcie based memory ranges.
>
> Hi Oza,
>
> I'm trying to make sense of this, but am still rather puzzled. I have
> no idea what the distinction between memory-mapped devices and
> pcie based devices is in your description, as PCIe is usually memory
> mapped, and Linux doesn't actually support other kinds of PCIe
> devices on most architectures.
>
>> for e.g. iproc based SOCs and other SOCs(suc as rcar) have PCI
>> world dma-ranges.
>> dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>;
>>
>> this patch serves following:
>>
>> 1) exposes interface to the pci host driver for their
>> inbound memory ranges
>>
>> 2) provide an interface to callers such as of_dma_get_ranges.
>> so then the returned size get best possible (largest) dma_mask.
>> because PCI RC drivers do not call APIs such as
>> dma_set_coherent_mask() and hence rather it shows its addressing
>> capabilities based on dma-ranges.
>>
>> for e.g.
>> dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>;
>> we should get dev->coherent_dma_mask=0x7f.
>
> do you mean the coherent_dma_mask of the PCI host bridge
> or an attached device here?
>
> If you require PCI devices to come up with an initial
> coherent_dma_mask other than 0xff, there are other
> problems involved. In particular, you will need to use
> swiotlb, which is not supported on arm32 at the moment,
> and the dma_set_mask()/dma_set_coherent_mask()
> functions need to be modified.
>
>> +   while (1) {
>> +   dma_ranges = of_get_property(node, "dma-ranges", &rlen);
>> +
>> +   /* Ignore empty ranges, they imply no translation required. 
>> */
>> +   if (dma_ranges && rlen > 0)
>> +   break;
>> +
>> +   /* no dma-ranges, they imply no translation required. */
>> +   if (!dma_ranges)
>> +   break;
>
> A missing parent dma-ranges property here should really indicate that there
> is no valid translation. If we have existing cases where this happens
> in DT files, we may treat it as allowing only 32-bit DMA (as we already
> do for having no dma-ranges at all), but treating it the same way
> as an empty dma-ranges property sounds wrong.
>
>  Arnd


Hi Arnd and Bjorn,

Can you please have a look at PATCH v7 ?
It addresses problem2 alone.

Regards,
Oza


Re: [PATCH 2/3] iommu/pci: reserve iova for PCI masters

2017-05-22 Thread Oza Oza
On Fri, May 5, 2017 at 9:21 PM, Robin Murphy  wrote:
> On 04/05/17 19:52, Oza Oza wrote:
>> On Thu, May 4, 2017 at 11:50 PM, Robin Murphy  wrote:
>>> On 03/05/17 05:46, Oza Pawandeep wrote:
>>>> this patch reserves the iova for PCI masters.
>>>> ARM64 based SOCs may have scattered memory banks.
>>>> such as iproc based SOC has
>>>>
>>>> <0x 0x8000 0x0 0x8000>, /* 2G @ 2G */
>>>> <0x0008 0x8000 0x3 0x8000>, /* 14G @ 34G */
>>>> <0x0090 0x 0x4 0x>, /* 16G @ 576G */
>>>> <0x00a0 0x 0x4 0x>; /* 16G @ 640G */
>>>>
>>>> but incoming PCI transcation addressing capability is limited
>>>> by host bridge, for example if max incoming window capability
>>>> is 512 GB, then 0x0090 and 0x00a0 will fall beyond it.
>>>>
>>>> to address this problem, iommu has to avoid allocating iovas which
>>>> are reserved. which inturn does not allocate iova if it falls into hole.
>>>
>>> I don't necessarily disagree with doing this, as we could do with facing
>>> up to the issue of discontiguous DMA ranges in particular (I too have a
>>> platform with this problem), but I'm still not overly keen on pulling DT
>>> specifics into this layer. More than that, though, if we are going to do
>>> it, then we should do it for all devices with a restrictive
>>> "dma-ranges", not just PCI ones.
>>>
>>
>> How do you propose to do it ?
>>
>> my thinking is this:
>> iova_reserve_pci_windows is written specific for PCI, and I am adding there.
>>
>> ideally
>> struct pci_host_bridge should have new member:
>>
>> struct list_head inbound_windows; /* resource_entry */
>>
>> but somehow this resource have to be filled much before
>> iommu_dma_init_domain happens.
>> and use brdge resource directly in iova_reserve_pci_windows as it is
>> already doing it for outbound memory.
>>
>> this will detach the DT specifics from dma-immu layer.
>> let me know how this sounds.
>
> Please check the state of the code currently queued in Joerg's tree and
> in linux-next - iommu_dma_get_resv_regions() has room for
> device-agnostic stuff before the if (!dev_is_pci(dev)) check.
>
> Furthermore, with the probe-deferral changes we end up with a common
> dma_configure() routine to abstract the firmware-specifics of
> of_dma_configure() vs. acpi_dma_configure(), so it would make sense to
> give drivers etc. a similar interface for interrogating ranges. i.e.
> some common function that abstracts the difference between parsing DT
> dma-ranges vs. the ACPI _DMA object, either with a list-based get/put
> model or perhaps an iterator with a user-provided callback (so users
> could process in-place or create their own list as necessary). Unless of
> course we go all the way to making the ranges an inherent part of the
> device layer like some MIPS platforms currently do.
>
> Robin.
>

Hi Robin,

your above comments are taken care to the best of my understanding.
Kindly have a look at the PATCH v7.
the whole patch-set takes care of IOVA reservation for PCI inbound memory.
now there is no dependency on IOMMU and OF layer.

Regards,
Oza.


>>>> Bug: SOC-5216
>>>> Change-Id: Icbfc99a045d730be143fef427098c937b9d46353
>>>> Signed-off-by: Oza Pawandeep 
>>>> Reviewed-on: http://gerrit-ccxsw.broadcom.net/40760
>>>> Reviewed-by: vpx_checkpatch status 
>>>> Reviewed-by: CCXSW 
>>>> Tested-by: vpx_autobuild status 
>>>> Tested-by: vpx_smoketest status 
>>>> Tested-by: CCXSW 
>>>> Reviewed-by: Scott Branden 
>>>>
>>>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>>>> index 48d36ce..08764b0 100644
>>>> --- a/drivers/iommu/dma-iommu.c
>>>> +++ b/drivers/iommu/dma-iommu.c
>>>> @@ -27,6 +27,7 @@
>>>>  #include 
>>>>  #include 
>>>>  #include 
>>>> +#include 
>>>>  #include 
>>>>  #include 
>>>>  #include 
>>>> @@ -171,8 +172,12 @@ static void iova_reserve_pci_windows(struct pci_dev 
>>>> *dev,
>>>>   struct iova_domain *iovad)
>>>>  {
>>>>   struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
>>>> + struct device_node *np = bridge->dev.parent->of_node;
>>>>   struct resource_entry *window;
>>>>   unsigned 

Re: [PATCH v7 0/3] PCI/IOMMU: Reserve IOVAs for PCI inbound memory

2017-05-22 Thread Oza Oza
On Tue, May 23, 2017 at 12:48 AM, Alex Williamson
 wrote:
> On Mon, 22 May 2017 22:09:39 +0530
> Oza Pawandeep  wrote:
>
>> iproc based PCI RC and Stingray SOC has limitaiton of addressing only 512GB
>> memory at once.
>>
>> IOVA allocation honors device's coherent_dma_mask/dma_mask.
>> In PCI case, current code honors DMA mask set by EP, there is no
>> concept of PCI host bridge dma-mask,  should be there and hence
>> could truly reflect the limitation of PCI host bridge.
>>
>> However assuming Linux takes care of largest possible dma_mask, still the
>> limitation could exist, because of the way memory banks are implemented.
>>
>> for e.g. memory banks:
>> <0x 0x8000 0x0 0x8000>, /* 2G @ 2G */
>> <0x0008 0x8000 0x3 0x8000>, /* 14G @ 34G */
>> <0x0090 0x 0x4 0x>, /* 16G @ 576G */
>> <0x00a0 0x 0x4 0x>; /* 16G @ 640G */
>>
>> When run User space (SPDK) which internally uses vfio in order to access
>> PCI EndPoint directly.
>>
>> Vfio uses huge-pages which could come from 640G/0x00a0.
>> And the way vfio maps the hugepage is to have phys addr as iova,
>> and ends up calling VFIO_IOMMU_MAP_DMA ends up calling iommu_map,
>> inturn arm_lpae_map mapping iovas out of range.
>>
>> So the way kernel allocates IOVA (where it honours device dma_mask) and
>> the way userspace gets IOVA is different.
>>
>> dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>; will not work.
>>
>> Instead we have to go for scattered dma-ranges leaving holes.
>> Hence, we have to reserve IOVA allocations for inbound memory.
>> The patch-set caters to only addressing IOVA allocation problem.
>
>
> The description here confuses me, with vfio the user owns the iova
> allocation problem.  Mappings are only identity mapped if the user
> chooses to do so.  The dma_mask of the device is set by the driver and
> only relevant to the DMA-API.  vfio is a meta-driver and doesn't know
> the dma_mask of any particular device, that's the user's job.  Is the
> net result of what's happening here for the vfio case simply to expose
> extra reserved regions in sysfs, which the user can then consume to
> craft a compatible iova?  Thanks,
>
> Alex

Hi Alex,

this is not a VFIO problem, the reason I have mentioned VFIO because,
wanted to bring problem
statement as a whole (which includes both kernel space and user space).
The way SPDK pipeline is set, yes mapping are identity mapped, and
whatever user space passes down IOVA,
VFIO use is as is. which is fine and expected.

But the problem is, user space physical memory (hugepages)  reside
high enough in
memory, which could be beyond PCI RC's capability.

Again, this is not VFIO's problem, neither is of user-space.
In-fact both have nothing to do with dma-mask as well.
My reference of dma-mask was for Linux IOMMU framework (not for VFIO)

Regards,
Oza.
>
>>
>> Changes since v7:
>> - Robin's comment addressed
>> where he wanted to remove depedency between IOMMU and OF layer.
>> - Bjorn Helgaas's comments addressed.
>>
>> Changes since v6:
>> - Robin's comments addressed.
>>
>> Changes since v5:
>> Changes since v4:
>> Changes since v3:
>> Changes since v2:
>> - minor changes, redudant checkes removed
>> - removed internal review
>>
>> Changes since v1:
>> - address Rob's comments.
>> - Add a get_dma_ranges() function to of_bus struct..
>> - Convert existing contents of of_dma_get_range function to
>>   of_bus_default_dma_get_ranges and adding that to the
>>   default of_bus struct.
>> - Make of_dma_get_range call of_bus_match() and then bus->get_dma_ranges.
>>
>>
>> Oza Pawandeep (3):
>>   OF/PCI: expose inbound memory interface to PCI RC drivers.
>>   IOMMU/PCI: reserve IOVA for inbound memory for PCI masters
>>   PCI: add support for inbound windows resources
>>
>>  drivers/iommu/dma-iommu.c | 44 --
>>  drivers/of/of_pci.c   | 96 
>> +++
>>  drivers/pci/probe.c   | 30 +--
>>  include/linux/of_pci.h|  7 
>>  include/linux/pci.h   |  1 +
>>  5 files changed, 170 insertions(+), 8 deletions(-)
>>
>


Re: [PATCH v3 1/2] PCI: iproc: Retry request when CRS returned from EP

2017-06-12 Thread Oza Oza
On Tue, Jun 13, 2017 at 5:00 AM, Bjorn Helgaas  wrote:
> Please wrap your changelogs to use 75 columns.  "git log" indents the
> changelog by four spaces, so if your text is 75 wide, it will still
> fit without wrapping.
>
> On Sun, Jun 11, 2017 at 09:35:37AM +0530, Oza Pawandeep wrote:
>> For Configuration Requests only, following reset
>> it is possible for a device to terminate the request
>> but indicate that it is temporarily unable to process
>> the Request, but will be able to process the Request
>> in the future – in this case, the Configuration Request
>> Retry Status 10 (CRS) Completion Status is used
>
> How does this relate to the CRS support we already have in the core,
> e.g., pci_bus_read_dev_vendor_id()?  It looks like your root complex
> already returns 0x0001 (CFG_RETRY_STATUS) in some cases.
>
> Also, per spec (PCIe r3.1, sec 2.3.2), CRS Software Visibility only
> affects config reads of the Vendor ID, but you call
> iproc_pcie_cfg_retry() for all config offsets.

Yes, as per Spec, CRS Software Visibility only affects config read of
the Vendor ID.
For config write or any other config read the Root must automatically
re-issue configuration
request again as a new request, and our PCIe RC fails to do so.

>
>> SPDK user space NVMe driver reinitializes NVMe which
>> causes reset, while doing this some configuration requests
>> get NAKed by Endpoint (NVMe).
>
> What's SPDK?  I don't know what "NAK" means in a PCIe context.  If you
> can use the appropriate PCIe terminology, I think it will make more
> sense to me.

when I meant NAK, I meant CRS, will change the description, and will take
care of using appropriate PCIe terminology.

>
>> Current iproc PCI driver is agnostic about it.
>> PAXB will forward the NAKed response in stipulated AXI code.
>
> In general a native host bridge driver should not have to deal with
> the CRS feature because it's supported in the PCI core.  So we need
> some explanation about why iProc is special in this regard.
>

For config write or any other config read the Root must automatically
re-issue configuration
request again as a new request, iproc based PCIe RC does not adhere to
this, and also
our PCI-to-AXI bridge (internal), which returns code 0x0001 to CPU.

>> NVMe spec defines this timeout in 500 ms units, and this
>> only happens if controller has been in reset, or with new firmware,
>> or in abrupt shutdown case.
>> Meanwhile config access could result into retry.
>
> I don't understand why NVMe is relevant here.  Is there something
> special about NVMe and CRS?
>

You are right, NVMe spec is irrelevant here, but since whole
exercise was carried out with NVMe and our major use cases are NVMe,
I ended up mentioning that. I can remove that from description.

>> This patch fixes the problem, and attempts to read again in case
>> of PAXB forwarding the NAK.
>>
>> It implements iproc_pcie_config_read which gets called for Stingray.
>> Otherwise it falls back to PCI generic APIs.
>>
>> Signed-off-by: Oza Pawandeep 
>> Reviewed-by: Ray Jui 
>> Reviewed-by: Scott Branden 
>>
>> diff --git a/drivers/pci/host/pcie-iproc.c b/drivers/pci/host/pcie-iproc.c
>> index 0f39bd2..05a3647 100644
>> --- a/drivers/pci/host/pcie-iproc.c
>> +++ b/drivers/pci/host/pcie-iproc.c
>> @@ -68,6 +68,9 @@
>>  #define APB_ERR_EN_SHIFT 0
>>  #define APB_ERR_EN   BIT(APB_ERR_EN_SHIFT)
>>
>> +#define CFG_RETRY_STATUS 0x0001
>> +#define CFG_RETRY_STATUS_TIMEOUT_US  50 /* 500 milli-seconds. */
>> +
>>  /* derive the enum index of the outbound/inbound mapping registers */
>>  #define MAP_REG(base_reg, index)  ((base_reg) + (index) * 2)
>>
>> @@ -448,6 +451,47 @@ static inline void iproc_pcie_apb_err_disable(struct 
>> pci_bus *bus,
>>   }
>>  }
>>
>> +static int iproc_pcie_cfg_retry(void __iomem *cfg_data_p)
>> +{
>> + int timeout = CFG_RETRY_STATUS_TIMEOUT_US;
>> + unsigned int ret;
>> +
>> + do {
>> + ret = readl(cfg_data_p);
>> + if (ret == CFG_RETRY_STATUS)
>> + udelay(1);
>> + else
>> + return PCIBIOS_SUCCESSFUL;
>> + } while (timeout--);
>> +
>> + return PCIBIOS_DEVICE_NOT_FOUND;
>> +}
>> +
>> +static void __iomem *iproc_pcie_map_ep_cfg_reg(struct iproc_pcie *pcie,
>> +unsigned int busno,
>> +unsigned int slot,
>> +unsigned int fn,
>> +int where)
>> +{
>> + u16 offset;
>> + u32 val;
>> +
>> + /* EP device access */
>> + val = (busno << CFG_ADDR_BUS_NUM_SHIFT) |
>> + (slot << CFG_ADDR_DEV_NUM_SHIFT) |
>> + (fn << CFG_ADDR_FUNC_NUM_SHIFT) |
>> + (where & CFG_ADDR_REG_NUM_MASK) |
>> + (1 & CFG_ADDR_CFG_TYPE_MASK);
>> +
>> + iproc_pcie_write_reg(pcie, IPROC_PCIE_CFG_ADDR, val);
>> + offset = 

Re: [PATCH 1/3] of/pci/dma: fix DMA configuration for PCI masters

2017-05-04 Thread Oza Oza
On Thu, May 4, 2017 at 11:32 PM, Robin Murphy  wrote:
> [apologies for the silence - I've been on holiday]
>
> On 03/05/17 05:46, Oza Pawandeep wrote:
>> current device framework and of framework integration assumes
>> dma-ranges in a way where memory-mapped devices define their
>> dma-ranges. (child-bus-address, parent-bus-address, length).
>
> Well, yes, that is simply the definition of dma-ranges, and remains true
> regardless of the particular format of either bus address.
>
>> of_dma_configure is specifically written to take care of memory
>> mapped devices. but no implementation exists for pci to take
>> care of pcie based memory ranges.
>
> That still doesn't make sense. To repeat myself again, PCI devices *ARE*
> memory-mapped devices. Yes, there do exist some platforms where I/O
> space is not treated as MMIO, but config space and memory space are very
> much memory-mapped however you look at them, and in the context of DMA,
> only memory space is relevant anyway.
>
> What *is* true about the current code is that of_dma_get_range() expects
> to be passed an OF node representing the device itself, and doesn't work
> properly when passed the node of the device's parent bus directly, which
> happens to be what pci_dma_configure() currently does. That's the only
> reason why it doesn't work for (single-entry) host controller dma-ranges
> today. This does not mean it's a PCI problem, it is simply the case that
> pci_dma_configure() is the only caller currently hitting it. Other
> discoverable, DMA-capable buses like fsl-mc are still going to face the
> exact same problem with or without this patch.
>
>> for e.g. iproc based SOCs and other SOCs(suc as rcar) have PCI
>> world dma-ranges.
>> dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>;
>>
>> this patch serves following:
>>
>> 1) exposes interface to the pci host driver for their
>> inbound memory ranges
>>
>> 2) provide an interface to callers such as of_dma_get_ranges.
>> so then the returned size get best possible (largest) dma_mask.
>> because PCI RC drivers do not call APIs such as
>> dma_set_coherent_mask() and hence rather it shows its addressing
>> capabilities based on dma-ranges.
>> for e.g.
>> dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>;
>> we should get dev->coherent_dma_mask=0x7f.
>>
>> 3) this patch handles multiple inbound windows and dma-ranges.
>> it is left to the caller, how it wants to use them.
>> the new function returns the resources in a standard and unform way
>>
>> 4) this way the callers of for e.g. of_dma_get_ranges
>> does not need to change.
>>
>> 5) leaves scope of adding PCI flag handling for inbound memory
>> by the new function.
>
> Which flags would ever actually matter? DMA windows aren't going to be
> to config or I/O space, so the memory type can be assumed, and the
> 32/64-bit distinction is irrelevant as it's not a relocatable BAR;
> DMA-able system memory isn't going to be read-sensitive, so the
> prefetchable flag shouldn't matter; and not being a BAR none of the
> others would be relevant either.
>

Thanks Robin; for your reply and attention:

agree with you, at present it would not matter,
but it does not mean that we do not scope it to make it matter in future.

now where it could matter:
there is Relaxed Ordering for inbound memory for PCI.
According to standard, Relaxed Ordering (RO) bit can be set only for
Memory requests and completions (if present in the original request).
Also, according to transaction ordering rules, I/O and configuration
requests can still be re-ordered ahead of each other.
and we would like to make use of it.
for e.g. lets say we mark memory as Relaxed Ordered with flag.
the special about this memory is incoming PCI transactions can be
reordered and rest memory has to be strongly ordered.

how it our SOC would make use of this is out of scope for the
discussion at this point of time, but I am just bringing in the
idea/point how flags could be useful
for inbound memory, since we would not like throw-away flags completely.

>>
>> Bug: SOC-5216
>> Change-Id: Ie045386df91e1e0587846bb147ae40d96f6d7d2e
>> Signed-off-by: Oza Pawandeep 
>> Reviewed-on: http://gerrit-ccxsw.broadcom.net/40428
>> Reviewed-by: vpx_checkpatch status 
>> Reviewed-by: CCXSW 
>> Reviewed-by: Ray Jui 
>> Tested-by: vpx_autobuild status 
>> Tested-by: vpx_smoketest status 
>> Tested-by: CCXSW 
>> Reviewed-by: Scott Branden 
>>
>> diff --git a/drivers/of/of_pci.c b/drivers/of/of_pci.c
>> index 0ee42c3..ed6e69a 100644
>> --- a/drivers/of/of_pci.c
>> +++ b/drivers/of/of_pci.c
>> @@ -283,6 +283,83 @@ int of_pci_get_host_bridge_resources(struct device_node 
>> *dev,
>>   return err;
>>  }
>>  EXPORT_SYMBOL_GPL(of_pci_get_host_bridge_resources);
>> +
>> +/**
>> + * of_pci_get_dma_ranges - Parse PCI host bridge inbound resources from DT
>> + * @np: device node of the host bridge having the dma-ranges property
>> + * @resources: list where the range of resources will be added after DT 
>> pars

Re: [PATCH 2/3] iommu/pci: reserve iova for PCI masters

2017-05-04 Thread Oza Oza
On Thu, May 4, 2017 at 11:50 PM, Robin Murphy  wrote:
> On 03/05/17 05:46, Oza Pawandeep wrote:
>> this patch reserves the iova for PCI masters.
>> ARM64 based SOCs may have scattered memory banks.
>> such as iproc based SOC has
>>
>> <0x 0x8000 0x0 0x8000>, /* 2G @ 2G */
>> <0x0008 0x8000 0x3 0x8000>, /* 14G @ 34G */
>> <0x0090 0x 0x4 0x>, /* 16G @ 576G */
>> <0x00a0 0x 0x4 0x>; /* 16G @ 640G */
>>
>> but incoming PCI transcation addressing capability is limited
>> by host bridge, for example if max incoming window capability
>> is 512 GB, then 0x0090 and 0x00a0 will fall beyond it.
>>
>> to address this problem, iommu has to avoid allocating iovas which
>> are reserved. which inturn does not allocate iova if it falls into hole.
>
> I don't necessarily disagree with doing this, as we could do with facing
> up to the issue of discontiguous DMA ranges in particular (I too have a
> platform with this problem), but I'm still not overly keen on pulling DT
> specifics into this layer. More than that, though, if we are going to do
> it, then we should do it for all devices with a restrictive
> "dma-ranges", not just PCI ones.
>

How do you propose to do it ?

my thinking is this:
iova_reserve_pci_windows is written specific for PCI, and I am adding there.

ideally
struct pci_host_bridge should have new member:

struct list_head inbound_windows; /* resource_entry */

but somehow this resource have to be filled much before
iommu_dma_init_domain happens.
and use brdge resource directly in iova_reserve_pci_windows as it is
already doing it for outbound memory.

this will detach the DT specifics from dma-immu layer.
let me know how this sounds.


>> Bug: SOC-5216
>> Change-Id: Icbfc99a045d730be143fef427098c937b9d46353
>> Signed-off-by: Oza Pawandeep 
>> Reviewed-on: http://gerrit-ccxsw.broadcom.net/40760
>> Reviewed-by: vpx_checkpatch status 
>> Reviewed-by: CCXSW 
>> Tested-by: vpx_autobuild status 
>> Tested-by: vpx_smoketest status 
>> Tested-by: CCXSW 
>> Reviewed-by: Scott Branden 
>>
>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>> index 48d36ce..08764b0 100644
>> --- a/drivers/iommu/dma-iommu.c
>> +++ b/drivers/iommu/dma-iommu.c
>> @@ -27,6 +27,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -171,8 +172,12 @@ static void iova_reserve_pci_windows(struct pci_dev 
>> *dev,
>>   struct iova_domain *iovad)
>>  {
>>   struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
>> + struct device_node *np = bridge->dev.parent->of_node;
>>   struct resource_entry *window;
>>   unsigned long lo, hi;
>> + int ret;
>> + dma_addr_t tmp_dma_addr = 0, dma_addr;
>> + LIST_HEAD(res);
>>
>>   resource_list_for_each_entry(window, &bridge->windows) {
>>   if (resource_type(window->res) != IORESOURCE_MEM &&
>> @@ -183,6 +188,36 @@ static void iova_reserve_pci_windows(struct pci_dev 
>> *dev,
>>   hi = iova_pfn(iovad, window->res->end - window->offset);
>>   reserve_iova(iovad, lo, hi);
>>   }
>> +
>> + /* PCI inbound memory reservation. */
>> + ret = of_pci_get_dma_ranges(np, &res);
>> + if (!ret) {
>> + resource_list_for_each_entry(window, &res) {
>> + struct resource *res_dma = window->res;
>> +
>> + dma_addr = res_dma->start - window->offset;
>> + if (tmp_dma_addr > dma_addr) {
>> + pr_warn("PCI: failed to reserve iovas; ranges 
>> should be sorted\n");
>
> I don't see anything in the DT spec about the entries having to be
> sorted, and it's not exactly impossible to sort a list if you need it so
> (and if I'm being really pedantic, one could still trigger this with a
> list that *is* sorted, only by different criteria).
>

we have to sort it the way we want then. I can make it sort then.
thanks for the suggestion.

> Robin.
>
>> + return;
>> + }
>> + if (tmp_dma_addr != dma_addr) {
>> + lo = iova_pfn(iovad, tmp_dma_addr);
>> + hi = iova_pfn(iovad, dma_addr - 1);
>> + reserve_iova(iovad, lo, hi);
>> + }
>> + tmp_dma_addr = window->res->end - window->offset;
>> + }
>> + /*
>> +  * the last dma-range should honour based on the
>> +  * 32/64-bit dma addresses.
>> +  */
>> + if (tmp_dma_addr < DMA_BIT_MASK(sizeof(dma_addr_t) * 8)) {
>> + lo = iova_pfn(iovad, tmp_dma_addr);
>> + hi = iova_pfn(iovad,
>> +   DMA_BIT_MASK(sizeof(dma_addr_t) * 8) - 
>> 1);
>> + reserve_iova(iovad, lo, hi);
>> + }
>> + }
>>  }
>>
>> 

Re: [PATCH 1/3] of/pci/dma: fix DMA configuration for PCI masters

2017-05-04 Thread Oza Oza
On Thu, May 4, 2017 at 11:32 PM, Robin Murphy  wrote:
> [apologies for the silence - I've been on holiday]
>
> On 03/05/17 05:46, Oza Pawandeep wrote:
>> current device framework and of framework integration assumes
>> dma-ranges in a way where memory-mapped devices define their
>> dma-ranges. (child-bus-address, parent-bus-address, length).
>
> Well, yes, that is simply the definition of dma-ranges, and remains true
> regardless of the particular format of either bus address.
>
>> of_dma_configure is specifically written to take care of memory
>> mapped devices. but no implementation exists for pci to take
>> care of pcie based memory ranges.
>
> That still doesn't make sense. To repeat myself again, PCI devices *ARE*
> memory-mapped devices. Yes, there do exist some platforms where I/O
> space is not treated as MMIO, but config space and memory space are very
> much memory-mapped however you look at them, and in the context of DMA,
> only memory space is relevant anyway.
>
> What *is* true about the current code is that of_dma_get_range() expects
> to be passed an OF node representing the device itself, and doesn't work
> properly when passed the node of the device's parent bus directly, which
> happens to be what pci_dma_configure() currently does. That's the only
> reason why it doesn't work for (single-entry) host controller dma-ranges
> today. This does not mean it's a PCI problem, it is simply the case that
> pci_dma_configure() is the only caller currently hitting it. Other
> discoverable, DMA-capable buses like fsl-mc are still going to face the
> exact same problem with or without this patch.
>

the new v2 is hooking callbacks for defualt and pci bus.
so now the implementation will not really look cluttered.
and for fsl-mc buses, we could choose to implement it in default bus callbacks.
will post the patch-set soon.

also with these patch-set we really do not need to prepare emulated child node.


>> for e.g. iproc based SOCs and other SOCs(suc as rcar) have PCI
>> world dma-ranges.
>> dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>;
>>
>> this patch serves following:
>>
>> 1) exposes interface to the pci host driver for their
>> inbound memory ranges
>>
>> 2) provide an interface to callers such as of_dma_get_ranges.
>> so then the returned size get best possible (largest) dma_mask.
>> because PCI RC drivers do not call APIs such as
>> dma_set_coherent_mask() and hence rather it shows its addressing
>> capabilities based on dma-ranges.
>> for e.g.
>> dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>;
>> we should get dev->coherent_dma_mask=0x7f.
>>
>> 3) this patch handles multiple inbound windows and dma-ranges.
>> it is left to the caller, how it wants to use them.
>> the new function returns the resources in a standard and unform way
>>
>> 4) this way the callers of for e.g. of_dma_get_ranges
>> does not need to change.
>>
>> 5) leaves scope of adding PCI flag handling for inbound memory
>> by the new function.
>
> Which flags would ever actually matter? DMA windows aren't going to be
> to config or I/O space, so the memory type can be assumed, and the
> 32/64-bit distinction is irrelevant as it's not a relocatable BAR;
> DMA-able system memory isn't going to be read-sensitive, so the
> prefetchable flag shouldn't matter; and not being a BAR none of the
> others would be relevant either.
>
>>
>> Bug: SOC-5216
>> Change-Id: Ie045386df91e1e0587846bb147ae40d96f6d7d2e
>> Signed-off-by: Oza Pawandeep 
>> Reviewed-on: http://gerrit-ccxsw.broadcom.net/40428
>> Reviewed-by: vpx_checkpatch status 
>> Reviewed-by: CCXSW 
>> Reviewed-by: Ray Jui 
>> Tested-by: vpx_autobuild status 
>> Tested-by: vpx_smoketest status 
>> Tested-by: CCXSW 
>> Reviewed-by: Scott Branden 
>>
>> diff --git a/drivers/of/of_pci.c b/drivers/of/of_pci.c
>> index 0ee42c3..ed6e69a 100644
>> --- a/drivers/of/of_pci.c
>> +++ b/drivers/of/of_pci.c
>> @@ -283,6 +283,83 @@ int of_pci_get_host_bridge_resources(struct device_node 
>> *dev,
>>   return err;
>>  }
>>  EXPORT_SYMBOL_GPL(of_pci_get_host_bridge_resources);
>> +
>> +/**
>> + * of_pci_get_dma_ranges - Parse PCI host bridge inbound resources from DT
>> + * @np: device node of the host bridge having the dma-ranges property
>> + * @resources: list where the range of resources will be added after DT 
>> parsing
>> + *
>> + * It is the caller's job to free the @resources list.
>> + *
>> + * This function will parse the "dma-ranges" property of a
>> + * PCI host bridge device node and setup the resource mapping based
>> + * on its content.
>> + *
>> + * It returns zero if the range parsing has been successful or a standard 
>> error
>> + * value if it failed.
>> + */
>> +
>> +int of_pci_get_dma_ranges(struct device_node *np, struct list_head 
>> *resources)
>> +{
>> + struct device_node *node = of_node_get(np);
>> + int rlen;
>> + int ret = 0;
>> + const int na = 3, ns = 2;
>> + struct resource *res;
>> + struct of_pci_range_parser pa

Re: [PATCH 2/3] iommu/pci: reserve iova for PCI masters

2017-05-05 Thread Oza Oza
On Thu, May 4, 2017 at 11:50 PM, Robin Murphy  wrote:
> On 03/05/17 05:46, Oza Pawandeep wrote:
>> this patch reserves the iova for PCI masters.
>> ARM64 based SOCs may have scattered memory banks.
>> such as iproc based SOC has
>>
>> <0x 0x8000 0x0 0x8000>, /* 2G @ 2G */
>> <0x0008 0x8000 0x3 0x8000>, /* 14G @ 34G */
>> <0x0090 0x 0x4 0x>, /* 16G @ 576G */
>> <0x00a0 0x 0x4 0x>; /* 16G @ 640G */
>>
>> but incoming PCI transcation addressing capability is limited
>> by host bridge, for example if max incoming window capability
>> is 512 GB, then 0x0090 and 0x00a0 will fall beyond it.
>>
>> to address this problem, iommu has to avoid allocating iovas which
>> are reserved. which inturn does not allocate iova if it falls into hole.
>
> I don't necessarily disagree with doing this, as we could do with facing
> up to the issue of discontiguous DMA ranges in particular (I too have a
> platform with this problem), but I'm still not overly keen on pulling DT
> specifics into this layer. More than that, though, if we are going to do
> it, then we should do it for all devices with a restrictive
> "dma-ranges", not just PCI ones.
>

pci_create_root_bus allocates host bridge, and currently it takes only
oubound resources.

if inbound memory is also added as a part of  pci_create_root_bus params,
then IOVA allocation can directly make use of inbound_windows member
of structure pci_host_bridge.

struct pci_host_bridge {
struct device dev;
struct pci_bus *bus; /* root bus */
struct list_head windows; /* resource_entry */
struct list_head inbound_windows; /* resource_entry */
.
.
}

so iova_reserve_pci_windows can iterate throough
resource_list_for_each_entry(window, &bridge->inbound_windows)
this way we can remove the dependency of dma-iommu.c on OF layer.

but only thing is:
pci_create_root_bus is called by handful of RC drivers, which needs to change.
ideally if you see both inbound and outbound resource should belong to
pci_host_bridge anyway.
and inbound is completely missing.

let me know your thoughts on this, Robin.

>> Bug: SOC-5216
>> Change-Id: Icbfc99a045d730be143fef427098c937b9d46353
>> Signed-off-by: Oza Pawandeep 
>> Reviewed-on: http://gerrit-ccxsw.broadcom.net/40760
>> Reviewed-by: vpx_checkpatch status 
>> Reviewed-by: CCXSW 
>> Tested-by: vpx_autobuild status 
>> Tested-by: vpx_smoketest status 
>> Tested-by: CCXSW 
>> Reviewed-by: Scott Branden 
>>
>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>> index 48d36ce..08764b0 100644
>> --- a/drivers/iommu/dma-iommu.c
>> +++ b/drivers/iommu/dma-iommu.c
>> @@ -27,6 +27,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -171,8 +172,12 @@ static void iova_reserve_pci_windows(struct pci_dev 
>> *dev,
>>   struct iova_domain *iovad)
>>  {
>>   struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
>> + struct device_node *np = bridge->dev.parent->of_node;
>>   struct resource_entry *window;
>>   unsigned long lo, hi;
>> + int ret;
>> + dma_addr_t tmp_dma_addr = 0, dma_addr;
>> + LIST_HEAD(res);
>>
>>   resource_list_for_each_entry(window, &bridge->windows) {
>>   if (resource_type(window->res) != IORESOURCE_MEM &&
>> @@ -183,6 +188,36 @@ static void iova_reserve_pci_windows(struct pci_dev 
>> *dev,
>>   hi = iova_pfn(iovad, window->res->end - window->offset);
>>   reserve_iova(iovad, lo, hi);
>>   }
>> +
>> + /* PCI inbound memory reservation. */
>> + ret = of_pci_get_dma_ranges(np, &res);
>> + if (!ret) {
>> + resource_list_for_each_entry(window, &res) {
>> + struct resource *res_dma = window->res;
>> +
>> + dma_addr = res_dma->start - window->offset;
>> + if (tmp_dma_addr > dma_addr) {
>> + pr_warn("PCI: failed to reserve iovas; ranges 
>> should be sorted\n");
>
> I don't see anything in the DT spec about the entries having to be
> sorted, and it's not exactly impossible to sort a list if you need it so
> (and if I'm being really pedantic, one could still trigger this with a
> list that *is* sorted, only by different criteria).
>
> Robin.
>
>> + return;
>> + }
>> + if (tmp_dma_addr != dma_addr) {
>> + lo = iova_pfn(iovad, tmp_dma_addr);
>> + hi = iova_pfn(iovad, dma_addr - 1);
>> + reserve_iova(iovad, lo, hi);
>> + }
>> + tmp_dma_addr = window->res->end - window->offset;
>> + }
>> + /*
>> +  * the last dma-range should honour based on the
>> +  * 32/64-bit dma addresses.
>> +  */
>> + if (tmp_dma_a

Re: [PATCH 1/3] of/pci/dma: fix DMA configuration for PCI masters

2017-05-05 Thread Oza Oza
On Fri, May 5, 2017 at 8:55 PM, Robin Murphy  wrote:
> On 04/05/17 19:41, Oza Oza wrote:
> [...]
>>>> 5) leaves scope of adding PCI flag handling for inbound memory
>>>> by the new function.
>>>
>>> Which flags would ever actually matter? DMA windows aren't going to be
>>> to config or I/O space, so the memory type can be assumed, and the
>>> 32/64-bit distinction is irrelevant as it's not a relocatable BAR;
>>> DMA-able system memory isn't going to be read-sensitive, so the
>>> prefetchable flag shouldn't matter; and not being a BAR none of the
>>> others would be relevant either.
>>>
>>
>> Thanks Robin; for your reply and attention:
>>
>> agree with you, at present it would not matter,
>> but it does not mean that we do not scope it to make it matter in future.
>>
>> now where it could matter:
>> there is Relaxed Ordering for inbound memory for PCI.
>> According to standard, Relaxed Ordering (RO) bit can be set only for
>> Memory requests and completions (if present in the original request).
>> Also, according to transaction ordering rules, I/O and configuration
>> requests can still be re-ordered ahead of each other.
>> and we would like to make use of it.
>> for e.g. lets say we mark memory as Relaxed Ordered with flag.
>> the special about this memory is incoming PCI transactions can be
>> reordered and rest memory has to be strongly ordered.
>
> Please look at "PCI Bus Binding to: IEEE Std 1275-1994 Standard for Boot
> (Initialization Configuration) Firmware" (as referenced in DTSpec) and
> explain how PCIe Relaxed Order has anything to do with the DT binding.
>
>> how it our SOC would make use of this is out of scope for the
>> discussion at this point of time, but I am just bringing in the
>> idea/point how flags could be useful
>> for inbound memory, since we would not like throw-away flags completely.
>
> The premise for implementing a PCI-specific parser is that you claim we
> need to do something with the phys.hi cell of a DT PCI address, rather
> than just taking the numerical part out of the phys.mid and phys.lo
> cells. Please make that argument in reference to the flags which that
> upper cell actually encodes, not unrelated things.
>

I think, the whole discussion around inbound flags is not what I
intended to bring.
this patch does nothing about inbound flag and never intends to solve
anything regarding inbound flags.
infact I would like to remove point 5 form the commit message. which
should keep it out of discussion completely.

let met tell what this patch is trying to address/solve 2 BUGs
1) fix wrong size return from of_dma_configure for PCI masters. (which
is right now BUG)

2) handles multiple dma-ranges cleanly

3) It takes care of dma-ranges being optional.

4) following is the comment on function of_dma_get_range (which is also a BUG)
"It returns -ENODEV if "dma-ranges" property was not found
 * for this device in DT."
which I think is wrong for PCI device, because if dma-ranges are
absent then instead of returning  -ENODEV,
it should return 0 with largest possible host memory.

it solves all the above 4 problems.

> [...]
>>>> +int of_pci_get_dma_ranges(struct device_node *np, struct list_head 
>>>> *resources)
>>>> +{
>>>> + struct device_node *node = of_node_get(np);
>>>> + int rlen;
>>>> + int ret = 0;
>>>> + const int na = 3, ns = 2;
>>>> + struct resource *res;
>>>> + struct of_pci_range_parser parser;
>>>> + struct of_pci_range range;
>>>> +
>>>> + if (!node)
>>>> + return -EINVAL;
>>>> +
>>>> + parser.node = node;
>>>> + parser.pna = of_n_addr_cells(node);
>>>> + parser.np = parser.pna + na + ns;
>>>> +
>>>> + parser.range = of_get_property(node, "dma-ranges", &rlen);
>>>> +
>>>> + if (!parser.range) {
>>>> + pr_debug("pcie device has no dma-ranges defined for 
>>>> node(%s)\n",
>>>> +   np->full_name);
>>>> + ret = -EINVAL;
>>>> + goto out;
>>>> + }
>>>> +
>>>> + parser.end = parser.range + rlen / sizeof(__be32);
>>>> +
>>>> + for_each_of_pci_range(&parser, &range) {
>>>
>>> This is plain wrong - of_pci_range_parser_one() will translate upwards
>>> through parent "rang

Re: [PATCH 2/3] iommu/pci: reserve iova for PCI masters

2017-05-05 Thread Oza Oza
On Fri, May 5, 2017 at 9:21 PM, Robin Murphy  wrote:
> On 04/05/17 19:52, Oza Oza wrote:
>> On Thu, May 4, 2017 at 11:50 PM, Robin Murphy  wrote:
>>> On 03/05/17 05:46, Oza Pawandeep wrote:
>>>> this patch reserves the iova for PCI masters.
>>>> ARM64 based SOCs may have scattered memory banks.
>>>> such as iproc based SOC has
>>>>
>>>> <0x 0x8000 0x0 0x8000>, /* 2G @ 2G */
>>>> <0x0008 0x8000 0x3 0x8000>, /* 14G @ 34G */
>>>> <0x0090 0x 0x4 0x>, /* 16G @ 576G */
>>>> <0x00a0 0x 0x4 0x>; /* 16G @ 640G */
>>>>
>>>> but incoming PCI transcation addressing capability is limited
>>>> by host bridge, for example if max incoming window capability
>>>> is 512 GB, then 0x0090 and 0x00a0 will fall beyond it.
>>>>
>>>> to address this problem, iommu has to avoid allocating iovas which
>>>> are reserved. which inturn does not allocate iova if it falls into hole.
>>>
>>> I don't necessarily disagree with doing this, as we could do with facing
>>> up to the issue of discontiguous DMA ranges in particular (I too have a
>>> platform with this problem), but I'm still not overly keen on pulling DT
>>> specifics into this layer. More than that, though, if we are going to do
>>> it, then we should do it for all devices with a restrictive
>>> "dma-ranges", not just PCI ones.
>>>
>>
>> How do you propose to do it ?
>>
>> my thinking is this:
>> iova_reserve_pci_windows is written specific for PCI, and I am adding there.
>>
>> ideally
>> struct pci_host_bridge should have new member:
>>
>> struct list_head inbound_windows; /* resource_entry */
>>
>> but somehow this resource have to be filled much before
>> iommu_dma_init_domain happens.
>> and use brdge resource directly in iova_reserve_pci_windows as it is
>> already doing it for outbound memory.
>>
>> this will detach the DT specifics from dma-immu layer.
>> let me know how this sounds.
>
> Please check the state of the code currently queued in Joerg's tree and
> in linux-next - iommu_dma_get_resv_regions() has room for
> device-agnostic stuff before the if (!dev_is_pci(dev)) check.
>
> Furthermore, with the probe-deferral changes we end up with a common
> dma_configure() routine to abstract the firmware-specifics of
> of_dma_configure() vs. acpi_dma_configure(), so it would make sense to
> give drivers etc. a similar interface for interrogating ranges. i.e.
> some common function that abstracts the difference between parsing DT
> dma-ranges vs. the ACPI _DMA object, either with a list-based get/put
> model or perhaps an iterator with a user-provided callback (so users
> could process in-place or create their own list as necessary). Unless of
> course we go all the way to making the ranges an inherent part of the
> device layer like some MIPS platforms currently do.
>
> Robin.
>

you are suggesting to wait till iommu_dma_get_resv_regions gets in ?

Oza.


>>>> Bug: SOC-5216
>>>> Change-Id: Icbfc99a045d730be143fef427098c937b9d46353
>>>> Signed-off-by: Oza Pawandeep 
>>>> Reviewed-on: http://gerrit-ccxsw.broadcom.net/40760
>>>> Reviewed-by: vpx_checkpatch status 
>>>> Reviewed-by: CCXSW 
>>>> Tested-by: vpx_autobuild status 
>>>> Tested-by: vpx_smoketest status 
>>>> Tested-by: CCXSW 
>>>> Reviewed-by: Scott Branden 
>>>>
>>>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>>>> index 48d36ce..08764b0 100644
>>>> --- a/drivers/iommu/dma-iommu.c
>>>> +++ b/drivers/iommu/dma-iommu.c
>>>> @@ -27,6 +27,7 @@
>>>>  #include 
>>>>  #include 
>>>>  #include 
>>>> +#include 
>>>>  #include 
>>>>  #include 
>>>>  #include 
>>>> @@ -171,8 +172,12 @@ static void iova_reserve_pci_windows(struct pci_dev 
>>>> *dev,
>>>>   struct iova_domain *iovad)
>>>>  {
>>>>   struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
>>>> + struct device_node *np = bridge->dev.parent->of_node;
>>>>   struct resource_entry *window;
>>>>   unsigned long lo, hi;
>>>> + int ret;
>>>> + dma_addr_t tmp_dma_addr = 0, dma_addr;
>>>> + LIST_HEAD(res);
>>>>
>>>>  

Re: [PATCH v3 1/2] PCI: iproc: Retry request when CRS returned from EP

2017-06-12 Thread Oza Oza
On Tue, Jun 13, 2017 at 9:58 AM, Oza Oza  wrote:
> On Tue, Jun 13, 2017 at 5:00 AM, Bjorn Helgaas  wrote:
>> Please wrap your changelogs to use 75 columns.  "git log" indents the
>> changelog by four spaces, so if your text is 75 wide, it will still
>> fit without wrapping.
>>
>> On Sun, Jun 11, 2017 at 09:35:37AM +0530, Oza Pawandeep wrote:
>>> For Configuration Requests only, following reset
>>> it is possible for a device to terminate the request
>>> but indicate that it is temporarily unable to process
>>> the Request, but will be able to process the Request
>>> in the future – in this case, the Configuration Request
>>> Retry Status 10 (CRS) Completion Status is used
>>
>> How does this relate to the CRS support we already have in the core,
>> e.g., pci_bus_read_dev_vendor_id()?  It looks like your root complex
>> already returns 0x0001 (CFG_RETRY_STATUS) in some cases.
>>
>> Also, per spec (PCIe r3.1, sec 2.3.2), CRS Software Visibility only
>> affects config reads of the Vendor ID, but you call
>> iproc_pcie_cfg_retry() for all config offsets.
>
> Yes, as per Spec, CRS Software Visibility only affects config read of
> the Vendor ID.
> For config write or any other config read the Root must automatically
> re-issue configuration
> request again as a new request, and our PCIe RC fails to do so.
>
>>
>>> SPDK user space NVMe driver reinitializes NVMe which
>>> causes reset, while doing this some configuration requests
>>> get NAKed by Endpoint (NVMe).
>>
>> What's SPDK?  I don't know what "NAK" means in a PCIe context.  If you
>> can use the appropriate PCIe terminology, I think it will make more
>> sense to me.

SPDK supports user space poll mode driver, and along with DPDK
interface with vfio
to directly map PCIe resources to user space.
the reason I mentioned SPDK, because it exposes this bug in our PCIe RC.

>
> when I meant NAK, I meant CRS, will change the description, and will take
> care of using appropriate PCIe terminology.
>
>>
>>> Current iproc PCI driver is agnostic about it.
>>> PAXB will forward the NAKed response in stipulated AXI code.
>>
>> In general a native host bridge driver should not have to deal with
>> the CRS feature because it's supported in the PCI core.  So we need
>> some explanation about why iProc is special in this regard.
>>
>
> For config write or any other config read the Root must automatically
> re-issue configuration
> request again as a new request, iproc based PCIe RC does not adhere to
> this, and also
> our PCI-to-AXI bridge (internal), which returns code 0x0001 to CPU.
>
>>> NVMe spec defines this timeout in 500 ms units, and this
>>> only happens if controller has been in reset, or with new firmware,
>>> or in abrupt shutdown case.
>>> Meanwhile config access could result into retry.
>>
>> I don't understand why NVMe is relevant here.  Is there something
>> special about NVMe and CRS?
>>
>
> You are right, NVMe spec is irrelevant here, but since whole
> exercise was carried out with NVMe and our major use cases are NVMe,
> I ended up mentioning that. I can remove that from description.
>
>>> This patch fixes the problem, and attempts to read again in case
>>> of PAXB forwarding the NAK.
>>>
>>> It implements iproc_pcie_config_read which gets called for Stingray.
>>> Otherwise it falls back to PCI generic APIs.
>>>
>>> Signed-off-by: Oza Pawandeep 
>>> Reviewed-by: Ray Jui 
>>> Reviewed-by: Scott Branden 
>>>
>>> diff --git a/drivers/pci/host/pcie-iproc.c b/drivers/pci/host/pcie-iproc.c
>>> index 0f39bd2..05a3647 100644
>>> --- a/drivers/pci/host/pcie-iproc.c
>>> +++ b/drivers/pci/host/pcie-iproc.c
>>> @@ -68,6 +68,9 @@
>>>  #define APB_ERR_EN_SHIFT 0
>>>  #define APB_ERR_EN   BIT(APB_ERR_EN_SHIFT)
>>>
>>> +#define CFG_RETRY_STATUS 0x0001
>>> +#define CFG_RETRY_STATUS_TIMEOUT_US  50 /* 500 milli-seconds. */
>>> +
>>>  /* derive the enum index of the outbound/inbound mapping registers */
>>>  #define MAP_REG(base_reg, index)  ((base_reg) + (index) * 2)
>>>
>>> @@ -448,6 +451,47 @@ static inline void iproc_pcie_apb_err_disable(struct 
>>> pci_bus *bus,
>>>   }
>>>  }
>>>
>>> +static int iproc_pcie_cfg_retry(void __iomem *cfg_data_p)
>>> +{
>>> + int timeout = CFG_RETRY_STATUS_TIMEOUT_US;
>>&

Re: [PATCH v3 2/2] PCI: iproc: add device shutdown for PCI RC

2017-06-13 Thread Oza Oza
On Tue, Jun 13, 2017 at 5:13 AM, Bjorn Helgaas  wrote:
> On Sun, Jun 11, 2017 at 09:35:38AM +0530, Oza Pawandeep wrote:
>> PERST# must be asserted around ~500ms before
>> the reboot is applied.
>>
>> During soft reset (e.g., "reboot" from Linux) on some iProc based SoCs
>> LCPLL clock and PERST both goes off simultaneously.
>> This will cause certain Endpoints Intel NVMe not get detected, upon
>> next boot sequence.
>>
>> This happens because; Endpoint is expecting the clock for some amount of
>> time after PERST is asserted, which is not happening in our case
>> (Compare to Intel X86 boards, will have clocks running).
>> this cause NVMe to behave in undefined way.
>
> This doesn't smell right.  The description makes this sound like a
> band-aid that happens to "solve" a problem with one particular device.
> It doesn't sound like this is making iProc adhere to something in the
> PCIe spec.
>
> Is there some PCIe spec timing requirement that tells you about this
> 500ms number?  Please include the reference if so.

On chip PLL which sources the reference clock to the device gets reset
when soft reset is applied; the soft reset also asserts PERST#.
This simultaneous assertion of reset and clock being lost is a problem
with some NVME cards.
The delay is added to keep clock alive while PERST gets asserted.
The time delay number can be set to a number that will allow the NVME
device to go into reset state.
500 ms delay is picked for that reason only, which is long enough to
get EP into reset correctly.

>
>> Essentially clock will remain alive for 500ms with PERST# = 0
>> before reboot.
>
>> This patch adds platform shutdown where it should be
>> called in device_shutdown while reboot command is issued.
>>
>> So in sequence first Endpoint Shutdown (e.g. nvme_shutdown)
>> followed by RC shutdown is called, which issues safe PERST
>> assertion.
>>
>> Signed-off-by: Oza Pawandeep 
>> Reviewed-by: Ray Jui 
>> Reviewed-by: Scott Branden 
>>
>> diff --git a/drivers/pci/host/pcie-iproc-platform.c 
>> b/drivers/pci/host/pcie-iproc-platform.c
>> index 90d2bdd..9512960 100644
>> --- a/drivers/pci/host/pcie-iproc-platform.c
>> +++ b/drivers/pci/host/pcie-iproc-platform.c
>> @@ -131,6 +131,13 @@ static int iproc_pcie_pltfm_remove(struct 
>> platform_device *pdev)
>>   return iproc_pcie_remove(pcie);
>>  }
>>
>> +static void iproc_pcie_pltfm_shutdown(struct platform_device *pdev)
>> +{
>> + struct iproc_pcie *pcie = platform_get_drvdata(pdev);
>> +
>> + iproc_pcie_shutdown(pcie);
>> +}
>> +
>>  static struct platform_driver iproc_pcie_pltfm_driver = {
>>   .driver = {
>>   .name = "iproc-pcie",
>> @@ -138,6 +145,7 @@ static int iproc_pcie_pltfm_remove(struct 
>> platform_device *pdev)
>>   },
>>   .probe = iproc_pcie_pltfm_probe,
>>   .remove = iproc_pcie_pltfm_remove,
>> + .shutdown = iproc_pcie_pltfm_shutdown,
>>  };
>>  module_platform_driver(iproc_pcie_pltfm_driver);
>>
>> diff --git a/drivers/pci/host/pcie-iproc.c b/drivers/pci/host/pcie-iproc.c
>> index 05a3647..bb61376 100644
>> --- a/drivers/pci/host/pcie-iproc.c
>> +++ b/drivers/pci/host/pcie-iproc.c
>> @@ -608,32 +608,40 @@ static int iproc_pcie_config_write32(struct pci_bus 
>> *bus, unsigned int devfn,
>>   .write = iproc_pcie_config_write32,
>>  };
>>
>> -static void iproc_pcie_reset(struct iproc_pcie *pcie)
>> +static void iproc_pcie_perst_ctrl(struct iproc_pcie *pcie, bool assert)
>>  {
>>   u32 val;
>>
>>   /*
>> -  * PAXC and the internal emulated endpoint device downstream should not
>> -  * be reset.  If firmware has been loaded on the endpoint device at an
>> -  * earlier boot stage, reset here causes issues.
>> +  * The internal emulated endpoints (such as PAXC) device downstream
>> +  * should not be reset. If firmware has been loaded on the endpoint
>> +  * device at an earlier boot stage, reset here causes issues.
>>*/
>>   if (pcie->ep_is_internal)
>>   return;
>>
>> - /*
>> -  * Select perst_b signal as reset source. Put the device into reset,
>> -  * and then bring it out of reset
>> -  */
>> - val = iproc_pcie_read_reg(pcie, IPROC_PCIE_CLK_CTRL);
>> - val &= ~EP_PERST_SOURCE_SELECT & ~EP_MODE_SURVIVE_PERST &
>> - ~RC_PCIE_RST_OUTPUT;
>> - iproc_pcie_write_reg(pcie, IPROC_PCIE_CLK_CTRL, val);
>> - udelay(250);
>> -
>> - val |= RC_PCIE_RST_OUTPUT;
>> - iproc_pcie_write_reg(pcie, IPROC_PCIE_CLK_CTRL, val);
>> - msleep(100);
>> + if (assert) {
>> + val = iproc_pcie_read_reg(pcie, IPROC_PCIE_CLK_CTRL);
>> + val &= ~EP_PERST_SOURCE_SELECT & ~EP_MODE_SURVIVE_PERST &
>> + ~RC_PCIE_RST_OUTPUT;
>> + iproc_pcie_write_reg(pcie, IPROC_PCIE_CLK_CTRL, val);
>> + udelay(250);
>> + } else {
>> + val = iproc_pcie_read_reg(pcie, IPROC_PCIE_CLK_CTRL);
>> + val |= RC_PCIE_RST_OUTPUT;
>> + iproc_pcie_

Re: [RFC PATH] of/pci/dma: fix DMA configruation for PCI masters

2017-05-02 Thread Oza Oza
On Mon, Apr 24, 2017 at 7:50 PM, Rob Herring  wrote:
> On Sat, Apr 22, 2017 at 3:08 AM, Oza Pawandeep  wrote:
>> current device frmework and of framework integration assumes dma-ranges
>> in a way where memory-mapped devices define their dma-ranges.
>> dma-ranges: (child-bus-address, parent-bus-address, length).
>>
>> but iproc based SOCs and other SOCs(suc as rcar) have PCI world dma-ranges.
>> dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>;
>>
>> of_dma_configure is specifically witten to take care of memory mapped 
>> devices.
>> but no implementation exists for pci to take care of pcie based memory 
>> ranges.
>> in fact pci world doesnt seem to define standard dma-ranges
>>
>> this patch served following purposes
>>
>> 1) exposes intrface to the pci host driver for thir inbound memory ranges
>>
>> 2) provide an interface to callers such as of_dma_get_ranges.
>> so then the returned size get best possible (largest) dma_mask.
>> for e.g.
>> dma-ranges = <0x4300 0x00 0x00 0x00 0x00 0x80 0x00>;
>> we should get dev->coherent_dma_mask=0x7f.
>>
>> 3) this patch handles multiple inbound windows and dma-ranges.
>> it is left to the caller, how it wants to use them.
>> the new function returns the resources in a standard and unform way
>>
>> 4) this way the callers of of_dma_get_ranges does not need to change.
>> and
>>
>> 5) leaves scope of adding PCI flag handling for inbound memory
>> by the new function.
>>
>> Signed-off-by: Oza Pawandeep 
>>
>> diff --git a/drivers/of/address.c b/drivers/of/address.c
>> index 02b2903..ec21191 100644
>> --- a/drivers/of/address.c
>> +++ b/drivers/of/address.c
>> @@ -6,6 +6,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -829,10 +830,30 @@ int of_dma_get_range(struct device_node *np, u64 
>> *dma_addr, u64 *paddr, u64 *siz
>> int len, naddr, nsize, pna;
>> int ret = 0;
>> u64 dmaaddr;
>> +   struct resource_entry *window;
>> +   LIST_HEAD(res);
>>
>> if (!node)
>> return -EINVAL;
>>
>> +   if (strcmp(np->name, "pci")) {
>
> Using the name is not reliable though I did recently add a dtc check
> for this. Of course, 'pcie' is valid too (and probably should be used
> for what you are testing). type is what you want to use here. We
> already have bus matching function and bus specific handlers in
> address.c. Whatever solution you come up with should be integrated
> with the existing bus specific handlers.
>
> Rob

Hi Rob,

I have addressed your comments.

now I have pushed 3 patchsets, which completely solves the problem for our SOC.

[PATCH 1/3] of/pci/dma: fix DMA configuration for PCI masters.

Regards,
Oza.


Re: [PATCH 1/3] of/pci/dma: fix DMA configuration for PCI masters

2017-05-02 Thread Oza Oza
I will send v2 after removing GERRIT details from
commit message. My apologies for the noise.

Regards,
Oza


Re: [PATCH 2/3] iommu/pci: reserve iova for PCI masters

2017-05-02 Thread Oza Oza
I will send v2 after removing GERRIT details from
commit message. My apologies for the noise.

Regards,
Oza


Re: [PATCH 3/3] PCI/of fix of_dma_get_range; get PCI specific dma-ranges

2017-05-02 Thread Oza Oza
I will send v2 after removing GERRIT details from
commit message. My apologies for the noise.

Regards,
Oza


Re: [PATCH v5 1/2] PCI: iproc: Retry request when CRS returned from EP

2017-08-03 Thread Oza Oza
On Fri, Aug 4, 2017 at 12:27 AM, Bjorn Helgaas  wrote:
> On Thu, Aug 03, 2017 at 01:50:29PM +0530, Oza Oza wrote:
>> On Thu, Aug 3, 2017 at 2:34 AM, Bjorn Helgaas  wrote:
>> > On Thu, Jul 06, 2017 at 08:39:41AM +0530, Oza Pawandeep wrote:
>> >> For Configuration Requests only, following reset it is possible for a
>> >> device to terminate the request but indicate that it is temporarily unable
>> >> to process the Request, but will be able to process the Request in the
>> >> future – in this case, the Configuration Request Retry Status 100 (CRS)
>> >> Completion Status is used.
>> >
>> > Please include the spec reference for this.
>> >
>> > The "100" looks like it's supposed to be the Completion Status Field
>> > value for CRS from PCIe r3.1, sec 2.2.9, Table 2-34, but the value in
>> > the table is actually 0b010, not 0b100.  I don't think this level of
>> > detail is necessary unless your hardware exposes those values to the
>> > driver.
>> >
>>
>> I will remove the above description from the comment.
>>
>> >> As per PCI spec, CRS Software Visibility only affects config read of the
>> >> Vendor ID, for config write or any other config read the Root must
>> >> automatically re-issue configuration request again as a new request.
>> >> Iproc based PCIe RC (hw) does not retry request on its own.
>> >
>> > I think sec 2.3.2 is the relevant part of the spec.  It basically
>> > says that when an RC receives a CRS completion for a config request:
>> >
>> >   - If CRS software visibility is not enabled, the RC must reissue the
>> > config request as a new request.
>> >
>> >   - If CRS software visibility is enabled,
>> > - for a config read of Vendor ID, the RC must return 0x0001 data
>> > - for all other config reads/writes, the RC must reissue the
>> >   request
>> >
>>
>> yes, above is more relevant, and I will include it in the description.
>>
>> > Apparently iProc *never* reissues config requests, regardless of
>> > whether CRS software visibility is enabled?
>> >
>>
>> that is true.
>>
>> > But your CFG_RETRY_STATUS definition suggests that iProc always
>> > fabricates config read data of 0x0001 when it sees CRS status, no
>> > matter whether software visibility is enabled and no matter what
>> > config address we read?
>>
>> yes, that is precisely the case.
>
> Not according to the documentation you quoted below.
>
>> > What about CRS status for a config *write*?  There's nothing here to
>> > reissue those.
>>
>> No, we do not need there, because read will always be issued first
>> before any write.
>> so we do not need to implement write.
>
> How so?  As far as I know, there's nothing in the spec that requires
> the first config access to a device to be a read, and there are
> reasons why we might want to do a write first:
> http://lkml.kernel.org/r/5952d144.8060...@oracle.com
>

I understand your point here. my thinking was during enumeration
process first read will always be issued
such as vendor/device id.
I will extend this implementation for write.

>> > Is there a hardware erratum that describes the actual hardware
>> > behavior?
>>
>> this is documented in our PCIe core hw manual.
>> >>
>> 4.7.3.3. Retry Status On Configuration Cycle
>> Endpoints are allowed to generate retry status on configuration
>> cycles.  In this case, the RC needs to re-issue the request.  The IP
>> does not handle this because the number of configuration cycles needed
>> will probably be less
>> than the total number of non-posted operations needed.   When a retry status
>> is received on the User RX interface for a configuration request that
>> was sent on the User TX interface, it will be indicated with a
>> completion with the CMPL_STATUS field set to 2=CRS, and the user will
>> have to find the address and data values and send a new transaction on
>> the User TX interface.
>> When the internal configuration space returns a retry status during a
>> configuration cycle (user_cscfg = 1) on the Command/Status interface,
>> the pcie_cscrs will assert with the pcie_csack signal to indicate the
>> CRS status.
>> When the CRS Software Visibility Enable register in the Root Control
>> register is enabled, the IP will return the data value to 0x0001 for
>> the Vendor ID value and 0x  (all 1’s) for the rest of the data in
>

Re: [PATCH v5 1/2] PCI: iproc: Retry request when CRS returned from EP

2017-08-03 Thread Oza Oza
On Fri, Aug 4, 2017 at 11:29 AM, Oza Oza  wrote:
> On Fri, Aug 4, 2017 at 12:27 AM, Bjorn Helgaas  wrote:
>> On Thu, Aug 03, 2017 at 01:50:29PM +0530, Oza Oza wrote:
>>> On Thu, Aug 3, 2017 at 2:34 AM, Bjorn Helgaas  wrote:
>>> > On Thu, Jul 06, 2017 at 08:39:41AM +0530, Oza Pawandeep wrote:
>>> >> For Configuration Requests only, following reset it is possible for a
>>> >> device to terminate the request but indicate that it is temporarily 
>>> >> unable
>>> >> to process the Request, but will be able to process the Request in the
>>> >> future – in this case, the Configuration Request Retry Status 100 (CRS)
>>> >> Completion Status is used.
>>> >
>>> > Please include the spec reference for this.
>>> >
>>> > The "100" looks like it's supposed to be the Completion Status Field
>>> > value for CRS from PCIe r3.1, sec 2.2.9, Table 2-34, but the value in
>>> > the table is actually 0b010, not 0b100.  I don't think this level of
>>> > detail is necessary unless your hardware exposes those values to the
>>> > driver.
>>> >
>>>
>>> I will remove the above description from the comment.
>>>
>>> >> As per PCI spec, CRS Software Visibility only affects config read of the
>>> >> Vendor ID, for config write or any other config read the Root must
>>> >> automatically re-issue configuration request again as a new request.
>>> >> Iproc based PCIe RC (hw) does not retry request on its own.
>>> >
>>> > I think sec 2.3.2 is the relevant part of the spec.  It basically
>>> > says that when an RC receives a CRS completion for a config request:
>>> >
>>> >   - If CRS software visibility is not enabled, the RC must reissue the
>>> > config request as a new request.
>>> >
>>> >   - If CRS software visibility is enabled,
>>> > - for a config read of Vendor ID, the RC must return 0x0001 data
>>> > - for all other config reads/writes, the RC must reissue the
>>> >   request
>>> >
>>>
>>> yes, above is more relevant, and I will include it in the description.
>>>
>>> > Apparently iProc *never* reissues config requests, regardless of
>>> > whether CRS software visibility is enabled?
>>> >
>>>
>>> that is true.
>>>
>>> > But your CFG_RETRY_STATUS definition suggests that iProc always
>>> > fabricates config read data of 0x0001 when it sees CRS status, no
>>> > matter whether software visibility is enabled and no matter what
>>> > config address we read?
>>>
>>> yes, that is precisely the case.
>>
>> Not according to the documentation you quoted below.
>>
>>> > What about CRS status for a config *write*?  There's nothing here to
>>> > reissue those.
>>>
>>> No, we do not need there, because read will always be issued first
>>> before any write.
>>> so we do not need to implement write.
>>
>> How so?  As far as I know, there's nothing in the spec that requires
>> the first config access to a device to be a read, and there are
>> reasons why we might want to do a write first:
>> http://lkml.kernel.org/r/5952d144.8060...@oracle.com
>>
>
> I understand your point here. my thinking was during enumeration
> process first read will always be issued
> such as vendor/device id.
> I will extend this implementation for write.

I am sorry, but I just released that, it is not possible to implement
retry for write.
the reason is:

we have indirect way of accessing configuration space access.
for e.g.
for config write:

A) write to to addr register.
B) write to data register

now above those 2 registers are implemented by host bridge (not in
PCIe core IP).
there is no way of knowing for software, if write has to be retried.

e.g. I can not read data register (step B) to check if write was successful.
I have double checked this with internal ASIC team here.

so in conclusion: I have to do following of things.

1) include Documentation in Changelog.
2) include sec 2.3.2 from PCIe spec in commit description.

please confirm, if you are fine with all the information.
so then I make above 2 changes. and re-post the patch.

Regards,
Oza.

>
>>> > Is there a hardware erratum that describes the actual hardware
>>> > behavior?
>>>
>>> this is documented in our PCIe core hw manual.
>>> >>
>>> 4.7.3.3. Retry Statu

Re: [PATCH v5 1/2] PCI: iproc: Retry request when CRS returned from EP

2017-08-04 Thread Oza Oza
On Fri, Aug 4, 2017 at 7:07 PM, Bjorn Helgaas  wrote:
> On Fri, Aug 04, 2017 at 11:29:29AM +0530, Oza Oza wrote:
>> On Fri, Aug 4, 2017 at 12:27 AM, Bjorn Helgaas  wrote:
>> > On Thu, Aug 03, 2017 at 01:50:29PM +0530, Oza Oza wrote:
>> >> On Thu, Aug 3, 2017 at 2:34 AM, Bjorn Helgaas  wrote:
>> >> > On Thu, Jul 06, 2017 at 08:39:41AM +0530, Oza Pawandeep wrote:
>
>> >> this is documented in our PCIe core hw manual.
>> >> >>
>> >> 4.7.3.3. Retry Status On Configuration Cycle
>> >> Endpoints are allowed to generate retry status on configuration
>> >> cycles.  In this case, the RC needs to re-issue the request.  The IP
>> >> does not handle this because the number of configuration cycles needed
>> >> will probably be less
>> >> than the total number of non-posted operations needed.   When a retry 
>> >> status
>> >> is received on the User RX interface for a configuration request that
>> >> was sent on the User TX interface, it will be indicated with a
>> >> completion with the CMPL_STATUS field set to 2=CRS, and the user will
>> >> have to find the address and data values and send a new transaction on
>> >> the User TX interface.
>> >> When the internal configuration space returns a retry status during a
>> >> configuration cycle (user_cscfg = 1) on the Command/Status interface,
>> >> the pcie_cscrs will assert with the pcie_csack signal to indicate the
>> >> CRS status.
>> >> When the CRS Software Visibility Enable register in the Root Control
>> >> register is enabled, the IP will return the data value to 0x0001 for
>> >> the Vendor ID value and 0x  (all 1’s) for the rest of the data in
>> >> the request for reads of offset 0 that return with CRS status.  This
>> >> is true for both the User RX Interface and for the Command/Status
>> >> interface.  When CRS Software Visibility is enabled, the CMPL_STATUS
>> >> field of the completion on the User RX Interface will not be 2=CRS and
>> >> the pcie_cscrs signal will not assert on the Command/Status interface.
>> >> >>
>> >> Broadcom does not sell PCIe core IP, so above information is not
>> >> publicly available in terms of hardware erratum or any similar note.
>> >>
>> >>
>> >> >
>> >> >> As a result of the fact, PCIe RC driver (sw) should take care of CRS.
>> >> >> This patch fixes the problem, and attempts to read config space again 
>> >> >> in
>> >> >> case of PCIe code forwarding the CRS back to CPU.
>> >> >> It implements iproc_pcie_config_read which gets called for Stingray,
>> >> >> Otherwise it falls back to PCI generic APIs.
> ...
>
>> >> >> +static int iproc_pcie_cfg_retry(void __iomem *cfg_data_p)
>> >> >> +{
>> >> >> + int timeout = CFG_RETRY_STATUS_TIMEOUT_US;
>> >> >> + unsigned int ret;
>> >> >> +
>> >> >> + /*
>> >> >> +  * As per PCI spec, CRS Software Visibility only
>> >> >> +  * affects config read of the Vendor ID.
>> >> >> +  * For config write or any other config read the Root must
>> >> >> +  * automatically re-issue configuration request again as a
>> >> >> +  * new request. Iproc based PCIe RC (hw) does not retry
>> >> >> +  * request on its own, so handle it here.
>> >> >> +  */
>> >> >> + do {
>> >> >> + ret = readl(cfg_data_p);
>> >> >> + if (ret == CFG_RETRY_STATUS)
>> >> >> + udelay(1);
>> >> >> + else
>> >> >> + return PCIBIOS_SUCCESSFUL;
>> >> >> + } while (timeout--);
>> >> >
>> >> > Shouldn't this check *where* in config space we're reading?
>> >> >
>> >> No, we do not require, because with SPDK it was reading BAR space
>> >> which points to MSI-x table.
>> >> what I mean is, it could be anywhere.
>> >
>> > The documentation above says the IP returns data of 0x0001 for *reads
>> > of offset 0*.  Your current code reissues the read if *any* read
>> > anywhere returns data that happens to match CFG_RETRY_STATUS.  That
>> > may be a perfectly valid register value

Re: [PATCH v5 1/2] PCI: iproc: Retry request when CRS returned from EP

2017-08-04 Thread Oza Oza
On Fri, Aug 4, 2017 at 6:57 PM, Bjorn Helgaas  wrote:
> On Fri, Aug 04, 2017 at 11:40:46AM +0530, Oza Oza wrote:
>> On Fri, Aug 4, 2017 at 11:29 AM, Oza Oza  wrote:
>> > On Fri, Aug 4, 2017 at 12:27 AM, Bjorn Helgaas  wrote:
>> >> On Thu, Aug 03, 2017 at 01:50:29PM +0530, Oza Oza wrote:
>> >>> On Thu, Aug 3, 2017 at 2:34 AM, Bjorn Helgaas  wrote:
>> >>> > On Thu, Jul 06, 2017 at 08:39:41AM +0530, Oza Pawandeep wrote:
> ...
>
>> >>> > What about CRS status for a config *write*?  There's nothing here to
>> >>> > reissue those.
>> >>>
>> >>> No, we do not need there, because read will always be issued first
>> >>> before any write.
>> >>> so we do not need to implement write.
>> >>
>> >> How so?  As far as I know, there's nothing in the spec that requires
>> >> the first config access to a device to be a read, and there are
>> >> reasons why we might want to do a write first:
>> >> http://lkml.kernel.org/r/5952d144.8060...@oracle.com
>> >>
>> >
>> > I understand your point here. my thinking was during enumeration
>> > process first read will always be issued
>> > such as vendor/device id.
>> > I will extend this implementation for write.
>>
>> I am sorry, but I just released that, it is not possible to implement
>> retry for write.
>> the reason is:
>>
>> we have indirect way of accessing configuration space access.
>> for e.g.
>> for config write:
>>
>> A) write to to addr register.
>> B) write to data register
>>
>> now above those 2 registers are implemented by host bridge (not in
>> PCIe core IP).
>> there is no way of knowing for software, if write has to be retried.
>>
>> e.g. I can not read data register (step B) to check if write was successful.
>> I have double checked this with internal ASIC team here.
>
> The bottom line is that you're saying this hardware cannot correctly
> support CRS.  Maybe the workaround you're proposing will work in many
> cases, but we need to acknowledge in the code and changelog that there
> are issues we might trip over.

yes this is precisely right.

1) I will have to add notes in the code as you are suggesting.
2) I will add documentation notes in the Change-log.

But even going forward, we will still have one more separate register
in host bridge,
which will be dedicated to CRS. but again to a very limited extent.
because CRS software visibility bit will not have any effect, (e.g. HW
is not going to consider it).

Regards,
Oza.


Re: [PATCH v5 1/2] PCI: iproc: Retry request when CRS returned from EP

2017-08-04 Thread Oza Oza
On Fri, Aug 4, 2017 at 7:07 PM, Bjorn Helgaas  wrote:
> On Fri, Aug 04, 2017 at 11:29:29AM +0530, Oza Oza wrote:
>> On Fri, Aug 4, 2017 at 12:27 AM, Bjorn Helgaas  wrote:
>> > On Thu, Aug 03, 2017 at 01:50:29PM +0530, Oza Oza wrote:
>> >> On Thu, Aug 3, 2017 at 2:34 AM, Bjorn Helgaas  wrote:
>> >> > On Thu, Jul 06, 2017 at 08:39:41AM +0530, Oza Pawandeep wrote:
>
>> >> this is documented in our PCIe core hw manual.
>> >> >>
>> >> 4.7.3.3. Retry Status On Configuration Cycle
>> >> Endpoints are allowed to generate retry status on configuration
>> >> cycles.  In this case, the RC needs to re-issue the request.  The IP
>> >> does not handle this because the number of configuration cycles needed
>> >> will probably be less
>> >> than the total number of non-posted operations needed.   When a retry 
>> >> status
>> >> is received on the User RX interface for a configuration request that
>> >> was sent on the User TX interface, it will be indicated with a
>> >> completion with the CMPL_STATUS field set to 2=CRS, and the user will
>> >> have to find the address and data values and send a new transaction on
>> >> the User TX interface.
>> >> When the internal configuration space returns a retry status during a
>> >> configuration cycle (user_cscfg = 1) on the Command/Status interface,
>> >> the pcie_cscrs will assert with the pcie_csack signal to indicate the
>> >> CRS status.
>> >> When the CRS Software Visibility Enable register in the Root Control
>> >> register is enabled, the IP will return the data value to 0x0001 for
>> >> the Vendor ID value and 0x  (all 1’s) for the rest of the data in
>> >> the request for reads of offset 0 that return with CRS status.  This
>> >> is true for both the User RX Interface and for the Command/Status
>> >> interface.  When CRS Software Visibility is enabled, the CMPL_STATUS
>> >> field of the completion on the User RX Interface will not be 2=CRS and
>> >> the pcie_cscrs signal will not assert on the Command/Status interface.
>> >> >>
>> >> Broadcom does not sell PCIe core IP, so above information is not
>> >> publicly available in terms of hardware erratum or any similar note.
>> >>
>> >>
>> >> >
>> >> >> As a result of the fact, PCIe RC driver (sw) should take care of CRS.
>> >> >> This patch fixes the problem, and attempts to read config space again 
>> >> >> in
>> >> >> case of PCIe code forwarding the CRS back to CPU.
>> >> >> It implements iproc_pcie_config_read which gets called for Stingray,
>> >> >> Otherwise it falls back to PCI generic APIs.
> ...
>
>> >> >> +static int iproc_pcie_cfg_retry(void __iomem *cfg_data_p)
>> >> >> +{
>> >> >> + int timeout = CFG_RETRY_STATUS_TIMEOUT_US;
>> >> >> + unsigned int ret;
>> >> >> +
>> >> >> + /*
>> >> >> +  * As per PCI spec, CRS Software Visibility only
>> >> >> +  * affects config read of the Vendor ID.
>> >> >> +  * For config write or any other config read the Root must
>> >> >> +  * automatically re-issue configuration request again as a
>> >> >> +  * new request. Iproc based PCIe RC (hw) does not retry
>> >> >> +  * request on its own, so handle it here.
>> >> >> +  */
>> >> >> + do {
>> >> >> + ret = readl(cfg_data_p);
>> >> >> + if (ret == CFG_RETRY_STATUS)
>> >> >> + udelay(1);
>> >> >> + else
>> >> >> + return PCIBIOS_SUCCESSFUL;
>> >> >> + } while (timeout--);
>> >> >
>> >> > Shouldn't this check *where* in config space we're reading?
>> >> >
>> >> No, we do not require, because with SPDK it was reading BAR space
>> >> which points to MSI-x table.
>> >> what I mean is, it could be anywhere.
>> >
>> > The documentation above says the IP returns data of 0x0001 for *reads
>> > of offset 0*.  Your current code reissues the read if *any* read
>> > anywhere returns data that happens to match CFG_RETRY_STATUS.  That
>> > may be a perfectly valid register val

Re: [PATCH 2/2] PCI: iproc: Add optional brcm,pci-hotplug

2017-08-08 Thread Oza Oza
On Tue, Aug 8, 2017 at 7:50 PM, Rob Herring  wrote:
> Please send bindings to DT list.

Sure, will do that.

>
> On Mon, Aug 7, 2017 at 11:44 PM, Oza Pawandeep  wrote:
>> Add description for optional device tree property
>> 'brcm,pci-hotplug' for PCI hotplug feature.
>>
>> Signed-off-by: Oza Pawandeep 
>> Reviewed-by: Ray Jui 
>>
>> diff --git a/Documentation/devicetree/bindings/pci/brcm,iproc-pcie.txt 
>> b/Documentation/devicetree/bindings/pci/brcm,iproc-pcie.txt
>> index b8e48b4..a3bad24 100644
>> --- a/Documentation/devicetree/bindings/pci/brcm,iproc-pcie.txt
>> +++ b/Documentation/devicetree/bindings/pci/brcm,iproc-pcie.txt
>> @@ -72,6 +72,29 @@ Optional properties:
>>  - brcm,pcie-msi-inten: Needs to be present for some older iProc platforms 
>> that
>>  require the interrupt enable registers to be set explicitly to enable MSI
>>
>> +Optional properties:
>> +- brcm,pci-hotplug: PCI hotplug feature is supported.
>
> I think we should make this a common property. We already have
> "ibm,slot-pluggable", so I'd propose "slot-pluggable".
>
> There's also "hotpluggable" for memory nodes defined, so we could
> reuse that here.
>

ok I will rename this to
brcm,slot-pluggable

>> +
>> +If the brcm,pcie-hotplug property is present, the following properties 
>> become
>> +effective:
>> +
>> +- brcm,prsnt-gpio: Array of gpios, needs to be present if Hotplug is 
>> supported.
>
> prsnt-gpios

will take care.

>
>> +
>> +PCI hotplug implementation is SOC/Board specific, and also it depends on
>> +how add-in card is designed (e.g. how many present pins are implemented).
>> +
>> +If x8 card is connected, then it might be possible that all the
>> +3 present pins could go low, or at least one pin goes low.
>> +
>> +If x4 card is connected, then it might be possible that 2 present
>> +pins go low, or at least one pin goes low.
>> +
>> +Example:
>> +brcm,prsnt-gpio: <&pca9505 32 1>, <&pca9505 33 1>;
>> +This is x4 connector: monitoring max 2 present lines.
>> +brcm,prsnt-gpio: <&pca9505 32 1>, <&pca9505 33 1>, <&pca9505 34 1>;
>> +This is x8 connector: monitoring max 3 present lines.
>> +
>>  Example:
>> pcie0: pcie@18012000 {
>> compatible = "brcm,iproc-pcie";
>> --
>> 1.9.1
>>


Re: [PATCH 2/2] PCI: iproc: Add optional brcm,pci-hotplug

2017-08-08 Thread Oza Oza
On Wed, Aug 9, 2017 at 10:57 AM, Ray Jui  wrote:
>
>
> On 8/8/2017 10:22 PM, Oza Oza wrote:
>>
>> On Tue, Aug 8, 2017 at 7:50 PM, Rob Herring  wrote:
>>>
>>> Please send bindings to DT list.
>>
>> Sure, will do that.
>>
>>> On Mon, Aug 7, 2017 at 11:44 PM, Oza Pawandeep 
>>> wrote:
>>>>
>>>> Add description for optional device tree property
>>>> 'brcm,pci-hotplug' for PCI hotplug feature.
>>>>
>>>> Signed-off-by: Oza Pawandeep 
>>>> Reviewed-by: Ray Jui 
>>>>
>>>> diff --git a/Documentation/devicetree/bindings/pci/brcm,iproc-pcie.txt
>>>> b/Documentation/devicetree/bindings/pci/brcm,iproc-pcie.txt
>>>> index b8e48b4..a3bad24 100644
>>>> --- a/Documentation/devicetree/bindings/pci/brcm,iproc-pcie.txt
>>>> +++ b/Documentation/devicetree/bindings/pci/brcm,iproc-pcie.txt
>>>> @@ -72,6 +72,29 @@ Optional properties:
>>>>   - brcm,pcie-msi-inten: Needs to be present for some older iProc
>>>> platforms that
>>>>   require the interrupt enable registers to be set explicitly to enable
>>>> MSI
>>>>
>>>> +Optional properties:
>>>> +- brcm,pci-hotplug: PCI hotplug feature is supported.
>>>
>>> I think we should make this a common property. We already have
>>> "ibm,slot-pluggable", so I'd propose "slot-pluggable".
>>>
>>> There's also "hotpluggable" for memory nodes defined, so we could
>>> reuse that here.
>>>
>> ok I will rename this to
>> brcm,slot-pluggable
>
>
> How's brcm,slot-pluggable a common property? It's still brcm specific.
> Didn't Rob propose either "slot-pluggable" or "hotpluggable"?
>
> And note it goes to the generic PCI binding instead iProc PCIe specific
> binding.
>

Initially I thought, Rob suggested either "slot-pluggable".
followed by, "hotpluggable" since memory node already has such property.

but not sure in which generic pci binding I should add ?
should it be part of
Documentation/devicetree/bindings/pci/host-generic-pci.txt

Can you please clarify Rob ?

Regards,
Oza.


Re: [PATCH 2/2] PCI: iproc: Add optional brcm,pci-hotplug

2017-08-08 Thread Oza Oza
On Wed, Aug 9, 2017 at 11:13 AM, Oza Oza  wrote:
> On Wed, Aug 9, 2017 at 10:57 AM, Ray Jui  wrote:
>>
>>
>> On 8/8/2017 10:22 PM, Oza Oza wrote:
>>>
>>> On Tue, Aug 8, 2017 at 7:50 PM, Rob Herring  wrote:
>>>>
>>>> Please send bindings to DT list.
>>>
>>> Sure, will do that.
>>>
>>>> On Mon, Aug 7, 2017 at 11:44 PM, Oza Pawandeep 
>>>> wrote:
>>>>>
>>>>> Add description for optional device tree property
>>>>> 'brcm,pci-hotplug' for PCI hotplug feature.
>>>>>
>>>>> Signed-off-by: Oza Pawandeep 
>>>>> Reviewed-by: Ray Jui 
>>>>>
>>>>> diff --git a/Documentation/devicetree/bindings/pci/brcm,iproc-pcie.txt
>>>>> b/Documentation/devicetree/bindings/pci/brcm,iproc-pcie.txt
>>>>> index b8e48b4..a3bad24 100644
>>>>> --- a/Documentation/devicetree/bindings/pci/brcm,iproc-pcie.txt
>>>>> +++ b/Documentation/devicetree/bindings/pci/brcm,iproc-pcie.txt
>>>>> @@ -72,6 +72,29 @@ Optional properties:
>>>>>   - brcm,pcie-msi-inten: Needs to be present for some older iProc
>>>>> platforms that
>>>>>   require the interrupt enable registers to be set explicitly to enable
>>>>> MSI
>>>>>
>>>>> +Optional properties:
>>>>> +- brcm,pci-hotplug: PCI hotplug feature is supported.
>>>>
>>>> I think we should make this a common property. We already have
>>>> "ibm,slot-pluggable", so I'd propose "slot-pluggable".
>>>>
>>>> There's also "hotpluggable" for memory nodes defined, so we could
>>>> reuse that here.
>>>>
>>> ok I will rename this to
>>> brcm,slot-pluggable
>>
>>
>> How's brcm,slot-pluggable a common property? It's still brcm specific.
>> Didn't Rob propose either "slot-pluggable" or "hotpluggable"?
>>
>> And note it goes to the generic PCI binding instead iProc PCIe specific
>> binding.
>>
>
> Initially I thought, Rob suggested either "slot-pluggable".
> followed by, "hotpluggable" since memory node already has such property.
>
> but not sure in which generic pci binding I should add ?
> should it be part of
> Documentation/devicetree/bindings/pci/host-generic-pci.txt
>
> Can you please clarify Rob ?
>
> Regards,
> Oza.

To add, every SOC might have different way of implementing hotplug.
so I suppose both the binding documents have to be updated.

Documentation/devicetree/bindings/pci/host-generic-pci.txt
which can have common boolean property
named "hotpluggable"

and SOC specific implementation can stay here
for e.g.
Documentation/devicetree/bindings/pci/brcm,iproc-pcie.txt
brcm,prsnt-gpios

Rob and Ray:
please let me know how this sounds.

Regards,
Oza.


Re: [PATCH v5 1/2] PCI: iproc: Retry request when CRS returned from EP

2017-08-03 Thread Oza Oza
On Thu, Aug 3, 2017 at 2:34 AM, Bjorn Helgaas  wrote:
> On Thu, Jul 06, 2017 at 08:39:41AM +0530, Oza Pawandeep wrote:
>> For Configuration Requests only, following reset it is possible for a
>> device to terminate the request but indicate that it is temporarily unable
>> to process the Request, but will be able to process the Request in the
>> future – in this case, the Configuration Request Retry Status 100 (CRS)
>> Completion Status is used.
>
> Please include the spec reference for this.
>
> The "100" looks like it's supposed to be the Completion Status Field
> value for CRS from PCIe r3.1, sec 2.2.9, Table 2-34, but the value in
> the table is actually 0b010, not 0b100.  I don't think this level of
> detail is necessary unless your hardware exposes those values to the
> driver.
>

I will remove the above description from the comment.

>> As per PCI spec, CRS Software Visibility only affects config read of the
>> Vendor ID, for config write or any other config read the Root must
>> automatically re-issue configuration request again as a new request.
>> Iproc based PCIe RC (hw) does not retry request on its own.
>
> I think sec 2.3.2 is the relevant part of the spec.  It basically
> says that when an RC receives a CRS completion for a config request:
>
>   - If CRS software visibility is not enabled, the RC must reissue the
> config request as a new request.
>
>   - If CRS software visibility is enabled,
> - for a config read of Vendor ID, the RC must return 0x0001 data
> - for all other config reads/writes, the RC must reissue the
>   request
>

yes, above is more relevant, and I will include it in the description.

> Apparently iProc *never* reissues config requests, regardless of
> whether CRS software visibility is enabled?
>

that is true.

> But your CFG_RETRY_STATUS definition suggests that iProc always
> fabricates config read data of 0x0001 when it sees CRS status, no
> matter whether software visibility is enabled and no matter what
> config address we read?
>

yes, that is precisely the case.


> What about CRS status for a config *write*?  There's nothing here to
> reissue those.
>

No, we do not need there, because read will always be issued first
before any write.
so we do not need to implement write.

> Is there a hardware erratum that describes the actual hardware
> behavior?

this is documented in our PCIe core hw manual.
>>
4.7.3.3. Retry Status On Configuration Cycle
Endpoints are allowed to generate retry status on configuration
cycles.  In this case, the RC needs to re-issue the request.  The IP
does not handle this because the number of configuration cycles needed
will probably be less
than the total number of non-posted operations needed.   When a retry status
is received on the User RX interface for a configuration request that
was sent on the User TX interface, it will be indicated with a
completion with the CMPL_STATUS field set to 2=CRS, and the user will
have to find the address and data values and send a new transaction on
the User TX interface.
When the internal configuration space returns a retry status during a
configuration cycle (user_cscfg = 1) on the Command/Status interface,
the pcie_cscrs will assert with the pcie_csack signal to indicate the
CRS status.
When the CRS Software Visibility Enable register in the Root Control
register is enabled, the IP will return the data value to 0x0001 for
the Vendor ID value and 0x  (all 1’s) for the rest of the data in
the request for reads of offset 0 that return with CRS status.  This
is true for both the User RX Interface and for the Command/Status
interface.  When CRS Software Visibility is enabled, the CMPL_STATUS
field of the completion on the User RX Interface will not be 2=CRS and
the pcie_cscrs signal will not assert on the Command/Status interface.
>>
Broadcom does not sell PCIe core IP, so above information is not
publicly available in terms of hardware erratum or any similar note.


>
>> As a result of the fact, PCIe RC driver (sw) should take care of CRS.
>> This patch fixes the problem, and attempts to read config space again in
>> case of PCIe code forwarding the CRS back to CPU.
>> It implements iproc_pcie_config_read which gets called for Stingray,
>> Otherwise it falls back to PCI generic APIs.
>>
>> Signed-off-by: Oza Pawandeep 
>> Reviewed-by: Ray Jui 
>> Reviewed-by: Scott Branden 
>>
>> diff --git a/drivers/pci/host/pcie-iproc.c b/drivers/pci/host/pcie-iproc.c
>> index 0f39bd2..b0abcd7 100644
>> --- a/drivers/pci/host/pcie-iproc.c
>> +++ b/drivers/pci/host/pcie-iproc.c
>> @@ -68,6 +68,9 @@
>>  #define APB_ERR_EN_SHIFT 0
>>  #define APB_ERR_EN   BIT(APB_ERR_EN_SHIFT)
>>
>> +#define CFG_RETRY_STATUS 0x0001
>> +#define CFG_RETRY_STATUS_TIMEOUT_US  50 /* 500 milli-seconds. */
>> +
>>  /* derive the enum index of the outbound/inbound mapping registers */
>>  #define MAP_REG(base_reg, index)  ((base_reg) + (index) * 2)
>>
>> @@ -448,6 +451,