Re: [PATCH] cxgb4vf: fix t4vf_eth_xmit()'s return type
| From: Luc Van Oostenryck | Sent: Tuesday, April 24, 2018 6:19:02 AM | | The method ndo_start_xmit() is defined as returning an 'netdev_tx_t', | which is a typedef for an enum type, but the implementation in this | driver returns an 'int'. | | Fix this by returning 'netdev_tx_t' in this driver too. Looks good to me. Casey
Re: [PATCH v4 12/17] net: cxgb4/cxgb4vf: Eliminate duplicate barriers on weakly-ordered archs
[[ Appologies for the DUPLICATE email. I forgot to tell my Mail Agent to use Plain Text. -- Casey ]] I feel very uncomfortable with these proposed changes. Our team is right in the middle of trying to tease our way through the various platform implementations of writel(), writel_relaxed(), __raw_writel(), etc. in order to support x86, PowerPC, ARM, etc. with a single code base. This is complicated by the somewhat ... "fuzzily defined" semantics and varying platform implementations of all of these APIs. (And note that I'm just picking writel() as an example.) Additionally, many of the changes aren't even in fast paths and are thus unneeded for performance. Please don't make these changes. We're trying to get this all sussed out. Casey From: Sinan Kaya Sent: Monday, March 19, 2018 7:42:27 PM To: netdev@vger.kernel.org; ti...@codeaurora.org; sulr...@codeaurora.org Cc: linux-arm-...@vger.kernel.org; linux-arm-ker...@lists.infradead.org; Sinan Kaya; Ganesh GR; Casey Leedom; linux-ker...@vger.kernel.org Subject: [PATCH v4 12/17] net: cxgb4/cxgb4vf: Eliminate duplicate barriers on weakly-ordered archs Code includes wmb() followed by writel(). writel() already has a barrier on some architectures like arm64. This ends up CPU observing two barriers back to back before executing the register write. Create a new wrapper function with relaxed write operator. Use the new wrapper when a write is following a wmb(). Signed-off-by: Sinan Kaya --- drivers/net/ethernet/chelsio/cxgb4/cxgb4.h | 6 ++ drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 13 +++-- drivers/net/ethernet/chelsio/cxgb4/sge.c | 12 ++-- drivers/net/ethernet/chelsio/cxgb4/t4_hw.c | 2 +- drivers/net/ethernet/chelsio/cxgb4vf/adapter.h | 14 ++ drivers/net/ethernet/chelsio/cxgb4vf/sge.c | 18 ++ 6 files changed, 44 insertions(+), 21 deletions(-) diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h index 9040e13..6bde0b9 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h @@ -1202,6 +1202,12 @@ static inline void t4_write_reg(struct adapter *adap, u32 reg_addr, u32 val) writel(val, adap->regs + reg_addr); } +static inline void t4_write_reg_relaxed(struct adapter *adap, u32 reg_addr, + u32 val) +{ + writel_relaxed(val, adap->regs + reg_addr); +} + #ifndef readq static inline u64 readq(const volatile void __iomem *addr) { diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c index 7b452e8..276472d 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c @@ -1723,8 +1723,8 @@ int cxgb4_sync_txq_pidx(struct net_device *dev, u16 qid, u16 pidx, else val = PIDX_T5_V(delta); wmb(); - t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL_A), - QID_V(qid) | val); + t4_write_reg_relaxed(adap, MYPF_REG(SGE_PF_KDOORBELL_A), + QID_V(qid) | val); } out: return ret; @@ -1902,8 +1902,9 @@ static void enable_txq_db(struct adapter *adap, struct sge_txq *q) * are committed before we tell HW about them. */ wmb(); - t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL_A), - QID_V(q->cntxt_id) | PIDX_V(q->db_pidx_inc)); + t4_write_reg_relaxed(adap, MYPF_REG(SGE_PF_KDOORBELL_A), + QID_V(q->cntxt_id) | + PIDX_V(q->db_pidx_inc)); q->db_pidx_inc = 0; } q->db_disabled = 0; @@ -2003,8 +2004,8 @@ static void sync_txq_pidx(struct adapter *adap, struct sge_txq *q) else val = PIDX_T5_V(delta); wmb(); - t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL_A), - QID_V(q->cntxt_id) | val); + t4_write_reg_relaxed(adap, MYPF_REG(SGE_PF_KDOORBELL_A), + QID_V(q->cntxt_id) | val); } out: q->db_disabled = 0; diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c b/drivers/net/ethernet/chelsio/cxgb4/sge.c index 6e310a0..7388aac 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/sge.c +++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c @@ -530,11 +530,11 @@ static inline void ring_fl_db(struct adapter *adap, struct sge_fl *q) * mechanism. */ if (unlikely(q->bar2_addr == NULL)) { - t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBEL
Re: [PATCH net] cxgb4: Fix stack out-of-bounds read due to wrong size to t4_record_mbox()
| From: Stefano Brivio | Sent: Friday, August 25, 2017 1:48 PM | | Passing commands for logging to t4_record_mbox() with size | MBOX_LEN, when the actual command size is actually smaller, | causes out-of-bounds stack accesses in t4_record_mbox() while | copying command words here: | ... Thanks for catching this. Definitely a bug. Odd because that's not what I checked into our out-of-kernel repository. And the corresponding code in the cxgb4vf driver is correct. So yes, ACK! Casey
Re: [PATCH v10 3/5] PCI: Disable Relaxed Ordering Attributes for AMD A1100
| From: Raj, Ashok | Sent: Monday, August 14, 2017 10:19 AM | | On Mon, Aug 14, 2017 at 11:44:57PM +0800, Ding Tianhong wrote: | > Casey reported that the AMD ARM A1100 SoC has a bug in its PCIe | > Root Port where Upstream Transaction Layer Packets with the Relaxed | > Ordering Attribute clear are allowed to bypass earlier TLPs with | > Relaxed Ordering set, it would cause Data Corruption, so we need | > to disable Relaxed Ordering Attribute when Upstream TLPs to the | > Root Port. | > | > Signed-off-by: Casey Leedom | > Signed-off-by: Ding Tianhong | > Acked-by: Alexander Duyck | > Acked-by: Ashok Raj | | I can't ack this patch :-).. must be someone from AMD. Please remove my | signature from this. You can go ahead and leave my name on since I'm the person who found and diagnosed the problem (with help from others inside Chelsio). If anyone on the Linux PCI List knows anyone at AMD who'd be willing to respond it would be great, but as I noted earlier, I think that AMD has effectively abandoned the A1100 ("Seattle") ARM SoC, so I doubt if we'll get anyone from AMD to even comment now. Casey
Re: [PATCH v9 1/4] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING
| From: Raj, Ashok | Sent: Wednesday, August 9, 2017 11:00 AM | | On Wed, Aug 09, 2017 at 04:46:07PM +, Casey Leedom wrote: | > | From: Raj, Ashok | > | Sent: Wednesday, August 9, 2017 8:58 AM | > | ... | > | As Casey pointed out in an earlier thread, we choose the heavy hammer | > | approach because there are some that can lead to data-corruption as | > | opposed to perf degradation. | > | > Careful. As far as I'm aware, there is no Data Corruption problem | > whatsoever with Intel Root Ports and processing of Transaction Layer | > Packets with and without the Relaxed Ordering Attribute set. | | That's right.. no data-corruption on Intel parts :-).. It was with other | vendor. Only performance issue with intel root-ports in the parts identified | by the optimization guide. Yes, I didn't want you to get into any trouble over that possible reading of what you wrote. Any progress on the "Chicken Bit" investigation? Being able to disable the non-optimal Relaxed Ordering "optimization" would be the best PCI Quirk of all ... Casey
Re: [PATCH v9 1/4] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING
| From: Raj, Ashok | Sent: Wednesday, August 9, 2017 8:58 AM | ... | As Casey pointed out in an earlier thread, we choose the heavy hammer | approach because there are some that can lead to data-corruption as opposed | to perf degradation. Careful. As far as I'm aware, there is no Data Corruption problem whatsoever with Intel Root Ports and processing of Transaction Layer Packets with and without the Relaxed Ordering Attribute set. The only issue which we've discovered with relatively recent Intel Root Port implementations and the use of the Relaxed Ordering Attribute is a performance issue. To the best of our ability to analyze the PCIe traces, it appeared that the Intel Root Complex delayed returning Link Flow Control Credits resulting in lowered performance (total bandwidth). When we used Relaxed Ordering for Ingress Packet Data delivery on a 100Gb/s Ethernet link with 1500-byte MTU, we were pegged at ~75Gb/s. Once we disabled Relaxed Ordering, we were able to deliver Ingress Packet Data to Host Memory at the full link rate. Casey
Re: [PATCH v9 1/4] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING
| From: Ding Tianhong | Sent: Wednesday, August 9, 2017 5:17 AM | | On 2017/8/9 11:02, Bjorn Helgaas wrote: | > | > On Wed, Aug 09, 2017 at 01:40:01AM +, Casey Leedom wrote: | > > | >> | From: Bjorn Helgaas | >> | Sent: Tuesday, August 8, 2017 4:22 PM | >> | ... | >> | It should also include a pointer to the AMD erratum, if available, or | >> | at least some reference to how we know it doesn't obey the rules. | >> | >> Getting an ACK from AMD seems like a forlorn cause at this point. My | >> contact was Bob Shaw and he stopped responding to me | >> messages almost a year ago saying that all of AMD's energies were being | >> redirected towards upcoming x86 products (likely Ryzen as we now know). | >> As far as I can tell AMD has walked away from their A1100 (AKA | >> "Seattle") ARM SoC. | >> | >> On the specific issue, I can certainly write up somthing even more | >> extensive than I wrote up for the comment in drivers/pci/quirks.c. | >> Please review the comment I wrote up and tell me if you'd like | >> something even more detailed -- I'm usually acused of writing comments | >> which are too long, so this would be a new one on me ... :-) | > | > If you have any bug reports with info about how you debugged it and | > concluded that Seattle is broken, you could include a link (probably | > in the changelog). But if there isn't anything, there isn't anything. | ... | OK, I could reorganize it, but still need the Casey to give me the link | for the Seattle, otherwise I could remove the AMD part and wait until | someone show it. Thanks There are no links and I was never given an internal bug number at AMD. As I said, they stopped responding to my notes about a years ago saying that they were moving the focus of all their people and no longer had resources to pursue the issue. Hopefully for them, Ryzen doesn't have the same Data Corruption problem ... As for how we diagnosed it, with our Ingress Packet delivery, we have the Ingress Packet Data delivered (DMA Write) into Free List Buffers, and then then a small message (DMA Write) to a "Response Queue" indicating delivery of the Ingress Packet Data into the Free List Buffers. The Transaction Layer Packets which convey the Ingress Packet Data all have the Relaxed Ordering Attribute set, while the following TLP carring the Ingress Data delivery notification into the Response Queue does not have the Relaxed Ordering Attribute set. The rules for processing TLPs with and without the Relaxed Ordering Attribute set are covered in Section 2.4.1 of the PCIe 3.0 specification (Revision 3.0 November 10, 2010). Table 2-34 "Ordering Rules Summary" covers the cases where one TLP may "pass" (be proccessed earlier) than a preceding TLP. In the case we're talking about, we have a sequence of one or more Posted DMA Write TLPs with the Relaxed Ordering Attribute set and a following Posted DMA Write TLP without the Relaxed Ordering Attribute set. Thus we need to look at the Row A, Column 2 cell of Table 2-34 governing when a Posted Request may "pass" a preceeding Posted Request. In that cell we have: a) No b) Y/N with the explanatory text: A2aA Posted Request must not pass another Posted Request unless A2b applies. A2bA Posted Request with RO[23] Set is permitted to pass another Posted Request[24]. A Posted Request with IDO Set is permitted to pass another Posted Request if the two Requester IDs are different. [23] In this section, "RO" is an abbreviation for the Relaxed Ordering Attribute field. [24] Some usages are enabled by not implementing this passing (see the No RO-enabled PR-PR Passing bit in Section 7.8.15). In our case, we were getting notifications of Ingress Packet Delivery in our Response Queues, but not all of the Ingress Packet Data Posted DMA Write TLPs had been processed yet by the Root Complex. As a result, we were picking up old stale memory data before those lagging Ingress Packet Data TLPs could be processed. This is a clear violation of the PCIe 3.0 TLP processing rules outlined above. Does that help? Casey
Re: [PATCH v9 2/4] PCI: Disable PCIe Relaxed Ordering if unsupported
| From: Bjorn Helgaas | Sent: Tuesday, August 8, 2017 7:22 PM | ... | and the caller should do something like this: | | if (pcie_relaxed_ordering_broken(pci_find_pcie_root_port(pdev))) | adapter->flags |= ROOT_NO_RELAXED_ORDERING; | | That way it's obvious where the issue is, and it's obvious that the | answer might be different for peer-to-peer transactions than it is for | transactions to the root port, i.e., to coherent memory. | ... Which is back to something very close to what I initially suggested in my first patch submission. Because you're right, this isn't about broken Source Devices, it's about broken Completing Devices. Unfortunately, as Alexander Duyck noted, in a Virtual Machine we won't be able to see our Root Port in order to make this determination. (And in some Hypervisor implementations I've seen, there's not even a synthetic Root Port available to the VM at all, let alone read-only access to the real one.) So the current scheme was developed of having the Hypervisor kernel traverse down the PCIe Fabric when it finds a broken Root Port implementation (the issue that we've mostly been primarily focused upon), and turning off the PCIe Capability Device Control[Relaxed Ordering Enable]. This was to serve two purposes: 1. Turn that off in order to prevent sending any Transaction Layer Packets with the Relaxed Ordering Attribute to any Completer. Which unfortunately would also prevent Peer-to-Peer use of the Relaxed Ordering Attribute. 2. Act as a message to Device Drivers for those downstream devices that the they were dealing with a broken Root Port implementation. And this would work even for a driver in a VM with an attached device since it would be able to see the PCIe Configuration Space for the attached device. I haven't been excited about any of this because: A. While so far all of the examples we've talked about are broken Root Port Completers, it's perfectly possible that other devices could be broken -- say an NVMe Device which is not "Coherent Memory". How would this fit into the framework APIs being described? B. I have yet to see a n example of how the currently proposed API framework would be used in a hybrid environment where TLPs to the Root Port would not use Relaxed Ordering, but TLPs to a Peer would use Relaxed Ordering. So far its all been about using a "big hammer" to completely disable the use of Relaxed Ordering. But the VM problem keeps cropping up over and over. A driver in a VM doesn't have access to the Root Port to determine if its on a "Black List" and our only way of communicating with the driver in the VM is to leave the device in a particular state (i.e. initialize the PCIe Capability Device Control[Relaxed Ordering Enable] to "off"). Oh, and also, on the current patch submission's focus on broken Root Port implementations: one could suggest that even if we're stuck with the "Device attached to a VM Conundrum", that what we should really be thinking about is if ANY device within a PCIe Fabric has broken Relaxed Ordering completion problems, and, if so, "poisoning" the entire containing PCIe Fabric by turning off Relaxed Ordering Enable for every device, up, down sideways -- including the Root Port itself. | ... | This associates the message with the Requester that may potentially | use relaxed ordering. But there's nothing wrong or unusual about the | Requester; the issue is with the *Completer*, so I think the message | should be in the quirk where we set PCI_DEV_FLAGS_NO_RELAXED_ORDERING. | Maybe it should be both places; I dunno. | | This implementation assumes the device only initiates transactions to | coherent memory, i.e., it assumes the device never does peer-to-peer | DMA. I guess we'll have to wait and see if we trip over any | peer-to-peer issues, then figure out how to handle them. | ... Yes, as soon as we want to implement the hybrid use of Relaxed Ordering I mentioned above. And that the Intel document mentions itself. Casey
Re: [PATCH v9 1/4] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING
| From: Bjorn Helgaas | Sent: Tuesday, August 8, 2017 4:22 PM | | This needs to include a link to the Intel spec | (https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf, | sec 3.9.1). In the commit message or as a comment? Regardless, I agree. It's always nice to be able to go back and see what the official documentation says. However, that said, links on the internet are ... fragile as time goes by, so we might want to simply quote section 3.9.1 in the commit message since it's relatively short: 3.9.1 Optimizing PCIe Performance for Accesses Toward Coherent Memory and Toward MMIO Regions (P2P) In order to maximize performance for PCIe devices in the processors listed in Table 3-6 below, the soft- ware should determine whether the accesses are toward coherent memory (system memory) or toward MMIO regions (P2P access to other devices). If the access is toward MMIO region, then software can command HW to set the RO bit in the TLP header, as this would allow hardware to achieve maximum throughput for these types of accesses. For accesses toward coherent memory, software can command HW to clear the RO bit in the TLP header (no RO), as this would allow hardware to achieve maximum throughput for these types of accesses. Table 3-6. Intel Processor CPU RP Device IDs for Processors Optimizing PCIe Performance ProcessorCPU RP Device IDs Intel Xeon processors based on 6F01H-6F0EH Broadwell microarchitecture Intel Xeon processors based on 2F01H-2F0EH Haswell microarchitecture | It should also include a pointer to the AMD erratum, if available, or | at least some reference to how we know it doesn't obey the rules. Getting an ACK from AMD seems like a forlorn cause at this point. My contact was Bob Shaw and he stopped responding to me messages almost a year ago saying that all of AMD's energies were being redirected towards upcoming x86 products (likely Ryzen as we now know). As far as I can tell AMD has walked away from their A1100 (AKA "Seattle") ARM SoC. On the specific issue, I can certainly write up somthing even more extensive than I wrote up for the comment in drivers/pci/quirks.c. Please review the comment I wrote up and tell me if you'd like something even more detailed -- I'm usually acused of writing comments which are too long, so this would be a new one on me ... :-) | Ashok, thanks for chiming in. Now that you have, I have a few more | questions for you: I can answer a few of these: | - Is the above doc the one you mentioned as being now public? Yes. Ashok worked with me to the extent he was allowed prior to the publishing of the public technocal note, but he couldn't say much. (Believe it or not, it is possible to say less than the quoted section above.) When the note was published, Patrick Cramer sent me the note about it and pointed me at section 3.9.1. | - Is this considered a hardware erratum? I certainly consider it a Hardware Bug. And I'm really hoping that Ashok will be able to find a "Chicken Bit" which allows the broken feature to be turned off. Remember, the Relaxed Ordering Attribute on a Transaction Layer Packet is simply a HINT. It is perfectly reasonable for a compliant implementation to simply ignore the Relaxed Ordering Attribute on an incoming TLP Request. The sole responsibility of a compliant implementation is to return the exact same Relaxed Ordering and No Snoop Attributes in any TLP Response (The rules for ID-Based Ordering Attribute are more complex.) Earlier Intel Root Complexes did exactly this: they ignored the Relaxed Ordering Attribute and there was no performance difference for using/not-using it. It's pretty obvious that an attempt was made to implement optimizations surounding the use of Relaxed Ordering and they didn't work. | - If so, is there a pointer to that as well? Intel is historically tight-lipped about admiting any bugs/errata in their products. I'm guessing that the above quoted Section 3.9.1 is likely to be all we ever get. The language above regarding TLPs targetting Coherent Shared Memory are basically as much of an admission that they got it wrong as we're going to get. But heck, maybe we'll get lucky ... Especially with regard to the hoped for "Chicken Bit" ... | - If this is not considered an erratum, can you provide any guidance |about how an OS should determine when it should use RO? Software? We don't need no stinking software! Sorry, I couldn't resist. | Relying on a list of device IDs in an optimization manual is OK for an | erratum, but if it's *not* an erratum, it seems like a hole in the specs | because as far as I know there's no generic way for the OS to discover | whether to use RO. Well, here's to hoping that Ashok and/or Patrick are able to offer more detailed information ... Casey
Re: [PATCH v8 1/4] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING
| From: Ding Tianhong | Sent: Thursday, August 3, 2017 6:44 AM | | diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c | index 6967c6b..1e1cdbe 100644 | --- a/drivers/pci/quirks.c | +++ b/drivers/pci/quirks.c | @@ -4016,6 +4016,44 @@ static void quirk_tw686x_class(struct pci_dev *pdev) |quirk_tw686x_class); | | /* | + * Some devices have problems with Transaction Layer Packets with the Relaxed | + * Ordering Attribute set. Such devices should mark themselves and other | + * Device Drivers should check before sending TLPs with RO set. | + */ | +static void quirk_relaxedordering_disable(struct pci_dev *dev) | +{ | + dev->dev_flags |= PCI_DEV_FLAGS_NO_RELAXED_ORDERING; | +} | + | +/* | + * Intel E5-26xx Root Complex has a Flow Control Credit issue which can | + * cause performance problems with Upstream Transaction Layer Packets with | + * Relaxed Ordering set. | + */ | +DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f02, PCI_CLASS_NOT_DEFINED, 8, | + quirk_relaxedordering_disable); | +DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f04, PCI_CLASS_NOT_DEFINED, 8, | + quirk_relaxedordering_disable); | +DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f08, PCI_CLASS_NOT_DEFINED, 8, | + quirk_relaxedordering_disable); | + ... It looks like this is missing the set of Root Complex IDs that were noted in the document to which Patrick Cramer sent us a reference: https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf In section 3.9.1 we have: 3.9.1 Optimizing PCIe Performance for Accesses Toward Coherent Memory and Toward MMIO Regions (P2P) In order to maximize performance for PCIe devices in the processors listed in Table 3-6 below, the soft- ware should determine whether the accesses are toward coherent memory (system memory) or toward MMIO regions (P2P access to other devices). If the access is toward MMIO region, then software can command HW to set the RO bit in the TLP header, as this would allow hardware to achieve maximum throughput for these types of accesses. For accesses toward coherent memory, software can command HW to clear the RO bit in the TLP header (no RO), as this would allow hardware to achieve maximum throughput for these types of accesses. Table 3-6. Intel Processor CPU RP Device IDs for Processors Optimizing PCIe Performance ProcessorCPU RP Device IDs Intel Xeon processors based on 6F01H-6F0EH Broadwell microarchitecture Intel Xeon processors based on 2F01H-2F0EH Haswell microarchitecture The PCI Device IDs you have there are the first ones that I guessed at having the performance problem with Relaxed Ordering. We now apparently have a complete list from Intel. I don't want to phrase this as a "NAK" because you've gone around the mulberry bush a bunch of times already. So maybe just go with what you've got in version 8 of your patch and then do a follow on patch to complete the table? Casey
Re: [PATCH v7 2/3] PCI: Enable PCIe Relaxed Ordering if supported
| From: Raj, Ashok | Sent: Friday, August 4, 2017 1:21 PM | | On Fri, Aug 04, 2017 at 08:20:37PM +, Casey Leedom wrote: | > ... | > As I've noted a number of times, it would be great if the Intel Hardware | > Engineers who attempted to implement the Relaxed Ordering semantics in the | > current generation of Root Complexes had left the ability to turn off the | > logic which is obviously not working. If there was a way to disable the | > logic via an undocumented register, then we could have the Linux PCI Quirk | > do that. Since Relaxed Ordering is just a hint, it's completely legitimate | > to completely ignore it. | | Suppose you are looking for the existence of a chicken bit to instruct the | port to ignore RO traffic. So all we would do is turn the chicken bit on | but would permit p2p traffic to be allowed since we won't turn off the | capability as currently proposed. | | Let me look into that keep you posted. Huh, I'd never heard it called a "chicken bit" before, but yes, that's what I'm talking about. Whenever our Hardware Designers implement new functionality in our hardware, they almost always put in A. several "knobs" which can control fundamental parameters of the new Hardware Feature, and B. a mechanism of completely disabling it if necessary. This stems from the incredibly long Design -> Deployment cyle for Hardware (as opposed to the edit->compile->run cycle for s! It's obvious that handling Relaxed Ordering is a new Hardware Feature for Intel's Root Complexes since previous versions simply ignored it (because that's legal[1]). If I was a Hardware Engineer tasked with implementing Relaxed Ordering semantics for a device, I would certainly have also implemented a switch to turn it off in case there were unintended consequences (performance in this case). And if there is such a mechanism to simply disable processing of Relaxed Ordering semantics in the Root Complex, that would be a far easier "fix" for this problem ... and leave the code in place to continue sending Relaxed Ordering TLPs for a future Root Complex implementation which got it right ... Casey [1] One can't ~quite~ just ignore the Relaxed Ordering Attribute on an incoming Transaction Layer Packet Request: PCIe completion rules (see section 2.2.9 of the PCIe 3.0 specificatin) require that the Relaxed Ordering and No Snoop Attributes in a Request TLP be reflected back verbatim in any corresponding Response TLP. (The rules for ID-Based Ordering are more complex.)
Re: [PATCH v7 2/3] PCI: Enable PCIe Relaxed Ordering if supported
| From: Raj, Ashok | Sent: Thursday, August 3, 2017 1:31 AM | | I don't understand this completely.. So your driver would know not to send | RO TLP's to root complex. But you want to send RO to the NVMe device? This | is the peer-2-peer case correct? Yes, this is the "heavy hammer" issue which you alluded to later. There are applications where a device will want to send TLPs to a Root Complex without Relaxed Ordering set, but will want to use it when sending TLPs to a Peer device (say, an NVMe storage device). The current approach doesn't make that easy ... and in fact, I still don't kow how to code a solution for this with the proposed APIs. This means that we may be trading off one performance problem for another and that Relaxed Ordering may be doomed for use under Linux for the foreseeable future. As I've noted a number of times, it would be great if the Intel Hardware Engineers who attempted to implement the Relaxed Ordering semantics in the current generation of Root Complexes had left the ability to turn off the logic which is obviously not working. If there was a way to disable the logic via an undocumented register, then we could have the Linux PCI Quirk do that. Since Relaxed Ordering is just a hint, it's completely legitimate to completely ignore it. Casey
Re: [PATCH v7 2/3] PCI: Enable PCIe Relaxed Ordering if supported
Okay, here you go. As you can tell, it's almost a trivial copy of the cxgb4 patch. By the way, I realized that we have yet another hole which is likely not to be fixable. If we're dealing with a problematic Root Complex, and we instantiate Virtual Functions and attach them to a Virtual Machine along with an NVMe device which can deal with Relaxed Ordering TLPs, the VF driver in the VM will be able to tell that it shouldn't attempt to send RO TLPs to the RC because it will see the state of its own PCIe Capability Device Control[Relaxed Ordering Enable] (a copy of the setting in the VF's corresponding PF), but it won't be able to change that and send non-RO TLPs to the RC, and RO TLPs to the NVMe device. Oh well. I sure wish that the Intel guys would pop up with a hidden register change for these problematic Intel RCs that perform poorly with RO TLPs. Their silence has been frustrating. Casey -- cxgb4vf Ethernet driver now queries PCIe configuration space to determine if it can send TLPs to it with the Relaxed Ordering Attribute set. diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/adapter.h b/drivers/net/ethernet/chelsio/cxgb4vf/adapter.h index 109bc63..08c6ddb 100644 --- a/drivers/net/ethernet/chelsio/cxgb4vf/adapter.h +++ b/drivers/net/ethernet/chelsio/cxgb4vf/adapter.h @@ -408,6 +408,7 @@ enum { /* adapter flags */ USING_MSI = (1UL << 1), USING_MSIX = (1UL << 2), QUEUES_BOUND = (1UL << 3), + ROOT_NO_RELAXED_ORDERING = (1UL << 4), }; /* diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c index ac7a150..59e7639 100644 --- a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c +++ b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c @@ -2888,6 +2888,24 @@ static int cxgb4vf_pci_probe(struct pci_dev *pdev, */ adapter->name = pci_name(pdev); adapter->msg_enable = DFLT_MSG_ENABLE; + + /* If possible, we use PCIe Relaxed Ordering Attribute to deliver +* Ingress Packet Data to Free List Buffers in order to allow for +* chipset performance optimizations between the Root Complex and +* Memory Controllers. (Messages to the associated Ingress Queue +* notifying new Packet Placement in the Free Lists Buffers will be +* send without the Relaxed Ordering Attribute thus guaranteeing that +* all preceding PCIe Transaction Layer Packets will be processed +* first.) But some Root Complexes have various issues with Upstream +* Transaction Layer Packets with the Relaxed Ordering Attribute set. +* The PCIe devices which under the Root Complexes will be cleared the +* Relaxed Ordering bit in the configuration space, So we check our +* PCIe configuration space to see if it's flagged with advice against +* using Relaxed Ordering. +*/ + if (!pcie_relaxed_ordering_supported(pdev)) + adapter->flags |= ROOT_NO_RELAXED_ORDERING; + err = adap_init0(adapter); if (err) goto err_unmap_bar; diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/sge.c b/drivers/net/ethernet/chelsio/cxgb4vf/sge.c index e37dde2..05498e7 100644 --- a/drivers/net/ethernet/chelsio/cxgb4vf/sge.c +++ b/drivers/net/ethernet/chelsio/cxgb4vf/sge.c @@ -2205,6 +2205,7 @@ int t4vf_sge_alloc_rxq(struct adapter *adapter, struct sge_rspq *rspq, struct port_info *pi = netdev_priv(dev); struct fw_iq_cmd cmd, rpl; int ret, iqandst, flsz = 0; + int relaxed = !(adapter->flags & ROOT_NO_RELAXED_ORDERING); /* * If we're using MSI interrupts and we're not initializing the @@ -2300,6 +2301,8 @@ int t4vf_sge_alloc_rxq(struct adapter *adapter, struct sge_rspq *rspq, cpu_to_be32( FW_IQ_CMD_FL0HOSTFCMODE_V(SGE_HOSTFCMODE_NONE) | FW_IQ_CMD_FL0PACKEN_F | + FW_IQ_CMD_FL0FETCHRO_V(relaxed) | + FW_IQ_CMD_FL0DATARO_V(relaxed) | FW_IQ_CMD_FL0PADEN_F); /* In T6, for egress queue type FL there is internal overhead
Re: [PATCH v7 2/3] PCI: Enable PCIe Relaxed Ordering if supported
| From: Ding Tianhong | Sent: Wednesday, July 26, 2017 6:01 PM | | On 2017/7/27 3:05, Casey Leedom wrote: | > | > Ding, send me a note if you'd like me to work that [cxgb4vf patch] up | > for you. | | Ok, you could send the change log and I could put it in the v8 version | together, will you base on the patch 3/3 or build a independence patch? Which ever you'd prefer. It would basically mirror the same exact code that you've got for cxgb4. I.e. testing the setting of the VF's PCIe Capability Device Control[Relaxed Ordering Enable], setting a new flag in adpater->flags, testing that flag in cxgb4vf/sge.c:t4vf_sge_alloc_rxq(). But since the VF's PF will already have disabled the PF's Relaxed Ordering Enable, the VF will also have it's Relaxed Ordering Enable disabled and any effort by the internal chip to send TLPs with the Relaxed Ordering Attribute will be gated by the PCIe logic. So it's not critical that this be in the first patch. Your call. Let me know if you'd like me to send that to you. | From: Ding Tianhong | Sent: Wednesday, July 26, 2017 6:08 PM | | On 2017/7/27 2:26, Casey Leedom wrote: | > | > 1. Did we ever get any acknowledgement from either Intel or AMD | > on this patch? I know that we can't ensure that, but it sure would | > be nice since the PCI Quirks that we're putting in affect their | > products. | | Still no Intel and AMD guys has ack this, this is what I am worried about, | should I ping some man again ? By amusing coincidence, Patrik Cramer (now Cc'ed) from Intel sent me a note yesterday with a link to the official Intel performance tuning documentation which covers this issue: https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf In section 3.9.1 we have: 3.9.1 Optimizing PCIe Performance for Accesses Toward Coherent Memory and Toward MMIO Regions (P2P) In order to maximize performance for PCIe devices in the processors listed in Table 3-6 below, the soft- ware should determine whether the accesses are toward coherent memory (system memory) or toward MMIO regions (P2P access to other devices). If the access is toward MMIO region, then software can command HW to set the RO bit in the TLP header, as this would allow hardware to achieve maximum throughput for these types of accesses. For accesses toward coherent memory, software can command HW to clear the RO bit in the TLP header (no RO), as this would allow hardware to achieve maximum throughput for these types of accesses. Table 3-6. Intel Processor CPU RP Device IDs for Processors Optimizing PCIe Performance ProcessorCPU RP Device IDs Intel Xeon processors based on 6F01H-6F0EH Broadwell microarchitecture Intel Xeon processors based on 2F01H-2F0EH Haswell microarchitecture Unfortunately that's a pretty thin section. But it does expand the set of Intel Root Complexes for which our Linux PCI Quirk will need to cover. So you should add those to the next (and hopefully final) spin of your patch. And, it also verifies the need to handle the use of Relaxed Ordering more subtlely than simply turning it off since the NVMe peer-to-peer example I keep bringing up would fall into the "need to use Relaxed Ordering" case ... It would have been nice to know why this is happening and if any future processor would fix this. After all, Relaxed Ordering, is just supposed to be a hint. At worst, a receiving device could just ignore the attribute entirely. Obviously someone made an effort to implement it but ... it didn't go the way they wanted. And, it also would have been nice to know if there was any hidden register in these Intel Root Complexes which can completely turn off the effort to pay attention to the Relaxed Ordering Attribute. We've spend an enormous amount of effort on this issue here on the Linux PCI email list struggling mightily to come up with a way to determine when it's safe/recommended/not-recommended/unsafe to use Relaxed Ordering when directing TLPs towards the Root Complex. And some architectures require RO for decent performance so we can't just "turn it off" unilatterally. Casey
Re: [PATCH v7 2/3] PCI: Enable PCIe Relaxed Ordering if supported
| From: Alexander Duyck | Sent: Wednesday, July 26, 2017 11:44 AM | | On Jul 26, 2017 11:26 AM, "Casey Leedom" wrote: | | | | I think that the patch will need to be extended to modify | | drivers/pci.c/iov.c:sriov_enable() to explicitly turn off | | Relaxed Ordering Enable if the Root Complex is marked | for no RO TLPs. | | I'm not sure that would be an issue. Wouldn't most VFs inherit the PF's settings? Ah yes, you're right. This is covered in section 3.5.4 of the Single Root I/O Virtualization and Sharing Specification, Revision 1.0 (September 11, 2007), governing the PCIe Capability Device Control register. It states that the VF version of that register shall follow the setting of the corresponding PF. So we should enhance the cxgb4vf/sge.c:t4vf_sge_alloc_rxq() in the same way we did for the cxgb4 driver, but that's not critical since the Relaxed Ordering Enable supersedes the internal chip's desire to use the Relaxed Ordering Attribute. Ding, send me a note if you'd like me to work that up for you. | Also I thought most of the VF configuration space is read only. Yes, but not all of it. And when a VF is exported to a Virtual Machine, then the Hypervisor captures and interprets all accesses to the VF's PCIe Configuration Space from the VM. Thanks again for reminding me of the subtle aspect of the SR_IOV specification that I forgot. Casey
Re: [PATCH v7 2/3] PCI: Enable PCIe Relaxed Ordering if supported
By the way Ding, two issues: 1. Did we ever get any acknowledgement from either Intel or AMD on this patch? I know that we can't ensure that, but it sure would be nice since the PCI Quirks that we're putting in affect their products. 2. I just realized that there's still a small hole in the patch with respect to PCIe SR-IOV Virtual Functions. When the PCI Quirk notices a problematic PCIe Device and marks it to note that it's not "happy" receiving Transaction Layer Packets with the Relaxed Ordering Attribute set, it's my understanding that your current patch iterates down the PCIe Fabric and turns off the PCIe Capability Device Control[Relaxed Ordering Enable]. But this scan may miss any SR-IOV VFs because they probably won't be instantiated at that time. And, at a later time, when they are instantiated, they could well have their Relaxed Ordering Enable set. I think that the patch will need to be extended to modify drivers/pci.c/iov.c:sriov_enable() to explicitly turn off Relaxed Ordering Enable if the Root Complex is marked for no RO TLPs. Casey
Re: [PATCH v7 3/3] net/cxgb4: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
Reviewed-by: Casey Leedom
Re: [PATCH v7 3/3] net/cxgb4: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
[[ Sorry for the Double Send: I forgot to switch to Plain Text. Have I mentioned how much I hate modern Web-based email agents? :-) -- Casey ]] Yeah, I think this works for now. We'll stumble over what to do when we want to mix upstream TLPs without Relaxed Ordering Attributes directed at problematic Root Complexes, and Peer-to-Peer TLPs with Relaxed Ordering Attributes ... or vice versa depending on which target PCIe Device has issues with Relaxed Ordering. Thanks for all the work! Casey
Re: [PATCH v6 0/3] Add new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
| From: Ding Tianhong | Sent: Wednesday, July 12, 2017 6:18 PM | | If no other more suggestion, I will send a new version and remove the | enable_pcie_relaxed_ordering(), thanks. :) Sounds good to me. (And sorry for forgetting to justify that last message. I hate working with web-based email agents.) Casey
Re: [PATCH v6 0/3] Add new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
Sorry again for the delay. This time at least partially caused by a Chelsio-internal Customer Support request to simply disable Relaxed Ordering entirely due to the performance issues with our 100Gb/s product and relatively recent Intel Root Complexes. Our Customer Support people are tired of asking customers to try turning off Relaxed Ordering. (sigh) So, first off, I've mentioned a couple of times that the current cxgb4 driver hardwired the PCIe Capability Device Control[Relaxed Ordering Enable] on. Here's the code which does it: drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:4657: static void enable_pcie_relaxed_ordering(struct pci_dev *dev) { pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_RELAX_EN); } This is called from the PCI Probe() routine init_one() later in that file. I just wanted to make sure people knew about this. Obviously given our current very difficult thread, this would either need to be diked out or changed or a completely different mechanism put in place. Second, just to make sure everyone's on the same page, the above simply allows the device to send TLPs with the Relaxed Ordering Attribute. It doesn't cause TLPs to suddenly all be sent with RO set. The use of Relaxed Ordering is selective. For instance, in our hardware we can configure the RX Path to use RO on Ingress Packet Data delivery to Free List Buffers, but not use RO for delivery of messages noting newly delivered Ingress Packet Data. Doing this allows the destination PCIe target to [potentially] optimize the DMA Writes to it based on local conditions (memory controller channel availability, etc.), but ensure that the message noting newly delivered Ingress Packet Data isn't processed till all of the preceding TLPs with RO set containing Ingress Packet Data have been processed. (This by the way is the essence of the AMD A1100 ARM SoC bug: its Root Complex isn't obeying that PCIe ordering rule.) Third, as noted above, I'm getting a lot of pressure to get this addressed sooner than later, so I think that we should go with something fairly simple along the lines that you guys are proposing and I'll stop whining about the problem of needing to handle Peer-to-Peer with Relaxed Ordering while not using it for deliveries to the Root Complex. We can just wait for that kettle of fish to explode on us and deal with the mess then. (Hhmmm, the mixed metaphor landed in an entirely different place than I originally intended ... :-)) If we try to stick as closely to Ding's latest patch set as possible, then we can probably just add the diff to remove the enable_pcie_relaxed_ordering() code in cxgb4_main.c. Casey
Re: [PATCH v6 0/3] Add new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
Hey Alexander, Okay, I understand your point regarding the "most likely scenario" being TLPs directed upstream to the Root Complex. But I'd still like to make sure that we have an agreed upon API/methodology for doing Peer-to-Peer with Relaxed Ordering and no Relaxed Ordering to the Root Complex. I don't see how the proposed APIs can be used in that fashion. Right now the proposed change for cxgb4 is for it to test its own PCIe Capability Device Control[Relaxed Ordering Enable] in order to use that information to program the Chelsio Hardware to emit/not emit upstream TLPs with the Relaxed Ordering Attribute set. But if we're going to have the mixed mode situation I describe, the PCIe Capability Device Control[Relaxed Ordering Enable] will have to be set which means that we'll be programming the Chelsio Hardware to send upstream TLPs with Relaxed Ordering Enable to the Root Complex which is what we were trying to avoid in the first place ... [[ And, as I noted on Friday evening, the currect cxgb4 Driver hardwires the Relaxed Ordering Enable on early dureing device probe, so that would minimally need to be addressed even if we decide that we don't ever want to support mixed mode Relaxed Ordering. ]] We need some method of telling the Chelsio Driver that it should/shouldn't use Relaxed Ordering with TLPs directed at the Root Complex. And the same is true for a Peer PCIe Device. It may be that we should approach this from the completely opposite direction and instead of having quirks which identify problematic devices, have quirks which identify devices which would benefit from the use of Relaxed Ordering (if the sending device supports that). That is, assume the using Relaxed Ordering shouldn't be done unless the target device says "I love Relaxed Ordering TLPs" ... In such a world, an NVMe or a Graphics device might declare love of Relaxed Ordering and the same for a SPARC Root Complex (I think that was your example). By the way, the sole example of Data Corruption with Relaxed Ordering is the AMD A1100 ARM SoC and AMD appears to have given up on that almost as soon as it was released. So what we're left with currently is a performance problem on modern Intel CPUs ... (And hopefully we'll get a Technical Publication on that issue fairly soon.) Casey
Re: [PATCH v6 0/3] Add new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
Okay, thanks for the note Alexander. I'll have to look more closely at the patch on Monday and try it out on one of the targeted systems to verify the semantics you describe. However, that said, there is no way to tell a priori where a device will send TLPs. To simply assume that all TLPs will be directed towards the Root Complex is a big assumption. Only the device and the code controlling it know where the TLPs will be directed. That's why there are changes required in the cxgb4 driver. For instance, the code in drivers/net/ethernet/chelsio./cxgb4/sge.c: t4_sge_alloc_rxq() knows that it's allocating Free List Buffers in Host Memory and that the RX Queues that it's allocating in the Hardware will eventually send Ingress Data to those Free List Buffers. (And similarly for the Free List Buffer Pointer Queue with respect to DMA Reads from the host.) In that routine we explicitly configure the Hardware to use/not-use the Relaxed Ordering Attribute via the FW_IQ_CMD_FL0FETCHRO and FW_IQ_CMD_FL0DATARO flags. Basically we're conditionally setting them based on the desirability of sending Relaxed Ordering TLPs to the Root Complex. (And we would perform the same kind of check for an nVME application ... which brings us to ...) And what would be the code using these patch APIs to set up a Peer-to-Peer nVME-style application? In that case we'd need the Chelsio adapter's PCIe Capability Device Control[Relaxed Ordering Enable] set for the nVME application ... and we would avoid programming the Chelsio Hardware to use Relaxed Ordering for TLPs directed at the Root Complex. Thus we would be in a position where some TLPs being emitted by the device to Peer devices would have Relaxed Ordering set and some directed at the Root Complex would not. And the only way for that to work is if the source device's Device Control[Relaxed Ordering Enable] is set ... Finally, setting aside my disagreements with the patch, we still have the code in the cxgb4 driver which explicitly turns on its own Device Control[Relaxed Ordering Enable] in cxgb4_main.c: enable_pcie_relaxed_ordering(). So the patch is something of a loop if all we're doing is testing our own Relaxed Ordering Enable state ... Casey
Re: [PATCH v6 0/3] Add new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
By the way, it ~seems~ like the patch set confuses the idea of the PCIe Capability Device Control[Relaxed Ordering Enable] with the device's ability to handle incoming TLPs with the Relaxed Ordering Attribute set. These are completely different things. The PCIe Capability Device Control[Relaxed Ordering Enable] solely governs the ability of the device to _send_ TLPs with the Relaxed Ordering Attribute set. It has nothing whatsoever to do with the handling of incoming TLPs with the Relaxed Ordering Attribute set. In fact, there is any standard way to disable receipt processing of such TLPs. That's kind of the whole point of the majority of the patch. If there was a separate "Ignore Incoming Relaxed Ordering Attributes" then we'd mostly just have a quirk which would set that for problematic devices. Worse yet, if I'm reading the patch right, it _is_ turning off the PCIe Capability Device Control[Relaxed Ordering Enable] for the devices that we've identified with problematic receive handling. Not only does this not solve the problem that we've been talking about, it actually almost certainly introduces a huge Graphics Performance Bug. The Relaxed Ordering Attribute was originally developed for Graphics Device support in order to download textures, etc. to a graphics devices as fast as possible and only ensure ordering at points where needed. For instance, as far as I know, the Intel Root Complexes that we've been talking about have no issues whatsoever in generating downstream TLPs with the Relaxed Ordering Attribute set. By turning off the PCIe Capability Device Control[Relaxed Ordering Enable] on these Root Complexes, this patch prevents the generation of downstream TLPs with the Relaxed Ordering Attribute set targeted at devices like graphics adapters. Casey
Re: [PATCH v6 0/3] Add new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
Hey Ding, Bjorn, Alexander, et.al., Sorry for the insane delay in getting to this and thanks especially to Ding for picking up the ball on this feature. I got side=-tracked into a multi-week rewrite of our Firmware/Host Driver Port Capabilities code, then to the recent Ethernet Plug-Fest at the University of New Hampshire for a week, and finally the 4th of July weekend. Digging out from under email has been non-amusing. In any case, enough excuses. I'm looking at the "v6" version of the patches. If this isn't the corect version, please yell at me and I'll look at the correct one immediately. In the change to drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c (patch 3/3) it looks like the code is checking the Chelsio Adapter PCI Device to see if Relaxed Ordering is supported. But what we really want to test is the Root Complex port. I.e. something like: diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c index 38a5c67..546538d 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c @@ -4620,6 +4620,7 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent) struct port_info *pi; bool highdma = false; struct adapter *adapter = NULL; + struct pci_dev *root; struct net_device *netdev; void __iomem *regs; u32 whoami, pl_rev; @@ -4726,6 +4727,24 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent) adapter->msg_enable = DFLT_MSG_ENABLE; memset(adapter->chan_map, 0xff, sizeof(adapter->chan_map)); + /* If possible, we use PCIe Relaxed Ordering Attribute to deliver +* Ingress Packet Data to Free List Buffers in order to allow for +* chipset performance optimizations between the Root Complex and +* Memory Controllers. (Messages to the associated Ingress Queue +* notifying new Packet Placement in the Free Lists Buffers will be +* send without the Relaxed Ordering Attribute thus guaranteeing that +* all preceding PCIe Transaction Layer Packets will be processed +* first.) But some Root Complexes have various issues with Upstream +* Transaction Layer Packets with the Relaxed Ordering Attribute set. +* The PCIe devices which under the Root Complexes will be cleared the +* Relaxed Ordering bit in the configuration space, So we check our +* PCIe configuration space to see if it's flagged with advice against +* using Relaxed Ordering. +*/ + root = pci_find_pcie_root_port(pdev); + if (!pcie_relaxed_ordering_supported(root)) + adapter->flags |= ROOT_NO_RELAXED_ORDERING; + spin_lock_init(&adapter->stats_lock); spin_lock_init(&adapter->tid_release_lock); spin_lock_init(&adapter->win0_lock); The basic idea is that we want to find out if the Root Complex is okay with TLPs directed its way with the Relaxed Ordering Attribute set. I also have to say that I'm a bit confused by the pcie_relaxed_ordering_supported() in use here. It seems that it's testing whether the PCI Device's own Relaxed Ordering Enable is set which governs whether it will send TLPs with the Relaxed Ordering Attribute set. It has nothing to do with whether or not the device responds well to incoming TLPs with the Relaxed Ordering Attribute set. I think that we really want to use the pci_dev_should_disable_relaxed_ordering() API on the Root Complex Port because that tests whether the quirck fired on it? I.e.: + root = pci_find_pcie_root_port(pdev); + if (!pci_dev_should_disable_relaxed_ordering(root)) + adapter->flags |= ROOT_NO_RELAXED_ORDERING; Casey
Re: [PATCH net-next 1/3] net: ethtool: add support for forward error correction modes
| From: Andrew Lunn | Sent: Thursday, July 6, 2017 4:15 PM | | > Even this feels too extreme for me. I think users would hate it. They | > did an ifup and swapped cables. They expect the OS/Driver/Firmware | > to continue trying to honor their ifup request. | | Lets take a look around at other subsystems | ... | Do you know of any subsystem that tried to keep its configuration | across a hot unplug/plug event? I suspect they all require user space | to take some action to get the newly plugged hardware into operation. I agree ... -ish ... :-) If you choose to think of a cable unplug/plug event as "hot plug", then the "reset" is the model that feels right. But I'll note that this is also presupposing what the right model is for users. This is akin to trying to decide what to make for dinner and deciding that a hammer is the right tool. If we end up deciding on Cracked Crab (or Tree Nuts for the vegans amongst us), then the hammer is quite possibly a good answer. All I can continue to say is: keep on thinking about users. How they're using networking devices now, how they think about them, how we're going to explain to them whatever changes they'll need to make to their usage of network ports. And yes, how this will fit into management models for other OSes, etc. The customers aren't always right, but their opinion matters a lot. Casey
Re: [PATCH net-next 1/3] net: ethtool: add support for forward error correction modes
| From: Jakub Kicinski | Sent: Thursday, July 6, 2017 3:43 PM | | On Thu, 6 Jul 2017 21:53:46 +, Casey Leedom wrote: | > | > But, and far more importantly, ideally _*ANY*_ such decision is made at an | > architectural level to apply to all Link Parameters and Vendor Products. | > The last thing a user wants to deal with is a hodge-podge of different | > policies for different adapters from different vendors. | | Agreed. Once we decided we should make the expected behaviour very | clear the everyone. | | Sorry if I'm misunderstanding - are you suggesting we should keep the | speed settings if hand-selected? My feeling is those should be reset | if they are incompatible with the module. Again, just to be perfectly clear, I don't think that there's a perfect solution to this. My personal feeling is that we need to think through all of the usage scenarios and see what happens in the various models, and more importantly, how easily we can explain the inevitable repercussions to users. Again, the "simplest model wins" philosophy. But to your specific question: yes, I am saying that if the user selected 25Gb/s and accidentally swapped in a 10Gb/s cable, and then swapped a different {100,25}Gb/s cable back in, the advertised Speed(s) should be 25Gb/s with the newest cable, respecting the original Link Parameter setting with the first cable. And again, I don't believe that we'll come up with a perfect solution. But hopefully we can come up with an abstraction model that's easy to understand and use. (And requires the fewest changes to current accepted practices.) | Hm. I'm beginning to come around on this. If my understanding of PHY | sub-layers is correct the FEC layer shouldn't be reset on module | unplug. OTOH when someone replaces a DAC with an optical module, | keeping FEC around is not going to do any good to users... When I first stumbled across this issue several months ago I though I must be dense. Nothing this simple should be this complicated. It's been extremely gratifying to find others similarly flummoxed ... :-) Casey
Re: [PATCH net-next 1/3] net: ethtool: add support for forward error correction modes
| From: Andrew Lunn | Sent: Thursday, July 6, 2017 3:33 PM | | On Thu, Jul 06, 2017 at 09:53:46PM +, Casey Leedom wrote: | > | > But, and far more importantly, ideally _*ANY*_ such decision is made at an | > architectural level to apply to all Link Parameters and Vendor Products. | > The last thing a user wants to deal with is a hodge-podge of different | > policies for different adapters from different vendors. | | Yes. | | SFP needs to becomes a Linux device, similar to Copper PHYs are Linux | devices. With some core code which all drivers can use, implement | ethtool --dump-module-eeprom, report speeds to the MAC using | adjust_link, etc.. Okay. This "feels" like an implementation approach rather than a model abstraction commentary, but it sounds like it would help ensure consistency across vendors, etc. | > how do users conceive of a "Port"? | | For a user, it is something they configure via /etc/network/interfaces | and then use ifup/ifdown on. | | ... | | And this is where it gets interesting, as you say. We are into a | hotplug model. | | I think you also need to define 'cable' here. I assume you are not | talking about a piece of CAT 5 or glass fibre. You mean something | which is active. You are putting a different module into the SFP cage. Yes, I'm talking about active Transceiver Modules whether welded onto the ends of copper-based cables or Optical Transceiver Modules. | The extreme model would be, if you pull the module out, the whole | netdev is hot-unplugged. Plug a different modules in, the netdev is | hot-plugged. The user has to ifup it again, and would get an error | message if the configuration is invalid. | | But i think this is too extreme. I agree. This would force a completely new set of behavior on all users. And it wouldn't match the paradigms used by any other OS. (Yes, I know that Linux tends to ignore such issues, but users do live in mixed shops and it would be cruel to them to force radically different operating paradigms between the systems they need to use.) | I think the sfp device needs to give a hotplug event on unplug/plug. | A hot-unplug would result in an ifdown. And within the kernel, the | netdev is set down. If there is an "allow-hotplug" statement in | /etc/network/interfaces, on hot-plug, udev would try to ifup and get | an error and it will stay down. Without the "allow-hotplug" the | interface remains configured down until the user does an ifup and | would see an error message if the configuration is invalid. Even this feels too extreme for me. I think users would hate it. They did an ifup and swapped cables. They expect the OS/Driver/Firmware to continue trying to honor their ifup request. Casey
Re: [PATCH net-next 1/3] net: ethtool: add support for forward error correction modes
| From: Wyborny, Carolyn | Sent: Thursday, July 6, 2017 3:16 PM | | Agree with your points generally, but other networking hw vendors have | dealt with this since SFP variant and other modules arrived on the scene. The only case of this of which I'm aware is the SFP+ RJ45 1Gb/s Transceiver Module. But yes, that presaged what we're looking at now as a much simpler example. Casey
Re: [PATCH net-next 1/3] net: ethtool: add support for forward error correction modes
| From: Jakub Kicinski | Sent: Thursday, July 6, 2017 12:02 PM | | IMHO if something gets replugged all the settings should be reset. | I feel that it's not entirely unlike replugging a USB adapter. Perhaps | we should introduce some (devlink) notifications for SFP module events | so userspace can apply whatever static user config it has? Absolutely a valid approach. As are all of the ones I outlined. But, and far more importantly, ideally _*ANY*_ such decision is made at an architectural level to apply to all Link Parameters and Vendor Products. The last thing a user wants to deal with is a hodge-podge of different policies for different adapters from different vendors. As I noted in my previous letter: this is something new that we've never faced before with any prior networking technology. All previous networking technologies had a static set of Physical Port Capabilities fixed from the moment a Host Diver/Firmware first see a Port. We're now facing a situation where these can change dynamically from moment to moment based on what Transceiver Module is inserted. With regard to this "architectural" issue, one way of trying to tease out what model will be the simplest for users to work with is to ask: how do users conceive of a "Port"? I.e. when a user requests that a particular Link Parameter be applied to a Port, are they thinking that it only applies to the current instantaneous combination of Adapter Transceiver Module Cage + Transceiver Module? Or do they conceptualize a "Port" as being a higher level entity? Or, let's make it Very Concrete with a specific example: 1. User applies some set of Link Parameters. 2. User attempts to bring Link up but it doesn't come up. 3. User decides to try a different cable on the grounds that the first cable may be bad. 4. New cable is accidentally of a completely different type with completely different subsequent Physical Port Capabilities, not capable of supporting the user's selected Link Parameters. 5. User realizes this accidental mis-selection, and swaps in a third cable, similar to the first cable. 6. User attempts to bring the Link up with the third cable. If we reset all of the user-specified Link Parameters at point #3 and/or #5, then the user's requested Link Parameter settings from point #1 will no longer be in effect at point #6. Is this expected/desirable? In our discussions (Dustin, I, and others), we felt that the answer to this above scenario question was "no". I.e. that users would have a persistent memory of having applied a set of Link Parameter settings to the "Port" and would be annoyed/confused if those settings were "arbitrarily" reset at some indefinite time. But, this is a discussion and a decision that needs to be made. Regardless of what people finally decide is the "Right Answer", it needs to be extremely well documented, it needs to be executed uniformly, and we may want to explicitly notify the user at the points where something unexpected/confusing is occurring. Say, in the "Reset Unsupportable Link Parameters When Transceiver Module Is Changed" model that you advocate, when the user's Link Parameter settings are reset. Casey
Re: [PATCH net-next 1/3] net: ethtool: add support for forward error correction modes
| From: Jakub Kicinski | Sent: Wednesday, June 28, 2017 6:00 PM | | On Wed, 28 Jun 2017 14:47:51 -0700, Dustin Byford wrote: | > | > You're not the first, or the second to ask that question. I agree it | > could use clarification. | > | > I always read auto in this context as automatic rather than autoneg. | > The best I can come up with is to perhaps fully spell out "automatic" in | > the documentation and the associated uapi enums. It's accurate, and | > hopefully different enough from "autoneg" to hint people away from the | > IEEE autoneg concept. | | So perhaps just "default"? Even saying something like ieee-selected | doesn't really help, because apparently there are two autonegs defined | - IEEE one and a "consortium" one... First, sorry for the extreme delay in responding. I was at the {25,100}Gb/s Ethernet "Plug Fest" all last week and then the holiday weekend added to the delay. As Dustin notes, you're not the first to be confused by the use of the term "auto". It absolutely means. It essentially means: "Use standard FEC settings as specified by IEEE 802.3 interpretations of Cable Transceiver Module parameters." But this is quite a mouthful. In this sense, "auto" means "automatic". Other keywords which we bandied about included: standard, default, ieee, ieee802.3, ieee802.3-default, transceiver-default, etc. As you can see, the quest to make the option more obvious lead to increasing verbosity. I think that in the end, we decided to go with a moderately short keyword with tons of documentation to make the meaning clear. By the way, this isn't the end of this problem. The new QSFP28 and SFP28 Port Types have introduced a problem which no one has ever seen with any preceding network technology. With all previous networking technologies, a driver could look at an adapter Port Type and know what its Port Capabilities were in terms of Speeds, etc. and those Port Capabilities would never change. With the new QSFP28 and SFP28 Port Types, the Physical Port Capabilities can change based on what Transceiver Modules you plug in. For instance, with one QSFP28 Transceiver Module you might see {100,50,25}Gb/s and with a different one {40,10}Gb/s. One Transceiver Module might support Reed-Solomon Forward Error Correction and another might also support BaseR/Reed-Solomon. For the proposed FEC controls, we've tried to cope with this by having this "Automatic IEEE802.3 Transceiver Module-dictated FEC" via the "auto" selection (where most users will leave it we hope). This allows the Firmware/Host Driver to automatically track the FEC needs of the currently plugged in Transceiver Module. However, the first question which pops up is: what happens if a user explicitly selects a particular FEC for from the set offered by the current Transceiver Module, and then swap out Transceiver Modules to one which doesn't support that FEC? Does the OS/Driver/Firmware "forget" the user's FEC setting? Does it respect it and potentially not get a link established? Do we "temporarily" put the user's FEC setting aside? Or do we have FEC settings per-Transceiver Module Type? Each choice has downsides in terms of "expected behavior" and the complexity of the abstraction model offered to the user. In our discussions of the above, we decided that we should err in the direction of the absolutely simplest abstraction model, even when that might result in a failure to establish a link. Our feeling was that doing anything else would result in endless user confusion. Basically, if it takes longer than a simple paragraph to explain, you're probably doing the wrong thing. So, we decided that if a user sets up a particular FEC setting, we keep that regardless of conflict with different Transceiver Modules. (But flag such issues in the System Log in order to try to give the user a chance to understand what the new cable they plugged in wasn't working.) As noted above, the "auto" FEC setting solves the problem of automatically tracking the FEC needs of whatever Transceiver Module happens to be plugged in. And now with QSFP28 and SFP28, we have the same issue with possible Speeds (and other Link Parameters). It may well be that we're going to need to extend this idea into Link Speeds. Casey
Re: [PATCH 1/2] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING
| From: Ding Tianhong | Sent: Wednesday, May 10, 2017 6:15 PM | | Hi Casey: | | Will you continue to work on this solution or send a new version patches? I won't be able to work on this any time soon given several other urgent issues. If you've got a desire to pick this up, I'd be happy to help code review your efforts. Casey
Re: [PATCH 1/2] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING
| From: Alexander Duyck | Date: Saturday, May 6, 2017 11:07 AM | | | From: Ding Tianhong | | Date: Fri, May 5, 2017 at 8:08 PM | | | | According the suggestion, I could only think of this code: | | .. | | This is a bit simplistic but it is a start. Yes, something tells me that this is going to be more complicated than any of us like ... | The other bit I was getting at is that we need to update the core PCIe | code so that when we configure devices and the root complex reports no | support for relaxed ordering it should be clearing the relaxed | ordering bits in the PCIe configuration registers on the upstream | facing devices. Of course, this can be written to by the Driver at any time ... and is in the case of the cxgb4 Driver ... After a lot of rummaging around, it does look like KVM prohibits writes to the PCIe Capability Device Control register in drivers/xen/xen-pciback/ conf_space_capability.c and conf_space.c simply because writes aren't allowed unless "permissive" is set. So it ~looks~ like a driver running in a Virtual Machine can't turn Enable Relaxed Ordering back on ... | The last bit we need in all this is a way to allow for setups where | peer-to-peer wants to perform relaxed ordering but for writes to the | host we have to not use relaxed ordering. For that we need to enable a | special case and that isn't handled right now in any of the solutions | we have coded up so far. Yes, we do need this. | From: Alexander Duyck | Date: Saturday, May 8, 2017 08:22 AM | | The problem is we need to have something that can be communicated | through a VM. Your change doesn't work in that regard. That was why I | suggested just updating the code so that we when we initialized PCIe | devices what we do is either set or clear the relaxed ordering bit in | the PCIe device control register. That way when we direct assign an | interface it could know just based on the bits int the PCIe | configuration if it could use relaxed ordering or not. | | At that point the driver code itself becomes very simple since you | could just enable the relaxed ordering by default in the igb/ixgbe | driver and if the bit is set or cleared in the PCIe configuration then | we are either sending with relaxed ordering requests or not and don't | have to try and locate the root complex. | | So from the sound of it Casey has a special use case where he doesn't | want to send relaxed ordering frames to the root complex, but instead | would like to send them to another PCIe device. To do that he needs to | have a way to enable the relaxed ordering bit in the PCIe | configuration but then not send any to the root complex. Odds are that | is something he might be able to just implement in the driver, but is | something that may become a more general case in the future. I don't | see our change here impacting it as long as we keep the solution | generic and mostly confined to when we instantiate the devices as the | driver could likely make the decision to change the behavior later. It's not just me. Intel has said that while RO directed at the Root Complex Host Coherent Memory has a performance bug (not Data Corruption), it's a performance win for Peer-to-Peer writes to MMIO Space. (I'll be very interested in hearing what the bug is if we get that much detail. The very same TLPs directed to the Root Complex Port without Relaxed Ordering set get good performance. So this is essentially a bug in the hardware that was ~trying~ to implement a performance win.) Meanwhile, I currently only know of a single PCIe End Point which causes catastrophic results: the AMD A1100 ARM SoC ("SEATTLE"). And it's not even clear that product is even alive anymore since I haven't been able to get any responses from them for several months. What I'm saying is: let's try to architect a solution which doesn't throw the baby out with the bath water ... I think that if a Device's Root Complex Port has problems with Relaxed Ordering, it ~probably~ makes sense to turn off the PCIe Capability Device Control[Enable Relaxed Ordering] when we assign a device to a Virtual Machine since the Device Driver can no longer query the Relaxed Ordering Support of the Root Complex Port. The only down side of this would be if we assigned two Peers to a VM in an application which wanted to do Peer-to-Peer transfers. But that seems like a hard application to support in any case since the PCI Bus:Slot.Function IDs for assigned Devices within a VM don't match the actual values. For Devices running in the base OS/Hypervisor, their Drivers can query the Relaxed Ordering Support for the Root Complex Port or a Peer Device. So a simple flag within the (struct pci_dev *)->dev_flags would serve for that along with a per-Architecture/Platform mechanism for setting it ... Casey
Re: [PATCH 1/2] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING
| From: Alexander Duyck | Sent: Wednesday, May 3, 2017 9:02 AM | ... | It sounds like we are more or less in agreement. My only concern is | really what we default this to. On x86 I would say we could probably | default this to disabled for existing platforms since my understanding | is that relaxed ordering doesn't provide much benefit on what is out | there right now when performing DMA through the root complex. As far | as peer-to-peer I would say we should probably look at enabling the | ability to have Relaxed Ordering enabled for some channels but not | others. In those cases the hardware needs to be smart enough to allow | for you to indicate you want it disabled by default for most of your | DMA channels, and then enabled for the select channels that are | handling the peer-to-peer traffic. Yes, I think that we are mostly in agreement. I had just wanted to make sure that whatever scheme was developed would allow for simultaneously supporting non-Relaxed Ordering for some PCIe End Points and Relaxed Ordering for others within the same system. I.e. not simply enabling/disabling/etc. based solely on System Platform Architecture. By the way, I've started our QA folks off looking at what things look like in Linux Virtual Machines under different Hypervisors to see what information they may provide to the VM in the way of what Root Complex Port is being used, etc. So far they've got Windows HyperV done and there there's no PCIe Fabric exposed in any way: just the attached device. I'll have to see what pci_find_pcie_root_port() returns in that environment. Maybe NULL? With your reservations (which I also share), I think that it probably makes sense to have a per-architecture definition of the "Can I Use Relaxed Ordering With TLPs Directed At This End Point" predicate, with the default being "No" for any architecture which doesn't implement the predicate. And if the specified (struct pci_dev *) End Node is NULL, it ought to return False for that as well. I can't see any reason to pass in the Source End Node but I may be missing something. At this point, this is pretty far outside my level of expertise. I'm happy to give it a go, but I'd be even happier if someone with a lot more experience in the PCIe Infrastructure were to want to carry the ball forward. I'm not super familiar with the Linux Kernel "Rules Of Engagement", so let me know what my next step should be. Thanks. Casey
Re: [PATCH 1/2] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING
| From: Alexander Duyck | Date: Tuesday, May 2, 2017 11:10 AM | ... | So for example, in the case of x86 it seems like there are multiple | root complexes that have issues, and the gains for enabling it with | standard DMA to host memory are small. As such we may want to default | it to off via the architecture specific PCIe code and then look at | having "white-list" cases where we enable it for things like | peer-to-peer accesses. In the case of SPARC we could look at | defaulting it to on, and only "black-list" any cases where there might | be issues since SPARC relies on this in a significant way for | performance. In the case of ARM and other architectures it is a bit of | a toss-up. I would say we could just default it to on for now and | "black-list" anything that doesn't work, or we could go the other way | like I suggested for x86. It all depends on what the ARM community | would want to agree on for this. I would say unless it makes a | significant difference like it does in the case of SPARC we are | probably better off just defaulting it to off. Sorry, I forgot to respond to this earlier when someone was rushing me out to a customer dinner. I'm unaware of any other x86 Root Complex Port that has a problem with Relaxed Ordering other than the performance issue with the current Intel implementation. Ashok tells me that Intel is in the final stages of putting together a technical notice on this issue but I don't know when that will come out. Hopefully that will shed much more light on the cause and actual use of Relaxed Ordering when directed to Coherent Memory on current and past Intel platforms. (Note that the performance bug seems to limit us to ~75-85Gb/s DMA Write speed to Coherent Host Memory.) The only other Device that I currently know of which has issues with Relaxed Ordering TLPs directed at it, is also a Root Complex Port on the new AMD A1100 ARM SoC ("SEATTLE"). There we have an actual bug which could lead to Data Corruption. But I don't know anything about other x86 Root Complex Ports having issues with this -- we've been using it ever since commit ef306b50b from December 2010. Also, I'm not aware of any performance data which has been gathered on the use of Relaxed Ordering when directed at Host Memory. From your note, it sounds like it's important on SPARC architectures. But it could conceivably be important on any architecture or Root Complex/Memory Controller implementation. We use it to direct Ingress Packet Data to Free List Buffers, followed by a TLP without Relaxed Ordering directed at a Host Memory Message Queue. The idea here is to give the Root Complex options on which DMA Memory Write TLPs to process in order to optimize data placement in memory. And with the next Ethernet speed bump to 200Gb/s this may be even more important. Basically, I'd hate to come up with a solution where we write off the use of Relaxed Ordering for DMA Writes to Host Memory. I don't think you're suggesting that, but there are a number of x86 Root Complex implementations out there -- and some like the new AMD Ryzen have yet to be tested -- as well as other architectures. And, as Ashok and I have both nothed, any solution we come up with needs to cope with a heterogeneous situation where, on the same PCIe Fabric, it may be necessesary/desireable to support Relaxed Ordering TLPs directed at some End Nodes but not others. Casey
[PATCH 0/2] Add new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
Some devices have problems with Transaction Layer Packets with the Relaxed Ordering Attribute set. This patch set adds a new PCIe Device Flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING, a set of PCI Quirks to catch some known devices with Relaxed Ordering issues, and a use of this new flag by the cxgb4 driver to avoid using Relaxed Ordering with problematic Root Complex Ports. It's been years since I've submitted kernel.org patches, I appolgise for the almost certain submission errors. Casey Leedom (2): PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING net/cxgb4: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag drivers/net/ethernet/chelsio/cxgb4/cxgb4.h | 1 + drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 17 +++ drivers/net/ethernet/chelsio/cxgb4/sge.c| 5 ++-- drivers/pci/quirks.c| 38 + include/linux/pci.h | 2 ++ 5 files changed, 61 insertions(+), 2 deletions(-) -- 1.9.1
[PATCH 1/2] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING
The new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING indicates that the Relaxed Ordering Attribute should not be used on Transaction Layer Packets destined for the PCIe End Node so flagged. Initially flagged this way are Intel E5-26xx Root Complex Ports which suffer from a Flow Control Credit Performance Problem and AMD A1100 ARM ("SEATTLE") Root Complex Ports which don't obey PCIe 3.0 ordering rules which can lead to Data Corruption. --- drivers/pci/quirks.c | 38 ++ include/linux/pci.h | 2 ++ 2 files changed, 40 insertions(+) diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index f754453..4ae78b3 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -3979,6 +3979,44 @@ static void quirk_tw686x_class(struct pci_dev *pdev) quirk_tw686x_class); /* + * Some devices have problems with Transaction Layer Packets with the Relaxed + * Ordering Attribute set. Such devices should mark themselves and other + * Device Drivers should check before sending TLPs with RO set. + */ +static void quirk_relaxedordering_disable(struct pci_dev *dev) +{ + dev->dev_flags |= PCI_DEV_FLAGS_NO_RELAXED_ORDERING; +} + +/* + * Intel E5-26xx Root Complex has a Flow Control Credit issue which can + * cause performance problems with Upstream Transaction Layer Packets with + * Relaxed Ordering set. + */ +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f02, PCI_CLASS_NOT_DEFINED, 8, + quirk_relaxedordering_disable); +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f04, PCI_CLASS_NOT_DEFINED, 8, + quirk_relaxedordering_disable); +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f08, PCI_CLASS_NOT_DEFINED, 8, + quirk_relaxedordering_disable); + +/* + * The AMD ARM A1100 (AKA "SEATTLE") SoC has a bug in its PCIe Root Complex + * where Upstream Transaction Layer Packets with the Relaxed Ordering + * Attribute clear are allowed to bypass earlier TLPs with Relaxed Ordering + * set. This is a violation of the PCIe 3.0 Transaction Ordering Rules + * outlined in Section 2.4.1 (PCI Express(r) Base Specification Revision 3.0 + * November 10, 2010). As a result, on this platform we can't use Relaxed + * Ordering for Upstream TLPs. + */ +DECLARE_PCI_FIXUP_CLASS_EARLY(0x1022, 0x1a00, PCI_CLASS_NOT_DEFINED, 8, + quirk_relaxedordering_disable); +DECLARE_PCI_FIXUP_CLASS_EARLY(0x1022, 0x1a01, PCI_CLASS_NOT_DEFINED, 8, + quirk_relaxedordering_disable); +DECLARE_PCI_FIXUP_CLASS_EARLY(0x1022, 0x1a02, PCI_CLASS_NOT_DEFINED, 8, + quirk_relaxedordering_disable); + +/* * Per PCIe r3.0, sec 2.2.9, "Completion headers must supply the same * values for the Attribute as were supplied in the header of the * corresponding Request, except as explicitly allowed when IDO is used." diff --git a/include/linux/pci.h b/include/linux/pci.h index eb3da1a..6764f66 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -178,6 +178,8 @@ enum pci_dev_flags { PCI_DEV_FLAGS_NO_PM_RESET = (__force pci_dev_flags_t) (1 << 7), /* Get VPD from function 0 VPD */ PCI_DEV_FLAGS_VPD_REF_F0 = (__force pci_dev_flags_t) (1 << 8), + /* Don't use Relaxed Ordering for TLPs directed at this device */ + PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 9), }; enum pci_irq_reroute_variant { -- 1.9.1
[PATCH 2/2] net/cxgb4: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
cxgb4 Ethernet driver now queries Root Complex Port to determine if it can send TLPs to it with the Relaxed Ordering Attribute set. --- drivers/net/ethernet/chelsio/cxgb4/cxgb4.h | 1 + drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 17 + drivers/net/ethernet/chelsio/cxgb4/sge.c| 5 +++-- 3 files changed, 21 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h index 163543b..46d61b1 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h @@ -512,6 +512,7 @@ enum { /* adapter flags */ USING_SOFT_PARAMS = (1 << 6), MASTER_PF = (1 << 7), FW_OFLD_CONN = (1 << 9), + ROOT_NO_RELAXED_ORDERING = (1 << 10), }; enum { diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c index afb0967..510c020 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c @@ -4636,6 +4636,7 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent) #ifdef CONFIG_PCI_IOV u32 v, port_vec; #endif + struct pci_dev *root; printk_once(KERN_INFO "%s - version %s\n", DRV_DESC, DRV_VERSION); @@ -4734,6 +4735,22 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent) adapter->msg_enable = DFLT_MSG_ENABLE; memset(adapter->chan_map, 0xff, sizeof(adapter->chan_map)); + /* If possible, we use PCIe Relaxed Ordering Attribute to deliver +* Ingress Packet Data to Free List Buffers in order to allow for +* chipset performance optimizations between the Root Complex and +* Memory Controllers. (Messages to the associated Ingress Queue +* notifying new Packet Placement in the Free Lists Buffers will be +* send without the Relaxed Ordering Attribute thus guaranteing that +* all preceding PCIe Transaction Layer Packets will be processed +* first.) But some Root Complexes have various issues with Upstream +* Transaction Layer Packets with the Relaxed Ordering Attribute set. +* So we check our Root Complex to see if it's flaged with advice +* against using Relaxed Ordering. +*/ + root = pci_find_pcie_root_port(adapter->pdev); + if (root && (root->dev_flags & PCI_DEV_FLAGS_NO_RELAXED_ORDERING)) + adapter->flags |= ROOT_NO_RELAXED_ORDERING; + spin_lock_init(&adapter->stats_lock); spin_lock_init(&adapter->tid_release_lock); spin_lock_init(&adapter->win0_lock); diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c b/drivers/net/ethernet/chelsio/cxgb4/sge.c index f05f0d4..ac229a3 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/sge.c +++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c @@ -2571,6 +2571,7 @@ int t4_sge_alloc_rxq(struct adapter *adap, struct sge_rspq *iq, bool fwevtq, struct fw_iq_cmd c; struct sge *s = &adap->sge; struct port_info *pi = netdev_priv(dev); + int relaxed = !(adap->flags & ROOT_NO_RELAXED_ORDERING); /* Size needs to be multiple of 16, including status entry. */ iq->size = roundup(iq->size, 16); @@ -2624,8 +2625,8 @@ int t4_sge_alloc_rxq(struct adapter *adap, struct sge_rspq *iq, bool fwevtq, flsz = fl->size / 8 + s->stat_len / sizeof(struct tx_desc); c.iqns_to_fl0congen |= htonl(FW_IQ_CMD_FL0PACKEN_F | -FW_IQ_CMD_FL0FETCHRO_F | -FW_IQ_CMD_FL0DATARO_F | +FW_IQ_CMD_FL0FETCHRO_V(relaxed) | +FW_IQ_CMD_FL0DATARO_V(relaxed) | FW_IQ_CMD_FL0PADEN_F); if (cong >= 0) c.iqns_to_fl0congen |= -- 1.9.1
Re: [PATCH net-next 1/4] ixgbe: sparc: rename the ARCH_WANT_RELAX_ORDER to IXGBE_ALLOW_RELAXED_ORDER
| From: Lucas Stach | Sent: Friday, April 28, 2017 1:51 AM | | Am Donnerstag, den 27.04.2017, 12:19 -0500 schrieb Bjorn Helgaas: | > | > | > I thought Relaxed Ordering was an optimization. Are there cases where | > it is actually required for correct behavior? | | Yes, at least the Tegra 2 TRM claims that RO needs to be enabled on the | device side for correct operation with the following language: | | "Tegra 2 requires relaxed ordering for responses to downstream requests | (responses can pass writes). It is possible in some circumstances for PCIe | transfers from an external bus masters (i.e. upstream transfers) to become | blocked by a downstream read or non-posted write. The responses to these | downstream requests are blocked by upstream posted writes only when PCIe | strict ordering is imposed. It is therefore necessary to never impose strict | ordering that would block a response to a downstream NPW/read request and | always set the relaxed ordering bit to 1. Only devices that are capable of | relaxed ordering may be used with Tegra 2 devices." (woof) Reading through the above paragraph is difficult because the author seems to shift language and terminology mid sentence and isn't following standard PCI terminology conventions. The Root Complex is "Upstream", a non-Root Complex Node in the PCIe Fabric is "Downstream", Requests that a Downstream Device (End Point) send to the Root Complex are called "Upstream Requests", responses that the Root Complex send to a Device are called "Downstream Responses" (or, even more pedantically, "Responses sent Downstream for an earlier Upstream Request"). Because a Root Complex is Upstream, but the Requests it sent Downstream, and Downstream Devices send their Requests Upstream, it's very important that we use exceedingly precise language. So, it ~sounds like~ the nVidia Tegra 2 document is talking about the need for Downstream Devices to echo the Relaxed Ordering Attribute in their Responses directed Upstream to Requests sent Downstream from the Root Complex. Moreover, there's code in drivers/pci/host/pci-tegra.c: tegra_pcie_relax_enable() which appears to set the PCIe Capability Device Control[Enable Relaxed Ordering] bit on all PCIe Fabric Nodes. If my reading of the intent of the nVidia document is correct -- and that's a Big If because of the extremely imprecise language used -- that means that the tegra_pcie_relax_enable() is completely bogus. The PCIe 3.0 Specification states that Responses MUST reflect the Relaxed Ordering and No Snoop Attributes of the Requests for which they are responding. Section 2.2.9 of PCI Express(r) Base Specification Revision 3.0 November 10, 2010: "Completion headers must supply the same values for the Attribute as were supplied in the header of the corresponding Request, except as explicitly allowed when IDO is used." And, specifically, the PCIe Capability Device Control[Enable Relaxed Ordering] bit _only_ affects the ability of that Device to originate Transaction Layer Packet Requests with the Relaxed Ordering Attribute set. Thus, tegra_pcie_relax_enable() setting those bits on all the Downstream Devices (and intervening Bridges) does not _cause_ those Devices to generate Requests with Relaxed Ordering set. And, if the Devices are PCIe 3.0 compliant, it also doesn't affect the Responses that they send back Upstream to the Root Complex. I apologize for the incredibly detailed nature of these responses, but it's very easy for people new to PCIe to get these things wrong and/or misinterpret the PCIe Specifications. Casey
Re: [PATCH net-next 1/4] ixgbe: sparc: rename the ARCH_WANT_RELAX_ORDER to IXGBE_ALLOW_RELAXED_ORDER
| From: Bjorn Helgaas | Sent: Thursday, April 27, 2017 10:19 AM | | Are you hinting that the PCI core or arch code could actually *enable* | Relaxed Ordering without the driver doing anything? Is it safe to do that? | Is there such a thing as a device that is capable of using RO, but where the | driver must be aware of it being enabled, so it programs the device | appropriately? I forgot to reply to this portion of Bjorn's email. The PCI Configuration Space PCI Capability Device Control[Enable Relaxed Ordering] bit governs enabling the _ability_ for the PCIe Device to send TLPs with the Relaxed Ordering Attribute set. It does not _cause_ RO to be set on TLPs. Doing that would almost certainly cause Data Corruption Bugs since you only want a subset of TLPs to have RO set. For instance, we typically use RO for Ingress Packet Data delivery but non-RO for messages notifying the Host that an Ingress Packet has been delivered. This ensures that the "Ingress Packet Delivered" non-RO TLP is processed _after_ any preceding RO TLPs delivering the actual Ingress Packet Data. In the above scenario, if one were to turn off Enable Relaxed Ordering via the PCIe Capability, then the on-chip PCIe engine would simply never send a TLP with the Relaxed Ordering Attribute set, regardless of any other chip programming. And finally, just to be absolutely clear, using Relaxed Ordering isn't and "Architecture Thing". It's a PCIe Fabric End Point Thing. Many End Points simply ignore the Relaxed Ordering Attribute (except to reflect it back in Response TLPs). In this sense, Relaxed Ordering simply provides potentially useful optimization information to the PCIe End Point. Casey
Re: [PATCH net-next 1/4] ixgbe: sparc: rename the ARCH_WANT_RELAX_ORDER to IXGBE_ALLOW_RELAXED_ORDER
Thanks for adding me to the Cc list Bjorn. Hopefully my message will make it out to the netdev and linux-pci lists -- I'm not currently subscribed to them. I've explicitly marked this message to be sent in plain text but modern email agents suck with respect to this. (sigh) I have officially become a curmudgeon. So, officially, Relaxed Ordering should be a Semantic Noop as far as PCIe transfers are concerned, as long as you don't care what order the PCIe Transaction Layer Packets are processed in by the target PCIe Fabric End Point. Basically, if you have some number of back-to-back PCIe TLPs between two Fabric End Points {A} -> {B} which have the Relaxed Ordering Attribute set, the End Point {B} receiving these RO TLPs may process them in any order it likes. When a TLP without Relaxed Ordering is sent {A} -> {B}, all preceding TLPs with Relaxed Ordering set must be processed by {B} prior to processing the TLP without Relaxed Ordering set. In this sense, a TLP without Relaxed Ordering set is something akin to a "memory barrier". All of this is covered in Section 2.4.1 of the PCIe 3.0 Specification (PCI Express(r) Base Specification Revision 3.0 November 10, 2010). The advantage of using Relaxed Ordering (which is heavily used when sending data to Graphics Cards as I understand it), is that the PCIe Endpoint can potentially optimize the processing order of RO TLPs with things like a local multi-channel Memory Controller in order to achieve the highest transfer bandwidth possible. However, we have discovered at least two PCIe 3.0 Root Complex implementations which have problems with TLPs directed at them with the Relaxed Ordering Attribute set and I'm in the process of working up a Linux Kernel PCI "Quirk" to allow those PCIe End Points to be marked as "not being advisable to send RO TLPs to". These problems range from "mere" Performance Problems to outright Data Corruption. I'm working with the vendors of these ... "problematic" Root Complex implementations and hope to have this patch submitted to the linux-pci list by tomorrow. By the way, it's important to note that just because, say, a Root Complex has problems with RO TLPs directed at it, that doesn't mean that you want to avoid all use of Relaxed Ordering within the PCIe Fabric. For instance, with the vendor whose Root Complex has a Performance Problem with RO TLPs directed at it, it's perfectly reasonable -- and desired -- to use Relaxed Ordering in Peer-to-Peer traffic. Say for instance, with an NVMe <-> Ethernet application. Casey
Re: [RFC PATCH v2] net: ethtool: add support for forward error correction modes
Vidya and I have been in communication on how to support Forward Error Correction Management on Links. My own thoughts are that we should consider using the new Get/Set Link Ksettiongs API because FEC is a low-level Link parameter which is either negotiated at the same time as Link Speed if Auto-Negotiation is in use, or must be set identically on both Link Peers is AN is not in use. I'm not convinced that FEC Management should use a separate ethtool/Kernel Driver API ala Pause Frame settings. I'm hoping to meet with Vidya and his collaborators at Cumulus Networks in the very near term to hash these issues out. Casey
Re: [RFC PATCH net-next] net: ethtool: add support for forward error correction modes
I'm attempting to start the work necessary to implement Vidya's proposed new ethtool interface to manage Forward Error Correction settings on a link. I'm confused by the ethtool FEC API and the degree/type of control it offers. At the top of the patch we have: Encoding: Types of encoding Off: Turning off any encoding RS : enforcing RS-FEC encoding on supported speeds BaseR : enforcing Base R encoding on supported speeds Auto : Default FEC settings for divers , and would represent asking the hardware to essentially go into a best effort mode. but then later on we have: +struct ethtool_fecparam { + __u32 cmd; + __u32 autoneg; + /* bitmask of FEC modes */ + __u32 fec; + __u32 reserved; +}; ... +enum ethtool_fec_config_bits { + ETHTOOL_FEC_NONE_BIT, + ETHTOOL_FEC_AUTO_BIT, + ETHTOOL_FEC_OFF_BIT, + ETHTOOL_FEC_RS_BIT, + ETHTOOL_FEC_BASER_BIT, +}; ... + ETHTOOL_LINK_MODE_FEC_NONE_BIT = 47, + ETHTOOL_LINK_MODE_FEC_RS_BIT= 48, + ETHTOOL_LINK_MODE_FEC_BASER_BIT = 49, The last ethtool Link Mode bits seem to imply a separable FEC on/off with individual control for RS and BASER. How would the "Auto" from the top be encoded within these Link Mode bits? And I don't see any reference to the ethtool_fec_config_bits in the kernel or ethtool patches so I'm not sure what they're supposed to reference. Can you clarify the above? I.e. can you offer a small template example of what a driver implementation might look like interpreting the incoming Link Mode Bits? And do we expect that there will be new FECs in the future? Casey
Re: [RFC PATCH net-next] net: ethtool: add support for forward error correction modes
And by the way, we currently have two ethtool APIs which pump in an Auto-Negotiation indication -- set_link_ksettings() and set_pauseparam(). Now we're talking about adding a third, set_fecparam(). Are all of the calls to these three APIs supposed to agree on the concept of Auto-Negotiations? I.e. what's it mean if set_link_ksettings() gets called with link_ksettings->base.autoneg == AUTONEG_ENABLE but set_pauseparam() gets called with epause->autoneg == AUTONEG_DISABLE? And now adding set_fecparam() into the system with a similar ability to specify the state of Auto-Negotiation is even more confusing. Casey
[RFC PATCH net-next] net: ethtool: add support for forward error correction modes
N.B. Sorry I'm not able to respond to the original message since I wasn't subscribed to netdev when it was sent a couple of weeks ago. This feature is something that Chelsio's cxgb4 driver needs. As we've tested our adapters against a number of switches, we've discovered a few which use varying defaults for FEC. And when Auto-Negotiation isn't used (or even possible with Optical Links), we need to be able to control turning FEC on/off. For our part, we default FEC Off for Optical Transceivers. For Copper, we read the Cable's EEPROM to determine how to default FEC. For some switches this works, but for at least one where that switch enables FEC for Optical Transceivers, it doesn't. For that switch we had to hard wire FEC on. Obviously that's not a good solution and we need an administrative interface so the system administrator can configure our adapter to use the appropriate FEC setting to match that of the switch. So this is basically a long-winded ACK of Vidya's patch and we would immediately implement this new ethtool API as soon as it's available. Casey
Re: [PATCH net-next] cxgb4: Reduce resource allocation in kdump kernel
Looks good to me. Of course I came up with those changes so maybe we should get another reviewer? :-) Also, don't forget to mention "Bug #29998" in the commit message ... Casey From: Hariprasad Shenai Sent: Friday, June 3, 2016 10:35:45 AM To: da...@davemloft.net Cc: netdev@vger.kernel.org; Casey Leedom; Nirranjan Kirubaharan; Hariprasad S Subject: [PATCH net-next] cxgb4: Reduce resource allocation in kdump kernel When is_kdump_kernel() is true, reduce our memory footprint by only using a single "Queue Set" and Forcing Master so we can reinitialize the Firmware/Chip. Signed-off-by: Hariprasad Shenai --- drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c index 477db477b133..5317187d0073 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c @@ -64,6 +64,7 @@ #include #include #include +#include #include "cxgb4.h" #include "t4_regs.h" @@ -3735,7 +3736,8 @@ static int adap_init0(struct adapter *adap) return ret; /* Contact FW, advertising Master capability */ - ret = t4_fw_hello(adap, adap->mbox, adap->mbox, MASTER_MAY, &state); + ret = t4_fw_hello(adap, adap->mbox, adap->mbox, + is_kdump_kernel() ? MASTER_MUST : MASTER_MAY, &state); if (ret < 0) { dev_err(adap->pdev_dev, "could not connect to FW, error %d\n", ret); @@ -4366,6 +4368,14 @@ static void cfg_queues(struct adapter *adap) if (q10g > netif_get_num_default_rss_queues()) q10g = netif_get_num_default_rss_queues(); + /* Reduce memory usage in kdump environment by using only one queue +* and disable all offload. +*/ + if (is_kdump_kernel()) { + q10g = 1; + adap->params.offload = 0; + } + for_each_port(adap, i) { struct port_info *pi = adap2pinfo(adap, i); -- 2.3.4
Re: [PATCH pci] pci: Add helper function to set VPD size
| From: Steve Wise | Sent: Friday, April 15, 2016 9:12 AM | | On 4/14/2016 1:35 PM, Steve Wise wrote: | >> The fix is to add a PCI helper function to set the VPD size, so the | >> driver can expicitly set the exact size of the VPD. | >> | >> Fixes commit 104daa71b396 ("PCI: Determine actual VPD size on first | >> access") | >> | >> Signed-off-by: Casey Leedom | >> Signed-off-by: Hariprasad Shenai | > Looks good! | | Hey Bjorn, | | Will this make the next 4.6-rc? Without a fix of some sort, cxgb4 is completely dead in the water as of the 4.6 series. I'm also worried about the bnx2x driver which seems to be doing something similar to our cxgb4 driver. (I.e. there isn't just a single VPD Data Structure hosted at the beginning of the VPD Space.) Casey
Re: [PATCH net-next 1/4] cxgb4: Use mask & shift while returning vlan_prio
> On Dec 16, 2015, at 4:07 PM, David Miller wrote: > > From: Hariprasad Shenai > Date: Wed, 16 Dec 2015 13:16:25 +0530 > >> @@ -66,7 +66,7 @@ struct l2t_data { >> >> static inline unsigned int vlan_prio(const struct l2t_entry *e) >> { >> -return e->vlan >> 13; >> +return (e->vlan & VLAN_PRIO_MASK) >> VLAN_PRIO_SHIFT; > > e->vlan is a u16, the vlan priotity is the top 3 bits of the 16-bit > value, and finally the right shift will be non-signed. > > Therefore this change is absolutely not necessary. > > Please remove this patch from the series and resend. I assume that you only meant that the masking portion is unnecessary. Doing the shift with the symbolic constant VLAN_PRIO_SHIFT instead of the literal constant “13” is still a reasonable change. The masking was almost certainly from me because once one uses the symbolic constants, weren’t not supposed to “know” about the internal structure of the operation. Modern compilers are of course free to optimize away the mask, etc. Casey-- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCHv2 net-next 2/4] cxgb4: For T4, don't read the Firmware Mailbox Control register
Hari, I think you missed the corresponding change that's needed for the const char *owner[] array. You need to add an "" entry so the index of "4" makes sense. Casey From: Hariprasad Shenai [haripra...@chelsio.com] Sent: Wednesday, September 30, 2015 8:03 AM To: netdev@vger.kernel.org Cc: da...@davemloft.net; Casey Leedom; Nirranjan Kirubaharan; Hariprasad S Subject: [PATCHv2 net-next 2/4] cxgb4: For T4, don't read the Firmware Mailbox Control register T4 doesn't have the Shadow copy of the register which we can read without side effect. So don't read mbox control register for T4 adapter Signed-off-by: Hariprasad Shenai --- drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 18 +- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c index 0a87a32..8001619 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c @@ -1134,12 +1134,20 @@ static int mbox_show(struct seq_file *seq, void *v) unsigned int mbox = (uintptr_t)seq->private & 7; struct adapter *adap = seq->private - mbox; void __iomem *addr = adap->regs + PF_REG(mbox, CIM_PF_MAILBOX_DATA_A); - unsigned int ctrl_reg = (is_t4(adap->params.chip) -? CIM_PF_MAILBOX_CTRL_A -: CIM_PF_MAILBOX_CTRL_SHADOW_COPY_A); - void __iomem *ctrl = adap->regs + PF_REG(mbox, ctrl_reg); - i = MBOWNER_G(readl(ctrl)); + /* For T4 we don't have a shadow copy of the Mailbox Control register. +* And since reading that real register causes a side effect of +* granting ownership, we're best of simply not reading it at all. +*/ + if (is_t4(adap->params.chip)) { + i = 4; /* index of "" */ + } else { + unsigned int ctrl_reg = CIM_PF_MAILBOX_CTRL_SHADOW_COPY_A; + void __iomem *ctrl = adap->regs + PF_REG(mbox, ctrl_reg); + + i = MBOWNER_G(readl(ctrl)); + } + seq_printf(seq, "mailbox owned by %s\n\n", owner[i]); for (i = 0; i < MBOX_LEN; i += 8) -- 2.3.4 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Request for advice on where to put Root Complex "fix up" code for downstream device
Thanks Bjorn and no issues at all about the delay -- I definitely understand how busy we all are. I'll go ahead and submit a PCI Quirk. As part of this, would you like me to also commit a new PCI-E routine to find the Root Complex Port for a given PCI Device? It seem like it might prove useful in the future. Otherwise I'll just incorporate that loop in my PCI Quirk. Casey From: Bjorn Helgaas [bhelg...@google.com] Sent: Friday, May 29, 2015 9:20 AM To: Casey Leedom Cc: netdev@vger.kernel.org; linux-...@vger.kernel.org Subject: Re: Request for advice on where to put Root Complex "fix up" code for downstream device Hi Casey, Sorry, this one slipped through and I forgot to respond earlier. On Thu, May 07, 2015 at 11:31:58PM +, Casey Leedom wrote: > | From: Bjorn Helgaas [bhelg...@google.com] > | Sent: Thursday, May 07, 2015 4:04 PM > | > | There are a lot of fixups in drivers/pci/quirks.c. For things that have to > | be worked around either before a driver claims the device or if there is no > | driver at all, the fixup *has* to go in drivers/pci/quirks.c > | > | But for things like this, where the problem can only occur after a driver > | claims the device, I think it makes more sense to put the fixup in the > | driver itself. The only wrinkle here is that the fixup has to be done on a > | separate device, not the device claimed by the driver. But I think it > | probably still makes sense to put this fixup in the driver. > > Okay, the example code that I provided (still quoted below) was indeed > done as a fix within the cxgb4 Network Driver. I've also worked up a > version as a PCI Quirk but if you and David Miller agree that the fixup > code should go into cxgb4, I'm comfortable with that. I can also provide > the example PCI Quirk code I worked up if you like. > > One complication to doing this in cxgb4 is that it attaches to Physical > Function 4 of our T5 chip. Meanwhile, a completely separate storage > driver, csiostor, connections to PF5 and PF6 and there's no > requirement at all that cxgb4 be loaded. So if we go down the road of > putting the fixup code in the cxgb4 driver, we'll also need to duplicate > that code in the csiostor driver. Sounds simpler to just put the quirk in drivers/pci/quirks.c. > | > +static void clear_root_complex_tlp_attributes(struct pci_dev *pdev) > | > +{ > | > + struct pci_bus *bus = pdev->bus; > | > + struct pci_dev *highest_pcie_bridge = NULL; > | > + > | > + while (bus) { > | > + struct pci_dev *bridge = bus->self; > | > + > | > + if (!bridge || !bridge->pcie_cap) > | > + break; > | > + highest_pcie_bridge = bridge; > | > + bus = bus->parent; > | > + } > | > | Can you use pci_upstream_bridge() here? There are a couple places where we > | want to find the Root Port, so we might factor that out someday. It'll be > | easier to find all those places if they use with pci_upstream_bridge(). > > It looks like pci_upstream_bridge() just traverses one like upstream toward > the > Root Complex? Or am I misunderstanding that function? No, you're right. I was just trying to suggest using pci_upstream_bridge() instead of bus->parent->self in your loop. It wouldn't replace the loop completely. Bjorn -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Request for advice on where to put Root Complex "fix up" code for downstream device
| From: Casey Leedom [lee...@chelsio.com] | Sent: Thursday, May 07, 2015 4:31 PM | | | From: Bjorn Helgaas [bhelg...@google.com] | | Sent: Thursday, May 07, 2015 4:04 PM | | | | There are a lot of fixups in drivers/pci/quirks.c. For things that have to | | be worked around either before a driver claims the device or if there is no | | driver at all, the fixup *has* to go in drivers/pci/quirks.c | | | | But for things like this, where the problem can only occur after a driver | | claims the device, I think it makes more sense to put the fixup in the | | driver itself. The only wrinkle here is that the fixup has to be done on a | | separate device, not the device claimed by the driver. But I think it | | probably still makes sense to put this fixup in the driver. | ... | One complication to doing this in cxgb4 is that it attaches to Physical | Function 4 of our T5 chip. Meanwhile, a completely separate storage | driver, csiostor, connections to PF5 and PF6 and there's no | requirement at all that cxgb4 be loaded. So if we go down the road of | putting the fixup code in the cxgb4 driver, we'll also need to duplicate | that code in the csiostor driver. I never heard back on this issue of needing to put the Root Complex "fixup" code in two different drivers -- cxgb4 and csiostor -- if we don't go down the path of using a PCI Quirk. I'm happy doing either and have verified both solutions locally. I'd just like to get a judgement call on this. It comes down to adding ~30 lines to drivers/net/eththernet/chelsio/cxgb4/cxgb4_main.c drivers/scsi/csiostor/csio_init.c or ~30 lines to drivers/pci/quirks.c | | Can you include a pointer to the relevant part of the spec? | | Sure: | | 2.2.9. Completion Rules | ... | Completion headers must supply the same values for | the Attribute as were supplied in the 20 header of | the corresponding Request, except as explicitly | allowed when IDO is used (see Section 2.2.6.4). | ... | 2.3.2. Completion Handling Rules | ... | If a received Completion matches the Transaction ID | of an outstanding Request, but in some other way | does not match the corresponding Request (e.g., a | problem with Attributes, Traffic Class, Byte Count, | Lower Address, etc), it is strongly recommended for | the Receiver to handle the Completion as a Malformed | TLP. However, if the Completion is otherwise properly | formed, it is permitted[22] for the Receiver to | handle the Completion as an Unexpected Completion. | | Can you use pci_upstream_bridge() here? There are a couple places where we | | want to find the Root Port, so we might factor that out someday. It'll be | | easier to find all those places if they use with pci_upstream_bridge(). | | It looks like pci_upstream_bridge() just traverses one like upstream toward the | Root Complex? Or am I misunderstanding that function? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 1/3] cxgb4: Add support for loopback between VI of same port
Okay, really: I’m not arguing for module parameters. I’m agreeing with you 100%. I’m not trying to be snarky or back you into admitting that there are some times when a module parameter is needed. I’m not being sneaky, etc. I’m really just asking a mechanism question. It is, on the other hand, quite likely that I’m being dumb. I’ll absolutely grant you that. So let me turn this around and ask: What command would you envision that I use in order to tell a driver to use a different TX routine for an interface? ethtool —tx-routine eth{n} loopback Or? Sorry for being so dense. We really are trying to live within the rules but we’re struggling to figure out what patch we should submit. Casey > On May 22, 2015, at 11:01 AM, David Miller wrote: > > From: Casey Leedom > Date: Fri, 22 May 2015 16:49:03 + > >> Oh I definitely understand that and agree. Unfortunately I've >> inherited a driver architecture that makes that ... "difficult" >> for many operations ... And I have an internal bug filed >> against me to fix those particular issues. >> >> However, that doesn't answer at least one of my questions >> which was how do I pass information into the driver _before_ >> it does the device probe? > > I did answer the question, I said that if you fix the real actual > core problem then you won't have this need to begin with. > > I thought I made that perfectly clear. > > I really am not going to entertain arguments of the form "it's > too hard to implement this correctly so I'm going to try > and slam a module parameter into the driver to fix things". > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH net-next 1/3] cxgb4: Add support for loopback between VI of same port
| From: David Miller [da...@davemloft.net] | Sent: Thursday, May 21, 2015 12:30 PM | | The prevailing assumption is that it's OK to have configuration | settings that can't be undone. | | And that's bogus from the beginning. Oh I definitely understand that and agree. Unfortunately I've inherited a driver architecture that makes that ... "difficult" for many operations ... And I have an internal bug filed against me to fix those particular issues. However, that doesn't answer at least one of my questions which was how do I pass information into the driver _before_ it does the device probe? In this case, telling it to _not_ attempt to attached to the chip firmware in order to debug, load firmware, etc.? And, I still need to know what mechanism we need to use to tell the driver to use one kind of transmit functionality or another. [[And in this case, we _can_ switch back and forth at will.]] Casey-- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH net-next 1/3] cxgb4: Add support for loopback between VI of same port
| From: David Miller [da...@davemloft.net] | Sent: Sunday, May 17, 2015 8:46 PM | | > From: Hariprasad Shenai | > Date: Sun, 17 May 2015 16:15:21 +0530 | > | > We now have a new cxgb4 module parameter "tx_vf" that when set | > to a non-zero value, causes cxgb4 to use the "Virtual Machine" version | > of the firmware Ethernet TX Packet Work Request (FW_ETH_TX_PKT_VM_WR) | > instead of the "normal" default non-VM Work Request (FW_ETH_TX_PKT_WR). | | Sorry, module parameters are veboten. | | Especially for settings like this which are guaranteed to be | interesting for other NIC drivers, not just your's. | | I'm really tired of explaining this to driver authors. Just | don't even try to push a module parameter past me. I definitely understand the issue of wanting to avoid randomly different module parameters in various drivers which do similar things. What we're looking for is a list of the acceptable ways for doing things — especially when they don't fit current ethtool/ioctl() mechanisms. A couple of specific examples: 1. We need to load the driver and tell it _not_ to attempt to contact firmware on the adapter. This is typically used to load firmware on a brand new adapter or debug a problem with the adapter without changing its existing state. This need presents an awkward problem because we need to have the driver know from the very start that it shouldn't try to communicate with the firmware, while our normal PCI probe() does in fact contact the firmware as part of the probe ... 2. This patch: We have the ability to use two fundamentally different TX Work Requests — one which causes the adapter firmware to check for local loopback delivery targets and one which doesn't. Unlike the first example, this can be specified long after the adapter probe operation but it's unclear if there's any current ethtool/ioctl() which can be used for this. Should we suggest a new ethtool operation like "TX Method"? More generally, is there a document somewhere which already covers the suggested mechanisms for passing parameter information into network drivers for different cases so we don't send in patch requests which waste people's time? Casey-- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html