Re: [Problem] broadcom tg3 network driver disconnects under high load
On Wed, 2015-04-29 at 13:34 -0400, Toan Pham wrote: Prashant, Unfortunately, I ran the same test 3 times with the new patch and all of them failed. Attached file is the dmesg log, after the Watchdog had timed out, and tried to restart the NIC. Feel free to let me know if you would like to try anything else. Thanks Toan thanks for result, so this looks to be a different problem. Sanjeev is setting up repo environment similar to yours to capture a pcie trace. Will keep you posted. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Problem] broadcom tg3 network driver disconnects under high load
On Tue, 2015-04-28 at 16:06 -0400, Toan Pham wrote: We were able to reproduce this issue internally only with iommu enabled. My last test to collect lspci-info took about 5 hours over a gigabit network for the bug to show up. My setup was running 3 tx scp sessions, each transferring a 1GB file outbound, and 1 rx scp session copying another 1GB file inbound. In a production environment with the BCM5762 NIC running as a server, I observed that the failure rate is about 1.65/week. Please perform a similar test with iommu disabled, and leave it running for days if need be. Sure will try Meanwhile can you try the attached patch and see if you are able to reproduce the problem ? No problem. I will apply the patch to kernel 4.0 and report back the result. Let me know if you need me to turn on any debug options like pcie trace, dev debug etc Thanks If you can collect pcie trace that would be great. Thanks -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Problem] broadcom tg3 network driver disconnects under high load
On Tue, 2015-04-28 at 11:11 -0700, Michael Chan wrote: On Mon, 2015-04-27 at 22:10 +, Toan Pham wrote: Michael, Please see attach files. BTW, I have also tested this bug on at least 8 different HP 705 PCs with the 5762 NIC, so it is probably not a manufacturer defect. In addition, I can never replicate the same issue on the older chipset, BCM5761, which can be found on the HP model 6005. I hope this information is helpful. Thanks Thanks for the data. The memory enable bit is cleared and there are some correctable error bits set. My colleague Sanjeev will look into this. Do you have PCIE Advanced Error Reporting (CONFIG_PCIEAER) enabled in your kernel? 5762 NIC has a bug due to which the chip would detect false 4G boundary crossing and it would stall the chip. With the data you have provided it is not clear whether we are hitting this problem or not. Register 0x4c04 bit 5 would be set when this condition occurs. But since the memory enable bit is clear the register dump collected before the chip was reset is having all garbage in it. We were able to reproduce this issue internally only with iommu enabled. In your dmesg logs I do not see iommu enabled. So unless we have a pcie trace we cannot confirm if this HW bug is indeed the problem you are seeing. Meanwhile can you try the attached patch and see if you are able to reproduce the problem ? This patch will restrict all DMA address given to the chip to 31 bits. Toan, thanks for bringing this to our notice, also please cc maintainers so that mails are not missed. From 488fd699985f73d361d04d4788de48833c6442ca Mon Sep 17 00:00:00 2001 From: Prashant Sreedharan prash...@broadcom.com Date: Tue, 28 Apr 2015 11:32:56 -0700 Subject: [PATCH] tg3: Restrict DMA address to 31 bits for 5762 device --- drivers/net/ethernet/broadcom/tg3.c | 13 + 1 files changed, 13 insertions(+), 0 deletions(-) diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c index 069952f..e980c96 100644 --- a/drivers/net/ethernet/broadcom/tg3.c +++ b/drivers/net/ethernet/broadcom/tg3.c @@ -17707,6 +17707,8 @@ static int tg3_init_one(struct pci_dev *pdev, */ if (tg3_flag(tp, IS_5788)) persist_dma_mask = dma_mask = DMA_BIT_MASK(32); + else if (tg3_asic_rev(tp) == ASIC_REV_5762) + persist_dma_mask = dma_mask = DMA_BIT_MASK(31); else if (tg3_flag(tp, 40BIT_DMA_BUG)) { persist_dma_mask = dma_mask = DMA_BIT_MASK(40); #ifdef CONFIG_HIGHMEM @@ -17736,6 +17738,17 @@ static int tg3_init_one(struct pci_dev *pdev, No usable DMA configuration, aborting\n); goto err_out_apeunmap; } + } else { + err = pci_set_dma_mask(pdev, dma_mask); + if (!err) { + err = pci_set_consistent_dma_mask(pdev, + persist_dma_mask); + } + if (err) { + dev_err(pdev-dev, +No usable DMA configuration, aborting\n); + goto err_out_apeunmap; + } } tg3_init_bufmgr_config(tp); -- 1.7.1
Re: [PATCH] net/tg3: Release IRQs on permanent error
On Fri, 2015-04-24 at 15:22 +1000, Gavin Shan wrote: When having permanent EEH error, the PCI device will be removed from the system. For this case, we shouldn't set pcierr_recovery to true wrongly, which blocks the driver to release the allocated interrupts and their handlers. Eventually, we can't disable MSI or MSIx successfully because of the MSI or MSIx interrupts still have associated interrupt actions, which is turned into following stack dump. Oops: Exception in kernel mode, sig: 5 [#1] : [c03b76a8] .free_msi_irqs+0x80/0x1a0 (unreliable) [c039f388] .pci_remove_bus_device+0x98/0x110 [c00790f4] .pcibios_remove_pci_devices+0x9c/0x128 [c0077b98] .handle_eeh_events+0x2d8/0x4b0 [c00782d0] .eeh_event_handler+0x130/0x1c0 [c0022bd4] .kernel_thread+0x54/0x70 Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com --- drivers/net/ethernet/broadcom/tg3.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c index 1270b18..069952f 100644 --- a/drivers/net/ethernet/broadcom/tg3.c +++ b/drivers/net/ethernet/broadcom/tg3.c @@ -18129,7 +18129,9 @@ static pci_ers_result_t tg3_io_error_detected(struct pci_dev *pdev, rtnl_lock(); - tp-pcierr_recovery = true; + /* We needn't recover from permanent error */ + if (state == pci_channel_io_frozen) + tp-pcierr_recovery = true; /* We probably don't have netdev yet */ if (!netdev || !netif_running(netdev)) Acked-by: Prashant Sreedharan prash...@broadcom.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tg3 NIC driver bug in 3.14.x under Xen [and 3 more messages]
On Fri, 2015-04-17 at 15:12 -0400, David Miller wrote: From: Konrad Rzeszutek Wilk konrad.w...@oracle.com Date: Fri, 17 Apr 2015 15:04:48 -0400 From 9e417af099e3cee2b219ab28ffc1e96b0564b213 Mon Sep 17 00:00:00 2001 From: Konrad Rzeszutek Wilk konrad.w...@oracle.com Date: Fri, 17 Apr 2015 14:55:47 -0400 Subject: [PATCH] config: Enable NEED_DMA_MAP_STATE when SWIOTLB is selected A huge amount of NIC drivers use the DMA API, however if compiled under 32-bit an very important part of the DMA API can be ommitted leading to the drivers not working at all (especially if used with 'swiotlb=force iommu=soft'). As Prashant Sreedharan explains it: the driver [tg3] uses DEFINE_DMA_UNMAP_ADDR(), dma_unmap_addr_set() to keep a copy of the dma mapping and dma_unmap_addr() to get the mapping value. On most of the platforms this is a no-op, but ... with iommu=soft and swiotlb=force this house keeping is required, ... otherwise we pass 0 while calling pci_unmap_/pci_dma_sync_ instead of the DMA address. As such enable this even when using 32-bit kernels. Reported-by: Ian Jackson ian.jack...@eu.citrix.com Signed-off-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com Acked-by: David S. Miller da...@davemloft.net Acked-by: Prashant Sreedharan prash...@broadcom.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tg3 NIC driver bug in 3.14.x under Xen [and 3 more messages]
On Thu, 2015-04-16 at 18:15 +0100, Ian Jackson wrote: Michael Chan writes (Re: tg3 NIC driver bug in 3.14.x under Xen [and 3 more messages]): On Thu, 2015-04-16 at 09:24 -0300, casca...@linux.vnet.ibm.com wrote: Yes, this looks like the driver is not syncing the DMA buffers. Unmap is supposed to synchronize as well. For small rx packets ( 256 bytes), we sync the DMA buffer before we copy the data to another SKB. For larger packets, we unmap the DMA buffer. Do we see the corruption in both cases? Yes, at least with swiotlb=force iommu=soft. Ok this is what is causing the problem, the driver uses DEFINE_DMA_UNMAP_ADDR(), dma_unmap_addr_set() to keep a copy of the dma mapping and dma_unmap_addr() to get the mapping value. On most of the platforms this is a no-op, but it appears with iommu=soft and swiotlb=force this house keeping is required, when I pass the correct dma_addr instead of 0 while calling pci_unmap_/pci_dma_sync_ I don't see the corruption. ie If you set CONFIG_NEED_DMA_MAP_STATE=y in your kernel config you should not see the problem. Can you confirm ? Thanks -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html