Re: [Problem] broadcom tg3 network driver disconnects under high load

2015-04-29 Thread Prashant Sreedharan
On Wed, 2015-04-29 at 13:34 -0400, Toan Pham wrote:
 Prashant,
 
 Unfortunately, I ran the same test 3 times with the new patch and all
 of them failed.
 Attached file is the dmesg log, after the Watchdog had timed out, and
 tried to restart the NIC.
 Feel free to let me know if you would like to try anything else.  Thanks
Toan thanks for result, so this looks to be a different problem. Sanjeev
is setting up repo environment similar to yours to capture a pcie trace.
Will keep you posted.  


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Problem] broadcom tg3 network driver disconnects under high load

2015-04-28 Thread Prashant Sreedharan
On Tue, 2015-04-28 at 16:06 -0400, Toan Pham wrote:
  We were able to reproduce this issue internally only with iommu enabled.
 
 My last test to collect lspci-info took about 5 hours over a gigabit
 network for the bug to show up.  My setup was running 3 tx scp
 sessions, each transferring a 1GB file outbound, and 1 rx scp session
 copying another 1GB file inbound.  In a production environment with
 the BCM5762 NIC running as a server, I observed that the failure rate
 is about 1.65/week.  Please perform a similar test with iommu
 disabled, and leave it running for days if need be.

Sure will try
 
 
   Meanwhile can you try the attached patch and see if you are able to 
  reproduce the problem ?
 
 No problem.  I will apply the patch to kernel 4.0 and report back the
 result.  Let me know if you need me to turn on any debug options like
 pcie trace, dev debug etc  Thanks

If you can collect pcie trace that would be great. Thanks


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Problem] broadcom tg3 network driver disconnects under high load

2015-04-28 Thread Prashant Sreedharan
On Tue, 2015-04-28 at 11:11 -0700, Michael Chan wrote:
 On Mon, 2015-04-27 at 22:10 +, Toan Pham wrote: 
  Michael,
  
  
  Please see attach files.
  
  BTW, I have also tested this bug on at least 8 different HP 705 PCs
  with the 5762 NIC, so it is probably not a manufacturer defect.  In
  addition, I can never replicate the same issue on the older chipset,
  BCM5761, which can be found on the HP model 6005.  I hope this
  information is helpful.  Thanks
 
 Thanks for the data.  The memory enable bit is cleared and there are
 some correctable error bits set.  My colleague Sanjeev will look into
 this.
 
 Do you have PCIE Advanced Error Reporting (CONFIG_PCIEAER) enabled in
 your kernel?
 

5762 NIC has a bug due to which the chip would detect false 4G boundary
crossing and it would stall the chip. With the data you have provided it
is not clear whether we are hitting this problem or not. Register 0x4c04
bit 5 would be set when this condition occurs. But since the memory
enable bit is clear the register dump collected before the chip was
reset is having all garbage in it. 

We were able to reproduce this issue internally only with iommu enabled.
In your dmesg logs I do not see iommu enabled. So unless we have a pcie
trace we cannot confirm if this HW bug is indeed the problem you are
seeing.

Meanwhile can you try the attached patch and see if you are able to
reproduce the problem ? This patch will restrict all DMA address given
to the chip to 31 bits.

Toan, thanks for bringing this to our notice, also please cc maintainers
so that mails are not missed.
From 488fd699985f73d361d04d4788de48833c6442ca Mon Sep 17 00:00:00 2001
From: Prashant Sreedharan prash...@broadcom.com
Date: Tue, 28 Apr 2015 11:32:56 -0700
Subject: [PATCH] tg3: Restrict DMA address to 31 bits for 5762 device

---
 drivers/net/ethernet/broadcom/tg3.c |   13 +
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 069952f..e980c96 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -17707,6 +17707,8 @@ static int tg3_init_one(struct pci_dev *pdev,
 	 */
 	if (tg3_flag(tp, IS_5788))
 		persist_dma_mask = dma_mask = DMA_BIT_MASK(32);
+	else if (tg3_asic_rev(tp) == ASIC_REV_5762)
+		persist_dma_mask = dma_mask = DMA_BIT_MASK(31);
 	else if (tg3_flag(tp, 40BIT_DMA_BUG)) {
 		persist_dma_mask = dma_mask = DMA_BIT_MASK(40);
 #ifdef CONFIG_HIGHMEM
@@ -17736,6 +17738,17 @@ static int tg3_init_one(struct pci_dev *pdev,
 No usable DMA configuration, aborting\n);
 			goto err_out_apeunmap;
 		}
+	} else {
+		err = pci_set_dma_mask(pdev, dma_mask);
+		if (!err) {
+			err = pci_set_consistent_dma_mask(pdev,
+			  persist_dma_mask);
+		}
+		if (err) {
+			dev_err(pdev-dev,
+No usable DMA configuration, aborting\n);
+			goto err_out_apeunmap;
+		}
 	}
 
 	tg3_init_bufmgr_config(tp);
-- 
1.7.1



Re: [PATCH] net/tg3: Release IRQs on permanent error

2015-04-24 Thread Prashant Sreedharan
On Fri, 2015-04-24 at 15:22 +1000, Gavin Shan wrote:
 When having permanent EEH error, the PCI device will be removed
 from the system. For this case, we shouldn't set pcierr_recovery
 to true wrongly, which blocks the driver to release the allocated
 interrupts and their handlers. Eventually, we can't disable MSI
 or MSIx successfully because of the MSI or MSIx interrupts still
 have associated interrupt actions, which is turned into following
 stack dump.
 
 Oops: Exception in kernel mode, sig: 5 [#1]
 :
 [c03b76a8] .free_msi_irqs+0x80/0x1a0 (unreliable)
 [c039f388] .pci_remove_bus_device+0x98/0x110
 [c00790f4] .pcibios_remove_pci_devices+0x9c/0x128
 [c0077b98] .handle_eeh_events+0x2d8/0x4b0
 [c00782d0] .eeh_event_handler+0x130/0x1c0
 [c0022bd4] .kernel_thread+0x54/0x70
 
 Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
 ---
  drivers/net/ethernet/broadcom/tg3.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)
 
 diff --git a/drivers/net/ethernet/broadcom/tg3.c 
 b/drivers/net/ethernet/broadcom/tg3.c
 index 1270b18..069952f 100644
 --- a/drivers/net/ethernet/broadcom/tg3.c
 +++ b/drivers/net/ethernet/broadcom/tg3.c
 @@ -18129,7 +18129,9 @@ static pci_ers_result_t tg3_io_error_detected(struct 
 pci_dev *pdev,
  
   rtnl_lock();
  
 - tp-pcierr_recovery = true;
 + /* We needn't recover from permanent error */
 + if (state == pci_channel_io_frozen)
 + tp-pcierr_recovery = true;
  
   /* We probably don't have netdev yet */
   if (!netdev || !netif_running(netdev))

Acked-by: Prashant Sreedharan prash...@broadcom.com


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tg3 NIC driver bug in 3.14.x under Xen [and 3 more messages]

2015-04-17 Thread Prashant Sreedharan
On Fri, 2015-04-17 at 15:12 -0400, David Miller wrote:
 From: Konrad Rzeszutek Wilk konrad.w...@oracle.com
 Date: Fri, 17 Apr 2015 15:04:48 -0400
 
  From 9e417af099e3cee2b219ab28ffc1e96b0564b213 Mon Sep 17 00:00:00 2001
  From: Konrad Rzeszutek Wilk konrad.w...@oracle.com
  Date: Fri, 17 Apr 2015 14:55:47 -0400
  Subject: [PATCH] config: Enable NEED_DMA_MAP_STATE when SWIOTLB is selected
  
  A huge amount of NIC drivers use the DMA API, however if compiled
  under 32-bit an very important part of the DMA API can be ommitted leading
  to the drivers not working at all (especially if used with
  'swiotlb=force iommu=soft').
  
  As Prashant Sreedharan explains it: the driver [tg3] uses
  DEFINE_DMA_UNMAP_ADDR(), dma_unmap_addr_set() to keep a copy of the dma
  mapping and dma_unmap_addr() to get the mapping value. On most of
  the platforms this is a no-op, but ... with iommu=soft and
  swiotlb=force this house keeping is required, ... otherwise
  we pass 0 while calling pci_unmap_/pci_dma_sync_ instead of the
  DMA address.
  
  As such enable this even when using 32-bit kernels.
  
  Reported-by: Ian Jackson ian.jack...@eu.citrix.com
  Signed-off-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
 
 Acked-by: David S. Miller da...@davemloft.net

Acked-by: Prashant Sreedharan prash...@broadcom.com


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tg3 NIC driver bug in 3.14.x under Xen [and 3 more messages]

2015-04-16 Thread Prashant Sreedharan
On Thu, 2015-04-16 at 18:15 +0100, Ian Jackson wrote:
 Michael Chan writes (Re: tg3 NIC driver bug in 3.14.x under Xen [and 3 more 
 messages]):
  On Thu, 2015-04-16 at 09:24 -0300, casca...@linux.vnet.ibm.com wrote: 
   Yes, this looks like the driver is not syncing the DMA buffers. Unmap is
   supposed to synchronize as well.
  
  For small rx packets ( 256 bytes), we sync the DMA buffer before we
  copy the data to another SKB.  For larger packets, we unmap the DMA
  buffer.  Do we see the corruption in both cases?
 
 Yes, at least with swiotlb=force iommu=soft.

Ok this is what is causing the problem, the driver uses
DEFINE_DMA_UNMAP_ADDR(), dma_unmap_addr_set() to keep a copy of the dma
mapping and dma_unmap_addr() to get the mapping value. On most of
the platforms this is a no-op, but it appears with iommu=soft and
swiotlb=force this house keeping is required, when I pass the correct
dma_addr instead of 0 while calling pci_unmap_/pci_dma_sync_ I don't see
the corruption. ie If you set CONFIG_NEED_DMA_MAP_STATE=y in your kernel
config you should not see the problem. Can you confirm ? Thanks


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html