RE: [Problem] broadcom tg3 network driver disconnects under high load
Hi Toan, We could not reproduce the issue. We have followed the below steps. 1. Booted to HP DeskElite 705 with Ubuntu 15.04. 2. Created 1G file with urandom 3. From another machine, repeatedly copied the 1G file back and forth with scp With this set up and tests, we were unable to reproduce the issue. As we discussed offline, We also tried with your custom OS on HP DeskElite 705 and couldn't reproduce the issue. As you suggested We also tried with 1G file provided by you but couldn't reproduce the issue. Since We can't reproduce this issue, We are unable to proceed further. Thank You for continuous efforts and help. Thanks, Satish -Original Message- From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On Behalf Of Prashant Sreedharan Sent: Thursday, April 30, 2015 12:25 AM To: Toan Pham Cc: Michael Chan; Sanjeev Bansal; netdev@vger.kernel.org Subject: Re: [Problem] broadcom tg3 network driver disconnects under high load On Wed, 2015-04-29 at 13:34 -0400, Toan Pham wrote: > Prashant, > > Unfortunately, I ran the same test 3 times with the new patch and all > of them failed. > Attached file is the dmesg log, after the Watchdog had timed out, and > tried to restart the NIC. > Feel free to let me know if you would like to try anything else. > Thanks Toan thanks for result, so this looks to be a different problem. Sanjeev is setting up repo environment similar to yours to capture a pcie trace. Will keep you posted. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html N�r��yb�X��ǧv�^�){.n�+���z�^�)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥
Re: [Problem] broadcom tg3 network driver disconnects under high load
On Wed, 2015-04-29 at 13:34 -0400, Toan Pham wrote: > Prashant, > > Unfortunately, I ran the same test 3 times with the new patch and all > of them failed. > Attached file is the dmesg log, after the Watchdog had timed out, and > tried to restart the NIC. > Feel free to let me know if you would like to try anything else. Thanks Toan thanks for result, so this looks to be a different problem. Sanjeev is setting up repo environment similar to yours to capture a pcie trace. Will keep you posted. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Problem] broadcom tg3 network driver disconnects under high load
On Tue, 2015-04-28 at 16:06 -0400, Toan Pham wrote: > > We were able to reproduce this issue internally only with iommu enabled. > > My last test to collect lspci-info took about 5 hours over a gigabit > network for the bug to show up. My setup was running 3 tx scp > sessions, each transferring a 1GB file outbound, and 1 rx scp session > copying another 1GB file inbound. In a production environment with > the BCM5762 NIC running as a server, I observed that the failure rate > is about 1.65/week. Please perform a similar test with iommu > disabled, and leave it running for days if need be. Sure will try > > > > Meanwhile can you try the attached patch and see if you are able to > > reproduce the problem ? > > No problem. I will apply the patch to kernel 4.0 and report back the > result. Let me know if you need me to turn on any debug options like > pcie trace, dev debug etc Thanks If you can collect pcie trace that would be great. Thanks -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Problem] broadcom tg3 network driver disconnects under high load
> We were able to reproduce this issue internally only with iommu enabled. My last test to collect lspci-info took about 5 hours over a gigabit network for the bug to show up. My setup was running 3 tx scp sessions, each transferring a 1GB file outbound, and 1 rx scp session copying another 1GB file inbound. In a production environment with the BCM5762 NIC running as a server, I observed that the failure rate is about 1.65/week. Please perform a similar test with iommu disabled, and leave it running for days if need be. > Meanwhile can you try the attached patch and see if you are able to > reproduce the problem ? No problem. I will apply the patch to kernel 4.0 and report back the result. Let me know if you need me to turn on any debug options like pcie trace, dev debug etc Thanks -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Problem] broadcom tg3 network driver disconnects under high load
On Tue, 2015-04-28 at 11:11 -0700, Michael Chan wrote: > On Mon, 2015-04-27 at 22:10 +, Toan Pham wrote: > > Michael, > > > > > > Please see attach files. > > > > BTW, I have also tested this bug on at least 8 different HP 705 PCs > > with the 5762 NIC, so it is probably not a manufacturer defect. In > > addition, I can never replicate the same issue on the older chipset, > > BCM5761, which can be found on the HP model 6005. I hope this > > information is helpful. Thanks > > Thanks for the data. The memory enable bit is cleared and there are > some correctable error bits set. My colleague Sanjeev will look into > this. > > Do you have PCIE Advanced Error Reporting (CONFIG_PCIEAER) enabled in > your kernel? > 5762 NIC has a bug due to which the chip would detect false 4G boundary crossing and it would stall the chip. With the data you have provided it is not clear whether we are hitting this problem or not. Register 0x4c04 bit 5 would be set when this condition occurs. But since the memory enable bit is clear the register dump collected before the chip was reset is having all garbage in it. We were able to reproduce this issue internally only with iommu enabled. In your dmesg logs I do not see iommu enabled. So unless we have a pcie trace we cannot confirm if this HW bug is indeed the problem you are seeing. Meanwhile can you try the attached patch and see if you are able to reproduce the problem ? This patch will restrict all DMA address given to the chip to 31 bits. Toan, thanks for bringing this to our notice, also please cc maintainers so that mails are not missed. >From 488fd699985f73d361d04d4788de48833c6442ca Mon Sep 17 00:00:00 2001 From: Prashant Sreedharan Date: Tue, 28 Apr 2015 11:32:56 -0700 Subject: [PATCH] tg3: Restrict DMA address to 31 bits for 5762 device --- drivers/net/ethernet/broadcom/tg3.c | 13 + 1 files changed, 13 insertions(+), 0 deletions(-) diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c index 069952f..e980c96 100644 --- a/drivers/net/ethernet/broadcom/tg3.c +++ b/drivers/net/ethernet/broadcom/tg3.c @@ -17707,6 +17707,8 @@ static int tg3_init_one(struct pci_dev *pdev, */ if (tg3_flag(tp, IS_5788)) persist_dma_mask = dma_mask = DMA_BIT_MASK(32); + else if (tg3_asic_rev(tp) == ASIC_REV_5762) + persist_dma_mask = dma_mask = DMA_BIT_MASK(31); else if (tg3_flag(tp, 40BIT_DMA_BUG)) { persist_dma_mask = dma_mask = DMA_BIT_MASK(40); #ifdef CONFIG_HIGHMEM @@ -17736,6 +17738,17 @@ static int tg3_init_one(struct pci_dev *pdev, "No usable DMA configuration, aborting\n"); goto err_out_apeunmap; } + } else { + err = pci_set_dma_mask(pdev, dma_mask); + if (!err) { + err = pci_set_consistent_dma_mask(pdev, + persist_dma_mask); + } + if (err) { + dev_err(&pdev->dev, +"No usable DMA configuration, aborting\n"); + goto err_out_apeunmap; + } } tg3_init_bufmgr_config(tp); -- 1.7.1
Re: [Problem] broadcom tg3 network driver disconnects under high load
> Do you have PCIE Advanced Error Reporting (CONFIG_PCIEAER) enabled in your kernel? Yes, it is enabled. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Problem] broadcom tg3 network driver disconnects under high load
On Mon, 2015-04-27 at 22:10 +, Toan Pham wrote: > Michael, > > > Please see attach files. > > BTW, I have also tested this bug on at least 8 different HP 705 PCs > with the 5762 NIC, so it is probably not a manufacturer defect. In > addition, I can never replicate the same issue on the older chipset, > BCM5761, which can be found on the HP model 6005. I hope this > information is helpful. Thanks Thanks for the data. The memory enable bit is cleared and there are some correctable error bits set. My colleague Sanjeev will look into this. Do you have PCIE Advanced Error Reporting (CONFIG_PCIEAER) enabled in your kernel? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Problem] broadcom tg3 network driver disconnects under high load
Michael, Please see attach files. BTW, I have also tested this bug on at least 8 different HP 705 PCs with the 5762 NIC, so it is probably not a manufacturer defect. In addition, I can never replicate the same issue on the older chipset, BCM5761, which can be found on the HP model 6005. I hope this information is helpful. Thanks 03:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5762 Gigabit Ethernet PCIe (rev 10) Subsystem: Hewlett-Packard Company Device 2215 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- Capabilities: [160 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb:Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0:Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb:Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01 Status: NegoPending- InProgress- Capabilities: [1b0 v1] Latency Tolerance Reporting Max snoop latency: 0ns Max no snoop latency: 0ns Capabilities: [230 v1] Transaction Processing Hints Interrupt vector mode supported Steering table in MSI-X table Kernel driver in use: tg3 00: e4 14 87 16 06 05 10 00 10 00 00 02 10 00 00 00 10: 0c 00 02 e0 00 00 00 00 0c 00 01 e0 00 00 00 00 20: 0c 00 00 e0 00 00 00 00 00 00 00 00 3c 10 15 22 30: 00 00 00 00 48 00 00 00 00 00 00 00 05 01 00 00 40: 00 00 00 00 00 00 00 fa 01 50 03 c8 08 20 00 16 50: 03 58 fc 80 00 00 00 78 05 a0 86 00 00 00 00 00 60: 00 00 00 00 00 00 00 00 98 02 00 f1 d1 02 f8 01 70: 10 10 07 00 00 ff 00 ff 00 00 00 00 00 00 00 00 80: e4 14 87 16 40 00 00 40 00 00 00 00 a5 09 00 00 90: 00 00 00 00 d2 01 00 00 00 00 00 00 4d 04 00 00 a0: 11 ac 05 80 04 00 00 00 22 01 00 00 10 00 02 00 b0: 82 8d 00 10 00 54 10 00 12 5c 47 00 43 00 12 10 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 1f 08 08 00 00 00 00 00 00 00 00 00 01 00 01 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 00 00 00 00 00 21 76 05 00 00 00 00 ff ff ff ff 100: 01 00 c1 13 00 00 10 00 00 00 00 00 30 20 06 00 110: 40 20 00 00 00 20 00 00 b4 00 00 00 01 10 00 40 120: 0f 00 00 00 48 3c 02 e0 00 00 00 00 00 00 00 00 130: 00 00 00 00 00 00 00 00 00 00 00 00 03 00 01 15 140: 8a 82 47 06 51 64 00 00 00 00 00 00 00 00 00 00 150: 04 00 01 16 00 00 00 00 16 81 07 00 01 00 00 00 160: 02 00 01 1b 00 00 00 00 00 00 00 00 00 00 00 00 170: 00 00 00 00 01 00 00 80 00 00 00 00 00 00 00 00 180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 190: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 1a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 1b0: 18 00 01 23 00 00 00 00 00 00 00 00 00 00 00 00 1c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 1d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 1e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 1f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 210: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 220: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 230: 17 00 01 00 03 04 05 00 00 00 00 00 00 00 00 00 240: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 250: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 260: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 270: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 290: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 2a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 2b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 2c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 2d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 2e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 2f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 310: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 320: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 330: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 340: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 350: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 360: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 370: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 390: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 3a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 3b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 3c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 3d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 3e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 3f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 410: 00 00 00 00 00 00 00 00 00 00 00 00 00 0
Re: [Problem] broadcom tg3 network driver disconnects under high load
On Fri, 2015-04-24 at 12:33 -0400, Toan Pham wrote: > Summary: Broadcom 5762 NIC locks up under heavy load. Can you provide lspci -vvvxxx -s 3:0.0 after it gets into this state? Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Problem] broadcom tg3 network driver disconnects under high load
Summary: Broadcom 5762 NIC locks up under heavy load. Description: The tg3 Broadcom network driver that binds with chipset 5762 locks up when under heavy network load. When this happens, a reboot is necessary to recover network. Sometimes, bringing the interface offline and online (via ifconfig) would recover networking. I've also tested with the latest tg3 driver 3.137h (dec 2014 version) and networking is still problematic. I have also disabled TSO, GSO etc... with ethtool, but the bug still surfaces. This bug may be related to the integrated Firmware because at the time of the crash, the memory dump of the bcm5762 chip is completely cleared out with 0xFFs. Here is the procedure to replicate the issue because it is hard to replicate it under moderate network load. 1. Bootup a machine with a broadcom 5762 NIC (ie. HP DeskElite 705) using a Ubuntu/Kubunu Live CD 14.04-15.04, or a build with the latest mainline kernel. 2. From another machine: start 5 sessions, repetitively copy (scp with public key authentication) a 70 MB file back and forth to the tg3 machine in each session. (not sure if this is necessary) 3. Create a 1GB file on the tg3 machine, with something like dd if=/dev/urandom of=/my_test_file bs=1024 count=$((1024*1000)) 4. From another machine: repetitively secure copy that 1GB file from the tg3 machine. This can be done with something like: while [ 0 ]; do scp -i /my/scp/private.key u...@ip.of.tg3:/my_test_file /tmp done; Networking will lockup in about 10-30 minutes, in some rare cases up to 4 hours of run time. Having multiple instances of the 1GB file transfer will significantly reduce the occurrence time. Keywords: networking, tg3 kernel version: Linux version 4.0.0-gbf70def. I have also tested with the following kernel versions: 3.17, 3.16, 2.6.39. Kernel log message (Oops): (see full ref: https://launchpadlibrarian.net/204185480/dmesg) WARNING: CPU: 0 PID: 1830 at net/sched/sch_generic.c:303 dev_watchdog+0xfc/0x185() NETDEV WATCHDOG: eth0 (tg3): transmit queue 0 timed out Modules linked in: CPU: 0 PID: 1830 Comm: cat Not tainted 4.0.0-gbf70def #4 Hardware name: Hewlett-Packard HP EliteDesk 705 G1 MT/2215, BIOS L06 v02.15 10/22/2014 f581df18 c06e5045 c0a7ec29 f581df30 c01319e9 c0668e77 f4c3 0005da10 f581df48 c0131a73 0009 f581df40 c0a7ec29 f581df5c f581df78 c0668e77 c0a7ec62 012f c0a7ec29 f4c3 c0a60eba Call Trace: [] dump_stack+0x41/0x52 [] warn_slowpath_common+0x83/0x9a [] ? dev_watchdog+0xfc/0x185 [] warn_slowpath_fmt+0x2b/0x2f [] dev_watchdog+0xfc/0x185 [] ? pfifo_fast_dequeue+0xaf/0xaf [] call_timer_fn+0x47/0xcd [] run_timer_softirq+0x165/0x1c4 [] ? pfifo_fast_dequeue+0xaf/0xaf [] __do_softirq+0xbe/0x1ef [] ? _local_bh_enable+0x40/0x40 [] do_softirq_own_stack+0x22/0x28 [] irq_exit+0x39/0x47 [] smp_apic_timer_interrupt+0x38/0x42 [] apic_timer_interrupt+0x2d/0x34 [] ? _raw_spin_unlock_irqrestore+0xd/0xf [] extract_buf+0x83/0xc7 [] extract_entropy_user+0xc2/0x11a [] urandom_read+0x68/0xbf [] ? extract_entropy_user+0x11a/0x11a [] __vfs_read+0x1b/0x47 [] vfs_read+0x6b/0xd3 [] SyS_read+0x44/0x84 [] syscall_call+0x7/0x7 System info and detailed description: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1447664 I can help test proposed patches fairly quickly. So please let me know if you need anything. Thank you. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html