[Sorry, Henrique, for replying directly to you]

> On 26 May 2015, at 15:39, Henrique de Moraes Holschuh wrote:
> 
> On Tue, May 26, 2015, at 09:24, Justin Catterall wrote:
>> At irregular times, and apparently for no reason at all, networking
>> drops and cannot be restarted without reboot on a fresh install of
>> Jessie. The NIC is a Broadcom NetXtreme BCM5720.
>> 
>> ifconfig thinks networking is still up because I can:
>>      ifconfig eth0 down
>> 
>> I find this when I try 'ifconfig eth0 up':
>> tg3_abort_hw timed out TX_MODE_ENABLE will not clear MAC_TX_MODE=ffffffff
> 
> Hmm, it is either a kernel issue, or a hardware issue.
> 
>> Any suggestions on where to look for a solution?
> 
> Yes.
> 
> First, disable all hardware offloading using ethtool.  See if that
> helps.

Was able to disable all except: 
  rx-vlan-offload: on [fixed]
  tx-vlan-offload: on [fixed]

Now, if I "/etc/init.d/networking restart" the system doesn't report any error, 
but networking is still dead. However, I can rmmod tg3|ptp|libphy, then 
"modprobe tg3" and "/etc/init.d/networking start" and all works (I have done 
this a handful of times with no need to reboot to re-enable networking). So 
that's some progress.


> Also, if this NIC is in the system mainboard, make sure you are using
> the latest firmware ("BIOS update") from your motherboard vendor: it is
> usual to have the motherboard NICs use a data block in the shared system
> FLASH for vital product data and firmware. The motherboard vendor will
> bundle up updates for the NIC firmware with the BIOS updates when both
> are in the same FLASH chip.

I've read the documentation for the latest firmware and there is no mention of 
changes for the NIC, only a "power-on delay option" to allow longer/shorter 
period of time to hit the key to access the BIOS. And a change to boot device 
detection to better detect devices with invalid boot records. No other changes 
mentioned in the firmware. 

Here's a link to the page:
http://h20565.www2.hp.com/hpsc/swd/public/detail?sp4ts.oid=5390291&swItemId=MTX_a21cee44c55643598fb2f52bc2&swEnvOid=4144#tab4

I don't like tinkering with firmware if I can help it, in this case they don't 
say there are changes to the NIC so do you think I should still upgrade? The 
description says no bugs fixed, only enhancements.


> Make sure you have the latest linux firmware file for the tg3 driver as
> well.  If the initramfs image has the tg3.ko module inside, it must also
> have the firmware file.  A workaround for any initramfs-related tg3
> firmware loading issues is to "rmmod tg3 ; modprobe tg3"  after the
> system booted (and before the NIC hardlocks).

See above, even after rmmod'ing I can still force network restart to fail 
without error, though it is recoverable if noticed.


> If all of the above failed, get yourself familiar with building a custom
> Debian-compatible kernel using pristine upstream kernels from
> www.kernel.org.  Wait until 3.18.15 and 4.0.5 are released in
> www.kernel.org, and build custom kernels based on them.  Alternatively,
> wait until a debian-packaged version of kernel 4.0.5 is available.  DO
> NOT use 4.0 kernels before 4.0.5 on pain of possible data loss.

Data loss? On a "stable" kernel? WTF are they doing these days? I notice that 
stable/dev are no longer even/odd major numbers - took me a bit of Googling to 
get caught up!


> If either the 3.18.15 or 4.0.5 kernel fixes the issue with your bcm5720,
> please tell us so that we can try to isolate the fix and backport it to
> the Debian kernel.

In the mean time I've made a bash-script to rmmod and modprobe as appropriate. 
I'll set a cron job to ping a couple of other servers on the LAN and execute 
the script and restart networking should the pings fail.


> If that fails, you will have to engage the kernel community itself for a
> fix.  Please file a bug on bugzilla.kernel.org, and good luck. There are
> several hardware hang reports open against BCM57xx + tg3.

Damn crap hardware. I remember having issues with tg3 at least six or seven 
years ago. I can believe it's still being incorporated into motherboards when 
there are obviously problems with the chipset. Depending on speed of progress 
on the kernel front I may just stick a PCI NIC in there - I think I still have 
some 3c509's around somewhere... 


> Alternatively, try to get yourself an Intel NIC that works with the igb
> driver (don't get an Intel NIC that needs the e1000e driver) to replace
> the hardlock-prone bcm5720 + tg3 combination.

Thanks for the pointers. I at least have a situation now where I don't need a 
reboot to get networking functioning after it fails. It's far from perfect, but 
it's much, much better.

-- 
Justin C, by the sea.


--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: 
https://lists.debian.org/3bcb9e79-8988-475e-b801-e5fccd423...@masonsmusic.co.uk

Reply via email to