Network dies unexpectedly

2013-12-26 Thread Galante, Nicola
Greetings,

I administer a web server for my institution and last night we had a
problem.  The server is a 1U Intel Xeon E5620 machine.  The on-board
network interface is an Intel 82574L Gigabit Controller.  Scientific Linux
6.4, kernel 2.6.32-431.1.2.el6.x86_64.  At some point last night the
network interface stopped working giving a backtrace on dev_watchdog.  I
could not restart the service network, it complained that the interface
eth0 was not available.  I tried to reconfigure it with NetworkManager,
unsuccessfully.  A full system reboot fixed the problem, although I
couldn't identify the problem.  I do not know if this matters, but this
problem never occurred before the last yum update.  Here below the portion
of /var/log/messages that relates to the problem

=
Dec 25 20:01:52 veritasm xinetd[1966]: EXIT: nrpe status=0 pid=20943
duration=0(sec)
Dec 25 20:02:21 veritasm xinetd[1966]: START: nrpe pid=20947
from=:::199.104.151.131
Dec 25 20:02:21 veritasm xinetd[1966]: EXIT: nrpe status=0 pid=20947
duration=0(sec)
Dec 26 02:18:37 veritasm kernel: [ cut here ]
Dec 26 02:18:37 veritasm kernel: WARNING: at net/sched/sch_generic.c:261
dev_watchdog+0x26b/0x280() (Not tainted)
Dec 26 02:18:37 veritasm kernel: Hardware name: X8DTL
Dec 26 02:18:37 veritasm kernel: NETDEV WATCHDOG: eth0 (e1000e): transmit
queue 0 timed out
Dec 26 02:18:37 veritasm kernel: Modules linked in: autofs4 8021q sunrpc
garp stp llc cpufreq_ondemand acpi_cpufreq freq_table mperf ipt_REJECT
nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT
nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter
ip6_tables ipv6 microcode iTCO_wdt iTCO_vendor_support sg i2c_i801 i2c_core
lpc_ich mfd_core e1000e ptp pps_core ioatdma dca i7core_edac edac_core
shpchp ext4 jbd2 mbcache raid1 sr_mod cdrom sd_mod crc_t10dif pata_acpi
ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
scsi_wait_scan]
Dec 26 02:18:37 veritasm kernel: Pid: 130, comm: kipmi0 Not tainted
2.6.32-431.1.2.el6.x86_64 #1
Dec 26 02:18:37 veritasm kernel: Call Trace:
Dec 26 02:18:37 veritasm kernel: IRQ  [81071e27] ?
warn_slowpath_common+0x87/0xc0
Dec 26 02:18:37 veritasm kernel: [81071f16] ?
warn_slowpath_fmt+0x46/0x50
Dec 26 02:18:37 veritasm kernel: [8147b75b] ?
dev_watchdog+0x26b/0x280
Dec 26 02:18:37 veritasm kernel: [8105dd5c] ?
scheduler_tick+0xcc/0x260
Dec 26 02:18:37 veritasm kernel: [8147b4f0] ?
dev_watchdog+0x0/0x280
Dec 26 02:18:37 veritasm kernel: [81084b07] ?
run_timer_softirq+0x197/0x340
Dec 26 02:18:37 veritasm kernel: [810ac905] ?
tick_dev_program_event+0x65/0xc0
Dec 26 02:18:37 veritasm kernel: [8107a8e1] ?
__do_softirq+0xc1/0x1e0
Dec 26 02:18:37 veritasm kernel: [810ac9da] ?
tick_program_event+0x2a/0x30
Dec 26 02:18:37 veritasm kernel: [8100c30c] ?
call_softirq+0x1c/0x30
Dec 26 02:18:37 veritasm kernel: [8100fa75] ? do_softirq+0x65/0xa0
Dec 26 02:18:37 veritasm kernel: [8107a795] ? irq_exit+0x85/0x90
Dec 26 02:18:37 veritasm kernel: [815310ba] ?
smp_apic_timer_interrupt+0x4a/0x60
Dec 26 02:18:37 veritasm kernel: [8100bb93] ?
apic_timer_interrupt+0x13/0x20
Dec 26 02:18:37 veritasm kernel: EOI  [8152a367] ?
_spin_unlock_irqrestore+0x17/0x20
Dec 26 02:18:37 veritasm kernel: [812e7790] ?
ipmi_thread+0x70/0x1c0
Dec 26 02:18:37 veritasm kernel: [812e7720] ?
ipmi_thread+0x0/0x1c0
Dec 26 02:18:37 veritasm kernel: [8109af06] ? kthread+0x96/0xa0
Dec 26 02:18:37 veritasm kernel: [8100c20a] ? child_rip+0xa/0x20
Dec 26 02:18:37 veritasm kernel: [8109ae70] ? kthread+0x0/0xa0
Dec 26 02:18:37 veritasm kernel: [8100c200] ? child_rip+0x0/0x20
Dec 26 02:18:37 veritasm kernel: ---[ end trace fc057a7fca6eff49 ]---
Dec 26 02:18:37 veritasm kernel: e1000e :06:00.0: eth0: Reset adapter
unexpectedly
Dec 26 02:18:37 veritasm NetworkManager[1724]: info (eth0): carrier now
OFF (device state 8, deferring action for 4 seconds)
Dec 26 02:18:38 veritasm kernel: e1000e :06:00.0: eth0: Timesync Tx
Control register not set as expected
Dec 26 02:18:41 veritasm NetworkManager[1724]: info (eth0): device state
change: 8 - 2 (reason 40)
Dec 26 02:18:41 veritasm NetworkManager[1724]: info (eth0): deactivating
device (reason: 40).
Dec 26 02:18:43 veritasm ntpd[1974]: Deleting interface #5 eth0,
199.104.151.141#123, interface stats: received=789, sent=886, dropped=0,
active_time=232100 secs
Dec 26 08:58:36 veritasm kernel: fuse init (API version 7.13)
Dec 26 08:58:36 veritasm rtkit-daemon[2379]: Sucessfully made thread 23005
of process 23005 (/usr/bin/pulseaudio) owned by '500' high priority at nice
level -11.
Dec 26 08:58:37 veritasm rtkit-daemon[2379]: Sucessfully made thread 23039
of process 23039 (/usr/bin/pulseaudio) owned by '500' high priority at nice
level -11.
Dec 26 08:58:37 

Re: Network dies unexpectedly

2013-12-26 Thread Paul Robert Marino
This was caused by an internal hardware watchdog built into Intel
network cards, it detected an error and disabled the interface on the
hardware level until you rebooted and the cards memory was cleared. It
looks like the card may have lost clock sync with its neighbor which
is odd that basically means it wasn't sending out the 5 volt signal
used for frequency sync. I've worked with Intel cards for probably
over a decade and Ive never seen this exact error before.

Try rolling back to the previous kernel version however this looks
more like it may be a physical hardware issue.

On Thu, Dec 26, 2013 at 12:40 PM, Galante, Nicola
ngala...@cfa.harvard.edu wrote:
 Greetings,

 I administer a web server for my institution and last night we had a
 problem.  The server is a 1U Intel Xeon E5620 machine.  The on-board network
 interface is an Intel 82574L Gigabit Controller.  Scientific Linux 6.4,
 kernel 2.6.32-431.1.2.el6.x86_64.  At some point last night the network
 interface stopped working giving a backtrace on dev_watchdog.  I could not
 restart the service network, it complained that the interface eth0 was not
 available.  I tried to reconfigure it with NetworkManager, unsuccessfully.
 A full system reboot fixed the problem, although I couldn't identify the
 problem.  I do not know if this matters, but this problem never occurred
 before the last yum update.  Here below the portion of /var/log/messages
 that relates to the problem

 =
 Dec 25 20:01:52 veritasm xinetd[1966]: EXIT: nrpe status=0 pid=20943
 duration=0(sec)
 Dec 25 20:02:21 veritasm xinetd[1966]: START: nrpe pid=20947
 from=:::199.104.151.131
 Dec 25 20:02:21 veritasm xinetd[1966]: EXIT: nrpe status=0 pid=20947
 duration=0(sec)
 Dec 26 02:18:37 veritasm kernel: [ cut here ]
 Dec 26 02:18:37 veritasm kernel: WARNING: at net/sched/sch_generic.c:261
 dev_watchdog+0x26b/0x280() (Not tainted)
 Dec 26 02:18:37 veritasm kernel: Hardware name: X8DTL
 Dec 26 02:18:37 veritasm kernel: NETDEV WATCHDOG: eth0 (e1000e): transmit
 queue 0 timed out
 Dec 26 02:18:37 veritasm kernel: Modules linked in: autofs4 8021q sunrpc
 garp stp llc cpufreq_ondemand acpi_cpufreq freq_table mperf ipt_REJECT
 nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT
 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter
 ip6_tables ipv6 microcode iTCO_wdt iTCO_vendor_support sg i2c_i801 i2c_core
 lpc_ich mfd_core e1000e ptp pps_core ioatdma dca i7core_edac edac_core
 shpchp ext4 jbd2 mbcache raid1 sr_mod cdrom sd_mod crc_t10dif pata_acpi
 ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
 scsi_wait_scan]
 Dec 26 02:18:37 veritasm kernel: Pid: 130, comm: kipmi0 Not tainted
 2.6.32-431.1.2.el6.x86_64 #1
 Dec 26 02:18:37 veritasm kernel: Call Trace:
 Dec 26 02:18:37 veritasm kernel: IRQ  [81071e27] ?
 warn_slowpath_common+0x87/0xc0
 Dec 26 02:18:37 veritasm kernel: [81071f16] ?
 warn_slowpath_fmt+0x46/0x50
 Dec 26 02:18:37 veritasm kernel: [8147b75b] ?
 dev_watchdog+0x26b/0x280
 Dec 26 02:18:37 veritasm kernel: [8105dd5c] ?
 scheduler_tick+0xcc/0x260
 Dec 26 02:18:37 veritasm kernel: [8147b4f0] ?
 dev_watchdog+0x0/0x280
 Dec 26 02:18:37 veritasm kernel: [81084b07] ?
 run_timer_softirq+0x197/0x340
 Dec 26 02:18:37 veritasm kernel: [810ac905] ?
 tick_dev_program_event+0x65/0xc0
 Dec 26 02:18:37 veritasm kernel: [8107a8e1] ?
 __do_softirq+0xc1/0x1e0
 Dec 26 02:18:37 veritasm kernel: [810ac9da] ?
 tick_program_event+0x2a/0x30
 Dec 26 02:18:37 veritasm kernel: [8100c30c] ?
 call_softirq+0x1c/0x30
 Dec 26 02:18:37 veritasm kernel: [8100fa75] ? do_softirq+0x65/0xa0
 Dec 26 02:18:37 veritasm kernel: [8107a795] ? irq_exit+0x85/0x90
 Dec 26 02:18:37 veritasm kernel: [815310ba] ?
 smp_apic_timer_interrupt+0x4a/0x60
 Dec 26 02:18:37 veritasm kernel: [8100bb93] ?
 apic_timer_interrupt+0x13/0x20
 Dec 26 02:18:37 veritasm kernel: EOI  [8152a367] ?
 _spin_unlock_irqrestore+0x17/0x20
 Dec 26 02:18:37 veritasm kernel: [812e7790] ?
 ipmi_thread+0x70/0x1c0
 Dec 26 02:18:37 veritasm kernel: [812e7720] ?
 ipmi_thread+0x0/0x1c0
 Dec 26 02:18:37 veritasm kernel: [8109af06] ? kthread+0x96/0xa0
 Dec 26 02:18:37 veritasm kernel: [8100c20a] ? child_rip+0xa/0x20
 Dec 26 02:18:37 veritasm kernel: [8109ae70] ? kthread+0x0/0xa0
 Dec 26 02:18:37 veritasm kernel: [8100c200] ? child_rip+0x0/0x20
 Dec 26 02:18:37 veritasm kernel: ---[ end trace fc057a7fca6eff49 ]---
 Dec 26 02:18:37 veritasm kernel: e1000e :06:00.0: eth0: Reset adapter
 unexpectedly
 Dec 26 02:18:37 veritasm NetworkManager[1724]: info (eth0): carrier now
 OFF (device state 8, deferring action for 4 seconds)
 Dec 26 02:18:38 veritasm kernel: e1000e :06:00.0: eth0: Timesync Tx
 Control register not set as expected
 Dec 26 02:18:41