Hi,

I have a xen host with some HV guests which becomes unreachable via
the network after apparently random amount of times.  I have already
switched the network card to see if that would make a difference,
and with the card currently installed, it worked fine for over 20 days
until it become unreachable again.  Before switching the network card,
it would run a week or two before becoming unreachable.  The previous
card was the on-board BCM5764M which uses the tg3 driver.

There are messages like this in the log file:


Oct 14 20:58:02 moonflo kernel: ------------[ cut here ]------------
Oct 14 20:58:02 moonflo kernel: WARNING: CPU: 10 PID: 0 at 
net/sched/sch_generic.c:303 dev_watchdog+0x259/0x270()
Oct 14 20:58:02 moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169): transmit 
queue 0 timed out
Oct 14 20:58:02 moonflo kernel: Modules linked in: arc4 ecb md4 hmac nls_utf8 
cifs fscache xt_physdev br_netfilter iptable_filter ip_tables xen_pciback 
xen_gntalloc xen_gntdev bridge stp llc zfs(PO) nouveau snd_hda_codec_realtek 
snd_hda_codec_generic zunicode(PO) zavl(PO) zcommon(PO) znvpair(PO) spl(O) 
zlib_deflate video backlight drm_kms_helper ttm snd_hda_intel 
snd_hda_controller snd_hda_codec snd_pcm snd_timer snd soundcore r8169 mii xts 
aesni_intel glue_helper lrw gf128mul ablk_helper cryptd aes_x86_64 
sha256_generic hid_generic usbhid uhci_hcd usb_storage ehci_pci ehci_hcd 
usbcore usb_common
Oct 14 20:58:02 moonflo kernel: CPU: 10 PID: 0 Comm: swapper/10 Tainted: P      
     O    4.0.5-gentoo #3
Oct 14 20:58:02 moonflo kernel: Hardware name: Hewlett-Packard HP Z800 
Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013
Oct 14 20:58:02 moonflo kernel:  ffffffff8175a77d ffff880124d43d98 
ffffffff814da8d8 0000000000000001
Oct 14 20:58:02 moonflo kernel:  ffff880124d43de8 ffff880124d43dd8 
ffffffff81088850 ffff880124d43dd8
Oct 14 20:58:02 moonflo kernel:  0000000000000000 ffff8800d45f2000 
0000000000000001 ffff8800d5294880
Oct 14 20:58:02 moonflo kernel: Call Trace:
Oct 14 20:58:02 moonflo kernel:  <IRQ>  [<ffffffff814da8d8>] 
dump_stack+0x45/0x57
Oct 14 20:58:02 moonflo kernel:  [<ffffffff81088850>] 
warn_slowpath_common+0x80/0xc0
Oct 14 20:58:02 moonflo kernel:  [<ffffffff810888d1>] 
warn_slowpath_fmt+0x41/0x50
Oct 14 20:58:02 moonflo kernel:  [<ffffffff812b31c5>] ? 
add_interrupt_randomness+0x35/0x1e0
Oct 14 20:58:02 moonflo kernel:  [<ffffffff8145b819>] dev_watchdog+0x259/0x270
Oct 14 20:58:02 moonflo kernel:  [<ffffffff8145b5c0>] ? 
dev_graft_qdisc+0x80/0x80
Oct 14 20:58:02 moonflo kernel:  [<ffffffff8145b5c0>] ? 
dev_graft_qdisc+0x80/0x80
Oct 14 20:58:02 moonflo kernel:  [<ffffffff810d4047>] 
call_timer_fn.isra.30+0x17/0x70
Oct 14 20:58:02 moonflo kernel:  [<ffffffff810d42a6>] 
run_timer_softirq+0x176/0x2b0
Oct 14 20:58:02 moonflo kernel:  [<ffffffff8108bd0a>] __do_softirq+0xda/0x1f0
Oct 14 20:58:02 moonflo kernel:  [<ffffffff8108c04e>] irq_exit+0x7e/0xa0
Oct 14 20:58:02 moonflo kernel:  [<ffffffff8130e075>] 
xen_evtchn_do_upcall+0x35/0x50
Oct 14 20:58:02 moonflo kernel:  [<ffffffff814e1e8e>] 
xen_do_hypervisor_callback+0x1e/0x40
Oct 14 20:58:02 moonflo kernel:  <EOI>  [<ffffffff810013aa>] ? 
xen_hypercall_sched_op+0xa/0x20
Oct 14 20:58:02 moonflo kernel:  [<ffffffff810013aa>] ? 
xen_hypercall_sched_op+0xa/0x20
Oct 14 20:58:02 moonflo kernel:  [<ffffffff810459e0>] ? xen_safe_halt+0x10/0x20
Oct 14 20:58:02 moonflo kernel:  [<ffffffff81053979>] ? default_idle+0x9/0x10
Oct 14 20:58:02 moonflo kernel:  [<ffffffff810542da>] ? arch_cpu_idle+0xa/0x10
Oct 14 20:58:02 moonflo kernel:  [<ffffffff810bd170>] ? 
cpu_startup_entry+0x190/0x2f0
Oct 14 20:58:02 moonflo kernel:  [<ffffffff81047cd5>] ? 
cpu_bringup_and_idle+0x25/0x40
Oct 14 20:58:02 moonflo kernel: ---[ end trace 98d961bae351244d ]---
Oct 14 20:58:02 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up


After that, there are lots of messages about the link being up, one message
every 12 seconds.  When you unplug the network cable, you get a message that
the link is down, and no message when you plug it in again.

I was hoping that switching the network card (to one that uses a different
driver) might solve the problem, and it did not.  Now I can only guess that
the network card goes to sleep and sometimes cannot be woken up again.

I tried to reduce the connection speed to 100Mbit and found that accessing the 
VMs
(via RDP) becomes too slow to use them.  So I disabled the power management of 
the
network card (through sysfs) and will have to see if the problem persists.

We'll be getting decent network cards in a couple days, but since the problem
doesn't seem to be related to a particular card/model/manufacturer, that might
not fix it, either.

This problem seems to only occur on machines that operate as a xen server.
Other machines, identical Z800s, not running xen, run just fine.

What would you suggest?

Reply via email to