J. Roeleveld wrote:
On Thursday, October 15, 2015 03:30:01 PM hw wrote:
Hi,

I have a xen host with some HV guests which becomes unreachable via
the network after apparently random amount of times.  I have already
switched the network card to see if that would make a difference,
and with the card currently installed, it worked fine for over 20 days
until it become unreachable again.  Before switching the network card,
it would run a week or two before becoming unreachable.  The previous
card was the on-board BCM5764M which uses the tg3 driver.

There are messages like this in the log file:


Oct 14 20:58:02 moonflo kernel: ------------[ cut here ]------------
Oct 14 20:58:02 moonflo kernel: WARNING: CPU: 10 PID: 0 at
net/sched/sch_generic.c:303 dev_watchdog+0x259/0x270() Oct 14 20:58:02
moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169): transmit queue 0 timed
out Oct 14 20:58:02 moonflo kernel: Modules linked in: arc4 ecb md4 hmac
nls_utf8 cifs fscache xt_physdev br_netfilter iptable_filter ip_tables
xen_pciback xen_gntalloc xen_gntdev bridge stp llc zfs(PO) nouveau
snd_hda_codec_realtek snd_hda_codec_generic zunicode(PO) zavl(PO)
zcommon(PO) znvpair(PO) spl(O) zlib_deflate video backlight drm_kms_helper
ttm snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm snd_timer snd
soundcore r8169 mii xts aesni_intel glue_helper lrw gf128mul ablk_helper
cryptd aes_x86_64 sha256_generic hid_generic usbhid uhci_hcd usb_storage
ehci_pci ehci_hcd usbcore usb_common Oct 14 20:58:02 moonflo kernel: CPU:
10 PID: 0 Comm: swapper/10 Tainted: P           O    4.0.5-gentoo #3 Oct 14
20:58:02 moonflo kernel: Hardware name: Hewlett-Packard HP Z800
Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013 Oct 14 20:58:02 moonflo
kernel:  ffffffff8175a77d ffff880124d43d98 ffffffff814da8d8
0000000000000001 Oct 14 20:58:02 moonflo kernel:  ffff880124d43de8
ffff880124d43dd8 ffffffff81088850 ffff880124d43dd8 Oct 14 20:58:02 moonflo
kernel:  0000000000000000 ffff8800d45f2000 0000000000000001
ffff8800d5294880 Oct 14 20:58:02 moonflo kernel: Call Trace:
Oct 14 20:58:02 moonflo kernel:  <IRQ>  [<ffffffff814da8d8>]
dump_stack+0x45/0x57 Oct 14 20:58:02 moonflo kernel:  [<ffffffff81088850>]
warn_slowpath_common+0x80/0xc0 Oct 14 20:58:02 moonflo kernel:
[<ffffffff810888d1>] warn_slowpath_fmt+0x41/0x50 Oct 14 20:58:02 moonflo
kernel:  [<ffffffff812b31c5>] ? add_interrupt_randomness+0x35/0x1e0 Oct 14
20:58:02 moonflo kernel:  [<ffffffff8145b819>] dev_watchdog+0x259/0x270 Oct
14 20:58:02 moonflo kernel:  [<ffffffff8145b5c0>] ?
dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo kernel:
[<ffffffff8145b5c0>] ? dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo
kernel:  [<ffffffff810d4047>] call_timer_fn.isra.30+0x17/0x70 Oct 14
20:58:02 moonflo kernel:  [<ffffffff810d42a6>]
run_timer_softirq+0x176/0x2b0 Oct 14 20:58:02 moonflo kernel:
[<ffffffff8108bd0a>] __do_softirq+0xda/0x1f0 Oct 14 20:58:02 moonflo
kernel:  [<ffffffff8108c04e>] irq_exit+0x7e/0xa0 Oct 14 20:58:02 moonflo
kernel:  [<ffffffff8130e075>] xen_evtchn_do_upcall+0x35/0x50 Oct 14
20:58:02 moonflo kernel:  [<ffffffff814e1e8e>]
xen_do_hypervisor_callback+0x1e/0x40 Oct 14 20:58:02 moonflo kernel:  <EOI>
  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 Oct 14 20:58:02
moonflo kernel:  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 Oct
14 20:58:02 moonflo kernel:  [<ffffffff810459e0>] ? xen_safe_halt+0x10/0x20
Oct 14 20:58:02 moonflo kernel:  [<ffffffff81053979>] ?
default_idle+0x9/0x10 Oct 14 20:58:02 moonflo kernel:  [<ffffffff810542da>]
? arch_cpu_idle+0xa/0x10 Oct 14 20:58:02 moonflo kernel:
[<ffffffff810bd170>] ? cpu_startup_entry+0x190/0x2f0 Oct 14 20:58:02
moonflo kernel:  [<ffffffff81047cd5>] ? cpu_bringup_and_idle+0x25/0x40 Oct
14 20:58:02 moonflo kernel: ---[ end trace 98d961bae351244d ]--- Oct 14
20:58:02 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up


After that, there are lots of messages about the link being up, one message
every 12 seconds.  When you unplug the network cable, you get a message that
the link is down, and no message when you plug it in again.

I was hoping that switching the network card (to one that uses a different
driver) might solve the problem, and it did not.  Now I can only guess that
the network card goes to sleep and sometimes cannot be woken up again.

I tried to reduce the connection speed to 100Mbit and found that accessing
the VMs (via RDP) becomes too slow to use them.  So I disabled the power
management of the network card (through sysfs) and will have to see if the
problem persists.

We'll be getting decent network cards in a couple days, but since the
problem doesn't seem to be related to a particular card/model/manufacturer,
that might not fix it, either.

This problem seems to only occur on machines that operate as a xen server.
Other machines, identical Z800s, not running xen, run just fine.

What would you suggest?

More info required:

- Which version of Xen

4.5.1

Installed versions:  4.5.1^t(02:44:35 PM 07/14/2015)(-custom-cflags -debug -efi 
-flask -xsm)

- Does this only occur with HVM guests?

The host has been running only HVM guests every time it happend.
It was running a PV guest in between (which I had to shut down
because other VMs were migrated, requiring the RAM).

- Which network-driver are you using inside the guest

r8169, compiled as a module

Same happened with the tg3 driver when the on-board cards were used.
The tg3 driver is completely disabled in the kernel config, i. e.
not even compiled as a module.

- Can you connect to the "local" console of the guest?

Yes, the host seems to be running fine except for having no network
connectivity.  There's a keyboard and monitor physically connected to
it with which you can log in and do stuff.

You get no answer when you ping the host while it is unreachable.

- If yes, does it still have no connectivity?

It has been restarted this morning when it was found to be unreachable.

I saw the same on my lab machine, which was related to:
- Not using correct drivers inside HVM guests

There are Windoze 7 guests running that have PV drivers installed.
One of those has formerly been running on a VMware host and was
migrated on Tuesday.  I deinstalled the VMware tools from it.

Since Monday, a HVM Linux system (a modified 32-bit Debian) has also
been migrated from the VMware host to this one.  I don't know if it
has VMware tools installed (I guess it does because it could be shut
down via VMware) and how those might react now.  It's working, and I
don't want to touch it.

However, the problem already occured before this migration, when the
on-board cards were still used.

- Switch hardware not keeping the MAC/IP/Port lists long enough

What might be the reason for the lists becoming too short?  Too many
devices connected to the network?

The host has been connected to two different switches and showed the
problem.  Previously, that was an 8-port 1Gb switch, now it's a 24-port
1Gb switch.  However, the 8-port switch is also connected to the 24-port
switch the host is now connected to.  (The 24-port switch connects it
"directly" to the rest of the network.)


Reply via email to