J. Roeleveld wrote:
On Thursday, October 15, 2015 05:46:07 PM hw wrote:
J. Roeleveld wrote:
On Thursday, October 15, 2015 03:30:01 PM hw wrote:
Hi,

I have a xen host with some HV guests which becomes unreachable via
the network after apparently random amount of times.  I have already
switched the network card to see if that would make a difference,
and with the card currently installed, it worked fine for over 20 days
until it become unreachable again.  Before switching the network card,
it would run a week or two before becoming unreachable.  The previous
card was the on-board BCM5764M which uses the tg3 driver.

There are messages like this in the log file:


Oct 14 20:58:02 moonflo kernel: ------------[ cut here ]------------
Oct 14 20:58:02 moonflo kernel: WARNING: CPU: 10 PID: 0 at
net/sched/sch_generic.c:303 dev_watchdog+0x259/0x270() Oct 14 20:58:02
moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169): transmit queue 0 timed
out Oct 14 20:58:02 moonflo kernel: Modules linked in: arc4 ecb md4 hmac
nls_utf8 cifs fscache xt_physdev br_netfilter iptable_filter ip_tables
xen_pciback xen_gntalloc xen_gntdev bridge stp llc zfs(PO) nouveau
snd_hda_codec_realtek snd_hda_codec_generic zunicode(PO) zavl(PO)
zcommon(PO) znvpair(PO) spl(O) zlib_deflate video backlight
drm_kms_helper
ttm snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm snd_timer snd
soundcore r8169 mii xts aesni_intel glue_helper lrw gf128mul ablk_helper
cryptd aes_x86_64 sha256_generic hid_generic usbhid uhci_hcd usb_storage
ehci_pci ehci_hcd usbcore usb_common Oct 14 20:58:02 moonflo kernel: CPU:
10 PID: 0 Comm: swapper/10 Tainted: P           O    4.0.5-gentoo #3 Oct
14
20:58:02 moonflo kernel: Hardware name: Hewlett-Packard HP Z800
Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013 Oct 14 20:58:02 moonflo
kernel:  ffffffff8175a77d ffff880124d43d98 ffffffff814da8d8
0000000000000001 Oct 14 20:58:02 moonflo kernel:  ffff880124d43de8
ffff880124d43dd8 ffffffff81088850 ffff880124d43dd8 Oct 14 20:58:02
moonflo
kernel:  0000000000000000 ffff8800d45f2000 0000000000000001
ffff8800d5294880 Oct 14 20:58:02 moonflo kernel: Call Trace:
Oct 14 20:58:02 moonflo kernel:  <IRQ>  [<ffffffff814da8d8>]
dump_stack+0x45/0x57 Oct 14 20:58:02 moonflo kernel:
[<ffffffff81088850>]
warn_slowpath_common+0x80/0xc0 Oct 14 20:58:02 moonflo kernel:
[<ffffffff810888d1>] warn_slowpath_fmt+0x41/0x50 Oct 14 20:58:02 moonflo
kernel:  [<ffffffff812b31c5>] ? add_interrupt_randomness+0x35/0x1e0 Oct
14
20:58:02 moonflo kernel:  [<ffffffff8145b819>] dev_watchdog+0x259/0x270
Oct
14 20:58:02 moonflo kernel:  [<ffffffff8145b5c0>] ?
dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo kernel:
[<ffffffff8145b5c0>] ? dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo
kernel:  [<ffffffff810d4047>] call_timer_fn.isra.30+0x17/0x70 Oct 14
20:58:02 moonflo kernel:  [<ffffffff810d42a6>]
run_timer_softirq+0x176/0x2b0 Oct 14 20:58:02 moonflo kernel:
[<ffffffff8108bd0a>] __do_softirq+0xda/0x1f0 Oct 14 20:58:02 moonflo
kernel:  [<ffffffff8108c04e>] irq_exit+0x7e/0xa0 Oct 14 20:58:02 moonflo
kernel:  [<ffffffff8130e075>] xen_evtchn_do_upcall+0x35/0x50 Oct 14
20:58:02 moonflo kernel:  [<ffffffff814e1e8e>]
xen_do_hypervisor_callback+0x1e/0x40 Oct 14 20:58:02 moonflo kernel:
<EOI>

   [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 Oct 14 20:58:02

moonflo kernel:  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
Oct
14 20:58:02 moonflo kernel:  [<ffffffff810459e0>] ?
xen_safe_halt+0x10/0x20
Oct 14 20:58:02 moonflo kernel:  [<ffffffff81053979>] ?
default_idle+0x9/0x10 Oct 14 20:58:02 moonflo kernel:
[<ffffffff810542da>]
? arch_cpu_idle+0xa/0x10 Oct 14 20:58:02 moonflo kernel:
[<ffffffff810bd170>] ? cpu_startup_entry+0x190/0x2f0 Oct 14 20:58:02
moonflo kernel:  [<ffffffff81047cd5>] ? cpu_bringup_and_idle+0x25/0x40
Oct
14 20:58:02 moonflo kernel: ---[ end trace 98d961bae351244d ]--- Oct 14
20:58:02 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up


After that, there are lots of messages about the link being up, one
message
every 12 seconds.  When you unplug the network cable, you get a message
that the link is down, and no message when you plug it in again.

I was hoping that switching the network card (to one that uses a
different
driver) might solve the problem, and it did not.  Now I can only guess
that
the network card goes to sleep and sometimes cannot be woken up again.

I tried to reduce the connection speed to 100Mbit and found that
accessing
the VMs (via RDP) becomes too slow to use them.  So I disabled the power
management of the network card (through sysfs) and will have to see if
the
problem persists.

We'll be getting decent network cards in a couple days, but since the
problem doesn't seem to be related to a particular
card/model/manufacturer,
that might not fix it, either.

This problem seems to only occur on machines that operate as a xen
server.
Other machines, identical Z800s, not running xen, run just fine.

What would you suggest?

More info required:

- Which version of Xen

4.5.1

Installed versions:  4.5.1^t(02:44:35 PM 07/14/2015)(-custom-cflags -debug
-efi -flask -xsm)

Ok, recent one.

- Does this only occur with HVM guests?

The host has been running only HVM guests every time it happend.
It was running a PV guest in between (which I had to shut down
because other VMs were migrated, requiring the RAM).

The PV didn't have any issues?

The whole server has the issue, not a particular VM.  While the PV guest
was running, the server didn't freeze.

- Which network-driver are you using inside the guest

r8169, compiled as a module

Same happened with the tg3 driver when the on-board cards were used.
The tg3 driver is completely disabled in the kernel config, i. e.
not even compiled as a module.

You have network cards assigned to the guests?

No, they are all connected via a bridge.

I enabled STP on the bridge and the server was ok for a week, then had
to be restarted.  I'm seeing lots of messages in the log:


Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, propagating
Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn bpdu
Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, propagating
Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn bpdu
Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, propagating
Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn bpdu
Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, propagating
Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn bpdu
Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, propagating


and sometimes:

Oct 28 10:47:04 moonflo kernel: brloc: port 1(enp55s4) neighbor 
8000.00:00:10:11:12:00 lost


Any idea what this means?

(Google has gone on strike, and another search engine didn't give any useful
findings ...)


- Can you connect to the "local" console of the guest?

Yes, the host seems to be running fine except for having no network
connectivity.  There's a keyboard and monitor physically connected to
it with which you can log in and do stuff.

The HOST loses network connectivity?

Yes.

Apparently when it became unresponsive yesterday, it was not possible
to log in at the console, either.  I wasn't there yesterday, though I've
see that happen before.  We tried to shut it down via acpid by pressing the
power button. It didn't turn off, so it was switched off by holding the power
button.  What I can see in the log is:


Oct 28 14:12:33 moonflo logger[20322]: /etc/xen/scripts/block: remove 
XENBUS_PATH=backend/vbd/2/768
Oct 28 14:12:33 moonflo logger[20323]: /etc/xen/scripts/vif-bridge: offline 
type_if=vif XENBUS_PATH=backend/vif/2/0
Oct 28 14:12:33 moonflo logger[20347]: /etc/xen/scripts/vif-bridge: brctl delif 
brloc vif2.0 failed
Oct 28 14:12:33 moonflo logger[20353]: /etc/xen/scripts/vif-bridge: ifconfig 
vif2.0 down failed
Oct 28 14:12:33 moonflo logger[20361]: /etc/xen/scripts/vif-bridge: Successful 
vif-bridge offline for vif2.0, bridge brloc.
Oct 28 14:12:33 moonflo logger[20372]: /etc/xen/scripts/vif-bridge: remove 
type_if=tap XENBUS_PATH=backend/vif/2/0
Oct 28 14:12:33 moonflo logger[20391]: /etc/xen/scripts/vif-bridge: Successful 
vif-bridge remove for vif2.0-emu, bridge brloc.
Oct 28 14:15:33 moonflo shutdown[20476]: shutting down for system halt
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Oct
 28 14:17:34 moonflo syslog-ng[4611]: syslog-ng starting up; version='3.6.2'


And:


Oct 24 11:47:42 moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169): transmit 
queue 0 timed out
Oct 24 11:47:42 moonflo kernel: Modules linked in: xt_physdev br_netfilter 
iptable_filter ip_tables xen_pciback xen_gntalloc xen_gntdev bridge stp llc 
zfs(PO) zuni
code(PO) zavl(PO) zcommon(PO) znvpair(PO) nouveau snd_hda_codec_realtek 
snd_hda_codec_generic video spl(O) backlight zlib_deflate drm_kms_helper 
snd_hda_intel snd_
hda_controller snd_hda_codec snd_pcm snd_timer r8169 snd ttm soundcore mii xts 
aesni_intel glue_helper lrw gf128mul ablk_helper cryptd aes_x86_64 
sha256_generic hi
d_generic usbhid uhci_hcd usb_storage ehci_pci ehci_hcd usbcore usb_common
Oct 24 11:47:42 moonflo kernel: CPU: 12 PID: 0 Comm: swapper/12 Tainted: P      
     O    4.0.5-gentoo #3
Oct 24 11:47:42 moonflo kernel: Hardware name: Hewlett-Packard HP Z800 
Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013
Oct 24 11:47:42 moonflo kernel:  ffffffff8175a77d ffff880124d83d98 
ffffffff814da8d8 0000000000000001
Oct 24 11:47:42 moonflo kernel:  ffff880124d83de8 ffff880124d83dd8 
ffffffff81088850 ffff880124d83e68
Oct 24 11:47:42 moonflo kernel:  0000000000000000 ffff88011efd8000 
0000000000000001 ffff8800d4eb5e80
Oct 24 11:47:42 moonflo kernel: Call Trace:
Oct 24 11:47:42 moonflo kernel:  <IRQ>  [<ffffffff814da8d8>] 
dump_stack+0x45/0x57
Oct 24 11:47:42 moonflo kernel:  [<ffffffff81088850>] 
warn_slowpath_common+0x80/0xc0
Oct 24 11:47:42 moonflo kernel:  [<ffffffff810888d1>] 
warn_slowpath_fmt+0x41/0x50
Oct 24 11:47:42 moonflo kernel:  [<ffffffff812b31c5>] ? 
add_interrupt_randomness+0x35/0x1e0
Oct 24 11:47:42 moonflo kernel:  [<ffffffff8145b819>] dev_watchdog+0x259/0x270
Oct 24 11:47:42 moonflo kernel:  [<ffffffff8145b5c0>] ? 
dev_graft_qdisc+0x80/0x80
Oct 24 11:47:42 moonflo kernel:  [<ffffffff8145b5c0>] ? 
dev_graft_qdisc+0x80/0x80
Oct 24 11:47:42 moonflo kernel:  [<ffffffff810d4047>] 
call_timer_fn.isra.30+0x17/0x70
Oct 24 11:47:42 moonflo kernel:  [<ffffffff810d42a6>] 
run_timer_softirq+0x176/0x2b0
Oct 24 11:47:42 moonflo kernel:  [<ffffffff8108bd0a>] __do_softirq+0xda/0x1f0
Oct 24 11:47:42 moonflo kernel:  [<ffffffff8108c04e>] irq_exit+0x7e/0xa0
Oct 24 11:47:42 moonflo kernel:  [<ffffffff8130e075>] 
xen_evtchn_do_upcall+0x35/0x50
Oct 24 11:47:42 moonflo kernel:  [<ffffffff814e1e8e>] 
xen_do_hypervisor_callback+0x1e/0x40
Oct 24 11:47:42 moonflo kernel:  <EOI>  [<ffffffff810013aa>] ? 
xen_hypercall_sched_op+0xa/0x20
Oct 24 11:47:42 moonflo kernel:  [<ffffffff810013aa>] ? 
xen_hypercall_sched_op+0xa/0x20
Oct 24 11:47:42 moonflo kernel:  [<ffffffff810459e0>] ? xen_safe_halt+0x10/0x20
Oct 24 11:47:42 moonflo kernel:  [<ffffffff81053979>] ? default_idle+0x9/0x10
Oct 24 11:47:42 moonflo kernel:  [<ffffffff810542da>] ? arch_cpu_idle+0xa/0x10
Oct 24 11:47:42 moonflo kernel:  [<ffffffff810bd170>] ? 
cpu_startup_entry+0x190/0x2f0
Oct 24 11:47:42 moonflo kernel:  [<ffffffff81047cd5>] ? 
cpu_bringup_and_idle+0x25/0x40
Oct 24 11:47:42 moonflo kernel: ---[ end trace 320b6f98f8fc070f ]---
Oct 24 11:47:42 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up


That was two days before it went down.  After that, messages about topology 
changes
are starting to appear.

I'm not sure if I should call this "progress" ;)


You get no answer when you ping the host while it is unreachable.

- If yes, does it still have no connectivity?

It has been restarted this morning when it was found to be unreachable.

I saw the same on my lab machine, which was related to:
- Not using correct drivers inside HVM guests

There are Windoze 7 guests running that have PV drivers installed.
One of those has formerly been running on a VMware host and was
migrated on Tuesday.  I deinstalled the VMware tools from it.

Which PV drivers?

Xen GPL PV Driver Developers
17.09.2014
0.11.0.373
Univention GmbH

And did you ensure all VMWare related drivers were removed?
I am not convinced uninstalling the VMWare tools is sufficient.

What would I need to look at to make sure they are removed?

The problem has been there before the VM that had VMWare drivers
installed was migrated to this server.  So I don't think they are
causing this problem.


Since Monday, a HVM Linux system (a modified 32-bit Debian) has also
been migrated from the VMware host to this one.  I don't know if it
has VMware tools installed (I guess it does because it could be shut
down via VMware) and how those might react now.  It's working, and I
don't want to touch it.

However, the problem already occured before this migration, when the
on-board cards were still used.

- Switch hardware not keeping the MAC/IP/Port lists long enough

What might be the reason for the lists becoming too short?  Too many
devices connected to the network?

No network activity for a while. (clean installs, nothing running)
Switch forgetting the MAC-address assigned to the VM.

Connecting to the VM-console, I could ping www.google.com and then the
connectivity re-appeared.

Half of the switches have been replaced last week in order to track down
what appears to be a weird network problem.  The problem is that the RDP
clients are being randomly stalled.  If it was only that, I'd suspect this
server some more, but the internet connection goes through the same switches
and is apprently also slowed down when the RPD clients are stalled.  They
got also randomly stalled when the RDP clients were accessing a totally
different server (the VMWare server), so this might be entirely unrelated.

Replacing the switches didn't fix the problem, so I'll probably put them
back into service and replace the other half.

The host has been connected to two different switches and showed the
problem.  Previously, that was an 8-port 1Gb switch, now it's a 24-port
1Gb switch.  However, the 8-port switch is also connected to the 24-port
switch the host is now connected to.  (The 24-port switch connects it
"directly" to the rest of the network.)

Assuming it's a managed switch, you could test this.
Alternatively, check if you can access the VMs from the host.

Good idea, I'll try that when it happens when I'm here.

The network cards have arrived, Intel PRO 1000 dual port, made for IBM.
I hope I get to swap the card today. Those *really* should work.

Hm, I could plug in two of them and give each VM and the host its own
physical card.  Do you think that might help?


Reply via email to