On 29 October 2015 11:29:18 CET, hw <h...@gc-24.de> wrote: >J. Roeleveld wrote: >> On Thursday, October 15, 2015 05:46:07 PM hw wrote: >>> J. Roeleveld wrote: >>>> On Thursday, October 15, 2015 03:30:01 PM hw wrote: >>>>> Hi, >>>>> >>>>> I have a xen host with some HV guests which becomes unreachable >via >>>>> the network after apparently random amount of times. I have >already >>>>> switched the network card to see if that would make a difference, >>>>> and with the card currently installed, it worked fine for over 20 >days >>>>> until it become unreachable again. Before switching the network >card, >>>>> it would run a week or two before becoming unreachable. The >previous >>>>> card was the on-board BCM5764M which uses the tg3 driver. >>>>> >>>>> There are messages like this in the log file: >>>>> >>>>> >>>>> Oct 14 20:58:02 moonflo kernel: ------------[ cut here >]------------ >>>>> Oct 14 20:58:02 moonflo kernel: WARNING: CPU: 10 PID: 0 at >>>>> net/sched/sch_generic.c:303 dev_watchdog+0x259/0x270() Oct 14 >20:58:02 >>>>> moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169): transmit queue 0 >timed >>>>> out Oct 14 20:58:02 moonflo kernel: Modules linked in: arc4 ecb >md4 hmac >>>>> nls_utf8 cifs fscache xt_physdev br_netfilter iptable_filter >ip_tables >>>>> xen_pciback xen_gntalloc xen_gntdev bridge stp llc zfs(PO) nouveau >>>>> snd_hda_codec_realtek snd_hda_codec_generic zunicode(PO) zavl(PO) >>>>> zcommon(PO) znvpair(PO) spl(O) zlib_deflate video backlight >>>>> drm_kms_helper >>>>> ttm snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm >snd_timer snd >>>>> soundcore r8169 mii xts aesni_intel glue_helper lrw gf128mul >ablk_helper >>>>> cryptd aes_x86_64 sha256_generic hid_generic usbhid uhci_hcd >usb_storage >>>>> ehci_pci ehci_hcd usbcore usb_common Oct 14 20:58:02 moonflo >kernel: CPU: >>>>> 10 PID: 0 Comm: swapper/10 Tainted: P O 4.0.5-gentoo >#3 Oct >>>>> 14 >>>>> 20:58:02 moonflo kernel: Hardware name: Hewlett-Packard HP Z800 >>>>> Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013 Oct 14 20:58:02 >moonflo >>>>> kernel: ffffffff8175a77d ffff880124d43d98 ffffffff814da8d8 >>>>> 0000000000000001 Oct 14 20:58:02 moonflo kernel: ffff880124d43de8 >>>>> ffff880124d43dd8 ffffffff81088850 ffff880124d43dd8 Oct 14 20:58:02 >>>>> moonflo >>>>> kernel: 0000000000000000 ffff8800d45f2000 0000000000000001 >>>>> ffff8800d5294880 Oct 14 20:58:02 moonflo kernel: Call Trace: >>>>> Oct 14 20:58:02 moonflo kernel: <IRQ> [<ffffffff814da8d8>] >>>>> dump_stack+0x45/0x57 Oct 14 20:58:02 moonflo kernel: >>>>> [<ffffffff81088850>] >>>>> warn_slowpath_common+0x80/0xc0 Oct 14 20:58:02 moonflo kernel: >>>>> [<ffffffff810888d1>] warn_slowpath_fmt+0x41/0x50 Oct 14 20:58:02 >moonflo >>>>> kernel: [<ffffffff812b31c5>] ? >add_interrupt_randomness+0x35/0x1e0 Oct >>>>> 14 >>>>> 20:58:02 moonflo kernel: [<ffffffff8145b819>] >dev_watchdog+0x259/0x270 >>>>> Oct >>>>> 14 20:58:02 moonflo kernel: [<ffffffff8145b5c0>] ? >>>>> dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo kernel: >>>>> [<ffffffff8145b5c0>] ? dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 >moonflo >>>>> kernel: [<ffffffff810d4047>] call_timer_fn.isra.30+0x17/0x70 Oct >14 >>>>> 20:58:02 moonflo kernel: [<ffffffff810d42a6>] >>>>> run_timer_softirq+0x176/0x2b0 Oct 14 20:58:02 moonflo kernel: >>>>> [<ffffffff8108bd0a>] __do_softirq+0xda/0x1f0 Oct 14 20:58:02 >moonflo >>>>> kernel: [<ffffffff8108c04e>] irq_exit+0x7e/0xa0 Oct 14 20:58:02 >moonflo >>>>> kernel: [<ffffffff8130e075>] xen_evtchn_do_upcall+0x35/0x50 Oct >14 >>>>> 20:58:02 moonflo kernel: [<ffffffff814e1e8e>] >>>>> xen_do_hypervisor_callback+0x1e/0x40 Oct 14 20:58:02 moonflo >kernel: >>>>> <EOI> >>>>> >>>>> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 Oct 14 >20:58:02 >>>>> >>>>> moonflo kernel: [<ffffffff810013aa>] ? >xen_hypercall_sched_op+0xa/0x20 >>>>> Oct >>>>> 14 20:58:02 moonflo kernel: [<ffffffff810459e0>] ? >>>>> xen_safe_halt+0x10/0x20 >>>>> Oct 14 20:58:02 moonflo kernel: [<ffffffff81053979>] ? >>>>> default_idle+0x9/0x10 Oct 14 20:58:02 moonflo kernel: >>>>> [<ffffffff810542da>] >>>>> ? arch_cpu_idle+0xa/0x10 Oct 14 20:58:02 moonflo kernel: >>>>> [<ffffffff810bd170>] ? cpu_startup_entry+0x190/0x2f0 Oct 14 >20:58:02 >>>>> moonflo kernel: [<ffffffff81047cd5>] ? >cpu_bringup_and_idle+0x25/0x40 >>>>> Oct >>>>> 14 20:58:02 moonflo kernel: ---[ end trace 98d961bae351244d ]--- >Oct 14 >>>>> 20:58:02 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up >>>>> >>>>> >>>>> After that, there are lots of messages about the link being up, >one >>>>> message >>>>> every 12 seconds. When you unplug the network cable, you get a >message >>>>> that the link is down, and no message when you plug it in again. >>>>> >>>>> I was hoping that switching the network card (to one that uses a >>>>> different >>>>> driver) might solve the problem, and it did not. Now I can only >guess >>>>> that >>>>> the network card goes to sleep and sometimes cannot be woken up >again. >>>>> >>>>> I tried to reduce the connection speed to 100Mbit and found that >>>>> accessing >>>>> the VMs (via RDP) becomes too slow to use them. So I disabled the >power >>>>> management of the network card (through sysfs) and will have to >see if >>>>> the >>>>> problem persists. >>>>> >>>>> We'll be getting decent network cards in a couple days, but since >the >>>>> problem doesn't seem to be related to a particular >>>>> card/model/manufacturer, >>>>> that might not fix it, either. >>>>> >>>>> This problem seems to only occur on machines that operate as a xen >>>>> server. >>>>> Other machines, identical Z800s, not running xen, run just fine. >>>>> >>>>> What would you suggest? >>>> >>>> More info required: >>>> >>>> - Which version of Xen >>> >>> 4.5.1 >>> >>> Installed versions: 4.5.1^t(02:44:35 PM 07/14/2015)(-custom-cflags >-debug >>> -efi -flask -xsm) >> >> Ok, recent one. >> >>>> - Does this only occur with HVM guests? >>> >>> The host has been running only HVM guests every time it happend. >>> It was running a PV guest in between (which I had to shut down >>> because other VMs were migrated, requiring the RAM). >> >> The PV didn't have any issues? > >The whole server has the issue, not a particular VM. While the PV >guest >was running, the server didn't freeze. > >>>> - Which network-driver are you using inside the guest >>> >>> r8169, compiled as a module >>> >>> Same happened with the tg3 driver when the on-board cards were used. >>> The tg3 driver is completely disabled in the kernel config, i. e. >>> not even compiled as a module. >> >> You have network cards assigned to the guests? > >No, they are all connected via a bridge. > >I enabled STP on the bridge and the server was ok for a week, then had >to be restarted. I'm seeing lots of messages in the log: > > >Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, >propagating >Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn >bpdu >Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, >propagating >Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn >bpdu >Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, >propagating >Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn >bpdu >Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, >propagating >Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn >bpdu >Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, >propagating > > >and sometimes: > >Oct 28 10:47:04 moonflo kernel: brloc: port 1(enp55s4) neighbor >8000.00:00:10:11:12:00 lost > > >Any idea what this means? > >(Google has gone on strike, and another search engine didn't give any >useful >findings ...) > > >>>> - Can you connect to the "local" console of the guest? >>> >>> Yes, the host seems to be running fine except for having no network >>> connectivity. There's a keyboard and monitor physically connected >to >>> it with which you can log in and do stuff. >> >> The HOST loses network connectivity? > >Yes. > >Apparently when it became unresponsive yesterday, it was not possible >to log in at the console, either. I wasn't there yesterday, though >I've >see that happen before. We tried to shut it down via acpid by pressing >the >power button. It didn't turn off, so it was switched off by holding the >power >button. What I can see in the log is: > > >Oct 28 14:12:33 moonflo logger[20322]: /etc/xen/scripts/block: remove >XENBUS_PATH=backend/vbd/2/768 >Oct 28 14:12:33 moonflo logger[20323]: /etc/xen/scripts/vif-bridge: >offline type_if=vif XENBUS_PATH=backend/vif/2/0 >Oct 28 14:12:33 moonflo logger[20347]: /etc/xen/scripts/vif-bridge: >brctl delif brloc vif2.0 failed >Oct 28 14:12:33 moonflo logger[20353]: /etc/xen/scripts/vif-bridge: >ifconfig vif2.0 down failed >Oct 28 14:12:33 moonflo logger[20361]: /etc/xen/scripts/vif-bridge: >Successful vif-bridge offline for vif2.0, bridge brloc. >Oct 28 14:12:33 moonflo logger[20372]: /etc/xen/scripts/vif-bridge: >remove type_if=tap XENBUS_PATH=backend/vif/2/0 >Oct 28 14:12:33 moonflo logger[20391]: /etc/xen/scripts/vif-bridge: >Successful vif-bridge remove for vif2.0-emu, bridge brloc. >Oct 28 14:15:33 moonflo shutdown[20476]: shutting down for system halt >^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Oct >28 14:17:34 moonflo syslog-ng[4611]: syslog-ng starting up; >version='3.6.2' > > >And: > > >Oct 24 11:47:42 moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169): >transmit queue 0 timed out >Oct 24 11:47:42 moonflo kernel: Modules linked in: xt_physdev >br_netfilter iptable_filter ip_tables xen_pciback xen_gntalloc >xen_gntdev bridge stp llc zfs(PO) zuni >code(PO) zavl(PO) zcommon(PO) znvpair(PO) nouveau snd_hda_codec_realtek >snd_hda_codec_generic video spl(O) backlight zlib_deflate >drm_kms_helper snd_hda_intel snd_ >hda_controller snd_hda_codec snd_pcm snd_timer r8169 snd ttm soundcore >mii xts aesni_intel glue_helper lrw gf128mul ablk_helper cryptd >aes_x86_64 sha256_generic hi >d_generic usbhid uhci_hcd usb_storage ehci_pci ehci_hcd usbcore >usb_common >Oct 24 11:47:42 moonflo kernel: CPU: 12 PID: 0 Comm: swapper/12 >Tainted: P O 4.0.5-gentoo #3 >Oct 24 11:47:42 moonflo kernel: Hardware name: Hewlett-Packard HP Z800 >Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013 >Oct 24 11:47:42 moonflo kernel: ffffffff8175a77d ffff880124d83d98 >ffffffff814da8d8 0000000000000001 >Oct 24 11:47:42 moonflo kernel: ffff880124d83de8 ffff880124d83dd8 >ffffffff81088850 ffff880124d83e68 >Oct 24 11:47:42 moonflo kernel: 0000000000000000 ffff88011efd8000 >0000000000000001 ffff8800d4eb5e80 >Oct 24 11:47:42 moonflo kernel: Call Trace: >Oct 24 11:47:42 moonflo kernel: <IRQ> [<ffffffff814da8d8>] >dump_stack+0x45/0x57 >Oct 24 11:47:42 moonflo kernel: [<ffffffff81088850>] >warn_slowpath_common+0x80/0xc0 >Oct 24 11:47:42 moonflo kernel: [<ffffffff810888d1>] >warn_slowpath_fmt+0x41/0x50 >Oct 24 11:47:42 moonflo kernel: [<ffffffff812b31c5>] ? >add_interrupt_randomness+0x35/0x1e0 >Oct 24 11:47:42 moonflo kernel: [<ffffffff8145b819>] >dev_watchdog+0x259/0x270 >Oct 24 11:47:42 moonflo kernel: [<ffffffff8145b5c0>] ? >dev_graft_qdisc+0x80/0x80 >Oct 24 11:47:42 moonflo kernel: [<ffffffff8145b5c0>] ? >dev_graft_qdisc+0x80/0x80 >Oct 24 11:47:42 moonflo kernel: [<ffffffff810d4047>] >call_timer_fn.isra.30+0x17/0x70 >Oct 24 11:47:42 moonflo kernel: [<ffffffff810d42a6>] >run_timer_softirq+0x176/0x2b0 >Oct 24 11:47:42 moonflo kernel: [<ffffffff8108bd0a>] >__do_softirq+0xda/0x1f0 >Oct 24 11:47:42 moonflo kernel: [<ffffffff8108c04e>] >irq_exit+0x7e/0xa0 >Oct 24 11:47:42 moonflo kernel: [<ffffffff8130e075>] >xen_evtchn_do_upcall+0x35/0x50 >Oct 24 11:47:42 moonflo kernel: [<ffffffff814e1e8e>] >xen_do_hypervisor_callback+0x1e/0x40 >Oct 24 11:47:42 moonflo kernel: <EOI> [<ffffffff810013aa>] ? >xen_hypercall_sched_op+0xa/0x20 >Oct 24 11:47:42 moonflo kernel: [<ffffffff810013aa>] ? >xen_hypercall_sched_op+0xa/0x20 >Oct 24 11:47:42 moonflo kernel: [<ffffffff810459e0>] ? >xen_safe_halt+0x10/0x20 >Oct 24 11:47:42 moonflo kernel: [<ffffffff81053979>] ? >default_idle+0x9/0x10 >Oct 24 11:47:42 moonflo kernel: [<ffffffff810542da>] ? >arch_cpu_idle+0xa/0x10 >Oct 24 11:47:42 moonflo kernel: [<ffffffff810bd170>] ? >cpu_startup_entry+0x190/0x2f0 >Oct 24 11:47:42 moonflo kernel: [<ffffffff81047cd5>] ? >cpu_bringup_and_idle+0x25/0x40 >Oct 24 11:47:42 moonflo kernel: ---[ end trace 320b6f98f8fc070f ]--- >Oct 24 11:47:42 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up > > >That was two days before it went down. After that, messages about >topology changes >are starting to appear. > >I'm not sure if I should call this "progress" ;) > >> >>> You get no answer when you ping the host while it is unreachable. >>> >>>> - If yes, does it still have no connectivity? >>> >>> It has been restarted this morning when it was found to be >unreachable. >>> >>>> I saw the same on my lab machine, which was related to: >>>> - Not using correct drivers inside HVM guests >>> >>> There are Windoze 7 guests running that have PV drivers installed. >>> One of those has formerly been running on a VMware host and was >>> migrated on Tuesday. I deinstalled the VMware tools from it. >> >> Which PV drivers? > >Xen GPL PV Driver Developers >17.09.2014 >0.11.0.373 >Univention GmbH > >> And did you ensure all VMWare related drivers were removed? >> I am not convinced uninstalling the VMWare tools is sufficient. > >What would I need to look at to make sure they are removed? > >The problem has been there before the VM that had VMWare drivers >installed was migrated to this server. So I don't think they are >causing this problem. > > >>> Since Monday, a HVM Linux system (a modified 32-bit Debian) has also >>> been migrated from the VMware host to this one. I don't know if it >>> has VMware tools installed (I guess it does because it could be shut >>> down via VMware) and how those might react now. It's working, and I >>> don't want to touch it. >>> >>> However, the problem already occured before this migration, when the >>> on-board cards were still used. >>> >>>> - Switch hardware not keeping the MAC/IP/Port lists long enough >>> >>> What might be the reason for the lists becoming too short? Too many >>> devices connected to the network? >> >> No network activity for a while. (clean installs, nothing running) >> Switch forgetting the MAC-address assigned to the VM. >> >> Connecting to the VM-console, I could ping www.google.com and then >the >> connectivity re-appeared. > >Half of the switches have been replaced last week in order to track >down >what appears to be a weird network problem. The problem is that the >RDP >clients are being randomly stalled. If it was only that, I'd suspect >this >server some more, but the internet connection goes through the same >switches >and is apprently also slowed down when the RPD clients are stalled. >They >got also randomly stalled when the RDP clients were accessing a totally >different server (the VMWare server), so this might be entirely >unrelated. > >Replacing the switches didn't fix the problem, so I'll probably put >them >back into service and replace the other half. > >>> The host has been connected to two different switches and showed the >>> problem. Previously, that was an 8-port 1Gb switch, now it's a >24-port >>> 1Gb switch. However, the 8-port switch is also connected to the >24-port >>> switch the host is now connected to. (The 24-port switch connects >it >>> "directly" to the rest of the network.) >> >> Assuming it's a managed switch, you could test this. >> Alternatively, check if you can access the VMs from the host. > >Good idea, I'll try that when it happens when I'm here. > >The network cards have arrived, Intel PRO 1000 dual port, made for IBM. >I hope I get to swap the card today. Those *really* should work. > >Hm, I could plug in two of them and give each VM and the host its own >physical card. Do you think that might help?
Quick reply from mobile. Will give a more detailed one later. Noticed you are using ZFS. Where is your swap partition located? On ZFS or? -- Joost -- Sent from my Android device with K-9 Mail. Please excuse my brevity.