On 29 October 2015 11:29:18 CET, hw <h...@gc-24.de> wrote:
>J. Roeleveld wrote:
>> On Thursday, October 15, 2015 05:46:07 PM hw wrote:
>>> J. Roeleveld wrote:
>>>> On Thursday, October 15, 2015 03:30:01 PM hw wrote:
>>>>> Hi,
>>>>>
>>>>> I have a xen host with some HV guests which becomes unreachable
>via
>>>>> the network after apparently random amount of times.  I have
>already
>>>>> switched the network card to see if that would make a difference,
>>>>> and with the card currently installed, it worked fine for over 20
>days
>>>>> until it become unreachable again.  Before switching the network
>card,
>>>>> it would run a week or two before becoming unreachable.  The
>previous
>>>>> card was the on-board BCM5764M which uses the tg3 driver.
>>>>>
>>>>> There are messages like this in the log file:
>>>>>
>>>>>
>>>>> Oct 14 20:58:02 moonflo kernel: ------------[ cut here
>]------------
>>>>> Oct 14 20:58:02 moonflo kernel: WARNING: CPU: 10 PID: 0 at
>>>>> net/sched/sch_generic.c:303 dev_watchdog+0x259/0x270() Oct 14
>20:58:02
>>>>> moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169): transmit queue 0
>timed
>>>>> out Oct 14 20:58:02 moonflo kernel: Modules linked in: arc4 ecb
>md4 hmac
>>>>> nls_utf8 cifs fscache xt_physdev br_netfilter iptable_filter
>ip_tables
>>>>> xen_pciback xen_gntalloc xen_gntdev bridge stp llc zfs(PO) nouveau
>>>>> snd_hda_codec_realtek snd_hda_codec_generic zunicode(PO) zavl(PO)
>>>>> zcommon(PO) znvpair(PO) spl(O) zlib_deflate video backlight
>>>>> drm_kms_helper
>>>>> ttm snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm
>snd_timer snd
>>>>> soundcore r8169 mii xts aesni_intel glue_helper lrw gf128mul
>ablk_helper
>>>>> cryptd aes_x86_64 sha256_generic hid_generic usbhid uhci_hcd
>usb_storage
>>>>> ehci_pci ehci_hcd usbcore usb_common Oct 14 20:58:02 moonflo
>kernel: CPU:
>>>>> 10 PID: 0 Comm: swapper/10 Tainted: P           O    4.0.5-gentoo
>#3 Oct
>>>>> 14
>>>>> 20:58:02 moonflo kernel: Hardware name: Hewlett-Packard HP Z800
>>>>> Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013 Oct 14 20:58:02
>moonflo
>>>>> kernel:  ffffffff8175a77d ffff880124d43d98 ffffffff814da8d8
>>>>> 0000000000000001 Oct 14 20:58:02 moonflo kernel:  ffff880124d43de8
>>>>> ffff880124d43dd8 ffffffff81088850 ffff880124d43dd8 Oct 14 20:58:02
>>>>> moonflo
>>>>> kernel:  0000000000000000 ffff8800d45f2000 0000000000000001
>>>>> ffff8800d5294880 Oct 14 20:58:02 moonflo kernel: Call Trace:
>>>>> Oct 14 20:58:02 moonflo kernel:  <IRQ>  [<ffffffff814da8d8>]
>>>>> dump_stack+0x45/0x57 Oct 14 20:58:02 moonflo kernel:
>>>>> [<ffffffff81088850>]
>>>>> warn_slowpath_common+0x80/0xc0 Oct 14 20:58:02 moonflo kernel:
>>>>> [<ffffffff810888d1>] warn_slowpath_fmt+0x41/0x50 Oct 14 20:58:02
>moonflo
>>>>> kernel:  [<ffffffff812b31c5>] ?
>add_interrupt_randomness+0x35/0x1e0 Oct
>>>>> 14
>>>>> 20:58:02 moonflo kernel:  [<ffffffff8145b819>]
>dev_watchdog+0x259/0x270
>>>>> Oct
>>>>> 14 20:58:02 moonflo kernel:  [<ffffffff8145b5c0>] ?
>>>>> dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo kernel:
>>>>> [<ffffffff8145b5c0>] ? dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02
>moonflo
>>>>> kernel:  [<ffffffff810d4047>] call_timer_fn.isra.30+0x17/0x70 Oct
>14
>>>>> 20:58:02 moonflo kernel:  [<ffffffff810d42a6>]
>>>>> run_timer_softirq+0x176/0x2b0 Oct 14 20:58:02 moonflo kernel:
>>>>> [<ffffffff8108bd0a>] __do_softirq+0xda/0x1f0 Oct 14 20:58:02
>moonflo
>>>>> kernel:  [<ffffffff8108c04e>] irq_exit+0x7e/0xa0 Oct 14 20:58:02
>moonflo
>>>>> kernel:  [<ffffffff8130e075>] xen_evtchn_do_upcall+0x35/0x50 Oct
>14
>>>>> 20:58:02 moonflo kernel:  [<ffffffff814e1e8e>]
>>>>> xen_do_hypervisor_callback+0x1e/0x40 Oct 14 20:58:02 moonflo
>kernel:
>>>>> <EOI>
>>>>>
>>>>>    [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 Oct 14
>20:58:02
>>>>>
>>>>> moonflo kernel:  [<ffffffff810013aa>] ?
>xen_hypercall_sched_op+0xa/0x20
>>>>> Oct
>>>>> 14 20:58:02 moonflo kernel:  [<ffffffff810459e0>] ?
>>>>> xen_safe_halt+0x10/0x20
>>>>> Oct 14 20:58:02 moonflo kernel:  [<ffffffff81053979>] ?
>>>>> default_idle+0x9/0x10 Oct 14 20:58:02 moonflo kernel:
>>>>> [<ffffffff810542da>]
>>>>> ? arch_cpu_idle+0xa/0x10 Oct 14 20:58:02 moonflo kernel:
>>>>> [<ffffffff810bd170>] ? cpu_startup_entry+0x190/0x2f0 Oct 14
>20:58:02
>>>>> moonflo kernel:  [<ffffffff81047cd5>] ?
>cpu_bringup_and_idle+0x25/0x40
>>>>> Oct
>>>>> 14 20:58:02 moonflo kernel: ---[ end trace 98d961bae351244d ]---
>Oct 14
>>>>> 20:58:02 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up
>>>>>
>>>>>
>>>>> After that, there are lots of messages about the link being up,
>one
>>>>> message
>>>>> every 12 seconds.  When you unplug the network cable, you get a
>message
>>>>> that the link is down, and no message when you plug it in again.
>>>>>
>>>>> I was hoping that switching the network card (to one that uses a
>>>>> different
>>>>> driver) might solve the problem, and it did not.  Now I can only
>guess
>>>>> that
>>>>> the network card goes to sleep and sometimes cannot be woken up
>again.
>>>>>
>>>>> I tried to reduce the connection speed to 100Mbit and found that
>>>>> accessing
>>>>> the VMs (via RDP) becomes too slow to use them.  So I disabled the
>power
>>>>> management of the network card (through sysfs) and will have to
>see if
>>>>> the
>>>>> problem persists.
>>>>>
>>>>> We'll be getting decent network cards in a couple days, but since
>the
>>>>> problem doesn't seem to be related to a particular
>>>>> card/model/manufacturer,
>>>>> that might not fix it, either.
>>>>>
>>>>> This problem seems to only occur on machines that operate as a xen
>>>>> server.
>>>>> Other machines, identical Z800s, not running xen, run just fine.
>>>>>
>>>>> What would you suggest?
>>>>
>>>> More info required:
>>>>
>>>> - Which version of Xen
>>>
>>> 4.5.1
>>>
>>> Installed versions:  4.5.1^t(02:44:35 PM 07/14/2015)(-custom-cflags
>-debug
>>> -efi -flask -xsm)
>>
>> Ok, recent one.
>>
>>>> - Does this only occur with HVM guests?
>>>
>>> The host has been running only HVM guests every time it happend.
>>> It was running a PV guest in between (which I had to shut down
>>> because other VMs were migrated, requiring the RAM).
>>
>> The PV didn't have any issues?
>
>The whole server has the issue, not a particular VM.  While the PV
>guest
>was running, the server didn't freeze.
>
>>>> - Which network-driver are you using inside the guest
>>>
>>> r8169, compiled as a module
>>>
>>> Same happened with the tg3 driver when the on-board cards were used.
>>> The tg3 driver is completely disabled in the kernel config, i. e.
>>> not even compiled as a module.
>>
>> You have network cards assigned to the guests?
>
>No, they are all connected via a bridge.
>
>I enabled STP on the bridge and the server was ok for a week, then had
>to be restarted.  I'm seeing lots of messages in the log:
>
>
>Oct 28 11:14:05 moonflo kernel: brloc: topology change detected,
>propagating
>Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn
>bpdu
>Oct 28 11:14:05 moonflo kernel: brloc: topology change detected,
>propagating
>Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn
>bpdu
>Oct 28 11:14:05 moonflo kernel: brloc: topology change detected,
>propagating
>Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn
>bpdu
>Oct 28 11:14:05 moonflo kernel: brloc: topology change detected,
>propagating
>Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn
>bpdu
>Oct 28 11:14:05 moonflo kernel: brloc: topology change detected,
>propagating
>
>
>and sometimes:
>
>Oct 28 10:47:04 moonflo kernel: brloc: port 1(enp55s4) neighbor
>8000.00:00:10:11:12:00 lost
>
>
>Any idea what this means?
>
>(Google has gone on strike, and another search engine didn't give any
>useful
>findings ...)
>
>
>>>> - Can you connect to the "local" console of the guest?
>>>
>>> Yes, the host seems to be running fine except for having no network
>>> connectivity.  There's a keyboard and monitor physically connected
>to
>>> it with which you can log in and do stuff.
>>
>> The HOST loses network connectivity?
>
>Yes.
>
>Apparently when it became unresponsive yesterday, it was not possible
>to log in at the console, either.  I wasn't there yesterday, though
>I've
>see that happen before.  We tried to shut it down via acpid by pressing
>the
>power button. It didn't turn off, so it was switched off by holding the
>power
>button.  What I can see in the log is:
>
>
>Oct 28 14:12:33 moonflo logger[20322]: /etc/xen/scripts/block: remove
>XENBUS_PATH=backend/vbd/2/768
>Oct 28 14:12:33 moonflo logger[20323]: /etc/xen/scripts/vif-bridge:
>offline type_if=vif XENBUS_PATH=backend/vif/2/0
>Oct 28 14:12:33 moonflo logger[20347]: /etc/xen/scripts/vif-bridge:
>brctl delif brloc vif2.0 failed
>Oct 28 14:12:33 moonflo logger[20353]: /etc/xen/scripts/vif-bridge:
>ifconfig vif2.0 down failed
>Oct 28 14:12:33 moonflo logger[20361]: /etc/xen/scripts/vif-bridge:
>Successful vif-bridge offline for vif2.0, bridge brloc.
>Oct 28 14:12:33 moonflo logger[20372]: /etc/xen/scripts/vif-bridge:
>remove type_if=tap XENBUS_PATH=backend/vif/2/0
>Oct 28 14:12:33 moonflo logger[20391]: /etc/xen/scripts/vif-bridge:
>Successful vif-bridge remove for vif2.0-emu, bridge brloc.
>Oct 28 14:15:33 moonflo shutdown[20476]: shutting down for system halt
>^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Oct
>28 14:17:34 moonflo syslog-ng[4611]: syslog-ng starting up;
>version='3.6.2'
>
>
>And:
>
>
>Oct 24 11:47:42 moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169):
>transmit queue 0 timed out
>Oct 24 11:47:42 moonflo kernel: Modules linked in: xt_physdev
>br_netfilter iptable_filter ip_tables xen_pciback xen_gntalloc
>xen_gntdev bridge stp llc zfs(PO) zuni
>code(PO) zavl(PO) zcommon(PO) znvpair(PO) nouveau snd_hda_codec_realtek
>snd_hda_codec_generic video spl(O) backlight zlib_deflate
>drm_kms_helper snd_hda_intel snd_
>hda_controller snd_hda_codec snd_pcm snd_timer r8169 snd ttm soundcore
>mii xts aesni_intel glue_helper lrw gf128mul ablk_helper cryptd
>aes_x86_64 sha256_generic hi
>d_generic usbhid uhci_hcd usb_storage ehci_pci ehci_hcd usbcore
>usb_common
>Oct 24 11:47:42 moonflo kernel: CPU: 12 PID: 0 Comm: swapper/12
>Tainted: P           O    4.0.5-gentoo #3
>Oct 24 11:47:42 moonflo kernel: Hardware name: Hewlett-Packard HP Z800
>Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013
>Oct 24 11:47:42 moonflo kernel:  ffffffff8175a77d ffff880124d83d98
>ffffffff814da8d8 0000000000000001
>Oct 24 11:47:42 moonflo kernel:  ffff880124d83de8 ffff880124d83dd8
>ffffffff81088850 ffff880124d83e68
>Oct 24 11:47:42 moonflo kernel:  0000000000000000 ffff88011efd8000
>0000000000000001 ffff8800d4eb5e80
>Oct 24 11:47:42 moonflo kernel: Call Trace:
>Oct 24 11:47:42 moonflo kernel:  <IRQ>  [<ffffffff814da8d8>]
>dump_stack+0x45/0x57
>Oct 24 11:47:42 moonflo kernel:  [<ffffffff81088850>]
>warn_slowpath_common+0x80/0xc0
>Oct 24 11:47:42 moonflo kernel:  [<ffffffff810888d1>]
>warn_slowpath_fmt+0x41/0x50
>Oct 24 11:47:42 moonflo kernel:  [<ffffffff812b31c5>] ?
>add_interrupt_randomness+0x35/0x1e0
>Oct 24 11:47:42 moonflo kernel:  [<ffffffff8145b819>]
>dev_watchdog+0x259/0x270
>Oct 24 11:47:42 moonflo kernel:  [<ffffffff8145b5c0>] ?
>dev_graft_qdisc+0x80/0x80
>Oct 24 11:47:42 moonflo kernel:  [<ffffffff8145b5c0>] ?
>dev_graft_qdisc+0x80/0x80
>Oct 24 11:47:42 moonflo kernel:  [<ffffffff810d4047>]
>call_timer_fn.isra.30+0x17/0x70
>Oct 24 11:47:42 moonflo kernel:  [<ffffffff810d42a6>]
>run_timer_softirq+0x176/0x2b0
>Oct 24 11:47:42 moonflo kernel:  [<ffffffff8108bd0a>]
>__do_softirq+0xda/0x1f0
>Oct 24 11:47:42 moonflo kernel:  [<ffffffff8108c04e>]
>irq_exit+0x7e/0xa0
>Oct 24 11:47:42 moonflo kernel:  [<ffffffff8130e075>]
>xen_evtchn_do_upcall+0x35/0x50
>Oct 24 11:47:42 moonflo kernel:  [<ffffffff814e1e8e>]
>xen_do_hypervisor_callback+0x1e/0x40
>Oct 24 11:47:42 moonflo kernel:  <EOI>  [<ffffffff810013aa>] ?
>xen_hypercall_sched_op+0xa/0x20
>Oct 24 11:47:42 moonflo kernel:  [<ffffffff810013aa>] ?
>xen_hypercall_sched_op+0xa/0x20
>Oct 24 11:47:42 moonflo kernel:  [<ffffffff810459e0>] ?
>xen_safe_halt+0x10/0x20
>Oct 24 11:47:42 moonflo kernel:  [<ffffffff81053979>] ?
>default_idle+0x9/0x10
>Oct 24 11:47:42 moonflo kernel:  [<ffffffff810542da>] ?
>arch_cpu_idle+0xa/0x10
>Oct 24 11:47:42 moonflo kernel:  [<ffffffff810bd170>] ?
>cpu_startup_entry+0x190/0x2f0
>Oct 24 11:47:42 moonflo kernel:  [<ffffffff81047cd5>] ?
>cpu_bringup_and_idle+0x25/0x40
>Oct 24 11:47:42 moonflo kernel: ---[ end trace 320b6f98f8fc070f ]---
>Oct 24 11:47:42 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up
>
>
>That was two days before it went down.  After that, messages about
>topology changes
>are starting to appear.
>
>I'm not sure if I should call this "progress" ;)
>
>>
>>> You get no answer when you ping the host while it is unreachable.
>>>
>>>> - If yes, does it still have no connectivity?
>>>
>>> It has been restarted this morning when it was found to be
>unreachable.
>>>
>>>> I saw the same on my lab machine, which was related to:
>>>> - Not using correct drivers inside HVM guests
>>>
>>> There are Windoze 7 guests running that have PV drivers installed.
>>> One of those has formerly been running on a VMware host and was
>>> migrated on Tuesday.  I deinstalled the VMware tools from it.
>>
>> Which PV drivers?
>
>Xen GPL PV Driver Developers
>17.09.2014
>0.11.0.373
>Univention GmbH
>
>> And did you ensure all VMWare related drivers were removed?
>> I am not convinced uninstalling the VMWare tools is sufficient.
>
>What would I need to look at to make sure they are removed?
>
>The problem has been there before the VM that had VMWare drivers
>installed was migrated to this server.  So I don't think they are
>causing this problem.
>
>
>>> Since Monday, a HVM Linux system (a modified 32-bit Debian) has also
>>> been migrated from the VMware host to this one.  I don't know if it
>>> has VMware tools installed (I guess it does because it could be shut
>>> down via VMware) and how those might react now.  It's working, and I
>>> don't want to touch it.
>>>
>>> However, the problem already occured before this migration, when the
>>> on-board cards were still used.
>>>
>>>> - Switch hardware not keeping the MAC/IP/Port lists long enough
>>>
>>> What might be the reason for the lists becoming too short?  Too many
>>> devices connected to the network?
>>
>> No network activity for a while. (clean installs, nothing running)
>> Switch forgetting the MAC-address assigned to the VM.
>>
>> Connecting to the VM-console, I could ping www.google.com and then
>the
>> connectivity re-appeared.
>
>Half of the switches have been replaced last week in order to track
>down
>what appears to be a weird network problem.  The problem is that the
>RDP
>clients are being randomly stalled.  If it was only that, I'd suspect
>this
>server some more, but the internet connection goes through the same
>switches
>and is apprently also slowed down when the RPD clients are stalled. 
>They
>got also randomly stalled when the RDP clients were accessing a totally
>different server (the VMWare server), so this might be entirely
>unrelated.
>
>Replacing the switches didn't fix the problem, so I'll probably put
>them
>back into service and replace the other half.
>
>>> The host has been connected to two different switches and showed the
>>> problem.  Previously, that was an 8-port 1Gb switch, now it's a
>24-port
>>> 1Gb switch.  However, the 8-port switch is also connected to the
>24-port
>>> switch the host is now connected to.  (The 24-port switch connects
>it
>>> "directly" to the rest of the network.)
>>
>> Assuming it's a managed switch, you could test this.
>> Alternatively, check if you can access the VMs from the host.
>
>Good idea, I'll try that when it happens when I'm here.
>
>The network cards have arrived, Intel PRO 1000 dual port, made for IBM.
>I hope I get to swap the card today. Those *really* should work.
>
>Hm, I could plug in two of them and give each VM and the host its own
>physical card.  Do you think that might help?

Quick reply from mobile.
Will give a more detailed one later.

 Noticed you are using ZFS. Where is your swap partition located?

On ZFS or?

--
Joost 
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Reply via email to