21.12.2018 14:24, Mike Lykov пишет:

I have a 4.2.7 setup hyperconverged, two deployed VM Engine images and i have 20-30 second network outage. After some pinging to start engine on host 1, then 2, then again 1 Engine image stuck at
"Probing EDD (edd=off to disable)... _"
as here: https://bugzilla.redhat.com/show_bug.cgi?id=1569827

Now I looking to the logs.
Full /var/log archives are here:
https://yadi.sk/d/XZ5jJfQLN6QMlA (HE engine logs) - 36 Mb
https://yadi.sk/d/bZ0TYGxFoHGgIQ (ovirtnode6 logs) - 144  Mb

I do some CCs in this email to personal addresses, if i's not relevant - please ignore.

Host nodes (centos 7.5) named ovirtnode1,5,6. Timeouts (in ha agent) are default. Sanlock are configured (as i think)
HE running on ovirtnode6, and spare HE deployed on ovirtnode1.

There is two network links: ovirtmgmt over "ovirtmgmt: port 1(enp59s0f0)" and glusterfs storage network over ib0 interface (different subnet)

messages log on ovirtnode6:
That outage:

---------------------------
Dec 21 12:32:56 ovirtnode6 kernel: bnx2x 0000:3b:00.0 enp59s0f0: NIC Link is Down Dec 21 12:32:56 ovirtnode6 kernel: ovirtmgmt: port 1(enp59s0f0) entered disabled state Dec 21 12:33:13 ovirtnode6 kernel: bnx2x 0000:3b:00.0 enp59s0f0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit Dec 21 12:33:13 ovirtnode6 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): enp59s0f0: link becomes ready Dec 21 12:33:13 ovirtnode6 kernel: ovirtmgmt: port 1(enp59s0f0) entered forwarding state Dec 21 12:33:13 ovirtnode6 NetworkManager[1715]: <info> [1545381193.2204] device (enp59s0f0): carrier: link connected
-----------------------

There is 17 second. at 33:13 link are back. BUT all events lead to crash follow later:

HA agent log:
------------------------------
MainThread::INFO::2018-12-21 12:32:59,540::states::444::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm running on localhost MainThread::INFO::2018-12-21 12:32:59,662::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400) MainThread::INFO::2018-12-21 12:33:09,797::states::136::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1280 due to gateway status MainThread::INFO::2018-12-21 12:33:09,798::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 2120) MainThread::ERROR::2018-12-21 12:33:19,815::states::436::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Host ovirtnode1.miac (id 1) score is significantly better than local score, shutting down VM on this host
----------------------------------------------


syslog messages:

Dec 21 12:33:19 ovirtnode6 journal: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Host ovirtnode1.miac (id 1) score is significantly better than local score, shutting down VM on this host Dec 21 12:33:29 ovirtnode6 journal: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Engine VM stopped on localhost Dec 21 12:33:37 ovirtnode6 kernel: ovirtmgmt: port 3(vnet1) entered disabled state
Dec 21 12:33:37 ovirtnode6 kernel: device vnet1 left promiscuous mode
Dec 21 12:33:37 ovirtnode6 kernel: ovirtmgmt: port 3(vnet1) entered disabled state Dec 21 12:33:37 ovirtnode6 NetworkManager[1715]: <info> [1545381217.1796] device (vnet1): state change: disconnected -> unmanaged (reason 'unmanaged', sys-iface-state: 'removed') Dec 21 12:33:37 ovirtnode6 NetworkManager[1715]: <info> [1545381217.1798] device (vnet1): released from master device ovirtmgmt Dec 21 12:33:37 ovirtnode6 libvirtd: 2018-12-21 08:33:37.192+0000: 2783: **************error : qemuMonitorIO:719 : internal error: End of file from qemu monitor************* - WHAT IS THIS?
Dec 21 12:33:37 ovirtnode6 kvm: 2 guests now active
Dec 21 12:33:37 ovirtnode6 systemd-machined: Machine qemu-2-HostedEngine terminated. Dec 21 12:33:37 ovirtnode6 firewalld[1693]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w2 -w -D libvirt-out -m physdev --physdev-is-bridged --physdev-out vnet1 -g FP-vnet1' failed: iptables v1.4.21: goto 'FP-vnet1' is not a chain#012#0
12Try `iptables -h' or 'iptables --help' for more information.

Dec 21 12:33:55 ovirtnode6 kernel: ovirtmgmt: port 3(vnet1) entered blocking state Dec 21 12:33:55 ovirtnode6 kernel: ovirtmgmt: port 3(vnet1) entered disabled state
Dec 21 12:33:55 ovirtnode6 kernel: device vnet1 entered promiscuous mode
Dec 21 12:33:55 ovirtnode6 kernel: ovirtmgmt: port 3(vnet1) entered blocking state Dec 21 12:33:55 ovirtnode6 kernel: ovirtmgmt: port 3(vnet1) entered forwarding state Dec 21 12:33:55 ovirtnode6 lldpad: recvfrom(Event interface): No buffer space available Dec 21 12:33:55 ovirtnode6 NetworkManager[1715]: <info> [1545381235.8086] manager: (vnet1): new Tun device (/org/freedesktop/NetworkManager/Devices/37) Dec 21 12:33:55 ovirtnode6 NetworkManager[1715]: <info> [1545381235.8121] device (vnet1): state change: unmanaged -> unavailable (reason 'connection-assumed', sys-iface-state: 'external') Dec 21 12:33:55 ovirtnode6 NetworkManager[1715]: <info> [1545381235.8127] device (vnet1): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'external')
---------------------------

*** WHAT IS THIS *** ? Link are ready some time ago, why this bridge status transitions and iptables errors are happen?

and this machine try to start again:
-----------------
Dec 21 12:33:56 ovirtnode6 systemd-machined: New machine qemu-15-HostedEngine. Dec 21 12:33:56 ovirtnode6 systemd: Started Virtual Machine qemu-15-HostedEngine.
-------------------

HA agent log on this host (ovirtnode6):

-----------------------------------------
MainThread::INFO::2018-12-21 12:33:49,880::states::510::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down and local host has best score (3400), attempting to start engine VM MainThread::INFO::2018-12-21 12:33:57,884::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2018-12-21 12:34:04,898::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) VM is powering up..
.....
MainThread::INFO::2018-12-21 12:36:24,800::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) VM is powering up.. MainThread::INFO::2018-12-21 12:36:24,921::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400)
---------------------------

HA Agent log on ovirtnode1 (spare HE host where VM trying to start):

----------------------
MainThread::INFO::2018-12-21 12:33:52,984::states::510::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down and local host has best score (3400), attempting to start engine VM

MainThread::INFO::2018-12-21 12:33:56,787::hosted_engine::947::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Engine VM started on localhost MainThread::INFO::2018-12-21 12:33:59,923::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStart-EngineStarting) sent? sent MainThread::INFO::2018-12-21 12:33:59,936::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2018-12-21 12:34:06,950::states::783::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Another host already took over..
-----------------

WHAT IS THIS? what in means about "took over", what process it do in this case?
------------------------------------
MainThread::INFO::2018-12-21 12:34:10,240::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineForceStop (score: 3400) MainThread::INFO::2018-12-21 12:34:10,246::hosted_engine::1006::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Shutting down vm using `/usr/sbin/hosted-engine --vm-poweroff` MainThread::INFO::2018-12-21 12:34:10,797::hosted_engine::1011::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stdout: MainThread::INFO::2018-12-21 12:34:10,797::hosted_engine::1012::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stderr: MainThread::ERROR::2018-12-21 12:34:10,797::hosted_engine::1020::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Engine VM stopped on localhost
-------------------------------------


I mount HE VM partitions with log, here is a syslog messages:

WHAT IS THIS guest-shutdown? why? no network problems at this time

Dec 21 12:28:24 ovirtengine ovsdb-server: ovs|36386|reconnect|WARN|ssl:[::ffff:127.0.0.1]:58834: connection dropped (Protocol error) Dec 21 12:28:24 ovirtengine python: ::ffff:172.16.10.101 - - [21/Dec/2018 12:28:24] "GET /v2.0/networks HTTP/1.1" 200 - Dec 21 12:30:44 ovirtengine ovsdb-server: ovs|00111|reconnect|ERR|ssl:[::ffff:172.16.10.5]:42032: no response to inactivity probe after 5 seconds, disconnecting Dec 21 12:30:44 ovirtengine ovsdb-server: ovs|00112|reconnect|ERR|ssl:[::ffff:172.16.10.1]:45624: no response to inactivity probe after 5 seconds, disconnecting Dec 21 12:31:07 ovirtengine qemu-ga: info: guest-shutdown called, mode: powerdown

Dec 21 12:33:58 ovirtengine kernel: Linux version 3.10.0-862.14.4.el7.x86_64 (mockbu...@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) ) #1 SMP Wed Sep 26 15:12:11 UTC 2018 Dec 21 12:33:58 ovirtengine kernel: Command line: BOOT_IMAGE=/vmlinuz-3.10.0-862.14.4.el7.x86_64 root=UUID=091e7022-295b-4b3f-96ad-4a7d90a2a9b0 ro crashkernel=auto rd.lvm.lv=ovirt/swap console=ttyS0 LANG=en_US.UTF-8
.....
Dec 21 12:33:59 ovirtengine kernel: XFS (vda3): Ending clean mount
Dec 21 12:34:01 ovirtengine systemd: Mounted /sysroot.
....
Dec 21 12:34:06 ovirtengine lvm: 6 logical volume(s) in volume group "ovirt" now active
Dec 21 12:34:06 ovirtengine systemd: Started LVM2 PV scan on device 252:2.
Dec 21 12:34:06 ovirtengine systemd: Found device /dev/mapper/ovirt-audit.
Dec 21 12:34:06 ovirtengine kernel: XFS (dm-5): Ending clean mount
Dec 21 12:34:06 ovirtengine systemd: Mounted /home.
Dec 21 12:34:07 ovirtengine kernel: XFS (dm-3): Ending clean mount
Dec 21 12:34:07 ovirtengine systemd: Mounted /var.
Dec 21 12:34:07 ovirtengine systemd: Starting Load/Save Random Seed...
Dec 21 12:34:07 ovirtengine systemd: Mounting /var/log...
Dec 21 12:34:07 ovirtengine kernel: XFS (dm-2): Mounting V5 Filesystem
Dec 21 12:34:07 ovirtengine systemd: Started Load/Save Random Seed.
Dec 21 12:34:08 ovirtengine kernel: XFS (dm-2): Ending clean mount
Dec 21 12:34:08 ovirtengine systemd: Mounted /var/log.
Dec 21 12:34:08 ovirtengine systemd: Starting Flush Journal to Persistent Storage...
Dec 21 12:34:08 ovirtengine systemd: Mounting /var/log/audit...
Dec 21 12:34:08 ovirtengine kernel: XFS (dm-1): Mounting V5 Filesystem
Dec 21 12:34:08 ovirtengine systemd: Started Flush Journal to Persistent Storage.
Dec 21 12:34:08 ovirtengine kernel: XFS (dm-1): Ending clean mount
Dec 21 12:34:08 ovirtengine systemd: Mounted /var/log/audit.
Dec 21 12:34:13 ovirtengine kernel: XFS (dm-4): Ending clean mount
Dec 21 12:34:13 ovirtengine systemd: Mounted /tmp.
Dec 21 12:34:13 ovirtengine systemd: Reached target Local File Systems.
.....
Dec 21 12:34:24 ovirtengine sshd[1324]: Server listening on 0.0.0.0 port 2222.
Dec 21 12:34:24 ovirtengine sshd[1324]: Server listening on :: port 2222.
Dec 21 12:34:25 ovirtengine rsyslogd: [origin software="rsyslogd" swVersion="8.24.0" x-pid="1334" x-info="http://www.rsyslog.com";] start
Dec 21 12:34:25 ovirtengine systemd: Started System Logging Service.
......
Dec 21 12:34:25 ovirtengine aliasesdb: BDB0196 Encrypted checksum: no encryption key specified Dec 21 12:34:25 ovirtengine aliasesdb: BDB0196 Encrypted checksum: no encryption key specified Dec 21 12:34:25 ovirtengine kernel: XFS (vda3): Internal error XFS_WANT_CORRUPTED_GOTO at line 1664 of file fs/xfs/libxfs/xfs_alloc.c. Caller xfs_free_extent+0xaa/0x140 [xfs] Dec 21 12:34:25 ovirtengine kernel: CPU: 1 PID: 1379 Comm: postalias Not tainted 3.10.0-862.14.4.el7.x86_64 #1 Dec 21 12:34:25 ovirtengine kernel: Hardware name: oVirt oVirt Node, BIOS 1.11.0-2.el7 04/01/2014 Dec 21 12:34:25 ovirtengine kernel: XFS (vda3): xfs_do_force_shutdown(0x8) called from line 236 of file fs/xfs/libxfs/xfs_defer.c. Return address = 0xffffffffc02709cb

Dec 21 12:34:27 ovirtengine kernel: XFS (vda3): Corruption of in-memory data detected. Shutting down filesystem Dec 21 12:34:27 ovirtengine kernel: XFS (vda3): Please umount the filesystem and rectify the problem(s)
......

Dec 21 12:34:27 ovirtengine ovirt-websocket-proxy.py: ImportError: No module named websocketproxy Dec 21 12:34:27 ovirtengine journal: 2018-12-21 12:34:27,189+0400 ovirt-engine-dwhd: ERROR run:554 Error: list index out of range Dec 21 12:34:27 ovirtengine systemd: ovirt-websocket-proxy.service: main process exited, code=killed, status=7/BUS Dec 21 12:34:27 ovirtengine systemd: Failed to start oVirt Engine websockets proxy. Dec 21 12:34:27 ovirtengine systemd: Unit ovirt-websocket-proxy.service entered failed state.
Dec 21 12:34:27 ovirtengine systemd: ovirt-websocket-proxy.service failed.
Dec 21 12:34:27 ovirtengine systemd: ovirt-engine-dwhd.service: main process exited, code=exited, status=1/FAILURE
......
----------------------------







_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/BBHBZ6ZWU5KLCWQ7PMOENAX2ZCR7M3FP/

Reply via email to