24.12.2018 11:30, Mike Lykov пишет:

Host nodes (centos 7.5) named ovirtnode1,5,6. Timeouts (in ha agent) are default. Sanlock are configured (as i think)
HE running on ovirtnode6, and spare HE deployed on ovirtnode1.

Fixed (as seems) by guestfish/xfs_repair method. It requires to zero xfs metadata logs, and this heavily relies on luck.

1. Why (when it cannot boot due to corruption) it NOT show anything at all in console? I can get to grub menu (if moving fast enough), but if I continue boot I see a blinking cursor for many minutes and not more. Grub options not contain any splash/quiet parameters. (exclusion for EDD message - it is meaningless, if I use edd=off - I get only black console).

Where is a kernel boot logs/console output? Are it try to load initrd at least?

2. How to set some timeouts for ha-agent NOT to try restart HE after 1-2 unsuccessful pings and 10 seconds outage? For HE VM stability (not crash/broke fs) are more important instead availability (I can live with unavailable it for 10-15 sec, but cannot with broken VM).




---------------------------
Dec 21 12:32:56 ovirtnode6 kernel: bnx2x 0000:3b:00.0 enp59s0f0: NIC Link is Down Dec 21 12:32:56 ovirtnode6 kernel: ovirtmgmt: port 1(enp59s0f0) entered disabled state Dec 21 12:33:13 ovirtnode6 kernel: bnx2x 0000:3b:00.0 enp59s0f0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit Dec 21 12:33:13 ovirtnode6 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): enp59s0f0: link becomes ready Dec 21 12:33:13 ovirtnode6 kernel: ovirtmgmt: port 1(enp59s0f0) entered forwarding state Dec 21 12:33:13 ovirtnode6 NetworkManager[1715]: <info> [1545381193.2204] device (enp59s0f0): carrier: link connected
-----------------------

There is 17 second. at 33:13 link are back. BUT all events lead to crash follow later:

HA agent log:
------------------------------
MainThread::INFO::2018-12-21 12:32:59,540::states::444::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm running on localhost MainThread::INFO::2018-12-21 12:32:59,662::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400) MainThread::INFO::2018-12-21 12:33:09,797::states::136::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1280 due to gateway status MainThread::INFO::2018-12-21 12:33:09,798::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 2120) MainThread::ERROR::2018-12-21 12:33:19,815::states::436::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Host ovirtnode1.miac (id 1) score is significantly better than local score, shutting down VM on this host
----------------------------------------------


syslog messages:

Dec 21 12:33:19 ovirtnode6 journal: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Host ovirtnode1.miac (id 1) score is significantly better than local score, shutting down VM on this host Dec 21 12:33:29 ovirtnode6 journal: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Engine VM stopped on localhost Dec 21 12:33:37 ovirtnode6 kernel: ovirtmgmt: port 3(vnet1) entered disabled state
Dec 21 12:33:37 ovirtnode6 kernel: device vnet1 left promiscuous mode
Dec 21 12:33:37 ovirtnode6 kernel: ovirtmgmt: port 3(vnet1) entered disabled state Dec 21 12:33:37 ovirtnode6 NetworkManager[1715]: <info> [1545381217.1796] device (vnet1): state change: disconnected -> unmanaged (reason 'unmanaged', sys-iface-state: 'removed') Dec 21 12:33:37 ovirtnode6 NetworkManager[1715]: <info> [1545381217.1798] device (vnet1): released from master device ovirtmgmt Dec 21 12:33:37 ovirtnode6 libvirtd: 2018-12-21 08:33:37.192+0000: 2783: **************error : qemuMonitorIO:719 : internal error: End of file from qemu monitor*************  - WHAT IS THIS?
Dec 21 12:33:37 ovirtnode6 kvm: 2 guests now active
Dec 21 12:33:37 ovirtnode6 systemd-machined: Machine qemu-2-HostedEngine terminated. Dec 21 12:33:37 ovirtnode6 firewalld[1693]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w2 -w -D libvirt-out -m physdev --physdev-is-bridged --physdev-out vnet1 -g FP-vnet1' failed: iptables v1.4.21: goto 'FP-vnet1' is not a chain#012#0
12Try `iptables -h' or 'iptables --help' for more information.

Dec 21 12:33:55 ovirtnode6 kernel: ovirtmgmt: port 3(vnet1) entered blocking state Dec 21 12:33:55 ovirtnode6 kernel: ovirtmgmt: port 3(vnet1) entered disabled state
Dec 21 12:33:55 ovirtnode6 kernel: device vnet1 entered promiscuous mode
Dec 21 12:33:55 ovirtnode6 kernel: ovirtmgmt: port 3(vnet1) entered blocking state Dec 21 12:33:55 ovirtnode6 kernel: ovirtmgmt: port 3(vnet1) entered forwarding state Dec 21 12:33:55 ovirtnode6 lldpad: recvfrom(Event interface): No buffer space available Dec 21 12:33:55 ovirtnode6 NetworkManager[1715]: <info> [1545381235.8086] manager: (vnet1): new Tun device (/org/freedesktop/NetworkManager/Devices/37) Dec 21 12:33:55 ovirtnode6 NetworkManager[1715]: <info> [1545381235.8121] device (vnet1): state change: unmanaged -> unavailable (reason 'connection-assumed', sys-iface-state: 'external') Dec 21 12:33:55 ovirtnode6 NetworkManager[1715]: <info> [1545381235.8127] device (vnet1): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'external')
---------------------------

*** WHAT IS THIS *** ? Link are ready some time ago, why this bridge status transitions and iptables errors are happen?

and this machine try to start again:
-----------------
Dec 21 12:33:56 ovirtnode6 systemd-machined: New machine qemu-15-HostedEngine. Dec 21 12:33:56 ovirtnode6 systemd: Started Virtual Machine qemu-15-HostedEngine.
-------------------

HA agent log on this host (ovirtnode6):

-----------------------------------------
MainThread::INFO::2018-12-21 12:33:49,880::states::510::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down and local host has best score (3400), attempting to start engine VM MainThread::INFO::2018-12-21 12:33:57,884::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2018-12-21 12:34:04,898::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) VM is powering up..
.....
MainThread::INFO::2018-12-21 12:36:24,800::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) VM is powering up.. MainThread::INFO::2018-12-21 12:36:24,921::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400)
---------------------------

HA Agent log on ovirtnode1 (spare HE host where VM trying to start):

----------------------
MainThread::INFO::2018-12-21 12:33:52,984::states::510::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down and local host has best score (3400), attempting to start engine VM

MainThread::INFO::2018-12-21 12:33:56,787::hosted_engine::947::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Engine VM started on localhost MainThread::INFO::2018-12-21 12:33:59,923::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStart-EngineStarting) sent? sent MainThread::INFO::2018-12-21 12:33:59,936::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2018-12-21 12:34:06,950::states::783::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Another host already took over..
-----------------

WHAT IS THIS? what in means about "took over", what process it do in this case?
------------------------------------
MainThread::INFO::2018-12-21 12:34:10,240::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineForceStop (score: 3400) MainThread::INFO::2018-12-21 12:34:10,246::hosted_engine::1006::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Shutting down vm using `/usr/sbin/hosted-engine --vm-poweroff` MainThread::INFO::2018-12-21 12:34:10,797::hosted_engine::1011::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stdout: MainThread::INFO::2018-12-21 12:34:10,797::hosted_engine::1012::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stderr: MainThread::ERROR::2018-12-21 12:34:10,797::hosted_engine::1020::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Engine VM stopped on localhost
-------------------------------------


I mount HE VM partitions with log, here is a syslog messages:

WHAT IS THIS guest-shutdown? why? no network problems at this time

Dec 21 12:28:24 ovirtengine ovsdb-server: ovs|36386|reconnect|WARN|ssl:[::ffff:127.0.0.1]:58834: connection dropped (Protocol error) Dec 21 12:28:24 ovirtengine python: ::ffff:172.16.10.101 - - [21/Dec/2018 12:28:24] "GET /v2.0/networks HTTP/1.1" 200 - Dec 21 12:30:44 ovirtengine ovsdb-server: ovs|00111|reconnect|ERR|ssl:[::ffff:172.16.10.5]:42032: no response to inactivity probe after 5 seconds, disconnecting Dec 21 12:30:44 ovirtengine ovsdb-server: ovs|00112|reconnect|ERR|ssl:[::ffff:172.16.10.1]:45624: no response to inactivity probe after 5 seconds, disconnecting Dec 21 12:31:07 ovirtengine qemu-ga: info: guest-shutdown called, mode: powerdown

Dec 21 12:33:58 ovirtengine kernel: Linux version 3.10.0-862.14.4.el7.x86_64 (mockbu...@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) ) #1 SMP Wed Sep 26 15:12:11 UTC 2018 Dec 21 12:33:58 ovirtengine kernel: Command line: BOOT_IMAGE=/vmlinuz-3.10.0-862.14.4.el7.x86_64 root=UUID=091e7022-295b-4b3f-96ad-4a7d90a2a9b0 ro crashkernel=auto rd.lvm.lv=ovirt/swap console=ttyS0 LANG=en_US.UTF-8
.....
Dec 21 12:33:59 ovirtengine kernel: XFS (vda3): Ending clean mount
Dec 21 12:34:01 ovirtengine systemd: Mounted /sysroot.
....
Dec 21 12:34:06 ovirtengine lvm: 6 logical volume(s) in volume group "ovirt" now active
Dec 21 12:34:06 ovirtengine systemd: Started LVM2 PV scan on device 252:2.
Dec 21 12:34:06 ovirtengine systemd: Found device /dev/mapper/ovirt-audit.
Dec 21 12:34:06 ovirtengine kernel: XFS (dm-5): Ending clean mount
Dec 21 12:34:06 ovirtengine systemd: Mounted /home.
Dec 21 12:34:07 ovirtengine kernel: XFS (dm-3): Ending clean mount
Dec 21 12:34:07 ovirtengine systemd: Mounted /var.
Dec 21 12:34:07 ovirtengine systemd: Starting Load/Save Random Seed...
Dec 21 12:34:07 ovirtengine systemd: Mounting /var/log...
Dec 21 12:34:07 ovirtengine kernel: XFS (dm-2): Mounting V5 Filesystem
Dec 21 12:34:07 ovirtengine systemd: Started Load/Save Random Seed.
Dec 21 12:34:08 ovirtengine kernel: XFS (dm-2): Ending clean mount
Dec 21 12:34:08 ovirtengine systemd: Mounted /var/log.
Dec 21 12:34:08 ovirtengine systemd: Starting Flush Journal to Persistent Storage...
Dec 21 12:34:08 ovirtengine systemd: Mounting /var/log/audit...
Dec 21 12:34:08 ovirtengine kernel: XFS (dm-1): Mounting V5 Filesystem
Dec 21 12:34:08 ovirtengine systemd: Started Flush Journal to Persistent Storage.
Dec 21 12:34:08 ovirtengine kernel: XFS (dm-1): Ending clean mount
Dec 21 12:34:08 ovirtengine systemd: Mounted /var/log/audit.
Dec 21 12:34:13 ovirtengine kernel: XFS (dm-4): Ending clean mount
Dec 21 12:34:13 ovirtengine systemd: Mounted /tmp.
Dec 21 12:34:13 ovirtengine systemd: Reached target Local File Systems.
.....
Dec 21 12:34:24 ovirtengine sshd[1324]: Server listening on 0.0.0.0 port 2222.
Dec 21 12:34:24 ovirtengine sshd[1324]: Server listening on :: port 2222.
Dec 21 12:34:25 ovirtengine rsyslogd: [origin software="rsyslogd" swVersion="8.24.0" x-pid="1334" x-info="http://www.rsyslog.com";] start
Dec 21 12:34:25 ovirtengine systemd: Started System Logging Service.
......
Dec 21 12:34:25 ovirtengine aliasesdb: BDB0196 Encrypted checksum: no encryption key specified Dec 21 12:34:25 ovirtengine aliasesdb: BDB0196 Encrypted checksum: no encryption key specified Dec 21 12:34:25 ovirtengine kernel: XFS (vda3): Internal error XFS_WANT_CORRUPTED_GOTO at line 1664 of file fs/xfs/libxfs/xfs_alloc.c. Caller xfs_free_extent+0xaa/0x140 [xfs] Dec 21 12:34:25 ovirtengine kernel: CPU: 1 PID: 1379 Comm: postalias Not tainted 3.10.0-862.14.4.el7.x86_64 #1 Dec 21 12:34:25 ovirtengine kernel: Hardware name: oVirt oVirt Node, BIOS 1.11.0-2.el7 04/01/2014 Dec 21 12:34:25 ovirtengine kernel: XFS (vda3): xfs_do_force_shutdown(0x8) called from line 236 of file fs/xfs/libxfs/xfs_defer.c.  Return address = 0xffffffffc02709cb

Dec 21 12:34:27 ovirtengine kernel: XFS (vda3): Corruption of in-memory data detected.  Shutting down filesystem Dec 21 12:34:27 ovirtengine kernel: XFS (vda3): Please umount the filesystem and rectify the problem(s)
......

Dec 21 12:34:27 ovirtengine ovirt-websocket-proxy.py: ImportError: No module named websocketproxy Dec 21 12:34:27 ovirtengine journal: 2018-12-21 12:34:27,189+0400 ovirt-engine-dwhd: ERROR run:554 Error: list index out of range Dec 21 12:34:27 ovirtengine systemd: ovirt-websocket-proxy.service: main process exited, code=killed, status=7/BUS Dec 21 12:34:27 ovirtengine systemd: Failed to start oVirt Engine websockets proxy. Dec 21 12:34:27 ovirtengine systemd: Unit ovirt-websocket-proxy.service entered failed state.
Dec 21 12:34:27 ovirtengine systemd: ovirt-websocket-proxy.service failed.
Dec 21 12:34:27 ovirtengine systemd: ovirt-engine-dwhd.service: main process exited, code=exited, status=1/FAILURE
......
----------------------------







_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/BBHBZ6ZWU5KLCWQ7PMOENAX2ZCR7M3FP/
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/CIW5NDHMUXTPJLZWMF54VXWZATLDR6XO/

Reply via email to