On Mi, 2014-04-09 at 14:42 +0200, Jiri Moskovcak wrote: > On 04/09/2014 02:32 PM, Daniel Helgenberger wrote: > > On Mi, 2014-04-09 at 09:18 +0200, Jiri Moskovcak wrote: > >> On 04/08/2014 06:09 PM, Daniel Helgenberger wrote: > >>> Hello, > >>> > >>> I have an oVirt 3.4 hosted engine lab setup witch I am evaluating for > >>> production use. > >>> > >>> I "simulated" an ungraceful shutdown of all HA nodes (powercut) while > >>> the engine was running. After powering up, the system did not recover > >>> itself (it seemed). > >>> I had to restart the ovirt-hosted-ha service (witch was in a locked > >>> state) and then manually run 'hosted-engine --vm-start'. > >>> > >>> What is the supposed procedure after a shutdown (graceful / ungraceful) > >>> of Hosted-Engine HA nodes? Should the engine recover by itself? Should > >>> the running VM's be restarted automatically? > >> > >> When this happens the agent should start the engine VM and the engine > >> should take care of restarting the VMs which were running on that > >> restarted host and are marked as HA. Can you please provide contents ov > >> /var/log/ovirt* from the host after the powercut when the engine VM > >> doesn't come up? > >> > > Hello Jirka, > > > > I accidentally already send the message without pointing out the > > interesting part; this is: > > > > <<< start logging ha-agent after reboot: > > /var/log/ovirt-hosted-engine-ha/agent.log:MainTMainThread::INFO::2014-04-08 > > 15:53:33,862::agent::52::ovirt_hosted_engine_ha.agent.agent.Agent::(run) > > ovirt-hosted-engine-ha agent 1.1.2-1 started > > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 > > 15:53:33,936::hosted_engine::223::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) > > Found certificate common name: 192.168.50.201 > > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 > > 15:53:33,937::hosted_engine::363::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) > > Initializing ha-broker connection > > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 > > 15:53:33,937::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) > > Starting monitor ping, options {'addr': '192.168.50.1'} > > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 > > 15:53:33,939::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) > > Success, id 139700911299600 > > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 > > 15:53:33,939::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) > > Starting monitor mgmt-bridge, options {'use_ssl': 'true', 'bridge_name': > > 'ovirtmgmt', 'address': '0'} > > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 > > 15:53:34,013::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) > > Success, id 139700911300304 > > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 > > 15:53:34,013::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) > > Starting monitor mem-free, options {'use_ssl': 'true', 'address': '0'} > > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 > > 15:53:34,015::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) > > Success, id 139700911300112 > > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 > > 15:53:34,015::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) > > Starting monitor cpu-load-no-engine, options {'use_ssl': 'true', > > 'vm_uuid': 'e68a11c8-1251-4c13-9e3b-3847bbb4fa3d', 'address': '0'} > > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 > > 15:53:34,018::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) > > Success, id 139700911300240 > > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 > > 15:53:34,018::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) > > Starting monitor engine-health, options {'use_ssl': 'true', 'vm_uuid': > > 'e68a11c8-1251-4c13-9e3b-3847bbb4fa3d', 'address': '0'} > > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 > > 15:53:34,024::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) > > Success, id 139700723857104 > > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 > > 15:53:34,024::hosted_engine::386::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) > > Broker initialized, all submonitors started > > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 > > 15:53:34,312::hosted_engine::430::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_cond_start_service) > > Starting vdsmd > > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::CRITICAL::2014-04-08 > > 15:53:34,442::agent::103::ovirt_hosted_engine_ha.agent.agent.Agent::(run) > > Could not start ha-agent > > (10 min nothing) > > <<< here I did a 'service ovirt-hosted-ha start' > > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 > > 15:59:16,698::agent::52::ovirt_hosted_engine_ha.agent.agent.Agent::(run) > > ovirt-hosted-engine-ha agent 1.1.2-1 started > > .... > > > > after this things went quite smoothly. > > > > Hi Daniel, > I noticed that in the log and I was just about to ask if that's when you > manually fixed it. Is there something else around that time in > /var/log/message which might be related to it?
No problem, here we go at the relevant time. I added some newlines for the interesting parts - otherwise it is complete: Apr 8 15:53:29 nodehv01 kernel: mboxdomain: port 1(eth1.1) entering forwarding state Apr 8 15:53:29 nodehv01 rpc.statd[1924]: Version 1.2.3 starting Apr 8 15:53:30 nodehv01 sm-notify[1925]: Version 1.2.3 starting Apr 8 15:53:34 nodehv01 wdmd[2315]: wdmd started S0 H1 G179 Apr 8 15:53:34 nodehv01 wdmd[2315]: /dev/watchdog armed with fire_timeout 60 --- Apr 8 15:53:34 nodehv01 ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent CRITICAL Could not start ha-agent#012Traceback (most recent call last):#012 File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 97, in run#012 self._run_agent()#012 File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 154, in _run_agent#012 hosted_engine.HostedEngine(self.shutdown_requested).start_monitoring()#012 File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 299, in start_monitoring#012 self._initialize_vdsm()#012 File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 390, in _initialize_vdsm#012 self._cond_start_service('vdsmd')#012 File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 439, in _cond_start_service#012 .format(service_name, res[1]))#012Exception: Could not start vdsmd: --- Apr 8 15:53:34 nodehv01 kernel: device-mapper: table: 253:2: multipath: error getting device Apr 8 15:53:34 nodehv01 kernel: device-mapper: ioctl: error adding target to table Apr 8 15:53:34 nodehv01 multipathd: 1ATA_ST3500418AS_6VMNMNV0: ignoring map Apr 8 15:53:34 nodehv01 multipathd: path checkers start up Apr 8 15:53:35 nodehv01 ntpd[2463]: ntpd 4.2.6p5@1.2349-o Sat Nov 23 18:21:48 UTC 2013 (1) Apr 8 15:53:35 nodehv01 ntpd[2464]: proto: precision = 0.062 usec Apr 8 15:53:35 nodehv01 ntpd[2464]: 0.0.0.0 c01d 0d kern kernel time sync enabled Apr 8 15:53:35 nodehv01 ntpd[2464]: Listen and drop on 0 v4wildcard 0.0.0.0 UDP 123 Apr 8 15:53:35 nodehv01 ntpd[2464]: Listen and drop on 1 v6wildcard :: UDP 123 Apr 8 15:53:35 nodehv01 ntpd[2464]: Listen normally on 2 lo 127.0.0.1 UDP 123 Apr 8 15:53:35 nodehv01 ntpd[2464]: Listen normally on 3 ovirtmgmt 192.168.50.201 UDP 123 Apr 8 15:53:35 nodehv01 ntpd[2464]: Listen normally on 4 lo ::1 UDP 123 Apr 8 15:53:35 nodehv01 ntpd[2464]: Listen normally on 5 eth1.1 fe80::215:17ff:fe21:111f UDP 123 Apr 8 15:53:35 nodehv01 ntpd[2464]: Listen normally on 6 eth1.100 fe80::215:17ff:fe21:111f UDP 123 Apr 8 15:53:35 nodehv01 ntpd[2464]: Listen normally on 7 eth1.1000 fe80::215:17ff:fe21:111f UDP 123 Apr 8 15:53:35 nodehv01 ntpd[2464]: Listen normally on 8 eth1 fe80::215:17ff:fe21:111f UDP 123 Apr 8 15:53:35 nodehv01 ntpd[2464]: Listen normally on 9 mboxsan fe80::215:17ff:fe21:111f UDP 123 Apr 8 15:53:35 nodehv01 ntpd[2464]: Listen normally on 10 mboxdomain fe80::215:17ff:fe21:111f UDP 123 Apr 8 15:53:35 nodehv01 ntpd[2464]: Listen normally on 11 mboxdmz fe80::215:17ff:fe21:111f UDP 123 Apr 8 15:53:35 nodehv01 ntpd[2464]: Listen normally on 12 eth0 fe80::215:17ff:fe21:111e UDP 123 Apr 8 15:53:35 nodehv01 ntpd[2464]: Listen normally on 13 ovirtmgmt fe80::215:17ff:fe21:111e UDP 123 Apr 8 15:53:35 nodehv01 ntpd[2464]: peers refreshed Apr 8 15:53:35 nodehv01 ntpd[2464]: Listening on routing socket on fd #30 for interface updates Apr 8 15:53:35 nodehv01 kernel: Loading iSCSI transport class v2.0-870. Apr 8 15:53:35 nodehv01 ntpd[2464]: 0.0.0.0 c016 06 restart Apr 8 15:53:35 nodehv01 ntpd[2464]: 0.0.0.0 c012 02 freq_set kernel 3.731 PPM Apr 8 15:53:35 nodehv01 kernel: iscsi: registered transport (tcp) Apr 8 15:53:35 nodehv01 kernel: iscsi: registered transport (iser) Apr 8 15:53:35 nodehv01 kernel: libcxgbi:libcxgbi_init_module: tag itt 0x1fff, 13 bits, age 0xf, 4 bits. Apr 8 15:53:35 nodehv01 kernel: libcxgbi:ddp_setup_host_page_size: system PAGE 4096, ddp idx 0. Apr 8 15:53:35 nodehv01 kernel: Chelsio T3 iSCSI Driver cxgb3i v2.0.0 (Jun. 2010) Apr 8 15:53:35 nodehv01 kernel: iscsi: registered transport (cxgb3i) Apr 8 15:53:35 nodehv01 kernel: Chelsio T4 iSCSI Driver cxgb4i v0.9.1 (Aug. 2010) Apr 8 15:53:35 nodehv01 kernel: iscsi: registered transport (cxgb4i) Apr 8 15:53:35 nodehv01 kernel: cnic: Broadcom NetXtreme II CNIC Driver cnic v2.5.16 (Dec 05, 2012) Apr 8 15:53:36 nodehv01 kernel: Broadcom NetXtreme II iSCSI Driver bnx2i v2.7.6.2 (Jun 06, 2013) Apr 8 15:53:36 nodehv01 kernel: iscsi: registered transport (bnx2i) Apr 8 15:53:36 nodehv01 kernel: iscsi: registered transport (be2iscsi) Apr 8 15:53:36 nodehv01 kernel: In beiscsi_module_init, tt=ffffffffa0699e80 Apr 8 15:53:36 nodehv01 iscsid: iSCSI logger with pid=2619 started! Apr 8 15:53:36 nodehv01 kernel: Ebtables v2.0 registered --- Apr 8 15:53:37 nodehv01 ovirt-ha-broker mem_free.MemFree ERROR Failed to getVdsStats: VDSM initialization timeout Apr 8 15:53:37 nodehv01 ovirt-ha-broker engine_health.CpuLoadNoEngine ERROR Failed to getVmStats: VDSM initialization timeout Apr 8 15:53:37 nodehv01 ovirt-ha-broker cpu_load_no_engine.EngineHealth ERROR Failed to getVmStats: VDSM initialization timeout Apr 8 15:53:37 nodehv01 ovirt-ha-broker mgmt_bridge.MgmtBridge ERROR Failed to getVdsCapabilities: VDSM initialization timeout --- Apr 8 15:53:37 nodehv01 iscsid: iSCSI daemon with pid=2620 started! Apr 8 15:53:37 nodehv01 kernel: lo: Disabled Privacy Extensions Apr 8 15:53:38 nodehv01 kernel: Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009) Apr 8 15:53:38 nodehv01 kernel: bonding: bond4 is being created... Apr 8 15:53:38 nodehv01 kernel: bonding: bond1 is being created... Apr 8 15:53:38 nodehv01 kernel: bonding: bond2 is being created... Apr 8 15:53:38 nodehv01 kernel: bonding: bond3 is being created... Apr 8 15:53:40 nodehv01 kernel: device-mapper: table: 253:2: multipath: error getting device Apr 8 15:53:40 nodehv01 kernel: device-mapper: ioctl: error adding target to table Apr 8 15:53:40 nodehv01 vdsm vds WARNING Unable to load the json rpc server module. Please make sure it is installed. Apr 8 15:53:42 nodehv01 ntpd[2464]: 0.0.0.0 c61c 0c clock_step +0.171138 s Apr 8 15:53:42 nodehv01 ntpd[2464]: 0.0.0.0 c614 04 freq_mode Apr 8 15:53:43 nodehv01 ntpd[2464]: 0.0.0.0 c618 08 no_sys_peer --- Apr 8 15:59:16 nodehv01 vdsm vds ERROR failed to retrieve Hosted Engine HA info#012Traceback (most recent call last):#012 File "/usr/share/vdsm/API.py", line 1603, in _getHaInfo#012 stats = instance.get_all_stats()#012 File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py", line 85, in get_all_stats#012 path.get_metadata_path(self._config),#012 File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/env/path.py", line 47, in get_metadata_path#012 return os.path.join(get_domain_path(config_),#012 File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/env/path.py", line 40, in get_domain_path#012 .format(sd_uuid, parent))#012Exception: path to storage domain bcfa7ec4-5278-44d8-9f31-682f2d9de91d not found in /rhev/data-center/mnt --- Apr 8 15:59:17 nodehv01 kernel: RPC: Registered named UNIX socket transport module. Apr 8 15:59:17 nodehv01 kernel: RPC: Registered udp transport module. Apr 8 15:59:17 nodehv01 kernel: RPC: Registered tcp transport module. Apr 8 15:59:17 nodehv01 kernel: RPC: Registered tcp NFSv4.1 backchannel transport module. Apr 8 15:59:17 nodehv01 kernel: FS-Cache: Loaded Apr 8 15:59:17 nodehv01 kernel: NFS: Registering the id_resolver key type Apr 8 15:59:17 nodehv01 kernel: FS-Cache: Netfs 'nfs' registered for caching > > Thanks, > Jirka > > >> Thanks, > >> Jirka > > > > > >> > >>> > >>> Thanks, > >>> Daniel > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> _______________________________________________ > >>> Users mailing list > >>> Users@ovirt.org > >>> http://lists.ovirt.org/mailman/listinfo/users > >>> > >> > > > -- Daniel Helgenberger m box bewegtbild GmbH P: +49/30/2408781-22 F: +49/30/2408781-10 ACKERSTR. 19 D-10115 BERLIN www.m-box.de www.monkeymen.tv Geschäftsführer: Martin Retschitzegger / Michaela Göllner Handeslregister: Amtsgericht Charlottenburg / HRB 112767
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users