Hi Jirka, Thanks for the update. It sounds like the same bug but with a few extra issues thrown in. e.g. Comment 9 seems to me to be a completely separate bug, although it may affect the issue I reported.
I can't see any mention of how the problem is being resolved, which I am interested in, but will keep an eye on it. I'll try the patched version when I get the time and enthusiasm to give it another crack. regards, John On 14/08/14 22:57, Jiri Moskovcak wrote: > Hi John, > after a deeper look I realized that you're probably facing [1]. The > patch is ready and I will also backport it to 3.4 branch. > > --Jirka > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1093638 > > On 07/29/2014 11:41 PM, John Gardeniers wrote: >> Hi Jiri, >> >> Sorry, I can't supply the log because the hosts have been recycled but >> I'm sure it would have contained exactly the same information that you >> already have from host2. It's a classic deadlock situation that should >> never be allowed to happen. A simple and time proven solution was in my >> original post. >> >> The reason for recycling the hosts is that I discovered yesterday that >> although the engine was still running it could not be accessed in any >> way. Upon further finding that there was no way to get it restarted I >> decided to abandon the whole idea of self-hosting until such time as I >> see an indication that it's production ready. >> >> regards, >> John >> >> >> On 29/07/14 22:52, Jiri Moskovcak wrote: >>> Hi John, >>> thanks for the logs. Seems like the engine is running on host2 and it >>> decides that it doesn't have the best score and shuts the engine down >>> and then neither of them want's to start the vm until you restart the >>> host2. Unfortunately the logs doesn't contain the part from host1 from >>> 2014-07-24 09:XX which I'd like to investigate because it might >>> contain the information why host1 refused to start the vm when host2 >>> killed it. >>> >>> Regards, >>> Jirka >>> >>> On 07/28/2014 02:57 AM, John Gardeniers wrote: >>>> Hi Jira, >>>> >>>> Version: ovirt-hosted-engine-ha-1.1.5-1.el6.noarch >>>> >>>> Attached are the logs. Thanks for looking. >>>> >>>> Regards, >>>> John >>>> >>>> >>>> On 25/07/14 17:47, Jiri Moskovcak wrote: >>>>> On 07/24/2014 11:37 PM, John Gardeniers wrote: >>>>>> Hi Jiri, >>>>>> >>>>>> Perhaps you can tell me how to determine the exact version of >>>>>> ovirt-hosted-engine-ha. >>>>> >>>>> Centos/RHEL/Fedora: rpm -q ovirt-hosted-engine-ha >>>>> >>>>>> As for the logs, I am not going to attach 60MB >>>>>> of logs to an email, >>>>> >>>>> - there are other ways to share the logs >>>>> >>>>>> nor can I see any imaginagle reason for you wanting >>>>>> to see them all, as the bulk is historical. I have already included >>>>>> the >>>>>> *relevant* sections. However, if you think there may be some other >>>>>> section that may help you feel free to be more explicit about >>>>>> what you >>>>>> are looking for. Right now I fail to understand what you might >>>>>> hope to >>>>>> see in logs from several weeks ago that you can't get from the last >>>>>> day >>>>>> or so. >>>>>> >>>>> >>>>> It's a standard way, people tend to think that they know what is a >>>>> relevant part of a log, but in many cases they fail. Asking for the >>>>> whole logs has proven to be faster than trying to find the relevant >>>>> part through the user. And you're right, I don't need the logs from >>>>> last week, just logs since the last start of the services when you >>>>> observed the problem. >>>>> >>>>> Regards, >>>>> Jirka >>>>> >>>>>> regards, >>>>>> John >>>>>> >>>>>> >>>>>> On 24/07/14 19:10, Jiri Moskovcak wrote: >>>>>>> Hi, please provide the the exact versions of ovirt-hosted-engine-ha >>>>>>> and all logs from /var/log/ovirt-hosted-engine-ha/ >>>>>>> >>>>>>> Thank you, >>>>>>> Jirka >>>>>>> >>>>>>> On 07/24/2014 01:29 AM, John Gardeniers wrote: >>>>>>>> Hi All, >>>>>>>> >>>>>>>> I have created a lab with 2 hypervisors and a self-hosted engine. >>>>>>>> Today >>>>>>>> I followed the upgrade instructions as described in >>>>>>>> http://www.ovirt.org/Hosted_Engine_Howto and rebooted the >>>>>>>> engine. I >>>>>>>> didn't really do an upgrade but simply wanted to test what would >>>>>>>> happen >>>>>>>> when the engine was rebooted. >>>>>>>> >>>>>>>> When the engine didn't restart I re-ran hosted-engine >>>>>>>> --set-maintenance=none and restarted the vdsm, ovirt-ha-agent and >>>>>>>> ovirt-ha-broker services on both nodes. 15 minutes later it still >>>>>>>> hadn't >>>>>>>> restarted, so I then tried rebooting both hypervisers. After an >>>>>>>> hour >>>>>>>> there was still no sign of the engine starting. The agent logs >>>>>>>> don't >>>>>>>> help me much. The following bits are repeated over and over. >>>>>>>> >>>>>>>> ovirt1 (192.168.19.20): >>>>>>>> >>>>>>>> MainThread::INFO::2014-07-24 >>>>>>>> 09:18:40,272::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Trying: notify time=1406157520.27 type=state_transition >>>>>>>> detail=EngineDown-EngineDown hostname='ovirt1.om.net' >>>>>>>> MainThread::INFO::2014-07-24 >>>>>>>> 09:18:40,272::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Success, was notification of state_transition >>>>>>>> (EngineDown-EngineDown) >>>>>>>> sent? ignored >>>>>>>> MainThread::INFO::2014-07-24 >>>>>>>> 09:18:40,594::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Current state EngineDown (score: 2400) >>>>>>>> MainThread::INFO::2014-07-24 >>>>>>>> 09:18:40,594::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Best remote host 192.168.19.21 (id: 2, score: 2400) >>>>>>>> >>>>>>>> ovirt2 (192.168.19.21): >>>>>>>> >>>>>>>> MainThread::INFO::2014-07-24 >>>>>>>> 09:18:04,005::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Trying: notify time=1406157484.01 type=state_transition >>>>>>>> detail=EngineDown-EngineDown hostname='ovirt2.om.net' >>>>>>>> MainThread::INFO::2014-07-24 >>>>>>>> 09:18:04,006::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Success, was notification of state_transition >>>>>>>> (EngineDown-EngineDown) >>>>>>>> sent? ignored >>>>>>>> MainThread::INFO::2014-07-24 >>>>>>>> 09:18:04,324::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Current state EngineDown (score: 2400) >>>>>>>> MainThread::INFO::2014-07-24 >>>>>>>> 09:18:04,324::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Best remote host 192.168.19.20 (id: 1, score: 2400) >>>>>>>> >>>>>>>> From the above information I decided to simply shut down one >>>>>>>> hypervisor >>>>>>>> and see what happens. The engine did start back up again a few >>>>>>>> minutes >>>>>>>> later. >>>>>>>> >>>>>>>> The interesting part is that each hypervisor seems to think the >>>>>>>> other is >>>>>>>> a better host. The two machines are identical, so there's no >>>>>>>> reason I >>>>>>>> can see for this odd behaviour. In a lab environment this is >>>>>>>> little >>>>>>>> more >>>>>>>> than an annoying inconvenience. In a production environment it >>>>>>>> would be >>>>>>>> completely unacceptable. >>>>>>>> >>>>>>>> May I suggest that this issue be looked into and some means >>>>>>>> found to >>>>>>>> eliminate this kind of mutual exclusion? e.g. After a few >>>>>>>> minutes of >>>>>>>> such an issue one hypervisor could be randomly given a slightly >>>>>>>> higher >>>>>>>> weighting, which should result in it being chosen to start the >>>>>>>> engine. >>>>>>>> >>>>>>>> regards, >>>>>>>> John >>>>>>>> _______________________________________________ >>>>>>>> Users mailing list >>>>>>>> Users@ovirt.org >>>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>>>> >>>>>>> >>>>>>> >>>>>>> ______________________________________________________________________ >>>>>>> >>>>>>> >>>>>>> This email has been scanned by the Symantec Email Security.cloud >>>>>>> service. >>>>>>> For more information please visit http://www.symanteccloud.com >>>>>>> ______________________________________________________________________ >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> ______________________________________________________________________ >>>>> >>>>> This email has been scanned by the Symantec Email Security.cloud >>>>> service. >>>>> For more information please visit http://www.symanteccloud.com >>>>> ______________________________________________________________________ >>>>> >>>> >>> >>> >>> ______________________________________________________________________ >>> This email has been scanned by the Symantec Email Security.cloud >>> service. >>> For more information please visit http://www.symanteccloud.com >>> ______________________________________________________________________ >> > > > ______________________________________________________________________ > This email has been scanned by the Symantec Email Security.cloud service. > For more information please visit http://www.symanteccloud.com > ______________________________________________________________________ _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users