After a few sleepless nights. I went through my entire system again and I found an interface on one of my hosts that had already been removed from the UI. Even after multiple "service network restart" it would still show up when I ran "ip addr". I had to end up forcefully removing it with rm -rf /etc/sysconfig/network-scripts/ifcfg-<interface>. After that I rebooted the node and the SPM came out of contention. I cant make sense of it but it worked
----- Original Message ----- From: "Liron Aravot" <lara...@redhat.com> To: "Maurice \"Moe\" James" <mja...@media-node.com> Cc: users@ovirt.org Sent: Wednesday, April 16, 2014 8:49:05 AM Subject: Re: [ovirt-users] SPM error Hi Maurice, any updates on the above? thanks, Liron ----- Original Message ----- > From: "Liron Aravot" <lara...@redhat.com> > To: "Maurice \"Moe\" James" <mja...@media-node.com> > Cc: users@ovirt.org > Sent: Tuesday, April 15, 2014 11:53:40 AM > Subject: Re: [ovirt-users] SPM error > > > > ----- Original Message ----- > > From: "Maurice \"Moe\" James" <mja...@media-node.com> > > To: "Liron Aravot" <lara...@redhat.com> > > Cc: "Itamar Heim" <ih...@redhat.com>, users@ovirt.org > > Sent: Tuesday, April 15, 2014 3:14:16 AM > > Subject: Re: [ovirt-users] SPM error > > > > Sorry forgot to paste > > https://bugzilla.redhat.com/show_bug.cgi?id=1086951 > > Hi Maurice, > the issue is that the host doesn't have access to all the storage domains > which causes to the spm start process to fail. > There's a bug open for that issue - > https://bugzilla.redhat.com/show_bug.cgi?id=1072900 so seems we'll be able > to close the one you opened as a duplicate but let's wait with that till > your issue is solved. > From looking in the logs, it seems like that host have problem accessing two > storage domains - > 3406665e-4adc-4fd4-aa1e-037547b29adb > f3b51811-4a7f-43af-8633-322b3db23c48 > > Can you verify that the host can access those domains? from the log it seems > like the nfs paths for those are: > shtistg01.suprtekstic.com:/storage/infrastructure > shtistg01.suprtekstic.com:/storage/exports > > > log snippet: > 1. > Thread-14::DEBUG::2014-04-11 > 22:54:44,331::mount::226::Storage.Misc.excCmd::(_runcmd) '/usr/bin/sudo -n > /bin/mount -t nfs -o soft,nosharecache,timeo=600,retra > ns=6,nfsvers=3 ashtistg01.suprtekstic.com:/storage/exports > /rhev/data-center/mnt/ashtistg01.suprtekstic.com:_storage_exports' (cwd > None) > Thread-14::ERROR::2014-04-11 > 22:55:36,659::storageServer::209::StorageServer.MountConnection::(connect) > Mount failed: (32, ';mount.nfs: Failed to resolve serv > er ashtistg01.suprtekstic.com: Name or service not known\n') > Traceback (most recent call last): > File "/usr/share/vdsm/storage/storageServer.py", line 207, in connect > self._mount.mount(self.options, self._vfsType) > File "/usr/share/vdsm/storage/mount.py", line 222, in mount > return self._runcmd(cmd, timeout) > File "/usr/share/vdsm/storage/mount.py", line 238, in _runcmd > raise MountError(rc, ";".join((out, err))) > MountError: (32, ';mount.nfs: Failed to resolve server > ashtistg01.suprtekstic.com: Name or service not known\n') > Thread-14::ERROR::2014-04-11 > 22:55:36,705::hsm::2379::Storage.HSM::(connectStorageServer) Could not > connect to storageServer > Traceback (most recent call last): > File "/usr/share/vdsm/storage/hsm.py", line 2376, in connectStorageServer > conObj.connect() > File "/usr/share/vdsm/storage/storageServer.py", line 320, in connect > return self._mountCon.connect() > File "/usr/share/vdsm/storage/storageServer.py", line 215, in connect > raise e > MountError: (32, ';mount.nfs: Failed to resolve server > ashtistg01.suprtekstic.com: Name or service not known\n') > > > > > > 2. > Thread-14::ERROR::2014-04-11 > 22:56:29,307::storageServer::209::StorageServer.MountConnection::(connect) > Mount failed: (32, ';mount.nfs: Failed to resolve serv > er ashtistg01.suprtekstic.com: Name or service not known\n') > Traceback (most recent call last): > File "/usr/share/vdsm/storage/storageServer.py", line 207, in connect > self._mount.mount(self.options, self._vfsType) > File "/usr/share/vdsm/storage/mount.py", line 222, in mount > return self._runcmd(cmd, timeout) > File "/usr/share/vdsm/storage/mount.py", line 238, in _runcmd > raise MountError(rc, ";".join((out, err))) > MountError: (32, ';mount.nfs: Failed to resolve server > ashtistg01.suprtekstic.com: Name or service not known\n') > Thread-14::ERROR::2014-04-11 > 22:56:29,309::hsm::2379::Storage.HSM::(connectStorageServer) Could not > connect to storageServer > Traceback (most recent call last): > File "/usr/share/vdsm/storage/hsm.py", line 2376, in connectStorageServer > conObj.connect() > File "/usr/share/vdsm/storage/storageServer.py", line 320, in connect > return self._mountCon.connect() > File "/usr/share/vdsm/storage/storageServer.py", line 215, in connect > raise e > MountError: (32, ';mount.nfs: Failed to resolve server > ashtistg01.suprtekstic.com: Name or service not known\n') > > > Regardless of that, there are sanlock errors over the log when trying to > acquire host-id over the log. > I'd start with check the connectivity issues to the storage domains above, > later on you can attach check that sanlock is running and operational and/or > attach the sanlock logs. > > > > > > On Mon, 2014-04-14 at 17:11 -0400, Liron Aravot wrote: > > > > > > ----- Original Message ----- > > > > From: "Maurice \"Moe\" James" <mja...@media-node.com> > > > > To: "Itamar Heim" <ih...@redhat.com> > > > > Cc: users@ovirt.org > > > > Sent: Sunday, April 13, 2014 2:28:45 AM > > > > Subject: Re: [ovirt-users] SPM error > > > > > > > > Were you able to find out anything? Is there anything that I can check > > > > in the meanwhile? > > > > > > > > > > Hi Muarice, > > > can you please attach the ovirt engine/vdsm logs? > > > thanks, > > > Liron > > > > > > > > On Sat, 2014-04-12 at 19:23 +0300, Itamar Heim wrote: > > > > > On 04/12/2014 03:40 PM, Maurice James wrote: > > > > > > What did you do to try to fix the sanlock? Anything is better than > > > > > > nothing at this point > > > > > > > > > > > > ----- Original Message ----- > > > > > > From: "Ted Miller" <tmil...@hcjb.org> > > > > > > To: "Maurice James" <mja...@media-node.com> > > > > > > Sent: Friday, April 11, 2014 7:27:24 PM > > > > > > Subject: Re: [ovirt-users] SPM error > > > > > > > > > > > > I did receive some help on one stage of rebuilding my sanlock, but > > > > > > there > > > > > > were > > > > > > too many other things wrong to get it started again. Only advice I > > > > > > have > > > > > > is -- > > > > > > look at your sanlock logs, and see if you can find anything there > > > > > > that is > > > > > > helpful. > > > > > > > > > > > > On 4/11/2014 7:23 PM, Maurice James wrote: > > > > > >> Nooooooo. > > > > > >> > > > > > >> > > > > > >> Sent from my Galaxy S®III > > > > > >> > > > > > >> -------- Original message -------- > > > > > >> From: Ted Miller <tmil...@hcjb.org> > > > > > >> Date:04/11/2014 7:08 PM (GMT-05:00) > > > > > >> To: Maurice James <mja...@media-node.com> > > > > > >> Subject: Re: [ovirt-users] SPM error > > > > > >> > > > > > >> > > > > > >> > > > > > >> I didn't, really. I did something wrong along the way, and ended > > > > > >> up > > > > > >> having > > > > > >> to rebuild the engine and hosts. (My problems were due to a > > > > > >> glusterfs > > > > > >> split-brain.) > > > > > >> Ted Miller > > > > > >> > > > > > >> On 4/11/2014 6:03 PM, Maurice James wrote: > > > > > >>> How did you fix it? > > > > > >>> > > > > > >>> > > > > > >>> Sent from my Galaxy S®III > > > > > >>> > > > > > >>> -------- Original message -------- > > > > > >>> From: Ted Miller <tmil...@hcjb.org> > > > > > >>> Date:04/11/2014 6:00 PM (GMT-05:00) > > > > > >>> To: users@ovirt.org > > > > > >>> Subject: Re: [ovirt-users] SPM error > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> On 4/11/2014 2:05 PM, Maurice James wrote: > > > > > >>>> I have an error trying to bring the master DC back online. After > > > > > >>>> several > > > > > >>>> reboots, no luck. I took the other cluster members offline to > > > > > >>>> try > > > > > >>>> to > > > > > >>>> troubleshoot. The remaining host is constantly in contention > > > > > >>>> with > > > > > >>>> itself > > > > > >>>> for SPM > > > > > >>>> > > > > > >>>> > > > > > >>>> ERROR > > > > > >>>> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] > > > > > >>>> (DefaultQuartzScheduler_Worker-40) [38d400ea] > > > > > >>>> IrsBroker::Failed::GetStoragePoolInfoVDS due to: > > > > > >>>> IrsSpmStartFailedException: IRSGenericException: > > > > > >>>> IRSErrorException: > > > > > >>>> SpmStart failed > > > > > >>>> > > > > > >>> I'm no expert, but the last time I beat my head on that rock, > > > > > >>> something > > > > > >>> was > > > > > >>> wrong with my sanlock storage. YMMV > > > > > >>> Ted Miller > > > > > >>> Elkhart, IN, USA > > > > > >>> > > > > > > > > > > > > > > > > Maurice - which type of storage is this? > > > > > > > > > > > > _______________________________________________ > > > > Users mailing list > > > > Users@ovirt.org > > > > http://lists.ovirt.org/mailman/listinfo/users > > > > > > > > > > > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users > _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users