Hello Tommy, I had a similar experience and after try to recover my storage domain, I realized that my VMs had missed. You have to verify if your VM disks are inside of your storage domain. In my case, I had to add a new a new Storage domain as Master domain to be able to remove the old VMs from DB and reattach the old storage domain. I hope this were not your case. If you haven't lost your VMs it's possible that you can recover them.
Good luck, Juanjo. On Wed, Apr 24, 2013 at 6:43 AM, Tommy McNeely <tommythe...@gmail.com>wrote: > > We had a hard crash (network, then power) on our 2 node Ovirt Cluster. We > have NFS datastore on CentOS 6 (3.2.0-1.39.el6). We can no longer get the > hosts to activate. They are unable to activate the "master" domain. The > master storage domain show "locked" while the other storage domains show > Unknown (disks) and inactive (ISO) All the domains are on the same NFS > server, we are able to mount it, the permissions are good. We believe we > might be getting bit by https://bugzilla.redhat.com/show_bug.cgi?id=920694or > http://gerrit.ovirt.org/#/c/13709/ which says to cease working on it: > > Michael Kublin Apr 10 > > Patch Set 5: Do not submit > > Liron, please abondon this work. This interacts with host life cycle which > will be changed, during a change a following problem will be solved as well. > > > So, We were wondering what we can do to get our oVirt back online, or > rather what the correct way is to solve this. We have a few VMs that are > down which we are looking for ways to recover as quickly as possible. > > Thanks in advance, > Tommy > > Here are the ovirt-engine logs: > > 2013-04-23 21:30:04,041 ERROR > [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-3-thread-49) Command > ConnectStoragePoolVDS execution failed. Exception: > IRSNoMasterDomainException: IRSGenericException: IRSErrorException: > IRSNoMasterDomainException: Cannot find master domain: > 'spUUID=0f63de0e-7d98-48ce-99ec-add109f83c4f, > msdUUID=774e3604-f449-4b3e-8c06-7cd16f98720c' > 2013-04-23 21:30:04,043 INFO > [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] > (pool-3-thread-49) FINISH, ConnectStoragePoolVDSCommand, log id: 50524b34 > 2013-04-23 21:30:04,049 WARN > [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] > (pool-3-thread-49) [7c5867d6] CanDoAction of action ReconstructMasterDomain > failed. > Reasons:VAR__ACTION__RECONSTRUCT_MASTER,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_STORAGE_DOMAIN_STATUS_ILLEGAL2,$status > Locked > > > > Here are the logs from vdsm: > > Thread-29::DEBUG::2013-04-23 > 21:36:05,906::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n > /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 > 10.101.0.148:/c/vpt1-vmdisks1 > /rhev/data-center/mnt/10.101.0.148:_c_vpt1-vmdisks1' > (cwd None) > Thread-29::DEBUG::2013-04-23 > 21:36:06,008::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n > /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 > 10.101.0.148:/c/vpool-iso /rhev/data-center/mnt/10.101.0.148:_c_vpool-iso' > (cwd None) > Thread-29::INFO::2013-04-23 > 21:36:06,065::logUtils::44::dispatcher::(wrapper) Run and protect: > connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': > '7c19bd42-c3dc-41b9-b81b-d9b75214b8dc'}, {'status': 0, 'id': > 'eff2ef61-0b12-4429-b087-8742be17ae90'}]} > Thread-29::DEBUG::2013-04-23 > 21:36:06,071::task::1151::TaskManager.Task::(prepare) > Task=`48337e40-2446-4357-b6dc-2c86f4da67e2`::finished: {'statuslist': > [{'status': 0, 'id': '7c19bd42-c3dc-41b9-b81b-d9b75214b8dc'}, {'status': 0, > 'id': 'eff2ef61-0b12-4429-b087-8742be17ae90'}]} > Thread-29::DEBUG::2013-04-23 > 21:36:06,071::task::568::TaskManager.Task::(_updateState) > Task=`48337e40-2446-4357-b6dc-2c86f4da67e2`::moving from state preparing -> > state finished > Thread-29::DEBUG::2013-04-23 > 21:36:06,071::resourceManager::830::ResourceManager.Owner::(releaseAll) > Owner.releaseAll requests {} resources {} > Thread-29::DEBUG::2013-04-23 > 21:36:06,072::resourceManager::864::ResourceManager.Owner::(cancelAll) > Owner.cancelAll requests {} > Thread-29::DEBUG::2013-04-23 > 21:36:06,072::task::957::TaskManager.Task::(_decref) > Task=`48337e40-2446-4357-b6dc-2c86f4da67e2`::ref 0 aborting False > Thread-30::DEBUG::2013-04-23 > 21:36:06,112::BindingXMLRPC::161::vds::(wrapper) [10.101.0.197] > Thread-30::DEBUG::2013-04-23 > 21:36:06,112::task::568::TaskManager.Task::(_updateState) > Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::moving from state init -> > state preparing > Thread-30::INFO::2013-04-23 > 21:36:06,113::logUtils::41::dispatcher::(wrapper) Run and protect: > connectStoragePool(spUUID='0f63de0e-7d98-48ce-99ec-add109f83c4f', hostID=1, > scsiKey='0f63de0e-7d98-48ce-99ec-add109f83c4f', > msdUUID='774e3604-f449-4b3e-8c06-7cd16f98720c', masterVersion=73, > options=None) > Thread-30::DEBUG::2013-04-23 > 21:36:06,113::resourceManager::190::ResourceManager.Request::(__init__) > ResName=`Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f`ReqID=`ee74329a-0a92-465a-be50-b8acc6d7246a`::Request > was made in '/usr/share/vdsm/storage/resourceManager.py' line '189' at > '__init__' > Thread-30::DEBUG::2013-04-23 > 21:36:06,114::resourceManager::504::ResourceManager::(registerResource) > Trying to register resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' > for lock type 'exclusive' > Thread-30::DEBUG::2013-04-23 > 21:36:06,114::resourceManager::547::ResourceManager::(registerResource) > Resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' is free. Now > locking as 'exclusive' (1 active user) > Thread-30::DEBUG::2013-04-23 > 21:36:06,114::resourceManager::227::ResourceManager.Request::(grant) > ResName=`Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f`ReqID=`ee74329a-0a92-465a-be50-b8acc6d7246a`::Granted > request > Thread-30::INFO::2013-04-23 > 21:36:06,115::sp::625::Storage.StoragePool::(connect) Connect host #1 to > the storage pool 0f63de0e-7d98-48ce-99ec-add109f83c4f with master domain: > 774e3604-f449-4b3e-8c06-7cd16f98720c (ver = 73) > Thread-30::DEBUG::2013-04-23 > 21:36:06,116::lvm::477::OperationMutex::(_invalidateAllPvs) Operation 'lvm > invalidate operation' got the operation mutex > Thread-30::DEBUG::2013-04-23 > 21:36:06,116::lvm::479::OperationMutex::(_invalidateAllPvs) Operation 'lvm > invalidate operation' released the operation mutex > Thread-30::DEBUG::2013-04-23 > 21:36:06,117::lvm::488::OperationMutex::(_invalidateAllVgs) Operation 'lvm > invalidate operation' got the operation mutex > Thread-30::DEBUG::2013-04-23 > 21:36:06,117::lvm::490::OperationMutex::(_invalidateAllVgs) Operation 'lvm > invalidate operation' released the operation mutex > Thread-30::DEBUG::2013-04-23 > 21:36:06,117::lvm::508::OperationMutex::(_invalidateAllLvs) Operation 'lvm > invalidate operation' got the operation mutex > Thread-30::DEBUG::2013-04-23 > 21:36:06,118::lvm::510::OperationMutex::(_invalidateAllLvs) Operation 'lvm > invalidate operation' released the operation mutex > Thread-30::DEBUG::2013-04-23 > 21:36:06,118::misc::1054::SamplingMethod::(__call__) Trying to enter > sampling method (storage.sdc.refreshStorage) > Thread-30::DEBUG::2013-04-23 > 21:36:06,118::misc::1056::SamplingMethod::(__call__) Got in to sampling > method > Thread-30::DEBUG::2013-04-23 > 21:36:06,119::misc::1054::SamplingMethod::(__call__) Trying to enter > sampling method (storage.iscsi.rescan) > Thread-30::DEBUG::2013-04-23 > 21:36:06,119::misc::1056::SamplingMethod::(__call__) Got in to sampling > method > Thread-30::DEBUG::2013-04-23 > 21:36:06,119::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n > /sbin/iscsiadm -m session -R' (cwd None) > Thread-30::DEBUG::2013-04-23 > 21:36:06,136::misc::84::Storage.Misc.excCmd::(<lambda>) FAILED: <err> = > 'iscsiadm: No session found.\n'; <rc> = 21 > Thread-30::DEBUG::2013-04-23 > 21:36:06,136::misc::1064::SamplingMethod::(__call__) Returning last result > MainProcess|Thread-30::DEBUG::2013-04-23 > 21:36:06,139::misc::84::Storage.Misc.excCmd::(<lambda>) '/bin/dd > of=/sys/class/scsi_host/host0/scan' (cwd None) > MainProcess|Thread-30::DEBUG::2013-04-23 > 21:36:06,142::misc::84::Storage.Misc.excCmd::(<lambda>) '/bin/dd > of=/sys/class/scsi_host/host1/scan' (cwd None) > MainProcess|Thread-30::DEBUG::2013-04-23 > 21:36:06,146::misc::84::Storage.Misc.excCmd::(<lambda>) '/bin/dd > of=/sys/class/scsi_host/host2/scan' (cwd None) > MainProcess|Thread-30::DEBUG::2013-04-23 > 21:36:06,149::iscsi::402::Storage.ISCSI::(forceIScsiScan) Performing SCSI > scan, this will take up to 30 seconds > Thread-30::DEBUG::2013-04-23 > 21:36:08,152::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n > /sbin/multipath' (cwd None) > Thread-30::DEBUG::2013-04-23 > 21:36:08,254::misc::84::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = > ''; <rc> = 0 > Thread-30::DEBUG::2013-04-23 > 21:36:08,256::lvm::477::OperationMutex::(_invalidateAllPvs) Operation 'lvm > invalidate operation' got the operation mutex > Thread-30::DEBUG::2013-04-23 > 21:36:08,256::lvm::479::OperationMutex::(_invalidateAllPvs) Operation 'lvm > invalidate operation' released the operation mutex > Thread-30::DEBUG::2013-04-23 > 21:36:08,257::lvm::488::OperationMutex::(_invalidateAllVgs) Operation 'lvm > invalidate operation' got the operation mutex > Thread-30::DEBUG::2013-04-23 > 21:36:08,257::lvm::490::OperationMutex::(_invalidateAllVgs) Operation 'lvm > invalidate operation' released the operation mutex > Thread-30::DEBUG::2013-04-23 > 21:36:08,258::lvm::508::OperationMutex::(_invalidateAllLvs) Operation 'lvm > invalidate operation' got the operation mutex > Thread-30::DEBUG::2013-04-23 > 21:36:08,258::lvm::510::OperationMutex::(_invalidateAllLvs) Operation 'lvm > invalidate operation' released the operation mutex > Thread-30::DEBUG::2013-04-23 > 21:36:08,258::misc::1064::SamplingMethod::(__call__) Returning last result > Thread-30::DEBUG::2013-04-23 > 21:36:08,259::lvm::368::OperationMutex::(_reloadvgs) Operation 'lvm reload > operation' got the operation mutex > Thread-30::DEBUG::2013-04-23 > 21:36:08,261::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n > /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] > ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 > filter = [ \\"r%.*%\\" ] } global { locking_type=1 > prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 > retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o > uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free > 774e3604-f449-4b3e-8c06-7cd16f98720c' (cwd None) > Thread-30::DEBUG::2013-04-23 > 21:36:08,514::misc::84::Storage.Misc.excCmd::(<lambda>) FAILED: <err> = ' > Volume group "774e3604-f449-4b3e-8c06-7cd16f98720c" not found\n'; <rc> = 5 > Thread-30::WARNING::2013-04-23 > 21:36:08,516::lvm::373::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] [' > Volume group "774e3604-f449-4b3e-8c06-7cd16f98720c" not found'] > Thread-30::DEBUG::2013-04-23 > 21:36:08,518::lvm::397::OperationMutex::(_reloadvgs) Operation 'lvm reload > operation' released the operation mutex > Thread-30::DEBUG::2013-04-23 > 21:36:08,524::resourceManager::557::ResourceManager::(releaseResource) > Trying to release resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' > Thread-30::DEBUG::2013-04-23 > 21:36:08,525::resourceManager::573::ResourceManager::(releaseResource) > Released resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' (0 active > users) > Thread-30::DEBUG::2013-04-23 > 21:36:08,525::resourceManager::578::ResourceManager::(releaseResource) > Resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' is free, finding > out if anyone is waiting for it. > Thread-30::DEBUG::2013-04-23 > 21:36:08,525::resourceManager::585::ResourceManager::(releaseResource) No > one is waiting for resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f', > Clearing records. > Thread-30::ERROR::2013-04-23 > 21:36:08,526::task::833::TaskManager.Task::(_setError) > Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Unexpected error > Traceback (most recent call last): > File "/usr/share/vdsm/storage/task.py", line 840, in _run > return fn(*args, **kargs) > File "/usr/share/vdsm/logUtils.py", line 42, in wrapper > res = f(*args, **kwargs) > File "/usr/share/vdsm/storage/hsm.py", line 926, in connectStoragePool > masterVersion, options) > File "/usr/share/vdsm/storage/hsm.py", line 973, in _connectStoragePool > res = pool.connect(hostID, scsiKey, msdUUID, masterVersion) > File "/usr/share/vdsm/storage/sp.py", line 642, in connect > self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion) > File "/usr/share/vdsm/storage/sp.py", line 1166, in __rebuild > self.masterDomain = self.getMasterDomain(msdUUID=msdUUID, > masterVersion=masterVersion) > File "/usr/share/vdsm/storage/sp.py", line 1505, in getMasterDomain > raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID) > StoragePoolMasterNotFound: Cannot find master domain: > 'spUUID=0f63de0e-7d98-48ce-99ec-add109f83c4f, > msdUUID=774e3604-f449-4b3e-8c06-7cd16f98720c' > Thread-30::DEBUG::2013-04-23 > 21:36:08,527::task::852::TaskManager.Task::(_run) > Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Task._run: > f551fa3f-9d8c-4de3-895a-964c821060d4 > ('0f63de0e-7d98-48ce-99ec-add109f83c4f', 1, > '0f63de0e-7d98-48ce-99ec-add109f83c4f', > '774e3604-f449-4b3e-8c06-7cd16f98720c', 73) {} failed - stopping task > Thread-30::DEBUG::2013-04-23 > 21:36:08,528::task::1177::TaskManager.Task::(stop) > Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::stopping in state preparing > (force False) > Thread-30::DEBUG::2013-04-23 > 21:36:08,528::task::957::TaskManager.Task::(_decref) > Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::ref 1 aborting True > Thread-30::INFO::2013-04-23 > 21:36:08,528::task::1134::TaskManager.Task::(prepare) > Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::aborting: Task is aborted: > 'Cannot find master domain' - code 304 > Thread-30::DEBUG::2013-04-23 > 21:36:08,529::task::1139::TaskManager.Task::(prepare) > Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Prepare: aborted: Cannot find > master domain > Thread-30::DEBUG::2013-04-23 > 21:36:08,529::task::957::TaskManager.Task::(_decref) > Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::ref 0 aborting True > Thread-30::DEBUG::2013-04-23 > 21:36:08,529::task::892::TaskManager.Task::(_doAbort) > Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Task._doAbort: force False > Thread-30::DEBUG::2013-04-23 > 21:36:08,530::resourceManager::864::ResourceManager.Owner::(cancelAll) > Owner.cancelAll requests {} > Thread-30::DEBUG::2013-04-23 > 21:36:08,530::task::568::TaskManager.Task::(_updateState) > Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::moving from state preparing -> > state aborting > Thread-30::DEBUG::2013-04-23 > 21:36:08,530::task::523::TaskManager.Task::(__state_aborting) > Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::_aborting: recover policy none > Thread-30::DEBUG::2013-04-23 > 21:36:08,531::task::568::TaskManager.Task::(_updateState) > Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::moving from state aborting -> > state failed > Thread-30::DEBUG::2013-04-23 > 21:36:08,531::resourceManager::830::ResourceManager.Owner::(releaseAll) > Owner.releaseAll requests {} resources {} > Thread-30::DEBUG::2013-04-23 > 21:36:08,531::resourceManager::864::ResourceManager.Owner::(cancelAll) > Owner.cancelAll requests {} > Thread-30::ERROR::2013-04-23 > 21:36:08,532::dispatcher::67::Storage.Dispatcher.Protect::(run) {'status': > {'message': "Cannot find master domain: > 'spUUID=0f63de0e-7d98-48ce-99ec-add109f83c4f, > msdUUID=774e3604-f449-4b3e-8c06-7cd16f98720c'", 'code': 304}} > [root@vmserver3 vdsm]# > > > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users > >
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users