Re: [ovirt-users] Storage domains not found by vdsm after a reboot

2016-12-05 Thread Yoann Laissus
Thanks for your answer.

Based on your advices, I improved my shutdown script by doing manually
all actions the engine does while putting an host into maintenance
(via vdsClient : stopping SPM, disconnecting the SP, disconnecting and
unlocking all SD)
So no need to use my previous hacks anymore :) I can post this script
if anyone is interested.

About the delay issue after a reboot, I investigated that a bit more.

It seems to not be related to vdsm and the lockspace, but the engine itself.
 When the engine starts, it probably doesn't know the host was down.
So it doesn't send domain informations and keeps trying to reconstruct
the SPM on the main host until a timeout occurs (found in the engine
logs). It also explains why it works immediately after restarting
vdsm, as the engine see the host down for a couple of time.

My usage is clearly not common but maybe the ideal solution would be
to have a special "maintenance" state to inform the engine the host
will be rebooted while it isn't alive.
But I can live with a small script which restart vdsm once the engine boots.

2016-12-03 20:58 GMT+01:00 Nir Soffer :
> On Sat, Dec 3, 2016 at 6:14 PM, Yoann Laissus  wrote:
>> Hello,
>>
>> I'm running into some weird issues with vdsm and my storage domains
>> after a reboot or a shutdown. I can't manage to figure out what's
>> going on...
>>
>> Currently, my cluster (4.0.5 with hosted engine) is composed of one
>> main node. (and another inactive one but unrelated to this issue).
>> It has local storage exposed to oVirt via 3 NFS exports (one specific
>> for the hosted engine vm) reachable from my local network.
>>
>> When I wan't to shutdown or reboot my main host (and so the whole
>> cluster), I use a custom script :
>> 1. Shutdown all VM
>> 2. Shutdown engine VM
>> 3. Stop HA agent and broker
>> 4. Stop vdsmd
>
> This leave vdsm connected to all storage domains, and sanlock is
> still maintaining the lockspace on all storage domains.
>
>> 5. Release the sanlock on the hosted engine SD
>
> You should not do that but use local/global maintenance mode in hosted
> engine agent.
>
>> 6. Shutdown / Reboot
>>
>> It works just fine, but at the next boot, VDSM takes at least 10-15
>> minutes to find storage domains, except the hosted engine one. The
>> engine loops trying to reconstruct the SPM.
>> During this time, vdsClient getConnectedStoragePoolsList returns nothing.
>> getStorageDomainsList returns only the hosted engine domain.
>> NFS exports are mountable from another server.
>
> The correct way to shutdown a host is to move the host to maintenance.
> This deactivate all storage domains on this host, release sanlock leases
> and disconnect from the storage server (e.g. log out from iscsi connection,
> unmount nfs mounts).
>
> If you don't this, sanlock will need more time to join the lockspace in the
> next time.
>
> I'm not sure what is the correct procedure when using hosted engine, since
> hosted engine will not let you put a host into maintenance if the hosted
> engine vm is running on this host. You can stop the hosted engine vm
> but then you cannot move the host into maintenance since you don't have
> engine :-)
>
> There must be a documented way to perform this operation, I hope that
> Simone will point us to the documentation.
>
> Nir
>
>>
>> But when I restart vdsm manually after the boot, it seems to detect
>> immediately the storage domains.
>>
>> Is there some kind of staled storage data used by vdsm and a timeout
>> to invalidate them ?
>> Am I missing something on the vdsm side in my shutdown procedure ?
>>
>> Thanks !
>>
>> Engine and vdsm logs are attached.
>>
>>
>> --
>> Yoann Laissus
>>
>> ___
>> Users mailing list
>> Users@ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/users
>>



-- 
Yoann Laissus
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Storage domains not found by vdsm after a reboot

2016-12-04 Thread Joop
On 3-12-2016 19:24, Charles Kozler wrote:
> I am facing the same issue here as well. The engine comes up and web
> UI is reachable. Initial login takes about 6 minutes to finally let me
> in and then once I am in under the events tab there is events for
> "storage domain  does not exist" yet they are all there. After
It will probably start faster if you install haveged in the hosted
engine VM.
I run a similar setup and it goes a lot faster that way.
> this comes 'reconstructing master domain' and it tries to cycle
> through my 2 storage domains not including ISO_UPLOAD and
> hosted_engine domains. Eventually it will either 1.) Settle on one and
> actually able to bring it up master domain or 2.) they all stay down
> and I have to manually activate one
>
> Its not really an issue since on three tests now I have recovered fine
> but it required some manual intervention on at least one occasion but
> otherwise it just flaps about until it can settle on one and actually
> bring it up
>
> Clocking it today its usually like this:
>
> 7 minutes for HE to come up on node 1 and access to web UI
> +6 minutes while hanging on logging in to web UI
> +9 minutes for one of the two storage domains to get activated as master
>
My setup is usally up in about 10min if I do nothing and if 'helping' it
I can get it up in around 6-8min. I usally don't bother and have another
cup of coffee :-)

About shutting down. I have a small script that shuts down the
ha-agent/ha-broker, then the engine and then vdsm/sanlock/nfs-server.
Search the ML for it and I can post it again if needed.
Putting the host into global maintenance and then shutting down will
work too but then you need to script --maintenance --mode=none during
startup.

Regards,

Joop


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Storage domains not found by vdsm after a reboot

2016-12-04 Thread Yaniv Kaul
On Dec 3, 2016 8:24 PM, "Charles Kozler"  wrote:

I am facing the same issue here as well. The engine comes up and web UI is
reachable. Initial login takes about 6 minutes to finally let me in and
then once I am in under the events tab there is events for "storage domain
 does not exist" yet they are all there. After this comes
'reconstructing master domain' and it tries to cycle through my 2 storage
domains not including ISO_UPLOAD and hosted_engine domains. Eventually it
will either 1.) Settle on one and actually able to bring it up master
domain or 2.) they all stay down and I have to manually activate one

Its not really an issue since on three tests now I have recovered fine but
it required some manual intervention on at least one occasion but otherwise
it just flaps about until it can settle on one and actually bring it up

Clocking it today its usually like this:

7 minutes for HE to come up on node 1 and access to web UI
+6 minutes while hanging on logging in to web UI
+9 minutes for one of the two storage domains to get activated as master

Total around 20 minutes before entire cluster is usable.


Can you please verify DNS, both forward and reverse resolution, is properly
configured?
Which oVirt release are you using?
Y.


On Sat, Dec 3, 2016 at 11:14 AM, Yoann Laissus 
wrote:

> Hello,
>
> I'm running into some weird issues with vdsm and my storage domains
> after a reboot or a shutdown. I can't manage to figure out what's
> going on...
>
> Currently, my cluster (4.0.5 with hosted engine) is composed of one
> main node. (and another inactive one but unrelated to this issue).
> It has local storage exposed to oVirt via 3 NFS exports (one specific
> for the hosted engine vm) reachable from my local network.
>
> When I wan't to shutdown or reboot my main host (and so the whole
> cluster), I use a custom script :
> 1. Shutdown all VM
> 2. Shutdown engine VM
> 3. Stop HA agent and broker
> 4. Stop vdsmd
> 5. Release the sanlock on the hosted engine SD
> 6. Shutdown / Reboot
>
> It works just fine, but at the next boot, VDSM takes at least 10-15
> minutes to find storage domains, except the hosted engine one. The
> engine loops trying to reconstruct the SPM.
> During this time, vdsClient getConnectedStoragePoolsList returns nothing.
> getStorageDomainsList returns only the hosted engine domain.
> NFS exports are mountable from another server.
>
> But when I restart vdsm manually after the boot, it seems to detect
> immediately the storage domains.
>
> Is there some kind of staled storage data used by vdsm and a timeout
> to invalidate them ?
> Am I missing something on the vdsm side in my shutdown procedure ?
>
> Thanks !
>
> Engine and vdsm logs are attached.
>
>
> --
> Yoann Laissus
>
> ___
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>
>

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Storage domains not found by vdsm after a reboot

2016-12-03 Thread Nir Soffer
On Sat, Dec 3, 2016 at 6:14 PM, Yoann Laissus  wrote:
> Hello,
>
> I'm running into some weird issues with vdsm and my storage domains
> after a reboot or a shutdown. I can't manage to figure out what's
> going on...
>
> Currently, my cluster (4.0.5 with hosted engine) is composed of one
> main node. (and another inactive one but unrelated to this issue).
> It has local storage exposed to oVirt via 3 NFS exports (one specific
> for the hosted engine vm) reachable from my local network.
>
> When I wan't to shutdown or reboot my main host (and so the whole
> cluster), I use a custom script :
> 1. Shutdown all VM
> 2. Shutdown engine VM
> 3. Stop HA agent and broker
> 4. Stop vdsmd

This leave vdsm connected to all storage domains, and sanlock is
still maintaining the lockspace on all storage domains.

> 5. Release the sanlock on the hosted engine SD

You should not do that but use local/global maintenance mode in hosted
engine agent.

> 6. Shutdown / Reboot
>
> It works just fine, but at the next boot, VDSM takes at least 10-15
> minutes to find storage domains, except the hosted engine one. The
> engine loops trying to reconstruct the SPM.
> During this time, vdsClient getConnectedStoragePoolsList returns nothing.
> getStorageDomainsList returns only the hosted engine domain.
> NFS exports are mountable from another server.

The correct way to shutdown a host is to move the host to maintenance.
This deactivate all storage domains on this host, release sanlock leases
and disconnect from the storage server (e.g. log out from iscsi connection,
unmount nfs mounts).

If you don't this, sanlock will need more time to join the lockspace in the
next time.

I'm not sure what is the correct procedure when using hosted engine, since
hosted engine will not let you put a host into maintenance if the hosted
engine vm is running on this host. You can stop the hosted engine vm
but then you cannot move the host into maintenance since you don't have
engine :-)

There must be a documented way to perform this operation, I hope that
Simone will point us to the documentation.

Nir

>
> But when I restart vdsm manually after the boot, it seems to detect
> immediately the storage domains.
>
> Is there some kind of staled storage data used by vdsm and a timeout
> to invalidate them ?
> Am I missing something on the vdsm side in my shutdown procedure ?
>
> Thanks !
>
> Engine and vdsm logs are attached.
>
>
> --
> Yoann Laissus
>
> ___
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Storage domains not found by vdsm after a reboot

2016-12-03 Thread Charles Kozler
I am facing the same issue here as well. The engine comes up and web UI is
reachable. Initial login takes about 6 minutes to finally let me in and
then once I am in under the events tab there is events for "storage domain
 does not exist" yet they are all there. After this comes
'reconstructing master domain' and it tries to cycle through my 2 storage
domains not including ISO_UPLOAD and hosted_engine domains. Eventually it
will either 1.) Settle on one and actually able to bring it up master
domain or 2.) they all stay down and I have to manually activate one

Its not really an issue since on three tests now I have recovered fine but
it required some manual intervention on at least one occasion but otherwise
it just flaps about until it can settle on one and actually bring it up

Clocking it today its usually like this:

7 minutes for HE to come up on node 1 and access to web UI
+6 minutes while hanging on logging in to web UI
+9 minutes for one of the two storage domains to get activated as master

Total around 20 minutes before entire cluster is usable.

On Sat, Dec 3, 2016 at 11:14 AM, Yoann Laissus 
wrote:

> Hello,
>
> I'm running into some weird issues with vdsm and my storage domains
> after a reboot or a shutdown. I can't manage to figure out what's
> going on...
>
> Currently, my cluster (4.0.5 with hosted engine) is composed of one
> main node. (and another inactive one but unrelated to this issue).
> It has local storage exposed to oVirt via 3 NFS exports (one specific
> for the hosted engine vm) reachable from my local network.
>
> When I wan't to shutdown or reboot my main host (and so the whole
> cluster), I use a custom script :
> 1. Shutdown all VM
> 2. Shutdown engine VM
> 3. Stop HA agent and broker
> 4. Stop vdsmd
> 5. Release the sanlock on the hosted engine SD
> 6. Shutdown / Reboot
>
> It works just fine, but at the next boot, VDSM takes at least 10-15
> minutes to find storage domains, except the hosted engine one. The
> engine loops trying to reconstruct the SPM.
> During this time, vdsClient getConnectedStoragePoolsList returns nothing.
> getStorageDomainsList returns only the hosted engine domain.
> NFS exports are mountable from another server.
>
> But when I restart vdsm manually after the boot, it seems to detect
> immediately the storage domains.
>
> Is there some kind of staled storage data used by vdsm and a timeout
> to invalidate them ?
> Am I missing something on the vdsm side in my shutdown procedure ?
>
> Thanks !
>
> Engine and vdsm logs are attached.
>
>
> --
> Yoann Laissus
>
> ___
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


[ovirt-users] Storage domains not found by vdsm after a reboot

2016-12-03 Thread Yoann Laissus
Hello,

I'm running into some weird issues with vdsm and my storage domains
after a reboot or a shutdown. I can't manage to figure out what's
going on...

Currently, my cluster (4.0.5 with hosted engine) is composed of one
main node. (and another inactive one but unrelated to this issue).
It has local storage exposed to oVirt via 3 NFS exports (one specific
for the hosted engine vm) reachable from my local network.

When I wan't to shutdown or reboot my main host (and so the whole
cluster), I use a custom script :
1. Shutdown all VM
2. Shutdown engine VM
3. Stop HA agent and broker
4. Stop vdsmd
5. Release the sanlock on the hosted engine SD
6. Shutdown / Reboot

It works just fine, but at the next boot, VDSM takes at least 10-15
minutes to find storage domains, except the hosted engine one. The
engine loops trying to reconstruct the SPM.
During this time, vdsClient getConnectedStoragePoolsList returns nothing.
getStorageDomainsList returns only the hosted engine domain.
NFS exports are mountable from another server.

But when I restart vdsm manually after the boot, it seems to detect
immediately the storage domains.

Is there some kind of staled storage data used by vdsm and a timeout
to invalidate them ?
Am I missing something on the vdsm side in my shutdown procedure ?

Thanks !

Engine and vdsm logs are attached.


-- 
Yoann Laissus

-
engine :

2016-12-03 15:58:53,616 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (org.ovirt.thread.pool-6-thread-5) [82a36a5] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand' return value '
TaskStatusListReturnForXmlRpc:{status='StatusForXmlRpc [code=654, message=Not SPM: ()]'}
'
2016-12-03 15:58:53,616 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (org.ovirt.thread.pool-6-thread-5) [82a36a5] HostName = ovirt-host1
2016-12-03 15:58:53,616 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (org.ovirt.thread.pool-6-thread-5) [82a36a5] Command 'HSMGetAllTasksStatusesVDSCommand(HostName = ovirt-host1, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='c7d02baa-7c47-484e-8146-ba01b36b49d6'})' execution failed: IRSGenericException: IRSErrorException: IRSNonOperationalException: Not SPM: ()
2016-12-03 15:58:53,617 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-6-thread-5) [82a36a5] FINISH, SpmStopVDSCommand, log id: 5f207339
2016-12-03 15:58:53,622 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ResetIrsVDSCommand] (org.ovirt.thread.pool-6-thread-5) [82a36a5] FINISH, ResetIrsVDSCommand, log id: dbff776
2016-12-03 15:58:53,623 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (org.ovirt.thread.pool-6-thread-5) [82a36a5] START, DisconnectStoragePoolVDSCommand(HostName = ovirt-host1, DisconnectStoragePoolVDSCommandParameters:{runAsync='true', hostId='c7d02baa-7c47-484e-8146-ba01b36b49d6', storagePoolId='58317bfd-02fd-030e-01a3-0181', vds_spm_id='1'}), log id: 26e25174
2016-12-03 15:58:54,630 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (org.ovirt.thread.pool-6-thread-5) [82a36a5] FINISH, DisconnectStoragePoolVDSCommand, log id: 26e25174
2016-12-03 15:58:54,633 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (org.ovirt.thread.pool-6-thread-5) [82a36a5] START, ReconstructMasterVDSCommand(HostName = ovirt-host1, ReconstructMasterVDSCommandParameters:{runAsync='true', hostId='c7d02baa-7c47-484e-8146-ba01b36b49d6', vdsSpmId='1', storagePoolId='58317bfd-02fd-030e-01a3-0181', storagePoolName='Default', masterDomainId='a403cebb-243d-41f0-83c1-25cb9f911f30', masterVersion='698', domainsList='[StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='58317bfd-02fd-030e-01a3-0181', storageId='02d9ee81-6db1-494b-b32b-d54236717aa0'}', status='Unknown'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='58317bfd-02fd-030e-01a3-0181', storageId='e1140160-8578-436f-a69b-6fee6cc7c894'}', status='Unknown'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='58317bfd-02fd-030e-01a3-0181', storageId='a403cebb-243d-41f0-83c1-25cb9f911f30'}', status='Unknown'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='58317bfd-02fd-030e-01a3-0181', storageId='47bec5f5-e44f-4cd3-9249-38266a5f2ab9'}', status='Inactive'}]'}), log id: 606cb02b
2016-12-03 15:58:55,666 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (org.ovirt.thread.pool-6-thread-5) [82a36a5] Failed in 'ReconstructMasterVDS' method
2016-12-03 15:58:55,670 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-5) [82a36a5] Correlation ID: null, Call Stack: null, Custom Event ID: -1,