[ovirt-users] Re: DR on hyperconverged deployment

wodel youchi Thu, 02 Apr 2020 16:34:23 -0700

Hi,
Thank you for your reply, but I already did all of that but I didn't
understand everything and I got several problems on the way of doing the
fail-over then a fail-back. I am writing this mail in hope to clarify those
things.
I will try to express my self correctly and give as much details as that I
can.

*The LAB :*
My LAB contains two single-host oVirt-HCI platforms, one to act as the *primary
site  (the source)* the second as the *disaster-recovery site (the target)*.
Each HCI site contains one data domain, the domain is comprised of a
gluster volume which is backed by one brick. The volumes (source and
target) have the same size, and they have been created within the process
of the HCI deployment.
*At the end of the deployment, I detached the deleted the gluster data
domain on the target site, but I didn't delete the target volume.*

My goal is to test the disaster recovery (active-passive DR to be precise)
process on an HCI implementation. To test the fail-over and the fail-back
process entirely.

*Documentation*

RHHI 1.7
Maintaining_Red_Hat_Hyperconverged_Infrastructure_for_Virtualization-en-US
and I started my implementation

I prepared all the ansible playbooks.

*The Test procedure:*

*Fail-over*

1 - Create a Windows10 VM on the source volume.

2 - Replicate to the DR site.

3 - Execute the fail-over procedure and test if the WM is usable in the
target platform.

4 - Detach and Delete the data domain in the target platform without
touching the target volume

5 - Make changes to the Win10 VM on the source volume (creating files and
installing software)

6 - Replicate again to the DR site then execute another fail-over and see
if the modification were synced.

*Fail-back*

1 - Make changes to the Win10 VM on the target volume (deleting files) *and
especially creating a snapshot*

2 - Detach and Delete the data domain in the source platform without
touching the source volume.

3 - Replicate to the source site.

4 - Execute the clean up playbook

5 - Execute the fail-over and WM is usable in the source platform and that
the modifications were synced especially the snapshot

*Things I need to confirm :*

1 - When creating the geo-replication from the primary site to the target
site, we get to a point where we have to create "*Scheduling regular
backups using geo-replication*", from my understanding it's like a cron job
that starts the geo-replication at a specific time (or day time), and from
my testing, the geo-replication starts syncing at that precise time and
when its "*CRAWL STATUS*" reaches "Changelog Crawl" it stops the
synchronization. In other terms when the geo-replication reaches the same
date as the check-point (the specific time).

The smallest time you can get from the configuration window is 24hours,
which means in the event of a disaster, you can at most recover the data
from the day before. *Is this correct?*

*Problems encountered during the test:*

*Fail-over*

1 - When executing the fail-over the first time (ansible-playbook
dr-rhv-failover.yml --tags "fail_over"), the import of the target data
domain failed with the error : *An exception occurred during task
execution. To see the full traceback, use -vvv. The error was:
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is
"[Error in creating a Storage Domain. The selected storage path is not
empty (probably contains another Storage Domain). Either remove the
existing Storage Domain from this path, or change the Storage path).]".
HTTP response code is 400. *I tried manually to import the domain from
oVirt's admin console and I got the same error. so I did the following

- I deleted the target volume and the brick and the sub-directory of the
brick.

- I recreated the volume from scratch.

- I redid the geo-replication synchronization from the source.

- I executed the fail-over and this time the target data domain was
imported correctly and the Win10 VM was started correctly.

2 - I detached then deleted the target data domain without touching the
target volume, then I made change to the Win10 VM on the source site, then
I created a new schedule of geo-replication, and after the replication I
executed another fail-over.

- The Win10 VM started successfully and the changes made were synced.

*Fail-back*
1 - The documentation doesn't explain the fail-back procedure thoroughly.
It doesn't explain what does the dr-cleanup.yml do?

2 - When launching the fail-back playbook at some point I get this message :

*TASK [oVirt.disaster-recovery : Failback Replication Sync pause]
****************************************************************************************************************************[oVirt.disaster-recovery
: Failback Replication Sync pause][Failback Replication Sync] Please press
ENTER once the destination storage domains are ready to be used for the
destination setup:*
What does this mean?

3 - I did some changed on the Win10 VM and I created snapshot of that VM.

4.a - To replicate the data from the target site to the primary site I
create a new geo-replication from the target volume to the source volume,
but I get a warning that the source volume was not empty so I forced the
geo-replication creation, then :
- I detached and deleted the source data domain without touching the source
volume.
- I started the geo-replication manually (without a schedule) and when it
reached the state of "Changelog Crawl" I stopped it.
- I executed the clean-up plyabook then I executed the fail-back playbook
- I got the error : the import of the source data domain failed with the
error : *An exception occurred during task execution. To see the full
traceback, use -vvv. The error was: ovirtsdk4.Error: Fault reason is
"Operation Failed". Fault detail is "[Error in creating a Storage Domain.
The selected storage path is not empty (probably contains another Storage
Domain). Either remove the existing Storage Domain from this path, or
change the Storage path).]". HTTP response code is 400.*

4.b - So I redid the test but,
- I deleted the source volume and its brick, then I created them again.
- I started the geo-replication manually (without a schedule) and when it
reached the state of "changelog" I stopped it.
- I executed the clean-up plyabook then I executed the fail-back playbook
- I got the error : the import of the source data domain failed with the
error : *An exception occurred during task execution. To see the full
traceback, use -vvv. The error was: ovirtsdk4.Error: Fault reason is
"Operation Failed". Fault detail is "[Error in creating a Storage Domain.
The selected storage path is not empty (probably contains another Storage
Domain). Either remove the existing Storage Domain from this path, or
change the Storage path).]". HTTP response code is 400.*

4.c - I redid the test but :
- I deleted the source volume and its brick, then I created them again.
- I started the geo-replication using a shedule this time
- I executed the clean-up plyabook then I executed the fail-back playbook
- *This time the source data domain was imported correctly and the Win10 VM
was started and the modifications were synced.*
- The snapshot was imported, but there was another snapshot with it called
"Win10-TMPDR".

Regards.

Le jeu. 2 avr. 2020 à 08:42, Eyal Shenitzky <eshen...@redhat.com> a écrit :

> If you intention is to use active-passive disaster recovery solution, you
> can have a look at the following guild:
>
> https://ovirt.org/documentation/disaster-recovery-guide/active_passive_overview.html
>
> On Wed, 1 Apr 2020 at 16:42, wodel youchi <wodel.you...@gmail.com> wrote:
>
>> Hi,
>>
>> I am trying to configure and test disaster recovery on ovirt HCI
>>
>> And to understand how it works
>> What is the minimum RPO and its relationship with checkpoint
>> And what are the steps to fail back
>>
>> Regards
>>
>> Le mer. 1 avr. 2020 14:16, Eyal Shenitzky <eshen...@redhat.com> a écrit :
>>
>>> Hi Wodel,
>>>
>>> Can you please explain what you are trying to do?
>>> I am not sure I understand it from your question.
>>>
>>> On Wed, 1 Apr 2020 at 12:55, wodel youchi <wodel.you...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I re-did the test and it seems that the minimum RPO is one day and if
>>>> someone could confirm that would be great
>>>>
>>>> As for the snapshot this time it was synced
>>>>
>>>> Then I tried to test the fail back and I found that the documentation
>>>> is not clear :
>>>> - it is not clear what is the purpose of the dr-clear playbook
>>>> - it is not clear what does mean : put the target volume in read write
>>>> mode and source volume in read-only mode
>>>> - Do we have to sync back using a new georeplication link from the dr
>>>> volume to source volume?
>>>> I tried to so, in my first trial I forced the creation of the back
>>>> georeplication without deleting the content of the source volume then I
>>>> started the replication manually  (I didn't use the checkpoint) and I
>>>> stopped the replication once it reached the changelog state, but I couldn't
>>>> import the source volume I got the error : volume is not empty
>>>>
>>>> In my second trial I deleted and recreated the source volume from
>>>> scratch and the i started the replication back manually at the end I got
>>>> the error
>>>>
>>>> In my third trial I deleted the source volume and recreated it from
>>>> scratch but I replicated back using the check point method and this time
>>>> the fail back worked.
>>>>
>>>>  Could someone sheds some light on this?
>>>>
>>>> Thank you
>>>> Regards.
>>>>
>>>> Le dim. 29 mars 2020 19:19, wodel youchi <wodel.you...@gmail.com> a
>>>> écrit :
>>>>
>>>>> Hi,
>>>>>
>>>>> Need to understand somethings about DR on oVirt-HI
>>>>>
>>>>>
>>>>>    - What does mean : Scheduling regular backups using
>>>>>    geo-replication (point 3.3.4 RHHI 1.7 Doc Maintaining RHHI) :
>>>>>       - Does this mean creating a check-point?
>>>>>       - If yes, does this mean that the geo-replication process will
>>>>>       sync data up to that check-point and then stops the 
>>>>> synchronization, then
>>>>>       repeat the same cycle the day after? does this mean that the 
>>>>> minimum RPO is
>>>>>       one day?
>>>>>    - I created a snapshot of a VM on the source Manager, I synced the
>>>>>    volume then I executed a DR, The VM was started on the Target Manager 
>>>>> but
>>>>>    the VM didn't have its snapshot, any idea???
>>>>>
>>>>>
>>>>> Regards, be safe.
>>>>>
>>>> _______________________________________________
>>>> Users mailing list -- users@ovirt.org
>>>> To unsubscribe send an email to users-le...@ovirt.org
>>>> Privacy Statement: https://www.ovirt.org/privacy-policy.html
>>>> oVirt Code of Conduct:
>>>> https://www.ovirt.org/community/about/community-guidelines/
>>>> List Archives:
>>>> https://lists.ovirt.org/archives/list/users@ovirt.org/message/N2MSZUYT2GE33IVUKGVYHLAO33ZFMJ7N/
>>>>
>>>
>>>
>>> --
>>> Regards,
>>> Eyal Shenitzky
>>>
>>
>
> --
> Regards,
> Eyal Shenitzky
>

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/4BDPPF2KWU5PDQTHNDTC6JBWD57UMFAE/

[ovirt-users] Re: DR on hyperconverged deployment

Reply via email to