[ovirt-users] Re: problems testing 4.3.10 to 4.4.8 upgrade SHE

Yedidyah Bar David Sun, 12 Sep 2021 01:38:56 -0700

Hi Gianluca,

On Fri, Sep 10, 2021 at 10:04 AM Gianluca Cecchi
<gianluca.cec...@gmail.com> wrote:
>
>
> On Wed, Sep 1, 2021 at 4:26 PM Gianluca Cecchi <gianluca.cec...@gmail.com> 
> wrote:
>>
>> On Wed, Sep 1, 2021 at 4:00 PM Yedidyah Bar David <d...@redhat.com> wrote:
>>>
>>>
>>> >
>>> > So I think there was something wrong with my system or probably a 
>>> > regression on this in 4.4.8.
>>> >
>>> > I see these lines in ansible steps of deploy of RHV 4.3 -> 4.4
>>> >
>>> > [ INFO  ] TASK [ovirt.ovirt.hosted_engine_setup : Remove host used to 
>>> > redeploy]
>>> > [ INFO  ] changed: [localhost -> 192.168.222.170]
>>> >
>>> > possibly this step should remove the host that I'm reinstalling...?
>>>
>>> It should. From the DB, before adding it again. Matches on the uuid
>>> (search the code for unique_id_out if you want the details). Why?
>>>
>>> (I didn't follow all this thread, ignoring the rest for now...)
>>>
>>> Best regards,
>>>
>>>
>>
>> It was the step I suspect there was a regression for in 4.4.8 (comparing 
>> with 4.4.7) when updating the first hosted-engine host during the upgrade 
>> flow and retaining its hostname details.


What's the regression?

>> I'm going to test with latest async 2 4.4.8 and see if it solves the 
>> problem. Otherwise I'm going to open a bugzilla sending the logs.

Can you clarify what the bug is?

>>
>> Gianluca
>
>
> So tried with 4.4.8 async 2 but the same problem
>
> [ INFO  ] TASK [ovirt.ovirt.hosted_engine_setup : Check actual cluster 
> location]
> [ INFO  ] skipping: [localhost]
> [ INFO  ] TASK [ovirt.ovirt.hosted_engine_setup : Enable GlusterFS at cluster 
> level]
> [ INFO  ] skipping: [localhost]
> [ INFO  ] TASK [ovirt.ovirt.hosted_engine_setup : Set VLAN ID at datacenter 
> level]
> [ INFO  ] skipping: [localhost]
> [ INFO  ] TASK [ovirt.ovirt.hosted_engine_setup : Get active list of active 
> firewalld zones]
> [ INFO  ] changed: [localhost]
> [ INFO  ] TASK [ovirt.ovirt.hosted_engine_setup : Configure libvirt firewalld 
> zone]
> [ INFO  ] changed: [localhost]
> [ INFO  ] TASK [ovirt.ovirt.hosted_engine_setup : Add host]
> [ INFO  ] changed: [localhost]
> [ INFO  ] TASK [ovirt.ovirt.hosted_engine_setup : Include after_add_host 
> tasks files]
> [ INFO  ] You can now connect to 
> https://novirt2.localdomain.local:6900/ovirt-engine/ and check the status of 
> this host and eventually remediate it, please continue only when the host is 
> listed as 'up'
> [ INFO  ] TASK [ovirt.ovirt.hosted_engine_setup : include_tasks]
> [ INFO  ] ok: [localhost]
> [ INFO  ] TASK [ovirt.ovirt.hosted_engine_setup : Create temporary lock file]
> [ INFO  ] changed: [localhost -> localhost]
> [ INFO  ] TASK [ovirt.ovirt.hosted_engine_setup : Pause execution until 
> /tmp/ansible.wy3ichvk_he_setup_lock is removed, delete it once ready to 
> proceed]
>
> the host keeps remaining as NoNResponsive in local engine and in engine.log 
> the same
>
> 2021-09-10 08:44:51,481+02 ERROR 
> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand] 
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-37) [] 
> Command 'GetCapabilitiesAsyncVDSCommand(HostName = novirt2.localdomain.local, 
> VdsIdAndVdsVDSCommandParametersBase:{hostId='ca9ff6f7-5a7c-4168-9632-998c52f76cfa',
>  
> vds='Host[novirt2.localdomain.local,ca9ff6f7-5a7c-4168-9632-998c52f76cfa]'})' 
> execution failed: java.net.ConnectException: Connection refused
>
> so the initial install/config of novirt2 doesn't start
>
> So the scenario is
>
> initial 4.3.10 with 2 hosts (novirt1 and novirt2) and 1 she engine (novmgr)
> iSCSI based storage: hosted_engine storage domain and one data storage domain
>
> This is  nested env so that through snapshots I can try and repeat steps.
> novirt1 and novirt2 are two VMS under one oVirt 4.4 env composed by one 
> single host and an external engine
>
> the steps:
> 1 vm running under novirt1 and hosted engine running under novir2 at the 
> beginning
> . global maintenance
> . stop engine
> . backup
> . shutdown engine vm and scratch novirt2
> actually I simulate scenario where I deploy novirt2 on a new hw, that is a 
> clone of novirt2 VM
> Already tested (in previous version of 4.4.8) that if I go through a 
> different hostname it works

Correct

> As novirt2 and novirt1 (in 4.3) are VMS running on the same hypervisor I see 
> that in their hw details I have the same serial number and the usual random 
> uuid

Same serial number? Doesn't sound right. Any idea why it's the same?

>
> novirt1
> uuid B1EF9AFF-D4BD-41A1-B26E-7DD0CC440963
> serial number 00fa984c-d5a1-e811-906e-00163566263e
>
> novirt2
> uuid D584E962-5461-4FA5-AFFA-DB413E17590C
> serial number  00fa984c-d5a1-e811-906e-00163566263e
>
> and the new novirt2 that has a different uuid, being a clone  has (from 
> dmidecode)
> uuid: 10b9031d-a475-4b41-a134-bad2ede3cf11
> serial Number: 00fa984c-d5a1-e811-906e-00163566263e
>
> Unfortunately I cannot try at the moment the scenario where I deploy the new 
> novirt2 on the same virtual hw, because in the first 4.3 install I configured 
> the OS disk as 50Gb and with this size 4.4.8 complains about insufficient 
> space. And having the snapshot active in preview I cannot resize the disk
> Eventually I can reinstall 4.3 on an 80Gb disk and try the same, maintaining 
> the same hw ... but this would imply that in general I cannot upgrade using 
> different hw and reusing the same hostnames.... correct?

Yes. Either reuse a host and keep its name (what we recommend in the
upgrade guide) or use a new host and a new name (backup/restore
guide).

The condition to remove the host prior to adding it is based on
unique_id_out, which is set in (see also bz 1642440, 1654697):

      - name: Get host unique id
        shell: |
          if [ -e /etc/vdsm/vdsm.id ];
          then cat /etc/vdsm/vdsm.id;
          elif [ -e /proc/device-tree/system-id ];
          then cat /proc/device-tree/system-id; #ppc64le
          else dmidecode -s system-uuid;
          fi;
        environment: "{{ he_cmd_lang }}"
        changed_when: true
        register: unique_id_out

So if you want to "make this work", you can set the uuid (either in
your (virtual) BIOS, to affect the /proc value, or in
/etc/vdsm/vdsm.id) to match the one of the old host (the one you want
to reuse its name). I didn't test this myself, though.

Some time ago we had a similar case, and as a result I filed this doc
bug (fixed since, so docs are updated - both admin/upgrade and
backup/restore guides - also oVirt, despite the bug being on RHV and
despite RHV and oVirt docs not being automatically in sync):

https://bugzilla.redhat.com/show_bug.cgi?id=1921048

>
> anyway if you want to check generated logs at local engine side and novirt2 
> side here they are:
>
> Contents under /var/log of novmgr (tar.gz format)
> https://drive.google.com/file/d/1e4WwN4D8GDBpsGqwpwM40MGcLISeOzGO/view?usp=sharing
>
> Contents under /var/log/of novirt2 (tar.gz format)
> https://drive.google.com/file/d/1uQxlsbPVclW4xcAbCP8dXyIF2HlLqaR-/view?usp=sharing

2021-09-10 00:52:00,950+0200 INFO ansible task start {'status': 'OK',
'ansible_type': 'task', 'ansible_playbook':
'/usr/share/ovirt-hosted-engine-setup/ansible/trigger_role.yml',
'ansible_task': 'ovirt.ovirt.hosted_engine_setup : Remove host used to
redeploy'}
2021-09-10 00:52:00,951+0200 DEBUG ansible on_any args TASK:
ovirt.ovirt.hosted_engine_setup : Remove host used to redeploy  kwargs
is_conditional:False
2021-09-10 00:52:00,951+0200 DEBUG ansible on_any args localhost TASK:
ovirt.ovirt.hosted_engine_setup : Remove host used to redeploy  kwargs
2021-09-10 00:52:01,688+0200 INFO ansible ok {'status': 'OK',
'ansible_type': 'task', 'ansible_playbook':
'/usr/share/ovirt-hosted-engine-setup/ansible/trigger_role.yml',
'ansible_host': 'localhost', 'ansible_task': 'Remove host used to
redeploy', 'task_duration': 0}
2021-09-10 00:52:01,688+0200 DEBUG ansible on_any args
<ansible.executor.task_result.TaskResult object at 0x7f26931196a0>
kwargs
2021-09-10 00:52:01,868+0200 DEBUG var changed: host "localhost" var
"db_remove_he_host" type "<class 'dict'>" value: "{
    "changed": true,
    "cmd": [
        "psql",
        "-d",
        "engine",
        "-c",
        "SELECT deletevds(vds_id) FROM (SELECT vds_id FROM vds WHERE
upper(vds_unique_id)=upper('10b9031d-a475-4b41-a134-bad2ede3cf11')) t"
    ],
    "delta": "0:00:00.021140",
    "end": "2021-09-10 00:52:01.467764",
    "failed": false,
    "rc": 0,
    "start": "2021-09-10 00:52:01.446624",
    "stderr": "",
    "stderr_lines": [],
    "stdout": " deletevds \n-----------\n(0 rows)",
    "stdout_lines": [
        " deletevds ",
        "-----------",
        "(0 rows)"
    ]
}"

Meaning, it wasn't removed.

Perhaps, if you do want to open a bug, it should say something like:
"HE deploy should remove the old host based on its name, and not its
UUID". However, it's not completely clear to me that this won't
introduce new regressions.

I admit I didn't completely understand your flow, and especially your
considerations there. If you think the current behavior prevents an
important flow, please clarify.

Best regards,
-- 
Didi
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/562FRV6U7RCNEUZ5YDICJLN2VJ2OOVQN/

[ovirt-users] Re: problems testing 4.3.10 to 4.4.8 upgrade SHE

Reply via email to