[ovirt-users] Re: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1600 due to network status

Yedidyah Bar David Sun, 25 Jul 2021 23:36:04 -0700

On Fri, Jul 23, 2021 at 6:17 PM Christoph Timm <ov...@timmi.org> wrote:
>
>
>
> Am 21.07.21 um 12:33 schrieb Christoph Timm:
> >
> > Am 21.07.21 um 12:17 schrieb Yedidyah Bar David:
> >> On Mon, Jul 19, 2021 at 2:20 PM Yedidyah Bar David <d...@redhat.com>
> >> wrote:
> >>> On Mon, Jul 19, 2021 at 1:54 PM Christoph Timm <ov...@timmi.org> wrote:
> >>>>
> >>>>
> >>>> Am 19.07.21 um 10:52 schrieb Yedidyah Bar David:
> >>>>> On Mon, Jul 19, 2021 at 11:39 AM Christoph Timm <ov...@timmi.org>
> >>>>> wrote:
> >>>>>> Am 19.07.21 um 10:25 schrieb Yedidyah Bar David:
> >>>>>>> On Mon, Jul 19, 2021 at 11:02 AM Christoph Timm
> >>>>>>> <ov...@timmi.org> wrote:
> >>>>>>>> Am 19.07.21 um 09:27 schrieb Yedidyah Bar David:
> >>>>>>>>> On Mon, Jul 19, 2021 at 10:04 AM Christoph Timm
> >>>>>>>>> <ov...@timmi.org> wrote:
> >>>>>>>>>> Hi Didi,
> >>>>>>>>>>
> >>>>>>>>>> thank you for the quick response.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Am 19.07.21 um 07:59 schrieb Yedidyah Bar David:
> >>>>>>>>>>> On Mon, Jul 19, 2021 at 8:39 AM Christoph Timm
> >>>>>>>>>>> <ov...@timmi.org> wrote:
> >>>>>>>>>>>> Hi List,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm trying to understand why my hosted engine is moved from
> >>>>>>>>>>>> one node to
> >>>>>>>>>>>> another from time to time.
> >>>>>>>>>>>> It is happening sometime multiple times a day. But there
> >>>>>>>>>>>> are also days
> >>>>>>>>>>>> without it.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I can see the following in the
> >>>>>>>>>>>> ovirt-hosted-engine-ha/agent.log:
> >>>>>>>>>>>> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
> >>>>>>>>>>>>
> >>>>>>>>>>>> Penalizing score by 1600 due to network status
> >>>>>>>>>>>>
> >>>>>>>>>>>> After that the engine will be shutdown and started on
> >>>>>>>>>>>> another host.
> >>>>>>>>>>>> The oVirt Admin portal is showing the following around the
> >>>>>>>>>>>> same time:
> >>>>>>>>>>>> Invalid status on Data Center Default. Setting status to
> >>>>>>>>>>>> Non Responsive.
> >>>>>>>>>>>>
> >>>>>>>>>>>> But the whole cluster is working normally during that time.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I believe that I have somehow a network issue on my side
> >>>>>>>>>>>> but I have no
> >>>>>>>>>>>> clue what kind of check is causing the network status to
> >>>>>>>>>>>> penalized.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Does anyone have an idea how to investigate this further?
> >>>>>>>>>>> Please check also broker.log. Do you see 'dig' failures?
> >>>>>>>>>> Yes I found them as well.
> >>>>>>>>>>
> >>>>>>>>>> Thread-1::WARNING::2021-07-19
> >>>>>>>>>> 08:02:00,032::network::120::network.Network::(_dns) DNS query
> >>>>>>>>>> failed:
> >>>>>>>>>> ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5
> >>>>>>>>>> ;; global options: +cmd
> >>>>>>>>>> ;; connection timed out; no servers could be reached
> >>>>>>>>>>
> >>>>>>>>>>> This happened several times already on our CI
> >>>>>>>>>>> infrastructure, but yours is
> >>>>>>>>>>> the first report from an actual real user. See also:
> >>>>>>>>>>>
> >>>>>>>>>>> https://lists.ovirt.org/archives/list/in...@ovirt.org/thread/LIGS5WXGEKWACY5GCK7Z6Q2JYVWJ6JBF/
> >>>>>>>>>>>
> >>>>>>>>>> So I understand that the following command is triggered to
> >>>>>>>>>> test the
> >>>>>>>>>> network: "dig +tries=1 +time=5"
> >>>>>>>>> Indeed.
> >>>>>>>>>
> >>>>>>>>>>> I didn't open a bug for this (yet?), also because I never
> >>>>>>>>>>> reproduced on my
> >>>>>>>>>>> own machines and am not sure about the exact failing flow.
> >>>>>>>>>>> If this is
> >>>>>>>>>>> reproducible
> >>>>>>>>>>> reliably for you, you might want to test the patch I pushed:
> >>>>>>>>>>>
> >>>>>>>>>>> https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/115596
> >> Now filed this bug and linked to it in the above patch. Thanks for
> >> your report!
> >>
> >> https://bugzilla.redhat.com/show_bug.cgi?id=1984356
> > Perfect I added me cc as well.
> >
> > I have implemented the change on one of my nodes, restarted the
> > ovirt-ha-broker and moved the engine to that node.
> > Since than the issue did not occur. I guess I will leave it running
> > until end of the week and will move the engine back to a none changed
> > node to see that the issue is back again.
> So I had no issue with the changed host until now. So I moved the engine
> to different host in the morning and now I received the issue. So I will
> implement the fix on all my hosts now.
> So hope this fix will be permanently included in the next release.


Yes, the bug is targeted 4.4.8 and patch is merged.

Best regards,

> >>
> >> Best regards,
> >>
> >>>>>>>>>> I'm happy to give it a try.
> >>>>>>>>>> Please confirm that I need to replace this file (network.py)
> >>>>>>>>>> on all my
> >>>>>>>>>> nodes (CentOS 8.4 based) which can host my engine.
> >>>>>>>>> It definitely makes sense to do so, but in principle there is
> >>>>>>>>> no problem
> >>>>>>>>> with applying it only on some of them. That's especially
> >>>>>>>>> useful if you try
> >>>>>>>>> this first on a test env and try to enforce a reproduction
> >>>>>>>>> somehow (overload
> >>>>>>>>> the network, disconnect stuff, etc.).
> >>>>>>>> OK will give it a try and report back.
> >>>>>>> Thanks and good luck.
> >>>> Do I need to restart anything after that change?
> >>> Yes, the broker. This might restart some other services there, so
> >>> best put the
> >>> host to maintenance during this.
> >>>
> >>>> Also please confirm that the comma after TCP is correct as there
> >>>> wasn't
> >>>> one before after the timeout in row 110.
> >>> It is correct, but not mandatory. We (my team, at least) often add it
> >>> in such cases
> >>> to make a theoretical future patch that adds another parameter not
> >>> require adding
> >>> it again (thus making the patch smaller and hopefully cleaner).
> >>>
> >>>>>>>>>>> Other ideas/opinions about how to enhance this part of the
> >>>>>>>>>>> monitoring
> >>>>>>>>>>> are most welcome.
> >>>>>>>>>>>
> >>>>>>>>>>> If this phenomenon is new for you, and you can reliably say
> >>>>>>>>>>> it's not due to
> >>>>>>>>>>> a recent "natural" higher network load, I wonder if it's due
> >>>>>>>>>>> to some weird
> >>>>>>>>>>> bug/change somewhere.
> >>>>>>>>>> I'm quite sure that I see this since we moved to 4.4.(4).
> >>>>>>>>>> Just for house keeping I'm running 4.4.7 now.
> >>>>>>>>> We use 'dig' as the network monitor since 4.3.5, around one
> >>>>>>>>> year before 4.4
> >>>>>>>>> was released: https://bugzilla.redhat.com/1659052
> >>>>>>>>>
> >>>>>>>>> Which version did you use before 4.4?
> >>>>>>>> The last 4.3 versions have been 4.3.7, 4.3.9 and 4.3.10 before
> >>>>>>>> migrating
> >>>>>>>> to 4.4.4.
> >>>>>>> I now realize that in above-linked bug we only changed the
> >>>>>>> default, for new
> >>>>>>> setups. So if you deployed He before 4.3.5, upgrade to later 4.3
> >>>>>>> would not
> >>>>>>> change the default (as opposed to upgrade to 4.4, which was
> >>>>>>> actually a
> >>>>>>> new deployment with engine backup/restore). Do you know which
> >>>>>>> version
> >>>>>>> your cluster was originally deployed with?
> >>>>>> Hm, I'm sorry but I don't recall this. I'm quite sure that we
> >>>>>> started
> >>>>> OK, thanks for trying.
> >>>>>
> >>>>>> with 4.0 something. But we moved to a HE setup around September
> >>>>>> 2019.
> >>>>>> But I don't recall the version. But we installed also the backup
> >>>>>> from
> >>>>>> the old installation into the HE environment if I'm not wrong.
> >>>>> If indeed this change was the trigger for you, you can rather
> >>>>> easily try to
> >>>>> change this to 'ping' and see if this helps - I think it's enough
> >>>>> to change
> >>>>> 'network_test' to 'ping' in
> >>>>> /etc/ovirt-hosted-engine/hosted-engine.conf
> >>>>> and restart the broker - didn't try, though. But generally
> >>>>> speaking, I do not
> >>>>> think we want to change the default back to 'ping', but rather
> >>>>> make 'dns'
> >>>>> work better/well. We had valid reasons to move away from ping...
> >>>> OK I will try this if the tcp change does not help me.
> >>> Ok.
> >>>
> >>> In parallel, especially if this is reproducible, you might want to do
> >>> some general
> >>> monitoring of your network - packet losses, etc. - and correlate
> >>> this with the
> >>> failures you see.
> >>>
> >>> Best regards,
> >>> --
> >>> Didi
> >>
> >>
> > _______________________________________________
> > Users mailing list -- users@ovirt.org
> > To unsubscribe send an email to users-le...@ovirt.org
> > Privacy Statement: https://www.ovirt.org/privacy-policy.html
> > oVirt Code of Conduct:
> > https://www.ovirt.org/community/about/community-guidelines/
> > List Archives:
> > https://lists.ovirt.org/archives/list/users@ovirt.org/message/IJ3RIVKFFB63RIPQZXOS6HCAUTWSSPYP/
>


-- 
Didi
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/AVN3MCPDZXQWGLQFB5FI2D77YQBB4EFC/

[ovirt-users] Re: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1600 due to network status

Reply via email to