[ovirt-users] Re: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1600 due to network status

Yedidyah Bar David Wed, 21 Jul 2021 03:20:51 -0700

On Mon, Jul 19, 2021 at 2:20 PM Yedidyah Bar David <d...@redhat.com> wrote:
>
> On Mon, Jul 19, 2021 at 1:54 PM Christoph Timm <ov...@timmi.org> wrote:
> >
> >
> >
> > Am 19.07.21 um 10:52 schrieb Yedidyah Bar David:
> > > On Mon, Jul 19, 2021 at 11:39 AM Christoph Timm <ov...@timmi.org> wrote:
> > >>
> > >> Am 19.07.21 um 10:25 schrieb Yedidyah Bar David:
> > >>> On Mon, Jul 19, 2021 at 11:02 AM Christoph Timm <ov...@timmi.org> wrote:
> > >>>> Am 19.07.21 um 09:27 schrieb Yedidyah Bar David:
> > >>>>> On Mon, Jul 19, 2021 at 10:04 AM Christoph Timm <ov...@timmi.org> 
> > >>>>> wrote:
> > >>>>>> Hi Didi,
> > >>>>>>
> > >>>>>> thank you for the quick response.
> > >>>>>>
> > >>>>>>
> > >>>>>> Am 19.07.21 um 07:59 schrieb Yedidyah Bar David:
> > >>>>>>> On Mon, Jul 19, 2021 at 8:39 AM Christoph Timm <ov...@timmi.org> 
> > >>>>>>> wrote:
> > >>>>>>>> Hi List,
> > >>>>>>>>
> > >>>>>>>> I'm trying to understand why my hosted engine is moved from one 
> > >>>>>>>> node to
> > >>>>>>>> another from time to time.
> > >>>>>>>> It is happening sometime multiple times a day. But there are also 
> > >>>>>>>> days
> > >>>>>>>> without it.
> > >>>>>>>>
> > >>>>>>>> I can see the following in the ovirt-hosted-engine-ha/agent.log:
> > >>>>>>>> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
> > >>>>>>>> Penalizing score by 1600 due to network status
> > >>>>>>>>
> > >>>>>>>> After that the engine will be shutdown and started on another host.
> > >>>>>>>> The oVirt Admin portal is showing the following around the same 
> > >>>>>>>> time:
> > >>>>>>>> Invalid status on Data Center Default. Setting status to Non 
> > >>>>>>>> Responsive.
> > >>>>>>>>
> > >>>>>>>> But the whole cluster is working normally during that time.
> > >>>>>>>>
> > >>>>>>>> I believe that I have somehow a network issue on my side but I 
> > >>>>>>>> have no
> > >>>>>>>> clue what kind of check is causing the network status to penalized.
> > >>>>>>>>
> > >>>>>>>> Does anyone have an idea how to investigate this further?
> > >>>>>>> Please check also broker.log. Do you see 'dig' failures?
> > >>>>>> Yes I found them as well.
> > >>>>>>
> > >>>>>> Thread-1::WARNING::2021-07-19
> > >>>>>> 08:02:00,032::network::120::network.Network::(_dns) DNS query failed:
> > >>>>>> ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5
> > >>>>>> ;; global options: +cmd
> > >>>>>> ;; connection timed out; no servers could be reached
> > >>>>>>
> > >>>>>>> This happened several times already on our CI infrastructure, but 
> > >>>>>>> yours is
> > >>>>>>> the first report from an actual real user. See also:
> > >>>>>>>
> > >>>>>>> https://lists.ovirt.org/archives/list/in...@ovirt.org/thread/LIGS5WXGEKWACY5GCK7Z6Q2JYVWJ6JBF/
> > >>>>>> So I understand that the following command is triggered to test the
> > >>>>>> network: "dig +tries=1 +time=5"
> > >>>>> Indeed.
> > >>>>>
> > >>>>>>> I didn't open a bug for this (yet?), also because I never 
> > >>>>>>> reproduced on my
> > >>>>>>> own machines and am not sure about the exact failing flow. If this 
> > >>>>>>> is
> > >>>>>>> reproducible
> > >>>>>>> reliably for you, you might want to test the patch I pushed:
> > >>>>>>>
> > >>>>>>> https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/115596


Now filed this bug and linked to it in the above patch. Thanks for your report!

https://bugzilla.redhat.com/show_bug.cgi?id=1984356

Best regards,

> > >>>>>> I'm happy to give it a try.
> > >>>>>> Please confirm that I need to replace this file (network.py) on all 
> > >>>>>> my
> > >>>>>> nodes (CentOS 8.4 based) which can host my engine.
> > >>>>> It definitely makes sense to do so, but in principle there is no 
> > >>>>> problem
> > >>>>> with applying it only on some of them. That's especially useful if 
> > >>>>> you try
> > >>>>> this first on a test env and try to enforce a reproduction somehow 
> > >>>>> (overload
> > >>>>> the network, disconnect stuff, etc.).
> > >>>> OK will give it a try and report back.
> > >>> Thanks and good luck.
> > Do I need to restart anything after that change?
>
> Yes, the broker. This might restart some other services there, so best put the
> host to maintenance during this.
>
> > Also please confirm that the comma after TCP is correct as there wasn't
> > one before after the timeout in row 110.
>
> It is correct, but not mandatory. We (my team, at least) often add it
> in such cases
> to make a theoretical future patch that adds another parameter not
> require adding
> it again (thus making the patch smaller and hopefully cleaner).
>
> > >>>
> > >>>>>>> Other ideas/opinions about how to enhance this part of the 
> > >>>>>>> monitoring
> > >>>>>>> are most welcome.
> > >>>>>>>
> > >>>>>>> If this phenomenon is new for you, and you can reliably say it's 
> > >>>>>>> not due to
> > >>>>>>> a recent "natural" higher network load, I wonder if it's due to 
> > >>>>>>> some weird
> > >>>>>>> bug/change somewhere.
> > >>>>>> I'm quite sure that I see this since we moved to 4.4.(4).
> > >>>>>> Just for house keeping I'm running 4.4.7 now.
> > >>>>> We use 'dig' as the network monitor since 4.3.5, around one year 
> > >>>>> before 4.4
> > >>>>> was released: https://bugzilla.redhat.com/1659052
> > >>>>>
> > >>>>> Which version did you use before 4.4?
> > >>>> The last 4.3 versions have been 4.3.7, 4.3.9 and 4.3.10 before 
> > >>>> migrating
> > >>>> to 4.4.4.
> > >>> I now realize that in above-linked bug we only changed the default, for 
> > >>> new
> > >>> setups. So if you deployed He before 4.3.5, upgrade to later 4.3 would 
> > >>> not
> > >>> change the default (as opposed to upgrade to 4.4, which was actually a
> > >>> new deployment with engine backup/restore). Do you know which version
> > >>> your cluster was originally deployed with?
> > >> Hm, I'm sorry but I don't recall this. I'm quite sure that we started
> > > OK, thanks for trying.
> > >
> > >> with 4.0 something. But we moved to a HE setup around September 2019.
> > >> But I don't recall the version. But we installed also the backup from
> > >> the old installation into the HE environment if I'm not wrong.
> > > If indeed this change was the trigger for you, you can rather easily try 
> > > to
> > > change this to 'ping' and see if this helps - I think it's enough to 
> > > change
> > > 'network_test' to 'ping' in /etc/ovirt-hosted-engine/hosted-engine.conf
> > > and restart the broker - didn't try, though. But generally speaking, I do 
> > > not
> > > think we want to change the default back to 'ping', but rather make 'dns'
> > > work better/well. We had valid reasons to move away from ping...
> > OK I will try this if the tcp change does not help me.
>
> Ok.
>
> In parallel, especially if this is reproducible, you might want to do
> some general
> monitoring of your network - packet losses, etc. - and correlate this with the
> failures you see.
>
> Best regards,
> --
> Didi



-- 
Didi
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/P7QUMYM2SMUK65GWGJPHQHZL2ABGTZTZ/

[ovirt-users] Re: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1600 due to network status

Reply via email to