On Fri, Jul 23, 2021 at 6:17 PM Christoph Timm <ov...@timmi.org> wrote: > > > > Am 21.07.21 um 12:33 schrieb Christoph Timm: > > > > Am 21.07.21 um 12:17 schrieb Yedidyah Bar David: > >> On Mon, Jul 19, 2021 at 2:20 PM Yedidyah Bar David <d...@redhat.com> > >> wrote: > >>> On Mon, Jul 19, 2021 at 1:54 PM Christoph Timm <ov...@timmi.org> wrote: > >>>> > >>>> > >>>> Am 19.07.21 um 10:52 schrieb Yedidyah Bar David: > >>>>> On Mon, Jul 19, 2021 at 11:39 AM Christoph Timm <ov...@timmi.org> > >>>>> wrote: > >>>>>> Am 19.07.21 um 10:25 schrieb Yedidyah Bar David: > >>>>>>> On Mon, Jul 19, 2021 at 11:02 AM Christoph Timm > >>>>>>> <ov...@timmi.org> wrote: > >>>>>>>> Am 19.07.21 um 09:27 schrieb Yedidyah Bar David: > >>>>>>>>> On Mon, Jul 19, 2021 at 10:04 AM Christoph Timm > >>>>>>>>> <ov...@timmi.org> wrote: > >>>>>>>>>> Hi Didi, > >>>>>>>>>> > >>>>>>>>>> thank you for the quick response. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Am 19.07.21 um 07:59 schrieb Yedidyah Bar David: > >>>>>>>>>>> On Mon, Jul 19, 2021 at 8:39 AM Christoph Timm > >>>>>>>>>>> <ov...@timmi.org> wrote: > >>>>>>>>>>>> Hi List, > >>>>>>>>>>>> > >>>>>>>>>>>> I'm trying to understand why my hosted engine is moved from > >>>>>>>>>>>> one node to > >>>>>>>>>>>> another from time to time. > >>>>>>>>>>>> It is happening sometime multiple times a day. But there > >>>>>>>>>>>> are also days > >>>>>>>>>>>> without it. > >>>>>>>>>>>> > >>>>>>>>>>>> I can see the following in the > >>>>>>>>>>>> ovirt-hosted-engine-ha/agent.log: > >>>>>>>>>>>> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) > >>>>>>>>>>>> > >>>>>>>>>>>> Penalizing score by 1600 due to network status > >>>>>>>>>>>> > >>>>>>>>>>>> After that the engine will be shutdown and started on > >>>>>>>>>>>> another host. > >>>>>>>>>>>> The oVirt Admin portal is showing the following around the > >>>>>>>>>>>> same time: > >>>>>>>>>>>> Invalid status on Data Center Default. Setting status to > >>>>>>>>>>>> Non Responsive. > >>>>>>>>>>>> > >>>>>>>>>>>> But the whole cluster is working normally during that time. > >>>>>>>>>>>> > >>>>>>>>>>>> I believe that I have somehow a network issue on my side > >>>>>>>>>>>> but I have no > >>>>>>>>>>>> clue what kind of check is causing the network status to > >>>>>>>>>>>> penalized. > >>>>>>>>>>>> > >>>>>>>>>>>> Does anyone have an idea how to investigate this further? > >>>>>>>>>>> Please check also broker.log. Do you see 'dig' failures? > >>>>>>>>>> Yes I found them as well. > >>>>>>>>>> > >>>>>>>>>> Thread-1::WARNING::2021-07-19 > >>>>>>>>>> 08:02:00,032::network::120::network.Network::(_dns) DNS query > >>>>>>>>>> failed: > >>>>>>>>>> ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 > >>>>>>>>>> ;; global options: +cmd > >>>>>>>>>> ;; connection timed out; no servers could be reached > >>>>>>>>>> > >>>>>>>>>>> This happened several times already on our CI > >>>>>>>>>>> infrastructure, but yours is > >>>>>>>>>>> the first report from an actual real user. See also: > >>>>>>>>>>> > >>>>>>>>>>> https://lists.ovirt.org/archives/list/in...@ovirt.org/thread/LIGS5WXGEKWACY5GCK7Z6Q2JYVWJ6JBF/ > >>>>>>>>>>> > >>>>>>>>>> So I understand that the following command is triggered to > >>>>>>>>>> test the > >>>>>>>>>> network: "dig +tries=1 +time=5" > >>>>>>>>> Indeed. > >>>>>>>>> > >>>>>>>>>>> I didn't open a bug for this (yet?), also because I never > >>>>>>>>>>> reproduced on my > >>>>>>>>>>> own machines and am not sure about the exact failing flow. > >>>>>>>>>>> If this is > >>>>>>>>>>> reproducible > >>>>>>>>>>> reliably for you, you might want to test the patch I pushed: > >>>>>>>>>>> > >>>>>>>>>>> https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/115596 > >> Now filed this bug and linked to it in the above patch. Thanks for > >> your report! > >> > >> https://bugzilla.redhat.com/show_bug.cgi?id=1984356 > > Perfect I added me cc as well. > > > > I have implemented the change on one of my nodes, restarted the > > ovirt-ha-broker and moved the engine to that node. > > Since than the issue did not occur. I guess I will leave it running > > until end of the week and will move the engine back to a none changed > > node to see that the issue is back again. > So I had no issue with the changed host until now. So I moved the engine > to different host in the morning and now I received the issue. So I will > implement the fix on all my hosts now. > So hope this fix will be permanently included in the next release.
Yes, the bug is targeted 4.4.8 and patch is merged. Best regards, > >> > >> Best regards, > >> > >>>>>>>>>> I'm happy to give it a try. > >>>>>>>>>> Please confirm that I need to replace this file (network.py) > >>>>>>>>>> on all my > >>>>>>>>>> nodes (CentOS 8.4 based) which can host my engine. > >>>>>>>>> It definitely makes sense to do so, but in principle there is > >>>>>>>>> no problem > >>>>>>>>> with applying it only on some of them. That's especially > >>>>>>>>> useful if you try > >>>>>>>>> this first on a test env and try to enforce a reproduction > >>>>>>>>> somehow (overload > >>>>>>>>> the network, disconnect stuff, etc.). > >>>>>>>> OK will give it a try and report back. > >>>>>>> Thanks and good luck. > >>>> Do I need to restart anything after that change? > >>> Yes, the broker. This might restart some other services there, so > >>> best put the > >>> host to maintenance during this. > >>> > >>>> Also please confirm that the comma after TCP is correct as there > >>>> wasn't > >>>> one before after the timeout in row 110. > >>> It is correct, but not mandatory. We (my team, at least) often add it > >>> in such cases > >>> to make a theoretical future patch that adds another parameter not > >>> require adding > >>> it again (thus making the patch smaller and hopefully cleaner). > >>> > >>>>>>>>>>> Other ideas/opinions about how to enhance this part of the > >>>>>>>>>>> monitoring > >>>>>>>>>>> are most welcome. > >>>>>>>>>>> > >>>>>>>>>>> If this phenomenon is new for you, and you can reliably say > >>>>>>>>>>> it's not due to > >>>>>>>>>>> a recent "natural" higher network load, I wonder if it's due > >>>>>>>>>>> to some weird > >>>>>>>>>>> bug/change somewhere. > >>>>>>>>>> I'm quite sure that I see this since we moved to 4.4.(4). > >>>>>>>>>> Just for house keeping I'm running 4.4.7 now. > >>>>>>>>> We use 'dig' as the network monitor since 4.3.5, around one > >>>>>>>>> year before 4.4 > >>>>>>>>> was released: https://bugzilla.redhat.com/1659052 > >>>>>>>>> > >>>>>>>>> Which version did you use before 4.4? > >>>>>>>> The last 4.3 versions have been 4.3.7, 4.3.9 and 4.3.10 before > >>>>>>>> migrating > >>>>>>>> to 4.4.4. > >>>>>>> I now realize that in above-linked bug we only changed the > >>>>>>> default, for new > >>>>>>> setups. So if you deployed He before 4.3.5, upgrade to later 4.3 > >>>>>>> would not > >>>>>>> change the default (as opposed to upgrade to 4.4, which was > >>>>>>> actually a > >>>>>>> new deployment with engine backup/restore). Do you know which > >>>>>>> version > >>>>>>> your cluster was originally deployed with? > >>>>>> Hm, I'm sorry but I don't recall this. I'm quite sure that we > >>>>>> started > >>>>> OK, thanks for trying. > >>>>> > >>>>>> with 4.0 something. But we moved to a HE setup around September > >>>>>> 2019. > >>>>>> But I don't recall the version. But we installed also the backup > >>>>>> from > >>>>>> the old installation into the HE environment if I'm not wrong. > >>>>> If indeed this change was the trigger for you, you can rather > >>>>> easily try to > >>>>> change this to 'ping' and see if this helps - I think it's enough > >>>>> to change > >>>>> 'network_test' to 'ping' in > >>>>> /etc/ovirt-hosted-engine/hosted-engine.conf > >>>>> and restart the broker - didn't try, though. But generally > >>>>> speaking, I do not > >>>>> think we want to change the default back to 'ping', but rather > >>>>> make 'dns' > >>>>> work better/well. We had valid reasons to move away from ping... > >>>> OK I will try this if the tcp change does not help me. > >>> Ok. > >>> > >>> In parallel, especially if this is reproducible, you might want to do > >>> some general > >>> monitoring of your network - packet losses, etc. - and correlate > >>> this with the > >>> failures you see. > >>> > >>> Best regards, > >>> -- > >>> Didi > >> > >> > > _______________________________________________ > > Users mailing list -- users@ovirt.org > > To unsubscribe send an email to users-le...@ovirt.org > > Privacy Statement: https://www.ovirt.org/privacy-policy.html > > oVirt Code of Conduct: > > https://www.ovirt.org/community/about/community-guidelines/ > > List Archives: > > https://lists.ovirt.org/archives/list/users@ovirt.org/message/IJ3RIVKFFB63RIPQZXOS6HCAUTWSSPYP/ > -- Didi _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/AVN3MCPDZXQWGLQFB5FI2D77YQBB4EFC/