On 24 July 2018 at 19:42, Greg Sheremeta <gsher...@redhat.com> wrote:

>
>
> On Tue, Jul 24, 2018 at 10:53 AM Andrej Krejcir <akrej...@redhat.com>
> wrote:
>
>>
>>
>> On Mon, 23 Jul 2018 at 15:03, Martin Perina <mper...@redhat.com> wrote:
>>
>>>
>>>
>>> On Mon, Jul 23, 2018 at 1:32 PM, Dafna Ron <d...@redhat.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> the issue seems to be that host-1 stopped responding and I can see some
>>>> fluetd errors which we should look at.
>>>>
>>>> Jira opened to track this issue: https://ovirt-jira.atlassian.
>>>> net/browse/OVIRT-2363
>>>>
>>>> Martin, I also added you to the Jira - can you please have a look?
>>>>
>>>> error from node-1 messages log:
>>>> Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>>> 05:09:14 -0400 [warn]: detached forwarding server 
>>>> 'lago-basic-suite-master-engine'
>>>> host="lago-basic-suite-master-engine" port=24224 phi=16.275347714068506
>>>> Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd:
>>>> ["lago-basic-suite-master-engine", "lago-basic-suite-master-engine",
>>>> "lago-basic-suite-master-engine", "lago-basic-suite-master-engine",
>>>> "lago-basic-suite-master-engine", "lago-basic-suite-master-engine"]
>>>> Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>>> 05:09:14 -0400 fluent.warn: {"host":"lago-basic-suite-
>>>> master-engine","port":24224,"phi":16.275347714068506,"message":"detached
>>>> forwarding server 'lago-basic-suite-master-engine'
>>>> host=\"lago-basic-suite-master-engine\" port=24224
>>>> phi=16.275347714068506"}
>>>> Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>>> 05:09:15 -0400 [warn]: detached forwarding server 
>>>> 'lago-basic-suite-master-engine'
>>>> host="lago-basic-suite-master-engine" port=24224 phi=16.70444149784817
>>>> Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd:
>>>> ["lago-basic-suite-master-engine", "lago-basic-suite-master-engine",
>>>> "lago-basic-suite-master-engine", "lago-basic-suite-master-engine",
>>>> "lago-basic-suite-master-engine", "lago-basic-suite-master-engine"]
>>>> Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>>> 05:09:15 -0400 fluent.warn: {"host":"lago-basic-suite-
>>>> master-engine","port":24224,"phi":16.70444149784817,"message":"detached
>>>> forwarding server 'lago-basic-suite-master-engine'
>>>> host=\"lago-basic-suite-master-engine\" port=24224
>>>> phi=16.70444149784817"}
>>>> Jul 23 05:09:23 lago-basic-suite-master-host-1 python: ansible-command
>>>> Invoked with warn=False executable=None _uses_shell=False
>>>> _raw_params=systemctl is-active 'collectd' removes=None argv=None
>>>> creates=None chdir=None stdin=None
>>>> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd-logind: New
>>>> session 29 of user root.
>>>> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd: Started Session
>>>> 29 of user root.
>>>> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd: Starting
>>>> Session 29 of user root.
>>>> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd-logind: Removed
>>>> session 29.
>>>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>>> 05:09:27 -0400 [warn]: failed to flush the buffer.
>>>> error_class="RuntimeError" error="no nodes are available"
>>>> plugin_id="object:151a620"
>>>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>>> 05:09:27 -0400 [warn]: retry count exceededs limit.
>>>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>>> 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-
>>>> 0.12.42/lib/fluent/plugin/out_forward.rb:222:in `write_objects'
>>>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>>> 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-
>>>> 0.12.42/lib/fluent/output.rb:490:in `write'
>>>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>>> 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-
>>>> 0.12.42/lib/fluent/buffer.rb:354:in `write_chunk'
>>>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>>> 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-
>>>> 0.12.42/lib/fluent/buffer.rb:333:in `pop'
>>>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>>> 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-
>>>> 0.12.42/lib/fluent/output.rb:342:in `try_flush'
>>>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>>> 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-
>>>> 0.12.42/lib/fluent/output.rb:149:in `run'
>>>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>>> 05:09:27 -0400 [error]: throwing away old logs.
>>>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>>> 05:09:27 -0400 fluent.warn: {"error_class":"RuntimeError","error":"no
>>>> nodes are available","plugin_id":"object:151a620","message":"failed to
>>>> flush the buffer. error_class=\"RuntimeError\" error=\"no nodes are
>>>> available\" plugin_id=\"object:151a620\""}
>>>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>>> 05:09:27 -0400 fluent.warn: {"message":"retry count exceededs limit."}
>>>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>>> 05:09:27 -0400 fluent.error: {"message":"throwing away old logs."}
>>>>
>>>>
>>>>
>>>> Thanks.
>>>> Dafna
>>>>
>>>
>>> ​Hi,
>>>
>>> I can see in vdsm.log that it received a kill signal:
>>>
>>> 2018-07-23 05:24:26,735-0400 INFO  (MainThread) [vds] Received signal
>>> 15, shutting down (vdsmd:68)
>>>
>>> ​And in /var/log/messages I found that mom was killed:
>>>
>>> Jul 23 05:24:16 lago-basic-suite-master-host-1 systemd: Stopping MOM
>>> instance configured for VDSM purposes...
>>>
>>> ...
>>>
>>> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service
>>> stop-sigterm timed out. Killing.
>>> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd:
>>> mom-vdsm.service: main process exited, code=killed, status=9/KILL
>>> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Stopped MOM
>>> instance configured for VDSM purposes.
>>> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Unit
>>> mom-vdsm.service entered failed state.
>>> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service
>>> failed.
>>>
>>> So Didi/Shirly/Martin can fluentd error be related to mom shutdown? And
>>> could this be a cause of VDSM shutdown?
>>>
>>> ​Hi,
>>
>> Mom is not related to fluentd and mom shutdown should not cause vdsm
>> shutdown.
>> ​
>> ​
>> The service dependency between vdsmd and mom-vdsm is weak (using
>> Wants=mom-vdsm.service).
>>
>> Looking at /var/log/messages both mom-vdsm and vdsmd services were
>> restarted:
>>
>> Jul 23 05:24:16 lago-basic-suite-master-host-1 systemd: Stopping MOM
>> instance configured for VDSM purposes...
>> ...
>> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Stopped MOM
>> instance configured for VDSM purposes.
>>
>> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Unit
>> mom-vdsm.service entered failed state.
>> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service
>> failed.
>> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Stopping Virtual
>> Desktop Server Manager...
>> ...
>> Jul 23 05:24:29 lago-basic-suite-master-host-1 systemd: Stopped Virtual
>> Desktop Server Manager.
>> ...
>> Jul 23 05:25:26 lago-basic-suite-master-host-1 systemd: Starting Virtual
>> Desktop Server Manager...
>> ...
>> Jul 23 05:25:29 lago-basic-suite-master-host-1 systemd: Started Virtual
>> Desktop Server Manager.
>> Jul 23 05:25:29 lago-basic-suite-master-host-1 systemd: Started MOM
>> instance configured for VDSM purposes.
>> Jul 23 05:25:29 lago-basic-suite-master-host-1 systemd: Starting MOM
>> instance configured for VDSM purposes...
>> ...
>> Jul 23 05:25:34 lago-basic-suite-master-host-1 systemd: Started MOM
>> instance configured for VDSM purposes.
>> Jul 23 05:25:34 lago-basic-suite-master-host-1 systemd: Starting MOM
>> instance configured for VDSM purposes...
>> ​
>>
>>
>>
>> ​The error in 008_basic_ui_sanity.py.junit.xml probably means that the
>> docker executable was not found on the machine running the test. Can it
>> be the cause of the failure?
>>
>> <error type="exceptions.OSError"
>>        message="[Errno 2] No such file or directory
>>        -------------------- >> begin captured stdout <<
>> ---------------------
>>        executing shell: docker ps
>>        --------------------- >> end captured stdout << ---------------
>>
>> File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
>> testMethod()
>> File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in
>> runTest self.test(*self.arg)
>> File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 129,
>> in wrapped_test test()
>> File "/home/jenkins/workspace/ovirt-master_change-queue-
>> tester/ovirt-system-tests/basic-suite-master/test-
>> scenarios/008_basic_ui_sanity.py", line 169, in start_grid
>> _docker_cleanup()
>> File "/home/jenkins/workspace/ovirt-master_change-queue-
>> tester/ovirt-system-tests/basic-suite-master/test-
>> scenarios/008_basic_ui_sanity.py", line 136, in _docker_cleanup
>> _shell(["docker", "ps"])
>> File "/home/jenkins/workspace/ovirt-master_change-queue-
>> tester/ovirt-system-tests/basic-suite-master/test-
>> scenarios/008_basic_ui_sanity.py", line 119, in _shell
>> stderr=subprocess.PIPE)
>> File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__
>> errread, errwrite)
>> File "/usr/lib64/python2.7/subprocess.py", line 1327, in _execute_child
>> raise child_exception [Errno 2] No such file or directory ​
>>
>>
> Yep, looks like docker isn't installed. And yes that would fail it. Any
> recent changes? I know Gal is working on some containerization of this [1],
> but I don't know what's been merged.
>

It seems that there was a short period of time last week where docker was
not available in CentOS, and while our mirrors server should protect
against this type of issue, we experienced some issues with it (Basically
ran out of disk space), so the jobs failed over to the upstream CentOs
repos, and Docker installation in mock failed.



> [1] Change I5af15dce: Adjust UI test to run inside STDCI container |
> https://gerrit.ovirt.org/#/c/93074/
>
>
>>
>> ​Andrej​
>>
>>
>>>>
>>>>
>>>> On Mon, Jul 23, 2018 at 10:31 AM, oVirt Jenkins <jenk...@ovirt.org>
>>>> wrote:
>>>>
>>>>> Change 92882,9 (ovirt-engine) is probably the reason behind recent
>>>>> system test
>>>>> failures in the "ovirt-master" change queue and needs to be fixed.
>>>>>
>>>>> This change had been removed from the testing queue. Artifacts build
>>>>> from this
>>>>> change will not be released until it is fixed.
>>>>>
>>>>> For further details about the change see:
>>>>> https://gerrit.ovirt.org/#/c/92882/9
>>>>>
>>>>> For failed test results see:
>>>>> http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/8764/
>>>>> _______________________________________________
>>>>> Infra mailing list -- infra@ovirt.org
>>>>> To unsubscribe send an email to infra-le...@ovirt.org
>>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>>>> oVirt Code of Conduct: https://www.ovirt.org/
>>>>> community/about/community-guidelines/
>>>>> List Archives: https://lists.ovirt.org/archives/list/infra@ovirt.org/
>>>>> message/6LYYXSGM4LQSRVSYY3IJEIE64LW27TJM/
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Martin Perina
>>> Associate Manager, Software Engineering
>>> Red Hat Czech s.r.o.
>>> _______________________________________________
>>> Infra mailing list -- infra@ovirt.org
>>> To unsubscribe send an email to infra-le...@ovirt.org
>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-
>>> guidelines/
>>> List Archives: https://lists.ovirt.org/archives/list/infra@ovirt.org/
>>> message/KXBI2VR5TXH2FRBOS3ASV3YPOTJZ52RB/
>>>
>> _______________________________________________
>> Infra mailing list -- infra@ovirt.org
>> To unsubscribe send an email to infra-le...@ovirt.org
>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-
>> guidelines/
>> List Archives: https://lists.ovirt.org/archives/list/infra@ovirt.org/
>> message/AD5NAECNGUW4LYJFC5C67TP4SMAY3ZW2/
>>
>
>
> --
>
> GREG SHEREMETA
>
> SENIOR SOFTWARE ENGINEER - TEAM LEAD - RHV UX
>
> Red Hat NA
>
> <https://www.redhat.com/>
>
> gsher...@redhat.com    IRC: gshereme
> <https://red.ht/sig>
>
> _______________________________________________
> Devel mailing list -- de...@ovirt.org
> To unsubscribe send an email to devel-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-
> guidelines/
> List Archives: https://lists.ovirt.org/archives/list/de...@ovirt.org/
> message/W6BR572DZKYDD6F7E2OBX2725FLLEMXW/
>
>


-- 
Barak Korren
RHV DevOps team , RHCE, RHCi
Red Hat EMEA
redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
_______________________________________________
Infra mailing list -- infra@ovirt.org
To unsubscribe send an email to infra-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/infra@ovirt.org/message/JBYHKWJKHIVDQHJNKR4YMTGO2TW7OQQQ/

Reply via email to