On Tue, Jul 24, 2018 at 10:53 AM Andrej Krejcir <akrej...@redhat.com> wrote:

>
>
> On Mon, 23 Jul 2018 at 15:03, Martin Perina <mper...@redhat.com> wrote:
>
>>
>>
>> On Mon, Jul 23, 2018 at 1:32 PM, Dafna Ron <d...@redhat.com> wrote:
>>
>>> Hi,
>>>
>>> the issue seems to be that host-1 stopped responding and I can see some
>>> fluetd errors which we should look at.
>>>
>>> Jira opened to track this issue:
>>> https://ovirt-jira.atlassian.net/browse/OVIRT-2363
>>>
>>> Martin, I also added you to the Jira - can you please have a look?
>>>
>>> error from node-1 messages log:
>>> Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>> 05:09:14 -0400 [warn]: detached forwarding server
>>> 'lago-basic-suite-master-engine' host="lago-basic-suite-master-engine"
>>> port=24224 phi=16.275347714068506
>>> Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd:
>>> ["lago-basic-suite-master-engine", "lago-basic-suite-master-engine",
>>> "lago-basic-suite-master-engine", "lago-basic-suite-master-engine",
>>> "lago-basic-suite-master-engine", "lago-basic-suite-master-engine"]
>>> Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>> 05:09:14 -0400 fluent.warn:
>>> {"host":"lago-basic-suite-master-engine","port":24224,"phi":16.275347714068506,"message":"detached
>>> forwarding server 'lago-basic-suite-master-engine'
>>> host=\"lago-basic-suite-master-engine\" port=24224 phi=16.275347714068506"}
>>> Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>> 05:09:15 -0400 [warn]: detached forwarding server
>>> 'lago-basic-suite-master-engine' host="lago-basic-suite-master-engine"
>>> port=24224 phi=16.70444149784817
>>> Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd:
>>> ["lago-basic-suite-master-engine", "lago-basic-suite-master-engine",
>>> "lago-basic-suite-master-engine", "lago-basic-suite-master-engine",
>>> "lago-basic-suite-master-engine", "lago-basic-suite-master-engine"]
>>> Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>> 05:09:15 -0400 fluent.warn:
>>> {"host":"lago-basic-suite-master-engine","port":24224,"phi":16.70444149784817,"message":"detached
>>> forwarding server 'lago-basic-suite-master-engine'
>>> host=\"lago-basic-suite-master-engine\" port=24224 phi=16.70444149784817"}
>>> Jul 23 05:09:23 lago-basic-suite-master-host-1 python: ansible-command
>>> Invoked with warn=False executable=None _uses_shell=False
>>> _raw_params=systemctl is-active 'collectd' removes=None argv=None
>>> creates=None chdir=None stdin=None
>>> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd-logind: New
>>> session 29 of user root.
>>> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd: Started Session
>>> 29 of user root.
>>> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd: Starting Session
>>> 29 of user root.
>>> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd-logind: Removed
>>> session 29.
>>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>> 05:09:27 -0400 [warn]: failed to flush the buffer.
>>> error_class="RuntimeError" error="no nodes are available"
>>> plugin_id="object:151a620"
>>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>> 05:09:27 -0400 [warn]: retry count exceededs limit.
>>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>> 05:09:27 -0400 [warn]:
>>> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/out_forward.rb:222:in
>>> `write_objects'
>>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>> 05:09:27 -0400 [warn]:
>>> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:490:in `write'
>>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>> 05:09:27 -0400 [warn]:
>>> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/buffer.rb:354:in
>>> `write_chunk'
>>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>> 05:09:27 -0400 [warn]:
>>> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/buffer.rb:333:in `pop'
>>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>> 05:09:27 -0400 [warn]:
>>> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:342:in `try_flush'
>>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>> 05:09:27 -0400 [warn]:
>>> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:149:in `run'
>>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>> 05:09:27 -0400 [error]: throwing away old logs.
>>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>> 05:09:27 -0400 fluent.warn: {"error_class":"RuntimeError","error":"no nodes
>>> are available","plugin_id":"object:151a620","message":"failed to flush the
>>> buffer. error_class=\"RuntimeError\" error=\"no nodes are available\"
>>> plugin_id=\"object:151a620\""}
>>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>> 05:09:27 -0400 fluent.warn: {"message":"retry count exceededs limit."}
>>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>>> 05:09:27 -0400 fluent.error: {"message":"throwing away old logs."}
>>>
>>>
>>>
>>> Thanks.
>>> Dafna
>>>
>>
>> ​Hi,
>>
>> I can see in vdsm.log that it received a kill signal:
>>
>> 2018-07-23 05:24:26,735-0400 INFO  (MainThread) [vds] Received signal 15,
>> shutting down (vdsmd:68)
>>
>> ​And in /var/log/messages I found that mom was killed:
>>
>> Jul 23 05:24:16 lago-basic-suite-master-host-1 systemd: Stopping MOM
>> instance configured for VDSM purposes...
>>
>> ...
>>
>> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service
>> stop-sigterm timed out. Killing.
>> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service:
>> main process exited, code=killed, status=9/KILL
>> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Stopped MOM
>> instance configured for VDSM purposes.
>> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Unit
>> mom-vdsm.service entered failed state.
>> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service
>> failed.
>>
>> So Didi/Shirly/Martin can fluentd error be related to mom shutdown? And
>> could this be a cause of VDSM shutdown?
>>
>> ​Hi,
>
> Mom is not related to fluentd and mom shutdown should not cause vdsm
> shutdown.
> ​
> ​
> The service dependency between vdsmd and mom-vdsm is weak (using
> Wants=mom-vdsm.service).
>
> Looking at /var/log/messages both mom-vdsm and vdsmd services were
> restarted:
>
> Jul 23 05:24:16 lago-basic-suite-master-host-1 systemd: Stopping MOM
> instance configured for VDSM purposes...
> ...
> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Stopped MOM
> instance configured for VDSM purposes.
>
> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Unit
> mom-vdsm.service entered failed state.
> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service
> failed.
> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Stopping Virtual
> Desktop Server Manager...
> ...
> Jul 23 05:24:29 lago-basic-suite-master-host-1 systemd: Stopped Virtual
> Desktop Server Manager.
> ...
> Jul 23 05:25:26 lago-basic-suite-master-host-1 systemd: Starting Virtual
> Desktop Server Manager...
> ...
> Jul 23 05:25:29 lago-basic-suite-master-host-1 systemd: Started Virtual
> Desktop Server Manager.
> Jul 23 05:25:29 lago-basic-suite-master-host-1 systemd: Started MOM
> instance configured for VDSM purposes.
> Jul 23 05:25:29 lago-basic-suite-master-host-1 systemd: Starting MOM
> instance configured for VDSM purposes...
> ...
> Jul 23 05:25:34 lago-basic-suite-master-host-1 systemd: Started MOM
> instance configured for VDSM purposes.
> Jul 23 05:25:34 lago-basic-suite-master-host-1 systemd: Starting MOM
> instance configured for VDSM purposes...
> ​
>
>
>
> ​The error in 008_basic_ui_sanity.py.junit.xml probably means that the
> docker executable was not found on the machine running the test. Can it
> be the cause of the failure?
>
> <error type="exceptions.OSError"
>        message="[Errno 2] No such file or directory
>        -------------------- >> begin captured stdout <<
> ---------------------
>        executing shell: docker ps
>        --------------------- >> end captured stdout << ---------------
>
> File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
> testMethod()
> File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
> self.test(*self.arg)
> File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 129, in
> wrapped_test test()
> File
> "/home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/basic-suite-master/test-scenarios/008_basic_ui_sanity.py",
> line 169, in start_grid _docker_cleanup()
> File
> "/home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/basic-suite-master/test-scenarios/008_basic_ui_sanity.py",
> line 136, in _docker_cleanup _shell(["docker", "ps"])
> File
> "/home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/basic-suite-master/test-scenarios/008_basic_ui_sanity.py",
> line 119, in _shell stderr=subprocess.PIPE)
> File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__ errread,
> errwrite)
> File "/usr/lib64/python2.7/subprocess.py", line 1327, in _execute_child
> raise child_exception [Errno 2] No such file or directory ​
>
>
Yep, looks like docker isn't installed. And yes that would fail it. Any
recent changes? I know Gal is working on some containerization of this [1],
but I don't know what's been merged.

[1] Change I5af15dce: Adjust UI test to run inside STDCI container |
https://gerrit.ovirt.org/#/c/93074/


>
> ​Andrej​
>
>
>>>
>>>
>>> On Mon, Jul 23, 2018 at 10:31 AM, oVirt Jenkins <jenk...@ovirt.org>
>>> wrote:
>>>
>>>> Change 92882,9 (ovirt-engine) is probably the reason behind recent
>>>> system test
>>>> failures in the "ovirt-master" change queue and needs to be fixed.
>>>>
>>>> This change had been removed from the testing queue. Artifacts build
>>>> from this
>>>> change will not be released until it is fixed.
>>>>
>>>> For further details about the change see:
>>>> https://gerrit.ovirt.org/#/c/92882/9
>>>>
>>>> For failed test results see:
>>>> http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/8764/
>>>> _______________________________________________
>>>> Infra mailing list -- infra@ovirt.org
>>>> To unsubscribe send an email to infra-le...@ovirt.org
>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>>> oVirt Code of Conduct:
>>>> https://www.ovirt.org/community/about/community-guidelines/
>>>> List Archives:
>>>> https://lists.ovirt.org/archives/list/infra@ovirt.org/message/6LYYXSGM4LQSRVSYY3IJEIE64LW27TJM/
>>>>
>>>
>>>
>>
>>
>> --
>> Martin Perina
>> Associate Manager, Software Engineering
>> Red Hat Czech s.r.o.
>> _______________________________________________
>> Infra mailing list -- infra@ovirt.org
>> To unsubscribe send an email to infra-le...@ovirt.org
>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>> oVirt Code of Conduct:
>> https://www.ovirt.org/community/about/community-guidelines/
>> List Archives:
>> https://lists.ovirt.org/archives/list/infra@ovirt.org/message/KXBI2VR5TXH2FRBOS3ASV3YPOTJZ52RB/
>>
> _______________________________________________
> Infra mailing list -- infra@ovirt.org
> To unsubscribe send an email to infra-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/infra@ovirt.org/message/AD5NAECNGUW4LYJFC5C67TP4SMAY3ZW2/
>


-- 

GREG SHEREMETA

SENIOR SOFTWARE ENGINEER - TEAM LEAD - RHV UX

Red Hat NA

<https://www.redhat.com/>

gsher...@redhat.com    IRC: gshereme
<https://red.ht/sig>
_______________________________________________
Infra mailing list -- infra@ovirt.org
To unsubscribe send an email to infra-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/infra@ovirt.org/message/W6BR572DZKYDD6F7E2OBX2725FLLEMXW/

Reply via email to