[jira] [Updated] (AMBARI-15389) Intermittent YARN service check failures during and post EU
[ https://issues.apache.org/jira/browse/AMBARI-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antonenko Alexander updated AMBARI-15389: - Attachment: AMBARI-15389.patch > Intermittent YARN service check failures during and post EU > --- > > Key: AMBARI-15389 > URL: https://issues.apache.org/jira/browse/AMBARI-15389 > Project: Ambari > Issue Type: Bug > Components: ambari-server >Affects Versions: 2.2.2 >Reporter: Dmitry Lysnichenko >Assignee: Antonenko Alexander > Fix For: 2.2.2 > > Attachments: AMBARI-15389.patch, AMBARI-15389.patch, > AMBARI-15389_2.2.patch > > > Build # - Ambari 2.2.1.1 - #63 > Observed this issue in a couple of EU runs recently where YARN service check > reports failure > a. In one test, the EU ran from HDP 2.3.4.0 to 2.4.0.0 and YARN service check > reported failure during EU itself; a retry of the operation led to service > check being successful > b. In another test post EU when YARN service check was run, it reported > failure; afterwards when I ran it again - success > Looks like there is some corner condition which causes this issue to be hit > {code} > stderr: /var/lib/ambari-agent/data/errors-822.txt > Traceback (most recent call last): > File > "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py", > line 142, in > ServiceCheck().execute() > File > "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", > line 219, in execute > method(env) > File > "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py", > line 104, in service_check > user=params.smokeuser, > File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", > line 70, in inner > result = function(command, **kwargs) > File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", > line 92, in checked_call > tries=tries, try_sleep=try_sleep) > File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", > line 140, in _call_wrapper > result = _call(command, **kwargs_copy) > File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", > line 291, in _call > raise Fail(err_msg) > resource_management.core.exceptions.Fail: Execution of '/usr/bin/kinit -kt > /etc/security/keytabs/smokeuser.headless.keytab ambari...@example.com; yarn > org.apache.hadoop.yarn.applications.distributedshell.Client -shell_command ls > -num_containers 1 -jar > /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar' > returned 2. Hortonworks # > This is MOTD message, added for testing in qe infra > 16/03/03 02:33:51 INFO impl.TimelineClientImpl: Timeline service address: > http://host:8188/ws/v1/timeline/ > 16/03/03 02:33:51 INFO distributedshell.Client: Initializing Client > 16/03/03 02:33:51 INFO distributedshell.Client: Running Client > 16/03/03 02:33:51 INFO client.RMProxy: Connecting to ResourceManager at > host-9-5.test/127.0.0.254:8050 > 16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster metric info from > ASM, numNodeManagers=3 > 16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster node info from ASM > 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, > nodeId=host:25454, nodeAddresshost:8042, nodeRackName/default-rack, > nodeNumContainers1 > 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, > nodeId=host-9-5.test:25454, nodeAddresshost-9-5.test:8042, > nodeRackName/default-rack, nodeNumContainers0 > 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, > nodeId=host-9-1.test:25454, nodeAddresshost-9-1.test:8042, > nodeRackName/default-rack, nodeNumContainers0 > 16/03/03 02:33:53 INFO distributedshell.Client: Queue info, > queueName=default, queueCurrentCapacity=0.08336, queueMaxCapacity=1.0, > queueApplicationCount=0, queueChildQueueCount=0 > 16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, > queueName=root, userAcl=SUBMIT_APPLICATIONS > 16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, > queueName=default, userAcl=SUBMIT_APPLICATIONS > 16/03/03 02:33:53 INFO distributedshell.Client: Max mem capabililty of > resources in this cluster 10240 > 16/03/03 02:33:53 INFO distributedshell.Client: Max virtual cores capabililty > of resources in this cluster 1 > 16/03/03 02:33:53 INFO distributedshell.Client: Copy App Master jar from > local filesystem and add to local environment > 16/03/03 02:33:53 INFO distributedshell.Client: Set the environment for the > application master > 16/03/03 02:33:53 INFO distributedshell.Client: Setting up app master command > 16/03/03 02:33:53 INFO distributedshell.Client: Completed
[jira] [Updated] (AMBARI-15389) Intermittent YARN service check failures during and post EU
[ https://issues.apache.org/jira/browse/AMBARI-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antonenko Alexander updated AMBARI-15389: - Attachment: AMBARI-15389_2.2.patch > Intermittent YARN service check failures during and post EU > --- > > Key: AMBARI-15389 > URL: https://issues.apache.org/jira/browse/AMBARI-15389 > Project: Ambari > Issue Type: Bug > Components: ambari-server >Affects Versions: 2.2.2 >Reporter: Dmitry Lysnichenko >Assignee: Antonenko Alexander > Fix For: 2.2.2 > > Attachments: AMBARI-15389.patch, AMBARI-15389_2.2.patch > > > Build # - Ambari 2.2.1.1 - #63 > Observed this issue in a couple of EU runs recently where YARN service check > reports failure > a. In one test, the EU ran from HDP 2.3.4.0 to 2.4.0.0 and YARN service check > reported failure during EU itself; a retry of the operation led to service > check being successful > b. In another test post EU when YARN service check was run, it reported > failure; afterwards when I ran it again - success > Looks like there is some corner condition which causes this issue to be hit > {code} > stderr: /var/lib/ambari-agent/data/errors-822.txt > Traceback (most recent call last): > File > "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py", > line 142, in > ServiceCheck().execute() > File > "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", > line 219, in execute > method(env) > File > "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py", > line 104, in service_check > user=params.smokeuser, > File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", > line 70, in inner > result = function(command, **kwargs) > File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", > line 92, in checked_call > tries=tries, try_sleep=try_sleep) > File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", > line 140, in _call_wrapper > result = _call(command, **kwargs_copy) > File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", > line 291, in _call > raise Fail(err_msg) > resource_management.core.exceptions.Fail: Execution of '/usr/bin/kinit -kt > /etc/security/keytabs/smokeuser.headless.keytab ambari...@example.com; yarn > org.apache.hadoop.yarn.applications.distributedshell.Client -shell_command ls > -num_containers 1 -jar > /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar' > returned 2. Hortonworks # > This is MOTD message, added for testing in qe infra > 16/03/03 02:33:51 INFO impl.TimelineClientImpl: Timeline service address: > http://host:8188/ws/v1/timeline/ > 16/03/03 02:33:51 INFO distributedshell.Client: Initializing Client > 16/03/03 02:33:51 INFO distributedshell.Client: Running Client > 16/03/03 02:33:51 INFO client.RMProxy: Connecting to ResourceManager at > host-9-5.test/127.0.0.254:8050 > 16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster metric info from > ASM, numNodeManagers=3 > 16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster node info from ASM > 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, > nodeId=host:25454, nodeAddresshost:8042, nodeRackName/default-rack, > nodeNumContainers1 > 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, > nodeId=host-9-5.test:25454, nodeAddresshost-9-5.test:8042, > nodeRackName/default-rack, nodeNumContainers0 > 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, > nodeId=host-9-1.test:25454, nodeAddresshost-9-1.test:8042, > nodeRackName/default-rack, nodeNumContainers0 > 16/03/03 02:33:53 INFO distributedshell.Client: Queue info, > queueName=default, queueCurrentCapacity=0.08336, queueMaxCapacity=1.0, > queueApplicationCount=0, queueChildQueueCount=0 > 16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, > queueName=root, userAcl=SUBMIT_APPLICATIONS > 16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, > queueName=default, userAcl=SUBMIT_APPLICATIONS > 16/03/03 02:33:53 INFO distributedshell.Client: Max mem capabililty of > resources in this cluster 10240 > 16/03/03 02:33:53 INFO distributedshell.Client: Max virtual cores capabililty > of resources in this cluster 1 > 16/03/03 02:33:53 INFO distributedshell.Client: Copy App Master jar from > local filesystem and add to local environment > 16/03/03 02:33:53 INFO distributedshell.Client: Set the environment for the > application master > 16/03/03 02:33:53 INFO distributedshell.Client: Setting up app master command > 16/03/03 02:33:53 INFO distributedshell.Client: Completed setting up app >
[jira] [Updated] (AMBARI-15389) Intermittent YARN service check failures during and post EU
[ https://issues.apache.org/jira/browse/AMBARI-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Lysnichenko updated AMBARI-15389: Fix Version/s: 2.2.2 > Intermittent YARN service check failures during and post EU > --- > > Key: AMBARI-15389 > URL: https://issues.apache.org/jira/browse/AMBARI-15389 > Project: Ambari > Issue Type: Bug > Components: ambari-server >Affects Versions: 2.2.2 >Reporter: Dmitry Lysnichenko >Assignee: Dmitry Lysnichenko > Fix For: 2.2.2 > > Attachments: AMBARI-15389.patch > > > Build # - Ambari 2.2.1.1 - #63 > Observed this issue in a couple of EU runs recently where YARN service check > reports failure > a. In one test, the EU ran from HDP 2.3.4.0 to 2.4.0.0 and YARN service check > reported failure during EU itself; a retry of the operation led to service > check being successful > b. In another test post EU when YARN service check was run, it reported > failure; afterwards when I ran it again - success > Looks like there is some corner condition which causes this issue to be hit > {code} > stderr: /var/lib/ambari-agent/data/errors-822.txt > Traceback (most recent call last): > File > "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py", > line 142, in > ServiceCheck().execute() > File > "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", > line 219, in execute > method(env) > File > "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py", > line 104, in service_check > user=params.smokeuser, > File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", > line 70, in inner > result = function(command, **kwargs) > File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", > line 92, in checked_call > tries=tries, try_sleep=try_sleep) > File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", > line 140, in _call_wrapper > result = _call(command, **kwargs_copy) > File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", > line 291, in _call > raise Fail(err_msg) > resource_management.core.exceptions.Fail: Execution of '/usr/bin/kinit -kt > /etc/security/keytabs/smokeuser.headless.keytab ambari...@example.com; yarn > org.apache.hadoop.yarn.applications.distributedshell.Client -shell_command ls > -num_containers 1 -jar > /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar' > returned 2. Hortonworks # > This is MOTD message, added for testing in qe infra > 16/03/03 02:33:51 INFO impl.TimelineClientImpl: Timeline service address: > http://host:8188/ws/v1/timeline/ > 16/03/03 02:33:51 INFO distributedshell.Client: Initializing Client > 16/03/03 02:33:51 INFO distributedshell.Client: Running Client > 16/03/03 02:33:51 INFO client.RMProxy: Connecting to ResourceManager at > host-9-5.test/127.0.0.254:8050 > 16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster metric info from > ASM, numNodeManagers=3 > 16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster node info from ASM > 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, > nodeId=host:25454, nodeAddresshost:8042, nodeRackName/default-rack, > nodeNumContainers1 > 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, > nodeId=host-9-5.test:25454, nodeAddresshost-9-5.test:8042, > nodeRackName/default-rack, nodeNumContainers0 > 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, > nodeId=host-9-1.test:25454, nodeAddresshost-9-1.test:8042, > nodeRackName/default-rack, nodeNumContainers0 > 16/03/03 02:33:53 INFO distributedshell.Client: Queue info, > queueName=default, queueCurrentCapacity=0.08336, queueMaxCapacity=1.0, > queueApplicationCount=0, queueChildQueueCount=0 > 16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, > queueName=root, userAcl=SUBMIT_APPLICATIONS > 16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, > queueName=default, userAcl=SUBMIT_APPLICATIONS > 16/03/03 02:33:53 INFO distributedshell.Client: Max mem capabililty of > resources in this cluster 10240 > 16/03/03 02:33:53 INFO distributedshell.Client: Max virtual cores capabililty > of resources in this cluster 1 > 16/03/03 02:33:53 INFO distributedshell.Client: Copy App Master jar from > local filesystem and add to local environment > 16/03/03 02:33:53 INFO distributedshell.Client: Set the environment for the > application master > 16/03/03 02:33:53 INFO distributedshell.Client: Setting up app master command > 16/03/03 02:33:53 INFO distributedshell.Client: Completed setting up app > master command {{JAVA_HOME}}/bin/java
[jira] [Updated] (AMBARI-15389) Intermittent YARN service check failures during and post EU
[ https://issues.apache.org/jira/browse/AMBARI-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Lysnichenko updated AMBARI-15389: Attachment: AMBARI-15389.patch > Intermittent YARN service check failures during and post EU > --- > > Key: AMBARI-15389 > URL: https://issues.apache.org/jira/browse/AMBARI-15389 > Project: Ambari > Issue Type: Bug > Components: ambari-server >Reporter: Dmitry Lysnichenko >Assignee: Dmitry Lysnichenko > Attachments: AMBARI-15389.patch > > > Build # - Ambari 2.2.1.1 - #63 > Observed this issue in a couple of EU runs recently where YARN service check > reports failure > a. In one test, the EU ran from HDP 2.3.4.0 to 2.4.0.0 and YARN service check > reported failure during EU itself; a retry of the operation led to service > check being successful > b. In another test post EU when YARN service check was run, it reported > failure; afterwards when I ran it again - success > Looks like there is some corner condition which causes this issue to be hit > {code} > stderr: /var/lib/ambari-agent/data/errors-822.txt > Traceback (most recent call last): > File > "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py", > line 142, in > ServiceCheck().execute() > File > "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", > line 219, in execute > method(env) > File > "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py", > line 104, in service_check > user=params.smokeuser, > File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", > line 70, in inner > result = function(command, **kwargs) > File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", > line 92, in checked_call > tries=tries, try_sleep=try_sleep) > File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", > line 140, in _call_wrapper > result = _call(command, **kwargs_copy) > File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", > line 291, in _call > raise Fail(err_msg) > resource_management.core.exceptions.Fail: Execution of '/usr/bin/kinit -kt > /etc/security/keytabs/smokeuser.headless.keytab ambari...@example.com; yarn > org.apache.hadoop.yarn.applications.distributedshell.Client -shell_command ls > -num_containers 1 -jar > /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar' > returned 2. Hortonworks # > This is MOTD message, added for testing in qe infra > 16/03/03 02:33:51 INFO impl.TimelineClientImpl: Timeline service address: > http://host:8188/ws/v1/timeline/ > 16/03/03 02:33:51 INFO distributedshell.Client: Initializing Client > 16/03/03 02:33:51 INFO distributedshell.Client: Running Client > 16/03/03 02:33:51 INFO client.RMProxy: Connecting to ResourceManager at > host-9-5.test/127.0.0.254:8050 > 16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster metric info from > ASM, numNodeManagers=3 > 16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster node info from ASM > 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, > nodeId=host:25454, nodeAddresshost:8042, nodeRackName/default-rack, > nodeNumContainers1 > 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, > nodeId=host-9-5.test:25454, nodeAddresshost-9-5.test:8042, > nodeRackName/default-rack, nodeNumContainers0 > 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, > nodeId=host-9-1.test:25454, nodeAddresshost-9-1.test:8042, > nodeRackName/default-rack, nodeNumContainers0 > 16/03/03 02:33:53 INFO distributedshell.Client: Queue info, > queueName=default, queueCurrentCapacity=0.08336, queueMaxCapacity=1.0, > queueApplicationCount=0, queueChildQueueCount=0 > 16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, > queueName=root, userAcl=SUBMIT_APPLICATIONS > 16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, > queueName=default, userAcl=SUBMIT_APPLICATIONS > 16/03/03 02:33:53 INFO distributedshell.Client: Max mem capabililty of > resources in this cluster 10240 > 16/03/03 02:33:53 INFO distributedshell.Client: Max virtual cores capabililty > of resources in this cluster 1 > 16/03/03 02:33:53 INFO distributedshell.Client: Copy App Master jar from > local filesystem and add to local environment > 16/03/03 02:33:53 INFO distributedshell.Client: Set the environment for the > application master > 16/03/03 02:33:53 INFO distributedshell.Client: Setting up app master command > 16/03/03 02:33:53 INFO distributedshell.Client: Completed setting up app > master command {{JAVA_HOME}}/bin/java -Xmx10m >
[jira] [Updated] (AMBARI-15389) Intermittent YARN service check failures during and post EU
[ https://issues.apache.org/jira/browse/AMBARI-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Lysnichenko updated AMBARI-15389: Status: Patch Available (was: Open) > Intermittent YARN service check failures during and post EU > --- > > Key: AMBARI-15389 > URL: https://issues.apache.org/jira/browse/AMBARI-15389 > Project: Ambari > Issue Type: Bug > Components: ambari-server >Reporter: Dmitry Lysnichenko >Assignee: Dmitry Lysnichenko > Attachments: AMBARI-15389.patch > > > Build # - Ambari 2.2.1.1 - #63 > Observed this issue in a couple of EU runs recently where YARN service check > reports failure > a. In one test, the EU ran from HDP 2.3.4.0 to 2.4.0.0 and YARN service check > reported failure during EU itself; a retry of the operation led to service > check being successful > b. In another test post EU when YARN service check was run, it reported > failure; afterwards when I ran it again - success > Looks like there is some corner condition which causes this issue to be hit > {code} > stderr: /var/lib/ambari-agent/data/errors-822.txt > Traceback (most recent call last): > File > "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py", > line 142, in > ServiceCheck().execute() > File > "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", > line 219, in execute > method(env) > File > "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py", > line 104, in service_check > user=params.smokeuser, > File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", > line 70, in inner > result = function(command, **kwargs) > File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", > line 92, in checked_call > tries=tries, try_sleep=try_sleep) > File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", > line 140, in _call_wrapper > result = _call(command, **kwargs_copy) > File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", > line 291, in _call > raise Fail(err_msg) > resource_management.core.exceptions.Fail: Execution of '/usr/bin/kinit -kt > /etc/security/keytabs/smokeuser.headless.keytab ambari...@example.com; yarn > org.apache.hadoop.yarn.applications.distributedshell.Client -shell_command ls > -num_containers 1 -jar > /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar' > returned 2. Hortonworks # > This is MOTD message, added for testing in qe infra > 16/03/03 02:33:51 INFO impl.TimelineClientImpl: Timeline service address: > http://host:8188/ws/v1/timeline/ > 16/03/03 02:33:51 INFO distributedshell.Client: Initializing Client > 16/03/03 02:33:51 INFO distributedshell.Client: Running Client > 16/03/03 02:33:51 INFO client.RMProxy: Connecting to ResourceManager at > host-9-5.test/127.0.0.254:8050 > 16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster metric info from > ASM, numNodeManagers=3 > 16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster node info from ASM > 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, > nodeId=host:25454, nodeAddresshost:8042, nodeRackName/default-rack, > nodeNumContainers1 > 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, > nodeId=host-9-5.test:25454, nodeAddresshost-9-5.test:8042, > nodeRackName/default-rack, nodeNumContainers0 > 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, > nodeId=host-9-1.test:25454, nodeAddresshost-9-1.test:8042, > nodeRackName/default-rack, nodeNumContainers0 > 16/03/03 02:33:53 INFO distributedshell.Client: Queue info, > queueName=default, queueCurrentCapacity=0.08336, queueMaxCapacity=1.0, > queueApplicationCount=0, queueChildQueueCount=0 > 16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, > queueName=root, userAcl=SUBMIT_APPLICATIONS > 16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, > queueName=default, userAcl=SUBMIT_APPLICATIONS > 16/03/03 02:33:53 INFO distributedshell.Client: Max mem capabililty of > resources in this cluster 10240 > 16/03/03 02:33:53 INFO distributedshell.Client: Max virtual cores capabililty > of resources in this cluster 1 > 16/03/03 02:33:53 INFO distributedshell.Client: Copy App Master jar from > local filesystem and add to local environment > 16/03/03 02:33:53 INFO distributedshell.Client: Set the environment for the > application master > 16/03/03 02:33:53 INFO distributedshell.Client: Setting up app master command > 16/03/03 02:33:53 INFO distributedshell.Client: Completed setting up app > master command {{JAVA_HOME}}/bin/java -Xmx10m >
[jira] [Updated] (AMBARI-15389) Intermittent YARN service check failures during and post EU
[ https://issues.apache.org/jira/browse/AMBARI-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Lysnichenko updated AMBARI-15389: Component/s: ambari-server > Intermittent YARN service check failures during and post EU > --- > > Key: AMBARI-15389 > URL: https://issues.apache.org/jira/browse/AMBARI-15389 > Project: Ambari > Issue Type: Bug > Components: ambari-server >Reporter: Dmitry Lysnichenko >Assignee: Dmitry Lysnichenko > Attachments: AMBARI-15389.patch > > > Build # - Ambari 2.2.1.1 - #63 > Observed this issue in a couple of EU runs recently where YARN service check > reports failure > a. In one test, the EU ran from HDP 2.3.4.0 to 2.4.0.0 and YARN service check > reported failure during EU itself; a retry of the operation led to service > check being successful > b. In another test post EU when YARN service check was run, it reported > failure; afterwards when I ran it again - success > Looks like there is some corner condition which causes this issue to be hit > {code} > stderr: /var/lib/ambari-agent/data/errors-822.txt > Traceback (most recent call last): > File > "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py", > line 142, in > ServiceCheck().execute() > File > "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", > line 219, in execute > method(env) > File > "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py", > line 104, in service_check > user=params.smokeuser, > File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", > line 70, in inner > result = function(command, **kwargs) > File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", > line 92, in checked_call > tries=tries, try_sleep=try_sleep) > File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", > line 140, in _call_wrapper > result = _call(command, **kwargs_copy) > File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", > line 291, in _call > raise Fail(err_msg) > resource_management.core.exceptions.Fail: Execution of '/usr/bin/kinit -kt > /etc/security/keytabs/smokeuser.headless.keytab ambari...@example.com; yarn > org.apache.hadoop.yarn.applications.distributedshell.Client -shell_command ls > -num_containers 1 -jar > /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar' > returned 2. Hortonworks # > This is MOTD message, added for testing in qe infra > 16/03/03 02:33:51 INFO impl.TimelineClientImpl: Timeline service address: > http://host:8188/ws/v1/timeline/ > 16/03/03 02:33:51 INFO distributedshell.Client: Initializing Client > 16/03/03 02:33:51 INFO distributedshell.Client: Running Client > 16/03/03 02:33:51 INFO client.RMProxy: Connecting to ResourceManager at > host-9-5.test/127.0.0.254:8050 > 16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster metric info from > ASM, numNodeManagers=3 > 16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster node info from ASM > 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, > nodeId=host:25454, nodeAddresshost:8042, nodeRackName/default-rack, > nodeNumContainers1 > 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, > nodeId=host-9-5.test:25454, nodeAddresshost-9-5.test:8042, > nodeRackName/default-rack, nodeNumContainers0 > 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, > nodeId=host-9-1.test:25454, nodeAddresshost-9-1.test:8042, > nodeRackName/default-rack, nodeNumContainers0 > 16/03/03 02:33:53 INFO distributedshell.Client: Queue info, > queueName=default, queueCurrentCapacity=0.08336, queueMaxCapacity=1.0, > queueApplicationCount=0, queueChildQueueCount=0 > 16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, > queueName=root, userAcl=SUBMIT_APPLICATIONS > 16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, > queueName=default, userAcl=SUBMIT_APPLICATIONS > 16/03/03 02:33:53 INFO distributedshell.Client: Max mem capabililty of > resources in this cluster 10240 > 16/03/03 02:33:53 INFO distributedshell.Client: Max virtual cores capabililty > of resources in this cluster 1 > 16/03/03 02:33:53 INFO distributedshell.Client: Copy App Master jar from > local filesystem and add to local environment > 16/03/03 02:33:53 INFO distributedshell.Client: Set the environment for the > application master > 16/03/03 02:33:53 INFO distributedshell.Client: Setting up app master command > 16/03/03 02:33:53 INFO distributedshell.Client: Completed setting up app > master command {{JAVA_HOME}}/bin/java -Xmx10m >