[jira] [Updated] (AMBARI-15389) Intermittent YARN service check failures during and post EU

2016-03-31 Thread Antonenko Alexander (JIRA)

 [ 
https://issues.apache.org/jira/browse/AMBARI-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antonenko Alexander updated AMBARI-15389:
-
Attachment: AMBARI-15389.patch

> Intermittent YARN service check failures during and post EU
> ---
>
> Key: AMBARI-15389
> URL: https://issues.apache.org/jira/browse/AMBARI-15389
> Project: Ambari
>  Issue Type: Bug
>  Components: ambari-server
>Affects Versions: 2.2.2
>Reporter: Dmitry Lysnichenko
>Assignee: Antonenko Alexander
> Fix For: 2.2.2
>
> Attachments: AMBARI-15389.patch, AMBARI-15389.patch, 
> AMBARI-15389_2.2.patch
>
>
> Build # - Ambari 2.2.1.1 - #63
> Observed this issue in a couple of EU runs recently where YARN service check 
> reports failure
> a. In one test, the EU ran from HDP 2.3.4.0 to 2.4.0.0 and YARN service check 
> reported failure during EU itself; a retry of the operation led to service 
> check being successful
> b. In another test post EU when YARN service check was run, it reported 
> failure; afterwards when I ran it again - success
> Looks like there is some corner condition which causes this issue to be hit
> {code}
> stderr:   /var/lib/ambari-agent/data/errors-822.txt
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py",
>  line 142, in 
> ServiceCheck().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
> method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py",
>  line 104, in service_check
> user=params.smokeuser,
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", 
> line 70, in inner
> result = function(command, **kwargs)
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", 
> line 92, in checked_call
> tries=tries, try_sleep=try_sleep)
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", 
> line 140, in _call_wrapper
> result = _call(command, **kwargs_copy)
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", 
> line 291, in _call
> raise Fail(err_msg)
> resource_management.core.exceptions.Fail: Execution of '/usr/bin/kinit -kt 
> /etc/security/keytabs/smokeuser.headless.keytab ambari...@example.com; yarn 
> org.apache.hadoop.yarn.applications.distributedshell.Client -shell_command ls 
> -num_containers 1 -jar 
> /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar'
>  returned 2.  Hortonworks #
> This is MOTD message, added for testing in qe infra
> 16/03/03 02:33:51 INFO impl.TimelineClientImpl: Timeline service address: 
> http://host:8188/ws/v1/timeline/
> 16/03/03 02:33:51 INFO distributedshell.Client: Initializing Client
> 16/03/03 02:33:51 INFO distributedshell.Client: Running Client
> 16/03/03 02:33:51 INFO client.RMProxy: Connecting to ResourceManager at 
> host-9-5.test/127.0.0.254:8050
> 16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster metric info from 
> ASM, numNodeManagers=3
> 16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster node info from ASM
> 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, 
> nodeId=host:25454, nodeAddresshost:8042, nodeRackName/default-rack, 
> nodeNumContainers1
> 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, 
> nodeId=host-9-5.test:25454, nodeAddresshost-9-5.test:8042, 
> nodeRackName/default-rack, nodeNumContainers0
> 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, 
> nodeId=host-9-1.test:25454, nodeAddresshost-9-1.test:8042, 
> nodeRackName/default-rack, nodeNumContainers0
> 16/03/03 02:33:53 INFO distributedshell.Client: Queue info, 
> queueName=default, queueCurrentCapacity=0.08336, queueMaxCapacity=1.0, 
> queueApplicationCount=0, queueChildQueueCount=0
> 16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, 
> queueName=root, userAcl=SUBMIT_APPLICATIONS
> 16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, 
> queueName=default, userAcl=SUBMIT_APPLICATIONS
> 16/03/03 02:33:53 INFO distributedshell.Client: Max mem capabililty of 
> resources in this cluster 10240
> 16/03/03 02:33:53 INFO distributedshell.Client: Max virtual cores capabililty 
> of resources in this cluster 1
> 16/03/03 02:33:53 INFO distributedshell.Client: Copy App Master jar from 
> local filesystem and add to local environment
> 16/03/03 02:33:53 INFO distributedshell.Client: Set the environment for the 
> application master
> 16/03/03 02:33:53 INFO distributedshell.Client: Setting up app master command
> 16/03/03 02:33:53 INFO distributedshell.Client: Completed 

[jira] [Updated] (AMBARI-15389) Intermittent YARN service check failures during and post EU

2016-03-31 Thread Antonenko Alexander (JIRA)

 [ 
https://issues.apache.org/jira/browse/AMBARI-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antonenko Alexander updated AMBARI-15389:
-
Attachment: AMBARI-15389_2.2.patch

> Intermittent YARN service check failures during and post EU
> ---
>
> Key: AMBARI-15389
> URL: https://issues.apache.org/jira/browse/AMBARI-15389
> Project: Ambari
>  Issue Type: Bug
>  Components: ambari-server
>Affects Versions: 2.2.2
>Reporter: Dmitry Lysnichenko
>Assignee: Antonenko Alexander
> Fix For: 2.2.2
>
> Attachments: AMBARI-15389.patch, AMBARI-15389_2.2.patch
>
>
> Build # - Ambari 2.2.1.1 - #63
> Observed this issue in a couple of EU runs recently where YARN service check 
> reports failure
> a. In one test, the EU ran from HDP 2.3.4.0 to 2.4.0.0 and YARN service check 
> reported failure during EU itself; a retry of the operation led to service 
> check being successful
> b. In another test post EU when YARN service check was run, it reported 
> failure; afterwards when I ran it again - success
> Looks like there is some corner condition which causes this issue to be hit
> {code}
> stderr:   /var/lib/ambari-agent/data/errors-822.txt
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py",
>  line 142, in 
> ServiceCheck().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
> method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py",
>  line 104, in service_check
> user=params.smokeuser,
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", 
> line 70, in inner
> result = function(command, **kwargs)
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", 
> line 92, in checked_call
> tries=tries, try_sleep=try_sleep)
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", 
> line 140, in _call_wrapper
> result = _call(command, **kwargs_copy)
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", 
> line 291, in _call
> raise Fail(err_msg)
> resource_management.core.exceptions.Fail: Execution of '/usr/bin/kinit -kt 
> /etc/security/keytabs/smokeuser.headless.keytab ambari...@example.com; yarn 
> org.apache.hadoop.yarn.applications.distributedshell.Client -shell_command ls 
> -num_containers 1 -jar 
> /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar'
>  returned 2.  Hortonworks #
> This is MOTD message, added for testing in qe infra
> 16/03/03 02:33:51 INFO impl.TimelineClientImpl: Timeline service address: 
> http://host:8188/ws/v1/timeline/
> 16/03/03 02:33:51 INFO distributedshell.Client: Initializing Client
> 16/03/03 02:33:51 INFO distributedshell.Client: Running Client
> 16/03/03 02:33:51 INFO client.RMProxy: Connecting to ResourceManager at 
> host-9-5.test/127.0.0.254:8050
> 16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster metric info from 
> ASM, numNodeManagers=3
> 16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster node info from ASM
> 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, 
> nodeId=host:25454, nodeAddresshost:8042, nodeRackName/default-rack, 
> nodeNumContainers1
> 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, 
> nodeId=host-9-5.test:25454, nodeAddresshost-9-5.test:8042, 
> nodeRackName/default-rack, nodeNumContainers0
> 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, 
> nodeId=host-9-1.test:25454, nodeAddresshost-9-1.test:8042, 
> nodeRackName/default-rack, nodeNumContainers0
> 16/03/03 02:33:53 INFO distributedshell.Client: Queue info, 
> queueName=default, queueCurrentCapacity=0.08336, queueMaxCapacity=1.0, 
> queueApplicationCount=0, queueChildQueueCount=0
> 16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, 
> queueName=root, userAcl=SUBMIT_APPLICATIONS
> 16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, 
> queueName=default, userAcl=SUBMIT_APPLICATIONS
> 16/03/03 02:33:53 INFO distributedshell.Client: Max mem capabililty of 
> resources in this cluster 10240
> 16/03/03 02:33:53 INFO distributedshell.Client: Max virtual cores capabililty 
> of resources in this cluster 1
> 16/03/03 02:33:53 INFO distributedshell.Client: Copy App Master jar from 
> local filesystem and add to local environment
> 16/03/03 02:33:53 INFO distributedshell.Client: Set the environment for the 
> application master
> 16/03/03 02:33:53 INFO distributedshell.Client: Setting up app master command
> 16/03/03 02:33:53 INFO distributedshell.Client: Completed setting up app 
> 

[jira] [Updated] (AMBARI-15389) Intermittent YARN service check failures during and post EU

2016-03-11 Thread Dmitry Lysnichenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/AMBARI-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Lysnichenko updated AMBARI-15389:

Fix Version/s: 2.2.2

> Intermittent YARN service check failures during and post EU
> ---
>
> Key: AMBARI-15389
> URL: https://issues.apache.org/jira/browse/AMBARI-15389
> Project: Ambari
>  Issue Type: Bug
>  Components: ambari-server
>Affects Versions: 2.2.2
>Reporter: Dmitry Lysnichenko
>Assignee: Dmitry Lysnichenko
> Fix For: 2.2.2
>
> Attachments: AMBARI-15389.patch
>
>
> Build # - Ambari 2.2.1.1 - #63
> Observed this issue in a couple of EU runs recently where YARN service check 
> reports failure
> a. In one test, the EU ran from HDP 2.3.4.0 to 2.4.0.0 and YARN service check 
> reported failure during EU itself; a retry of the operation led to service 
> check being successful
> b. In another test post EU when YARN service check was run, it reported 
> failure; afterwards when I ran it again - success
> Looks like there is some corner condition which causes this issue to be hit
> {code}
> stderr:   /var/lib/ambari-agent/data/errors-822.txt
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py",
>  line 142, in 
> ServiceCheck().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
> method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py",
>  line 104, in service_check
> user=params.smokeuser,
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", 
> line 70, in inner
> result = function(command, **kwargs)
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", 
> line 92, in checked_call
> tries=tries, try_sleep=try_sleep)
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", 
> line 140, in _call_wrapper
> result = _call(command, **kwargs_copy)
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", 
> line 291, in _call
> raise Fail(err_msg)
> resource_management.core.exceptions.Fail: Execution of '/usr/bin/kinit -kt 
> /etc/security/keytabs/smokeuser.headless.keytab ambari...@example.com; yarn 
> org.apache.hadoop.yarn.applications.distributedshell.Client -shell_command ls 
> -num_containers 1 -jar 
> /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar'
>  returned 2.  Hortonworks #
> This is MOTD message, added for testing in qe infra
> 16/03/03 02:33:51 INFO impl.TimelineClientImpl: Timeline service address: 
> http://host:8188/ws/v1/timeline/
> 16/03/03 02:33:51 INFO distributedshell.Client: Initializing Client
> 16/03/03 02:33:51 INFO distributedshell.Client: Running Client
> 16/03/03 02:33:51 INFO client.RMProxy: Connecting to ResourceManager at 
> host-9-5.test/127.0.0.254:8050
> 16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster metric info from 
> ASM, numNodeManagers=3
> 16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster node info from ASM
> 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, 
> nodeId=host:25454, nodeAddresshost:8042, nodeRackName/default-rack, 
> nodeNumContainers1
> 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, 
> nodeId=host-9-5.test:25454, nodeAddresshost-9-5.test:8042, 
> nodeRackName/default-rack, nodeNumContainers0
> 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, 
> nodeId=host-9-1.test:25454, nodeAddresshost-9-1.test:8042, 
> nodeRackName/default-rack, nodeNumContainers0
> 16/03/03 02:33:53 INFO distributedshell.Client: Queue info, 
> queueName=default, queueCurrentCapacity=0.08336, queueMaxCapacity=1.0, 
> queueApplicationCount=0, queueChildQueueCount=0
> 16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, 
> queueName=root, userAcl=SUBMIT_APPLICATIONS
> 16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, 
> queueName=default, userAcl=SUBMIT_APPLICATIONS
> 16/03/03 02:33:53 INFO distributedshell.Client: Max mem capabililty of 
> resources in this cluster 10240
> 16/03/03 02:33:53 INFO distributedshell.Client: Max virtual cores capabililty 
> of resources in this cluster 1
> 16/03/03 02:33:53 INFO distributedshell.Client: Copy App Master jar from 
> local filesystem and add to local environment
> 16/03/03 02:33:53 INFO distributedshell.Client: Set the environment for the 
> application master
> 16/03/03 02:33:53 INFO distributedshell.Client: Setting up app master command
> 16/03/03 02:33:53 INFO distributedshell.Client: Completed setting up app 
> master command {{JAVA_HOME}}/bin/java 

[jira] [Updated] (AMBARI-15389) Intermittent YARN service check failures during and post EU

2016-03-11 Thread Dmitry Lysnichenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/AMBARI-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Lysnichenko updated AMBARI-15389:

Attachment: AMBARI-15389.patch

> Intermittent YARN service check failures during and post EU
> ---
>
> Key: AMBARI-15389
> URL: https://issues.apache.org/jira/browse/AMBARI-15389
> Project: Ambari
>  Issue Type: Bug
>  Components: ambari-server
>Reporter: Dmitry Lysnichenko
>Assignee: Dmitry Lysnichenko
> Attachments: AMBARI-15389.patch
>
>
> Build # - Ambari 2.2.1.1 - #63
> Observed this issue in a couple of EU runs recently where YARN service check 
> reports failure
> a. In one test, the EU ran from HDP 2.3.4.0 to 2.4.0.0 and YARN service check 
> reported failure during EU itself; a retry of the operation led to service 
> check being successful
> b. In another test post EU when YARN service check was run, it reported 
> failure; afterwards when I ran it again - success
> Looks like there is some corner condition which causes this issue to be hit
> {code}
> stderr:   /var/lib/ambari-agent/data/errors-822.txt
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py",
>  line 142, in 
> ServiceCheck().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
> method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py",
>  line 104, in service_check
> user=params.smokeuser,
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", 
> line 70, in inner
> result = function(command, **kwargs)
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", 
> line 92, in checked_call
> tries=tries, try_sleep=try_sleep)
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", 
> line 140, in _call_wrapper
> result = _call(command, **kwargs_copy)
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", 
> line 291, in _call
> raise Fail(err_msg)
> resource_management.core.exceptions.Fail: Execution of '/usr/bin/kinit -kt 
> /etc/security/keytabs/smokeuser.headless.keytab ambari...@example.com; yarn 
> org.apache.hadoop.yarn.applications.distributedshell.Client -shell_command ls 
> -num_containers 1 -jar 
> /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar'
>  returned 2.  Hortonworks #
> This is MOTD message, added for testing in qe infra
> 16/03/03 02:33:51 INFO impl.TimelineClientImpl: Timeline service address: 
> http://host:8188/ws/v1/timeline/
> 16/03/03 02:33:51 INFO distributedshell.Client: Initializing Client
> 16/03/03 02:33:51 INFO distributedshell.Client: Running Client
> 16/03/03 02:33:51 INFO client.RMProxy: Connecting to ResourceManager at 
> host-9-5.test/127.0.0.254:8050
> 16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster metric info from 
> ASM, numNodeManagers=3
> 16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster node info from ASM
> 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, 
> nodeId=host:25454, nodeAddresshost:8042, nodeRackName/default-rack, 
> nodeNumContainers1
> 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, 
> nodeId=host-9-5.test:25454, nodeAddresshost-9-5.test:8042, 
> nodeRackName/default-rack, nodeNumContainers0
> 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, 
> nodeId=host-9-1.test:25454, nodeAddresshost-9-1.test:8042, 
> nodeRackName/default-rack, nodeNumContainers0
> 16/03/03 02:33:53 INFO distributedshell.Client: Queue info, 
> queueName=default, queueCurrentCapacity=0.08336, queueMaxCapacity=1.0, 
> queueApplicationCount=0, queueChildQueueCount=0
> 16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, 
> queueName=root, userAcl=SUBMIT_APPLICATIONS
> 16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, 
> queueName=default, userAcl=SUBMIT_APPLICATIONS
> 16/03/03 02:33:53 INFO distributedshell.Client: Max mem capabililty of 
> resources in this cluster 10240
> 16/03/03 02:33:53 INFO distributedshell.Client: Max virtual cores capabililty 
> of resources in this cluster 1
> 16/03/03 02:33:53 INFO distributedshell.Client: Copy App Master jar from 
> local filesystem and add to local environment
> 16/03/03 02:33:53 INFO distributedshell.Client: Set the environment for the 
> application master
> 16/03/03 02:33:53 INFO distributedshell.Client: Setting up app master command
> 16/03/03 02:33:53 INFO distributedshell.Client: Completed setting up app 
> master command {{JAVA_HOME}}/bin/java -Xmx10m 
> 

[jira] [Updated] (AMBARI-15389) Intermittent YARN service check failures during and post EU

2016-03-11 Thread Dmitry Lysnichenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/AMBARI-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Lysnichenko updated AMBARI-15389:

Status: Patch Available  (was: Open)

> Intermittent YARN service check failures during and post EU
> ---
>
> Key: AMBARI-15389
> URL: https://issues.apache.org/jira/browse/AMBARI-15389
> Project: Ambari
>  Issue Type: Bug
>  Components: ambari-server
>Reporter: Dmitry Lysnichenko
>Assignee: Dmitry Lysnichenko
> Attachments: AMBARI-15389.patch
>
>
> Build # - Ambari 2.2.1.1 - #63
> Observed this issue in a couple of EU runs recently where YARN service check 
> reports failure
> a. In one test, the EU ran from HDP 2.3.4.0 to 2.4.0.0 and YARN service check 
> reported failure during EU itself; a retry of the operation led to service 
> check being successful
> b. In another test post EU when YARN service check was run, it reported 
> failure; afterwards when I ran it again - success
> Looks like there is some corner condition which causes this issue to be hit
> {code}
> stderr:   /var/lib/ambari-agent/data/errors-822.txt
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py",
>  line 142, in 
> ServiceCheck().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
> method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py",
>  line 104, in service_check
> user=params.smokeuser,
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", 
> line 70, in inner
> result = function(command, **kwargs)
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", 
> line 92, in checked_call
> tries=tries, try_sleep=try_sleep)
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", 
> line 140, in _call_wrapper
> result = _call(command, **kwargs_copy)
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", 
> line 291, in _call
> raise Fail(err_msg)
> resource_management.core.exceptions.Fail: Execution of '/usr/bin/kinit -kt 
> /etc/security/keytabs/smokeuser.headless.keytab ambari...@example.com; yarn 
> org.apache.hadoop.yarn.applications.distributedshell.Client -shell_command ls 
> -num_containers 1 -jar 
> /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar'
>  returned 2.  Hortonworks #
> This is MOTD message, added for testing in qe infra
> 16/03/03 02:33:51 INFO impl.TimelineClientImpl: Timeline service address: 
> http://host:8188/ws/v1/timeline/
> 16/03/03 02:33:51 INFO distributedshell.Client: Initializing Client
> 16/03/03 02:33:51 INFO distributedshell.Client: Running Client
> 16/03/03 02:33:51 INFO client.RMProxy: Connecting to ResourceManager at 
> host-9-5.test/127.0.0.254:8050
> 16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster metric info from 
> ASM, numNodeManagers=3
> 16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster node info from ASM
> 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, 
> nodeId=host:25454, nodeAddresshost:8042, nodeRackName/default-rack, 
> nodeNumContainers1
> 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, 
> nodeId=host-9-5.test:25454, nodeAddresshost-9-5.test:8042, 
> nodeRackName/default-rack, nodeNumContainers0
> 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, 
> nodeId=host-9-1.test:25454, nodeAddresshost-9-1.test:8042, 
> nodeRackName/default-rack, nodeNumContainers0
> 16/03/03 02:33:53 INFO distributedshell.Client: Queue info, 
> queueName=default, queueCurrentCapacity=0.08336, queueMaxCapacity=1.0, 
> queueApplicationCount=0, queueChildQueueCount=0
> 16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, 
> queueName=root, userAcl=SUBMIT_APPLICATIONS
> 16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, 
> queueName=default, userAcl=SUBMIT_APPLICATIONS
> 16/03/03 02:33:53 INFO distributedshell.Client: Max mem capabililty of 
> resources in this cluster 10240
> 16/03/03 02:33:53 INFO distributedshell.Client: Max virtual cores capabililty 
> of resources in this cluster 1
> 16/03/03 02:33:53 INFO distributedshell.Client: Copy App Master jar from 
> local filesystem and add to local environment
> 16/03/03 02:33:53 INFO distributedshell.Client: Set the environment for the 
> application master
> 16/03/03 02:33:53 INFO distributedshell.Client: Setting up app master command
> 16/03/03 02:33:53 INFO distributedshell.Client: Completed setting up app 
> master command {{JAVA_HOME}}/bin/java -Xmx10m 
> 

[jira] [Updated] (AMBARI-15389) Intermittent YARN service check failures during and post EU

2016-03-11 Thread Dmitry Lysnichenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/AMBARI-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Lysnichenko updated AMBARI-15389:

Component/s: ambari-server

> Intermittent YARN service check failures during and post EU
> ---
>
> Key: AMBARI-15389
> URL: https://issues.apache.org/jira/browse/AMBARI-15389
> Project: Ambari
>  Issue Type: Bug
>  Components: ambari-server
>Reporter: Dmitry Lysnichenko
>Assignee: Dmitry Lysnichenko
> Attachments: AMBARI-15389.patch
>
>
> Build # - Ambari 2.2.1.1 - #63
> Observed this issue in a couple of EU runs recently where YARN service check 
> reports failure
> a. In one test, the EU ran from HDP 2.3.4.0 to 2.4.0.0 and YARN service check 
> reported failure during EU itself; a retry of the operation led to service 
> check being successful
> b. In another test post EU when YARN service check was run, it reported 
> failure; afterwards when I ran it again - success
> Looks like there is some corner condition which causes this issue to be hit
> {code}
> stderr:   /var/lib/ambari-agent/data/errors-822.txt
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py",
>  line 142, in 
> ServiceCheck().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
> method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py",
>  line 104, in service_check
> user=params.smokeuser,
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", 
> line 70, in inner
> result = function(command, **kwargs)
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", 
> line 92, in checked_call
> tries=tries, try_sleep=try_sleep)
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", 
> line 140, in _call_wrapper
> result = _call(command, **kwargs_copy)
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", 
> line 291, in _call
> raise Fail(err_msg)
> resource_management.core.exceptions.Fail: Execution of '/usr/bin/kinit -kt 
> /etc/security/keytabs/smokeuser.headless.keytab ambari...@example.com; yarn 
> org.apache.hadoop.yarn.applications.distributedshell.Client -shell_command ls 
> -num_containers 1 -jar 
> /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar'
>  returned 2.  Hortonworks #
> This is MOTD message, added for testing in qe infra
> 16/03/03 02:33:51 INFO impl.TimelineClientImpl: Timeline service address: 
> http://host:8188/ws/v1/timeline/
> 16/03/03 02:33:51 INFO distributedshell.Client: Initializing Client
> 16/03/03 02:33:51 INFO distributedshell.Client: Running Client
> 16/03/03 02:33:51 INFO client.RMProxy: Connecting to ResourceManager at 
> host-9-5.test/127.0.0.254:8050
> 16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster metric info from 
> ASM, numNodeManagers=3
> 16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster node info from ASM
> 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, 
> nodeId=host:25454, nodeAddresshost:8042, nodeRackName/default-rack, 
> nodeNumContainers1
> 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, 
> nodeId=host-9-5.test:25454, nodeAddresshost-9-5.test:8042, 
> nodeRackName/default-rack, nodeNumContainers0
> 16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, 
> nodeId=host-9-1.test:25454, nodeAddresshost-9-1.test:8042, 
> nodeRackName/default-rack, nodeNumContainers0
> 16/03/03 02:33:53 INFO distributedshell.Client: Queue info, 
> queueName=default, queueCurrentCapacity=0.08336, queueMaxCapacity=1.0, 
> queueApplicationCount=0, queueChildQueueCount=0
> 16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, 
> queueName=root, userAcl=SUBMIT_APPLICATIONS
> 16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, 
> queueName=default, userAcl=SUBMIT_APPLICATIONS
> 16/03/03 02:33:53 INFO distributedshell.Client: Max mem capabililty of 
> resources in this cluster 10240
> 16/03/03 02:33:53 INFO distributedshell.Client: Max virtual cores capabililty 
> of resources in this cluster 1
> 16/03/03 02:33:53 INFO distributedshell.Client: Copy App Master jar from 
> local filesystem and add to local environment
> 16/03/03 02:33:53 INFO distributedshell.Client: Set the environment for the 
> application master
> 16/03/03 02:33:53 INFO distributedshell.Client: Setting up app master command
> 16/03/03 02:33:53 INFO distributedshell.Client: Completed setting up app 
> master command {{JAVA_HOME}}/bin/java -Xmx10m 
>