[ 
https://issues.apache.org/jira/browse/AMBARI-9717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alejandro Fernandez updated AMBARI-9717:
----------------------------------------
    Attachment:     (was: AMBARI-9717.patch)

> Kafka & Spark service checks fail intermittently on kerberized cluster
> ----------------------------------------------------------------------
>
>                 Key: AMBARI-9717
>                 URL: https://issues.apache.org/jira/browse/AMBARI-9717
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>    Affects Versions: 2.0.0
>            Reporter: Alejandro Fernandez
>            Assignee: Alejandro Fernandez
>             Fix For: 2.0.0
>
>
> Impact: Prevents RU from completing successfully
> Frequency: reproduces often
> I ran into this while performing an RU during the following,
> * Installed a 3-node cluster with ambari build #427
> * Installed HDP 2.2.2.0-2398 on centos 6
> * Added HDFS and ZK
> * Added Namenode HA
> * Added all services (including Spark and Ranger)
> * Kerberized the cluster (failed to start due to AMS service check)
> * Registered repo HDP 2.2.2.0-2399
> * Performed a RU
> stdout:
> {code}
> Running kafka create topic command
> 2015-02-18 03:29:51,851 - u'Execute[\'source /etc/kafka/conf/kafka-env.sh ; 
> /usr/hdp/current/kafka-broker//bin/kafka-topics.sh --zookeeper 
> c6403.ambari.apache.org:2181,c6401.ambari.apache.org:2181,c6402.ambari.apache.org:2181
>  --create --topic ambari_kafka_service_check --partitions 1 
> --replication-factor 1 | grep \'Created topic 
> "ambari_kafka_service_check".\\|Topic "ambari_kafka_service_check" already 
> exists.\'\']' {'logoutput': True}
> 2015-02-18 03:29:54,183 - Error while executing command 'service_check':
> Traceback (most recent call last):
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 208, in execute
>     method(env)
>   File 
> "/var/lib/ambari-agent/cache/common-services/KAFKA/0.8.1.2.2/package/scripts/service_check.py",
>  line 37, in service_check
>     logoutput=True,
>   File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 148, in __init__
>     self.env.run()
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 152, in run
>     self.run_action(resource, action)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 118, in run_action
>     provider_action()
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/core/providers/system.py",
>  line 276, in action_run
>     raise ex
> Fail: Execution of 'source /etc/kafka/conf/kafka-env.sh ; 
> /usr/hdp/current/kafka-broker//bin/kafka-topics.sh --zookeeper 
> c6403.ambari.apache.org:2181,c6401.ambari.apache.org:2181,c6402.ambari.apache.org:2181
>  --create --topic ambari_kafka_service_check --partitions 1 
> --replication-factor 1 | grep 'Created topic 
> "ambari_kafka_service_check".\|Topic "ambari_kafka_service_check" already 
> exists.'' returned 1.
> {code}
> It turns out that the Kafka topic command can return a nonzero exit code, 
> which is valid, so the output just needs to be validated against a regex 
> expression.
> For Spark, it fails with
> {code}
> 2015-02-20 01:25:28,782 - call['hdp-select status hadoop-client'] {'timeout': 
> 20}
> 2015-02-20 01:26:19,441 - Spark Job History Server not running.
> 2015-02-20 01:26:19,442 - Error while executing command 'service_check':
> Traceback (most recent call last):
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 208, in execute
>     method(env)
>   File 
> "/var/lib/ambari-agent/cache/common-services/SPARK/1.2.0.2.2/package/scripts/service_check.py",
>  line 61, in service_check
>     raise ComponentIsNotRunning()
> ComponentIsNotRunning
> {code}
> while running this command several times because it has not kinit'ed,
> {code}
> curl -s -o /dev/null -w'%{http_code}' --negotiate -u: -k 
> http://c6407.ambari.apache.org:18080
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to