Alex Savitsky created LIVY-447:
----------------------------------

             Summary: Batch session appears "dead" when launched against YARN 
cluster, but the job completes fine.
                 Key: LIVY-447
                 URL: https://issues.apache.org/jira/browse/LIVY-447
             Project: Livy
          Issue Type: Bug
          Components: Batch
    Affects Versions: 0.5.0
            Reporter: Alex Savitsky


Job was launched using the following request:

POST /batches

{{{}}
{{ "file": "/user/alsavits/oats/oats-spark-controls-exec.jar",}}
{{ "className": "com.rbc.rbccm.regops.controls.CompletenessControl",}}
{{ "args": ["ET", "20171229"]}}
{{}}}

Relevant Livy config:

{{livy.spark.master = yarn}}

{{livy.spark.deploy-mode = cluster}}

Livy log:

{{18/03/05 12:03:39 INFO LineBufferedStream: stdout: 18/03/05 12:03:39 INFO 
Client: Requesting a new application from cluster with 8 NodeManagers}}
{{18/03/05 12:03:39 INFO LineBufferedStream: stdout: 18/03/05 12:03:39 INFO 
Client: Verifying our application has not requested more than the maximum 
memory capability of the cluster (174080 MB per container)}}
{{18/03/05 12:03:39 INFO LineBufferedStream: stdout: 18/03/05 12:03:39 INFO 
Client: Will allocate AM container, with 1408 MB memory including 384 MB 
overhead}}
{{18/03/05 12:03:39 INFO LineBufferedStream: stdout: 18/03/05 12:03:39 INFO 
Client: Setting up container launch context for our AM}}
{{18/03/05 12:03:39 INFO LineBufferedStream: stdout: 18/03/05 12:03:39 INFO 
Client: Setting up the launch environment for our AM container}}
{{18/03/05 12:03:39 INFO LineBufferedStream: stdout: 18/03/05 12:03:39 INFO 
Client: Preparing resources for our AM container}}
{{18/03/05 12:03:39 INFO LineBufferedStream: stdout: 18/03/05 12:03:39 INFO 
YarnSparkHadoopUtil: getting token for namenode: 
hdfs://guedlpahdp001.devfg.rbc.com:8020/user/DRLB0SRVCTRLRW/.sparkStaging/application_1520006903702_0110}}
{{18/03/05 12:03:39 INFO LineBufferedStream: stdout: 18/03/05 12:03:39 INFO 
DFSClient: Created HDFS_DELEGATION_TOKEN token 73717 for DRLB0SRVCTRLRW on 
10.61.34.124:8020}}
{{18/03/05 12:03:41 INFO LineBufferedStream: stdout: 18/03/05 12:03:41 INFO 
metastore: Trying to connect to metastore with URI 
thrift://guedlpahdp002.devfg.rbc.com:9083}}
{{18/03/05 12:03:41 INFO LineBufferedStream: stdout: 18/03/05 12:03:41 INFO 
metastore: Connected to metastore.}}
{{18/03/05 12:03:41 INFO LineBufferedStream: stdout: 18/03/05 12:03:41 INFO 
YarnSparkHadoopUtil: HBase class not found java.lang.ClassNotFoundException: 
org.apache.hadoop.hbase.HBaseConfiguration}}
{{18/03/05 12:03:41 INFO LineBufferedStream: stdout: 18/03/05 12:03:41 INFO 
Client: Use hdfs cache file as spark.yarn.archive for HDP, 
hdfsCacheFile:hdfs:///hdp/apps/2.6.3.0-235/spark2/spark2-hdp-yarn-archive.tar.gz}}
{{18/03/05 12:03:41 INFO LineBufferedStream: stdout: 18/03/05 12:03:41 INFO 
Client: Source and destination file systems are the same. Not copying 
hdfs:/hdp/apps/2.6.3.0-235/spark2/spark2-hdp-yarn-archive.tar.gz}}
{{18/03/05 12:03:41 INFO LineBufferedStream: stdout: 18/03/05 12:03:41 INFO 
Client: Source and destination file systems are the same. Not copying 
hdfs://guedlpahdp001.devfg.rbc.com:8020/user/alsavits/oats/oats-spark-controls-exec.jar}}
{{18/03/05 12:03:41 INFO LineBufferedStream: stdout: 18/03/05 12:03:41 INFO 
Client: Uploading resource 
file:/tmp/spark-91b652aa-704d-4986-9435-cbd369c62f71/__spark_conf__2946776456430567485.zip
 -> 
hdfs://guedlpahdp001.devfg.rbc.com:8020/user/DRLB0SRVCTRLRW/.sparkStaging/application_1520006903702_0110/__spark_conf__.zip}}
{{18/03/05 12:03:41 INFO LineBufferedStream: stdout: 18/03/05 12:03:41 INFO 
SecurityManager: Changing view acls to: drlb0ots,DRLB0SRVCTRLRW}}
{{18/03/05 12:03:41 INFO LineBufferedStream: stdout: 18/03/05 12:03:41 INFO 
SecurityManager: Changing modify acls to: drlb0ots,DRLB0SRVCTRLRW}}
{{18/03/05 12:03:41 INFO LineBufferedStream: stdout: 18/03/05 12:03:41 INFO 
SecurityManager: Changing view acls groups to:}}
{{18/03/05 12:03:42 INFO LineBufferedStream: stdout: 18/03/05 12:03:41 INFO 
SecurityManager: Changing modify acls groups to:}}
{{18/03/05 12:03:42 INFO LineBufferedStream: stdout: 18/03/05 12:03:42 INFO 
SecurityManager: SecurityManager: authentication disabled; ui acls disabled; 
users with view permissions: Set(drlb0ots, DRLB0SRVCTRLRW); groups with view 
permissions: Set(); users with modify permissions: Set(drlb0ots, 
DRLB0SRVCTRLRW); groups with modify permissions: Set()}}
{{18/03/05 12:03:42 INFO LineBufferedStream: stdout: 18/03/05 12:03:42 INFO 
Client: Submitting application application_1520006903702_0110 to 
ResourceManager}}
{{18/03/05 12:03:42 INFO LineBufferedStream: stdout: 18/03/05 12:03:42 INFO 
YarnClientImpl: Submitted application application_1520006903702_0110}}
{{18/03/05 12:03:42 INFO LineBufferedStream: stdout: 18/03/05 12:03:42 INFO 
Client: Application report for application_1520006903702_0110 (state: 
ACCEPTED)}}
{{18/03/05 12:03:42 INFO LineBufferedStream: stdout: 18/03/05 12:03:42 INFO 
Client:}}
{{18/03/05 12:03:42 INFO LineBufferedStream: stdout: client token: Token \{ 
kind: YARN_CLIENT_TOKEN, service: }}}
{{18/03/05 12:03:42 INFO LineBufferedStream: stdout: diagnostics: AM container 
is launched, waiting for AM container to Register with RM}}
{{18/03/05 12:03:42 INFO LineBufferedStream: stdout: ApplicationMaster host: 
N/A}}
{{18/03/05 12:03:42 INFO LineBufferedStream: stdout: ApplicationMaster RPC 
port: -1}}
{{18/03/05 12:03:42 INFO LineBufferedStream: stdout: queue: default}}
{{18/03/05 12:03:42 INFO LineBufferedStream: stdout: start time: 1520269422040}}
{{18/03/05 12:03:42 INFO LineBufferedStream: stdout: final status: UNDEFINED}}
{{18/03/05 12:03:42 INFO LineBufferedStream: stdout: tracking URL: 
{color:#FF0000}http://guedlpahdp001.devfg.rbc.com:8088/proxy/application_1520006903702_0110/{color}}}
{{18/03/05 12:03:42 INFO LineBufferedStream: stdout: user: DRLB0SRVCTRLRW}}
{{18/03/05 12:03:42 INFO LineBufferedStream: stdout: 18/03/05 12:03:42 INFO 
ShutdownHookManager: Shutdown hook called}}
{{18/03/05 12:03:42 INFO LineBufferedStream: stdout: 18/03/05 12:03:42 INFO 
ShutdownHookManager: Deleting directory 
/tmp/spark-91b652aa-704d-4986-9435-cbd369c62f71}}

Later, when the created batch is queried:

GET /batches/2

{{{}}
{{ "id": 2,}}
{{ "state": "dead",}}
{{ "appId": null,}}
{{ "appInfo": {}}
{{ "driverLogUrl": null,}}
{{ "sparkUiUrl": null}}
{{ },}}
{{ "log": [}}
{{ "stdout: ",}}
{{ "ls: cannot access /usr/hdp/2.6.3.0-235/hadoop/lib: No such file or 
directory",}}
{{ "\nstderr: ",}}
{{ "\nYARN Diagnostics: ",}}
{{ "java.lang.RuntimeException: java.lang.RuntimeException: 
java.lang.ClassNotFoundException: Class 
org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider not 
found",}}
{{ "org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2227) 
org.apache.hadoop.yarn.client.RMProxy.createRMFailoverProxyProvider(RMProxy.java:161)
 org.apache.hadoop.yarn.client.RMProxy.createRMProxy(RMProxy.java:94) 
org.apache.hadoop.yarn.client.ClientRMProxy.createRMProxy(ClientRMProxy.java:72)
 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStart(YarnClientImpl.java:187)
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
org.apache.livy.utils.SparkYarnApp$.yarnClient$lzycompute(SparkYarnApp.scala:51)
 org.apache.livy.utils.SparkYarnApp$.yarnClient(SparkYarnApp.scala:48) 
org.apache.livy.utils.SparkYarnApp$.$lessinit$greater$default$6(SparkYarnApp.scala:119)
 org.apache.livy.utils.SparkApp$$anonfun$create$1.apply(SparkApp.scala:91) 
org.apache.livy.utils.SparkApp$$anonfun$create$1.apply(SparkApp.scala:91) 
org.apache.livy.utils.SparkYarnApp.org$apache$livy$utils$SparkYarnApp$$getAppIdFromTag(SparkYarnApp.scala:175)
 
org.apache.livy.utils.SparkYarnApp$$anonfun$1$$anonfun$4.apply(SparkYarnApp.scala:239)
 
org.apache.livy.utils.SparkYarnApp$$anonfun$1$$anonfun$4.apply(SparkYarnApp.scala:236)
 scala.Option.getOrElse(Option.scala:121) 
org.apache.livy.utils.SparkYarnApp$$anonfun$1.apply$mcV$sp(SparkYarnApp.scala:236)
 org.apache.livy.Utils$$anon$1.run(Utils.scala:94)"}}
{{ ]}}
{{}}}

However, when hitting the tracking URL (shown in Livy log above), the 
application status is RUNNING, and it later completes successfully in 2-3 
minutes time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to