Re: Spark Yarn Cluster with Reference File

Aditya Fri, 23 Sep 2016 00:44:38 -0700

Hi Abhishek,

From your spark-submit it seems your passing the file as a parameter tothe driver program. So now it depends what exactly you are doing withthat parameter. Using --files option it will be available to all theworker nodes but if in your code if you are referencing using thespecified path in distributed mode it wont get the file on the worker nodes.


If you can share the snippet of code it will be easy to debug.

On Friday 23 September 2016 01:03 PM, ABHISHEK wrote:

Hello there,
I have Spark Application which refer to an external file ‘abc.drl’ andhaving unstructured data.Application is able to find this reference file if I run app in Localmode but in Yarn with Cluster mode, it is not able to find the filein the specified path.I tried with both local and hdfs path with –-files option but itdidn’t work.
What is working ?
1.Current Spark Application runs fine if I run it in Local mode asmentioned below.
In below command   file path is local path not HDFS.
spark-submit --master local[*] --class "com.abc.StartMain"abc-0.0.1-SNAPSHOT-jar-with-dependencies.jar /home/abhietc/abc/abc.drl
3.I want to run this Spark application using Yarn with cluster mode.
For that, I used below command but application is not able to find thepath for the reference file abc.drl.I tried giving both local and HDFSpath but didn’t work.
spark-submit --master yarn --deploy-mode cluster --files/home/abhietc/abc/abc.drl --class com.abc.StartMainabc-0.0.1-SNAPSHOT-jar-with-dependencies.jar /home/abhietc/abc/abc.drl
spark-submit --master yarn --deploy-mode cluster --fileshdfs://abhietc.com:8020/user/abhietc/abc.drl<http://abhietc.com:8020/user/abhietc/abc.drl> --classcom.abc.StartMain abc-0.0.1-SNAPSHOT-jar-with-dependencies.jarhdfs://abhietc.com:8020/user/abhietc/abc.drl<http://abhietc.com:8020/user/abhietc/abc.drl>
spark-submit --master yarn --deploy-mode cluster --fileshdfs://abc.com:8020/tmp/abc.drl <http://abc.com:8020/tmp/abc.drl>--class com.abc.StartMain abc-0.0.1-SNAPSHOT-jar-with-dependencies.jarhdfs://abc.com:8020/tmp/abc.drl <http://abc.com:8020/tmp/abc.drl>
Error Messages:
Surprising we are not doing any Write operation on reference file butstill log shows that application is trying to write to file insteadreading the file.
Also log shows File not found exception for both HDFS and Local path.
-------------
16/09/20 14:49:50 ERROR scheduler.JobScheduler: Error running jobstreaming job 1474363176000 ms.0org.apache.spark.SparkException: Job aborted due to stage failure:Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3in stage 1.0 (TID 4, abc.com <http://abc.com>):java.lang.RuntimeException: Unable to write Resource:FileResource[file=hdfs:/abc.com:8020/user/abhietc/abc.drl<http://abc.com:8020/user/abhietc/abc.drl>]atorg.drools.compiler.kie.builder.impl.KieFileSystemImpl.write(KieFileSystemImpl.java:71)atcom.hmrc.taxcalculator.KieSessionFactory$.getNewSession(KieSessionFactory.scala:49)atcom.hmrc.taxcalculator.KieSessionFactory$.getKieSession(KieSessionFactory.scala:21)atcom.hmrc.taxcalculator.KieSessionFactory$.execute(KieSessionFactory.scala:27)atcom.abc.StartMain$$anonfun$main$1$$anonfun$4.apply(TaxCalculatorMain.scala:124)atcom.abc.StartMain$$anonfun$main$1$$anonfun$4.apply(TaxCalculatorMain.scala:124)atorg.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)atorg.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)atorg.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
atorg.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
atorg.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
atorg.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)atjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException:hdfs:/abc.com:8020/user/abhietc/abc.drl<http://abc.com:8020/user/abhietc/abc.drl> (No such file or directory)
        at java.io.FileInputStream.open(Native Method)
        at java.io.FileInputStream.<init>(FileInputStream.java:146)
atorg.drools.core.io.impl.FileSystemResource.getInputStream(FileSystemResource.java:123)atorg.drools.compiler.kie.builder.impl.KieFileSystemImpl.write(KieFileSystemImpl.java:58)
        ... 19 more
--------------
Cheers,
Abhishek

Re: Spark Yarn Cluster with Reference File

Reply via email to