[ https://issues.apache.org/jira/browse/OOZIE-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Satish Subhashrao Saley reopened OOZIE-2787: -------------------------------------------- Reopening as there is a regression. {code} pi.py is under oozie.wf.application.path and workflow configuration is - <name>pyspark example</name> <jar>pi.py</jar> <spark-opts>${testConf}</spark-opts> <file>pi.py#pi-renamed.py</file> {code} With the change we added in this jira, it will call run the Spark job with following params: {code} --master yarn-cluster --name pyspark example --conf spark.driver.extraJavaOptions=-Dlog4j.configuration=spark-log4j.properties --conf spark.ui.view.acls=* --queue default --conf spark.executor.extraClassPath=$PWD/* --conf spark.driver.extraClassPath=$PWD/* --conf spark.yarn.security.tokens.hive.enabled=false --conf spark.yarn.security.tokens.hbase.enabled=false --conf spark.executor.extraJavaOptions=-Dlog4j.configuration=spark-log4j.properties --properties-file spark-defaults.conf --files <<list of files excluding pi.py>> --conf spark.yarn.jar=hdfs://localhost/share/spark/lib/spark-assembly.jar --verbose hdfs://localhost/user/saley/examples/apps/spark-yarn-cluster/pi.py#pi-renamed.py 10 {code} The job fails saying - {code} 2017-02-07 21:59:24,847 [Driver] ERROR org.apache.spark.deploy.yarn.ApplicationMaster - User application exited with status 2 2017-02-07 21:59:24,849 [Driver] INFO org.apache.spark.deploy.yarn.ApplicationMaster - Final app status: FAILED, exitCode: 2, (reason: User application exited with status 2) python: can't open file 'pi.py#pi-renamed.py': [Errno 2] No such file or directory {code} Spark does not understand the {{#}} sign. Therefore, we need to pass in the direct path for the file. But at the same time, we also need to make sure that application jar won't get distributed twice. Solution - Mention the direct path for the application jar/py file if there is a {{#}} sign (fragment) in the path. We can do so, because file is already available in the launcher's local directory i.e. current directory. Also, at the same time remove the application jar from *--files* option. While doing so, we need extra checks for PySpark dependencies otherwise those will get distributed multiple times. The amend patch will also distribute the files mentioned in <file> and having {{#}}. > Oozie distributes application jar twice making the spark job fail > ----------------------------------------------------------------- > > Key: OOZIE-2787 > URL: https://issues.apache.org/jira/browse/OOZIE-2787 > Project: Oozie > Issue Type: Bug > Reporter: Satish Subhashrao Saley > Assignee: Satish Subhashrao Saley > Attachments: OOZIE-2787-1.patch, OOZIE-2787-2.patch, > OOZIE-2787-3.patch, OOZIE-2787-4.patch, OOZIE-2787-5.patch > > > Oozie adds the application jar to the list of files to be uploaded to > distributed cache. Since this gets added twice, the job fails. This is > observed from spark 2.1.0 which introduces a check for same file and fails > the job. > {code} > --master > yarn > --deploy-mode > cluster > --name > oozieSparkStarter > --class > ScalaWordCount > --queue > default > --conf > spark.executor.extraClassPath=$PWD/* > --conf > spark.driver.extraClassPath=$PWD/* > --conf > spark.executor.extraJavaOptions=-Dlog4j.configuration=spark-log4j.properties > --conf > spark.driver.extraJavaOptions=-Dlog4j.configuration=spark-log4j.properties > --conf > spark.yarn.security.tokens.hive.enabled=false > --conf > spark.yarn.security.tokens.hbase.enabled=false > --files > hdfs://mycluster.com/user/saley/oozie/apps/sparkapp/lib/spark-example.jar > --properties-file > spark-defaults.conf > --verbose > spark-example.jar > samplefile.txt > output > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)