[ 
https://issues.apache.org/jira/browse/OOZIE-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley reopened OOZIE-2787:
--------------------------------------------

Reopening as there is a regression.

{code}
pi.py is under oozie.wf.application.path and workflow configuration is -

<name>pyspark example</name>
<jar>pi.py</jar>
<spark-opts>${testConf}</spark-opts>
<file>pi.py#pi-renamed.py</file>
{code}

With the change we added in this jira, it will call run the Spark job with 
following params:

{code}
--master
yarn-cluster
--name
pyspark example
--conf
spark.driver.extraJavaOptions=-Dlog4j.configuration=spark-log4j.properties
--conf
spark.ui.view.acls=*
--queue
default
--conf
spark.executor.extraClassPath=$PWD/*
--conf
spark.driver.extraClassPath=$PWD/*
--conf
spark.yarn.security.tokens.hive.enabled=false
--conf
spark.yarn.security.tokens.hbase.enabled=false
--conf
spark.executor.extraJavaOptions=-Dlog4j.configuration=spark-log4j.properties
--properties-file
spark-defaults.conf
--files
<<list of files excluding pi.py>>
--conf
spark.yarn.jar=hdfs://localhost/share/spark/lib/spark-assembly.jar
--verbose
hdfs://localhost/user/saley/examples/apps/spark-yarn-cluster/pi.py#pi-renamed.py
10
{code} 

The job fails saying - 
{code}
2017-02-07 21:59:24,847 [Driver] ERROR 
org.apache.spark.deploy.yarn.ApplicationMaster  - User application exited with 
status 2
2017-02-07 21:59:24,849 [Driver] INFO  
org.apache.spark.deploy.yarn.ApplicationMaster  - Final app status: FAILED, 
exitCode: 2, (reason: User application exited with status 2)
python: can't open file 'pi.py#pi-renamed.py': [Errno 2] No such file or 
directory
{code}
Spark does not understand the {{#}} sign.
Therefore, we need to pass in the direct path for the file.
But at the same time, we also need to make sure that application jar won't get 
distributed twice.

Solution - Mention the direct path for the application jar/py file if there is 
a {{#}} sign (fragment) in the path. We can do so, because file is already 
available in the launcher's local directory i.e. current directory. Also, at 
the same time remove the application jar from *--files* option. While doing so, 
we need extra checks for PySpark dependencies otherwise those will get 
distributed multiple times. The amend patch will also distribute the files 
mentioned in <file> and having {{#}}.

> Oozie distributes application jar twice making the spark job fail
> -----------------------------------------------------------------
>
>                 Key: OOZIE-2787
>                 URL: https://issues.apache.org/jira/browse/OOZIE-2787
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Satish Subhashrao Saley
>            Assignee: Satish Subhashrao Saley
>         Attachments: OOZIE-2787-1.patch, OOZIE-2787-2.patch, 
> OOZIE-2787-3.patch, OOZIE-2787-4.patch, OOZIE-2787-5.patch
>
>
> Oozie adds the application jar to the list of files to be uploaded to 
> distributed cache. Since this gets added twice, the job fails. This is 
> observed from spark 2.1.0 which introduces a check for same file and fails 
> the job.
> {code}
> --master
> yarn
> --deploy-mode
> cluster
> --name
> oozieSparkStarter
> --class
> ScalaWordCount
> --queue 
> default
> --conf
> spark.executor.extraClassPath=$PWD/*
> --conf
> spark.driver.extraClassPath=$PWD/*
> --conf
> spark.executor.extraJavaOptions=-Dlog4j.configuration=spark-log4j.properties
> --conf
> spark.driver.extraJavaOptions=-Dlog4j.configuration=spark-log4j.properties
> --conf
> spark.yarn.security.tokens.hive.enabled=false
> --conf
> spark.yarn.security.tokens.hbase.enabled=false
> --files
> hdfs://mycluster.com/user/saley/oozie/apps/sparkapp/lib/spark-example.jar
> --properties-file
> spark-defaults.conf
> --verbose
> spark-example.jar
> samplefile.txt
> output
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to