[jira] [Comment Edited] (OOZIE-2787) Oozie distributes application jar twice making the spark job fail

2017-02-03 Thread Andras Piros (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851984#comment-15851984
 ] 

Andras Piros edited comment on OOZIE-2787 at 2/3/17 7:21 PM:
-

[~satishsaley] thanks for the patch!

Some observations:
* please add test case to {{TestSparkMain}} or elsewhere
* please rename new Maven profile to {{spark-2.1-kafka-1.6.2}} to get a better 
idea what's in there
* I'd extract the {{filterJars()}} to a nested class for better testability and 
SRP, like {{JarURIFilter}}. In that case you could pass all the necessary 
parameters via constructor, and have a {{toString()}} method that calls 
{{StringUtils.join()}}
* it's OK w/ me if all the JAR files of the current directory are filtered, 
supposing all those ones are application JARs. What about other packages like 
{{.py}} and {{.zip}} files? Maybe worth having unit tests for those as well


was (Author: andras.piros):
[~satishsaley] thanks for the patch!

Some observations:
* please add test case to {{TestSparkMain}} or elsewhere
* please rename new Maven profile to {{spark-2.1-kafka-1.6.2}} to get a better 
idea what's in there
* I'd extract the {{filterJars()}} to a nested class for better testability and 
SRP, like {{JarURIFilter}}. In that case you could pass all the necessary 
parameters via constructor, and have a {{toString{})) method that calls 
{{StringUtils.join()}}
* it's OK w/ me if all the JAR files of the current directory are filtered, 
supposing all those ones are application JARs. What about other packages like 
{{.py}} and {{.zip}} files? Maybe worth having unit tests for those as well

> Oozie distributes application jar twice making the spark job fail
> -
>
> Key: OOZIE-2787
> URL: https://issues.apache.org/jira/browse/OOZIE-2787
> Project: Oozie
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
> Attachments: OOZIE-2787-1.patch
>
>
> Oozie adds the application jar to the list of files to be uploaded to 
> distributed cache. Since this gets added twice, the job fails. This is 
> observed from spark 2.1.0 which introduces a check for same file and fails 
> the job.
> {code}
> --master
> yarn
> --deploy-mode
> cluster
> --name
> oozieSparkStarter
> --class
> ScalaWordCount
> --queue 
> default
> --conf
> spark.executor.extraClassPath=$PWD/*
> --conf
> spark.driver.extraClassPath=$PWD/*
> --conf
> spark.executor.extraJavaOptions=-Dlog4j.configuration=spark-log4j.properties
> --conf
> spark.driver.extraJavaOptions=-Dlog4j.configuration=spark-log4j.properties
> --conf
> spark.yarn.security.tokens.hive.enabled=false
> --conf
> spark.yarn.security.tokens.hbase.enabled=false
> --files
> hdfs://mycluster.com/user/saley/oozie/apps/sparkapp/lib/spark-example.jar
> --properties-file
> spark-defaults.conf
> --verbose
> spark-example.jar
> samplefile.txt
> output
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (OOZIE-2787) Oozie distributes application jar twice making the spark job fail

2017-02-08 Thread Satish Subhashrao Saley (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858502#comment-15858502
 ] 

Satish Subhashrao Saley edited comment on OOZIE-2787 at 2/8/17 10:40 PM:
-

Reopening as there is a regression.

{code}
pi.py is under oozie.wf.application.path and workflow configuration is -

pyspark example
pi.py
${testConf}
pi.py#pi-renamed.py
{code}

With the change we added in this jira, it will call run the Spark job with 
following params:

{code}
--master
yarn-cluster
--name
pyspark example
--conf
spark.driver.extraJavaOptions=-Dlog4j.configuration=spark-log4j.properties
--conf
spark.ui.view.acls=*
--queue
default
--conf
spark.executor.extraClassPath=$PWD/*
--conf
spark.driver.extraClassPath=$PWD/*
--conf
spark.yarn.security.tokens.hive.enabled=false
--conf
spark.yarn.security.tokens.hbase.enabled=false
--conf
spark.executor.extraJavaOptions=-Dlog4j.configuration=spark-log4j.properties
--properties-file
spark-defaults.conf
--files
<>
--conf
spark.yarn.jar=hdfs://localhost/share/spark/lib/spark-assembly.jar
--verbose
hdfs://localhost/user/saley/examples/apps/spark-yarn-cluster/pi.py#pi-renamed.py
10
{code} 

The job fails saying - 
{code}
2017-02-07 21:59:24,847 [Driver] ERROR 
org.apache.spark.deploy.yarn.ApplicationMaster  - User application exited with 
status 2
2017-02-07 21:59:24,849 [Driver] INFO  
org.apache.spark.deploy.yarn.ApplicationMaster  - Final app status: FAILED, 
exitCode: 2, (reason: User application exited with status 2)
python: can't open file 'pi.py#pi-renamed.py': [Errno 2] No such file or 
directory
{code}
Spark does not understand the {{#}} sign.
Therefore, we need to pass in the direct path for the file.
But at the same time, we also need to make sure that application jar won't get 
distributed twice.

Solution - Mention the direct path for the application jar/py file if there is 
a {{#}} sign (fragment) in the path. We can do so, because file is already 
available in the launcher's local directory i.e. current directory. Also, at 
the same time remove the application jar from *--files* option. 


was (Author: satishsaley):
Reopening as there is a regression.

{code}
pi.py is under oozie.wf.application.path and workflow configuration is -

pyspark example
pi.py
${testConf}
pi.py#pi-renamed.py
{code}

With the change we added in this jira, it will call run the Spark job with 
following params:

{code}
--master
yarn-cluster
--name
pyspark example
--conf
spark.driver.extraJavaOptions=-Dlog4j.configuration=spark-log4j.properties
--conf
spark.ui.view.acls=*
--queue
default
--conf
spark.executor.extraClassPath=$PWD/*
--conf
spark.driver.extraClassPath=$PWD/*
--conf
spark.yarn.security.tokens.hive.enabled=false
--conf
spark.yarn.security.tokens.hbase.enabled=false
--conf
spark.executor.extraJavaOptions=-Dlog4j.configuration=spark-log4j.properties
--properties-file
spark-defaults.conf
--files
<>
--conf
spark.yarn.jar=hdfs://localhost/share/spark/lib/spark-assembly.jar
--verbose
hdfs://localhost/user/saley/examples/apps/spark-yarn-cluster/pi.py#pi-renamed.py
10
{code} 

The job fails saying - 
{code}
2017-02-07 21:59:24,847 [Driver] ERROR 
org.apache.spark.deploy.yarn.ApplicationMaster  - User application exited with 
status 2
2017-02-07 21:59:24,849 [Driver] INFO  
org.apache.spark.deploy.yarn.ApplicationMaster  - Final app status: FAILED, 
exitCode: 2, (reason: User application exited with status 2)
python: can't open file 'pi.py#pi-renamed.py': [Errno 2] No such file or 
directory
{code}
Spark does not understand the {{#}} sign.
Therefore, we need to pass in the direct path for the file.
But at the same time, we also need to make sure that application jar won't get 
distributed twice.

Solution - Mention the direct path for the application jar/py file if there is 
a {{#}} sign (fragment) in the path. We can do so, because file is already 
available in the launcher's local directory i.e. current directory. Also, at 
the same time remove the application jar from *--files* option. While doing so, 
we need extra checks for PySpark dependencies otherwise those will get 
distributed multiple times. The amend patch will also distribute the files 
mentioned in  and having {{#}}.

> Oozie distributes application jar twice making the spark job fail
> -
>
> Key: OOZIE-2787
> URL: https://issues.apache.org/jira/browse/OOZIE-2787
> Project: Oozie
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
> Attachments: OOZIE-2787-1.patch, OOZIE-2787-2.patch, 
> OOZIE-2787-3.patch, OOZIE-2787-4.patch, OOZIE-2787-5.patch
>
>
> Oozie adds the application jar to the list of files to be uploaded to 
> distributed cache. Since this gets added twice, the job fails. This is 
> observ