[jira] [Commented] (AIRFLOW-3647) Contributed SparkSubmitOperator doesn't honor --archives configuration

Ken Melms (JIRA) Mon, 07 Jan 2019 15:03:52 -0800


    [ 
https://issues.apache.org/jira/browse/AIRFLOW-3647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736460#comment-16736460
 ]


Ken Melms commented on AIRFLOW-3647:
------------------------------------

I have the code for this ready to go - I just needed an issue to tie the PR to.

> Contributed SparkSubmitOperator doesn't honor --archives configuration
> ----------------------------------------------------------------------
>
>                 Key: AIRFLOW-3647
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-3647
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: contrib
>    Affects Versions: 1.10.1
>         Environment: Linux (RHEL 7)
> Python 3.5 (using a virtual environment)
> spark-2.1.3-bin-hadoop26
> Airflow 1.10.1
> CDH 5.14 Hadoop [Yarn] cluster (no end user / dev modifications allowed)
>            Reporter: Ken Melms
>            Priority: Minor
>              Labels: easyfix, newbie
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The contributed SparkSubmitOperator has no ability to honor the spark-submit 
> configuration field "--archives" which is treated subtly different than 
> "files" or "-py-files" in that it will unzip the archive into the 
> application's working directory, and can optionally add an alias to the 
> unzipped folder so that you can refer to it elsewhere in your submission.
> EG:
> spark-submit  --archives=hdfs:////user/someone/python35_venv.zip#PYTHON 
> --conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=./PYTHON/python35/bin/python3" 
> run_me.py  
> In our case - this behavior allows for multiple python virtual environments 
> to be sourced from HDFS without incurring the penalty of pushing the whole 
> python virtual env to the cluster each submission.  This solves (for us) 
> using python-based spark jobs on a cluster that the end user has no ability 
> to define the python modules in use.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (AIRFLOW-3647) Contributed SparkSubmitOperator doesn't honor --archives configuration

Reply via email to