[ https://issues.apache.org/jira/browse/AIRFLOW-3647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736460#comment-16736460 ]
Ken Melms commented on AIRFLOW-3647: ------------------------------------ I have the code for this ready to go - I just needed an issue to tie the PR to. > Contributed SparkSubmitOperator doesn't honor --archives configuration > ---------------------------------------------------------------------- > > Key: AIRFLOW-3647 > URL: https://issues.apache.org/jira/browse/AIRFLOW-3647 > Project: Apache Airflow > Issue Type: Improvement > Components: contrib > Affects Versions: 1.10.1 > Environment: Linux (RHEL 7) > Python 3.5 (using a virtual environment) > spark-2.1.3-bin-hadoop26 > Airflow 1.10.1 > CDH 5.14 Hadoop [Yarn] cluster (no end user / dev modifications allowed) > Reporter: Ken Melms > Priority: Minor > Labels: easyfix, newbie > Original Estimate: 1h > Remaining Estimate: 1h > > The contributed SparkSubmitOperator has no ability to honor the spark-submit > configuration field "--archives" which is treated subtly different than > "files" or "-py-files" in that it will unzip the archive into the > application's working directory, and can optionally add an alias to the > unzipped folder so that you can refer to it elsewhere in your submission. > EG: > spark-submit --archives=hdfs:////user/someone/python35_venv.zip#PYTHON > --conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=./PYTHON/python35/bin/python3" > run_me.py > In our case - this behavior allows for multiple python virtual environments > to be sourced from HDFS without incurring the penalty of pushing the whole > python virtual env to the cluster each submission. This solves (for us) > using python-based spark jobs on a cluster that the end user has no ability > to define the python modules in use. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)