[jira] [Commented] (AIRFLOW-3647) Contributed SparkSubmitOperator doesn't honor --archives configuration
[ https://issues.apache.org/jira/browse/AIRFLOW-3647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16784932#comment-16784932 ] ASF subversion and git services commented on AIRFLOW-3647: -- Commit 9c11ac92fdda0e8ba623cb0be543ff5058cd9ce2 in airflow's branch refs/heads/v1-10-stable from Penumbra69 [ https://gitbox.apache.org/repos/asf?p=airflow.git;h=9c11ac9 ] [AIRFLOW-3647] Add archives config option to SparkSubmitOperator (#4467) To enable to spark behavior of transporting and extracting an archive on job launch, making the _contents_ of the archive available to the driver as well as the workers (not just the jar or archive as a zip file) - this configuration attribute is necessary. This is required if you have no ability to modify the Python env on the worker / driver nodes, but you wish to use versions, modules, or features not installed. We transport a full Python 3.5 environment to our CDH cluster using this option and the alias "#PYTHON" paired an additional configuration to spark to use it: --archives "hdfs:///user/myuser/my_python_env.zip#PYTHON" --conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=./PYTHON/python35/bin/python3" > Contributed SparkSubmitOperator doesn't honor --archives configuration > -- > > Key: AIRFLOW-3647 > URL: https://issues.apache.org/jira/browse/AIRFLOW-3647 > Project: Apache Airflow > Issue Type: Improvement > Components: contrib >Affects Versions: 1.10.1 > Environment: Linux (RHEL 7) > Python 3.5 (using a virtual environment) > spark-2.1.3-bin-hadoop26 > Airflow 1.10.1 > CDH 5.14 Hadoop [Yarn] cluster (no end user / dev modifications allowed) >Reporter: Ken Melms >Priority: Minor > Labels: easyfix, newbie > Fix For: 1.10.3 > > Original Estimate: 1h > Remaining Estimate: 1h > > The contributed SparkSubmitOperator has no ability to honor the spark-submit > configuration field "--archives" which is treated subtly different than > "files" or "-py-files" in that it will unzip the archive into the > application's working directory, and can optionally add an alias to the > unzipped folder so that you can refer to it elsewhere in your submission. > EG: > spark-submit --archives=hdfs:user/someone/python35_venv.zip#PYTHON > --conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=./PYTHON/python35/bin/python3" > run_me.py > In our case - this behavior allows for multiple python virtual environments > to be sourced from HDFS without incurring the penalty of pushing the whole > python virtual env to the cluster each submission. This solves (for us) > using python-based spark jobs on a cluster that the end user has no ability > to define the python modules in use. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-3647) Contributed SparkSubmitOperator doesn't honor --archives configuration
[ https://issues.apache.org/jira/browse/AIRFLOW-3647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736460#comment-16736460 ] Ken Melms commented on AIRFLOW-3647: I have the code for this ready to go - I just needed an issue to tie the PR to. > Contributed SparkSubmitOperator doesn't honor --archives configuration > -- > > Key: AIRFLOW-3647 > URL: https://issues.apache.org/jira/browse/AIRFLOW-3647 > Project: Apache Airflow > Issue Type: Improvement > Components: contrib >Affects Versions: 1.10.1 > Environment: Linux (RHEL 7) > Python 3.5 (using a virtual environment) > spark-2.1.3-bin-hadoop26 > Airflow 1.10.1 > CDH 5.14 Hadoop [Yarn] cluster (no end user / dev modifications allowed) >Reporter: Ken Melms >Priority: Minor > Labels: easyfix, newbie > Original Estimate: 1h > Remaining Estimate: 1h > > The contributed SparkSubmitOperator has no ability to honor the spark-submit > configuration field "--archives" which is treated subtly different than > "files" or "-py-files" in that it will unzip the archive into the > application's working directory, and can optionally add an alias to the > unzipped folder so that you can refer to it elsewhere in your submission. > EG: > spark-submit --archives=hdfs:user/someone/python35_venv.zip#PYTHON > --conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=./PYTHON/python35/bin/python3" > run_me.py > In our case - this behavior allows for multiple python virtual environments > to be sourced from HDFS without incurring the penalty of pushing the whole > python virtual env to the cluster each submission. This solves (for us) > using python-based spark jobs on a cluster that the end user has no ability > to define the python modules in use. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-3647) Contributed SparkSubmitOperator doesn't honor --archives configuration
[ https://issues.apache.org/jira/browse/AIRFLOW-3647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762587#comment-16762587 ] ASF GitHub Bot commented on AIRFLOW-3647: - ashb commented on pull request #4467: [AIRFLOW-3647] Add archives config option to SparkSubmitOperator URL: https://github.com/apache/airflow/pull/4467 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Contributed SparkSubmitOperator doesn't honor --archives configuration > -- > > Key: AIRFLOW-3647 > URL: https://issues.apache.org/jira/browse/AIRFLOW-3647 > Project: Apache Airflow > Issue Type: Improvement > Components: contrib >Affects Versions: 1.10.1 > Environment: Linux (RHEL 7) > Python 3.5 (using a virtual environment) > spark-2.1.3-bin-hadoop26 > Airflow 1.10.1 > CDH 5.14 Hadoop [Yarn] cluster (no end user / dev modifications allowed) >Reporter: Ken Melms >Priority: Minor > Labels: easyfix, newbie > Original Estimate: 1h > Remaining Estimate: 1h > > The contributed SparkSubmitOperator has no ability to honor the spark-submit > configuration field "--archives" which is treated subtly different than > "files" or "-py-files" in that it will unzip the archive into the > application's working directory, and can optionally add an alias to the > unzipped folder so that you can refer to it elsewhere in your submission. > EG: > spark-submit --archives=hdfs:user/someone/python35_venv.zip#PYTHON > --conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=./PYTHON/python35/bin/python3" > run_me.py > In our case - this behavior allows for multiple python virtual environments > to be sourced from HDFS without incurring the penalty of pushing the whole > python virtual env to the cluster each submission. This solves (for us) > using python-based spark jobs on a cluster that the end user has no ability > to define the python modules in use. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-3647) Contributed SparkSubmitOperator doesn't honor --archives configuration
[ https://issues.apache.org/jira/browse/AIRFLOW-3647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762588#comment-16762588 ] ASF subversion and git services commented on AIRFLOW-3647: -- Commit 13c63ffad05817bf4ed6ef948dc9672c26f8ffb6 in airflow's branch refs/heads/master from Penumbra69 [ https://gitbox.apache.org/repos/asf?p=airflow.git;h=13c63ff ] [AIRFLOW-3647] Add archives config option to SparkSubmitOperator (#4467) To enable to spark behavior of transporting and extracting an archive on job launch, making the _contents_ of the archive available to the driver as well as the workers (not just the jar or archive as a zip file) - this configuration attribute is necessary. This is required if you have no ability to modify the Python env on the worker / driver nodes, but you wish to use versions, modules, or features not installed. We transport a full Python 3.5 environment to our CDH cluster using this option and the alias "#PYTHON" paired an additional configuration to spark to use it: --archives "hdfs:///user/myuser/my_python_env.zip#PYTHON" --conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=./PYTHON/python35/bin/python3" > Contributed SparkSubmitOperator doesn't honor --archives configuration > -- > > Key: AIRFLOW-3647 > URL: https://issues.apache.org/jira/browse/AIRFLOW-3647 > Project: Apache Airflow > Issue Type: Improvement > Components: contrib >Affects Versions: 1.10.1 > Environment: Linux (RHEL 7) > Python 3.5 (using a virtual environment) > spark-2.1.3-bin-hadoop26 > Airflow 1.10.1 > CDH 5.14 Hadoop [Yarn] cluster (no end user / dev modifications allowed) >Reporter: Ken Melms >Priority: Minor > Labels: easyfix, newbie > Original Estimate: 1h > Remaining Estimate: 1h > > The contributed SparkSubmitOperator has no ability to honor the spark-submit > configuration field "--archives" which is treated subtly different than > "files" or "-py-files" in that it will unzip the archive into the > application's working directory, and can optionally add an alias to the > unzipped folder so that you can refer to it elsewhere in your submission. > EG: > spark-submit --archives=hdfs:user/someone/python35_venv.zip#PYTHON > --conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=./PYTHON/python35/bin/python3" > run_me.py > In our case - this behavior allows for multiple python virtual environments > to be sourced from HDFS without incurring the penalty of pushing the whole > python virtual env to the cluster each submission. This solves (for us) > using python-based spark jobs on a cluster that the end user has no ability > to define the python modules in use. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)