[jira] [Commented] (SPARK-9313) Enable a "docker run" invocation in place of PYSPARK_PYTHON

Justin Uang (JIRA) Mon, 14 Sep 2015 19:50:42 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744731#comment-14744731
 ]


Justin Uang commented on SPARK-9313:
------------------------------------

This would be hugely helpful. I'm working on a platform that allows users to 
execute python code and also specify a requirements.txt. The problem is that we 
might have multiple user's code running with different requirements.txt, so we 
were thinking of having our service create a docker image, then send it out to 
all the worker nodes before running their code.

In addition, we will need something of the like because --py-files is not 
sufficient. It doesn't work with any python library that has native libraries, 
like numpy, scipy, and almost all scientific libraries, because the OS cannot 
load native libraries directly from the zip. Instead, we need something that 
installs the libraries in a virtualenv, or something similar.

Stepping back, this would be the logical step for all spark applications, not 
just pyspark ones, as this would allow us to not have all those manual spark 
configuration parameters, like 

{code}
--files FILES
--py-files PY_FILES
--archives ARCHIVES
{code}

> Enable a "docker run" invocation in place of PYSPARK_PYTHON
> -----------------------------------------------------------
>
>                 Key: SPARK-9313
>                 URL: https://issues.apache.org/jira/browse/SPARK-9313
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark
>         Environment: Linux
>            Reporter: thom neale
>            Priority: Minor
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> There's a potentially high-yield improvement that might be possible by 
> enabling people to set PYSPARK_PYTHON (or possibly a new env var) to a docker 
> run of a specific docker image. I'm interesting in taking a shot at this, but 
> could use some pointers on overall pyspark architecture in order to avoid 
> hurting myself or trying something stupid that won't work. 
> History of this idea: I handle most of the spark infrastructure for 
> MassMutual's data science team, and we currently push code updates out to 
> spark workers with a combination of git post-recieve hooks and ansible 
> playbooks, all glued together with jenkins. It works well, but every time 
> someone wants a specific PYSPARK_PYTHON environment with precise branch 
> checkouts, for example, it has to be exquisitely configured in advance. What 
> would be amazing is if we could run a docker image in place of 
> PYSPARK_PYTHON, so people could build an image with whatever they want on it, 
> push it to a docker registry, then as long as the spark worker nodes had a 
> docker daemon running, they wouldn't need the images in advance--they would 
> just pull the built images from the registry on the fly once someone 
> submitted their job and specified the appropriate docker fu in place of 
> PYSPARK_PYTHON. This would basically make the distribution of code to the 
> workers self-service as long as users were savvy with docker. A lesser 
> benefit is that the layered filesystem feature of docker would solve the 
> (it's not really a problem) minor issue of a profusion of python virtualenvs, 
> each loaded with a huge ML stack plus other deps, from gobbling up gigs of 
> space on smaller code partitions on our workers. Each new combination of 
> branch checkouts for our application code could use the same huge ML base 
> image, and things would just be faster and simpler. 
> What I Speculate This Would Require 
> --------------------------------------------------- 
> Based on a reading of pyspark/daemon.py, I think this would require: 
> - somehow making the os.setpgid call inside manager() optional. The 
> pyspark.daemon process isn't allowed to call setpgid, I think because it has 
> pid 1 in the container. In my hacked branch I'm going this by checking if a 
> new environment variable is set. 
> - instead of binding to a random port, if the worker is dockerized, bind to a 
> predetermined port 
> - When the dockerized worker is invoked, query docker for the exposed port on 
> the host, and print that instead - Possibly do the same with ports opened by 
> forked workers? 
> - Forward stdin/out to/from the container where appropriate 
> My initial tinkering has done the first three points on 1.3.1 and I get the 
> InvalidArgumentException with an out-of-range port number, probably 
> indicating something is hitting an error a printing something else instead of 
> the actual port. 
> Any pointers people can supply would most welcome; I'm really interested in 
> at least succeeding in a demonstration of this hack, if not getting it merged 
> any time soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9313) Enable a "docker run" invocation in place of PYSPARK_PYTHON

Reply via email to