[ https://issues.apache.org/jira/browse/SPARK-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744731#comment-14744731 ]
Justin Uang commented on SPARK-9313: ------------------------------------ This would be hugely helpful. I'm working on a platform that allows users to execute python code and also specify a requirements.txt. The problem is that we might have multiple user's code running with different requirements.txt, so we were thinking of having our service create a docker image, then send it out to all the worker nodes before running their code. In addition, we will need something of the like because --py-files is not sufficient. It doesn't work with any python library that has native libraries, like numpy, scipy, and almost all scientific libraries, because the OS cannot load native libraries directly from the zip. Instead, we need something that installs the libraries in a virtualenv, or something similar. Stepping back, this would be the logical step for all spark applications, not just pyspark ones, as this would allow us to not have all those manual spark configuration parameters, like {code} --files FILES --py-files PY_FILES --archives ARCHIVES {code} > Enable a "docker run" invocation in place of PYSPARK_PYTHON > ----------------------------------------------------------- > > Key: SPARK-9313 > URL: https://issues.apache.org/jira/browse/SPARK-9313 > Project: Spark > Issue Type: New Feature > Components: PySpark > Environment: Linux > Reporter: thom neale > Priority: Minor > Original Estimate: 0.05h > Remaining Estimate: 0.05h > > There's a potentially high-yield improvement that might be possible by > enabling people to set PYSPARK_PYTHON (or possibly a new env var) to a docker > run of a specific docker image. I'm interesting in taking a shot at this, but > could use some pointers on overall pyspark architecture in order to avoid > hurting myself or trying something stupid that won't work. > History of this idea: I handle most of the spark infrastructure for > MassMutual's data science team, and we currently push code updates out to > spark workers with a combination of git post-recieve hooks and ansible > playbooks, all glued together with jenkins. It works well, but every time > someone wants a specific PYSPARK_PYTHON environment with precise branch > checkouts, for example, it has to be exquisitely configured in advance. What > would be amazing is if we could run a docker image in place of > PYSPARK_PYTHON, so people could build an image with whatever they want on it, > push it to a docker registry, then as long as the spark worker nodes had a > docker daemon running, they wouldn't need the images in advance--they would > just pull the built images from the registry on the fly once someone > submitted their job and specified the appropriate docker fu in place of > PYSPARK_PYTHON. This would basically make the distribution of code to the > workers self-service as long as users were savvy with docker. A lesser > benefit is that the layered filesystem feature of docker would solve the > (it's not really a problem) minor issue of a profusion of python virtualenvs, > each loaded with a huge ML stack plus other deps, from gobbling up gigs of > space on smaller code partitions on our workers. Each new combination of > branch checkouts for our application code could use the same huge ML base > image, and things would just be faster and simpler. > What I Speculate This Would Require > --------------------------------------------------- > Based on a reading of pyspark/daemon.py, I think this would require: > - somehow making the os.setpgid call inside manager() optional. The > pyspark.daemon process isn't allowed to call setpgid, I think because it has > pid 1 in the container. In my hacked branch I'm going this by checking if a > new environment variable is set. > - instead of binding to a random port, if the worker is dockerized, bind to a > predetermined port > - When the dockerized worker is invoked, query docker for the exposed port on > the host, and print that instead - Possibly do the same with ports opened by > forked workers? > - Forward stdin/out to/from the container where appropriate > My initial tinkering has done the first three points on 1.3.1 and I get the > InvalidArgumentException with an out-of-range port number, probably > indicating something is hitting an error a printing something else instead of > the actual port. > Any pointers people can supply would most welcome; I'm really interested in > at least succeeding in a demonstration of this hack, if not getting it merged > any time soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org