Hey all, great discussion, just wanted to +1 that I see a lot of value in steps that make it easier to use PySpark as an ordinary python library.
You might want to check out this (https://github.com/minrk/findspark <https://github.com/minrk/findspark>), started by Jupyter project devs, that offers one way to facilitate this stuff. I’ve also cced them here to join the conversation. Also, @Jey, I can also confirm that at least in some scenarios (I’ve done it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs just using `from pyspark import SparkContext; sc = SparkContext(master=“X”)` so long as the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are set correctly on *both* workers and driver. That said, there’s definitely additional configuration / functionality that would require going through the proper submit scripts. > On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal <punya.bis...@gmail.com> > wrote: > > I agree with everything Justin just said. An additional advantage of > publishing PySpark's Python code in a standards-compliant way is the fact > that we'll be able to declare transitive dependencies (Pandas, Py4J) in a way > that pip can use. Contrast this with the current situation, where > df.toPandas() exists in the Spark API but doesn't actually work until you > install Pandas. > > Punya > On Wed, Jul 22, 2015 at 12:49 PM Justin Uang <justin.u...@gmail.com > <mailto:justin.u...@gmail.com>> wrote: > // + Davies for his comments > // + Punya for SA > > For development and CI, like Olivier mentioned, I think it would be hugely > beneficial to publish pyspark (only code in the python/ dir) on PyPI. If > anyone wants to develop against PySpark APIs, they need to download the > distribution and do a lot of PYTHONPATH munging for all the tools (pylint, > pytest, IDE code completion). Right now that involves adding python/ and > python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more > dependencies, we would have to manually mirror all the PYTHONPATH munging in > the ./pyspark script. With a proper pyspark setup.py which declares its > dependencies, and a published distribution, depending on pyspark will just be > adding pyspark to my setup.py dependencies. > > Of course, if we actually want to run parts of pyspark that is backed by Py4J > calls, then we need the full spark distribution with either ./pyspark or > ./spark-submit, but for things like linting and development, the PYTHONPATH > munging is very annoying. > > I don't think the version-mismatch issues are a compelling reason to not go > ahead with PyPI publishing. At runtime, we should definitely enforce that the > version has to be exact, which means there is no backcompat nightmare as > suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267 > <https://issues.apache.org/jira/browse/SPARK-1267>. This would mean that even > if the user got his pip installed pyspark to somehow get loaded before the > spark distribution provided pyspark, then the user would be alerted > immediately. > > Davies, if you buy this, should me or someone on my team pick up > https://issues.apache.org/jira/browse/SPARK-1267 > <https://issues.apache.org/jira/browse/SPARK-1267> and > https://github.com/apache/spark/pull/464 > <https://github.com/apache/spark/pull/464>? > > On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot > <o.girar...@lateral-thoughts.com <mailto:o.girar...@lateral-thoughts.com>> > wrote: > Ok, I get it. Now what can we do to improve the current situation, because > right now if I want to set-up a CI env for PySpark, I have to : > 1- download a pre-built version of pyspark and unzip it somewhere on every > agent > 2- define the SPARK_HOME env > 3- symlink this distribution pyspark dir inside the python install dir > site-packages/ directory > and if I rely on additional packages (like databricks' Spark-CSV project), I > have to (except if I'm mistaken) > 4- compile/assembly spark-csv, deploy the jar in a specific directory on > every agent > 5- add this jar-filled directory to the Spark distribution's additional > classpath using the conf/spark-default file > > Then finally we can launch our unit/integration-tests. > Some issues are related to spark-packages, some to the lack of python-based > dependency, and some to the way SparkContext are launched when using pyspark. > I think step 1 and 2 are fair enough > 4 and 5 may already have solutions, I didn't check and considering > spark-shell is downloading such dependencies automatically, I think if > nothing's done yet it will (I guess ?). > > For step 3, maybe just adding a setup.py to the distribution would be enough, > I'm not exactly advocating to distribute a full 300Mb spark distribution in > PyPi, maybe there's a better compromise ? > > Regards, > > Olivier. > > Le ven. 5 juin 2015 à 22:12, Jey Kottalam <j...@cs.berkeley.edu > <mailto:j...@cs.berkeley.edu>> a écrit : > Couldn't we have a pip installable "pyspark" package that just serves as a > shim to an existing Spark installation? Or it could even download the latest > Spark binary if SPARK_HOME isn't set during installation. Right now, Spark > doesn't play very well with the usual Python ecosystem. For example, why do I > need to use a strange incantation when booting up IPython if I want to use > PySpark in a notebook with MASTER="local[4]"? It would be much nicer to just > type `from pyspark import SparkContext; sc = SparkContext("local[4]")` in my > notebook. > > I did a test and it seems like PySpark's basic unit-tests do pass when > SPARK_HOME is set and Py4J is on the PYTHONPATH: > > > PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH > python $SPARK_HOME/python/pyspark/rdd.py > > -Jey > > > On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen <rosenvi...@gmail.com > <mailto:rosenvi...@gmail.com>> wrote: > This has been proposed before: > https://issues.apache.org/jira/browse/SPARK-1267 > <https://issues.apache.org/jira/browse/SPARK-1267> > > There's currently tighter coupling between the Python and Java halves of > PySpark than just requiring SPARK_HOME to be set; if we did this, I bet we'd > run into tons of issues when users try to run a newer version of the Python > half of PySpark against an older set of Java components or vice-versa. > > On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot > <o.girar...@lateral-thoughts.com <mailto:o.girar...@lateral-thoughts.com>> > wrote: > Hi everyone, > Considering the python API as just a front needing the SPARK_HOME defined > anyway, I think it would be interesting to deploy the Python part of Spark on > PyPi in order to handle the dependencies in a Python project needing PySpark > via pip. > > For now I just symlink the python/pyspark in my python install dir > site-packages/ in order for PyCharm or other lint tools to work properly. > I can do the setup.py work or anything. > > What do you think ? > > Regards, > > Olivier. > >