Re: PySpark on PyPi

Jeremy Freeman Fri, 24 Jul 2015 13:51:41 -0700

Hey all, great discussion, just wanted to +1 that I see a lot of value in steps 
that make it easier to use PySpark as an ordinary python library.


You might want to check out this (https://github.com/minrk/findspark 
<https://github.com/minrk/findspark>), started by Jupyter project devs, that 
offers one way to facilitate this stuff. I’ve also cced them here to join the 
conversation.

Also, @Jey, I can also confirm that at least in some scenarios (I’ve done it in 
an EC2 cluster in standalone mode) it’s possible to run PySpark jobs just using 
`from pyspark import SparkContext; sc = SparkContext(master=“X”)` so long as 
the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are set correctly 
on *both* workers and driver. That said, there’s definitely additional 
configuration / functionality that would require going through the proper 
submit scripts.

> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal <punya.bis...@gmail.com> 
> wrote:
> 
> I agree with everything Justin just said. An additional advantage of 
> publishing PySpark's Python code in a standards-compliant way is the fact 
> that we'll be able to declare transitive dependencies (Pandas, Py4J) in a way 
> that pip can use. Contrast this with the current situation, where 
> df.toPandas() exists in the Spark API but doesn't actually work until you 
> install Pandas.
> 
> Punya
> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang <justin.u...@gmail.com 
> <mailto:justin.u...@gmail.com>> wrote:
> // + Davies for his comments
> // + Punya for SA
> 
> For development and CI, like Olivier mentioned, I think it would be hugely 
> beneficial to publish pyspark (only code in the python/ dir) on PyPI. If 
> anyone wants to develop against PySpark APIs, they need to download the 
> distribution and do a lot of PYTHONPATH munging for all the tools (pylint, 
> pytest, IDE code completion). Right now that involves adding python/ and 
> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more 
> dependencies, we would have to manually mirror all the PYTHONPATH munging in 
> the ./pyspark script. With a proper pyspark setup.py which declares its 
> dependencies, and a published distribution, depending on pyspark will just be 
> adding pyspark to my setup.py dependencies.
> 
> Of course, if we actually want to run parts of pyspark that is backed by Py4J 
> calls, then we need the full spark distribution with either ./pyspark or 
> ./spark-submit, but for things like linting and development, the PYTHONPATH 
> munging is very annoying.
> 
> I don't think the version-mismatch issues are a compelling reason to not go 
> ahead with PyPI publishing. At runtime, we should definitely enforce that the 
> version has to be exact, which means there is no backcompat nightmare as 
> suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267 
> <https://issues.apache.org/jira/browse/SPARK-1267>. This would mean that even 
> if the user got his pip installed pyspark to somehow get loaded before the 
> spark distribution provided pyspark, then the user would be alerted 
> immediately.
> 
> Davies, if you buy this, should me or someone on my team pick up 
> https://issues.apache.org/jira/browse/SPARK-1267 
> <https://issues.apache.org/jira/browse/SPARK-1267> and 
> https://github.com/apache/spark/pull/464 
> <https://github.com/apache/spark/pull/464>?
> 
> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot 
> <o.girar...@lateral-thoughts.com <mailto:o.girar...@lateral-thoughts.com>> 
> wrote:
> Ok, I get it. Now what can we do to improve the current situation, because 
> right now if I want to set-up a CI env for PySpark, I have to :
> 1- download a pre-built version of pyspark and unzip it somewhere on every 
> agent
> 2- define the SPARK_HOME env 
> 3- symlink this distribution pyspark dir inside the python install dir 
> site-packages/ directory
> and if I rely on additional packages (like databricks' Spark-CSV project), I 
> have to (except if I'm mistaken) 
> 4- compile/assembly spark-csv, deploy the jar in a specific directory on 
> every agent
> 5- add this jar-filled directory to the Spark distribution's additional 
> classpath using the conf/spark-default file 
> 
> Then finally we can launch our unit/integration-tests. 
> Some issues are related to spark-packages, some to the lack of python-based 
> dependency, and some to the way SparkContext are launched when using pyspark.
> I think step 1 and 2 are fair enough
> 4 and 5 may already have solutions, I didn't check and considering 
> spark-shell is downloading such dependencies automatically, I think if 
> nothing's done yet it will (I guess ?).
> 
> For step 3, maybe just adding a setup.py to the distribution would be enough, 
> I'm not exactly advocating to distribute a full 300Mb spark distribution in 
> PyPi, maybe there's a better compromise ?
> 
> Regards, 
> 
> Olivier.
> 
> Le ven. 5 juin 2015 à 22:12, Jey Kottalam <j...@cs.berkeley.edu 
> <mailto:j...@cs.berkeley.edu>> a écrit :
> Couldn't we have a pip installable "pyspark" package that just serves as a 
> shim to an existing Spark installation? Or it could even download the latest 
> Spark binary if SPARK_HOME isn't set during installation. Right now, Spark 
> doesn't play very well with the usual Python ecosystem. For example, why do I 
> need to use a strange incantation when booting up IPython if I want to use 
> PySpark in a notebook with MASTER="local[4]"? It would be much nicer to just 
> type `from pyspark import SparkContext; sc = SparkContext("local[4]")` in my 
> notebook.
> 
> I did a test and it seems like PySpark's basic unit-tests do pass when 
> SPARK_HOME is set and Py4J is on the PYTHONPATH:
> 
>   
> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
>  python $SPARK_HOME/python/pyspark/rdd.py
> 
> -Jey
> 
> 
> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen <rosenvi...@gmail.com 
> <mailto:rosenvi...@gmail.com>> wrote:
> This has been proposed before: 
> https://issues.apache.org/jira/browse/SPARK-1267 
> <https://issues.apache.org/jira/browse/SPARK-1267>
> 
> There's currently tighter coupling between the Python and Java halves of 
> PySpark than just requiring SPARK_HOME to be set; if we did this, I bet we'd 
> run into tons of issues when users try to run a newer version of the Python 
> half of PySpark against an older set of Java components or vice-versa.
> 
> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot 
> <o.girar...@lateral-thoughts.com <mailto:o.girar...@lateral-thoughts.com>> 
> wrote:
> Hi everyone, 
> Considering the python API as just a front needing the SPARK_HOME defined 
> anyway, I think it would be interesting to deploy the Python part of Spark on 
> PyPi in order to handle the dependencies in a Python project needing PySpark 
> via pip.
> 
> For now I just symlink the python/pyspark in my python install dir 
> site-packages/ in order for PyCharm or other lint tools to work properly.
> I can do the setup.py work or anything.
> 
> What do you think ? 
> 
> Regards, 
> 
> Olivier.
> 
>

Re: PySpark on PyPi

Reply via email to