Re: PySpark on PyPi

Justin Uang Wed, 22 Jul 2015 09:51:19 -0700

// + *Davies* for his comments
// + Punya for SA

For development and CI, like Olivier mentioned, I think it would be hugely
beneficial to publish pyspark (only code in the python/ dir) on PyPI. If
anyone wants to develop against PySpark APIs, they need to download the
distribution and do a lot of PYTHONPATH munging for all the tools (pylint,
pytest, IDE code completion). Right now that involves adding python/ and
python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more
dependencies, we would have to manually mirror all the PYTHONPATH munging
in the ./pyspark script. With a proper pyspark setup.py which declares its
dependencies, and a published distribution, depending on pyspark will just
be adding pyspark to my setup.py dependencies.


Of course, if we actually want to run parts of pyspark that is backed by
Py4J calls, then we need the full spark distribution with either ./pyspark
or ./spark-submit, but for things like linting and development, the
PYTHONPATH munging is very annoying.

I don't think the version-mismatch issues are a compelling reason to not go
ahead with PyPI publishing. At runtime, we should definitely enforce that
the version has to be exact, which means there is no backcompat nightmare
as suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267.
This would mean that even if the user got his pip installed pyspark to
somehow get loaded before the spark distribution provided pyspark, then the
user would be alerted immediately.

*Davies*, if you buy this, should me or someone on my team pick up
https://issues.apache.org/jira/browse/SPARK-1267 and
https://github.com/apache/spark/pull/464?

On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> Ok, I get it. Now what can we do to improve the current situation, because
> right now if I want to set-up a CI env for PySpark, I have to :
> 1- download a pre-built version of pyspark and unzip it somewhere on every
> agent
> 2- define the SPARK_HOME env
> 3- symlink this distribution pyspark dir inside the python install dir
> site-packages/ directory
> and if I rely on additional packages (like databricks' Spark-CSV project),
> I have to (except if I'm mistaken)
> 4- compile/assembly spark-csv, deploy the jar in a specific directory on
> every agent
> 5- add this jar-filled directory to the Spark distribution's additional
> classpath using the conf/spark-default file
>
> Then finally we can launch our unit/integration-tests.
> Some issues are related to spark-packages, some to the lack of
> python-based dependency, and some to the way SparkContext are launched when
> using pyspark.
> I think step 1 and 2 are fair enough
> 4 and 5 may already have solutions, I didn't check and considering
> spark-shell is downloading such dependencies automatically, I think if
> nothing's done yet it will (I guess ?).
>
> For step 3, maybe just adding a setup.py to the distribution would be
> enough, I'm not exactly advocating to distribute a full 300Mb spark
> distribution in PyPi, maybe there's a better compromise ?
>
> Regards,
>
> Olivier.
>
> Le ven. 5 juin 2015 à 22:12, Jey Kottalam <j...@cs.berkeley.edu> a écrit :
>
>> Couldn't we have a pip installable "pyspark" package that just serves as
>> a shim to an existing Spark installation? Or it could even download the
>> latest Spark binary if SPARK_HOME isn't set during installation. Right now,
>> Spark doesn't play very well with the usual Python ecosystem. For example,
>> why do I need to use a strange incantation when booting up IPython if I
>> want to use PySpark in a notebook with MASTER="local[4]"? It would be much
>> nicer to just type `from pyspark import SparkContext; sc =
>> SparkContext("local[4]")` in my notebook.
>>
>> I did a test and it seems like PySpark's basic unit-tests do pass when
>> SPARK_HOME is set and Py4J is on the PYTHONPATH:
>>
>>
>> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
>> python $SPARK_HOME/python/pyspark/rdd.py
>>
>> -Jey
>>
>>
>> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen <rosenvi...@gmail.com> wrote:
>>
>>> This has been proposed before:
>>> https://issues.apache.org/jira/browse/SPARK-1267
>>>
>>> There's currently tighter coupling between the Python and Java halves of
>>> PySpark than just requiring SPARK_HOME to be set; if we did this, I bet
>>> we'd run into tons of issues when users try to run a newer version of the
>>> Python half of PySpark against an older set of Java components or
>>> vice-versa.
>>>
>>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot <
>>> o.girar...@lateral-thoughts.com> wrote:
>>>
>>>> Hi everyone,
>>>> Considering the python API as just a front needing the SPARK_HOME
>>>> defined anyway, I think it would be interesting to deploy the Python part
>>>> of Spark on PyPi in order to handle the dependencies in a Python project
>>>> needing PySpark via pip.
>>>>
>>>> For now I just symlink the python/pyspark in my python install dir
>>>> site-packages/ in order for PyCharm or other lint tools to work properly.
>>>> I can do the setup.py work or anything.
>>>>
>>>> What do you think ?
>>>>
>>>> Regards,
>>>>
>>>> Olivier.
>>>>
>>>
>>>
>>

Re: PySpark on PyPi

Reply via email to