Re: PySpark on PyPi

Justin Uang Tue, 28 Jul 2015 05:53:05 -0700

// ping

do we have any signoff from the pyspark devs to submit a PR to publish to
PyPI?


On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman <freeman.jer...@gmail.com>
wrote:

> Hey all, great discussion, just wanted to +1 that I see a lot of value in
> steps that make it easier to use PySpark as an ordinary python library.
>
> You might want to check out this (https://github.com/minrk/findspark),
> started by Jupyter project devs, that offers one way to facilitate this
> stuff. I’ve also cced them here to join the conversation.
>
> Also, @Jey, I can also confirm that at least in some scenarios (I’ve done
> it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs
> just using `from pyspark import SparkContext; sc =
> SparkContext(master=“X”)` so long as the environmental variables
> (PYTHONPATH and PYSPARK_PYTHON) are set correctly on *both* workers and
> driver. That said, there’s definitely additional configuration /
> functionality that would require going through the proper submit scripts.
>
> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal <punya.bis...@gmail.com>
> wrote:
>
> I agree with everything Justin just said. An additional advantage of
> publishing PySpark's Python code in a standards-compliant way is the fact
> that we'll be able to declare transitive dependencies (Pandas, Py4J) in a
> way that pip can use. Contrast this with the current situation, where
> df.toPandas() exists in the Spark API but doesn't actually work until you
> install Pandas.
>
> Punya
> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang <justin.u...@gmail.com>
> wrote:
>
>> // + *Davies* for his comments
>> // + Punya for SA
>>
>> For development and CI, like Olivier mentioned, I think it would be
>> hugely beneficial to publish pyspark (only code in the python/ dir) on
>> PyPI. If anyone wants to develop against PySpark APIs, they need to
>> download the distribution and do a lot of PYTHONPATH munging for all the
>> tools (pylint, pytest, IDE code completion). Right now that involves adding
>> python/ and python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to
>> add more dependencies, we would have to manually mirror all the PYTHONPATH
>> munging in the ./pyspark script. With a proper pyspark setup.py which
>> declares its dependencies, and a published distribution, depending on
>> pyspark will just be adding pyspark to my setup.py dependencies.
>>
>> Of course, if we actually want to run parts of pyspark that is backed by
>> Py4J calls, then we need the full spark distribution with either ./pyspark
>> or ./spark-submit, but for things like linting and development, the
>> PYTHONPATH munging is very annoying.
>>
>> I don't think the version-mismatch issues are a compelling reason to not
>> go ahead with PyPI publishing. At runtime, we should definitely enforce
>> that the version has to be exact, which means there is no backcompat
>> nightmare as suggested by Davies in
>> https://issues.apache.org/jira/browse/SPARK-1267. This would mean that
>> even if the user got his pip installed pyspark to somehow get loaded before
>> the spark distribution provided pyspark, then the user would be alerted
>> immediately.
>>
>> *Davies*, if you buy this, should me or someone on my team pick up
>> https://issues.apache.org/jira/browse/SPARK-1267 and
>> https://github.com/apache/spark/pull/464?
>>
>> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot <
>> o.girar...@lateral-thoughts.com> wrote:
>>
>>> Ok, I get it. Now what can we do to improve the current situation,
>>> because right now if I want to set-up a CI env for PySpark, I have to :
>>> 1- download a pre-built version of pyspark and unzip it somewhere on
>>> every agent
>>> 2- define the SPARK_HOME env
>>> 3- symlink this distribution pyspark dir inside the python install dir
>>> site-packages/ directory
>>> and if I rely on additional packages (like databricks' Spark-CSV
>>> project), I have to (except if I'm mistaken)
>>> 4- compile/assembly spark-csv, deploy the jar in a specific directory on
>>> every agent
>>> 5- add this jar-filled directory to the Spark distribution's additional
>>> classpath using the conf/spark-default file
>>>
>>> Then finally we can launch our unit/integration-tests.
>>> Some issues are related to spark-packages, some to the lack of
>>> python-based dependency, and some to the way SparkContext are launched when
>>> using pyspark.
>>> I think step 1 and 2 are fair enough
>>> 4 and 5 may already have solutions, I didn't check and considering
>>> spark-shell is downloading such dependencies automatically, I think if
>>> nothing's done yet it will (I guess ?).
>>>
>>> For step 3, maybe just adding a setup.py to the distribution would be
>>> enough, I'm not exactly advocating to distribute a full 300Mb spark
>>> distribution in PyPi, maybe there's a better compromise ?
>>>
>>> Regards,
>>>
>>> Olivier.
>>>
>>> Le ven. 5 juin 2015 à 22:12, Jey Kottalam <j...@cs.berkeley.edu> a
>>> écrit :
>>>
>>>> Couldn't we have a pip installable "pyspark" package that just serves
>>>> as a shim to an existing Spark installation? Or it could even download the
>>>> latest Spark binary if SPARK_HOME isn't set during installation. Right now,
>>>> Spark doesn't play very well with the usual Python ecosystem. For example,
>>>> why do I need to use a strange incantation when booting up IPython if I
>>>> want to use PySpark in a notebook with MASTER="local[4]"? It would be much
>>>> nicer to just type `from pyspark import SparkContext; sc =
>>>> SparkContext("local[4]")` in my notebook.
>>>>
>>>> I did a test and it seems like PySpark's basic unit-tests do pass when
>>>> SPARK_HOME is set and Py4J is on the PYTHONPATH:
>>>>
>>>>
>>>> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
>>>> python $SPARK_HOME/python/pyspark/rdd.py
>>>>
>>>> -Jey
>>>>
>>>>
>>>> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen <rosenvi...@gmail.com>
>>>> wrote:
>>>>
>>>>> This has been proposed before:
>>>>> https://issues.apache.org/jira/browse/SPARK-1267
>>>>>
>>>>> There's currently tighter coupling between the Python and Java halves
>>>>> of PySpark than just requiring SPARK_HOME to be set; if we did this, I bet
>>>>> we'd run into tons of issues when users try to run a newer version of the
>>>>> Python half of PySpark against an older set of Java components or
>>>>> vice-versa.
>>>>>
>>>>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot <
>>>>> o.girar...@lateral-thoughts.com> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>> Considering the python API as just a front needing the SPARK_HOME
>>>>>> defined anyway, I think it would be interesting to deploy the Python part
>>>>>> of Spark on PyPi in order to handle the dependencies in a Python project
>>>>>> needing PySpark via pip.
>>>>>>
>>>>>> For now I just symlink the python/pyspark in my python install dir
>>>>>> site-packages/ in order for PyCharm or other lint tools to work properly.
>>>>>> I can do the setup.py work or anything.
>>>>>>
>>>>>> What do you think ?
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Olivier.
>>>>>>
>>>>>
>>>>>
>>>>
>

Re: PySpark on PyPi

Reply via email to