[ 
https://issues.apache.org/jira/browse/SPARK-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14083355#comment-14083355
 ] 

Alex Gaudio commented on SPARK-1267:
------------------------------------

I'm all for pip installable pyspark, but I'm confused about the ideal way to 
install the pyspark code.  I'd also prefer to avoid introducing an extra 
variable, SPARK_VERSION.  It seems to me that if we had the typical setup.py 
file that downloaded code from PyPi, then users would have to deal with 
differences in dependencies between the python version in PyPi and in their 
code pointed to by SPARK_HOME.   Additionally, users would still need to 
download the spark jars or set SPARK_HOME, which means two (possibly different) 
versions of the python code are flying around. The fact that users have to 
manage the version, download spark into SPARK_HOME, and pip install pyspark 
doesn't seem quite right.

What do you think about this:  We create a setup.py file that requires 
SPARK_HOME be set in the environment (requiring that the user have downloaded 
Spark) BEFORE the pyspark code gets installed.  

An additional idea we could consider:  Then, when pip or a user calls pyspark, 
we have "python setup.py install" redirect to "python setup.py develop."  This 
installs pyspark in "development mode" and means that the pyspark code pointed 
to by $SPARK_HOME/python is the source of truth.  (more about development mode 
here: https://pythonhosted.org/setuptools/setuptools.html#development-mode).  
My thinking for this is that since users need to specify SPARK_HOME, we might 
as well keep the python library with the spark code (as it currently is) to 
avoid potential compatibility conflicts.  As a maintainer, we also don't need 
to update PyPi with the latest version of pyspark.  Using "develop mode" as 
default may be a bad idea.  I also don't know how to automatically prefer 
"setup.py develop" over "setup.py install".

Last, and perhaps most obvious, if we create a setup.py file, we could also 
probably no longer include the py4j egg in the spark downloads as we'd rely on 
setuptools to provide the external libraries.



> Add a pip installer for PySpark
> -------------------------------
>
>                 Key: SPARK-1267
>                 URL: https://issues.apache.org/jira/browse/SPARK-1267
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>            Reporter: Prabin Banka
>            Priority: Minor
>              Labels: pyspark
>
> Please refer to this mail archive,
> http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3CCAOEPXP7jKiw-3M8eh2giBcs8gEkZ1upHpGb=fqoucvscywj...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to