Most admins I talk to about python and spark are already actively (or on
their way to) managing their cluster python installations. Even if people
begin using the system python with pyspark, there is eventually a user who
needs a complex dependency (like pandas or sklearn) on the cluster. No
admin would muck around installing libs into system python, so you end up
with other python installations.

Installing a non-system python is something users intending to use pyspark
on a real cluster should be thinking about, eventually, anyway. It would
work in situations where people are running pyspark locally or actively
managing python installations on a cluster. There is an awkward middle
point where someone has installed spark but not configured their cluster
(by installing non default python) in any other way. Most clusters I see
are RHEL/CentOS and have something other than system python used by spark.

What libraries stopped supporting python 2.6 and where does spark use them?
The "ease of transitioning to pyspark onto a cluster" problem may be an
easier pill to swallow if it only affected something like mllib or spark
sql and not parts of the core api. You end up hoping numpy or pandas are
installed in the runtime components of spark anyway. At that point people
really should just go install a non system python. There are tradeoffs to
using pyspark and I feel pretty fine explaining to people that managing
their cluster's python installations is something that comes with using
pyspark.

RHEL/CentOS is so common that this would probably be a little work for a
lot of people.

--Juliet

On Tue, Jan 5, 2016 at 4:07 PM, Koert Kuipers <ko...@tresata.com> wrote:

> hey evil admin:)
> i think the bit about java was from me?
> if so, i meant to indicate that the reality for us is java is 1.7 on most
> (all?) clusters. i do not believe spark prefers java 1.8. my point was that
> even although java 1.7 is getting old as well it would be a major issue for
> me if spark dropped java 1.7 support.
>
> On Tue, Jan 5, 2016 at 6:53 PM, Carlile, Ken <carli...@janelia.hhmi.org>
> wrote:
>
>> As one of the evil administrators that runs a RHEL 6 cluster, we already
>> provide quite a few different version of python on our cluster pretty darn
>> easily. All you need is a separate install directory and to set the
>> PYTHON_HOME environment variable to point to the correct python, then have
>> the users make sure the correct python is in their PATH. I understand that
>> other administrators may not be so compliant.
>>
>> Saw a small bit about the java version in there; does Spark currently
>> prefer Java 1.8.x?
>>
>> —Ken
>>
>> On Jan 5, 2016, at 6:08 PM, Josh Rosen <joshro...@databricks.com> wrote:
>>
>> Note that you _can_ use a Python 2.7 `ipython` executable on the driver
>>> while continuing to use a vanilla `python` executable on the executors
>>
>>
>> Whoops, just to be clear, this should actually read "while continuing to
>> use a vanilla `python` 2.7 executable".
>>
>> On Tue, Jan 5, 2016 at 3:07 PM, Josh Rosen <joshro...@databricks.com>
>> wrote:
>>
>>> Yep, the driver and executors need to have compatible Python versions. I
>>> think that there are some bytecode-level incompatibilities between 2.6 and
>>> 2.7 which would impact the deserialization of Python closures, so I think
>>> you need to be running the same 2.x version for all communicating Spark
>>> processes. Note that you _can_ use a Python 2.7 `ipython` executable on the
>>> driver while continuing to use a vanilla `python` executable on the
>>> executors (we have environment variables which allow you to control these
>>> separately).
>>>
>>> On Tue, Jan 5, 2016 at 3:05 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> I think all the slaves need the same (or a compatible) version of
>>>> Python installed since they run Python code in PySpark jobs natively.
>>>>
>>>> On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers <ko...@tresata.com> wrote:
>>>>
>>>>> interesting i didnt know that!
>>>>>
>>>>> On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <
>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>
>>>>>> even if python 2.7 was needed only on this one machine that launches
>>>>>> the app we can not ship it with our software because its gpl licensed
>>>>>>
>>>>>> Not to nitpick, but maybe this is important. The Python license is 
>>>>>> GPL-compatible
>>>>>> but not GPL <https://docs.python.org/3/license.html>:
>>>>>>
>>>>>> Note GPL-compatible doesn’t mean that we’re distributing Python under
>>>>>> the GPL. All Python licenses, unlike the GPL, let you distribute a 
>>>>>> modified
>>>>>> version without making your changes open source. The GPL-compatible
>>>>>> licenses make it possible to combine Python with other software that is
>>>>>> released under the GPL; the others don’t.
>>>>>>
>>>>>> Nick
>>>>>> ​
>>>>>>
>>>>>> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers <ko...@tresata.com>
>>>>>> wrote:
>>>>>>
>>>>>>> i do not think so.
>>>>>>>
>>>>>>> does the python 2.7 need to be installed on all slaves? if so, we do
>>>>>>> not have direct access to those.
>>>>>>>
>>>>>>> also, spark is easy for us to ship with our software since its
>>>>>>> apache 2 licensed, and it only needs to be present on the machine that
>>>>>>> launches the app (thanks to yarn).
>>>>>>> even if python 2.7 was needed only on this one machine that launches
>>>>>>> the app we can not ship it with our software because its gpl licensed, 
>>>>>>> so
>>>>>>> the client would have to download it and install it themselves, and this
>>>>>>> would mean its an independent install which has to be audited and 
>>>>>>> approved
>>>>>>> and now you are in for a lot of fun. basically it will never happen.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen <joshro...@databricks.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> If users are able to install Spark 2.0 on their RHEL clusters, then
>>>>>>>> I imagine that they're also capable of installing a standalone Python
>>>>>>>> alongside that Spark version (without changing Python systemwide). For
>>>>>>>> instance, Anaconda/Miniconda make it really easy to install Python
>>>>>>>> 2.7.x/3.x without impacting / changing the system Python and doesn't
>>>>>>>> require any special permissions to install (you don't need root / sudo
>>>>>>>> access). Does this address the Python versioning concerns for RHEL 
>>>>>>>> users?
>>>>>>>>
>>>>>>>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers <ko...@tresata.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> yeah, the practical concern is that we have no control over java
>>>>>>>>> or python version on large company clusters. our current reality for 
>>>>>>>>> the
>>>>>>>>> vast majority of them is java 7 and python 2.6, no matter how 
>>>>>>>>> outdated that
>>>>>>>>> is.
>>>>>>>>>
>>>>>>>>> i dont like it either, but i cannot change it.
>>>>>>>>>
>>>>>>>>> we currently don't use pyspark so i have no stake in this, but if
>>>>>>>>> we did i can assure you we would not upgrade to spark 2.x if python 
>>>>>>>>> 2.6 was
>>>>>>>>> dropped. no point in developing something that doesnt run for 
>>>>>>>>> majority of
>>>>>>>>> customers.
>>>>>>>>>
>>>>>>>>> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
>>>>>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> As I pointed out in my earlier email, RHEL will support Python
>>>>>>>>>> 2.6 until 2020. So I'm assuming these large companies will have the 
>>>>>>>>>> option
>>>>>>>>>> of riding out Python 2.6 until then.
>>>>>>>>>>
>>>>>>>>>> Are we seriously saying that Spark should likewise support Python
>>>>>>>>>> 2.6 for the next several years? Even though the core Python devs 
>>>>>>>>>> stopped
>>>>>>>>>> supporting it in 2013?
>>>>>>>>>>
>>>>>>>>>> If that's not what we're suggesting, then when, roughly, can we
>>>>>>>>>> drop support? What are the criteria?
>>>>>>>>>>
>>>>>>>>>> I understand the practical concern here. If companies are stuck
>>>>>>>>>> using 2.6, it doesn't matter to them that it is deprecated. But 
>>>>>>>>>> balancing
>>>>>>>>>> that concern against the maintenance burden on this project, I would 
>>>>>>>>>> say
>>>>>>>>>> that "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable
>>>>>>>>>> position to take. There are many tiny annoyances one has to put up 
>>>>>>>>>> with to
>>>>>>>>>> support 2.6.
>>>>>>>>>>
>>>>>>>>>> I suppose if our main PySpark contributors are fine putting up
>>>>>>>>>> with those annoyances, then maybe we don't need to drop support just 
>>>>>>>>>> yet...
>>>>>>>>>>
>>>>>>>>>> Nick
>>>>>>>>>> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente <
>>>>>>>>>> ju...@esbet.es>님이 작성:
>>>>>>>>>>
>>>>>>>>>>> Unfortunately, Koert is right.
>>>>>>>>>>>
>>>>>>>>>>> I've been in a couple of projects using Spark (banking industry)
>>>>>>>>>>> where CentOS + Python 2.6 is the toolbox available.
>>>>>>>>>>>
>>>>>>>>>>> That said, I believe it should not be a concern for Spark.
>>>>>>>>>>> Python 2.6 is old and busted, which is totally opposite to the Spark
>>>>>>>>>>> philosophy IMO.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> El 5 ene 2016, a las 20:07, Koert Kuipers <ko...@tresata.com>
>>>>>>>>>>> escribió:
>>>>>>>>>>>
>>>>>>>>>>> rhel/centos 6 ships with python 2.6, doesnt it?
>>>>>>>>>>>
>>>>>>>>>>> if so, i still know plenty of large companies where python 2.6
>>>>>>>>>>> is the only option. asking them for python 2.7 is not going to work
>>>>>>>>>>>
>>>>>>>>>>> so i think its a bad idea
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland <
>>>>>>>>>>> juliet.hougl...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I don't see a reason Spark 2.0 would need to support Python
>>>>>>>>>>>> 2.6. At this point, Python 3 should be the default that is 
>>>>>>>>>>>> encouraged.
>>>>>>>>>>>> Most organizations acknowledge the 2.7 is common, but lagging
>>>>>>>>>>>> behind the version they should theoretically use. Dropping python 
>>>>>>>>>>>> 2.6
>>>>>>>>>>>> support sounds very reasonable to me.
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas <
>>>>>>>>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> +1
>>>>>>>>>>>>>
>>>>>>>>>>>>> Red Hat supports Python 2.6 on REHL 5 until 2020
>>>>>>>>>>>>> <https://alexgaynor.net/2015/mar/30/red-hat-open-source-community/>,
>>>>>>>>>>>>> but otherwise yes, Python 2.6 is ancient history and the core 
>>>>>>>>>>>>> Python
>>>>>>>>>>>>> developers stopped supporting it in 2013. REHL 5 is not a good 
>>>>>>>>>>>>> enough
>>>>>>>>>>>>> reason to continue support for Python 2.6 IMO.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We should aim to support Python 2.7 and Python 3.3+ (which I
>>>>>>>>>>>>> believe we currently do).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Nick
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang <
>>>>>>>>>>>>> allenzhang...@126.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> plus 1,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> we are currently using python 2.7.2 in production environment.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 在 2016-01-05 18:11:45,"Meethu Mathew" <
>>>>>>>>>>>>>> meethu.mat...@flytxt.com> 写道:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>> We use Python 2.7
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Meethu Mathew
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin <
>>>>>>>>>>>>>> r...@databricks.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Does anybody here care about us dropping support for Python
>>>>>>>>>>>>>>> 2.6 in Spark 2.0?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Python 2.6 is ancient, and is pretty slow in many aspects
>>>>>>>>>>>>>>> (e.g. json parsing) when compared with Python 2.7. Some 
>>>>>>>>>>>>>>> libraries that
>>>>>>>>>>>>>>> Spark depend on stopped supporting 2.6. We can still convince 
>>>>>>>>>>>>>>> the library
>>>>>>>>>>>>>>> maintainers to support 2.6, but it will be extra work. I'm 
>>>>>>>>>>>>>>> curious if
>>>>>>>>>>>>>>> anybody still uses Python 2.6 to run Spark.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>
>>
>>
>

Reply via email to