+1 On Wed, Jan 6, 2016 at 9:18 AM, Juliet Hougland <juliet.hougl...@gmail.com> wrote:
> Most admins I talk to about python and spark are already actively (or on > their way to) managing their cluster python installations. Even if people > begin using the system python with pyspark, there is eventually a user who > needs a complex dependency (like pandas or sklearn) on the cluster. No > admin would muck around installing libs into system python, so you end up > with other python installations. > > Installing a non-system python is something users intending to use pyspark > on a real cluster should be thinking about, eventually, anyway. It would > work in situations where people are running pyspark locally or actively > managing python installations on a cluster. There is an awkward middle > point where someone has installed spark but not configured their cluster > (by installing non default python) in any other way. Most clusters I see > are RHEL/CentOS and have something other than system python used by spark. > > What libraries stopped supporting python 2.6 and where does spark use > them? The "ease of transitioning to pyspark onto a cluster" problem may be > an easier pill to swallow if it only affected something like mllib or spark > sql and not parts of the core api. You end up hoping numpy or pandas are > installed in the runtime components of spark anyway. At that point people > really should just go install a non system python. There are tradeoffs to > using pyspark and I feel pretty fine explaining to people that managing > their cluster's python installations is something that comes with using > pyspark. > > RHEL/CentOS is so common that this would probably be a little work for a > lot of people. > > --Juliet > > On Tue, Jan 5, 2016 at 4:07 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> hey evil admin:) >> i think the bit about java was from me? >> if so, i meant to indicate that the reality for us is java is 1.7 on most >> (all?) clusters. i do not believe spark prefers java 1.8. my point was that >> even although java 1.7 is getting old as well it would be a major issue for >> me if spark dropped java 1.7 support. >> >> On Tue, Jan 5, 2016 at 6:53 PM, Carlile, Ken <carli...@janelia.hhmi.org> >> wrote: >> >>> As one of the evil administrators that runs a RHEL 6 cluster, we already >>> provide quite a few different version of python on our cluster pretty darn >>> easily. All you need is a separate install directory and to set the >>> PYTHON_HOME environment variable to point to the correct python, then have >>> the users make sure the correct python is in their PATH. I understand that >>> other administrators may not be so compliant. >>> >>> Saw a small bit about the java version in there; does Spark currently >>> prefer Java 1.8.x? >>> >>> —Ken >>> >>> On Jan 5, 2016, at 6:08 PM, Josh Rosen <joshro...@databricks.com> wrote: >>> >>> Note that you _can_ use a Python 2.7 `ipython` executable on the driver >>>> while continuing to use a vanilla `python` executable on the executors >>> >>> >>> Whoops, just to be clear, this should actually read "while continuing to >>> use a vanilla `python` 2.7 executable". >>> >>> On Tue, Jan 5, 2016 at 3:07 PM, Josh Rosen <joshro...@databricks.com> >>> wrote: >>> >>>> Yep, the driver and executors need to have compatible Python versions. >>>> I think that there are some bytecode-level incompatibilities between 2.6 >>>> and 2.7 which would impact the deserialization of Python closures, so I >>>> think you need to be running the same 2.x version for all communicating >>>> Spark processes. Note that you _can_ use a Python 2.7 `ipython` executable >>>> on the driver while continuing to use a vanilla `python` executable on the >>>> executors (we have environment variables which allow you to control these >>>> separately). >>>> >>>> On Tue, Jan 5, 2016 at 3:05 PM, Nicholas Chammas < >>>> nicholas.cham...@gmail.com> wrote: >>>> >>>>> I think all the slaves need the same (or a compatible) version of >>>>> Python installed since they run Python code in PySpark jobs natively. >>>>> >>>>> On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers <ko...@tresata.com> >>>>> wrote: >>>>> >>>>>> interesting i didnt know that! >>>>>> >>>>>> On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas < >>>>>> nicholas.cham...@gmail.com> wrote: >>>>>> >>>>>>> even if python 2.7 was needed only on this one machine that launches >>>>>>> the app we can not ship it with our software because its gpl licensed >>>>>>> >>>>>>> Not to nitpick, but maybe this is important. The Python license is >>>>>>> GPL-compatible >>>>>>> but not GPL <https://docs.python.org/3/license.html>: >>>>>>> >>>>>>> Note GPL-compatible doesn’t mean that we’re distributing Python >>>>>>> under the GPL. All Python licenses, unlike the GPL, let you distribute a >>>>>>> modified version without making your changes open source. The >>>>>>> GPL-compatible licenses make it possible to combine Python with other >>>>>>> software that is released under the GPL; the others don’t. >>>>>>> >>>>>>> Nick >>>>>>> >>>>>>> >>>>>>> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers <ko...@tresata.com> >>>>>>> wrote: >>>>>>> >>>>>>>> i do not think so. >>>>>>>> >>>>>>>> does the python 2.7 need to be installed on all slaves? if so, we >>>>>>>> do not have direct access to those. >>>>>>>> >>>>>>>> also, spark is easy for us to ship with our software since its >>>>>>>> apache 2 licensed, and it only needs to be present on the machine that >>>>>>>> launches the app (thanks to yarn). >>>>>>>> even if python 2.7 was needed only on this one machine that >>>>>>>> launches the app we can not ship it with our software because its gpl >>>>>>>> licensed, so the client would have to download it and install it >>>>>>>> themselves, and this would mean its an independent install which has >>>>>>>> to be >>>>>>>> audited and approved and now you are in for a lot of fun. basically it >>>>>>>> will >>>>>>>> never happen. >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen < >>>>>>>> joshro...@databricks.com> wrote: >>>>>>>> >>>>>>>>> If users are able to install Spark 2.0 on their RHEL clusters, >>>>>>>>> then I imagine that they're also capable of installing a standalone >>>>>>>>> Python >>>>>>>>> alongside that Spark version (without changing Python systemwide). For >>>>>>>>> instance, Anaconda/Miniconda make it really easy to install Python >>>>>>>>> 2.7.x/3.x without impacting / changing the system Python and doesn't >>>>>>>>> require any special permissions to install (you don't need root / sudo >>>>>>>>> access). Does this address the Python versioning concerns for RHEL >>>>>>>>> users? >>>>>>>>> >>>>>>>>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers <ko...@tresata.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> yeah, the practical concern is that we have no control over java >>>>>>>>>> or python version on large company clusters. our current reality for >>>>>>>>>> the >>>>>>>>>> vast majority of them is java 7 and python 2.6, no matter how >>>>>>>>>> outdated that >>>>>>>>>> is. >>>>>>>>>> >>>>>>>>>> i dont like it either, but i cannot change it. >>>>>>>>>> >>>>>>>>>> we currently don't use pyspark so i have no stake in this, but if >>>>>>>>>> we did i can assure you we would not upgrade to spark 2.x if python >>>>>>>>>> 2.6 was >>>>>>>>>> dropped. no point in developing something that doesnt run for >>>>>>>>>> majority of >>>>>>>>>> customers. >>>>>>>>>> >>>>>>>>>> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas < >>>>>>>>>> nicholas.cham...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> As I pointed out in my earlier email, RHEL will support Python >>>>>>>>>>> 2.6 until 2020. So I'm assuming these large companies will have the >>>>>>>>>>> option >>>>>>>>>>> of riding out Python 2.6 until then. >>>>>>>>>>> >>>>>>>>>>> Are we seriously saying that Spark should likewise support >>>>>>>>>>> Python 2.6 for the next several years? Even though the core Python >>>>>>>>>>> devs >>>>>>>>>>> stopped supporting it in 2013? >>>>>>>>>>> >>>>>>>>>>> If that's not what we're suggesting, then when, roughly, can we >>>>>>>>>>> drop support? What are the criteria? >>>>>>>>>>> >>>>>>>>>>> I understand the practical concern here. If companies are stuck >>>>>>>>>>> using 2.6, it doesn't matter to them that it is deprecated. But >>>>>>>>>>> balancing >>>>>>>>>>> that concern against the maintenance burden on this project, I >>>>>>>>>>> would say >>>>>>>>>>> that "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable >>>>>>>>>>> position to take. There are many tiny annoyances one has to put up >>>>>>>>>>> with to >>>>>>>>>>> support 2.6. >>>>>>>>>>> >>>>>>>>>>> I suppose if our main PySpark contributors are fine putting up >>>>>>>>>>> with those annoyances, then maybe we don't need to drop support >>>>>>>>>>> just yet... >>>>>>>>>>> >>>>>>>>>>> Nick >>>>>>>>>>> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente < >>>>>>>>>>> ju...@esbet.es>님이 작성: >>>>>>>>>>> >>>>>>>>>>>> Unfortunately, Koert is right. >>>>>>>>>>>> >>>>>>>>>>>> I've been in a couple of projects using Spark (banking >>>>>>>>>>>> industry) where CentOS + Python 2.6 is the toolbox available. >>>>>>>>>>>> >>>>>>>>>>>> That said, I believe it should not be a concern for Spark. >>>>>>>>>>>> Python 2.6 is old and busted, which is totally opposite to the >>>>>>>>>>>> Spark >>>>>>>>>>>> philosophy IMO. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> El 5 ene 2016, a las 20:07, Koert Kuipers <ko...@tresata.com> >>>>>>>>>>>> escribió: >>>>>>>>>>>> >>>>>>>>>>>> rhel/centos 6 ships with python 2.6, doesnt it? >>>>>>>>>>>> >>>>>>>>>>>> if so, i still know plenty of large companies where python 2.6 >>>>>>>>>>>> is the only option. asking them for python 2.7 is not going to work >>>>>>>>>>>> >>>>>>>>>>>> so i think its a bad idea >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland < >>>>>>>>>>>> juliet.hougl...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I don't see a reason Spark 2.0 would need to support Python >>>>>>>>>>>>> 2.6. At this point, Python 3 should be the default that is >>>>>>>>>>>>> encouraged. >>>>>>>>>>>>> Most organizations acknowledge the 2.7 is common, but lagging >>>>>>>>>>>>> behind the version they should theoretically use. Dropping python >>>>>>>>>>>>> 2.6 >>>>>>>>>>>>> support sounds very reasonable to me. >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas < >>>>>>>>>>>>> nicholas.cham...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> +1 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Red Hat supports Python 2.6 on REHL 5 until 2020 >>>>>>>>>>>>>> <https://alexgaynor.net/2015/mar/30/red-hat-open-source-community/>, >>>>>>>>>>>>>> but otherwise yes, Python 2.6 is ancient history and the core >>>>>>>>>>>>>> Python >>>>>>>>>>>>>> developers stopped supporting it in 2013. REHL 5 is not a good >>>>>>>>>>>>>> enough >>>>>>>>>>>>>> reason to continue support for Python 2.6 IMO. >>>>>>>>>>>>>> >>>>>>>>>>>>>> We should aim to support Python 2.7 and Python 3.3+ (which I >>>>>>>>>>>>>> believe we currently do). >>>>>>>>>>>>>> >>>>>>>>>>>>>> Nick >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang < >>>>>>>>>>>>>> allenzhang...@126.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> plus 1, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> we are currently using python 2.7.2 in production >>>>>>>>>>>>>>> environment. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 在 2016-01-05 18:11:45,"Meethu Mathew" < >>>>>>>>>>>>>>> meethu.mat...@flytxt.com> 写道: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> +1 >>>>>>>>>>>>>>> We use Python 2.7 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Meethu Mathew >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin < >>>>>>>>>>>>>>> r...@databricks.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Does anybody here care about us dropping support for Python >>>>>>>>>>>>>>>> 2.6 in Spark 2.0? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Python 2.6 is ancient, and is pretty slow in many aspects >>>>>>>>>>>>>>>> (e.g. json parsing) when compared with Python 2.7. Some >>>>>>>>>>>>>>>> libraries that >>>>>>>>>>>>>>>> Spark depend on stopped supporting 2.6. We can still convince >>>>>>>>>>>>>>>> the library >>>>>>>>>>>>>>>> maintainers to support 2.6, but it will be extra work. I'm >>>>>>>>>>>>>>>> curious if >>>>>>>>>>>>>>>> anybody still uses Python 2.6 to run Spark. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>> >>> >>> >> > -- Best Regards Jeff Zhang