I don't think that we're planning to drop Java 7 support for Spark 2.0. Personally, I would recommend using Java 8 if you're running Spark 1.5.0+ and are using SQL/DataFrames so that you can benefit from improvements to code cache flushing in the Java 8 JVMs. Spark SQL's generated classes can fill up the JVM's code cache, which causes JIT to stop working for new bytecode. Empirically, it looks like the Java 8 JVMs have an improved ability to flush this code cache, thereby avoiding this problem.
TL;DR: I'd prefer to run Java 8 with Spark if given the choice. On Tue, Jan 5, 2016 at 4:07 PM, Koert Kuipers <ko...@tresata.com> wrote: > hey evil admin:) > i think the bit about java was from me? > if so, i meant to indicate that the reality for us is java is 1.7 on most > (all?) clusters. i do not believe spark prefers java 1.8. my point was that > even although java 1.7 is getting old as well it would be a major issue for > me if spark dropped java 1.7 support. > > On Tue, Jan 5, 2016 at 6:53 PM, Carlile, Ken <carli...@janelia.hhmi.org> > wrote: > >> As one of the evil administrators that runs a RHEL 6 cluster, we already >> provide quite a few different version of python on our cluster pretty darn >> easily. All you need is a separate install directory and to set the >> PYTHON_HOME environment variable to point to the correct python, then have >> the users make sure the correct python is in their PATH. I understand that >> other administrators may not be so compliant. >> >> Saw a small bit about the java version in there; does Spark currently >> prefer Java 1.8.x? >> >> —Ken >> >> On Jan 5, 2016, at 6:08 PM, Josh Rosen <joshro...@databricks.com> wrote: >> >> Note that you _can_ use a Python 2.7 `ipython` executable on the driver >>> while continuing to use a vanilla `python` executable on the executors >> >> >> Whoops, just to be clear, this should actually read "while continuing to >> use a vanilla `python` 2.7 executable". >> >> On Tue, Jan 5, 2016 at 3:07 PM, Josh Rosen <joshro...@databricks.com> >> wrote: >> >>> Yep, the driver and executors need to have compatible Python versions. I >>> think that there are some bytecode-level incompatibilities between 2.6 and >>> 2.7 which would impact the deserialization of Python closures, so I think >>> you need to be running the same 2.x version for all communicating Spark >>> processes. Note that you _can_ use a Python 2.7 `ipython` executable on the >>> driver while continuing to use a vanilla `python` executable on the >>> executors (we have environment variables which allow you to control these >>> separately). >>> >>> On Tue, Jan 5, 2016 at 3:05 PM, Nicholas Chammas < >>> nicholas.cham...@gmail.com> wrote: >>> >>>> I think all the slaves need the same (or a compatible) version of >>>> Python installed since they run Python code in PySpark jobs natively. >>>> >>>> On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers <ko...@tresata.com> wrote: >>>> >>>>> interesting i didnt know that! >>>>> >>>>> On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas < >>>>> nicholas.cham...@gmail.com> wrote: >>>>> >>>>>> even if python 2.7 was needed only on this one machine that launches >>>>>> the app we can not ship it with our software because its gpl licensed >>>>>> >>>>>> Not to nitpick, but maybe this is important. The Python license is >>>>>> GPL-compatible >>>>>> but not GPL <https://docs.python.org/3/license.html>: >>>>>> >>>>>> Note GPL-compatible doesn’t mean that we’re distributing Python under >>>>>> the GPL. All Python licenses, unlike the GPL, let you distribute a >>>>>> modified >>>>>> version without making your changes open source. The GPL-compatible >>>>>> licenses make it possible to combine Python with other software that is >>>>>> released under the GPL; the others don’t. >>>>>> >>>>>> Nick >>>>>> >>>>>> >>>>>> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers <ko...@tresata.com> >>>>>> wrote: >>>>>> >>>>>>> i do not think so. >>>>>>> >>>>>>> does the python 2.7 need to be installed on all slaves? if so, we do >>>>>>> not have direct access to those. >>>>>>> >>>>>>> also, spark is easy for us to ship with our software since its >>>>>>> apache 2 licensed, and it only needs to be present on the machine that >>>>>>> launches the app (thanks to yarn). >>>>>>> even if python 2.7 was needed only on this one machine that launches >>>>>>> the app we can not ship it with our software because its gpl licensed, >>>>>>> so >>>>>>> the client would have to download it and install it themselves, and this >>>>>>> would mean its an independent install which has to be audited and >>>>>>> approved >>>>>>> and now you are in for a lot of fun. basically it will never happen. >>>>>>> >>>>>>> >>>>>>> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen <joshro...@databricks.com >>>>>>> > wrote: >>>>>>> >>>>>>>> If users are able to install Spark 2.0 on their RHEL clusters, then >>>>>>>> I imagine that they're also capable of installing a standalone Python >>>>>>>> alongside that Spark version (without changing Python systemwide). For >>>>>>>> instance, Anaconda/Miniconda make it really easy to install Python >>>>>>>> 2.7.x/3.x without impacting / changing the system Python and doesn't >>>>>>>> require any special permissions to install (you don't need root / sudo >>>>>>>> access). Does this address the Python versioning concerns for RHEL >>>>>>>> users? >>>>>>>> >>>>>>>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers <ko...@tresata.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> yeah, the practical concern is that we have no control over java >>>>>>>>> or python version on large company clusters. our current reality for >>>>>>>>> the >>>>>>>>> vast majority of them is java 7 and python 2.6, no matter how >>>>>>>>> outdated that >>>>>>>>> is. >>>>>>>>> >>>>>>>>> i dont like it either, but i cannot change it. >>>>>>>>> >>>>>>>>> we currently don't use pyspark so i have no stake in this, but if >>>>>>>>> we did i can assure you we would not upgrade to spark 2.x if python >>>>>>>>> 2.6 was >>>>>>>>> dropped. no point in developing something that doesnt run for >>>>>>>>> majority of >>>>>>>>> customers. >>>>>>>>> >>>>>>>>> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas < >>>>>>>>> nicholas.cham...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> As I pointed out in my earlier email, RHEL will support Python >>>>>>>>>> 2.6 until 2020. So I'm assuming these large companies will have the >>>>>>>>>> option >>>>>>>>>> of riding out Python 2.6 until then. >>>>>>>>>> >>>>>>>>>> Are we seriously saying that Spark should likewise support Python >>>>>>>>>> 2.6 for the next several years? Even though the core Python devs >>>>>>>>>> stopped >>>>>>>>>> supporting it in 2013? >>>>>>>>>> >>>>>>>>>> If that's not what we're suggesting, then when, roughly, can we >>>>>>>>>> drop support? What are the criteria? >>>>>>>>>> >>>>>>>>>> I understand the practical concern here. If companies are stuck >>>>>>>>>> using 2.6, it doesn't matter to them that it is deprecated. But >>>>>>>>>> balancing >>>>>>>>>> that concern against the maintenance burden on this project, I would >>>>>>>>>> say >>>>>>>>>> that "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable >>>>>>>>>> position to take. There are many tiny annoyances one has to put up >>>>>>>>>> with to >>>>>>>>>> support 2.6. >>>>>>>>>> >>>>>>>>>> I suppose if our main PySpark contributors are fine putting up >>>>>>>>>> with those annoyances, then maybe we don't need to drop support just >>>>>>>>>> yet... >>>>>>>>>> >>>>>>>>>> Nick >>>>>>>>>> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente < >>>>>>>>>> ju...@esbet.es>님이 작성: >>>>>>>>>> >>>>>>>>>>> Unfortunately, Koert is right. >>>>>>>>>>> >>>>>>>>>>> I've been in a couple of projects using Spark (banking industry) >>>>>>>>>>> where CentOS + Python 2.6 is the toolbox available. >>>>>>>>>>> >>>>>>>>>>> That said, I believe it should not be a concern for Spark. >>>>>>>>>>> Python 2.6 is old and busted, which is totally opposite to the Spark >>>>>>>>>>> philosophy IMO. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> El 5 ene 2016, a las 20:07, Koert Kuipers <ko...@tresata.com> >>>>>>>>>>> escribió: >>>>>>>>>>> >>>>>>>>>>> rhel/centos 6 ships with python 2.6, doesnt it? >>>>>>>>>>> >>>>>>>>>>> if so, i still know plenty of large companies where python 2.6 >>>>>>>>>>> is the only option. asking them for python 2.7 is not going to work >>>>>>>>>>> >>>>>>>>>>> so i think its a bad idea >>>>>>>>>>> >>>>>>>>>>> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland < >>>>>>>>>>> juliet.hougl...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> I don't see a reason Spark 2.0 would need to support Python >>>>>>>>>>>> 2.6. At this point, Python 3 should be the default that is >>>>>>>>>>>> encouraged. >>>>>>>>>>>> Most organizations acknowledge the 2.7 is common, but lagging >>>>>>>>>>>> behind the version they should theoretically use. Dropping python >>>>>>>>>>>> 2.6 >>>>>>>>>>>> support sounds very reasonable to me. >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas < >>>>>>>>>>>> nicholas.cham...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> +1 >>>>>>>>>>>>> >>>>>>>>>>>>> Red Hat supports Python 2.6 on REHL 5 until 2020 >>>>>>>>>>>>> <https://alexgaynor.net/2015/mar/30/red-hat-open-source-community/>, >>>>>>>>>>>>> but otherwise yes, Python 2.6 is ancient history and the core >>>>>>>>>>>>> Python >>>>>>>>>>>>> developers stopped supporting it in 2013. REHL 5 is not a good >>>>>>>>>>>>> enough >>>>>>>>>>>>> reason to continue support for Python 2.6 IMO. >>>>>>>>>>>>> >>>>>>>>>>>>> We should aim to support Python 2.7 and Python 3.3+ (which I >>>>>>>>>>>>> believe we currently do). >>>>>>>>>>>>> >>>>>>>>>>>>> Nick >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang < >>>>>>>>>>>>> allenzhang...@126.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> plus 1, >>>>>>>>>>>>>> >>>>>>>>>>>>>> we are currently using python 2.7.2 in production environment. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> 在 2016-01-05 18:11:45,"Meethu Mathew" < >>>>>>>>>>>>>> meethu.mat...@flytxt.com> 写道: >>>>>>>>>>>>>> >>>>>>>>>>>>>> +1 >>>>>>>>>>>>>> We use Python 2.7 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Meethu Mathew >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin < >>>>>>>>>>>>>> r...@databricks.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Does anybody here care about us dropping support for Python >>>>>>>>>>>>>>> 2.6 in Spark 2.0? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Python 2.6 is ancient, and is pretty slow in many aspects >>>>>>>>>>>>>>> (e.g. json parsing) when compared with Python 2.7. Some >>>>>>>>>>>>>>> libraries that >>>>>>>>>>>>>>> Spark depend on stopped supporting 2.6. We can still convince >>>>>>>>>>>>>>> the library >>>>>>>>>>>>>>> maintainers to support 2.6, but it will be extra work. I'm >>>>>>>>>>>>>>> curious if >>>>>>>>>>>>>>> anybody still uses Python 2.6 to run Spark. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>> >>> >> >> >