If the subtext is vendors, then I'd have a look at what recent distros look
like. I'll write about CDH as a representative example, but I think other
distros are naturally similar.

CDH has been on Java 8, Hadoop 2.6, Python 2.7 for almost two years (CDH
5.3 / Dec 2014). Granted, this depends on installing on an OS with that
Java / Python version. But Java 8 / Python 2.7 is available for all of the
supported OSes. The population that isn't on CDH 4, because that supported
was dropped a long time ago in Spark, and who is on a version released
2-2.5 years ago, and won't update, is a couple percent of the installed
base. They do not in general want anything to change at all.

I assure everyone that vendors too are aligned in wanting to cater to the
crowd that wants the most recent version of everything. For example, CDH
offers both Spark 2.0.1 and 1.6 at the same time.

I wouldn't dismiss support for these supporting components as a relevant
proxy for whether they are worth supporting in Spark. Java 7 is long since
EOL (no, I don't count paying Oracle for support). No vendor is supporting
Hadoop < 2.6. Scala 2.10 was EOL at the end of 2014. Is there a criteria
here that reaches a different conclusion about these things just for Spark?
This was roughly the same conversation that happened 6 months ago.

I imagine we're going to find that in about 6 months it'll make more sense
all around to remove these. If we can just give a heads up with deprecation
and then kick the can down the road a bit more, that sounds like enough for
now.

On Fri, Oct 28, 2016 at 8:58 AM Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> Deprecating them is fine (and I know they're already deprecated), the
> question is just whether to remove them. For example, what exactly is the
> downside of having Python 2.6 or Java 7 right now? If it's high, then we
> can remove them, but I just haven't seen a ton of details. It also sounded
> like fairly recent versions of CDH, HDP, RHEL, etc still have old versions
> of these.
>
> Just talking with users, I've seen many of people who say "we have a
> Hadoop cluster from $VENDOR, but we just download Spark from Apache and run
> newer versions of that". That's great for Spark IMO, and we need to stay
> compatible even with somewhat older Hadoop installs because they are
> time-consuming to update. Having the whole community on a small set of
> versions leads to a better experience for everyone and also to more of a
> "network effect": more people can battle-test new versions, answer
> questions about them online, write libraries that easily reach the majority
> of Spark users, etc.
>

Reply via email to