If the subtext is vendors, then I'd have a look at what recent distros look like. I'll write about CDH as a representative example, but I think other distros are naturally similar.
CDH has been on Java 8, Hadoop 2.6, Python 2.7 for almost two years (CDH 5.3 / Dec 2014). Granted, this depends on installing on an OS with that Java / Python version. But Java 8 / Python 2.7 is available for all of the supported OSes. The population that isn't on CDH 4, because that supported was dropped a long time ago in Spark, and who is on a version released 2-2.5 years ago, and won't update, is a couple percent of the installed base. They do not in general want anything to change at all. I assure everyone that vendors too are aligned in wanting to cater to the crowd that wants the most recent version of everything. For example, CDH offers both Spark 2.0.1 and 1.6 at the same time. I wouldn't dismiss support for these supporting components as a relevant proxy for whether they are worth supporting in Spark. Java 7 is long since EOL (no, I don't count paying Oracle for support). No vendor is supporting Hadoop < 2.6. Scala 2.10 was EOL at the end of 2014. Is there a criteria here that reaches a different conclusion about these things just for Spark? This was roughly the same conversation that happened 6 months ago. I imagine we're going to find that in about 6 months it'll make more sense all around to remove these. If we can just give a heads up with deprecation and then kick the can down the road a bit more, that sounds like enough for now. On Fri, Oct 28, 2016 at 8:58 AM Matei Zaharia <matei.zaha...@gmail.com> wrote: > Deprecating them is fine (and I know they're already deprecated), the > question is just whether to remove them. For example, what exactly is the > downside of having Python 2.6 or Java 7 right now? If it's high, then we > can remove them, but I just haven't seen a ton of details. It also sounded > like fairly recent versions of CDH, HDP, RHEL, etc still have old versions > of these. > > Just talking with users, I've seen many of people who say "we have a > Hadoop cluster from $VENDOR, but we just download Spark from Apache and run > newer versions of that". That's great for Spark IMO, and we need to stay > compatible even with somewhat older Hadoop installs because they are > time-consuming to update. Having the whole community on a small set of > versions leads to a better experience for everyone and also to more of a > "network effect": more people can battle-test new versions, answer > questions about them online, write libraries that easily reach the majority > of Spark users, etc. >