Will pyspark users care much about Hadoop version? they won't if running locally. They will if connecting to a Hadoop cluster. Then again in that context, they're probably using a distro anyway that harmonizes it. Hadoop 3's installed based can't be that large yet; it's been around far less time.
The bigger question indeed is dropping Hadoop 2.x / Hive 1.x etc eventually, not now. But if the question now is build defaults, is it a big deal either way? On Tue, Jun 23, 2020 at 11:03 PM Xiao Li <lix...@databricks.com> wrote: > I think we just need to provide two options and let end users choose the > ones they need. Hadoop 3.2 or Hadoop 2.7. Thus, SPARK-32017 (Make Pyspark > Hadoop 3.2+ Variant available in PyPI) is a high priority task for Spark > 3.1 release to me. > > I do not know how to track the popularity of Hadoop 2 vs Hadoop 3. Based > on this link > https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs , it > sounds like Hadoop 3.x is not as popular as Hadoop 2.7. > > >