Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

Koert Kuipers Wed, 07 Oct 2020 15:25:15 -0700

it seems to me with SPARK-20202 we are no longer planning to support
hadoop2 + hive 1.2. is that correct?


so basically spark 3.1 will no longer run on say CDH 5.x or HDP2.x with
hive?

my use case is building spark 3.1 and launching on these existing clusters
that are not managed by me. e.g. i do not use the spark version provided by
cloudera.
however there are workarounds for me (using older spark version to extract
out of hive, then switch to newer spark version) so i am not too worried
about this. just making sure i understand.

thanks

On Sat, Oct 3, 2020 at 8:17 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
wrote:

> Hi, All.
>
> As of today, master branch (Apache Spark 3.1.0) resolved
> 852+ JIRA issues and 606+ issues are 3.1.0-only patches.
> According to the 3.1.0 release window, branch-3.1 will be
> created on November 1st and enters QA period.
>
> Here are some notable updates I've been monitoring.
>
> *Language*
> 01. SPARK-25075 Support Scala 2.13
>       - Since SPARK-32926, Scala 2.13 build test has
>         become a part of GitHub Action jobs.
>       - After SPARK-33044, Scala 2.13 test will be
>         a part of Jenkins jobs.
> 02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
> 03. SPARK-32082 Project Zen: Improving Python usability
>       - 7 of 16 issues are resolved.
> 04. SPARK-32073 Drop R < 3.5 support
>       - This is done for Spark 3.0.1 and 3.1.0.
>
> *Dependency*
> 05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
>       - This changes the default dist. for better cloud support
> 06. SPARK-32981 Remove hive-1.2 distribution
> 07. SPARK-20202 Remove references to org.spark-project.hive
>       - This will remove Hive 1.2.1 from source code
> 08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)
>
> *Core*
> 09. SPARK-27495 Support Stage level resource conf and scheduling
>       - 11 of 15 issues are resolved
> 10. SPARK-25299 Use remote storage for persisting shuffle data
>       - 8 of 14 issues are resolved
>
> *Resource Manager*
> 11. SPARK-33005 Kubernetes GA preparation
>       - It is on the way and we are waiting for more feedback.
>
> *SQL*
> 12. SPARK-30648/SPARK-32346 Support filters pushdown
>       to JSON/Avro
> 13. SPARK-32948/SPARK-32958 Add Json expression optimizer
> 14. SPARK-12312 Support JDBC Kerberos w/ keytab
>       - 11 of 17 issues are resolved
> 15. SPARK-27589 DSv2 was mostly completed in 3.0
>       and added more features in 3.1 but still we missed
>       - All built-in DataSource v2 write paths are disabled
>         and v1 write is used instead.
>       - Support partition pruning with subqueries
>       - Support bucketing
>
> We still have one month before the feature freeze
> and starting QA. If you are working for 3.1,
> please consider the timeline and share your schedule
> with the Apache Spark community. For the other stuff,
> we can put it into 3.2 release scheduled in June 2021.
>
> Last not but least, I want to emphasize (7) once again.
> We need to remove the forked unofficial Hive eventually.
> Please let us know your reasons if you need to build
> from Apache Spark 3.1 source code for Hive 1.2.
>
> https://github.com/apache/spark/pull/29936
>
> As I wrote in the above PR description, for old releases,
> Apache Spark 2.4(LTS) and 3.0 (~2021.12) will provide
> Hive 1.2-based distribution.
>
> Bests,
> Dongjoon.
>

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

Reply via email to