Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

Dongjoon Hyun Sun, 04 Oct 2020 12:45:14 -0700

For Xiao's comment, I want to point out that Apache Spark 3.1.0 is
different from 2.3 or 2.4.


Apache Spark 3.1.0 should be compared with Apache Spark 2.1.0.

- Apache Spark 2.0.0 was released on July 26, 2016.
- Apache Spark 2.1.0 was released on December 28, 2016.

Bests,
Dongjoon.


On Sun, Oct 4, 2020 at 10:53 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
wrote:

> Thank you all.
>
> BTW, Xiao and Mridul, I'm wondering what date you have in your mind
> specifically.
>
> Usually, `Christmas and New Year season` doesn't give us much additional
> time.
>
> If you think so, could you make a PR for Apache Spark website according
> to your expectation?
>
> https://spark.apache.org/versioning-policy.html
>
> Bests,
> Dongjoon.
>
>
> On Sun, Oct 4, 2020 at 7:18 AM Mridul Muralidharan <mri...@gmail.com>
> wrote:
>
>>
>> +1 on pushing the branch cut for increased dev time to match previous
>> releases.
>>
>> Regards,
>> Mridul
>>
>> On Sat, Oct 3, 2020 at 10:22 PM Xiao Li <gatorsm...@gmail.com> wrote:
>>
>>> Thank you for your updates.
>>>
>>> Spark 3.0 got released on Jun 18, 2020. If Nov 1st is the target date of
>>> the 3.1 branch cut, the feature development time window is less than 5
>>> months. This is shorter than what we did in Spark 2.3 and 2.4 releases.
>>>
>>> Below are three highly desirable feature work I am watching. Hopefully,
>>> we can finish them before the branch cut.
>>>
>>>    - Support push-based shuffle to improve shuffle efficiency:
>>>    https://issues.apache.org/jira/browse/SPARK-30602
>>>    - Unify create table syntax:
>>>    https://issues.apache.org/jira/browse/SPARK-31257
>>>    - Bloom filter join:
>>>    https://issues.apache.org/jira/browse/SPARK-32268
>>>
>>> Thanks,
>>>
>>> Xiao
>>>
>>>
>>> Hyukjin Kwon <gurwls...@gmail.com> 于2020年10月3日周六 下午5:41写道：
>>>
>>>> Nice summary. Thanks Dongjoon. One minor correction -> I believe we
>>>> dropped R 3.5 and below at branch 2.4 as well.
>>>>
>>>> On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun, <dongjoon.h...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> As of today, master branch (Apache Spark 3.1.0) resolved
>>>>> 852+ JIRA issues and 606+ issues are 3.1.0-only patches.
>>>>> According to the 3.1.0 release window, branch-3.1 will be
>>>>> created on November 1st and enters QA period.
>>>>>
>>>>> Here are some notable updates I've been monitoring.
>>>>>
>>>>> *Language*
>>>>> 01. SPARK-25075 Support Scala 2.13
>>>>>       - Since SPARK-32926, Scala 2.13 build test has
>>>>>         become a part of GitHub Action jobs.
>>>>>       - After SPARK-33044, Scala 2.13 test will be
>>>>>         a part of Jenkins jobs.
>>>>> 02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
>>>>> 03. SPARK-32082 Project Zen: Improving Python usability
>>>>>       - 7 of 16 issues are resolved.
>>>>> 04. SPARK-32073 Drop R < 3.5 support
>>>>>       - This is done for Spark 3.0.1 and 3.1.0.
>>>>>
>>>>> *Dependency*
>>>>> 05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
>>>>>       - This changes the default dist. for better cloud support
>>>>> 06. SPARK-32981 Remove hive-1.2 distribution
>>>>> 07. SPARK-20202 Remove references to org.spark-project.hive
>>>>>       - This will remove Hive 1.2.1 from source code
>>>>> 08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)
>>>>>
>>>>> *Core*
>>>>> 09. SPARK-27495 Support Stage level resource conf and scheduling
>>>>>       - 11 of 15 issues are resolved
>>>>> 10. SPARK-25299 Use remote storage for persisting shuffle data
>>>>>       - 8 of 14 issues are resolved
>>>>>
>>>>> *Resource Manager*
>>>>> 11. SPARK-33005 Kubernetes GA preparation
>>>>>       - It is on the way and we are waiting for more feedback.
>>>>>
>>>>> *SQL*
>>>>> 12. SPARK-30648/SPARK-32346 Support filters pushdown
>>>>>       to JSON/Avro
>>>>> 13. SPARK-32948/SPARK-32958 Add Json expression optimizer
>>>>> 14. SPARK-12312 Support JDBC Kerberos w/ keytab
>>>>>       - 11 of 17 issues are resolved
>>>>> 15. SPARK-27589 DSv2 was mostly completed in 3.0
>>>>>       and added more features in 3.1 but still we missed
>>>>>       - All built-in DataSource v2 write paths are disabled
>>>>>         and v1 write is used instead.
>>>>>       - Support partition pruning with subqueries
>>>>>       - Support bucketing
>>>>>
>>>>> We still have one month before the feature freeze
>>>>> and starting QA. If you are working for 3.1,
>>>>> please consider the timeline and share your schedule
>>>>> with the Apache Spark community. For the other stuff,
>>>>> we can put it into 3.2 release scheduled in June 2021.
>>>>>
>>>>> Last not but least, I want to emphasize (7) once again.
>>>>> We need to remove the forked unofficial Hive eventually.
>>>>> Please let us know your reasons if you need to build
>>>>> from Apache Spark 3.1 source code for Hive 1.2.
>>>>>
>>>>> https://github.com/apache/spark/pull/29936
>>>>>
>>>>> As I wrote in the above PR description, for old releases,
>>>>> Apache Spark 2.4(LTS) and 3.0 (~2021.12) will provide
>>>>> Hive 1.2-based distribution.
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

Reply via email to