Re: Apache Spark 3.2 Expectation

Chang Chen Wed, 03 Mar 2021 06:33:28 -0800

+1 for Data Source V2 Aggregate push down

huaxin gao <huaxin.ga...@gmail.com> 于2021年2月27日周六 上午4:20写道：


> Thanks Dongjoon and Xiao for the discussion. I would like to add Data
> Source V2 Aggregate push down to the list. I am currently working on
> JDBC Data Source V2 Aggregate push down, but the common code can be used
> for the file based V2 Data Source as well. For example, MAX and MIN can be
> pushed down to Parquet and Orc, since they can use statistics information
> to perform these operations efficiently. Quite a few users are
> interested in this Aggregate push down feature and the preliminary
> performance test for JDBC Aggregate push down is positive. So I think it is
> a valuable feature to add for Spark 3.2.
>
> Thanks,
> Huaxin
>
> On Fri, Feb 26, 2021 at 11:13 AM Xiao Li <gatorsm...@gmail.com> wrote:
>
>> Thank you, Dongjoon, for initiating this discussion. Let us keep it open.
>> It might take 1-2 weeks to collect from the community all the features
>> we plan to build and ship in 3.2 since we just finished the 3.1 voting.
>>
>>
>>> 3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut`
>>> in April because we took 3 month for Spark 3.1 release.
>>
>>
>> TBH, cutting the branch this April does not look good to me. That means,
>> we only have one month left for feature development of Spark 3.2. Do we
>> have enough features in the current master branch? If not, are we able to
>> finish major features we collected here? Do they have a timeline or project
>> plan?
>>
>> Xiao
>>
>> Dongjoon Hyun <dongjoon.h...@gmail.com> 于2021年2月26日周五 上午10:07写道：
>>
>>> Thank you, Mridul and Sean.
>>>
>>> 1. Yes, `2017` was a typo. Java 17 is scheduled September 2021. And, of
>>> course, it's a nice-to-have status. :)
>>>
>>> 2. `Push based shuffle and disaggregated shuffle`. Definitely. Thanks
>>> for sharing,
>>>
>>> 3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut`
>>> in April because we took 3 month for Spark 3.1 release.
>>>     Let's update our release roadmap of the Apache Spark website.
>>>
>>> > I'd roughly expect 3.2 in, say, July of this year, given the usual
>>> cadence. No reason it couldn't be a little sooner or later. There is
>>> already some good stuff in 3.2 and will be a good minor release in 5-6
>>> months.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>> On Thu, Feb 25, 2021 at 9:33 AM Sean Owen <sro...@gmail.com> wrote:
>>>
>>>> I'd roughly expect 3.2 in, say, July of this year, given the usual
>>>> cadence. No reason it couldn't be a little sooner or later. There is
>>>> already some good stuff in 3.2 and will be a good minor release in 5-6
>>>> months.
>>>>
>>>> On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> Since we have been preparing Apache Spark 3.2.0 in master branch since
>>>>> December 2020, March seems to be a good time to share our thoughts and
>>>>> aspirations on Apache Spark 3.2.
>>>>>
>>>>> According to the progress on Apache Spark 3.1 release, Apache Spark
>>>>> 3.2 seems to be the last minor release of this year. Given the timeframe,
>>>>> we might consider the following. (This is a small set. Please add your
>>>>> thoughts to this limited list.)
>>>>>
>>>>> # Languages
>>>>>
>>>>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
>>>>> slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
>>>>> and investigating the publishing issue. Thank you for your contributions
>>>>> and feedback on this.
>>>>>
>>>>> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
>>>>> Java 11, we need lots of support from our dependencies. Let's see.
>>>>>
>>>>> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
>>>>> 2021-12-23. So, the deprecation is not required yet, but we had better
>>>>> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>>>>>
>>>>> - SparkR CRAN publishing: As we know, it's discontinued so far.
>>>>> Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing.
>>>>> If it succeeds to revive it, we can keep publishing. Otherwise, I believe
>>>>> we had better drop it from the releasing work item list officially.
>>>>>
>>>>> # Dependencies
>>>>>
>>>>> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile
>>>>> in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 
>>>>> 3.2.2's
>>>>> shaded clients via SPARK-33212. So far, there is one on-going report at
>>>>> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
>>>>> we can move toward Hadoop 3.3.2.
>>>>>
>>>>> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
>>>>> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile 
>>>>> completely
>>>>> via SPARK-32981 and replaced the generated hive-service-rpc code with the
>>>>> official dependency via SPARK-32981. We are steadily improving this area
>>>>> and will consume Hive 2.3.9 if available.
>>>>>
>>>>> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s
>>>>> client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to
>>>>> support K8s model 1.19.
>>>>>
>>>>> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using
>>>>> Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with
>>>>> Scala 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
>>>>> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
>>>>> with Kafka Client 2.8 hopefully.
>>>>>
>>>>> # Some Features
>>>>>
>>>>> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
>>>>> Iceberg integration. Especially, we hope the on-going function catalog 
>>>>> SPIP
>>>>> and up-coming storage partitioned join SPIP can be delivered as a part of
>>>>> Spark 3.2 and become an additional foundation.
>>>>>
>>>>> - Columnar Encryption: As of today, Apache Spark master branch
>>>>> supports columnar encryption via Apache ORC 1.6 and it's documented via
>>>>> SPARK-34036. Also, upcoming Apache Parquet 1.12 has a similar capability.
>>>>> Hopefully, Apache Spark 3.2 is going to be the first release to have this
>>>>> feature officially. Any feedback is welcome.
>>>>>
>>>>> - Improved ZStandard Support: Spark 3.2 will bring more benefits for
>>>>> ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support
>>>>> for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD
>>>>> compression, 3) SPARK-34503 sets ZSTD as the default codec for event log
>>>>> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. 
>>>>> Also,
>>>>> the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool),
>>>>> too. I'm expecting more benefits.
>>>>>
>>>>> - Structure Streaming with RocksDB backend: According to the latest
>>>>> update, it looks active enough for merging to master branch in Spark 3.2.
>>>>>
>>>>> Please share your thoughts and let's build better Apache Spark 3.2
>>>>> together.
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>

Re: Apache Spark 3.2 Expectation

Reply via email to