Re: Apache Spark 3.2 Expectation

Xiao Li Wed, 10 Mar 2021 12:41:38 -0800

Below are some nice-to-have features we can work on in Spark 3.2: Lateral
Join support <https://issues.apache.org/jira/browse/SPARK-28379>, interval
data type, timestamp without time zone, un-nesting arbitrary queries, the
returned metrics of DSV2, and error message standardization. Spark 3.2 will
be another exciting release I believe!


Go Spark!

Xiao




Dongjoon Hyun <dongjoon.h...@gmail.com> 于2021年3月10日周三 下午12:25写道：

> Hi, Xiao.
>
> This thread started 13 days ago. Since you asked the community about major
> features or timelines at that time, could you share your roadmap or
> expectations if you have something in your mind?
>
> > Thank you, Dongjoon, for initiating this discussion. Let us keep it
> open. It might take 1-2 weeks to collect from the community all the
> features we plan to build and ship in 3.2 since we just finished the 3.1
> voting.
> > TBH, cutting the branch this April does not look good to me. That means,
> we only have one month left for feature development of Spark 3.2. Do we
> have enough features in the current master branch? If not, are we able to
> finish major features we collected here? Do they have a timeline or project
> plan?
>
> Bests,
> Dongjoon.
>
>
>
> On Wed, Mar 3, 2021 at 2:58 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
>
>> Hi, John.
>>
>> This thread aims to share your expectations and goals (and maybe work
>> progress) to Apache Spark 3.2 because we are making this together. :)
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Wed, Mar 3, 2021 at 1:59 PM John Zhuge <jzh...@apache.org> wrote:
>>
>>> Hi Dongjoon,
>>>
>>> Is it possible to get ViewCatalog in? The community already had fairly
>>> detailed discussions.
>>>
>>> Thanks,
>>> John
>>>
>>> On Thu, Feb 25, 2021 at 8:57 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>> wrote:
>>>
>>>> Hi, All.
>>>>
>>>> Since we have been preparing Apache Spark 3.2.0 in master branch since
>>>> December 2020, March seems to be a good time to share our thoughts and
>>>> aspirations on Apache Spark 3.2.
>>>>
>>>> According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
>>>> seems to be the last minor release of this year. Given the timeframe, we
>>>> might consider the following. (This is a small set. Please add your
>>>> thoughts to this limited list.)
>>>>
>>>> # Languages
>>>>
>>>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
>>>> slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
>>>> and investigating the publishing issue. Thank you for your contributions
>>>> and feedback on this.
>>>>
>>>> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
>>>> Java 11, we need lots of support from our dependencies. Let's see.
>>>>
>>>> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
>>>> 2021-12-23. So, the deprecation is not required yet, but we had better
>>>> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>>>>
>>>> - SparkR CRAN publishing: As we know, it's discontinued so far.
>>>> Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing.
>>>> If it succeeds to revive it, we can keep publishing. Otherwise, I believe
>>>> we had better drop it from the releasing work item list officially.
>>>>
>>>> # Dependencies
>>>>
>>>> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile
>>>> in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
>>>> shaded clients via SPARK-33212. So far, there is one on-going report at
>>>> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
>>>> we can move toward Hadoop 3.3.2.
>>>>
>>>> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
>>>> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely
>>>> via SPARK-32981 and replaced the generated hive-service-rpc code with the
>>>> official dependency via SPARK-32981. We are steadily improving this area
>>>> and will consume Hive 2.3.9 if available.
>>>>
>>>> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s
>>>> client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to
>>>> support K8s model 1.19.
>>>>
>>>> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
>>>> Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
>>>> 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
>>>> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
>>>> with Kafka Client 2.8 hopefully.
>>>>
>>>> # Some Features
>>>>
>>>> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
>>>> Iceberg integration. Especially, we hope the on-going function catalog SPIP
>>>> and up-coming storage partitioned join SPIP can be delivered as a part of
>>>> Spark 3.2 and become an additional foundation.
>>>>
>>>> - Columnar Encryption: As of today, Apache Spark master branch supports
>>>> columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
>>>> Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,
>>>> Apache Spark 3.2 is going to be the first release to have this feature
>>>> officially. Any feedback is welcome.
>>>>
>>>> - Improved ZStandard Support: Spark 3.2 will bring more benefits for
>>>> ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support
>>>> for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD
>>>> compression, 3) SPARK-34503 sets ZSTD as the default codec for event log
>>>> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also,
>>>> the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool),
>>>> too. I'm expecting more benefits.
>>>>
>>>> - Structure Streaming with RocksDB backend: According to the latest
>>>> update, it looks active enough for merging to master branch in Spark 3.2.
>>>>
>>>> Please share your thoughts and let's build better Apache Spark 3.2
>>>> together.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>
>>>
>>> --
>>> John Zhuge
>>>
>>

Re: Apache Spark 3.2 Expectation

Reply via email to