Apache Spark 3.2 Expectation

Dongjoon Hyun Thu, 25 Feb 2021 08:56:53 -0800

Hi, All.

Since we have been preparing Apache Spark 3.2.0 in master branch since
December 2020, March seems to be a good time to share our thoughts and
aspirations on Apache Spark 3.2.


According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
seems to be the last minor release of this year. Given the timeframe, we
might consider the following. (This is a small set. Please add your
thoughts to this limited list.)

# Languages

- Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped
out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and
investigating the publishing issue. Thank you for your contributions and
feedback on this.

- Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like Java
11, we need lots of support from our dependencies. Let's see.

- Python 3.6 Deprecation(?): Python 3.6 community support ends at
2021-12-23. So, the deprecation is not required yet, but we had better
prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

- SparkR CRAN publishing: As we know, it's discontinued so far. Resuming it
depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it
succeeds to revive it, we can keep publishing. Otherwise, I believe we had
better drop it from the releasing work item list officially.

# Dependencies

- Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in
Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
shaded clients via SPARK-33212. So far, there is one on-going report at
YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
we can move toward Hadoop 3.3.2.

- Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead
of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via
SPARK-32981 and replaced the generated hive-service-rpc code with the
official dependency via SPARK-32981. We are steadily improving this area
and will consume Hive 2.3.9 if available.

- K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client
dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support
K8s model 1.19.

- Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
with Kafka Client 2.8 hopefully.

# Some Features

- Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
Iceberg integration. Especially, we hope the on-going function catalog SPIP
and up-coming storage partitioned join SPIP can be delivered as a part of
Spark 3.2 and become an additional foundation.

- Columnar Encryption: As of today, Apache Spark master branch supports
columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,
Apache Spark 3.2 is going to be the first release to have this feature
officially. Any feedback is welcome.

- Improved ZStandard Support: Spark 3.2 will bring more benefits for
ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support
for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD
compression, 3) SPARK-34503 sets ZSTD as the default codec for event log
compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also,
the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool),
too. I'm expecting more benefits.

- Structure Streaming with RocksDB backend: According to the latest update,
it looks active enough for merging to master branch in Spark 3.2.

Please share your thoughts and let's build better Apache Spark 3.2 together.

Bests,
Dongjoon.

Apache Spark 3.2 Expectation

Reply via email to