date:20210225

Re: [VOTE] Release Spark 3.1.1 (RC3)

2021-02-25 Thread Hyukjin Kwon

Thanks, Xiao. I will close this vote within a couple of hours.

2021년 2월 26일 (금) 오후 4:30, Xiao Li 님이 작성:

> I confirmed that Q17 and Q39a/b have matching results between Spark 3.0
> and 3.1 after enabling spark.sql.legacy.statisticalAggregate. The result
> changes are expected. For more details, you can read the PR
> https://github.com/apache/spark/pull/29983/ Also, the result of Q18 is
> affected by the overflow checking in Spark. These issues exist in all the
> releases. We will continue to improve our ANSI mode and fix them in the
> upcoming releases.
>
> Thus, I change my vote from -1 to +1.
>
> As Ismael suggested, we can add some Github Actions to validate the TPC-DS
> and TPC-H results for small scale datasets.
>
> Cheers,
>
> Xiao
>
>
>
> Ismaël Mejía  于2021年2月25日周四 下午12:16写道：
>
>> Since the TPC-DS performance tests are one of the main validation sources
>> for regressions on Spark releases maybe it is time to automate the query
>> outputs validation to find correctness issues eagerly (it would be also
>> nice to validate the performance regressions but correctness >>>
>> performance).
>>
>> This has been a long standing open issue [1] that is probably worth to
>> address and it seems that automating this via Github Actions could be
>> relatively straight-forward.
>>
>> [1] https://github.com/databricks/spark-sql-perf/issues/184
>>
>>
>> On Wed, Feb 24, 2021 at 8:15 PM Reynold Xin  wrote:
>>
>>> +1 Correctness issues are serious!
>>>
>>>
>>> On Wed, Feb 24, 2021 at 11:08 AM, Mridul Muralidharan 
>>> wrote:
>>>
 That is indeed cause for concern.
 +1 on extending the voting deadline until we finish investigation of
 this.

 Regards,
 Mridul


 On Wed, Feb 24, 2021 at 12:55 PM Xiao Li  wrote:

> -1 Could we extend the voting deadline?
>
> A few TPC-DS queries (q17, q18, q39a, q39b) are returning different
> results between Spark 3.0 and Spark 3.1. We need a few more days to
> understand whether these changes are expected.
>
> Xiao
>
>
> Mridul Muralidharan  于2021年2月24日周三 上午10:41写道：
>
>>
>> Sounds good, thanks for clarifying Hyukjin !
>> +1 on release.
>>
>> Regards,
>> Mridul
>>
>>
>> On Wed, Feb 24, 2021 at 2:46 AM Hyukjin Kwon 
>> wrote:
>>
>>> I remember HiveExternalCatalogVersionsSuite was flaky for a while
>>> which is fixed in
>>> https://github.com/apache/spark/commit/0d5d248bdc4cdc71627162a3d20c42ad19f24ef4
>>> and .. KafkaDelegationTokenSuite is flaky (
>>> https://issues.apache.org/jira/browse/SPARK-31250).
>>>
>>> 2021년 2월 24일 (수) 오후 5:19, Mridul Muralidharan 님이
>>> 작성:
>>>

 Signatures, digests, etc check out fine.
 Checked out tag and build/tested with -Pyarn -Phadoop-2.7 -Phive
 -Phive-thriftserver -Pmesos -Pkubernetes

 I keep getting test failures with
 * org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite
 * org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.
 (Note: I remove $HOME/.m2 and $HOME/.iv2 paths before build)

 Removing these suites gets the build through though - does anyone
 have suggestions on how to fix it ? I did not face this with RC1.

 Regards,
 Mridul


 On Mon, Feb 22, 2021 at 12:57 AM Hyukjin Kwon 
 wrote:

> Please vote on releasing the following candidate as Apache Spark
> version 3.1.1.
>
> The vote is open until February 24th 11PM PST and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.1.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> The tag to be voted on is v3.1.1-rc3 (commit
> 1d550c4e90275ab418b9161925049239227f3dc9):
> https://github.com/apache/spark/tree/v3.1.1-rc3
>
> The release files, including signatures, digests, etc. can be
> found at:
> 
> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
>
> https://repository.apache.org/content/repositories/orgapachespark-1367
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-docs/
>
> The list of bug fixes going into 3.1.1 can be found at the
> following URL:
> https://s.apache.org/41kf2
>
> This release is using the

Re: [VOTE] Release Spark 3.1.1 (RC3)

2021-02-25 Thread Xiao Li

I confirmed that Q17 and Q39a/b have matching results between Spark 3.0 and
3.1 after enabling spark.sql.legacy.statisticalAggregate. The result
changes are expected. For more details, you can read the PR
https://github.com/apache/spark/pull/29983/ Also, the result of Q18 is
affected by the overflow checking in Spark. These issues exist in all the
releases. We will continue to improve our ANSI mode and fix them in the
upcoming releases.

Thus, I change my vote from -1 to +1.

As Ismael suggested, we can add some Github Actions to validate the TPC-DS
and TPC-H results for small scale datasets.

Cheers,

Xiao



Ismaël Mejía  于2021年2月25日周四 下午12:16写道：

> Since the TPC-DS performance tests are one of the main validation sources
> for regressions on Spark releases maybe it is time to automate the query
> outputs validation to find correctness issues eagerly (it would be also
> nice to validate the performance regressions but correctness >>>
> performance).
>
> This has been a long standing open issue [1] that is probably worth to
> address and it seems that automating this via Github Actions could be
> relatively straight-forward.
>
> [1] https://github.com/databricks/spark-sql-perf/issues/184
>
>
> On Wed, Feb 24, 2021 at 8:15 PM Reynold Xin  wrote:
>
>> +1 Correctness issues are serious!
>>
>>
>> On Wed, Feb 24, 2021 at 11:08 AM, Mridul Muralidharan 
>> wrote:
>>
>>> That is indeed cause for concern.
>>> +1 on extending the voting deadline until we finish investigation of
>>> this.
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>> On Wed, Feb 24, 2021 at 12:55 PM Xiao Li  wrote:
>>>
 -1 Could we extend the voting deadline?

 A few TPC-DS queries (q17, q18, q39a, q39b) are returning different
 results between Spark 3.0 and Spark 3.1. We need a few more days to
 understand whether these changes are expected.

 Xiao


 Mridul Muralidharan  于2021年2月24日周三 上午10:41写道：

>
> Sounds good, thanks for clarifying Hyukjin !
> +1 on release.
>
> Regards,
> Mridul
>
>
> On Wed, Feb 24, 2021 at 2:46 AM Hyukjin Kwon 
> wrote:
>
>> I remember HiveExternalCatalogVersionsSuite was flaky for a while
>> which is fixed in
>> https://github.com/apache/spark/commit/0d5d248bdc4cdc71627162a3d20c42ad19f24ef4
>> and .. KafkaDelegationTokenSuite is flaky (
>> https://issues.apache.org/jira/browse/SPARK-31250).
>>
>> 2021년 2월 24일 (수) 오후 5:19, Mridul Muralidharan 님이
>> 작성:
>>
>>>
>>> Signatures, digests, etc check out fine.
>>> Checked out tag and build/tested with -Pyarn -Phadoop-2.7 -Phive
>>> -Phive-thriftserver -Pmesos -Pkubernetes
>>>
>>> I keep getting test failures with
>>> * org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite
>>> * org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.
>>> (Note: I remove $HOME/.m2 and $HOME/.iv2 paths before build)
>>>
>>> Removing these suites gets the build through though - does anyone
>>> have suggestions on how to fix it ? I did not face this with RC1.
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>> On Mon, Feb 22, 2021 at 12:57 AM Hyukjin Kwon 
>>> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 3.1.1.

 The vote is open until February 24th 11PM PST and passes if a
 majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

 [ ] +1 Release this package as Apache Spark 3.1.1
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 The tag to be voted on is v3.1.1-rc3 (commit
 1d550c4e90275ab418b9161925049239227f3dc9):
 https://github.com/apache/spark/tree/v3.1.1-rc3

 The release files, including signatures, digests, etc. can be found
 at:
 
 https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-bin/

 Signatures used for Spark RCs can be found in this file:
 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:

 https://repository.apache.org/content/repositories/orgapachespark-1367

 The documentation corresponding to this release can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-docs/

 The list of bug fixes going into 3.1.1 can be found at the
 following URL:
 https://s.apache.org/41kf2

 This release is using the release script of the tag v3.1.1-rc3.

 FAQ

 ===
 What happened to 3.1.0?
 ===

 There was a technical issue during Apache Spark 3.1.0 preparation,
 and it

Re: [VOTE] Release Spark 3.1.1 (RC3)

2021-02-25 Thread Ismaël Mejía

Since the TPC-DS performance tests are one of the main validation sources
for regressions on Spark releases maybe it is time to automate the query
outputs validation to find correctness issues eagerly (it would be also
nice to validate the performance regressions but correctness >>>
performance).

This has been a long standing open issue [1] that is probably worth to
address and it seems that automating this via Github Actions could be
relatively straight-forward.

[1] https://github.com/databricks/spark-sql-perf/issues/184


On Wed, Feb 24, 2021 at 8:15 PM Reynold Xin  wrote:

> +1 Correctness issues are serious!
>
>
> On Wed, Feb 24, 2021 at 11:08 AM, Mridul Muralidharan 
> wrote:
>
>> That is indeed cause for concern.
>> +1 on extending the voting deadline until we finish investigation of this.
>>
>> Regards,
>> Mridul
>>
>>
>> On Wed, Feb 24, 2021 at 12:55 PM Xiao Li  wrote:
>>
>>> -1 Could we extend the voting deadline?
>>>
>>> A few TPC-DS queries (q17, q18, q39a, q39b) are returning different
>>> results between Spark 3.0 and Spark 3.1. We need a few more days to
>>> understand whether these changes are expected.
>>>
>>> Xiao
>>>
>>>
>>> Mridul Muralidharan  于2021年2月24日周三 上午10:41写道：
>>>

 Sounds good, thanks for clarifying Hyukjin !
 +1 on release.

 Regards,
 Mridul


 On Wed, Feb 24, 2021 at 2:46 AM Hyukjin Kwon 
 wrote:

> I remember HiveExternalCatalogVersionsSuite was flaky for a while
> which is fixed in
> https://github.com/apache/spark/commit/0d5d248bdc4cdc71627162a3d20c42ad19f24ef4
> and .. KafkaDelegationTokenSuite is flaky (
> https://issues.apache.org/jira/browse/SPARK-31250).
>
> 2021년 2월 24일 (수) 오후 5:19, Mridul Muralidharan 님이 작성:
>
>>
>> Signatures, digests, etc check out fine.
>> Checked out tag and build/tested with -Pyarn -Phadoop-2.7 -Phive
>> -Phive-thriftserver -Pmesos -Pkubernetes
>>
>> I keep getting test failures with
>> * org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite
>> * org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.
>> (Note: I remove $HOME/.m2 and $HOME/.iv2 paths before build)
>>
>> Removing these suites gets the build through though - does anyone
>> have suggestions on how to fix it ? I did not face this with RC1.
>>
>> Regards,
>> Mridul
>>
>>
>> On Mon, Feb 22, 2021 at 12:57 AM Hyukjin Kwon 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 3.1.1.
>>>
>>> The vote is open until February 24th 11PM PST and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.1.1
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.1.1-rc3 (commit
>>> 1d550c4e90275ab418b9161925049239227f3dc9):
>>> https://github.com/apache/spark/tree/v3.1.1-rc3
>>>
>>> The release files, including signatures, digests, etc. can be found
>>> at:
>>> 
>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>>
>>> https://repository.apache.org/content/repositories/orgapachespark-1367
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-docs/
>>>
>>> The list of bug fixes going into 3.1.1 can be found at the following
>>> URL:
>>> https://s.apache.org/41kf2
>>>
>>> This release is using the release script of the tag v3.1.1-rc3.
>>>
>>> FAQ
>>>
>>> ===
>>> What happened to 3.1.0?
>>> ===
>>>
>>> There was a technical issue during Apache Spark 3.1.0 preparation,
>>> and it was discussed and decided to skip 3.1.0.
>>> Please see
>>> https://spark.apache.org/news/next-official-release-spark-3.1.1.html for
>>> more details.
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate,
>>> then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC via "pip install
>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-bin/pyspark-3.1.1.tar.gz
>>> "
>>> and see if anything important breaks.
>>> In the Java/Scala, you can add the

Re: Apache Spark 3.2 Expectation

2021-02-25 Thread Sean Owen

I'd roughly expect 3.2 in, say, July of this year, given the usual cadence.
No reason it couldn't be a little sooner or later. There is already some
good stuff in 3.2 and will be a good minor release in 5-6 months.

On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Since we have been preparing Apache Spark 3.2.0 in master branch since
> December 2020, March seems to be a good time to share our thoughts and
> aspirations on Apache Spark 3.2.
>
> According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
> seems to be the last minor release of this year. Given the timeframe, we
> might consider the following. (This is a small set. Please add your
> thoughts to this limited list.)
>
> # Languages
>
> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped
> out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and
> investigating the publishing issue. Thank you for your contributions and
> feedback on this.
>
> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
> Java 11, we need lots of support from our dependencies. Let's see.
>
> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
> 2021-12-23. So, the deprecation is not required yet, but we had better
> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>
> - SparkR CRAN publishing: As we know, it's discontinued so far. Resuming
> it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it
> succeeds to revive it, we can keep publishing. Otherwise, I believe we had
> better drop it from the releasing work item list officially.
>
> # Dependencies
>
> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in
> Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
> shaded clients via SPARK-33212. So far, there is one on-going report at
> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
> we can move toward Hadoop 3.3.2.
>
> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead
> of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via
> SPARK-32981 and replaced the generated hive-service-rpc code with the
> official dependency via SPARK-32981. We are steadily improving this area
> and will consume Hive 2.3.9 if available.
>
> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client
> dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support
> K8s model 1.19.
>
> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
> Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
> 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
> with Kafka Client 2.8 hopefully.
>
> # Some Features
>
> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
> Iceberg integration. Especially, we hope the on-going function catalog SPIP
> and up-coming storage partitioned join SPIP can be delivered as a part of
> Spark 3.2 and become an additional foundation.
>
> - Columnar Encryption: As of today, Apache Spark master branch supports
> columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
> Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,
> Apache Spark 3.2 is going to be the first release to have this feature
> officially. Any feedback is welcome.
>
> - Improved ZStandard Support: Spark 3.2 will bring more benefits for
> ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support
> for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD
> compression, 3) SPARK-34503 sets ZSTD as the default codec for event log
> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also,
> the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool),
> too. I'm expecting more benefits.
>
> - Structure Streaming with RocksDB backend: According to the latest
> update, it looks active enough for merging to master branch in Spark 3.2.
>
> Please share your thoughts and let's build better Apache Spark 3.2
> together.
>
> Bests,
> Dongjoon.
>

Re: Apache Spark 3.2 Expectation

2021-02-25 Thread Mridul Muralidharan

Nit: Java 17 -> should be available by Sept 2021 :-)
Adoption would also depend on some of our nontrivial dependencies
supporting it - it might be a stretch to get it in for Apache Spark 3.2 ?

Features:
Push based shuffle and disaggregated shuffle should also be in 3.2


Regards,
Mridul






On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Since we have been preparing Apache Spark 3.2.0 in master branch since
> December 2020, March seems to be a good time to share our thoughts and
> aspirations on Apache Spark 3.2.
>
> According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
> seems to be the last minor release of this year. Given the timeframe, we
> might consider the following. (This is a small set. Please add your
> thoughts to this limited list.)
>
> # Languages
>
> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped
> out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and
> investigating the publishing issue. Thank you for your contributions and
> feedback on this.
>
> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
> Java 11, we need lots of support from our dependencies. Let's see.
>
> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
> 2021-12-23. So, the deprecation is not required yet, but we had better
> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>
> - SparkR CRAN publishing: As we know, it's discontinued so far. Resuming
> it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it
> succeeds to revive it, we can keep publishing. Otherwise, I believe we had
> better drop it from the releasing work item list officially.
>
> # Dependencies
>
> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in
> Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
> shaded clients via SPARK-33212. So far, there is one on-going report at
> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
> we can move toward Hadoop 3.3.2.
>
> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead
> of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via
> SPARK-32981 and replaced the generated hive-service-rpc code with the
> official dependency via SPARK-32981. We are steadily improving this area
> and will consume Hive 2.3.9 if available.
>
> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client
> dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support
> K8s model 1.19.
>
> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
> Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
> 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
> with Kafka Client 2.8 hopefully.
>
> # Some Features
>
> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
> Iceberg integration. Especially, we hope the on-going function catalog SPIP
> and up-coming storage partitioned join SPIP can be delivered as a part of
> Spark 3.2 and become an additional foundation.
>
> - Columnar Encryption: As of today, Apache Spark master branch supports
> columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
> Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,
> Apache Spark 3.2 is going to be the first release to have this feature
> officially. Any feedback is welcome.
>
> - Improved ZStandard Support: Spark 3.2 will bring more benefits for
> ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support
> for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD
> compression, 3) SPARK-34503 sets ZSTD as the default codec for event log
> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also,
> the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool),
> too. I'm expecting more benefits.
>
> - Structure Streaming with RocksDB backend: According to the latest
> update, it looks active enough for merging to master branch in Spark 3.2.
>
> Please share your thoughts and let's build better Apache Spark 3.2
> together.
>
> Bests,
> Dongjoon.
>

Apache Spark 3.2 Expectation

2021-02-25 Thread Dongjoon Hyun

Hi, All.

Since we have been preparing Apache Spark 3.2.0 in master branch since
December 2020, March seems to be a good time to share our thoughts and
aspirations on Apache Spark 3.2.

According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
seems to be the last minor release of this year. Given the timeframe, we
might consider the following. (This is a small set. Please add your
thoughts to this limited list.)

# Languages

- Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped
out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and
investigating the publishing issue. Thank you for your contributions and
feedback on this.

- Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like Java
11, we need lots of support from our dependencies. Let's see.

- Python 3.6 Deprecation(?): Python 3.6 community support ends at
2021-12-23. So, the deprecation is not required yet, but we had better
prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

- SparkR CRAN publishing: As we know, it's discontinued so far. Resuming it
depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it
succeeds to revive it, we can keep publishing. Otherwise, I believe we had
better drop it from the releasing work item list officially.

# Dependencies

- Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in
Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
shaded clients via SPARK-33212. So far, there is one on-going report at
YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
we can move toward Hadoop 3.3.2.

- Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead
of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via
SPARK-32981 and replaced the generated hive-service-rpc code with the
official dependency via SPARK-32981. We are steadily improving this area
and will consume Hive 2.3.9 if available.

- K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client
dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support
K8s model 1.19.

- Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
with Kafka Client 2.8 hopefully.

# Some Features

- Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
Iceberg integration. Especially, we hope the on-going function catalog SPIP
and up-coming storage partitioned join SPIP can be delivered as a part of
Spark 3.2 and become an additional foundation.

- Columnar Encryption: As of today, Apache Spark master branch supports
columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,
Apache Spark 3.2 is going to be the first release to have this feature
officially. Any feedback is welcome.

- Improved ZStandard Support: Spark 3.2 will bring more benefits for
ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support
for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD
compression, 3) SPARK-34503 sets ZSTD as the default codec for event log
compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also,
the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool),
too. I'm expecting more benefits.

- Structure Streaming with RocksDB backend: According to the latest update,
it looks active enough for merging to master branch in Spark 3.2.

Please share your thoughts and let's build better Apache Spark 3.2 together.

Bests,
Dongjoon.

Unsubscribe

2021-02-25 Thread Anton Solod

Unsubscribe

Re: [VOTE] Release Spark 3.1.1 (RC3)

Re: [VOTE] Release Spark 3.1.1 (RC3)

Re: [VOTE] Release Spark 3.1.1 (RC3)

Re: Apache Spark 3.2 Expectation

Re: Apache Spark 3.2 Expectation

Apache Spark 3.2 Expectation

Unsubscribe

7 matches

Site Navigation

Mail list logo

Footer information