Sure, thank you, Hyukjin. Bests, Dongjoon.
On Fri, Feb 26, 2021 at 4:01 PM Hyukjin Kwon <gurwls...@gmail.com> wrote: > I have an idea which I'll send an email to discuss next or a week after > the next week. I did not have enough bandwidth to drive both together at > the same time. I would appreciate if we have some more time for 3.2. > > In addition, It would also be great if we follow the schedule and catch > potential blockers quickly during QA instead of when we cut RCs. That will > considerably speed up the process and make it on time. > > Thanks. > > > On Sat, 27 Feb 2021, 06:00 Dongjoon Hyun, <dongjoon.h...@gmail.com> wrote: > >> Thank you for sharing your plan, Huaxin! >> >> Bests, >> Dongjoon. >> >> >> On Fri, Feb 26, 2021 at 12:20 PM huaxin gao <huaxin.ga...@gmail.com> >> wrote: >> >>> Thanks Dongjoon and Xiao for the discussion. I would like to add Data >>> Source V2 Aggregate push down to the list. I am currently working on >>> JDBC Data Source V2 Aggregate push down, but the common code can be used >>> for the file based V2 Data Source as well. For example, MAX and MIN can be >>> pushed down to Parquet and Orc, since they can use statistics information >>> to perform these operations efficiently. Quite a few users are >>> interested in this Aggregate push down feature and the preliminary >>> performance test for JDBC Aggregate push down is positive. So I think it is >>> a valuable feature to add for Spark 3.2. >>> >>> Thanks, >>> Huaxin >>> >>> On Fri, Feb 26, 2021 at 11:13 AM Xiao Li <gatorsm...@gmail.com> wrote: >>> >>>> Thank you, Dongjoon, for initiating this discussion. Let us keep it >>>> open. It might take 1-2 weeks to collect from the community all the >>>> features we plan to build and ship in 3.2 since we just finished the 3.1 >>>> voting. >>>> >>>> >>>>> 3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need >>>>> `branch-cut` in April because we took 3 month for Spark 3.1 release. >>>> >>>> >>>> TBH, cutting the branch this April does not look good to me. That >>>> means, we only have one month left for feature development of Spark 3.2. Do >>>> we have enough features in the current master branch? If not, are we able >>>> to finish major features we collected here? Do they have a timeline or >>>> project plan? >>>> >>>> Xiao >>>> >>>> Dongjoon Hyun <dongjoon.h...@gmail.com> 于2021年2月26日周五 上午10:07写道: >>>> >>>>> Thank you, Mridul and Sean. >>>>> >>>>> 1. Yes, `2017` was a typo. Java 17 is scheduled September 2021. And, >>>>> of course, it's a nice-to-have status. :) >>>>> >>>>> 2. `Push based shuffle and disaggregated shuffle`. Definitely. Thanks >>>>> for sharing, >>>>> >>>>> 3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need >>>>> `branch-cut` in April because we took 3 month for Spark 3.1 release. >>>>> Let's update our release roadmap of the Apache Spark website. >>>>> >>>>> > I'd roughly expect 3.2 in, say, July of this year, given the usual >>>>> cadence. No reason it couldn't be a little sooner or later. There is >>>>> already some good stuff in 3.2 and will be a good minor release in 5-6 >>>>> months. >>>>> >>>>> Bests, >>>>> Dongjoon. >>>>> >>>>> >>>>> >>>>> On Thu, Feb 25, 2021 at 9:33 AM Sean Owen <sro...@gmail.com> wrote: >>>>> >>>>>> I'd roughly expect 3.2 in, say, July of this year, given the usual >>>>>> cadence. No reason it couldn't be a little sooner or later. There is >>>>>> already some good stuff in 3.2 and will be a good minor release in 5-6 >>>>>> months. >>>>>> >>>>>> On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun < >>>>>> dongjoon.h...@gmail.com> wrote: >>>>>> >>>>>>> Hi, All. >>>>>>> >>>>>>> Since we have been preparing Apache Spark 3.2.0 in master branch >>>>>>> since December 2020, March seems to be a good time to share our thoughts >>>>>>> and aspirations on Apache Spark 3.2. >>>>>>> >>>>>>> According to the progress on Apache Spark 3.1 release, Apache Spark >>>>>>> 3.2 seems to be the last minor release of this year. Given the >>>>>>> timeframe, >>>>>>> we might consider the following. (This is a small set. Please add your >>>>>>> thoughts to this limited list.) >>>>>>> >>>>>>> # Languages >>>>>>> >>>>>>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but >>>>>>> slipped out. Currently, we are trying to use Scala 2.13.5 via >>>>>>> SPARK-34505 >>>>>>> and investigating the publishing issue. Thank you for your contributions >>>>>>> and feedback on this. >>>>>>> >>>>>>> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. >>>>>>> Like Java 11, we need lots of support from our dependencies. Let's see. >>>>>>> >>>>>>> - Python 3.6 Deprecation(?): Python 3.6 community support ends at >>>>>>> 2021-12-23. So, the deprecation is not required yet, but we had better >>>>>>> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022. >>>>>>> >>>>>>> - SparkR CRAN publishing: As we know, it's discontinued so far. >>>>>>> Resuming it depends on the success of Apache SparkR 3.1.1 CRAN >>>>>>> publishing. >>>>>>> If it succeeds to revive it, we can keep publishing. Otherwise, I >>>>>>> believe >>>>>>> we had better drop it from the releasing work item list officially. >>>>>>> >>>>>>> # Dependencies >>>>>>> >>>>>>> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop >>>>>>> profile in Apache Spark 3.1. Currently, Spark master branch lives on >>>>>>> Hadoop >>>>>>> 3.2.2's shaded clients via SPARK-33212. So far, there is one on-going >>>>>>> report at YARN environment. We hope it will be fixed soon at Spark 3.2 >>>>>>> timeframe and we can move toward Hadoop 3.3.2. >>>>>>> >>>>>>> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default >>>>>>> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile >>>>>>> completely >>>>>>> via SPARK-32981 and replaced the generated hive-service-rpc code with >>>>>>> the >>>>>>> official dependency via SPARK-32981. We are steadily improving this area >>>>>>> and will consume Hive 2.3.9 if available. >>>>>>> >>>>>>> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s >>>>>>> client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to >>>>>>> support K8s model 1.19. >>>>>>> >>>>>>> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using >>>>>>> Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with >>>>>>> Scala 2.12.13, but it was reverted later due to Scala 2.12.13 issue. >>>>>>> Since >>>>>>> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go >>>>>>> with Kafka Client 2.8 hopefully. >>>>>>> >>>>>>> # Some Features >>>>>>> >>>>>>> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with >>>>>>> Apache Iceberg integration. Especially, we hope the on-going function >>>>>>> catalog SPIP and up-coming storage partitioned join SPIP can be >>>>>>> delivered >>>>>>> as a part of Spark 3.2 and become an additional foundation. >>>>>>> >>>>>>> - Columnar Encryption: As of today, Apache Spark master branch >>>>>>> supports columnar encryption via Apache ORC 1.6 and it's documented via >>>>>>> SPARK-34036. Also, upcoming Apache Parquet 1.12 has a similar >>>>>>> capability. >>>>>>> Hopefully, Apache Spark 3.2 is going to be the first release to have >>>>>>> this >>>>>>> feature officially. Any feedback is welcome. >>>>>>> >>>>>>> - Improved ZStandard Support: Spark 3.2 will bring more benefits for >>>>>>> ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool >>>>>>> support >>>>>>> for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD >>>>>>> compression, 3) SPARK-34503 sets ZSTD as the default codec for event log >>>>>>> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. >>>>>>> Also, >>>>>>> the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool), >>>>>>> too. I'm expecting more benefits. >>>>>>> >>>>>>> - Structure Streaming with RocksDB backend: According to the latest >>>>>>> update, it looks active enough for merging to master branch in Spark >>>>>>> 3.2. >>>>>>> >>>>>>> Please share your thoughts and let's build better Apache Spark 3.2 >>>>>>> together. >>>>>>> >>>>>>> Bests, >>>>>>> Dongjoon. >>>>>>> >>>>>>