Hello All, Current status of features from the allow list for branch-3.3 is:
IN PROGRESS: 1. SPARK-37691: Support ANSI Aggregation Function: percentile_disc 2. SPARK-28516: Data Type Formatting Functions: `to_char` 3. SPARK-34079: Improvement CTE table scan IN PROGRESS but won't/couldn't be merged to branch-3.3: 1. SPARK-37650: Tell spark-env.sh the python interpreter 2. SPARK-36664: Log time spent waiting for cluster resources 3. SPARK-37396: Inline type hint files for files in python/pyspark/mllib 4. SPARK-37395: Inline type hint files for files in python/pyspark/ml 5. SPARK-37093: Inline type hints python/pyspark/streaming RESOLVED: 1. SPARK-32268: Bloom Filter Join 2. SPARK-38548: New SQL function: try_sum 3. SPARK-38063: Support SQL split_part function 4. SPARK-38432: Refactor framework so as JDBC dialect could compile filter by self way 5. SPARK-34863: Support nested column in Spark Parquet vectorized readers 6. SPARK-38194: Make Yarn memory overhead factor configurable 7. SPARK-37618: Support cleaning up shuffle blocks from external shuffle service 8. SPARK-37831: Add task partition id in metrics 9. SPARK-37974: Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support 10. SPARK-38590: New SQL function: try_to_binary 11. SPARK-37377: Refactor V2 Partitioning interface and remove deprecated usage of Distribution 12. SPARK-38085: DataSource V2: Handle DELETE commands for group-based sources 13. SPARK-34659: Web UI does not correctly get appId 14. SPARK-38589: New SQL function: try_avg Max Gekk Software Engineer Databricks, Inc. On Mon, Apr 4, 2022 at 9:27 PM Maxim Gekk <[email protected]> wrote: > Hello All, > > Below is current status of features from the allow list: > > IN PROGRESS: > > 1. SPARK-37396: Inline type hint files for files in > python/pyspark/mllib > 2. SPARK-37395: Inline type hint files for files in python/pyspark/ml > 3. SPARK-37093: Inline type hints python/pyspark/streaming > 4. SPARK-37377: Refactor V2 Partitioning interface and remove > deprecated usage of Distribution > 5. SPARK-38085: DataSource V2: Handle DELETE commands for group-based > sources > 6. SPARK-37691: Support ANSI Aggregation Function: percentile_disc > 7. SPARK-28516: Data Type Formatting Functions: `to_char` > 8. SPARK-36664: Log time spent waiting for cluster resources > 9. SPARK-34659: Web UI does not correctly get appId > 10. SPARK-37650: Tell spark-env.sh the python interpreter > 11. SPARK-38589: New SQL function: try_avg > 12. SPARK-38590: New SQL function: try_to_binary > 13. SPARK-34079: Improvement CTE table scan > > RESOLVED: > > 1. SPARK-32268: Bloom Filter Join > 2. SPARK-38548: New SQL function: try_sum > 3. SPARK-38063: Support SQL split_part function > 4. SPARK-38432: Refactor framework so as JDBC dialect could compile > filter by self way > 5. SPARK-34863: Support nested column in Spark Parquet vectorized > readers > 6. SPARK-38194: Make Yarn memory overhead factor configurable > 7. SPARK-37618: Support cleaning up shuffle blocks from external > shuffle service > 8. SPARK-37831: Add task partition id in metrics > 9. SPARK-37974: Implement vectorized DELTA_BYTE_ARRAY and > DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support > > We need to decide whether we are going to wait a little bit more or close > the doors. > > Maxim Gekk > > Software Engineer > > Databricks, Inc. > > > On Fri, Mar 18, 2022 at 9:22 AM Maxim Gekk <[email protected]> > wrote: > >> Hi All, >> >> Here is the allow list which I built based on your requests in this >> thread: >> >> 1. SPARK-37396: Inline type hint files for files in >> python/pyspark/mllib >> 2. SPARK-37395: Inline type hint files for files in python/pyspark/ml >> 3. SPARK-37093: Inline type hints python/pyspark/streaming >> 4. SPARK-37377: Refactor V2 Partitioning interface and remove >> deprecated usage of Distribution >> 5. SPARK-38085: DataSource V2: Handle DELETE commands for group-based >> sources >> 6. SPARK-32268: Bloom Filter Join >> 7. SPARK-38548: New SQL function: try_sum >> 8. SPARK-37691: Support ANSI Aggregation Function: percentile_disc >> 9. SPARK-38063: Support SQL split_part function >> 10. SPARK-28516: Data Type Formatting Functions: `to_char` >> 11. SPARK-38432: Refactor framework so as JDBC dialect could compile >> filter by self way >> 12. SPARK-34863: Support nested column in Spark Parquet vectorized >> readers >> 13. SPARK-38194: Make Yarn memory overhead factor configurable >> 14. SPARK-37618: Support cleaning up shuffle blocks from external >> shuffle service >> 15. SPARK-37831: Add task partition id in metrics >> 16. SPARK-37974: Implement vectorized DELTA_BYTE_ARRAY and >> DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support >> 17. SPARK-36664: Log time spent waiting for cluster resources >> 18. SPARK-34659: Web UI does not correctly get appId >> 19. SPARK-37650: Tell spark-env.sh the python interpreter >> 20. SPARK-38589: New SQL function: try_avg >> 21. SPARK-38590: New SQL function: try_to_binary >> 22. SPARK-34079: Improvement CTE table scan >> >> Best regards, >> Max Gekk >> >> >> On Thu, Mar 17, 2022 at 4:59 PM Tom Graves <[email protected]> wrote: >> >>> Is the feature freeze target date March 22nd then? I saw a few dates >>> thrown around want to confirm what we landed on >>> >>> I am trying to get the following improvements finished review and in, if >>> concerns with either, let me know: >>> - [SPARK-34079][SQL] Merge non-correlated scalar subqueries >>> <https://github.com/apache/spark/pull/32298#> >>> - [SPARK-37618][CORE] Remove shuffle blocks using the shuffle service >>> for released executors <https://github.com/apache/spark/pull/35085#> >>> >>> Tom >>> >>> >>> On Thursday, March 17, 2022, 07:24:41 AM CDT, Gengliang Wang < >>> [email protected]> wrote: >>> >>> >>> I'd like to add the following new SQL functions in the 3.3 release. >>> These functions are useful when overflow or encoding errors occur: >>> >>> - [SPARK-38548][SQL] New SQL function: try_sum >>> <https://github.com/apache/spark/pull/35848> >>> - [SPARK-38589][SQL] New SQL function: try_avg >>> <https://github.com/apache/spark/pull/35896> >>> - [SPARK-38590][SQL] New SQL function: try_to_binary >>> <https://github.com/apache/spark/pull/35897> >>> >>> Gengliang >>> >>> On Thu, Mar 17, 2022 at 7:59 AM Andrew Melo <[email protected]> >>> wrote: >>> >>> Hello, >>> >>> I've been trying for a bit to get the following two PRs merged and >>> into a release, and I'm having some difficulty moving them forward: >>> >>> https://github.com/apache/spark/pull/34903 - This passes the current >>> python interpreter to spark-env.sh to allow some currently-unavailable >>> customization to happen >>> https://github.com/apache/spark/pull/31774 - This fixes a bug in the >>> SparkUI reverse proxy-handling code where it does a greedy match for >>> "proxy" in the URL, and will mistakenly replace the App-ID in the >>> wrong place. >>> >>> I'm not exactly sure of how to get attention of PRs that have been >>> sitting around for a while, but these are really important to our >>> use-cases, and it would be nice to have them merged in. >>> >>> Cheers >>> Andrew >>> >>> On Wed, Mar 16, 2022 at 6:21 PM Holden Karau <[email protected]> >>> wrote: >>> > >>> > I'd like to add/backport the logging in >>> https://github.com/apache/spark/pull/35881 PR so that when users submit >>> issues with dynamic allocation we can better debug what's going on. >>> > >>> > On Wed, Mar 16, 2022 at 3:45 PM Chao Sun <[email protected]> wrote: >>> >> >>> >> There is one item on our side that we want to backport to 3.3: >>> >> - vectorized DELTA_BYTE_ARRAY/DELTA_LENGTH_BYTE_ARRAY encodings for >>> >> Parquet V2 support (https://github.com/apache/spark/pull/35262) >>> >> >>> >> It's already reviewed and approved. >>> >> >>> >> On Wed, Mar 16, 2022 at 9:13 AM Tom Graves >>> <[email protected]> wrote: >>> >> > >>> >> > It looks like the version hasn't been updated on master and still >>> shows 3.3.0-SNAPSHOT, can you please update that. >>> >> > >>> >> > Tom >>> >> > >>> >> > On Wednesday, March 16, 2022, 01:41:00 AM CDT, Maxim Gekk < >>> [email protected]> wrote: >>> >> > >>> >> > >>> >> > Hi All, >>> >> > >>> >> > I have created the branch for Spark 3.3: >>> >> > https://github.com/apache/spark/commits/branch-3.3 >>> >> > >>> >> > Please, backport important fixes to it, and if you have some >>> doubts, ping me in the PR. Regarding new features, we are still building >>> the allow list for branch-3.3. >>> >> > >>> >> > Best regards, >>> >> > Max Gekk >>> >> > >>> >> > >>> >> > On Wed, Mar 16, 2022 at 5:51 AM Dongjoon Hyun < >>> [email protected]> wrote: >>> >> > >>> >> > Yes, I agree with you for your whitelist approach for backporting. >>> :) >>> >> > Thank you for summarizing. >>> >> > >>> >> > Thanks, >>> >> > Dongjoon. >>> >> > >>> >> > >>> >> > On Tue, Mar 15, 2022 at 4:20 PM Xiao Li <[email protected]> >>> wrote: >>> >> > >>> >> > I think I finally got your point. What you want to keep unchanged >>> is the branch cut date of Spark 3.3. Today? or this Friday? This is not a >>> big deal. >>> >> > >>> >> > My major concern is whether we should keep merging the feature work >>> or the dependency upgrade after the branch cut. To make our release time >>> more predictable, I am suggesting we should finalize the exception PR list >>> first, instead of merging them in an ad hoc way. In the past, we spent a >>> lot of time on the revert of the PRs that were merged after the branch cut. >>> I hope we can minimize unnecessary arguments in this release. Do you agree, >>> Dongjoon? >>> >> > >>> >> > >>> >> > >>> >> > Dongjoon Hyun <[email protected]> 于2022年3月15日周二 15:55写道: >>> >> > >>> >> > That is not totally fine, Xiao. It sounds like you are asking a >>> change of plan without a proper reason. >>> >> > >>> >> > Although we cut the branch Today according our plan, you still can >>> collect the list and make a list of exceptions. I'm not blocking what you >>> want to do. >>> >> > >>> >> > Please let the community start to ramp down as we agreed before. >>> >> > >>> >> > Dongjoon >>> >> > >>> >> > >>> >> > >>> >> > On Tue, Mar 15, 2022 at 3:07 PM Xiao Li <[email protected]> >>> wrote: >>> >> > >>> >> > Please do not get me wrong. If we don't cut a branch, we are >>> allowing all patches to land Apache Spark 3.3. That is totally fine. After >>> we cut the branch, we should avoid merging the feature work. In the next >>> three days, let us collect the actively developed PRs that we want to make >>> an exception (i.e., merged to 3.3 after the upcoming branch cut). Does that >>> make sense? >>> >> > >>> >> > Dongjoon Hyun <[email protected]> 于2022年3月15日周二 14:54写道: >>> >> > >>> >> > Xiao. You are working against what you are saying. >>> >> > If you don't cut a branch, it means you are allowing all patches to >>> land Apache Spark 3.3. No? >>> >> > >>> >> > > we need to avoid backporting the feature work that are not being >>> well discussed. >>> >> > >>> >> > >>> >> > >>> >> > On Tue, Mar 15, 2022 at 12:12 PM Xiao Li <[email protected]> >>> wrote: >>> >> > >>> >> > Cutting the branch is simple, but we need to avoid backporting the >>> feature work that are not being well discussed. Not all the members are >>> actively following the dev list. I think we should wait 3 more days for >>> collecting the PR list before cutting the branch. >>> >> > >>> >> > BTW, there are very few 3.4-only feature work that will be affected. >>> >> > >>> >> > Xiao >>> >> > >>> >> > Dongjoon Hyun <[email protected]> 于2022年3月15日周二 11:49写道: >>> >> > >>> >> > Hi, Max, Chao, Xiao, Holden and all. >>> >> > >>> >> > I have a different idea. >>> >> > >>> >> > Given the situation and small patch list, I don't think we need to >>> postpone the branch cut for those patches. It's easier to cut a branch-3.3 >>> and allow backporting. >>> >> > >>> >> > As of today, we already have an obvious Apache Spark 3.4 patch in >>> the branch together. This situation only becomes worse and worse because >>> there is no way to block the other patches from landing unintentionally if >>> we don't cut a branch. >>> >> > >>> >> > [SPARK-38335][SQL] Implement parser support for DEFAULT column >>> values >>> >> > >>> >> > Let's cut `branch-3.3` Today for Apache Spark 3.3.0 preparation. >>> >> > >>> >> > Best, >>> >> > Dongjoon. >>> >> > >>> >> > >>> >> > On Tue, Mar 15, 2022 at 10:17 AM Chao Sun <[email protected]> >>> wrote: >>> >> > >>> >> > Cool, thanks for clarifying! >>> >> > >>> >> > On Tue, Mar 15, 2022 at 10:11 AM Xiao Li <[email protected]> >>> wrote: >>> >> > >> >>> >> > >> For the following list: >>> >> > >> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering >>> >> > >> #34659 [SPARK-34863][SQL] Support complex types for Parquet >>> vectorized reader >>> >> > >> #35848 [SPARK-38548][SQL] New SQL function: try_sum >>> >> > >> Do you mean we should include them, or exclude them from 3.3? >>> >> > > >>> >> > > >>> >> > > If possible, I hope these features can be shipped with Spark 3.3. >>> >> > > >>> >> > > >>> >> > > >>> >> > > Chao Sun <[email protected]> 于2022年3月15日周二 10:06写道: >>> >> > >> >>> >> > >> Hi Xiao, >>> >> > >> >>> >> > >> For the following list: >>> >> > >> >>> >> > >> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering >>> >> > >> #34659 [SPARK-34863][SQL] Support complex types for Parquet >>> vectorized reader >>> >> > >> #35848 [SPARK-38548][SQL] New SQL function: try_sum >>> >> > >> >>> >> > >> Do you mean we should include them, or exclude them from 3.3? >>> >> > >> >>> >> > >> Thanks, >>> >> > >> Chao >>> >> > >> >>> >> > >> On Tue, Mar 15, 2022 at 9:56 AM Dongjoon Hyun < >>> [email protected]> wrote: >>> >> > >> > >>> >> > >> > The following was tested and merged a few minutes ago. So, we >>> can remove it from the list. >>> >> > >> > >>> >> > >> > #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to v1.5.1 >>> >> > >> > >>> >> > >> > Thanks, >>> >> > >> > Dongjoon. >>> >> > >> > >>> >> > >> > On Tue, Mar 15, 2022 at 9:48 AM Xiao Li <[email protected]> >>> wrote: >>> >> > >> >> >>> >> > >> >> Let me clarify my above suggestion. Maybe we can wait 3 more >>> days to collect the list of actively developed PRs that we want to merge to >>> 3.3 after the branch cut? >>> >> > >> >> >>> >> > >> >> Please do not rush to merge the PRs that are not fully >>> reviewed. We can cut the branch this Friday and continue merging the PRs >>> that have been discussed in this thread. Does that make sense? >>> >> > >> >> >>> >> > >> >> Xiao >>> >> > >> >> >>> >> > >> >> >>> >> > >> >> >>> >> > >> >> Holden Karau <[email protected]> 于2022年3月15日周二 09:10写道: >>> >> > >> >>> >>> >> > >> >>> May I suggest we push out one week (22nd) just to give >>> everyone a bit of breathing space? Rushed software development more often >>> results in bugs. >>> >> > >> >>> >>> >> > >> >>> On Tue, Mar 15, 2022 at 6:23 AM Yikun Jiang < >>> [email protected]> wrote: >>> >> > >> >>>> >>> >> > >> >>>> > To make our release time more predictable, let us collect >>> the PRs and wait three more days before the branch cut? >>> >> > >> >>>> >>> >> > >> >>>> For SPIP: Support Customized Kubernetes Schedulers: >>> >> > >> >>>> #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to >>> v1.5.1 >>> >> > >> >>>> >>> >> > >> >>>> Three more days are OK for this from my view. >>> >> > >> >>>> >>> >> > >> >>>> Regards, >>> >> > >> >>>> Yikun >>> >> > >> >>> >>> >> > >> >>> -- >>> >> > >> >>> Twitter: https://twitter.com/holdenkarau >>> >> > >> >>> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 >>> >> > >> >>> YouTube Live Streams: >>> https://www.youtube.com/user/holdenkarau >>> > >>> > >>> > >>> > -- >>> > Twitter: https://twitter.com/holdenkarau >>> > Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 >>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: [email protected] >>> >>>
