Re: Apache Spark 3.3 Release

Maxim Gekk Fri, 15 Apr 2022 06:28:38 -0700

Hello All,

Current status of features from the allow list for branch-3.3 is:


IN PROGRESS:

   1. SPARK-37691: Support ANSI Aggregation Function: percentile_disc
   2. SPARK-28516: Data Type Formatting Functions: `to_char`
   3. SPARK-34079: Improvement CTE table scan

IN PROGRESS but won't/couldn't be merged to branch-3.3:

   1. SPARK-37650: Tell spark-env.sh the python interpreter
   2. SPARK-36664: Log time spent waiting for cluster resources
   3. SPARK-37396: Inline type hint files for files in python/pyspark/mllib
   4. SPARK-37395: Inline type hint files for files in python/pyspark/ml
   5. SPARK-37093: Inline type hints python/pyspark/streaming

RESOLVED:

   1. SPARK-32268: Bloom Filter Join
   2. SPARK-38548: New SQL function: try_sum
   3. SPARK-38063: Support SQL split_part function
   4. SPARK-38432: Refactor framework so as JDBC dialect could compile
   filter by self way
   5. SPARK-34863: Support nested column in Spark Parquet vectorized readers
   6. SPARK-38194: Make Yarn memory overhead factor configurable
   7. SPARK-37618: Support cleaning up shuffle blocks from external shuffle
   service
   8. SPARK-37831: Add task partition id in metrics
   9. SPARK-37974: Implement vectorized DELTA_BYTE_ARRAY and
   DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
   10. SPARK-38590: New SQL function: try_to_binary
   11. SPARK-37377: Refactor V2 Partitioning interface and remove
   deprecated usage of Distribution
   12. SPARK-38085: DataSource V2: Handle DELETE commands for group-based
   sources
   13. SPARK-34659: Web UI does not correctly get appId
   14. SPARK-38589: New SQL function: try_avg


Max Gekk

Software Engineer

Databricks, Inc.


On Mon, Apr 4, 2022 at 9:27 PM Maxim Gekk <[email protected]> wrote:

> Hello All,
>
> Below is current status of features from the allow list:
>
> IN PROGRESS:
>
>    1. SPARK-37396: Inline type hint files for files in
>    python/pyspark/mllib
>    2. SPARK-37395: Inline type hint files for files in python/pyspark/ml
>    3. SPARK-37093: Inline type hints python/pyspark/streaming
>    4. SPARK-37377: Refactor V2 Partitioning interface and remove
>    deprecated usage of Distribution
>    5. SPARK-38085: DataSource V2: Handle DELETE commands for group-based
>    sources
>    6. SPARK-37691: Support ANSI Aggregation Function: percentile_disc
>    7. SPARK-28516: Data Type Formatting Functions: `to_char`
>    8. SPARK-36664: Log time spent waiting for cluster resources
>    9. SPARK-34659: Web UI does not correctly get appId
>    10. SPARK-37650: Tell spark-env.sh the python interpreter
>    11. SPARK-38589: New SQL function: try_avg
>    12. SPARK-38590: New SQL function: try_to_binary
>    13. SPARK-34079: Improvement CTE table scan
>
> RESOLVED:
>
>    1. SPARK-32268: Bloom Filter Join
>    2. SPARK-38548: New SQL function: try_sum
>    3. SPARK-38063: Support SQL split_part function
>    4. SPARK-38432: Refactor framework so as JDBC dialect could compile
>    filter by self way
>    5. SPARK-34863: Support nested column in Spark Parquet vectorized
>    readers
>    6. SPARK-38194: Make Yarn memory overhead factor configurable
>    7. SPARK-37618: Support cleaning up shuffle blocks from external
>    shuffle service
>    8. SPARK-37831: Add task partition id in metrics
>    9. SPARK-37974: Implement vectorized DELTA_BYTE_ARRAY and
>    DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
>
> We need to decide whether we are going to wait a little bit more or close
> the doors.
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>
>
> On Fri, Mar 18, 2022 at 9:22 AM Maxim Gekk <[email protected]>
> wrote:
>
>> Hi All,
>>
>> Here is the allow list which I built based on your requests in this
>> thread:
>>
>>    1. SPARK-37396: Inline type hint files for files in
>>    python/pyspark/mllib
>>    2. SPARK-37395: Inline type hint files for files in python/pyspark/ml
>>    3. SPARK-37093: Inline type hints python/pyspark/streaming
>>    4. SPARK-37377: Refactor V2 Partitioning interface and remove
>>    deprecated usage of Distribution
>>    5. SPARK-38085: DataSource V2: Handle DELETE commands for group-based
>>    sources
>>    6. SPARK-32268: Bloom Filter Join
>>    7. SPARK-38548: New SQL function: try_sum
>>    8. SPARK-37691: Support ANSI Aggregation Function: percentile_disc
>>    9. SPARK-38063: Support SQL split_part function
>>    10. SPARK-28516: Data Type Formatting Functions: `to_char`
>>    11. SPARK-38432: Refactor framework so as JDBC dialect could compile
>>    filter by self way
>>    12. SPARK-34863: Support nested column in Spark Parquet vectorized
>>    readers
>>    13. SPARK-38194: Make Yarn memory overhead factor configurable
>>    14. SPARK-37618: Support cleaning up shuffle blocks from external
>>    shuffle service
>>    15. SPARK-37831: Add task partition id in metrics
>>    16. SPARK-37974: Implement vectorized DELTA_BYTE_ARRAY and
>>    DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
>>    17. SPARK-36664: Log time spent waiting for cluster resources
>>    18. SPARK-34659: Web UI does not correctly get appId
>>    19. SPARK-37650: Tell spark-env.sh the python interpreter
>>    20. SPARK-38589: New SQL function: try_avg
>>    21. SPARK-38590: New SQL function: try_to_binary
>>    22. SPARK-34079: Improvement CTE table scan
>>
>> Best regards,
>> Max Gekk
>>
>>
>> On Thu, Mar 17, 2022 at 4:59 PM Tom Graves <[email protected]> wrote:
>>
>>> Is the feature freeze target date March 22nd then?  I saw a few dates
>>> thrown around want to confirm what we landed on
>>>
>>> I am trying to get the following improvements finished review and in, if
>>> concerns with either, let me know:
>>> - [SPARK-34079][SQL] Merge non-correlated scalar subqueries
>>> <https://github.com/apache/spark/pull/32298#>
>>> - [SPARK-37618][CORE] Remove shuffle blocks using the shuffle service
>>> for released executors <https://github.com/apache/spark/pull/35085#>
>>>
>>> Tom
>>>
>>>
>>> On Thursday, March 17, 2022, 07:24:41 AM CDT, Gengliang Wang <
>>> [email protected]> wrote:
>>>
>>>
>>> I'd like to add the following new SQL functions in the 3.3 release.
>>> These functions are useful when overflow or encoding errors occur:
>>>
>>>    - [SPARK-38548][SQL] New SQL function: try_sum
>>>    <https://github.com/apache/spark/pull/35848>
>>>    - [SPARK-38589][SQL] New SQL function: try_avg
>>>    <https://github.com/apache/spark/pull/35896>
>>>    - [SPARK-38590][SQL] New SQL function: try_to_binary
>>>    <https://github.com/apache/spark/pull/35897>
>>>
>>> Gengliang
>>>
>>> On Thu, Mar 17, 2022 at 7:59 AM Andrew Melo <[email protected]>
>>> wrote:
>>>
>>> Hello,
>>>
>>> I've been trying for a bit to get the following two PRs merged and
>>> into a release, and I'm having some difficulty moving them forward:
>>>
>>> https://github.com/apache/spark/pull/34903 - This passes the current
>>> python interpreter to spark-env.sh to allow some currently-unavailable
>>> customization to happen
>>> https://github.com/apache/spark/pull/31774 - This fixes a bug in the
>>> SparkUI reverse proxy-handling code where it does a greedy match for
>>> "proxy" in the URL, and will mistakenly replace the App-ID in the
>>> wrong place.
>>>
>>> I'm not exactly sure of how to get attention of PRs that have been
>>> sitting around for a while, but these are really important to our
>>> use-cases, and it would be nice to have them merged in.
>>>
>>> Cheers
>>> Andrew
>>>
>>> On Wed, Mar 16, 2022 at 6:21 PM Holden Karau <[email protected]>
>>> wrote:
>>> >
>>> > I'd like to add/backport the logging in
>>> https://github.com/apache/spark/pull/35881 PR so that when users submit
>>> issues with dynamic allocation we can better debug what's going on.
>>> >
>>> > On Wed, Mar 16, 2022 at 3:45 PM Chao Sun <[email protected]> wrote:
>>> >>
>>> >> There is one item on our side that we want to backport to 3.3:
>>> >> - vectorized DELTA_BYTE_ARRAY/DELTA_LENGTH_BYTE_ARRAY encodings for
>>> >> Parquet V2 support (https://github.com/apache/spark/pull/35262)
>>> >>
>>> >> It's already reviewed and approved.
>>> >>
>>> >> On Wed, Mar 16, 2022 at 9:13 AM Tom Graves
>>> <[email protected]> wrote:
>>> >> >
>>> >> > It looks like the version hasn't been updated on master and still
>>> shows 3.3.0-SNAPSHOT, can you please update that.
>>> >> >
>>> >> > Tom
>>> >> >
>>> >> > On Wednesday, March 16, 2022, 01:41:00 AM CDT, Maxim Gekk <
>>> [email protected]> wrote:
>>> >> >
>>> >> >
>>> >> > Hi All,
>>> >> >
>>> >> > I have created the branch for Spark 3.3:
>>> >> > https://github.com/apache/spark/commits/branch-3.3
>>> >> >
>>> >> > Please, backport important fixes to it, and if you have some
>>> doubts, ping me in the PR. Regarding new features, we are still building
>>> the allow list for branch-3.3.
>>> >> >
>>> >> > Best regards,
>>> >> > Max Gekk
>>> >> >
>>> >> >
>>> >> > On Wed, Mar 16, 2022 at 5:51 AM Dongjoon Hyun <
>>> [email protected]> wrote:
>>> >> >
>>> >> > Yes, I agree with you for your whitelist approach for backporting.
>>> :)
>>> >> > Thank you for summarizing.
>>> >> >
>>> >> > Thanks,
>>> >> > Dongjoon.
>>> >> >
>>> >> >
>>> >> > On Tue, Mar 15, 2022 at 4:20 PM Xiao Li <[email protected]>
>>> wrote:
>>> >> >
>>> >> > I think I finally got your point. What you want to keep unchanged
>>> is the branch cut date of Spark 3.3. Today? or this Friday? This is not a
>>> big deal.
>>> >> >
>>> >> > My major concern is whether we should keep merging the feature work
>>> or the dependency upgrade after the branch cut. To make our release time
>>> more predictable, I am suggesting we should finalize the exception PR list
>>> first, instead of merging them in an ad hoc way. In the past, we spent a
>>> lot of time on the revert of the PRs that were merged after the branch cut.
>>> I hope we can minimize unnecessary arguments in this release. Do you agree,
>>> Dongjoon?
>>> >> >
>>> >> >
>>> >> >
>>> >> > Dongjoon Hyun <[email protected]> 于2022年3月15日周二 15:55写道：
>>> >> >
>>> >> > That is not totally fine, Xiao. It sounds like you are asking a
>>> change of plan without a proper reason.
>>> >> >
>>> >> > Although we cut the branch Today according our plan, you still can
>>> collect the list and make a list of exceptions. I'm not blocking what you
>>> want to do.
>>> >> >
>>> >> > Please let the community start to ramp down as we agreed before.
>>> >> >
>>> >> > Dongjoon
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Tue, Mar 15, 2022 at 3:07 PM Xiao Li <[email protected]>
>>> wrote:
>>> >> >
>>> >> > Please do not get me wrong. If we don't cut a branch, we are
>>> allowing all patches to land Apache Spark 3.3. That is totally fine. After
>>> we cut the branch, we should avoid merging the feature work. In the next
>>> three days, let us collect the actively developed PRs that we want to make
>>> an exception (i.e., merged to 3.3 after the upcoming branch cut). Does that
>>> make sense?
>>> >> >
>>> >> > Dongjoon Hyun <[email protected]> 于2022年3月15日周二 14:54写道：
>>> >> >
>>> >> > Xiao. You are working against what you are saying.
>>> >> > If you don't cut a branch, it means you are allowing all patches to
>>> land Apache Spark 3.3. No?
>>> >> >
>>> >> > > we need to avoid backporting the feature work that are not being
>>> well discussed.
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Tue, Mar 15, 2022 at 12:12 PM Xiao Li <[email protected]>
>>> wrote:
>>> >> >
>>> >> > Cutting the branch is simple, but we need to avoid backporting the
>>> feature work that are not being well discussed. Not all the members are
>>> actively following the dev list. I think we should wait 3 more days for
>>> collecting the PR list before cutting the branch.
>>> >> >
>>> >> > BTW, there are very few 3.4-only feature work that will be affected.
>>> >> >
>>> >> > Xiao
>>> >> >
>>> >> > Dongjoon Hyun <[email protected]> 于2022年3月15日周二 11:49写道：
>>> >> >
>>> >> > Hi, Max, Chao, Xiao, Holden and all.
>>> >> >
>>> >> > I have a different idea.
>>> >> >
>>> >> > Given the situation and small patch list, I don't think we need to
>>> postpone the branch cut for those patches. It's easier to cut a branch-3.3
>>> and allow backporting.
>>> >> >
>>> >> > As of today, we already have an obvious Apache Spark 3.4 patch in
>>> the branch together. This situation only becomes worse and worse because
>>> there is no way to block the other patches from landing unintentionally if
>>> we don't cut a branch.
>>> >> >
>>> >> >     [SPARK-38335][SQL] Implement parser support for DEFAULT column
>>> values
>>> >> >
>>> >> > Let's cut `branch-3.3` Today for Apache Spark 3.3.0 preparation.
>>> >> >
>>> >> > Best,
>>> >> > Dongjoon.
>>> >> >
>>> >> >
>>> >> > On Tue, Mar 15, 2022 at 10:17 AM Chao Sun <[email protected]>
>>> wrote:
>>> >> >
>>> >> > Cool, thanks for clarifying!
>>> >> >
>>> >> > On Tue, Mar 15, 2022 at 10:11 AM Xiao Li <[email protected]>
>>> wrote:
>>> >> > >>
>>> >> > >> For the following list:
>>> >> > >> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
>>> >> > >> #34659 [SPARK-34863][SQL] Support complex types for Parquet
>>> vectorized reader
>>> >> > >> #35848 [SPARK-38548][SQL] New SQL function: try_sum
>>> >> > >> Do you mean we should include them, or exclude them from 3.3?
>>> >> > >
>>> >> > >
>>> >> > > If possible, I hope these features can be shipped with Spark 3.3.
>>> >> > >
>>> >> > >
>>> >> > >
>>> >> > > Chao Sun <[email protected]> 于2022年3月15日周二 10:06写道：
>>> >> > >>
>>> >> > >> Hi Xiao,
>>> >> > >>
>>> >> > >> For the following list:
>>> >> > >>
>>> >> > >> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
>>> >> > >> #34659 [SPARK-34863][SQL] Support complex types for Parquet
>>> vectorized reader
>>> >> > >> #35848 [SPARK-38548][SQL] New SQL function: try_sum
>>> >> > >>
>>> >> > >> Do you mean we should include them, or exclude them from 3.3?
>>> >> > >>
>>> >> > >> Thanks,
>>> >> > >> Chao
>>> >> > >>
>>> >> > >> On Tue, Mar 15, 2022 at 9:56 AM Dongjoon Hyun <
>>> [email protected]> wrote:
>>> >> > >> >
>>> >> > >> > The following was tested and merged a few minutes ago. So, we
>>> can remove it from the list.
>>> >> > >> >
>>> >> > >> > #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to v1.5.1
>>> >> > >> >
>>> >> > >> > Thanks,
>>> >> > >> > Dongjoon.
>>> >> > >> >
>>> >> > >> > On Tue, Mar 15, 2022 at 9:48 AM Xiao Li <[email protected]>
>>> wrote:
>>> >> > >> >>
>>> >> > >> >> Let me clarify my above suggestion. Maybe we can wait 3 more
>>> days to collect the list of actively developed PRs that we want to merge to
>>> 3.3 after the branch cut?
>>> >> > >> >>
>>> >> > >> >> Please do not rush to merge the PRs that are not fully
>>> reviewed. We can cut the branch this Friday and continue merging the PRs
>>> that have been discussed in this thread. Does that make sense?
>>> >> > >> >>
>>> >> > >> >> Xiao
>>> >> > >> >>
>>> >> > >> >>
>>> >> > >> >>
>>> >> > >> >> Holden Karau <[email protected]> 于2022年3月15日周二 09:10写道：
>>> >> > >> >>>
>>> >> > >> >>> May I suggest we push out one week (22nd) just to give
>>> everyone a bit of breathing space? Rushed software development more often
>>> results in bugs.
>>> >> > >> >>>
>>> >> > >> >>> On Tue, Mar 15, 2022 at 6:23 AM Yikun Jiang <
>>> [email protected]> wrote:
>>> >> > >> >>>>
>>> >> > >> >>>> > To make our release time more predictable, let us collect
>>> the PRs and wait three more days before the branch cut?
>>> >> > >> >>>>
>>> >> > >> >>>> For SPIP: Support Customized Kubernetes Schedulers:
>>> >> > >> >>>> #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to
>>> v1.5.1
>>> >> > >> >>>>
>>> >> > >> >>>> Three more days are OK for this from my view.
>>> >> > >> >>>>
>>> >> > >> >>>> Regards,
>>> >> > >> >>>> Yikun
>>> >> > >> >>>
>>> >> > >> >>> --
>>> >> > >> >>> Twitter: https://twitter.com/holdenkarau
>>> >> > >> >>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9
>>> >> > >> >>> YouTube Live Streams:
>>> https://www.youtube.com/user/holdenkarau
>>> >
>>> >
>>> >
>>> > --
>>> > Twitter: https://twitter.com/holdenkarau
>>> > Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9
>>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [email protected]
>>>
>>>

Re: Apache Spark 3.3 Release

Reply via email to