Re: Apache Spark 3.3 Release

Wenchen Fan Sun, 20 Mar 2022 23:47:52 -0700

Shall we revisit this list after a week? Ideally, they should be either
merged or rejected for 3.3, so that we can cut rc1. We can still discuss
them case by case at that time if there are exceptions.


On Sat, Mar 19, 2022 at 5:27 AM Dongjoon Hyun <[email protected]>
wrote:

> Thank you for your summarization.
>
> I believe we need to have a discussion in order to evaluate each PR's
> readiness.
>
> BTW, `branch-3.3` is still open for bug fixes including minor dependency
> changes like the following.
>
> (Backported)
> [SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4
> Revert "[SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4"
> [SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.5
>
> (Upcoming)
> [SPARK-38544][BUILD] Upgrade log4j2 to 2.17.2 from 2.17.1
> [SPARK-38602][BUILD] Upgrade Kafka to 3.1.1 from 3.1.0
>
> Dongjoon.
>
>
>
> On Thu, Mar 17, 2022 at 11:22 PM Maxim Gekk <[email protected]>
> wrote:
>
>> Hi All,
>>
>> Here is the allow list which I built based on your requests in this
>> thread:
>>
>>    1. SPARK-37396: Inline type hint files for files in
>>    python/pyspark/mllib
>>    2. SPARK-37395: Inline type hint files for files in python/pyspark/ml
>>    3. SPARK-37093: Inline type hints python/pyspark/streaming
>>    4. SPARK-37377: Refactor V2 Partitioning interface and remove
>>    deprecated usage of Distribution
>>    5. SPARK-38085: DataSource V2: Handle DELETE commands for group-based
>>    sources
>>    6. SPARK-32268: Bloom Filter Join
>>    7. SPARK-38548: New SQL function: try_sum
>>    8. SPARK-37691: Support ANSI Aggregation Function: percentile_disc
>>    9. SPARK-38063: Support SQL split_part function
>>    10. SPARK-28516: Data Type Formatting Functions: `to_char`
>>    11. SPARK-38432: Refactor framework so as JDBC dialect could compile
>>    filter by self way
>>    12. SPARK-34863: Support nested column in Spark Parquet vectorized
>>    readers
>>    13. SPARK-38194: Make Yarn memory overhead factor configurable
>>    14. SPARK-37618: Support cleaning up shuffle blocks from external
>>    shuffle service
>>    15. SPARK-37831: Add task partition id in metrics
>>    16. SPARK-37974: Implement vectorized DELTA_BYTE_ARRAY and
>>    DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
>>    17. SPARK-36664: Log time spent waiting for cluster resources
>>    18. SPARK-34659: Web UI does not correctly get appId
>>    19. SPARK-37650: Tell spark-env.sh the python interpreter
>>    20. SPARK-38589: New SQL function: try_avg
>>    21. SPARK-38590: New SQL function: try_to_binary
>>    22. SPARK-34079: Improvement CTE table scan
>>
>> Best regards,
>> Max Gekk
>>
>>
>> On Thu, Mar 17, 2022 at 4:59 PM Tom Graves <[email protected]> wrote:
>>
>>> Is the feature freeze target date March 22nd then?  I saw a few dates
>>> thrown around want to confirm what we landed on
>>>
>>> I am trying to get the following improvements finished review and in, if
>>> concerns with either, let me know:
>>> - [SPARK-34079][SQL] Merge non-correlated scalar subqueries
>>> <https://github.com/apache/spark/pull/32298#>
>>> - [SPARK-37618][CORE] Remove shuffle blocks using the shuffle service
>>> for released executors <https://github.com/apache/spark/pull/35085#>
>>>
>>> Tom
>>>
>>>
>>> On Thursday, March 17, 2022, 07:24:41 AM CDT, Gengliang Wang <
>>> [email protected]> wrote:
>>>
>>>
>>> I'd like to add the following new SQL functions in the 3.3 release.
>>> These functions are useful when overflow or encoding errors occur:
>>>
>>>    - [SPARK-38548][SQL] New SQL function: try_sum
>>>    <https://github.com/apache/spark/pull/35848>
>>>    - [SPARK-38589][SQL] New SQL function: try_avg
>>>    <https://github.com/apache/spark/pull/35896>
>>>    - [SPARK-38590][SQL] New SQL function: try_to_binary
>>>    <https://github.com/apache/spark/pull/35897>
>>>
>>> Gengliang
>>>
>>> On Thu, Mar 17, 2022 at 7:59 AM Andrew Melo <[email protected]>
>>> wrote:
>>>
>>> Hello,
>>>
>>> I've been trying for a bit to get the following two PRs merged and
>>> into a release, and I'm having some difficulty moving them forward:
>>>
>>> https://github.com/apache/spark/pull/34903 - This passes the current
>>> python interpreter to spark-env.sh to allow some currently-unavailable
>>> customization to happen
>>> https://github.com/apache/spark/pull/31774 - This fixes a bug in the
>>> SparkUI reverse proxy-handling code where it does a greedy match for
>>> "proxy" in the URL, and will mistakenly replace the App-ID in the
>>> wrong place.
>>>
>>> I'm not exactly sure of how to get attention of PRs that have been
>>> sitting around for a while, but these are really important to our
>>> use-cases, and it would be nice to have them merged in.
>>>
>>> Cheers
>>> Andrew
>>>
>>> On Wed, Mar 16, 2022 at 6:21 PM Holden Karau <[email protected]>
>>> wrote:
>>> >
>>> > I'd like to add/backport the logging in
>>> https://github.com/apache/spark/pull/35881 PR so that when users submit
>>> issues with dynamic allocation we can better debug what's going on.
>>> >
>>> > On Wed, Mar 16, 2022 at 3:45 PM Chao Sun <[email protected]> wrote:
>>> >>
>>> >> There is one item on our side that we want to backport to 3.3:
>>> >> - vectorized DELTA_BYTE_ARRAY/DELTA_LENGTH_BYTE_ARRAY encodings for
>>> >> Parquet V2 support (https://github.com/apache/spark/pull/35262)
>>> >>
>>> >> It's already reviewed and approved.
>>> >>
>>> >> On Wed, Mar 16, 2022 at 9:13 AM Tom Graves
>>> <[email protected]> wrote:
>>> >> >
>>> >> > It looks like the version hasn't been updated on master and still
>>> shows 3.3.0-SNAPSHOT, can you please update that.
>>> >> >
>>> >> > Tom
>>> >> >
>>> >> > On Wednesday, March 16, 2022, 01:41:00 AM CDT, Maxim Gekk <
>>> [email protected]> wrote:
>>> >> >
>>> >> >
>>> >> > Hi All,
>>> >> >
>>> >> > I have created the branch for Spark 3.3:
>>> >> > https://github.com/apache/spark/commits/branch-3.3
>>> >> >
>>> >> > Please, backport important fixes to it, and if you have some
>>> doubts, ping me in the PR. Regarding new features, we are still building
>>> the allow list for branch-3.3.
>>> >> >
>>> >> > Best regards,
>>> >> > Max Gekk
>>> >> >
>>> >> >
>>> >> > On Wed, Mar 16, 2022 at 5:51 AM Dongjoon Hyun <
>>> [email protected]> wrote:
>>> >> >
>>> >> > Yes, I agree with you for your whitelist approach for backporting.
>>> :)
>>> >> > Thank you for summarizing.
>>> >> >
>>> >> > Thanks,
>>> >> > Dongjoon.
>>> >> >
>>> >> >
>>> >> > On Tue, Mar 15, 2022 at 4:20 PM Xiao Li <[email protected]>
>>> wrote:
>>> >> >
>>> >> > I think I finally got your point. What you want to keep unchanged
>>> is the branch cut date of Spark 3.3. Today? or this Friday? This is not a
>>> big deal.
>>> >> >
>>> >> > My major concern is whether we should keep merging the feature work
>>> or the dependency upgrade after the branch cut. To make our release time
>>> more predictable, I am suggesting we should finalize the exception PR list
>>> first, instead of merging them in an ad hoc way. In the past, we spent a
>>> lot of time on the revert of the PRs that were merged after the branch cut.
>>> I hope we can minimize unnecessary arguments in this release. Do you agree,
>>> Dongjoon?
>>> >> >
>>> >> >
>>> >> >
>>> >> > Dongjoon Hyun <[email protected]> 于2022年3月15日周二 15:55写道：
>>> >> >
>>> >> > That is not totally fine, Xiao. It sounds like you are asking a
>>> change of plan without a proper reason.
>>> >> >
>>> >> > Although we cut the branch Today according our plan, you still can
>>> collect the list and make a list of exceptions. I'm not blocking what you
>>> want to do.
>>> >> >
>>> >> > Please let the community start to ramp down as we agreed before.
>>> >> >
>>> >> > Dongjoon
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Tue, Mar 15, 2022 at 3:07 PM Xiao Li <[email protected]>
>>> wrote:
>>> >> >
>>> >> > Please do not get me wrong. If we don't cut a branch, we are
>>> allowing all patches to land Apache Spark 3.3. That is totally fine. After
>>> we cut the branch, we should avoid merging the feature work. In the next
>>> three days, let us collect the actively developed PRs that we want to make
>>> an exception (i.e., merged to 3.3 after the upcoming branch cut). Does that
>>> make sense?
>>> >> >
>>> >> > Dongjoon Hyun <[email protected]> 于2022年3月15日周二 14:54写道：
>>> >> >
>>> >> > Xiao. You are working against what you are saying.
>>> >> > If you don't cut a branch, it means you are allowing all patches to
>>> land Apache Spark 3.3. No?
>>> >> >
>>> >> > > we need to avoid backporting the feature work that are not being
>>> well discussed.
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Tue, Mar 15, 2022 at 12:12 PM Xiao Li <[email protected]>
>>> wrote:
>>> >> >
>>> >> > Cutting the branch is simple, but we need to avoid backporting the
>>> feature work that are not being well discussed. Not all the members are
>>> actively following the dev list. I think we should wait 3 more days for
>>> collecting the PR list before cutting the branch.
>>> >> >
>>> >> > BTW, there are very few 3.4-only feature work that will be affected.
>>> >> >
>>> >> > Xiao
>>> >> >
>>> >> > Dongjoon Hyun <[email protected]> 于2022年3月15日周二 11:49写道：
>>> >> >
>>> >> > Hi, Max, Chao, Xiao, Holden and all.
>>> >> >
>>> >> > I have a different idea.
>>> >> >
>>> >> > Given the situation and small patch list, I don't think we need to
>>> postpone the branch cut for those patches. It's easier to cut a branch-3.3
>>> and allow backporting.
>>> >> >
>>> >> > As of today, we already have an obvious Apache Spark 3.4 patch in
>>> the branch together. This situation only becomes worse and worse because
>>> there is no way to block the other patches from landing unintentionally if
>>> we don't cut a branch.
>>> >> >
>>> >> >     [SPARK-38335][SQL] Implement parser support for DEFAULT column
>>> values
>>> >> >
>>> >> > Let's cut `branch-3.3` Today for Apache Spark 3.3.0 preparation.
>>> >> >
>>> >> > Best,
>>> >> > Dongjoon.
>>> >> >
>>> >> >
>>> >> > On Tue, Mar 15, 2022 at 10:17 AM Chao Sun <[email protected]>
>>> wrote:
>>> >> >
>>> >> > Cool, thanks for clarifying!
>>> >> >
>>> >> > On Tue, Mar 15, 2022 at 10:11 AM Xiao Li <[email protected]>
>>> wrote:
>>> >> > >>
>>> >> > >> For the following list:
>>> >> > >> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
>>> >> > >> #34659 [SPARK-34863][SQL] Support complex types for Parquet
>>> vectorized reader
>>> >> > >> #35848 [SPARK-38548][SQL] New SQL function: try_sum
>>> >> > >> Do you mean we should include them, or exclude them from 3.3?
>>> >> > >
>>> >> > >
>>> >> > > If possible, I hope these features can be shipped with Spark 3.3.
>>> >> > >
>>> >> > >
>>> >> > >
>>> >> > > Chao Sun <[email protected]> 于2022年3月15日周二 10:06写道：
>>> >> > >>
>>> >> > >> Hi Xiao,
>>> >> > >>
>>> >> > >> For the following list:
>>> >> > >>
>>> >> > >> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
>>> >> > >> #34659 [SPARK-34863][SQL] Support complex types for Parquet
>>> vectorized reader
>>> >> > >> #35848 [SPARK-38548][SQL] New SQL function: try_sum
>>> >> > >>
>>> >> > >> Do you mean we should include them, or exclude them from 3.3?
>>> >> > >>
>>> >> > >> Thanks,
>>> >> > >> Chao
>>> >> > >>
>>> >> > >> On Tue, Mar 15, 2022 at 9:56 AM Dongjoon Hyun <
>>> [email protected]> wrote:
>>> >> > >> >
>>> >> > >> > The following was tested and merged a few minutes ago. So, we
>>> can remove it from the list.
>>> >> > >> >
>>> >> > >> > #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to v1.5.1
>>> >> > >> >
>>> >> > >> > Thanks,
>>> >> > >> > Dongjoon.
>>> >> > >> >
>>> >> > >> > On Tue, Mar 15, 2022 at 9:48 AM Xiao Li <[email protected]>
>>> wrote:
>>> >> > >> >>
>>> >> > >> >> Let me clarify my above suggestion. Maybe we can wait 3 more
>>> days to collect the list of actively developed PRs that we want to merge to
>>> 3.3 after the branch cut?
>>> >> > >> >>
>>> >> > >> >> Please do not rush to merge the PRs that are not fully
>>> reviewed. We can cut the branch this Friday and continue merging the PRs
>>> that have been discussed in this thread. Does that make sense?
>>> >> > >> >>
>>> >> > >> >> Xiao
>>> >> > >> >>
>>> >> > >> >>
>>> >> > >> >>
>>> >> > >> >> Holden Karau <[email protected]> 于2022年3月15日周二 09:10写道：
>>> >> > >> >>>
>>> >> > >> >>> May I suggest we push out one week (22nd) just to give
>>> everyone a bit of breathing space? Rushed software development more often
>>> results in bugs.
>>> >> > >> >>>
>>> >> > >> >>> On Tue, Mar 15, 2022 at 6:23 AM Yikun Jiang <
>>> [email protected]> wrote:
>>> >> > >> >>>>
>>> >> > >> >>>> > To make our release time more predictable, let us collect
>>> the PRs and wait three more days before the branch cut?
>>> >> > >> >>>>
>>> >> > >> >>>> For SPIP: Support Customized Kubernetes Schedulers:
>>> >> > >> >>>> #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to
>>> v1.5.1
>>> >> > >> >>>>
>>> >> > >> >>>> Three more days are OK for this from my view.
>>> >> > >> >>>>
>>> >> > >> >>>> Regards,
>>> >> > >> >>>> Yikun
>>> >> > >> >>>
>>> >> > >> >>> --
>>> >> > >> >>> Twitter: https://twitter.com/holdenkarau
>>> >> > >> >>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9
>>> >> > >> >>> YouTube Live Streams:
>>> https://www.youtube.com/user/holdenkarau
>>> >
>>> >
>>> >
>>> > --
>>> > Twitter: https://twitter.com/holdenkarau
>>> > Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9
>>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [email protected]
>>>
>>>

Re: Apache Spark 3.3 Release

Reply via email to