from:"L. C. Hsieh"

[VOTE][RESULT] SPIP: Stored Procedures API for Catalogs

2024-05-15 Thread L. C. Hsieh

The vote passes with 13+1s (8 binding +1s) and 1+0.

(* = binding)
+1:
Chao Sun (*)
Liang-Chi Hsieh (*)
Huaxin Gao (*)
Bo Yang
Dongjoon Hyun (*)
Kent Yao
Wenchen Fan (*)
Ryan Blue
Anton Okolnychyi
Zhou Jiang
Gengliang Wang (*)
Xiao Li (*)
Hyukjin Kwon (*)

+0: None
Mich Talebzadeh


-1: None

Thanks all.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-15 Thread L. C. Hsieh

Hi all,

Thanks all for participating and your support! The vote has been passed.
I'll send out the result in a separate thread.

On Wed, May 15, 2024 at 4:44 PM Hyukjin Kwon  wrote:
>
> +1
>
> On Tue, 14 May 2024 at 16:39, Wenchen Fan  wrote:
>>
>> +1
>>
>> On Tue, May 14, 2024 at 8:19 AM Zhou Jiang  wrote:
>>>
>>> +1 (non-binding)
>>>
>>> On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>>>>
>>>> Please also refer to:
>>>>
>>>>- Discussion thread:
>>>> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>>>>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>>>>- SPIP doc: 
>>>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>>>>
>>>>
>>>> Please vote on the SPIP for the next 72 hours:
>>>>
>>>> [ ] +1: Accept the proposal as an official SPIP
>>>> [ ] +0
>>>> [ ] -1: I don’t think this is a good idea because …
>>>>
>>>>
>>>> Thank you!
>>>>
>>>> Liang-Chi Hsieh
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>
>>>
>>> --
>>> Zhou JIANG
>>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread L. C. Hsieh

+1

On Sat, May 11, 2024 at 3:11 PM Chao Sun  wrote:
>
> +1
>
> On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
>>
>> Hi all,
>>
>> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>>
>> Please also refer to:
>>
>>- Discussion thread:
>> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>>- SPIP doc: 
>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>>
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>>
>> Thank you!
>>
>> Liang-Chi Hsieh
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread L. C. Hsieh

Hi all,

I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.

Please also refer to:

   - Discussion thread:
https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
   - JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
   - SPIP doc: 
https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/


Please vote on the SPIP for the next 72 hours:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …


Thank you!

Liang-Chi Hsieh

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-09 Thread L. C. Hsieh

Thanks Anton. Thank you, Wenchen, Dongjoon, Ryan, Serge, Allison and
others if I miss those who are participating in the discussion.

I suppose we have reached a consensus or close to being in the design.

If you have some more comments, please let us know.

If not, I will go to start a vote soon after a few days.

Thank you.

On Thu, May 9, 2024 at 6:12 PM Anton Okolnychyi  wrote:
>
> Thanks to everyone who commented on the design doc. I updated the proposal 
> and it is ready for another look. I hope we can converge and move forward 
> with this effort!
>
> - Anton
>
> пт, 19 квіт. 2024 р. о 15:54 Anton Okolnychyi  пише:
>>
>> Hi folks,
>>
>> I'd like to start a discussion on SPARK-44167 that aims to enable catalogs 
>> to expose custom routines as stored procedures. I believe this functionality 
>> will enhance Spark’s ability to interact with external connectors and allow 
>> users to perform more operations in plain SQL.
>>
>> SPIP [1] contains proposed API changes and parser extensions. Any feedback 
>> is more than welcome!
>>
>> Unlike the initial proposal for stored procedures with Python [2], this one 
>> focuses on exposing pre-defined stored procedures via the catalog API. This 
>> approach is inspired by a similar functionality in Trino and avoids the 
>> challenges of supporting user-defined routines discussed earlier [3].
>>
>> Liang-Chi was kind enough to shepherd this effort. Thanks!
>>
>> - Anton
>>
>> [1] - 
>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>> [2] - 
>> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/
>> [3] - https://lists.apache.org/thread/lkjm9r7rx7358xxn2z8yof4wdknpzg3l
>>
>>
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread L. C. Hsieh

+1

On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun  wrote:
>
> I'll start with my +1.
>
> Dongjoon.
>
> On 2024/04/26 16:45:51 Dongjoon Hyun wrote:
> > Please vote on SPARK-46122 to set spark.sql.legacy.createHiveTableByDefault
> > to `false` by default. The technical scope is defined in the following PR.
> >
> > - DISCUSSION:
> > https://lists.apache.org/thread/ylk96fg4lvn6klxhj6t6yh42lyqb8wmd
> > - JIRA: https://issues.apache.org/jira/browse/SPARK-46122
> > - PR: https://github.com/apache/spark/pull/46207
> >
> > The vote is open until April 30th 1AM (PST) and passes
> > if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Set spark.sql.legacy.createHiveTableByDefault to false by default
> > [ ] -1 Do not change spark.sql.legacy.createHiveTableByDefault because ...
> >
> > Thank you in advance.
> >
> > Dongjoon
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread L. C. Hsieh

+1

On Thu, Apr 25, 2024 at 8:16 PM Yuming Wang  wrote:

> +1
>
> On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek  wrote:
>
>> Of course, I can't think of a scenario of thousands of tables with single
>> in memory Spark cluster with in memory catalog.
>> Thanks for the help!
>>
>> בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh ‏<
>> mich.talebza...@gmail.com>:
>>
>>>
>>>
>>> Agreed. In scenarios where most of the interactions with the catalog are
>>> related to query planning, saving and metadata management, the choice of
>>> catalog implementation may have less impact on query runtime performance.
>>> This is because the time spent on metadata operations is generally
>>> minimal compared to the time spent on actual data fetching, processing, and
>>> computation.
>>> However, if we consider scalability and reliability concerns, especially
>>> as the size and complexity of data and query workload grow. While an
>>> in-memory catalog may offer excellent performance for smaller workloads,
>>> it will face limitations in handling larger-scale deployments with
>>> thousands of tables, partitions, and users. Additionally, durability and
>>> persistence are crucial considerations, particularly in production
>>> environments where data integrity
>>> and availability are crucial. In-memory catalog implementations may lack
>>> durability, meaning that metadata changes could be lost in the event of a
>>> system failure or restart. Therefore, while in-memory catalog
>>> implementations can provide speed and efficiency for certain use cases, we
>>> ought to consider the requirements for scalability, reliability, and data
>>> durability when choosing a catalog solution for production deployments. In
>>> many cases, a combination of in-memory and disk-based catalog solutions may
>>> offer the best balance of performance and resilience for demanding large
>>> scale workloads.
>>>
>>>
>>> HTH
>>>
>>>
>>> Mich Talebzadeh,
>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von Braun
>>> )".
>>>
>>>
>>> On Thu, 25 Apr 2024 at 16:32, Nimrod Ofek  wrote:
>>>
 Of course, but it's in memory and not persisted which is much faster,
 and as I said- I believe that most of the interaction with it is during the
 planning and save and not actual query run operations, and they are short
 and minimal compared to data fetching and manipulation so I don't believe
 it will have big impact on query run...

 בתאריך יום ה׳, 25 באפר׳ 2024, 17:52, מאת Mich Talebzadeh ‏<
 mich.talebza...@gmail.com>:

> Well, I will be surprised because Derby database is single threaded
> and won't be much of a use here.
>
> Most Hive metastore in the commercial world utilise postgres or Oracle
> for metastore that are battle proven, replicated and backed up.
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner
> Von Braun
> )".
>
>
> On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek 
> wrote:
>
>> Yes, in memory hive catalog backed by local Derby DB.
>> And again, I presume that most metadata related parts are during
>> planning and not actual run, so I don't see why it should strongly affect
>> query performance.
>>
>> Thanks,
>>
>>
>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh ‏<
>> mich.talebza...@gmail.com>:
>>
>>> With regard to your point below
>>>
>>> "The thing I'm missing is this: let's say that the output format I
>>> choose is delta lake or iceberg or whatever format that uses parquet. 
>>> Where
>>> does the catalog implementation (which holds metadata afaik, same 
>>> metadata
>>> that iceberg and delta lake save for their tables about their columns)
>>> comes into play and why should it

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread L. C. Hsieh

+1

On Thu, Apr 25, 2024 at 11:19 AM Maciej  wrote:
>
> +1
>
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> PGP: A30CEF0C31A501EC
>
> On 4/25/24 6:21 PM, Reynold Xin wrote:
>
> +1
>
> On Thu, Apr 25, 2024 at 9:01 AM Santosh Pingale 
>  wrote:
>>
>> +1
>>
>> On Thu, Apr 25, 2024, 5:41 PM Dongjoon Hyun  wrote:
>>>
>>> FYI, there is a proposal to drop Python 3.8 because its EOL is October 2024.
>>>
>>> https://github.com/apache/spark/pull/46228
>>> [SPARK-47993][PYTHON] Drop Python 3.8
>>>
>>> Since it's still alive and there will be an overlap between the lifecycle 
>>> of Python 3.8 and Apache Spark 4.0.0, please give us your feedback on the 
>>> PR, if you have any concerns.
>>>
>>> From my side, I agree with this decision.
>>>
>>> Thanks,
>>> Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-16 Thread L. C. Hsieh

+1

On Tue, Apr 16, 2024 at 4:08 AM Wenchen Fan  wrote:
>
> +1
>
> On Mon, Apr 15, 2024 at 12:31 PM Dongjoon Hyun  wrote:
>>
>> I'll start with my +1.
>>
>> - Checked checksum and signature
>> - Checked Scala/Java/R/Python/SQL Document's Spark version
>> - Checked published Maven artifacts
>> - All CIs passed.
>>
>> Thanks,
>> Dongjoon.
>>
>> On 2024/04/15 04:22:26 Dongjoon Hyun wrote:
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 3.4.3.
>> >
>> > The vote is open until April 18th 1AM (PDT) and passes if a majority +1 PMC
>> > votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 3.4.3
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see https://spark.apache.org/
>> >
>> > The tag to be voted on is v3.4.3-rc2 (commit
>> > 1eb558c3a6fbdd59e5a305bc3ab12ce748f6511f)
>> > https://github.com/apache/spark/tree/v3.4.3-rc2
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1453/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-docs/
>> >
>> > The list of bug fixes going into 3.4.3 can be found at the following URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12353987
>> >
>> > This release is using the release script of the tag v3.4.3-rc2.
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 3.4.3?
>> > ===
>> >
>> > The current list of open tickets targeted at 3.4.3 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>> > Version/s" = 3.4.3
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[VOTE][RESULT] Add new `Versions` in Apache Spark JIRA for Versioning of Spark Operator

2024-04-15 Thread L. C. Hsieh

Hi all,

The vote passes with 7+1s (5 binding +1s).

(* = binding)
+1:
Dongjoon Hyun(*)
Liang-Chi Hsieh(*)
Huaxin Gao(*)
Bo Yang
Xiao Li(*)
Chao Sun(*)
Hussein Awala

+0: None

-1: None

Thanks.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-13 Thread L. C. Hsieh

+1

On Sat, Apr 13, 2024 at 4:12 PM Hyukjin Kwon  wrote:
>
> +1
>
> On Sun, Apr 14, 2024 at 7:46 AM Chao Sun  wrote:
>>
>> +1.
>>
>> This feature is very helpful for guarding against correctness issues, such 
>> as null results due to invalid input or math overflows. It’s been there for 
>> a while now and it’s a good time to enable it by default as Spark enters the 
>> next major release.
>>
>> On Sat, Apr 13, 2024 at 3:27 PM Dongjoon Hyun  wrote:
>>>
>>> I'll start from my +1.
>>>
>>> Dongjoon.
>>>
>>> On 2024/04/13 22:22:05 Dongjoon Hyun wrote:
>>> > Please vote on SPARK-4 to use ANSI SQL mode by default.
>>> > The technical scope is defined in the following PR which is
>>> > one line of code change and one line of migration guide.
>>> >
>>> > - DISCUSSION:
>>> > https://lists.apache.org/thread/ztlwoz1v1sn81ssks12tb19x37zozxlz
>>> > - JIRA: https://issues.apache.org/jira/browse/SPARK-4
>>> > - PR: https://github.com/apache/spark/pull/46013
>>> >
>>> > The vote is open until April 17th 1AM (PST) and passes
>>> > if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>> >
>>> > [ ] +1 Use ANSI SQL mode by default
>>> > [ ] -1 Do not use ANSI SQL mode by default because ...
>>> >
>>> > Thank you in advance.
>>> >
>>> > Dongjoon
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Add new `Versions` in Apache Spark JIRA for Versioning of Spark Operator

2024-04-12 Thread L. C. Hsieh

+1

Thank you, Dongjoon. Yea, We may need to customize the merge script
for a particular repository.


On Fri, Apr 12, 2024 at 9:07 AM Dongjoon Hyun  wrote:
>
> +1
>
> Thank you!
>
> I hope we can customize `dev/merge_spark_pr.py` script per repository after 
> this PR.
>
> Dongjoon.
>
> On 2024/04/12 03:28:36 "L. C. Hsieh" wrote:
> > Hi all,
> >
> > Thanks for all discussions in the thread of "Versioning of Spark
> > Operator": https://lists.apache.org/thread/zhc7nb2sxm8jjxdppq8qjcmlf4rcsthh
> >
> > I would like to create this vote to get the consensus for versioning
> > of the Spark Kubernetes Operator.
> >
> > The proposal is to use an independent versioning for the Spark
> > Kubernetes Operator.
> >
> > Please vote on adding new `Versions` in Apache Spark JIRA which can be
> > used for places like "Fix Version/s" in the JIRA tickets of the
> > operator.
> >
> > The new `Versions` will be `kubernetes-operator-` prefix, for example
> > `kubernetes-operator-0.1.0`.
> >
> > The vote is open until April 15th 1AM (PST) and passes if a majority
> > +1 PMC votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Adding the new `Versions` for Spark Kubernetes Operator in
> > Apache Spark JIRA
> > [ ] -1 Do not add the new `Versions` because ...
> >
> > Thank you.
> >
> >
> > Note that this is not a SPIP vote and also not a release vote. I don't
> > find similar votes in previous threads. This is made similarly like a
> > SPIP or a release vote. So I think it should be okay. Please correct
> > me if this vote format is not good for you.
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] SPARK-44444: Use ANSI SQL mode by default

2024-04-12 Thread L. C. Hsieh

+1

I believe ANSI mode is well developed after many releases. No doubt it
could be used.
Since it is very easy to disable it to restore to current behavior, I
guess the impact could be limited.
Do we have known the possible impacts such as what are the major
changes (e.g., what kind of queries/expressions will fail)? We can
describe them in the release note.

On Thu, Apr 11, 2024 at 10:29 PM Gengliang Wang  wrote:
>
>
> +1, enabling Spark's ANSI SQL mode in version 4.0 will significantly enhance 
> data quality and integrity. I fully support this initiative.
>
> > In other words, the current Spark ANSI SQL implementation becomes the first 
> > implementation for Spark SQL users to face at first while providing
> `spark.sql.ansi.enabled=false` in the same way without losing any 
> capability.`spark.sql.ansi.enabled=false` in the same way without losing any 
> capability.
>
> BTW, the try_* functions and SQL Error Attribution Framework will also be 
> beneficial in migrating to ANSI SQL mode.
>
>
> Gengliang
>
>
> On Thu, Apr 11, 2024 at 7:56 PM Dongjoon Hyun  wrote:
>>
>> Hi, All.
>>
>> Thanks to you, we've been achieving many things and have on-going SPIPs.
>> I believe it's time to scope Apache Spark 4.0.0 (SPARK-44111) more narrowly
>> by asking your opinions about Apache Spark's ANSI SQL mode.
>>
>> https://issues.apache.org/jira/browse/SPARK-44111
>> Prepare Apache Spark 4.0.0
>>
>> SPARK-4 was proposed last year (on 15/Jul/23) as the one of desirable
>> items for 4.0.0 because it's a big behavior.
>>
>> https://issues.apache.org/jira/browse/SPARK-4
>> Use ANSI SQL mode by default
>>
>> Historically, spark.sql.ansi.enabled was added at Apache Spark 3.0.0 and has
>> been aiming to provide a better Spark SQL compatibility in a standard way.
>> We also have a daily CI to protect the behavior too.
>>
>> https://github.com/apache/spark/actions/workflows/build_ansi.yml
>>
>> However, it's still behind the configuration with several known issues, e.g.,
>>
>> SPARK-41794 Reenable ANSI mode in test_connect_column
>> SPARK-41547 Reenable ANSI mode in test_connect_functions
>> SPARK-46374 Array Indexing is 1-based via ANSI SQL Standard
>>
>> To be clear, we know that many DBMSes have their own implementations of
>> SQL standard and not the same. Like them, SPARK-4 aims to enable
>> only the existing Spark's configuration, `spark.sql.ansi.enabled=true`.
>> There is nothing more than that.
>>
>> In other words, the current Spark ANSI SQL implementation becomes the first
>> implementation for Spark SQL users to face at first while providing
>> `spark.sql.ansi.enabled=false` in the same way without losing any capability.
>>
>> If we don't want this change for some reasons, we can simply exclude
>> SPARK-4 from SPARK-44111 as a part of Apache Spark 4.0.0 preparation.
>> It's time just to make a go/no-go decision for this item for the global 
>> optimization
>> for Apache Spark 4.0.0 release. After 4.0.0, it's unlikely for us to aim
>> for this again for the next four years until 2028.
>>
>> WDYT?
>>
>> Bests,
>> Dongjoon

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[VOTE] Add new `Versions` in Apache Spark JIRA for Versioning of Spark Operator

2024-04-11 Thread L. C. Hsieh

Hi all,

Thanks for all discussions in the thread of "Versioning of Spark
Operator": https://lists.apache.org/thread/zhc7nb2sxm8jjxdppq8qjcmlf4rcsthh

I would like to create this vote to get the consensus for versioning
of the Spark Kubernetes Operator.

The proposal is to use an independent versioning for the Spark
Kubernetes Operator.

Please vote on adding new `Versions` in Apache Spark JIRA which can be
used for places like "Fix Version/s" in the JIRA tickets of the
operator.

The new `Versions` will be `kubernetes-operator-` prefix, for example
`kubernetes-operator-0.1.0`.

The vote is open until April 15th 1AM (PST) and passes if a majority
+1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Adding the new `Versions` for Spark Kubernetes Operator in
Apache Spark JIRA
[ ] -1 Do not add the new `Versions` because ...

Thank you.


Note that this is not a SPIP vote and also not a release vote. I don't
find similar votes in previous threads. This is made similarly like a
SPIP or a release vote. So I think it should be okay. Please correct
me if this vote format is not good for you.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-10 Thread L. C. Hsieh

+1 for Wenchen's point.

I don't see a strong reason to pull these transformations into Spark
instead of keeping them in third party packages/projects.

On Wed, Apr 10, 2024 at 5:32 AM Wenchen Fan  wrote:
>
> It's good to reduce duplication between different native accelerators of 
> Spark, and AFAIK there is already a project trying to solve it: 
> https://substrait.io/
>
> I'm not sure why we need to do this inside Spark, instead of doing the 
> unification for a wider scope (for all engines, not only Spark).
>
>
> On Wed, Apr 10, 2024 at 10:11 AM Holden Karau  wrote:
>>
>> I like the idea of improving flexibility of Sparks physical plans and really 
>> anything that might reduce code duplication among the ~4 or so different 
>> accelerators.
>>
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>> On Tue, Apr 9, 2024 at 3:14 AM Dongjoon Hyun  wrote:
>>>
>>> Thank you for sharing, Jia.
>>>
>>> I have the same questions like the previous Weiting's thread.
>>>
>>> Do you think you can share the future milestone of Apache Gluten?
>>> I'm wondering when the first stable release will come and how we can 
>>> coordinate across the ASF communities.
>>>
>>> > This project is still under active development now, and doesn't have a 
>>> > stable release.
>>> > https://github.com/apache/incubator-gluten/releases/tag/v1.1.1
>>>
>>> In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end of 
>>> support.
>>> And, 3.4 will have 3.4.3 next week and 3.4.4 (another EOL release) is 
>>> scheduled in October.
>>>
>>> For the SPIP, I guess it's applicable for Apache Spark 4.0.0 only if there 
>>> is something we need to do from Spark side.
>>
>> +1 I think any changes need to target 4.0
>>>
>>>
>>> Thanks,
>>> Dongjoon.
>>>
>>>
>>> On Tue, Apr 9, 2024 at 12:22 AM Ke Jia  wrote:

 Apache Spark currently lacks an official mechanism to support 
 cross-platform execution of physical plans. The Gluten project offers a 
 mechanism that utilizes the Substrait standard to convert and optimize 
 Spark's physical plans. By introducing Gluten's plan conversion, 
 validation, and fallback mechanisms into Spark, we can significantly 
 enhance the portability and interoperability of Spark's physical plans, 
 enabling them to operate across a broader spectrum of execution 
 environments without requiring users to migrate, while also improving 
 Spark's execution efficiency through the utilization of Gluten's advanced 
 optimization techniques. And the integration of Gluten into Spark has 
 already shown significant performance improvements with ClickHouse and 
 Velox backends and has been successfully deployed in production by several 
 customers.

 References:
 JIAR Ticket
 SPIP Doc

 Your feedback and comments are welcome and appreciated.  Thanks.

 Thanks,
 Jia Ke

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Versioning of Spark Operator

2024-04-10 Thread L. C. Hsieh

This approach makes sense to me.

If Spark K8s operator is aligned with Spark versions, for example, it
uses 4.0.0 now.
Because these JIRA tickets are not actually targeting Spark 4.0.0, it
will cause confusion and more questions, like when we are going to cut
Spark release,
should we include Spark operator JIRAs in the release note, etc.

So I think an independent version number for Spark K8s operator would
be a better option.

If there are no more options or comments, I will create a vote later
to create new "Versions" in Apache Spark JIRA.

Thank you all.

On Wed, Apr 10, 2024 at 12:20 AM Dongjoon Hyun  wrote:
>
> Ya, that would work.
>
> Inevitably, I looked at Apache Flink K8s Operator's JIRA and GitHub repo.
>
> It looks reasonable to me.
>
> Although they share the same JIRA, they choose different patterns per place.
>
> 1. In POM file and Maven Artifact, independent version number.
> 1.8.0
>
> 2. Tag is also based on the independent version number
> https://github.com/apache/flink-kubernetes-operator/tags
> - release-1.8.0
> - release-1.7.0
>
> 3. JIRA Fixed Version is `kubernetes-operator-` prefix.
> https://issues.apache.org/jira/browse/FLINK-34957
> > Fix Version/s: kubernetes-operator-1.9.0
>
> Maybe, we can borrow this pattern.
>
> I guess we need a vote for any further decision because we need to create new 
> `Versions` in Apache Spark JIRA.
>
> Dongjoon.
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Versioning of Spark Operator

2024-04-10 Thread L. C. Hsieh

Yea, I guess, for example, the first release of Spark K8s Operator
would be something like 0.1.0 instead of 4.0.0.

It sounds hard to align with Spark versions because of that?


On Tue, Apr 9, 2024 at 10:15 AM Dongjoon Hyun  wrote:
>
> Ya, that's simple and possible.
>
> However, it may cause many confusions because it implies that new `Spark K8s 
> Operator 4.0.0` and `Spark Connect Go 4.0.0` follow the same `Semantic 
> Versioning` policy like Apache Spark 4.0.0.
>
> In addition, `Versioning` is directly related to the Release Cadence. It's 
> unlikely for us to have `Spark K8s Operator` and `Spark Connect Go` releases 
> at every Apache Spark maintenance release. For example, there is no commit in 
> Spark Connect Go repository.
>
> I believe the versioning and release cadence is related to those subprojects' 
> maturity more.
>
> Dongjoon.
>
> On 2024/04/09 16:59:40 DB Tsai wrote:
> >  Aligning with Spark releases is sensible, as it allows us to guarantee 
> > that the Spark operator functions correctly with the new version while also 
> > maintaining support for previous versions.
> >
> > DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
> >
> > > On Apr 9, 2024, at 9:45 AM, Mridul Muralidharan  wrote:
> > >
> > >
> > >   I am trying to understand if we can simply align with Spark's version 
> > > for this ?
> > > Makes the release and jira management much more simpler for developers 
> > > and intuitive for users.
> > >
> > > Regards,
> > > Mridul
> > >
> > >
> > > On Tue, Apr 9, 2024 at 10:09 AM Dongjoon Hyun  > > <mailto:dongj...@apache.org>> wrote:
> > >> Hi, Liang-Chi.
> > >>
> > >> Thank you for leading Apache Spark K8s operator as a shepherd.
> > >>
> > >> I took a look at `Apache Spark Connect Go` repo mentioned in the thread. 
> > >> Sadly, there is no release at all and no activity since last 6 months. 
> > >> It seems to be the first time for Apache Spark community to consider 
> > >> these sister repositories (Go and K8s Operator).
> > >>
> > >> https://github.com/apache/spark-connect-go/commits/master/
> > >>
> > >> Dongjoon.
> > >>
> > >> On 2024/04/08 17:48:18 "L. C. Hsieh" wrote:
> > >> > Hi all,
> > >> >
> > >> > We've opened the dedicated repository of Spark Kubernetes Operator,
> > >> > and the first PR is created.
> > >> > Thank you for the review from the community so far.
> > >> >
> > >> > About the versioning of Spark Operator, there are questions.
> > >> >
> > >> > As we are using Spark JIRA, when we are going to merge PRs, we need to
> > >> > choose a Spark version. However, the Spark Operator is versioning
> > >> > differently than Spark. I'm wondering how we deal with this?
> > >> >
> > >> > Not sure if Connect also has its versioning different to Spark? If so,
> > >> > maybe we can follow how Connect does.
> > >> >
> > >> > Can someone who is familiar with Connect versioning give some 
> > >> > suggestions?
> > >> >
> > >> > Thank you.
> > >> >
> > >> > Liang-Chi
> > >> >
> > >> > -
> > >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> > >> > <mailto:dev-unsubscr...@spark.apache.org>
> > >> >
> > >> >
> > >>
> > >> -
> > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> > >> <mailto:dev-unsubscr...@spark.apache.org>
> > >>
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Versioning of Spark Operator

2024-04-09 Thread L. C. Hsieh

For Spark Operator, I think the answer is yes. According to my
impression, Spark Operator should be Spark version-agnostic. Zhou,
please correct me if I'm wrong.
I am not sure about the Spark Connector Go client, but if it is going
to talk with Spark cluster, I guess it should be still related to
Spark version (there is compatible issue).


> On 2024/04/09 21:35:45 bo yang wrote:
> > Thanks Liang-Chi for the Spark Operator work, and also the discussion here!
> >
> > For Spark Operator and Connector Go Client, I am guessing they need to
> > support multiple versions of Spark? e.g. same Spark Operator may support
> > running multiple versions of Spark, and Connector Go Client might support
> > multiple versions of Spark driver as well.
> >
> > How do people think of using the minimum supported Spark version as the
> > version name for Spark Operator and Connector Go Client? For example,
> > Spark Operator 3.5.x supports Spark 3.5 and above.
> >
> > Best,
> > Bo
> >
> >
> > On Tue, Apr 9, 2024 at 10:14 AM Dongjoon Hyun  wrote:
> >
> > > Ya, that's simple and possible.
> > >
> > > However, it may cause many confusions because it implies that new `Spark
> > > K8s Operator 4.0.0` and `Spark Connect Go 4.0.0` follow the same `Semantic
> > > Versioning` policy like Apache Spark 4.0.0.
> > >
> > > In addition, `Versioning` is directly related to the Release Cadence. It's
> > > unlikely for us to have `Spark K8s Operator` and `Spark Connect Go`
> > > releases at every Apache Spark maintenance release. For example, there is
> > > no commit in Spark Connect Go repository.
> > >
> > > I believe the versioning and release cadence is related to those
> > > subprojects' maturity more.
> > >
> > > Dongjoon.
> > >
> > > On 2024/04/09 16:59:40 DB Tsai wrote:
> > > >  Aligning with Spark releases is sensible, as it allows us to guarantee
> > > that the Spark operator functions correctly with the new version while 
> > > also
> > > maintaining support for previous versions.
> > > >
> > > > DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
> > > >
> > > > > On Apr 9, 2024, at 9:45 AM, Mridul Muralidharan 
> > > wrote:
> > > > >
> > > > >
> > > > >   I am trying to understand if we can simply align with Spark's
> > > version for this ?
> > > > > Makes the release and jira management much more simpler for developers
> > > and intuitive for users.
> > > > >
> > > > > Regards,
> > > > > Mridul
> > > > >
> > > > >
> > > > > On Tue, Apr 9, 2024 at 10:09 AM Dongjoon Hyun  > > <mailto:dongj...@apache.org>> wrote:
> > > > >> Hi, Liang-Chi.
> > > > >>
> > > > >> Thank you for leading Apache Spark K8s operator as a shepherd.
> > > > >>
> > > > >> I took a look at `Apache Spark Connect Go` repo mentioned in the
> > > thread. Sadly, there is no release at all and no activity since last 6
> > > months. It seems to be the first time for Apache Spark community to
> > > consider these sister repositories (Go and K8s Operator).
> > > > >>
> > > > >> https://github.com/apache/spark-connect-go/commits/master/
> > > > >>
> > > > >> Dongjoon.
> > > > >>
> > > > >> On 2024/04/08 17:48:18 "L. C. Hsieh" wrote:
> > > > >> > Hi all,
> > > > >> >
> > > > >> > We've opened the dedicated repository of Spark Kubernetes Operator,
> > > > >> > and the first PR is created.
> > > > >> > Thank you for the review from the community so far.
> > > > >> >
> > > > >> > About the versioning of Spark Operator, there are questions.
> > > > >> >
> > > > >> > As we are using Spark JIRA, when we are going to merge PRs, we need
> > > to
> > > > >> > choose a Spark version. However, the Spark Operator is versioning
> > > > >> > differently than Spark. I'm wondering how we deal with this?
> > > > >> >
> > > > >> > Not sure if Connect also has its versioning different to Spark? If
> > > so,
> > > > >> > maybe we can follow how Connect does.
> > > > >> >
> > > > >> > Can someone who is familiar with Connect versioning give some
> > > suggestions?
> > > > >> >
> > > > >> > Thank you.
> > > > >> >
> > > > >> > Liang-Chi
> > > > >> >
> > > > >> >
> > > -
> > > > >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org  > > dev-unsubscr...@spark.apache.org>
> > > > >> >
> > > > >> >
> > > > >>
> > > > >> -
> > > > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org  > > dev-unsubscr...@spark.apache.org>
> > > > >>
> > > >
> > > >
> > >
> > > -
> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >
> > >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Versioning of Spark Operator

2024-04-08 Thread L. C. Hsieh

Hi all,

We've opened the dedicated repository of Spark Kubernetes Operator,
and the first PR is created.
Thank you for the review from the community so far.

About the versioning of Spark Operator, there are questions.

As we are using Spark JIRA, when we are going to merge PRs, we need to
choose a Spark version. However, the Spark Operator is versioning
differently than Spark. I'm wondering how we deal with this?

Not sure if Connect also has its versioning different to Spark? If so,
maybe we can follow how Connect does.

Can someone who is familiar with Connect versioning give some suggestions?

Thank you.

Liang-Chi

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Apache Spark 3.4.3 (?)

2024-04-07 Thread L. C. Hsieh

+1

Thanks Dongjoon!

On Sun, Apr 7, 2024 at 1:56 AM Kent Yao  wrote:
>
> +1, thank you, Dongjoon
>
>
> Kent
>
> Holden Karau  于2024年4月7日周日 14:54写道：
> >
> > Sounds good to me :)
> >
> > Twitter: https://twitter.com/holdenkarau
> > Books (Learning Spark, High Performance Spark, etc.): 
> > https://amzn.to/2MaRAG9
> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> >
> >
> > On Sat, Apr 6, 2024 at 2:51 PM Dongjoon Hyun  
> > wrote:
> >>
> >> Hi, All.
> >>
> >> Apache Spark 3.4.2 tag was created on Nov 24th and `branch-3.4` has 85 
> >> commits including important security and correctness patches like 
> >> SPARK-45580, SPARK-46092, SPARK-46466, SPARK-46794, and SPARK-46862.
> >>
> >> https://github.com/apache/spark/releases/tag/v3.4.2
> >>
> >> $ git log --oneline v3.4.2..HEAD | wc -l
> >>   85
> >>
> >> SPARK-45580 Subquery changes the output schema of the outer query
> >> SPARK-46092 Overflow in Parquet row group filter creation causes incorrect 
> >> results
> >> SPARK-46466 Vectorized parquet reader should never do rebase for timestamp 
> >> ntz
> >> SPARK-46794 Incorrect results due to inferred predicate from checkpoint 
> >> with subquery
> >> SPARK-46862 Incorrect count() of a dataframe loaded from CSV datasource
> >> SPARK-45445 Upgrade snappy to 1.1.10.5
> >> SPARK-47428 Upgrade Jetty to 9.4.54.v20240208
> >> SPARK-46239 Hide `Jetty` info
> >>
> >>
> >> Currently, I'm checking more applicable patches for branch-3.4. I'd like 
> >> to propose to release Apache Spark 3.4.3 and volunteer as the release 
> >> manager for Apache Spark 3.4.3. If there are no additional blockers, the 
> >> first tentative RC1 vote date is April 15th (Monday).
> >>
> >> WDYT?
> >>
> >>
> >> Dongjoon.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-03-31 Thread L. C. Hsieh

+1

Thanks Hyukjin.

On Sun, Mar 31, 2024 at 10:52 PM Dongjoon Hyun  wrote:
>
> +1
>
> Thank you, Hyukjin.
>
> Dongjoon
>
> On Sun, Mar 31, 2024 at 19:07 Haejoon Lee 
>  wrote:
>>
>> +1
>>
>> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon  wrote:
>>>
>>> Hi all,
>>>
>>> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark 
>>> Connect)
>>>
>>> JIRA
>>> Prototype
>>> SPIP doc
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>> Thanks.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

2024-03-28 Thread L. C. Hsieh

 in Salesforce. My team has 
>>>> been running the Spark on k8s operator (OSS from Google) in my company to 
>>>> serve Spark users on production for 4+ years, and we've been actively 
>>>> contributing to the Spark on k8s operator OSS and also, occasionally, the 
>>>> Spark OSS. According to our experience, Google's Spark Operator has its 
>>>> own problems, like its close coupling with the spark version, as well as 
>>>> the JVM overhead during job submission. However on the other side, it's 
>>>> been a great component in our team's service in the company, especially 
>>>> being written in golang, it's really easy to have it interact with k8s, 
>>>> and also its CRD covers a lot of different use cases, as it has been built 
>>>> up through time thanks to many users' contribution during these years. 
>>>> There were also a handful of sessions of Google's Spark Operator Spark 
>>>> Summit that made it widely adopted.
>>>>
>>>> For this SPIP, I really love the idea of this proposal for the official 
>>>> k8s operator of Spark project, as well as the separate layer of the 
>>>> submission worker and being spark version agnostic. I think we can get the 
>>>> best of the two:
>>>> 1. I would advocate the new project to still use golang for the 
>>>> implementation, as golang is the go-to cloud native language that works 
>>>> the best with k8s.
>>>> 2. We make sure the functionality of the current Google's spark operator 
>>>> CRD is preserved in the new official Spark Operator; if we can make it 
>>>> compatible or even merge the two projects to make it the new official 
>>>> operator in spark project, it would be the best.
>>>> 3. The new Spark Operator should continue being spark agnostic and 
>>>> continue having this lightweight/separate layer of submission worker. 
>>>> We've seen scalability issues caused by the heavy JVM during spark-submit 
>>>> in Google's Spark Operator and we implemented an internal version of fix 
>>>> for it within our company.
>>>>
>>>> We can continue the discussion in more detail, but generally I love this 
>>>> move of the official spark operator, and I really appreciate the effort! 
>>>> In the SPIP doc. I see my comment has gained several upvotes from someone 
>>>> I don't know, so I believe there are other spark/spark operator users who 
>>>> agree with some of my points. Let me know what you all think and let's 
>>>> continue the discussion, so that we can make this operator a great new 
>>>> component of the Open Source Spark Project!
>>>>
>>>> Thanks!
>>>>
>>>> Shiqi
>>>>
>>>> On Mon, Nov 13, 2023 at 11:50 PM L. C. Hsieh  wrote:
>>>>>
>>>>> Thanks for all the support from the community for the SPIP proposal.
>>>>>
>>>>> Since all questions/discussion are settled down (if I didn't miss any
>>>>> major ones), if no more questions or concerns, I'll be the shepherd
>>>>> for this SPIP proposal and call for a vote tomorrow.
>>>>>
>>>>> Thank you all!
>>>>>
>>>>> On Mon, Nov 13, 2023 at 6:43 PM Zhou Jiang  wrote:
>>>>> >
>>>>> > Hi Holden,
>>>>> >
>>>>> > Thanks a lot for your feedback!
>>>>> > Yes, this proposal attempts to integrate existing solutions, especially 
>>>>> > from CRD perspective. The proposed schema retains similarity with 
>>>>> > current designs, while reducing duplicates and maintaining a single 
>>>>> > source of truth from conf properties. It also tends to be close to 
>>>>> > native integration with k8s to minimize schema changes for new features.
>>>>> > For dependencies, packing everything is the easiest way to get started. 
>>>>> > It would be straightforward to add --packages and --repositories 
>>>>> > support for Maven dependencies. It's technically possible to pull 
>>>>> > dependencies in cloud storage from init containers (if defined by 
>>>>> > user). It could be tricky to design a general solution that supports 
>>>>> > different cloud providers from the operator layer. An enhancement that 
>>>>> > I can think of is to add support for profile scripts that can enable 
>>>>> > addit

The dedicated repository for Kubernetes Operator for Apache Spark

2024-03-27 Thread L. C. Hsieh

Hi all,

For the passed SPIP: An Official Kubernetes Operator for Apache Spark,
the developers have been working on code cleaning and refactoring for
open source in the last few months. They are ready to contribute the
code to Spark now.

As we discussed, I will go to create a dedicated repository for the
Kubernetes Operator for Apache Spark. I think the repository name will
be "spark-kubernetes-operator". I will try to create the repository
tomorrow.

After that, they will contribute the code as an initial PR for review
from the Spark community.

Thank you.

Liang-Chi

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark

2024-03-12 Thread L. C. Hsieh

+1


On Tue, Mar 12, 2024 at 8:20 AM Chao Sun  wrote:

> +1
>
> On Tue, Mar 12, 2024 at 8:03 AM Xiao Li 
> wrote:
>
>> +1
>>
>> On Tue, Mar 12, 2024 at 6:09 AM Holden Karau 
>> wrote:
>>
>>> +1
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>>
>>> On Mon, Mar 11, 2024 at 7:44 PM Reynold Xin 
>>> wrote:
>>>
 +1


 On Mon, Mar 11 2024 at 7:38 PM, Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:

> +1 (non-binding), thanks Gengliang!
>
> On Mon, Mar 11, 2024 at 5:46 PM Gengliang Wang 
> wrote:
>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Structured Logging Framework for
>> Apache Spark
>>
>> References:
>>
>>- JIRA ticket 
>>- SPIP doc
>>
>> 
>>- Discussion thread
>>
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks!
>> Gengliang Wang
>>
>
>>
>> --
>>
>>

Re: [VOTE] SPIP: Structured Streaming - Arbitrary State API v2

2024-01-10 Thread L. C. Hsieh

+1

On Wed, Jan 10, 2024 at 9:06 AM Bhuwan Sahni
 wrote:

> +1. This is a good addition.
>
> 
> *Bhuwan Sahni*
> Staff Software Engineer
>
> bhuwan.sa...@databricks.com
> 500 108th Ave. NE
> Bellevue, WA 98004
> USA
>
>
> On Wed, Jan 10, 2024 at 9:00 AM Burak Yavuz  wrote:
>
>> +1. Excited to see more stateful workloads with Structured Streaming!
>>
>>
>> Best,
>> Burak
>>
>> On Wed, Jan 10, 2024 at 8:21 AM Praveen Gattu
>>  wrote:
>>
>>> +1. This brings Structured Streaming a good solution for
>>> customers wanting to build stateful stream processing applications.
>>>
>>> On Wed, Jan 10, 2024 at 7:30 AM Bartosz Konieczny <
>>> bartkoniec...@gmail.com> wrote:
>>>
 +1 :)

 On Wed, Jan 10, 2024 at 9:57 AM Shixiong Zhu  wrote:

> +1 (binding)
>
> Best Regards,
> Shixiong Zhu
>
>
> On Tue, Jan 9, 2024 at 6:47 PM 刘唯  wrote:
>
>> This is a good addition! +1
>>
>> Raghu Angadi  于2024年1月9日周二
>> 13:17写道：
>>
>>> +1. This is a major improvement to the state API.
>>>
>>> Raghu.
>>>
>>> On Tue, Jan 9, 2024 at 1:42 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 +1 for me as well


 Mich Talebzadeh,
 Dad | Technologist | Solutions Architect | Engineer
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility
 for any loss, damage or destruction of data or any other property 
 which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary 
 damages
 arising from such loss, damage or destruction.




 On Tue, 9 Jan 2024 at 03:24, Anish Shrigondekar
  wrote:

> Thanks Jungtaek for creating the Vote thread.
>
> +1 (non-binding) from my side too.
>
> Thanks,
> Anish
>
> On Tue, Jan 9, 2024 at 6:09 AM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> Starting with my +1 (non-binding). Thanks!
>>
>> On Tue, Jan 9, 2024 at 9:37 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I'd like to start the vote for SPIP: Structured Streaming -
>>> Arbitrary State API v2.
>>>
>>> References:
>>>
>>>- JIRA ticket
>>>
>>>- SPIP doc
>>>
>>> 
>>>- Discussion thread
>>>
>>> 
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>> Thanks!
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>

 --
 Bartosz Konieczny
 freelance data engineer
 https://www.waitingforcode.com
 https://github.com/bartosz25/
 https://twitter.com/waitingforcode

Re: [DISCUSS] SPIP: Structured Streaming - Arbitrary State API v2

2024-01-08 Thread L. C. Hsieh

+1

I left some comments in the SPIP doc and got replies quickly. The new
API looks good and more comprehensive. I think it will help Spark
Structured Streaming to be more useful in more complicated streaming
use cases.

On Fri, Jan 5, 2024 at 8:15 PM Burak Yavuz  wrote:
>
> I'm also a +1 on the newer APIs. We had a lot of learnings from using 
> flatMapGroupsWithState and I believe that we can make the APIs a lot easier 
> to use.
>
> On Wed, Nov 29, 2023 at 6:43 PM Anish Shrigondekar 
>  wrote:
>>
>> Hi dev,
>>
>> Addressed the comments that Jungtaek had on the doc. Bumping the thread once 
>> again to see if other folks have any feedback on the proposal.
>>
>> Thanks,
>> Anish
>>
>> On Mon, Nov 27, 2023 at 8:15 PM Jungtaek Lim  
>> wrote:
>>>
>>> Kindly bump for better reach after the long holiday. Please kindly review 
>>> the proposal which opens the chance to address complex use cases of 
>>> streaming. Thanks!
>>>
>>> On Thu, Nov 23, 2023 at 8:19 AM Jungtaek Lim  
>>> wrote:

 Thanks Anish for proposing SPIP and initiating this thread! I believe this 
 SPIP will help a bunch of complex use cases on streaming.

 dev@: We are coincidentally initiating this discussion in thanksgiving 
 holidays. We understand people in the US may not have time to review the 
 SPIP, and we plan to bump this thread in early next week. We are open for 
 any feedback from non-US during the holiday. We can either address 
 feedback altogether after the holiday (Anish is in the US) or I can answer 
 if the feedback is more about the question. Thanks!

 On Thu, Nov 23, 2023 at 5:27 AM Anish Shrigondekar 
  wrote:
>
> Hi dev,
>
> I would like to start a discussion on "Structured Streaming - Arbitrary 
> State API v2". This proposal aims to address a bunch of limitations we 
> see today using mapGroupsWithState/flatMapGroupsWithState operator. The 
> detailed set of limitations is described in the SPIP doc.
>
> We propose to support various features such as multiple state variables 
> (flexible data modeling), composite types, enhanced timer functionality, 
> support for chaining operators after new operator, handling initial state 
> along with state data source, schema evolution etc This will allow users 
> to write more powerful streaming state management logic primarily used in 
> operational use-cases. Other built-in stateful operators could also 
> benefit from such changes in the future.
>
> JIRA: https://issues.apache.org/jira/browse/SPARK-45939
> SPIP: 
> https://docs.google.com/document/d/1QtC5qd4WQEia9kl1Qv74WE0TiXYy3x6zeTykygwPWig/edit?usp=sharing
> Design Doc: 
> https://docs.google.com/document/d/1QjZmNZ-fHBeeCYKninySDIoOEWfX6EmqXs2lK097u9o/edit?usp=sharing
>
> cc - @Jungtaek Lim  who has graciously agreed to be the shepherd for this 
> project
>
> Looking forward to your feedback !
>
> Thanks,
> Anish

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Spark 3.3.4 (RC1)

2023-12-10 Thread L. C. Hsieh

+1

On Sun, Dec 10, 2023 at 6:15 PM Kent Yao  wrote:
>
> +1(non-binding
>
> Kent Yao
>
> Yuming Wang  于2023年12月11日周一 09:33写道：
> >
> > +1
> >
> > On Mon, Dec 11, 2023 at 5:55 AM Dongjoon Hyun  wrote:
> >>
> >> +1
> >>
> >> Dongjoon
> >>
> >> On 2023/12/08 21:41:00 Dongjoon Hyun wrote:
> >> > Please vote on releasing the following candidate as Apache Spark version
> >> > 3.3.4.
> >> >
> >> > The vote is open until December 15th 1AM (PST) and passes if a majority 
> >> > +1
> >> > PMC votes are cast, with a minimum of 3 +1 votes.
> >> >
> >> > [ ] +1 Release this package as Apache Spark 3.3.4
> >> > [ ] -1 Do not release this package because ...
> >> >
> >> > To learn more about Apache Spark, please see https://spark.apache.org/
> >> >
> >> > The tag to be voted on is v3.3.4-rc1 (commit
> >> > 18db204995b32e87a650f2f09f9bcf047ddafa90)
> >> > https://github.com/apache/spark/tree/v3.3.4-rc1
> >> >
> >> > The release files, including signatures, digests, etc. can be found at:
> >> >
> >> > https://dist.apache.org/repos/dist/dev/spark/v3.3.4-rc1-bin/
> >> >
> >> >
> >> > Signatures used for Spark RCs can be found in this file:
> >> >
> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >> >
> >> >
> >> > The staging repository for this release can be found at:
> >> >
> >> > https://repository.apache.org/content/repositories/orgapachespark-1451/
> >> >
> >> >
> >> > The documentation corresponding to this release can be found at:
> >> >
> >> > https://dist.apache.org/repos/dist/dev/spark/v3.3.4-rc1-docs/
> >> >
> >> >
> >> > The list of bug fixes going into 3.3.4 can be found at the following URL:
> >> >
> >> > https://issues.apache.org/jira/projects/SPARK/versions/12353505
> >> >
> >> >
> >> > This release is using the release script of the tag v3.3.4-rc1.
> >> >
> >> >
> >> > FAQ
> >> >
> >> >
> >> > =
> >> >
> >> > How can I help test this release?
> >> >
> >> > =
> >> >
> >> >
> >> >
> >> > If you are a Spark user, you can help us test this release by taking
> >> >
> >> > an existing Spark workload and running on this release candidate, then
> >> >
> >> > reporting any regressions.
> >> >
> >> >
> >> >
> >> > If you're working in PySpark you can set up a virtual env and install
> >> >
> >> > the current RC and see if anything important breaks, in the Java/Scala
> >> >
> >> > you can add the staging repository to your projects resolvers and test
> >> >
> >> > with the RC (make sure to clean up the artifact cache before/after so
> >> >
> >> > you don't end up building with a out of date RC going forward).
> >> >
> >> >
> >> >
> >> > ===
> >> >
> >> > What should happen to JIRA tickets still targeting 3.3.4?
> >> >
> >> > ===
> >> >
> >> >
> >> >
> >> > The current list of open tickets targeted at 3.3.4 can be found at:
> >> >
> >> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> >> > Version/s" = 3.3.4
> >> >
> >> >
> >> > Committers should look at those and triage. Extremely important bug
> >> >
> >> > fixes, documentation, and API tweaks that impact compatibility should
> >> >
> >> > be worked on immediately. Everything else please retarget to an
> >> >
> >> > appropriate release.
> >> >
> >> >
> >> >
> >> > ==
> >> >
> >> > But my bug isn't fixed?
> >> >
> >> > ==
> >> >
> >> >
> >> >
> >> > In order to make timely releases, we will typically not hold the
> >> >
> >> > release unless the bug in question is a regression from the previous
> >> >
> >> > release. That being said, if there is something which is a regression
> >> >
> >> > that has not been correctly targeted please ping me or a committer to
> >> >
> >> > help target the issue.
> >> >
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Apache Spark 3.3.4 EOL Release?

2023-12-04 Thread L. C. Hsieh

+1

Thanks Dongjoon!

On Mon, Dec 4, 2023 at 9:26 AM Yang Jie  wrote:
>
> +1 for a 3.3.4 EOL Release. Thanks Dongjoon.
>
> Jie Yang
>
> On 2023/12/04 15:08:25 Tom Graves wrote:
> >  +1 for a 3.3.4 EOL Release. Thanks Dongjoon.
> > Tom
> > On Friday, December 1, 2023 at 02:48:22 PM CST, Dongjoon Hyun 
> >  wrote:
> >
> >  Hi, All.
> >
> > Since the Apache Spark 3.3.0 RC6 vote passed on Jun 14, 2022, branch-3.3 
> > has been maintained and served well until now.
> >
> > - https://github.com/apache/spark/releases/tag/v3.3.0 (tagged on Jun 9th, 
> > 2022)
> > - https://lists.apache.org/thread/zg6k1spw6k1c7brgo6t7qldvsqbmfytm (vote 
> > result on June 14th, 2022)
> >
> > As of today, branch-3.3 has 56 additional patches after v3.3.3 (tagged on 
> > Aug 3rd about 4 month ago) and reaches the end-of-life this month according 
> > to the Apache Spark release cadence, 
> > https://spark.apache.org/versioning-policy.html .
> >
> > $ git log --oneline v3.3.3..HEAD | wc -l
> > 56
> >
> > Along with the recent Apache Spark 3.4.2 release, I hope the users can get 
> > a chance to have these last bits of Apache Spark 3.3.x, and I'd like to 
> > propose to have Apache Spark 3.3.4 EOL Release vote on December 11th and 
> > volunteer as the release manager.
> >
> > WDTY?
> >
> > Please let us know if you need more patches on branch-3.3.
> >
> > Thanks,
> > Dongjoon.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Spark 3.4.2 (RC1)

2023-11-29 Thread L. C. Hsieh

+1

Thanks Dongjoon!

On Wed, Nov 29, 2023 at 7:53 PM Mridul Muralidharan  wrote:
>
> +1
>
> Signatures, digests, etc check out fine.
> Checked out tag and build/tested with -Phive -Pyarn -Pmesos -Pkubernetes
>
> Regards,
> Mridul
>
> On Wed, Nov 29, 2023 at 5:08 AM Yang Jie  wrote:
>>
>> +1(non-binding)
>>
>> Jie Yang
>>
>> On 2023/11/29 02:08:04 Kent Yao wrote:
>> > +1(non-binding)
>> >
>> > Kent Yao
>> >
>> > On 2023/11/27 01:12:53 Dongjoon Hyun wrote:
>> > > Hi, Marc.
>> > >
>> > > Given that it exists in 3.4.0 and 3.4.1, I don't think it's a release
>> > > blocker for Apache Spark 3.4.2.
>> > >
>> > > When the patch is ready, we can consider it for 3.4.3.
>> > >
>> > > In addition, note that we categorized release-blocker-level issues by
>> > > marking 'Blocker' priority with `Target Version` before the vote.
>> > >
>> > > Best,
>> > > Dongjoon.
>> > >
>> > >
>> > > On Sat, Nov 25, 2023 at 12:01 PM Marc Le Bihan  
>> > > wrote:
>> > >
>> > > > -1 If you can wait that the last remaining problem with Generics (?) is
>> > > > entirely solved, that causes this exception to be thrown :
>> > > >
>> > > > java.lang.ClassCastException: class [Ljava.lang.Object; cannot be cast 
>> > > > to class [Ljava.lang.reflect.TypeVariable; ([Ljava.lang.Object; and 
>> > > > [Ljava.lang.reflect.TypeVariable; are in module java.base of loader 
>> > > > 'bootstrap')
>> > > > at 
>> > > > org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:116)
>> > > > at 
>> > > > org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140)
>> > > > at scala.collection.ArrayOps$.map$extension(ArrayOps.scala:929)
>> > > > at 
>> > > > org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:138)
>> > > > at 
>> > > > org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:60)
>> > > > at 
>> > > > org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:53)
>> > > > at 
>> > > > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:62)
>> > > > at org.apache.spark.sql.Encoders$.bean(Encoders.scala:179)
>> > > > at org.apache.spark.sql.Encoders.bean(Encoders.scala)
>> > > >
>> > > >
>> > > > https://issues.apache.org/jira/browse/SPARK-45311
>> > > >
>> > > > Thanks !
>> > > >
>> > > > Marc Le Bihan
>> > > >
>> > > >
>> > > > On 25/11/2023 11:48, Dongjoon Hyun wrote:
>> > > >
>> > > > Please vote on releasing the following candidate as Apache Spark 
>> > > > version
>> > > > 3.4.2.
>> > > >
>> > > > The vote is open until November 30th 1AM (PST) and passes if a 
>> > > > majority +1
>> > > > PMC votes are cast, with a minimum of 3 +1 votes.
>> > > >
>> > > > [ ] +1 Release this package as Apache Spark 3.4.2
>> > > > [ ] -1 Do not release this package because ...
>> > > >
>> > > > To learn more about Apache Spark, please see https://spark.apache.org/
>> > > >
>> > > > The tag to be voted on is v3.4.2-rc1 (commit
>> > > > 0c0e7d4087c64efca259b4fb656b8be643be5686)
>> > > > https://github.com/apache/spark/tree/v3.4.2-rc1
>> > > >
>> > > > The release files, including signatures, digests, etc. can be found at:
>> > > > https://dist.apache.org/repos/dist/dev/spark/v3.4.2-rc1-bin/
>> > > >
>> > > > Signatures used for Spark RCs can be found in this file:
>> > > > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> > > >
>> > > > The staging repository for this release can be found at:
>> > > > https://repository.apache.org/content/repositories/orgapachespark-1450/
>> > > >
>> > > > The documentation corresponding to this release can be found at:
>> > > > https://dist.apache.org/repos/dist/dev/spark/v3.4.2-rc1-docs/
>> > > >
>> > > > The list of bug fixes going into 3.4.2 can be found at the following 
>> > > > URL:
>> > > > https://issues.apache.org/jira/projects/SPARK/versions/12353368
>> > > >
>> > > > This release is using the release script of the tag v3.4.2-rc1.
>> > > >
>> > > > FAQ
>> > > >
>> > > > =
>> > > > How can I help test this release?
>> > > > =
>> > > >
>> > > > If you are a Spark user, you can help us test this release by taking
>> > > > an existing Spark workload and running on this release candidate, then
>> > > > reporting any regressions.
>> > > >
>> > > > If you're working in PySpark you can set up a virtual env and install
>> > > > the current RC and see if anything important breaks, in the Java/Scala
>> > > > you can add the staging repository to your projects resolvers and test
>> > > > with the RC (make sure to clean up the artifact cache before/after so
>> > > > you don't end up building with a out of date RC going forward).
>> > > >
>> > > > ===
>> > > > What should happen to JIRA tickets still targeting 3.4.2?
>> > > > ===
>> > > >
>> > > > The current list of open

[VOTE][RESULT] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-17 Thread L. C. Hsieh

Hi all,

The vote passes with 19 +1s (11 binding +1s).
Thanks to all who reviews the SPIP doc and votes!

(* = binding)
+1:
- Ye Zhou
- L. C. Hsieh (*)
- Chao Sun (*)
- Vakaris Baškirov
- DB Tsai (*)
- Holden Karau (*)
- Lucian Neghina
- Mridul Muralidharan (*)
- Huaxin Gao (*)
- Cheng Pan
- Yuming Wang (*)
- Bo Yang
- Yikun Jiang (*)
- Xiao Li (*)
- Dongjoon Hyun (*)
- Ilan Filonenko
- Ruifeng Zheng (*)
- Jungtaek Lim
- Gabor Somogyi

+0: None

-1: None

Thanks!

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-14 Thread L. C. Hsieh

+1

On Tue, Nov 14, 2023 at 9:46 AM Ye Zhou  wrote:
>
> +1(Non-binding)
>
> On Tue, Nov 14, 2023 at 9:42 AM L. C. Hsieh  wrote:
>>
>> Hi all,
>>
>> I’d like to start a vote for SPIP: An Official Kubernetes Operator for
>> Apache Spark.
>>
>> The proposal is to develop an official Java-based Kubernetes operator
>> for Apache Spark to automate the deployment and simplify the lifecycle
>> management and orchestration of Spark applications and Spark clusters
>> on k8s at prod scale.
>>
>> This aims to reduce the learning curve and operation overhead for
>> Spark users so they can concentrate on core Spark logic.
>>
>> Please also refer to:
>>
>>- Discussion thread:
>> https://lists.apache.org/thread/wdy7jfhf7m8jy74p6s0npjfd15ym5rxz
>>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-45923
>>- SPIP doc: 
>> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
>>
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>>
>> Thank you!
>>
>> Liang-Chi Hsieh
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>
>
> --
>
> Zhou, Ye  周晔

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[VOTE] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-14 Thread L. C. Hsieh

Hi all,

I’d like to start a vote for SPIP: An Official Kubernetes Operator for
Apache Spark.

The proposal is to develop an official Java-based Kubernetes operator
for Apache Spark to automate the deployment and simplify the lifecycle
management and orchestration of Spark applications and Spark clusters
on k8s at prod scale.

This aims to reduce the learning curve and operation overhead for
Spark users so they can concentrate on core Spark logic.

Please also refer to:

   - Discussion thread:
https://lists.apache.org/thread/wdy7jfhf7m8jy74p6s0npjfd15ym5rxz
   - JIRA ticket: https://issues.apache.org/jira/browse/SPARK-45923
   - SPIP doc: 
https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE


Please vote on the SPIP for the next 72 hours:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …


Thank you!

Liang-Chi Hsieh

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-13 Thread L. C. Hsieh

Thanks for all the support from the community for the SPIP proposal.

Since all questions/discussion are settled down (if I didn't miss any
major ones), if no more questions or concerns, I'll be the shepherd
for this SPIP proposal and call for a vote tomorrow.

Thank you all!

On Mon, Nov 13, 2023 at 6:43 PM Zhou Jiang  wrote:
>
> Hi Holden,
>
> Thanks a lot for your feedback!
> Yes, this proposal attempts to integrate existing solutions, especially from 
> CRD perspective. The proposed schema retains similarity with current designs, 
> while reducing duplicates and maintaining a single source of truth from conf 
> properties. It also tends to be close to native integration with k8s to 
> minimize schema changes for new features.
> For dependencies, packing everything is the easiest way to get started. It 
> would be straightforward to add --packages and --repositories support for 
> Maven dependencies. It's technically possible to pull dependencies in cloud 
> storage from init containers (if defined by user). It could be tricky to 
> design a general solution that supports different cloud providers from the 
> operator layer. An enhancement that I can think of is to add support for 
> profile scripts that can enable additional user-defined actions in 
> application containers.
> Operator does not have to build everything for k8s version compatibility. 
> Similar to Spark, operator can be built on Fabric8 
> client(https://github.com/fabric8io/kubernetes-client) for support across 
> versions, given that it makes similar API calls for resource management as 
> Spark. For tests, in addition to fabric8 mock server, we may also borrow the 
> idea from Flink operator to start minikube cluster for integration tests.
> This operator is not starting from scratch as it is derived from an internal 
> project which has been working in prod scale for a few years. It aims to 
> include a few new features / enhancements, and a few re-architecture mostly 
> to incorporate lessons learnt for designing CRD / API perspective.
> Benchmarking operator performance alone can be nuanced, often tied to the 
> underlying cluster. There's a testing strategy that Aaruna & I discussed in a 
> previous Data AI summit, involves scheduling wide (massive light-weight 
> applications) and deep (single application request a lot of executors with 
> heavy IO) cases, revealing typical bottlenecks at the k8s API server and 
> scheduler performance.Similar tests can be performed for this as well.
>
> On Sun, Nov 12, 2023 at 4:32 PM Holden Karau  wrote:
>>
>> To be clear: I am generally supportive of the idea (+1) but have some 
>> follow-up questions:
>>
>> Have we taken the time to learn from the other operators? Do we have a 
>> compatible CRD/API or not (and if so why?)
>> The API seems to assume that everything is packaged in the container in 
>> advance, but I imagine that might not be the case for many folks who have 
>> Java or Python packages published to cloud storage and they want to use?
>> What's our plan for the testing on the potential version explosion (not 
>> tying ourselves to operator version -> spark version makes a lot of sense, 
>> but how do we reasonably assure ourselves that the cross product of Operator 
>> Version, Kube Version, and Spark Version all function)? Do we have CI 
>> resources for this?
>> Is there a current (non-open source operator) that folks from Apple are 
>> using and planning to open source, or is this a fresh "from the ground up" 
>> operator proposal?
>> One of the key reasons for this is listed as "An out-of-the-box automation 
>> solution that scales effectively" but I don't see any discussion of the 
>> target scale or plans to achieve it?
>>
>>
>>
>> On Thu, Nov 9, 2023 at 9:02 PM Zhou Jiang  wrote:
>>>
>>> Hi Spark community,
>>>
>>> I'm reaching out to initiate a conversation about the possibility of 
>>> developing a Java-based Kubernetes operator for Apache Spark. Following the 
>>> operator pattern 
>>> (https://kubernetes.io/docs/concepts/extend-kubernetes/operator/), Spark 
>>> users may manage applications and related components seamlessly using 
>>> native tools like kubectl. The primary goal is to simplify the Spark user 
>>> experience on Kubernetes, minimizing the learning curve and operational 
>>> complexities and therefore enable users to focus on the Spark application 
>>> development.
>>>
>>> Although there are several open-source Spark on Kubernetes operators 
>>> available, none of them are officially integrated into the Apache Spark 
>>> project. As a result, these operators may lack active support and 
>>> development for new features. Within this proposal, our aim is to introduce 
>>> a Java-based Spark operator as an integral component of the Apache Spark 
>>> project. This solution has been employed internally at Apple for multiple 
>>> years, operating millions of executors in real production environments. The 
>>> use of Java in this solution is intended to

Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-09 Thread L. C. Hsieh

+1

On Thu, Nov 9, 2023 at 7:57 PM Chao Sun  wrote:
>
> +1
>
>
> On Thu, Nov 9, 2023 at 6:36 PM Xiao Li  wrote:
> >
> > +1
> >
> > huaxin gao  于2023年11月9日周四 16:53写道：
> >>
> >> +1
> >>
> >> On Thu, Nov 9, 2023 at 3:14 PM DB Tsai  wrote:
> >>>
> >>> +1
> >>>
> >>> To be completely transparent, I am employed in the same department as 
> >>> Zhou at Apple.
> >>>
> >>> I support this proposal, provided that we witness community adoption 
> >>> following the release of the Flink Kubernetes operator, streamlining 
> >>> Flink deployment on Kubernetes.
> >>>
> >>> A well-maintained official Spark Kubernetes operator is essential for our 
> >>> Spark community as well.
> >>>
> >>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
> >>>
> >>> On Nov 9, 2023, at 12:05 PM, Zhou Jiang  wrote:
> >>>
> >>> Hi Spark community,
> >>>
> >>> I'm reaching out to initiate a conversation about the possibility of 
> >>> developing a Java-based Kubernetes operator for Apache Spark. Following 
> >>> the operator pattern 
> >>> (https://kubernetes.io/docs/concepts/extend-kubernetes/operator/), Spark 
> >>> users may manage applications and related components seamlessly using 
> >>> native tools like kubectl. The primary goal is to simplify the Spark user 
> >>> experience on Kubernetes, minimizing the learning curve and operational 
> >>> complexities and therefore enable users to focus on the Spark application 
> >>> development.
> >>> Although there are several open-source Spark on Kubernetes operators 
> >>> available, none of them are officially integrated into the Apache Spark 
> >>> project. As a result, these operators may lack active support and 
> >>> development for new features. Within this proposal, our aim is to 
> >>> introduce a Java-based Spark operator as an integral component of the 
> >>> Apache Spark project. This solution has been employed internally at Apple 
> >>> for multiple years, operating millions of executors in real production 
> >>> environments. The use of Java in this solution is intended to accommodate 
> >>> a wider user and contributor audience, especially those who are familiar 
> >>> with Scala.
> >>> Ideally, this operator should have its dedicated repository, similar to 
> >>> Spark Connect Golang or Spark Docker, allowing it to maintain a loose 
> >>> connection with the Spark release cycle. This model is also followed by 
> >>> the Apache Flink Kubernetes operator.
> >>> We believe that this project holds the potential to evolve into a 
> >>> thriving community project over the long run. A comparison can be drawn 
> >>> with the Flink Kubernetes Operator: Apple has open-sourced internal Flink 
> >>> Kubernetes operator, making it a part of the Apache Flink project 
> >>> (https://github.com/apache/flink-kubernetes-operator). This move has 
> >>> gained wide industry adoption and contributions from the community. In a 
> >>> mere year, the Flink operator has garnered more than 600 stars and has 
> >>> attracted contributions from over 80 contributors. This showcases the 
> >>> level of community interest and collaborative momentum that can be 
> >>> achieved in similar scenarios.
> >>> More details can be found at SPIP doc : Spark Kubernetes Operator 
> >>> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
> >>>
> >>> Thanks,
> >>>
> >>> --
> >>> Zhou JIANG
> >>>
> >>>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Apache Spark 3.4.2 (?)

2023-11-07 Thread L. C. Hsieh

+1

On Tue, Nov 7, 2023 at 4:56 PM Dongjoon Hyun  wrote:
>
> Thank you all!
>
> Dongjoon
>
> On Mon, Nov 6, 2023 at 6:03 PM Holden Karau  wrote:
>>
>> +1
>>
>> On Mon, Nov 6, 2023 at 4:30 PM yangjie01  wrote:
>>>
>>> +1
>>>
>>>
>>>
>>> 发件人: Yuming Wang 
>>> 日期: 2023年11月7日 星期二 07:00
>>> 收件人: Santosh Pingale 
>>> 抄送: Dongjoon Hyun , dev 
>>> 主题: Re: Apache Spark 3.4.2 (?)
>>>
>>>
>>>
>>> +1
>>>
>>>
>>>
>>> On Tue, Nov 7, 2023 at 3:55 AM Santosh Pingale 
>>>  wrote:
>>>
>>> Makes sense given the nature of those commits.
>>>
>>>
>>>
>>> On Mon, Nov 6, 2023, 7:52 PM Dongjoon Hyun  wrote:
>>>
>>> Hi, All.
>>>
>>> Apache Spark 3.4.1 tag was created on Jun 19th and `branch-3.4` has 103 
>>> commits including important security and correctness patches like 
>>> SPARK-44251, SPARK-44805, and SPARK-44940.
>>>
>>> https://github.com/apache/spark/releases/tag/v3.4.1
>>>
>>> $ git log --oneline v3.4.1..HEAD | wc -l
>>> 103
>>>
>>> SPARK-44251 Potential for incorrect results or NPE when full outer 
>>> USING join has null key value
>>> SPARK-44805 Data lost after union using 
>>> spark.sql.parquet.enableNestedColumnVectorizedReader=true
>>> SPARK-44940 Improve performance of JSON parsing when 
>>> "spark.sql.json.enablePartialResults" is enabled
>>>
>>> Currently, I'm checking the following open correctness issues. I'd like to 
>>> propose to release Apache Spark 3.4.2 after resolving them and volunteer as 
>>> the release manager for Apache Spark 3.4.2. If there are no additional 
>>> blockers, the first tentative RC1 vote date is November 13rd (Monday). If 
>>> it takes some time to resolve the open correctness issues, we can start the 
>>> vote after Thanksgiving holiday.
>>>
>>> SPARK-44512 dataset.sort.select.write.partitionBy sorts wrong column
>>> SPARK-45282 Join loses records for cached datasets
>>>
>>> WDTY?
>>>
>>> Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPIP: State Data Source - Reader

2023-10-23 Thread L. C. Hsieh

+1

On Mon, Oct 23, 2023 at 6:31 PM Anish Shrigondekar
 wrote:
>
> +1 (non-binding)
>
> Thanks,
> Anish
>
> On Mon, Oct 23, 2023 at 5:01 PM Wenchen Fan  wrote:
>>
>> +1
>>
>> On Mon, Oct 23, 2023 at 4:03 PM Jungtaek Lim  
>> wrote:
>>>
>>> Starting with my +1 (non-binding). Thanks!
>>>
>>> On Mon, Oct 23, 2023 at 1:23 PM Jungtaek Lim  
>>> wrote:

 Hi all,

 I'd like to start the vote for SPIP: State Data Source - Reader.

 The high level summary of the SPIP is that we propose a new data source 
 which enables a read ability for state store in the checkpoint, via batch 
 query. This would enable two major use cases 1) constructing tests with 
 verifying state store 2) inspecting values in state store in the scenario 
 of incident.

 References:

 JIRA ticket
 SPIP doc
 Discussion thread

 Please vote on the SPIP for the next 72 hours:

 [ ] +1: Accept the proposal as an official SPIP
 [ ] +0
 [ ] -1: I don’t think this is a good idea because …

 Thanks!
 Jungtaek Lim (HeartSaVioR)

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Apache Spark 3.3.3 (RC1)

2023-08-10 Thread L. C. Hsieh

+1

Thanks Yuming.

On Thu, Aug 10, 2023 at 3:24 PM Dongjoon Hyun  wrote:
>
> +1
>
> Dongjoon
>
> On 2023/08/10 07:14:07 yangjie01 wrote:
> > +1
> > Thanks, Jie Yang
> >
> >
> > 发件人: Yuming Wang 
> > 日期: 2023年8月10日 星期四 13:33
> > 收件人: Dongjoon Hyun 
> > 抄送: dev 
> > 主题: Re: [VOTE] Release Apache Spark 3.3.3 (RC1)
> >
> > +1 myself.
> >
> > On Tue, Aug 8, 2023 at 12:41 AM Dongjoon Hyun 
> > mailto:dongjoon.h...@gmail.com>> wrote:
> > Thank you, Yuming.
> >
> > Dongjoon.
> >
> > On Mon, Aug 7, 2023 at 9:30 AM yangjie01 
> > mailto:yangji...@baidu.com>> wrote:
> > HI，Dongjoon and Yuming
> >
> > I submitted a PR a few days ago to try to fix this issue: 
> > https://github.com/apache/spark/pull/42167.
> >  The reason for the failure is that the branch daily test and the master 
> > use the same yml file.
> >
> > Jie Yang
> >
> > 发件人: Dongjoon Hyun mailto:dongjoon.h...@gmail.com>>
> > 日期: 2023年8月8日 星期二 00:18
> > 收件人: Yuming Wang mailto:yumw...@apache.org>>
> > 抄送: dev mailto:dev@spark.apache.org>>
> > 主题: Re: [VOTE] Release Apache Spark 3.3.3 (RC1)
> >
> > Hi, Yuming.
> >
> > One of the community GitHub Action test pipelines is unhealthy consistently 
> > due to Python mypy linter.
> >
> > https://github.com/apache/spark/actions/workflows/build_branch33.yml
> >
> > It seems due to the pipeline difference between the same Python mypy linter 
> > already pass in commit build,
> >
> > Dongjoon.
> >
> >
> > On Fri, Aug 4, 2023 at 8:09 PM Yuming Wang 
> > mailto:yumw...@apache.org>> wrote:
> > Please vote on releasing the following candidate as Apache Spark version 
> > 3.3.3.
> >
> > The vote is open until 11:59pm Pacific time August 10th and passes if a 
> > majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 3.3.3
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see 
> > https://spark.apache.org
> >
> > The tag to be voted on is v3.3.3-rc1 (commit 
> > 8c2b3319c6734250ff9d72f3d7e5cab56b142195):
> > https://github.com/apache/spark/tree/v3.3.3-rc1
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.3.3-rc1-bin
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1445
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.3.3-rc1-docs
> >
> > The list of bug fixes going into 3.3.3 can be found at the following URL:
> > https://s.apache.org/rjci4
> >
> > This release is using the release script of the tag v3.3.3-rc1.
> >
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with an out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 3.3.3?
> > ===
> > The current list of open tickets targeted at 3.3.3 can be found at:
> > https://issues.apache.org/jira/projects/SPARK
> >  and search for "Target

Re: Welcome two new Apache Spark committers

2023-08-07 Thread L. C. Hsieh

Congratulations!

On Mon, Aug 7, 2023 at 9:44 AM huaxin gao  wrote:
>
> Congratulations! Peter and Xiduo!
>
> On Mon, Aug 7, 2023 at 9:40 AM Dongjoon Hyun  wrote:
>>
>> Congratulations, Peter and Xiduo. :)
>>
>> Dongjoon.
>>
>> On Sun, Aug 6, 2023 at 10:08 PM XiDuo You  wrote:
>>>
>>> Thank you all !
>>>
>>> Jia Fan  于2023年8月7日周一 11:31写道：
>>> >
>>> > Congratulations!
>>> > 
>>> >
>>> > Jia Fan
>>> >
>>> >
>>> > 2023年8月7日 11:28，Ye Xianjin  写道：
>>> >
>>> > Congratulations!
>>> >
>>> > Sent from my iPhone
>>> >
>>> > On Aug 7, 2023, at 11:16 AM, Yuming Wang  wrote:
>>> >
>>> > 
>>> >
>>> > Congratulations!
>>> >
>>> > On Mon, Aug 7, 2023 at 11:11 AM Kent Yao  wrote:
>>> >>
>>> >> Congrats! Peter and Xiduo!
>>> >>
>>> >> Cheng Pan  于2023年8月7日周一 11:01写道：
>>> >> >
>>> >> > Congratulations! Peter and Xiduo!
>>> >> >
>>> >> > Thanks,
>>> >> > Cheng Pan
>>> >> >
>>> >> >
>>> >> > > On Aug 7, 2023, at 10:58, Gengliang Wang  wrote:
>>> >> > >
>>> >> > > Congratulations! Peter and Xiduo!
>>> >> >
>>> >> >
>>> >> >
>>> >> > -
>>> >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >> >
>>> >>
>>> >> -
>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Time for Spark v3.5.0 release

2023-07-04 Thread L. C. Hsieh

+1

Thanks Yuanjian.

On Tue, Jul 4, 2023 at 7:45 AM yangjie01  wrote:
>
> +1
>
>
>
> 发件人: Maxim Gekk 
> 日期: 2023年7月4日 星期二 17:24
> 收件人: Kent Yao 
> 抄送: "dev@spark.apache.org" 
> 主题: Re: Time for Spark v3.5.0 release
>
>
>
> +1
>
> On Tue, Jul 4, 2023 at 11:55 AM Kent Yao  wrote:
>
> +1， thank you
>
> Kent
>
> On 2023/07/04 05:32:52 Dongjoon Hyun wrote:
> > +1
> >
> > Thank you, Yuanjian
> >
> > Dongjoon
> >
> > On Tue, Jul 4, 2023 at 1:03 AM Hyukjin Kwon  wrote:
> >
> > > Yeah one day postponed shouldn't be a big deal.
> > >
> > > On Tue, Jul 4, 2023 at 7:10 AM Yuanjian Li  wrote:
> > >
> > >> Hi All,
> > >>
> > >> According to the Spark versioning policy at
> > >> https://spark.apache.org/versioning-policy.html, should we cut
> > >> *branch-3.5* on *July 17th, 2023*? (We initially proposed January 16th,
> > >> but since it's a Sunday, I suggest we postpone it by one day).
> > >>
> > >> I would like to volunteer as the release manager for Apache Spark 3.5.0.
> > >>
> > >> Best,
> > >> Yuanjian
> > >>
> > >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread L. C. Hsieh

Thanks Dongjoon!

On Fri, Jun 23, 2023 at 7:10 PM Hyukjin Kwon  wrote:
>
> Thanks!
>
> On Sat, Jun 24, 2023 at 11:01 AM Mridul Muralidharan  wrote:
>>
>>
>> Thanks Dongjoon !
>>
>> Regards,
>> Mridul
>>
>> On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun  wrote:
>>>
>>> We are happy to announce the availability of Apache Spark 3.4.1!
>>>
>>> Spark 3.4.1 is a maintenance release containing stability fixes. This
>>> release is based on the branch-3.4 maintenance branch of Spark. We strongly
>>> recommend all 3.4 users to upgrade to this stable release.
>>>
>>> To download Spark 3.4.1, head over to the download page:
>>> https://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-3-4-1.html
>>>
>>> We would like to acknowledge all community members for contributing to this
>>> release. This release would not have been possible without you.
>>>
>>>
>>> Dongjoon Hyun

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE][SPIP] PySpark Test Framework

2023-06-22 Thread L. C. Hsieh

+1

On Thu, Jun 22, 2023 at 3:10 PM Xinrong Meng  wrote:
>
> +1
>
> Thanks for driving that!
>
> On Wed, Jun 21, 2023 at 10:25 PM Ruifeng Zheng  wrote:
>>
>> +1
>>
>> On Thu, Jun 22, 2023 at 1:11 PM Dongjoon Hyun  
>> wrote:
>>>
>>> +1
>>>
>>> Dongjoon
>>>
>>> On Wed, Jun 21, 2023 at 8:56 PM Hyukjin Kwon  wrote:

 +1

 On Thu, 22 Jun 2023 at 02:20, Jacek Laskowski  wrote:
>
> +0
>
> Pozdrawiam,
> Jacek Laskowski
> 
> "The Internals Of" Online Books
> Follow me on https://twitter.com/jaceklaskowski
>
>
>
> On Wed, Jun 21, 2023 at 5:11 PM Amanda Liu  
> wrote:
>>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: PySpark Test Framework.
>>
>> The high-level summary for the SPIP is that it proposes an official test 
>> framework for PySpark. Currently, there are only disparate open-source 
>> repos and blog posts for PySpark testing resources. We can streamline 
>> and simplify the testing process by incorporating test features, such as 
>> a PySpark Test Base class (which allows tests to share Spark sessions) 
>> and test util functions (for example, asserting dataframe and schema 
>> equality).
>>
>> SPIP doc: 
>> https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v
>>
>> JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44042
>>
>> Discussion thread: 
>> https://lists.apache.org/thread/trwgbgn3ycoj8b8k8lkxko2hql23o41n
>>
>> Please vote on the SPIP for the next 72 hours:
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because __.
>>
>> Thank you!
>>
>> Best,
>> Amanda Liu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Spark 3.4.1 (RC1)

2023-06-20 Thread L. C. Hsieh

+1

On Tue, Jun 20, 2023 at 8:48 PM Dongjoon Hyun  wrote:
>
> +1
>
> Dongjoon
>
> On 2023/06/20 02:51:32 Jia Fan wrote:
> > +1
> >
> > Dongjoon Hyun  于2023年6月20日周二 10:41写道：
> >
> > > Please vote on releasing the following candidate as Apache Spark version
> > > 3.4.1.
> > >
> > > The vote is open until June 23rd 1AM (PST) and passes if a majority +1 PMC
> > > votes are cast, with a minimum of 3 +1 votes.
> > >
> > > [ ] +1 Release this package as Apache Spark 3.4.1
> > > [ ] -1 Do not release this package because ...
> > >
> > > To learn more about Apache Spark, please see https://spark.apache.org/
> > >
> > > The tag to be voted on is v3.4.1-rc1 (commit
> > > 6b1ff22dde1ead51cbf370be6e48a802daae58b6)
> > > https://github.com/apache/spark/tree/v3.4.1-rc1
> > >
> > > The release files, including signatures, digests, etc. can be found at:
> > > https://dist.apache.org/repos/dist/dev/spark/v3.4.1-rc1-bin/
> > >
> > > Signatures used for Spark RCs can be found in this file:
> > > https://dist.apache.org/repos/dist/dev/spark/KEYS
> > >
> > > The staging repository for this release can be found at:
> > > https://repository.apache.org/content/repositories/orgapachespark-1443/
> > >
> > > The documentation corresponding to this release can be found at:
> > > https://dist.apache.org/repos/dist/dev/spark/v3.4.1-rc1-docs/
> > >
> > > The list of bug fixes going into 3.4.1 can be found at the following URL:
> > > https://issues.apache.org/jira/projects/SPARK/versions/12352874
> > >
> > > This release is using the release script of the tag v3.4.1-rc1.
> > >
> > > FAQ
> > >
> > > =
> > > How can I help test this release?
> > > =
> > >
> > > If you are a Spark user, you can help us test this release by taking
> > > an existing Spark workload and running on this release candidate, then
> > > reporting any regressions.
> > >
> > > If you're working in PySpark you can set up a virtual env and install
> > > the current RC and see if anything important breaks, in the Java/Scala
> > > you can add the staging repository to your projects resolvers and test
> > > with the RC (make sure to clean up the artifact cache before/after so
> > > you don't end up building with a out of date RC going forward).
> > >
> > > ===
> > > What should happen to JIRA tickets still targeting 3.4.1?
> > > ===
> > >
> > > The current list of open tickets targeted at 3.4.1 can be found at:
> > > https://issues.apache.org/jira/projects/SPARK and search for "Target
> > > Version/s" = 3.4.1
> > >
> > > Committers should look at those and triage. Extremely important bug
> > > fixes, documentation, and API tweaks that impact compatibility should
> > > be worked on immediately. Everything else please retarget to an
> > > appropriate release.
> > >
> > > ==
> > > But my bug isn't fixed?
> > > ==
> > >
> > > In order to make timely releases, we will typically not hold the
> > > release unless the bug in question is a regression from the previous
> > > release. That being said, if there is something which is a regression
> > > that has not been correctly targeted please ping me or a committer to
> > > help target the issue.
> > >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-12 Thread L. C. Hsieh

+1

On Mon, Jun 12, 2023 at 11:06 AM huaxin gao  wrote:
>
> +1
>
> On Mon, Jun 12, 2023 at 11:05 AM Dongjoon Hyun  wrote:
>>
>> +1
>>
>> Dongjoon
>>
>> On 2023/06/12 18:00:38 Dongjoon Hyun wrote:
>> > Please vote on the release plan for Apache Spark 4.0.0.
>> >
>> > The vote is open until June 16th 1AM (PST) and passes if a majority +1 PMC
>> > votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Have a release plan for Apache Spark 4.0.0 (June 2024)
>> > [ ] -1 Do not have a plan for Apache Spark 4.0.0 because ...
>> >
>> > ===
>> > Apache Spark 4.0.0 Release Plan
>> > ===
>> >
>> > 1. After creating `branch-3.5`, set "4.0.0-SNAPSHOT" in master branch.
>> >
>> > 2. Creating `branch-4.0` on April 1st, 2024.
>> >
>> > 3. Apache Spark 4.0.0 RC1 on May 1st, 2024.
>> >
>> > 4. Apache Spark 4.0.0 Release in June, 2024.
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Apache Spark 3.4.1 Release?

2023-06-08 Thread L. C. Hsieh

+1

Thanks Dongjoon for driving this.

On Thu, Jun 8, 2023 at 2:25 PM Dongjoon Hyun  wrote:
>
> Hi, All.
>
> `branch-3.4` already has 77 commits since v3.4.0 tag.
>
> https://github.com/apache/spark/releases/v3.4.0 (Tagged on April 6th)
>
> $ git log --oneline v3.4.0..HEAD | wc -l
> 77
>
> I'd like to propose to have Apache Spark 3.4.1 before DATA+AI Summit (June 
> 26~29) because that provides more stable new features of Spark 3.4. I also 
> volunteer as a release manager of Apache Spark 3.4.1 and the candidate vote 
> date in my mind is June 20th, Tuesday.
>
> WDTY?
>
> Thanks,
> Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Apache Spark 3.2.4 (RC1)

2023-04-10 Thread L. C. Hsieh

+1

Thanks Dongjoon

On Sun, Apr 9, 2023 at 5:20 PM Dongjoon Hyun  wrote:
>
> I'll start with my +1.
>
> I verified the checksum, signatures of the artifacts, and documentations.
> Also, ran the tests with YARN and K8s modules.
>
> Dongjoon.
>
> On 2023/04/09 23:46:10 Dongjoon Hyun wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 3.2.4.
> >
> > The vote is open until April 13th 1AM (PST) and passes if a majority +1 PMC
> > votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 3.2.4
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see https://spark.apache.org/
> >
> > The tag to be voted on is v3.2.4-rc1 (commit
> > 0ae10ac18298d1792828f1d59b652ef17462d76e)
> > https://github.com/apache/spark/tree/v3.2.4-rc1
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.2.4-rc1-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1442/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.2.4-rc1-docs/
> >
> > The list of bug fixes going into 3.2.4 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12352607
> >
> > This release is using the release script of the tag v3.2.4-rc1.
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 3.2.4?
> > ===
> >
> > The current list of open tickets targeted at 3.2.4 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> > Version/s" = 3.2.4
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Apache Spark 3.4.0 (RC7)

2023-04-08 Thread L. C. Hsieh

+1

Thanks Xinrong.

On Sat, Apr 8, 2023 at 8:23 AM yangjie01  wrote:
>
> +1
>
>
>
> 发件人: Sean Owen 
> 日期: 2023年4月8日 星期六 20:27
> 收件人: Xinrong Meng 
> 抄送: dev 
> 主题: Re: [VOTE] Release Apache Spark 3.4.0 (RC7)
>
>
>
> +1 form me, same result as last time.
>
>
>
> On Fri, Apr 7, 2023 at 6:30 PM Xinrong Meng  wrote:
>
> Please vote on releasing the following candidate(RC7) as Apache Spark version 
> 3.4.0.
>
> The vote is open until 11:59pm Pacific time April 12th and passes if a 
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.4.0-rc7 (commit 
> 87a5442f7ed96b11051d8a9333476d080054e5a0):
> https://github.com/apache/spark/tree/v3.4.0-rc7
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1441
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-docs/
>
> The list of bug fixes going into 3.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12351465
>
> This release is using the release script of the tag v3.4.0-rc7.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.4.0?
> ===
> The current list of open tickets targeted at 3.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target 
> Version/s" = 3.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Thanks,
> Xinrong Meng

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Apache Spark 3.2.4 EOL Release?

2023-04-04 Thread L. C. Hsieh

+1

Sounds good and thanks Dongjoon for driving this.

On 2023/04/04 17:24:54 Dongjoon Hyun wrote:
> Hi, All.
> 
> Since Apache Spark 3.2.0 passed RC7 vote on October 12, 2021, branch-3.2
> has been maintained and served well until now.
> 
> - https://github.com/apache/spark/releases/tag/v3.2.0 (tagged on Oct 6,
> 2021)
> - https://lists.apache.org/thread/jslhkh9sb5czvdsn7nz4t40xoyvznlc7
> 
> As of today, branch-3.2 has 62 additional patches after v3.2.3 and reaches
> the end-of-life this month according to the Apache Spark release cadence. (
> https://spark.apache.org/versioning-policy.html)
> 
> $ git log --oneline v3.2.3..HEAD | wc -l
> 62
> 
> With the upcoming Apache Spark 3.4, I hope the users can get a chance to
> have these last bits of Apache Spark 3.2.x, and I'd like to propose to have
> Apache Spark 3.2.4 EOL Release next week and volunteer as the release
> manager. WDTY? Please let me know if you need more patches on branch-3.2.
> 
> Thanks,
> Dongjoon.
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Apache Spark 3.4.0 (RC5)

2023-04-03 Thread L. C. Hsieh

+1

Thanks Xinrong.

On Mon, Apr 3, 2023 at 12:35 PM Dongjoon Hyun  wrote:
>
> +1
>
> I also verified that RC5 has SBOM artifacts.
>
> https://repository.apache.org/content/repositories/orgapachespark-1439/org/apache/spark/spark-core_2.12/3.4.0/spark-core_2.12-3.4.0-cyclonedx.json
> https://repository.apache.org/content/repositories/orgapachespark-1439/org/apache/spark/spark-core_2.13/3.4.0/spark-core_2.13-3.4.0-cyclonedx.json
>
> Thanks,
> Dongjoon.
>
>
>
> On Mon, Apr 3, 2023 at 1:57 AM yangjie01  wrote:
>>
>> +1, checked Java 17 + Scala 2.13 + Python 3.10.10.
>>
>>
>>
>> 发件人: Herman van Hovell 
>> 日期: 2023年3月31日 星期五 12:12
>> 收件人: Sean Owen 
>> 抄送: Xinrong Meng , dev 
>> 主题: Re: [VOTE] Release Apache Spark 3.4.0 (RC5)
>>
>>
>>
>> +1
>>
>>
>>
>> On Thu, Mar 30, 2023 at 11:05 PM Sean Owen  wrote:
>>
>> +1 same result from me as last time.
>>
>>
>>
>> On Thu, Mar 30, 2023 at 3:21 AM Xinrong Meng  
>> wrote:
>>
>> Please vote on releasing the following candidate(RC5) as Apache Spark 
>> version 3.4.0.
>>
>> The vote is open until 11:59pm Pacific time April 4th and passes if a 
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.4.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.4.0-rc5 (commit 
>> f39ad617d32a671e120464e4a75986241d72c487):
>> https://github.com/apache/spark/tree/v3.4.0-rc5
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc5-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1439
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc5-docs/
>>
>> The list of bug fixes going into 3.4.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12351465
>>
>> This release is using the release script of the tag v3.4.0-rc5.
>>
>>
>>
>>
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with an out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.4.0?
>> ===
>> The current list of open tickets targeted at 3.4.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target 
>> Version/s" = 3.4.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>
>>
>> Thanks,
>>
>> Xinrong Meng
>>
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[VOTE][RESULT][SPIP] Lazy Materialization for Parquet Read Performance Improvement

2023-02-17 Thread L. C. Hsieh

The vote passes with 9 +1s (4 binding +1s).
Thanks to all who reviews the SPIP doc and votes!

(* = binding)
+1:
- Dongjoon Hyun (*)
- Huaxin Gao (*)
- Mich Talebzadeh
- L. C. Hsieh (*)
- Prem Sahoo
- Yuming Wang
- Guo Weijie
- DB Tsai (*)
- Kazuyuki Tanimura

+0: None

-1: None

Thanks!

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[ANNOUNCE] Apache Spark 3.3.2 released

2023-02-17 Thread L. C. Hsieh

We are happy to announce the availability of Apache Spark 3.3.2!

Spark 3.3.2 is a maintenance release containing stability fixes. This
release is based on the branch-3.3 maintenance branch of Spark. We strongly
recommend all 3.3 users to upgrade to this stable release.

To download Spark 3.3.2, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-3-2.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE][SPIP] Lazy Materialization for Parquet Read Performance Improvement

2023-02-16 Thread L. C. Hsieh

Based on SPIP doc
(https://spark.apache.org/improvement-proposals.html), the vote passes
if at least 3 +1 votes from PMC members and no -1 votes from PMC
members.

Also, the vote should be open for at least 72 hours.

On Thu, Feb 16, 2023 at 10:34 AM Mich Talebzadeh
 wrote:
>
> How many votes are needed for the approval state?
>
> Thanks
>
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>
>
>
>
>
> On Thu, 16 Feb 2023 at 18:19, kazuyuki tanimura  wrote:
>>
>> +1 for myself
>>
>> On Feb 14, 2023, at 10:42 AM, DB Tsai  wrote:
>>
>> +1
>>
>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>
>> On Feb 14, 2023, at 8:29 AM, Guo Weijie  wrote:
>>
>> +1
>>
>> Yuming Wang  于2023年2月14日周二 15:58写道：
>>>
>>> +1
>>>
>>> On Tue, Feb 14, 2023 at 11:27 AM Prem Sahoo  wrote:
>>>>
>>>> +1
>>>>
>>>> On Mon, Feb 13, 2023 at 8:13 PM L. C. Hsieh  wrote:
>>>>>
>>>>> +1
>>>>>
>>>>> On Mon, Feb 13, 2023 at 3:49 PM Mich Talebzadeh 
>>>>>  wrote:
>>>>>>
>>>>>> +1 for me
>>>>>>
>>>>>>
>>>>>>view my Linkedin profile
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>>>> loss, damage or destruction of data or any other property which may 
>>>>>> arise from relying on this email's technical content is explicitly 
>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>> damages arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, 13 Feb 2023 at 23:18, huaxin gao  wrote:
>>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> On Mon, Feb 13, 2023 at 3:09 PM Dongjoon Hyun  
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>> Dongjoon
>>>>>>>>
>>>>>>>> On 2023/02/13 22:52:59 "L. C. Hsieh" wrote:
>>>>>>>> > Hi all,
>>>>>>>> >
>>>>>>>> > I'd like to start the vote for SPIP: Lazy Materialization for Parquet
>>>>>>>> > Read Performance Improvement.
>>>>>>>> >
>>>>>>>> > The high summary of the SPIP is that it proposes an improvement to 
>>>>>>>> > the
>>>>>>>> > Parquet reader with lazy materialization which only materializes 
>>>>>>>> > (i.e.
>>>>>>>> > decompress, de-code, etc...) necessary values. For Spark-SQL filter
>>>>>>>> > operations, evaluating the filters first and lazily materializing 
>>>>>>>> > only
>>>>>>>> > the used values can save computation wastes and improve the read
>>>>>>>> > performance.
>>>>>>>> >
>>>>>>>> > References:
>>>>>>>> >
>>>>>>>> > JIRA ticket https://issues.apache.org/jira/browse/SPARK-42256
>>>>>>>> > SPIP doc 
>>>>>>>> > https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
>>>>>>>> > Discussion thread
>>>>>>>> > https://lists.apache.org/thread/5yf2ylqhcv94y03m7gp3mgf3q0fp6gw6
>>>>>>>> >
>>>>>>>> > Please vote on the SPIP for the next 72 hours:
>>>>>>>> >
>>>>>>>> > [ ] +1: Accept the proposal as an official SPIP
>>>>>>>> > [ ] +0
>>>>>>>> > [ ] -1: I don’t think this is a good idea because …
>>>>>>>> >
>>>>>>>> > Thank you!
>>>>>>>> >
>>>>>>>> > Liang-Chi Hsieh
>>>>>>>> >
>>>>>>>> > -
>>>>>>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>> >
>>>>>>>> >
>>>>>>>>
>>>>>>>> -
>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>
>>
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[VOTE][RESULT] Release Spark 3.3.2 (RC1)

2023-02-15 Thread L. C. Hsieh

The vote passes with 12 +1s (4 binding +1s).
Thanks to all who helped with the release!

(* = binding)
+1:
- Mridul Muralidharan (*)
- Dongjoon Hyun (*)
- Sean Owen (*)
- Enrico Minack
- Bjørn Jørgensen
- Yikun Jiang
- Yang Jie
- Yuming Wang
- John Zhuge
- William Hyun
- Chao Sun
- L. C. Hsieh (*)

+0: None

-1: None

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Spark 3.3.2 (RC1)

2023-02-14 Thread L. C. Hsieh

Seems I've not added my +1 yet.

+1

On Mon, Feb 13, 2023 at 10:02 AM Holden Karau  wrote:
>
> That’s legit, if the patch author isn’t comfortable with a backport then 
> let’s leave it be 
>
> On Mon, Feb 13, 2023 at 9:59 AM Dongjoon Hyun  wrote:
>>
>> Hi, All.
>>
>> As the author of that `Improvement` patch, I strongly disagree with giving 
>> the wrong idea which Python 3.11 is officially supported in Spark 3.3.
>>
>> I only developed and delivered it for Apache Spark 3.4.0 specifically as 
>> `Improvement`.
>>
>> We may want to backport it branch-3.3 but it's also another discussion topic 
>> because it's `Improvement` instead of a blocker of any existing release 
>> branch.
>>
>> Please raise the backporting discussion thread after 3.3.2 releasing if you 
>> want it in branch-3.3.
>>
>> We need to talk. :)
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Mon, Feb 13, 2023 at 9:31 AM Chao Sun  wrote:
>>>
>>> +1
>>>
>>> On Mon, Feb 13, 2023 at 9:20 AM L. C. Hsieh  wrote:
>>> >
>>> > If it is not supported in Spark 3.3.x, it looks like an improvement at
>>> > Spark 3.4.
>>> > For such cases we usually do not back port. I think this is also why
>>> > the PR did not back port when it was merged.
>>> >
>>> > I'm okay if there is consensus to back port it.
>>> >
>>> > On Mon, Feb 13, 2023 at 9:08 AM Sean Owen  wrote:
>>> > >
>>> > > Does that change change the result for Spark 3.3.x?
>>> > > It looks like we do not support Python 3.11 in Spark 3.3.x, which is 
>>> > > one answer to whether this should be changed now.
>>> > > But if that's the only change that matters for Python 3.11 and makes it 
>>> > > work, sure I think we should back-port. It doesn't necessarily block a 
>>> > > release but if that's the case, it seems OK to include to me in a next 
>>> > > RC.
>>> > >
>>> > > On Mon, Feb 13, 2023 at 10:53 AM Bjørn Jørgensen 
>>> > >  wrote:
>>> > >>
>>> > >> There is a fix for python 3.11 
>>> > >> https://github.com/apache/spark/pull/38987
>>> > >> We should have this in more branches.
>>> > >>
>>> > >> man. 13. feb. 2023 kl. 09:39 skrev Bjørn Jørgensen 
>>> > >> :
>>> > >>>
>>> > >>> On manjaro it is Python 3.10.9
>>> > >>>
>>> > >>> On ubuntu it is Python 3.11.1
>>> > >>>
>>> > >>> man. 13. feb. 2023 kl. 03:24 skrev yangjie01 :
>>> > >>>>
>>> > >>>> Which Python version do you use for testing? When I use the latest 
>>> > >>>> Python 3.11, I can reproduce similar test failures (43 tests of sql 
>>> > >>>> module fail), but when I use python 3.10, they will succeed
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>> YangJie
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>> 发件人: Bjørn Jørgensen 
>>> > >>>> 日期: 2023年2月13日 星期一 05:09
>>> > >>>> 收件人: Sean Owen 
>>> > >>>> 抄送: "L. C. Hsieh" , Spark dev list 
>>> > >>>> 
>>> > >>>> 主题: Re: [VOTE] Release Spark 3.3.2 (RC1)
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>> Tried it one more time and the same result.
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>> On another box with Manjaro
>>> > >>>>
>>> > >>>> 
>>> > >>>> [INFO] Reactor Summary for Spark Project Parent POM 3.3.2:
>>> > >>>> [INFO]
>>> > >>>> [INFO] Spark Project Parent POM ... SUCCESS 
>>> > >>>> [01:50 min]
>>> > >>>> [INFO] Spark Project Tags . SUCCESS 
>>> > >>>> [ 17.359 s]
>>> > >>>> [INFO] Spark Project Sketch ... SUCCESS 
>>> > >>

Re: [VOTE][SPIP] Lazy Materialization for Parquet Read Performance Improvement

2023-02-13 Thread L. C. Hsieh

+1

On Mon, Feb 13, 2023 at 3:49 PM Mich Talebzadeh 
wrote:

> +1 for me
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 13 Feb 2023 at 23:18, huaxin gao  wrote:
>
>> +1
>>
>> On Mon, Feb 13, 2023 at 3:09 PM Dongjoon Hyun 
>> wrote:
>>
>>> +1
>>>
>>> Dongjoon
>>>
>>> On 2023/02/13 22:52:59 "L. C. Hsieh" wrote:
>>> > Hi all,
>>> >
>>> > I'd like to start the vote for SPIP: Lazy Materialization for Parquet
>>> > Read Performance Improvement.
>>> >
>>> > The high summary of the SPIP is that it proposes an improvement to the
>>> > Parquet reader with lazy materialization which only materializes (i.e.
>>> > decompress, de-code, etc...) necessary values. For Spark-SQL filter
>>> > operations, evaluating the filters first and lazily materializing only
>>> > the used values can save computation wastes and improve the read
>>> > performance.
>>> >
>>> > References:
>>> >
>>> > JIRA ticket https://issues.apache.org/jira/browse/SPARK-42256
>>> > SPIP doc
>>> https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
>>> > Discussion thread
>>> > https://lists.apache.org/thread/5yf2ylqhcv94y03m7gp3mgf3q0fp6gw6
>>> >
>>> > Please vote on the SPIP for the next 72 hours:
>>> >
>>> > [ ] +1: Accept the proposal as an official SPIP
>>> > [ ] +0
>>> > [ ] -1: I don’t think this is a good idea because …
>>> >
>>> > Thank you!
>>> >
>>> > Liang-Chi Hsieh
>>> >
>>> > -
>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

[VOTE][SPIP] Lazy Materialization for Parquet Read Performance Improvement

2023-02-13 Thread L. C. Hsieh

Hi all,

I'd like to start the vote for SPIP: Lazy Materialization for Parquet
Read Performance Improvement.

The high summary of the SPIP is that it proposes an improvement to the
Parquet reader with lazy materialization which only materializes (i.e.
decompress, de-code, etc...) necessary values. For Spark-SQL filter
operations, evaluating the filters first and lazily materializing only
the used values can save computation wastes and improve the read
performance.

References:

JIRA ticket https://issues.apache.org/jira/browse/SPARK-42256
SPIP doc 
https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
Discussion thread
https://lists.apache.org/thread/5yf2ylqhcv94y03m7gp3mgf3q0fp6gw6

Please vote on the SPIP for the next 72 hours:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …

Thank you!

Liang-Chi Hsieh

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

2023-02-13 Thread L. C. Hsieh

Hi Mich,

The title of this thread is "[DISCUSS]". We need to have a public
discussion on a SPIP proposal collecting comments before we can move
forward to call for a vote on it.


On Mon, Feb 13, 2023 at 2:35 PM Mich Talebzadeh 
wrote:

> Hi,
>
> I thought we already voted to go ahead with this proposal!
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 13 Feb 2023 at 20:41, kazuyuki tanimura 
> wrote:
>
>> Thank you Liang-Chi!
>>
>> Kazu
>>
>> On Feb 11, 2023, at 7:12 PM, L. C. Hsieh  wrote:
>>
>> Thanks all for your feedback.
>>
>> Given this positive feedback, if there is no other comments/discussion, I
>> will go to start a vote in the next few days.
>>
>> Thank you again!
>>
>> On Thu, Feb 2, 2023 at 10:12 AM kazuyuki tanimura <
>> ktanim...@apple.com.invalid> wrote:
>>
>>> Thank you all for +1s and reviewing the SPIP doc.
>>>
>>> Kazu
>>>
>>> On Feb 1, 2023, at 1:28 AM, Dongjoon Hyun 
>>> wrote:
>>>
>>> +1
>>>
>>> On Wed, Feb 1, 2023 at 12:52 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> +1
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, 1 Feb 2023 at 02:23, huaxin gao  wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Tue, Jan 31, 2023 at 6:10 PM DB Tsai  wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On Jan 31, 2023, at 4:16 PM, Yuming Wang  wrote:
>>>>>>
>>>>>> 
>>>>>> +1.
>>>>>>
>>>>>> On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura <
>>>>>> ktanim...@apple.com.invalid> wrote:
>>>>>>
>>>>>>> Great! Much appreciated, Mitch!
>>>>>>>
>>>>>>> Kazu
>>>>>>>
>>>>>>> On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh <
>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>
>>>>>>> Thanks, Kazu.
>>>>>>>
>>>>>>> I followed that template link and indeed as you pointed out it is a
>>>>>>> common template. If it works then it is what it is.
>>>>>>>
>>>>>>> I will be going through your design proposals and hopefully we can
>>>>>>> review it.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Mich
>>>>>>>
>>>>>>>
>>>>>>>view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>&g

Re: [VOTE] Release Spark 3.3.2 (RC1)

2023-02-13 Thread L. C. Hsieh

If it is not supported in Spark 3.3.x, it looks like an improvement at
Spark 3.4.
For such cases we usually do not back port. I think this is also why
the PR did not back port when it was merged.

I'm okay if there is consensus to back port it.

On Mon, Feb 13, 2023 at 9:08 AM Sean Owen  wrote:
>
> Does that change change the result for Spark 3.3.x?
> It looks like we do not support Python 3.11 in Spark 3.3.x, which is one 
> answer to whether this should be changed now.
> But if that's the only change that matters for Python 3.11 and makes it work, 
> sure I think we should back-port. It doesn't necessarily block a release but 
> if that's the case, it seems OK to include to me in a next RC.
>
> On Mon, Feb 13, 2023 at 10:53 AM Bjørn Jørgensen  
> wrote:
>>
>> There is a fix for python 3.11 https://github.com/apache/spark/pull/38987
>> We should have this in more branches.
>>
>> man. 13. feb. 2023 kl. 09:39 skrev Bjørn Jørgensen 
>> :
>>>
>>> On manjaro it is Python 3.10.9
>>>
>>> On ubuntu it is Python 3.11.1
>>>
>>> man. 13. feb. 2023 kl. 03:24 skrev yangjie01 :
>>>>
>>>> Which Python version do you use for testing? When I use the latest Python 
>>>> 3.11, I can reproduce similar test failures (43 tests of sql module fail), 
>>>> but when I use python 3.10, they will succeed
>>>>
>>>>
>>>>
>>>> YangJie
>>>>
>>>>
>>>>
>>>> 发件人: Bjørn Jørgensen 
>>>> 日期: 2023年2月13日 星期一 05:09
>>>> 收件人: Sean Owen 
>>>> 抄送: "L. C. Hsieh" , Spark dev list 
>>>> 主题: Re: [VOTE] Release Spark 3.3.2 (RC1)
>>>>
>>>>
>>>>
>>>> Tried it one more time and the same result.
>>>>
>>>>
>>>>
>>>> On another box with Manjaro
>>>>
>>>> 
>>>> [INFO] Reactor Summary for Spark Project Parent POM 3.3.2:
>>>> [INFO]
>>>> [INFO] Spark Project Parent POM ... SUCCESS [01:50 
>>>> min]
>>>> [INFO] Spark Project Tags . SUCCESS [ 
>>>> 17.359 s]
>>>> [INFO] Spark Project Sketch ... SUCCESS [ 
>>>> 12.517 s]
>>>> [INFO] Spark Project Local DB . SUCCESS [ 
>>>> 14.463 s]
>>>> [INFO] Spark Project Networking ... SUCCESS [01:07 
>>>> min]
>>>> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [  
>>>> 9.013 s]
>>>> [INFO] Spark Project Unsafe ... SUCCESS [  
>>>> 8.184 s]
>>>> [INFO] Spark Project Launcher . SUCCESS [ 
>>>> 10.454 s]
>>>> [INFO] Spark Project Core . SUCCESS [23:58 
>>>> min]
>>>> [INFO] Spark Project ML Local Library . SUCCESS [ 
>>>> 21.218 s]
>>>> [INFO] Spark Project GraphX ... SUCCESS [01:24 
>>>> min]
>>>> [INFO] Spark Project Streaming  SUCCESS [04:57 
>>>> min]
>>>> [INFO] Spark Project Catalyst . SUCCESS [08:00 
>>>> min]
>>>> [INFO] Spark Project SQL .. SUCCESS [  
>>>> 01:02 h]
>>>> [INFO] Spark Project ML Library ... SUCCESS [14:38 
>>>> min]
>>>> [INFO] Spark Project Tools  SUCCESS [  
>>>> 4.394 s]
>>>> [INFO] Spark Project Hive . SUCCESS [53:43 
>>>> min]
>>>> [INFO] Spark Project REPL . SUCCESS [01:16 
>>>> min]
>>>> [INFO] Spark Project Assembly . SUCCESS [  
>>>> 2.186 s]
>>>> [INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [ 
>>>> 16.150 s]
>>>> [INFO] Spark Integration for Kafka 0.10 ... SUCCESS [01:34 
>>>> min]
>>>> [INFO] Kafka 0.10+ Source for Structured Streaming  SUCCESS [32:55 
>>>> min]
>>>> [INFO] Spark Project Examples . SUCCESS [ 
>>>> 23.800 s]
>>>> [INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS [  
>>>> 7.3

Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

2023-02-11 Thread L. C. Hsieh

Thanks all for your feedback.

Given this positive feedback, if there is no other comments/discussion, I
will go to start a vote in the next few days.

Thank you again!

On Thu, Feb 2, 2023 at 10:12 AM kazuyuki tanimura
 wrote:

> Thank you all for +1s and reviewing the SPIP doc.
>
> Kazu
>
> On Feb 1, 2023, at 1:28 AM, Dongjoon Hyun  wrote:
>
> +1
>
> On Wed, Feb 1, 2023 at 12:52 AM Mich Talebzadeh 
> wrote:
>
>> +1
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 1 Feb 2023 at 02:23, huaxin gao  wrote:
>>
>>> +1
>>>
>>> On Tue, Jan 31, 2023 at 6:10 PM DB Tsai  wrote:
>>>
 +1

 Sent from my iPhone

 On Jan 31, 2023, at 4:16 PM, Yuming Wang  wrote:

 
 +1.

 On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura <
 ktanim...@apple.com.invalid> wrote:

> Great! Much appreciated, Mitch!
>
> Kazu
>
> On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
> Thanks, Kazu.
>
> I followed that template link and indeed as you pointed out it is a
> common template. If it works then it is what it is.
>
> I will be going through your design proposals and hopefully we can
> review it.
>
> Regards,
>
> Mich
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damages
> arising from such loss, damage or destruction.
>
>
>
>
> On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura 
> wrote:
>
>> Thank you Mich. I followed the instruction at
>> https://spark.apache.org/improvement-proposals.html and used its
>> template.
>> While we are open to revise our design doc, it seems more like you
>> are proposing the community to change the instruction per se?
>>
>> Kazu
>>
>> On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>> Hi,
>>
>> Thanks for these proposals. good suggestions. Is this style of
>> breaking down your approach standard?
>>
>> My view would be that perhaps it makes more sense to follow the
>> industry established approach of breaking down
>> your technical proposal  into:
>>
>>
>>1. Background
>>2. Objective
>>3. Scope
>>4. Constraints
>>5. Assumptions
>>6. Reporting
>>7. Deliverables
>>8. Timelines
>>9. Appendix
>>
>> Your current approach using below
>>
>> Q1. What are you trying to do? Articulate your objectives using
>> absolutely no jargon. What are you trying to achieve?
>> Q2. What problem is this proposal NOT designed to solve? What issues
>> the suggested proposal is not going to address
>> Q3. How is it done today, and what are the limits of current practice?
>> Q4. What is new in your approach approach and why do you think it
>> will be successful succeed?
>> Q5. Who cares? If you are successful, what difference will it make?
>> If your proposal succeeds, what tangible benefits will it add?
>> Q6. What are the risks?
>> Q7. How long will it take?
>> Q8. What are the midterm and final “exams” to check for success?
>>
>>
>> May not do  justice to your proposal.
>>
>> HTH
>>
>> Mich
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>> for any loss, damage or destruction of data or any other property which 
>> may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <
>> ktanim...@apple.com.invalid> wrote:
>>
>>> Hi everyone,
>>>
>>> I would like to start a discussion on “Lazy Materialization

Re: [VOTE] Release Spark 3.3.2 (RC1)

2023-02-11 Thread L. C. Hsieh

Thank you for testing it.

I was going to run it again but still didn't see any errors.

I also checked CI (and looked again now) on branch-3.3 before cutting RC.

BTW, I didn't find an actual test failure (i.e. "- test_name ***
FAILED ***") in the log file.

Maybe it is due to the dev env? What dev env you're using to run the test?


On Sat, Feb 11, 2023 at 8:58 AM Bjørn Jørgensen
 wrote:
>
>
> ./build/mvn clean package
>
> Run completed in 1 hour, 18 minutes, 29 seconds.
> Total number of tests run: 11652
> Suites: completed 516, aborted 0
> Tests: succeeded 11609, failed 43, canceled 8, ignored 57, pending 0
> *** 43 TESTS FAILED ***
> [INFO] 
> 
> [INFO] Reactor Summary for Spark Project Parent POM 3.3.2:
> [INFO]
> [INFO] Spark Project Parent POM ... SUCCESS [  3.418 
> s]
> [INFO] Spark Project Tags . SUCCESS [ 17.845 
> s]
> [INFO] Spark Project Sketch ... SUCCESS [ 20.791 
> s]
> [INFO] Spark Project Local DB . SUCCESS [ 16.527 
> s]
> [INFO] Spark Project Networking ... SUCCESS [01:03 
> min]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [  9.914 
> s]
> [INFO] Spark Project Unsafe ... SUCCESS [ 12.007 
> s]
> [INFO] Spark Project Launcher . SUCCESS [  7.620 
> s]
> [INFO] Spark Project Core . SUCCESS [40:04 
> min]
> [INFO] Spark Project ML Local Library . SUCCESS [ 29.997 
> s]
> [INFO] Spark Project GraphX ... SUCCESS [02:33 
> min]
> [INFO] Spark Project Streaming  SUCCESS [05:51 
> min]
> [INFO] Spark Project Catalyst . SUCCESS [13:29 
> min]
> [INFO] Spark Project SQL .. FAILURE [  01:25 
> h]
> [INFO] Spark Project ML Library ... SKIPPED
> [INFO] Spark Project Tools  SKIPPED
> [INFO] Spark Project Hive . SKIPPED
> [INFO] Spark Project REPL . SKIPPED
> [INFO] Spark Project Assembly . SKIPPED
> [INFO] Kafka 0.10+ Token Provider for Streaming ... SKIPPED
> [INFO] Spark Integration for Kafka 0.10 ... SKIPPED
> [INFO] Kafka 0.10+ Source for Structured Streaming  SKIPPED
> [INFO] Spark Project Examples . SKIPPED
> [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED
> [INFO] Spark Avro . SKIPPED
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> --------
> [INFO] Total time:  02:30 h
> [INFO] Finished at: 2023-02-11T17:32:45+01:00
>
> lør. 11. feb. 2023 kl. 06:01 skrev L. C. Hsieh :
>>
>> Please vote on releasing the following candidate as Apache Spark version 
>> 3.3.2.
>>
>> The vote is open until Feb 15th 9AM (PST) and passes if a majority +1
>> PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.3.2
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v3.3.2-rc1 (commit
>> 5103e00c4ce5fcc4264ca9c4df12295d42557af6):
>> https://github.com/apache/spark/tree/v3.3.2-rc1
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.2-rc1-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1433/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.2-rc1-docs/
>>
>> The list of bug fixes going into 3.3.2 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12352299
>>
>> This release is using the release script of the tag v3.3.2-rc1.
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by tak

Re: [VOTE] Release Spark 3.3.2 (RC1)

2023-02-11 Thread L. C. Hsieh

Hi Mridul,

Thanks for testing it.

I can see the artifact in
https://repository.apache.org/content/repositories/orgapachespark-1433/org/apache/spark/spark-mllib-local_2.13/3.3.2/.
Did I miss something?

Liang-Chi

On Sat, Feb 11, 2023 at 10:08 AM Mridul Muralidharan  wrote:
>
>
> Hi,
>
> The following file is missing in the staging repository - there is a 
> corresponding asc sig file, without the artifact.
> * 
> org/apache/spark/spark-mllib-local_2.13/3.3.2/spark-mllib-local_2.13-3.3.2-test-sources.jar
> Can we have this fixed please ?
>
> Rest of the signatures, digests, etc check out fine.
>
> Built and tested with "-Phive -Pyarn -Pmesos -Pkubernetes".
>
> Regards,
> Mridul
>
>
>
>
> On Fri, Feb 10, 2023 at 11:01 PM L. C. Hsieh  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version 
>> 3.3.2.
>>
>> The vote is open until Feb 15th 9AM (PST) and passes if a majority +1
>> PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.3.2
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v3.3.2-rc1 (commit
>> 5103e00c4ce5fcc4264ca9c4df12295d42557af6):
>> https://github.com/apache/spark/tree/v3.3.2-rc1
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.2-rc1-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1433/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.2-rc1-docs/
>>
>> The list of bug fixes going into 3.3.2 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12352299
>>
>> This release is using the release script of the tag v3.3.2-rc1.
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.3.2?
>> ===
>>
>> The current list of open tickets targeted at 3.3.2 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.3.2
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[VOTE] Release Spark 3.3.2 (RC1)

2023-02-10 Thread L. C. Hsieh

Please vote on releasing the following candidate as Apache Spark version 3.3.2.

The vote is open until Feb 15th 9AM (PST) and passes if a majority +1
PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.3.2
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v3.3.2-rc1 (commit
5103e00c4ce5fcc4264ca9c4df12295d42557af6):
https://github.com/apache/spark/tree/v3.3.2-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.3.2-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1433/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.3.2-rc1-docs/

The list of bug fixes going into 3.3.2 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12352299

This release is using the release script of the tag v3.3.2-rc1.

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.3.2?
===

The current list of open tickets targeted at 3.3.2 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.3.2

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Time for release v3.3.2

2023-01-30 Thread L. C. Hsieh

Hi Spark devs,

As you know, it has been 4 months since Spark 3.3.1 was released on
2022/10, it seems a good time to think about next maintenance release,
i.e. Spark 3.3.2.

I'm thinking of the release of Spark 3.3.2 this Feb (2023/02).

What do you think?

I am willing to volunteer for Spark 3.3.2 if there is consensus about
this maintenance release.

Thank you.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] Deprecate DStream in 3.4

2023-01-12 Thread L. C. Hsieh

+1

On Thu, Jan 12, 2023 at 10:39 PM Jungtaek Lim
 wrote:
>
> Yes, exactly. I'm sorry to bring confusion - should have clarified action 
> items on the proposal.
>
> On Fri, Jan 13, 2023 at 3:31 PM Dongjoon Hyun  wrote:
>>
>> Then, could you elaborate `the proposed code change` specifically?
>> Maybe, usual deprecation warning logs and annotation on the API?
>>
>>
>> On Thu, Jan 12, 2023 at 10:05 PM Jungtaek Lim  
>> wrote:
>>>
>>> Maybe I need to clarify - my proposal is "explicitly" deprecating it, which 
>>> incurs code change for sure. Guidance on the Spark website is done already 
>>> as I mentioned - we updated the DStream doc page to mention that DStream is 
>>> a "legacy" project and users should move to SS. I don't feel this is 
>>> sufficient to refrain users from using it, hence initiating this proposal.
>>>
>>> Sorry to make confusion. I just wanted to make sure the goal of the 
>>> proposal is not "removing" the API. The discussion on the removal of API 
>>> doesn't tend to go well, so I wanted to make sure I don't mean that.
>>>
>>> On Fri, Jan 13, 2023 at 2:46 PM Dongjoon Hyun  
>>> wrote:

 +1 for the proposal (guiding only without any code change).

 Thanks,
 Dongjoon.

 On Thu, Jan 12, 2023 at 9:33 PM Shixiong Zhu  wrote:
>
> +1
>
>
> On Thu, Jan 12, 2023 at 5:08 PM Tathagata Das 
>  wrote:
>>
>> +1
>>
>> On Thu, Jan 12, 2023 at 7:46 PM Hyukjin Kwon  wrote:
>>>
>>> +1
>>>
>>> On Fri, 13 Jan 2023 at 08:51, Jungtaek Lim 
>>>  wrote:

 bump for more visibility.

 On Wed, Jan 11, 2023 at 12:20 PM Jungtaek Lim 
  wrote:
>
> Hi dev,
>
> I'd like to propose the deprecation of DStream in Spark 3.4, in favor 
> of promoting Structured Streaming.
> (Sorry for the late proposal, if we don't make the change in 3.4, we 
> will have to wait for another 6 months.)
>
> We have been focusing on Structured Streaming for years (across 
> multiple major and minor versions), and during the time we haven't 
> made any improvements for DStream. Furthermore, recently we updated 
> the DStream doc to explicitly say DStream is a legacy project.
> https://spark.apache.org/docs/latest/streaming-programming-guide.html#note
>
> The baseline of deprecation is that we don't see a particular use 
> case which only DStream solves. This is a different story with GraphX 
> and MLLIB, as we don't have replacements for that.
>
> The proposal does not mean we will remove the API soon, as the Spark 
> project has been making deprecation against public API. I don't 
> intend to propose the target version for removal. The goal is to 
> guide users to refrain from constructing a new workload with DStream. 
> We might want to go with this in future, but it would require a new 
> discussion thread at that time.
>
> What do you think?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Time for Spark 3.4.0 release?

2023-01-04 Thread L. C. Hsieh

+1

Thank you!

On Wed, Jan 4, 2023 at 9:13 AM Chao Sun  wrote:

> +1, thanks!
>
> Chao
>
> On Wed, Jan 4, 2023 at 1:56 AM Mridul Muralidharan 
> wrote:
>
>>
>> +1, Thanks !
>>
>> Regards,
>> Mridul
>>
>> On Wed, Jan 4, 2023 at 2:20 AM Gengliang Wang  wrote:
>>
>>> +1, thanks for driving the release!
>>>
>>>
>>> Gengliang
>>>
>>> On Tue, Jan 3, 2023 at 10:55 PM Dongjoon Hyun 
>>> wrote:
>>>
 +1

 Thank you!

 Dongjoon

 On Tue, Jan 3, 2023 at 9:44 PM Rui Wang  wrote:

> +1 to cut the branch starting from a workday!
>
> Great to see this is happening!
>
> Thanks Xinrong!
>
> -Rui
>
> On Tue, Jan 3, 2023 at 9:21 PM 416161...@qq.com 
> wrote:
>
>> +1, thank you Xinrong for driving this release!
>>
>> --
>> Ruifeng Zheng
>> ruife...@foxmail.com
>>
>> 
>>
>>
>>
>> -- Original --
>> *From:* "Hyukjin Kwon" ;
>> *Date:* Wed, Jan 4, 2023 01:15 PM
>> *To:* "Xinrong Meng";
>> *Cc:* "dev";
>> *Subject:* Re: Time for Spark 3.4.0 release?
>>
>> SGTM +1
>>
>> On Wed, Jan 4, 2023 at 2:13 PM Xinrong Meng 
>> wrote:
>>
>>> Hi All,
>>>
>>> Shall we cut *branch-3.4* on *January 16th, 2023*? We proposed
>>> January 15th per
>>> https://spark.apache.org/versioning-policy.html, but I would
>>> suggest we postpone one day since January 15th is a Sunday.
>>>
>>> I would like to volunteer as the release manager for *Apache Spark
>>> 3.4.0*.
>>>
>>> Thanks,
>>>
>>> Xinrong Meng
>>>
>>>

Re: [ANNOUNCE] Apache Spark 3.2.3 released

2022-11-30 Thread L. C. Hsieh

Thanks, Chao!

On Wed, Nov 30, 2022 at 9:58 AM huaxin gao  wrote:
>
> Thanks Chao for driving the release!
>
> On Wed, Nov 30, 2022 at 9:24 AM Dongjoon Hyun  wrote:
>>
>> Thank you, Chao!
>>
>> On Wed, Nov 30, 2022 at 8:16 AM Yang,Jie(INF)  wrote:
>>>
>>> Thanks, Chao!
>>>
>>>
>>>
>>> 发件人: Maxim Gekk 
>>> 日期: 2022年11月30日 星期三 19:40
>>> 收件人: Jungtaek Lim 
>>> 抄送: Wenchen Fan , Chao Sun , dev 
>>> , user 
>>> 主题: Re: [ANNOUNCE] Apache Spark 3.2.3 released
>>>
>>>
>>>
>>> Thank you, Chao!
>>>
>>>
>>>
>>> On Wed, Nov 30, 2022 at 12:42 PM Jungtaek Lim 
>>>  wrote:
>>>
>>> Thanks Chao for driving the release!
>>>
>>>
>>>
>>> On Wed, Nov 30, 2022 at 6:03 PM Wenchen Fan  wrote:
>>>
>>> Thanks, Chao!
>>>
>>>
>>>
>>> On Wed, Nov 30, 2022 at 1:33 AM Chao Sun  wrote:
>>>
>>> We are happy to announce the availability of Apache Spark 3.2.3!
>>>
>>> Spark 3.2.3 is a maintenance release containing stability fixes. This
>>> release is based on the branch-3.2 maintenance branch of Spark. We strongly
>>> recommend all 3.2 users to upgrade to this stable release.
>>>
>>> To download Spark 3.2.3, head over to the download page:
>>> https://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-3-2-3.html
>>>
>>> We would like to acknowledge all community members for contributing to this
>>> release. This release would not have been possible without you.
>>>
>>> Chao
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Spark 3.2.3 (RC1)

2022-11-14 Thread L. C. Hsieh

+1

Thanks Chao.

On Mon, Nov 14, 2022 at 6:55 PM Dongjoon Hyun  wrote:
>
> +1
>
> Thank you, Chao.
>
> On Mon, Nov 14, 2022 at 4:12 PM Chao Sun  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version 
>> 3.2.3.
>>
>> The vote is open until 11:59pm Pacific time Nov 17th and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.2.3
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.2.3-rc1 (commit
>> b53c341e0fefbb33d115ab630369a18765b7763d):
>> https://github.com/apache/spark/tree/v3.2.3-rc1
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.2.3-rc1-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1431/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.2.3-rc1-docs/
>>
>> The list of bug fixes going into 3.2.3 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12352105
>>
>> This release is using the release script of the tag v3.2.3-rc1.
>>
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.2.3?
>> ===
>> The current list of open tickets targeted at 3.2.3 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.2.3
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [ANNOUNCE] Apache Spark 3.3.1 released

2022-10-26 Thread L. C. Hsieh

Thank you for driving the release of Apache Spark 3.3.1, Yuming!

On Tue, Oct 25, 2022 at 11:38 PM Dongjoon Hyun  wrote:
>
> It's great. Thank you so much, Yuming!
>
> Dongjoon
>
> On Tue, Oct 25, 2022 at 11:23 PM Yuming Wang  wrote:
>>
>> We are happy to announce the availability of Apache Spark 3.3.1!
>>
>> Spark 3.3.1 is a maintenance release containing stability fixes. This
>> release is based on the branch-3.3 maintenance branch of Spark. We strongly
>> recommend all 3.3 users to upgrade to this stable release.
>>
>> To download Spark 3.3.1, head over to the download page:
>> https://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-3-1.html
>>
>> We would like to acknowledge all community members for contributing to this
>> release. This release would not have been possible without you.
>>
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Spark 3.3.1 (RC4)

2022-10-18 Thread L. C. Hsieh

+1

Thanks Yuming!

On Tue, Oct 18, 2022 at 11:28 AM Dongjoon Hyun  wrote:
>
> +1
>
> Thank you, Yuming and all!
>
> Dongjoon.
>
>
> On Tue, Oct 18, 2022 at 9:22 AM Yang,Jie(INF)  wrote:
>>
>> Use maven to test Java 17 + Scala 2.13 and test passed, +1 for me
>>
>>
>>
>> 发件人: Sean Owen 
>> 日期: 2022年10月17日 星期一 21:34
>> 收件人: Yuming Wang 
>> 抄送: dev 
>> 主题: Re: [VOTE] Release Spark 3.3.1 (RC4)
>>
>>
>>
>> +1 from me, same as last time
>>
>>
>>
>> On Sun, Oct 16, 2022 at 9:14 PM Yuming Wang  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version 
>> 3.3.1.
>>
>> The vote is open until 11:59pm Pacific time October 21th and passes if a 
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.3.1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see https://spark.apache.org
>>
>> The tag to be voted on is v3.3.1-rc4 (commit 
>> fbbcf9434ac070dd4ced4fb9efe32899c6db12a9):
>> https://github.com/apache/spark/tree/v3.3.1-rc4
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc4-bin
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1430
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc4-docs
>>
>> The list of bug fixes going into 3.3.1 can be found at the following URL:
>> https://s.apache.org/ttgz6
>>
>> This release is using the release script of the tag v3.3.1-rc4.
>>
>>
>> FAQ
>>
>> ==
>> What happened to v3.3.1-rc3?
>> ==
>> A performance regression(SPARK-40703) was found after tagging v3.3.1-rc3, 
>> which the Iceberg community hopes Spark 3.3.1 could fix.
>> So we skipped the vote on v3.3.1-rc3.
>>
>> =
>> How can I help test this release?
>> =
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.3.1?
>> ===
>> The current list of open tickets targeted at 3.3.1 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target 
>> Version/s" = 3.3.1
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Apache Spark 3.2.3 Release?

2022-10-18 Thread L. C. Hsieh

+1

Thanks Chao!

On Tue, Oct 18, 2022 at 11:30 AM Dongjoon Hyun  wrote:
>
> +1
>
> Thank you for volunteering, Chao!
>
> Dongjoon.
>
>
> On Tue, Oct 18, 2022 at 9:55 AM Sean Owen  wrote:
>>
>> OK by me, if someone is willing to drive it.
>>
>> On Tue, Oct 18, 2022 at 11:47 AM Chao Sun  wrote:
>>>
>>> Hi All,
>>>
>>> It's been more than 3 months since 3.2.2 (tagged at Jul 11) was
>>> released There are now 66 patches accumulated in branch-3.2, including
>>> 2 correctness issues.
>>>
>>> Is it a good time to start a new release? If there's no objection, I'd
>>> like to volunteer as the release manager for the 3.2.3 release, and
>>> start preparing the first RC next week.
>>>
>>> # Correctness issues
>>>
>>> SPARK-39833Filtered parquet data frame count() and show() produce
>>> inconsistent results when spark.sql.parquet.filterPushdown is true
>>> SPARK-40002.   Limit improperly pushed down through window using ntile 
>>> function
>>>
>>> Best,
>>> Chao
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-05 Thread L. C. Hsieh

+1

Thanks Dongjoon.

On Wed, Oct 5, 2022 at 3:11 PM Jungtaek Lim
 wrote:
>
> +1
>
> On Thu, Oct 6, 2022 at 5:59 AM Chao Sun  wrote:
>>
>> +1
>>
>> > and specifically may allow us to finally move off of the ancient version 
>> > of Guava (?)
>>
>> I think the Guava issue comes from Hive 2.3 dependency, not Hadoop.
>>
>> On Wed, Oct 5, 2022 at 1:55 PM Xinrong Meng  wrote:
>>>
>>> +1.
>>>
>>> On Wed, Oct 5, 2022 at 1:53 PM Xiao Li  
>>> wrote:

 +1.

 Xiao

 On Wed, Oct 5, 2022 at 12:49 PM Sean Owen  wrote:
>
> I'm OK with this. It simplifies maintenance a bit, and specifically may 
> allow us to finally move off of the ancient version of Guava (?)
>
> On Mon, Oct 3, 2022 at 10:16 PM Dongjoon Hyun  
> wrote:
>>
>> Hi, All.
>>
>> I'm wondering if the following Apache Spark Hadoop2 Binary Distribution
>> is still used by someone in the community or not. If it's not used or 
>> not useful,
>> we may remove it from Apache Spark 3.4.0 release.
>>
>> 
>> https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz
>>
>> Here is the background of this question.
>> Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache
>> Spark community has been building and releasing with Java 8 only.
>> I believe that the user applications also use Java8+ in these days.
>> Recently, I received the following message from the Hadoop PMC.
>>
>>   > "if you really want to claim hadoop 2.x compatibility, then you have 
>> to
>>   > be building against java 7". Otherwise a lot of people with hadoop 
>> 2.x
>>   > clusters won't be able to run your code. If your projects are java8+
>>   > only, then they are implicitly hadoop 3.1+, no matter what you use
>>   > in your build. Hence: no need for branch-2 branches except
>>   > to complicate your build/test/release processes [1]
>>
>> If Hadoop2 binary distribution is no longer used as of today,
>> or incomplete somewhere due to Java 8 building, the following three
>> existing alternative Hadoop 3 binary distributions could be
>> the better official solution for old Hadoop 2 clusters.
>>
>> 1) Scala 2.12 and without-hadoop distribution
>> 2) Scala 2.12 and Hadoop 3 distribution
>> 3) Scala 2.13 and Hadoop 3 distribution
>>
>> In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 Binary 
>> distribution?
>>
>> Dongjoon
>>
>> [1] 
>> https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247



 --


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Time for Spark 3.3.1 release?

2022-09-12 Thread L. C. Hsieh

+1

Thanks Yuming!

On Mon, Sep 12, 2022 at 11:50 AM Dongjoon Hyun  wrote:
>
> +1
>
> Thanks,
> Dongjoon.
>
> On Mon, Sep 12, 2022 at 6:38 AM Yuming Wang  wrote:
>>
>> Hi, All.
>>
>>
>>
>> Since Apache Spark 3.3.0 tag creation (Jun 10), new 138 patches including 7 
>> correctness patches arrived at branch-3.3.
>>
>>
>>
>> Shall we make a new release, Apache Spark 3.3.1, as the second release at 
>> branch-3.3? I'd like to volunteer as the release manager for Apache Spark 
>> 3.3.1.
>>
>>
>>
>> All changes:
>>
>> https://github.com/apache/spark/compare/v3.3.0...branch-3.3
>>
>>
>>
>> Correctness issues:
>>
>> SPARK-40149: Propagate metadata columns through Project
>>
>> SPARK-40002: Don't push down limit through window using ntile
>>
>> SPARK-39976: ArrayIntersect should handle null in left expression correctly
>>
>> SPARK-39833: Disable Parquet column index in DSv1 to fix a correctness issue 
>> in the case of overlapping partition and data columns
>>
>> SPARK-39061: Set nullable correctly for Inline output attributes
>>
>> SPARK-39887: RemoveRedundantAliases should keep aliases that make the output 
>> of projection nodes unique
>>
>> SPARK-38614: Don't push down limit through window that's using percent_rank

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Welcoming three new PMC members

2022-08-09 Thread L. C. Hsieh

Congrats!

On Tue, Aug 9, 2022 at 5:38 PM Chao Sun  wrote:
>
> Congrats everyone!
>
> On Tue, Aug 9, 2022 at 5:36 PM Dongjoon Hyun  wrote:
> >
> > Congrat to all!
> >
> > Dongjoon.
> >
> > On Tue, Aug 9, 2022 at 5:13 PM Takuya UESHIN  wrote:
> > >
> > > Congratulations!
> > >
> > > On Tue, Aug 9, 2022 at 4:57 PM Hyukjin Kwon  wrote:
> > >>
> > >> Congrats everybody!
> > >>
> > >> On Wed, 10 Aug 2022 at 05:50, Mridul Muralidharan  
> > >> wrote:
> > >>>
> > >>>
> > >>> Congratulations !
> > >>> Great to have you join the PMC !!
> > >>>
> > >>> Regards,
> > >>> Mridul
> > >>>
> > >>> On Tue, Aug 9, 2022 at 11:57 AM vaquar khan  
> > >>> wrote:
> > 
> >  Congratulations
> > 
> >  On Tue, Aug 9, 2022, 11:40 AM Xiao Li  wrote:
> > >
> > > Hi all,
> > >
> > > The Spark PMC recently voted to add three new PMC members. Join me in 
> > > welcoming them to their new roles!
> > >
> > > New PMC members: Huaxin Gao, Gengliang Wang and Maxim Gekk
> > >
> > > The Spark PMC
> > >
> > >
> > >
> > > --
> > > Takuya UESHIN
> > >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Update Spark 3.4 Release Window?

2022-07-21 Thread L. C. Hsieh

I'm also +1 for Feb. 2023 (RC) and Jan. 2023 (Code freeze).

Liang-Chi

On Wed, Jul 20, 2022 at 2:02 PM Dongjoon Hyun  wrote:
>
> I fixed typos :)
>
> +1 for February 2023 (Release Candidate) and January 2023 (Code freeze).
>
> On 2022/07/20 20:59:30 Dongjoon Hyun wrote:
> > Thank you for initiating this discussion, Xinrong. I also agree with Sean.
> >
> > +1 for February 2023 (Release Candidate) and January 2021 (Code freeze).
> >
> > Dongjoon.
> >
> > On Wed, Jul 20, 2022 at 1:42 PM Sean Owen  wrote:
> > >
> > > I don't know any better than others when it will actually happen, though 
> > > historically, it's more like 7-8 months between minor releases. I might 
> > > therefore expect a release more like February 2023, and work backwards 
> > > from there. Doesn't really matter, this is just a public guess and can be 
> > > changed.
> > >
> > > On Wed, Jul 20, 2022 at 3:27 PM Xinrong Meng  
> > > wrote:
> > >>
> > >> Hi All,
> > >>
> > >> Since Spark 3.3.0 was released on June 16, 2022, shall we update the 
> > >> release window https://spark.apache.org/versioning-policy.html for Spark 
> > >> 3.4?
> > >>
> > >> A proposal is as follows:
> > >>
> > >> | October 15th 2022 | Code freeze. Release branch cut.
> > >> | Late October 2022 | QA period. Focus on bug fixes, tests, stability 
> > >> and docs. Generally, no new features merged.
> > >> | November 2022 | Release candidates (RC), voting, etc. until final 
> > >> release passes
> > >>
> > >> Thanks!
> > >>
> > >> Xinrong Meng
> > >>
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Spark 3.2.2 (RC1)

2022-07-11 Thread L. C. Hsieh

+1

On Mon, Jul 11, 2022 at 4:50 PM Hyukjin Kwon  wrote:
>
> +1
>
> On Tue, 12 Jul 2022 at 06:58, Dongjoon Hyun  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version 
>> 3.2.2.
>>
>> The vote is open until July 15th 1AM (PST) and passes if a majority +1 PMC 
>> votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.2.2
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v3.2.2-rc1 (commit 
>> 78a5825fe266c0884d2dd18cbca9625fa258d7f7):
>> https://github.com/apache/spark/tree/v3.2.2-rc1
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.2.2-rc1-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1409/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.2.2-rc1-docs/
>>
>> The list of bug fixes going into 3.2.2 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12351232
>>
>> This release is using the release script of the tag v3.2.2-rc1.
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.2.2?
>> ===
>>
>> The current list of open tickets targeted at 3.2.2 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target 
>> Version/s" = 3.2.2
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>
>> Dongjoon

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE][SPIP] Spark Connect

2022-06-13 Thread L. C. Hsieh

+1

On Mon, Jun 13, 2022 at 5:41 PM Chao Sun  wrote:
>
> +1 (non-binding)
>
> On Mon, Jun 13, 2022 at 5:11 PM Hyukjin Kwon  wrote:
>>
>> +1
>>
>> On Tue, 14 Jun 2022 at 08:50, Yuming Wang  wrote:
>>>
>>> +1.
>>>
>>> On Tue, Jun 14, 2022 at 2:20 AM Matei Zaharia  
>>> wrote:

 +1, very excited about this direction.

 Matei

 On Jun 13, 2022, at 11:07 AM, Herman van Hovell 
  wrote:

 Let me kick off the voting...

 +1

 On Mon, Jun 13, 2022 at 2:02 PM Herman van Hovell  
 wrote:
>
> Hi all,
>
> I’d like to start a vote for SPIP: "Spark Connect"
>
> The goal of the SPIP is to introduce a Dataframe based client/server API 
> for Spark
>
> Please also refer to:
>
> - Previous discussion in dev mailing list: [DISCUSS] SPIP: Spark Connect 
> - A client and server interface for Apache Spark.
> - Design doc: Spark Connect - A client and server interface for Apache 
> Spark.
> - JIRA: SPARK-39375
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Kind Regards,
> Herman



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Spark 3.3.0 (RC6)

2022-06-13 Thread L. C. Hsieh

+1

On Mon, Jun 13, 2022 at 5:07 PM Holden Karau  wrote:
>
> +1
>
> On Mon, Jun 13, 2022 at 4:51 PM Yuming Wang  wrote:
>>
>> +1 (non-binding)
>>
>> On Tue, Jun 14, 2022 at 7:41 AM Dongjoon Hyun  
>> wrote:
>>>
>>> +1
>>>
>>> Thanks,
>>> Dongjoon.
>>>
>>> On Mon, Jun 13, 2022 at 3:54 PM Chris Nauroth  wrote:

 +1 (non-binding)

 I repeated all checks I described for RC5:

 https://lists.apache.org/thread/ksoxmozgz7q728mnxl6c2z7ncmo87vls

 Maxim, thank you for your dedication on these release candidates.

 Chris Nauroth


 On Mon, Jun 13, 2022 at 3:21 PM Mridul Muralidharan  
 wrote:
>
>
> +1
>
> Signatures, digests, etc check out fine.
> Checked out tag and build/tested with -Pyarn -Pmesos -Pkubernetes
>
> The test "SPARK-33084: Add jar support Ivy URI in SQL" in 
> sql.SQLQuerySuite fails; but other than that, rest looks good.
>
> Regards,
> Mridul
>
>
>
> On Mon, Jun 13, 2022 at 4:25 PM Tom Graves  
> wrote:
>>
>> +1
>>
>> Tom
>>
>> On Thursday, June 9, 2022, 11:27:50 PM CDT, Maxim Gekk 
>>  wrote:
>>
>>
>> Please vote on releasing the following candidate as Apache Spark version 
>> 3.3.0.
>>
>> The vote is open until 11:59pm Pacific time June 14th and passes if a 
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.3.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.3.0-rc6 (commit 
>> f74867bddfbcdd4d08076db36851e88b15e66556):
>> https://github.com/apache/spark/tree/v3.3.0-rc6
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc6-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1407
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc6-docs/
>>
>> The list of bug fixes going into 3.3.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>>
>> This release is using the release script of the tag v3.3.0-rc6.
>>
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.3.0?
>> ===
>> The current list of open tickets targeted at 3.3.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target 
>> Version/s" = 3.3.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Spark 3.3.0 (RC5)

2022-06-07 Thread L. C. Hsieh

+1

Liang-Chi

On Tue, Jun 7, 2022 at 1:03 PM Gengliang Wang  wrote:
>
> +1 (non-binding)
>
> Gengliang
>
> On Tue, Jun 7, 2022 at 12:24 PM Thomas Graves  wrote:
>>
>> +1
>>
>> Tom Graves
>>
>> On Sat, Jun 4, 2022 at 9:50 AM Maxim Gekk
>>  wrote:
>> >
>> > Please vote on releasing the following candidate as Apache Spark version 
>> > 3.3.0.
>> >
>> > The vote is open until 11:59pm Pacific time June 8th and passes if a 
>> > majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 3.3.0
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > The tag to be voted on is v3.3.0-rc5 (commit 
>> > 7cf29705272ab8e8c70e8885a3664ad8ae3cd5e9):
>> > https://github.com/apache/spark/tree/v3.3.0-rc5
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1406
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-docs/
>> >
>> > The list of bug fixes going into 3.3.0 can be found at the following URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12350369
>> >
>> > This release is using the release script of the tag v3.3.0-rc5.
>> >
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 3.3.0?
>> > ===
>> > The current list of open tickets targeted at 3.3.0 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK and search for "Target 
>> > Version/s" = 3.3.0
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>> >
>> > Maxim Gekk
>> >
>> > Software Engineer
>> >
>> > Databricks, Inc.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Spark 3.3.0 (RC4)

2022-06-03 Thread L. C. Hsieh

It's fixed at https://github.com/apache/spark/pull/36762.

On Fri, Jun 3, 2022 at 2:20 PM Sean Owen  wrote:
>
> Ah yeah, I think it's this change from 15 hrs ago. That needs to be .toSeq:
>
> https://github.com/apache/spark/commit/4a0f0ff6c22b85cb0fc1eef842da8dbe4c90543a#diff-01813c3e2e933ed573e4a93750107f004a86e587330cba5e91b5052fa6ade2a5R146
>
> On Fri, Jun 3, 2022 at 4:13 PM Sean Owen  wrote:
>>
>> In Scala 2.13, I'm getting errors like this:
>>
>>  analyzer should replace current_timestamp with literals *** FAILED ***
>>   java.lang.ClassCastException: class scala.collection.mutable.ArrayBuffer 
>> cannot be cast to class scala.collection.immutable.Seq 
>> (scala.collection.mutable.ArrayBuffer and scala.collection.immutable.Seq are 
>> in unnamed module of loader 'app')
>>   at 
>> org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite.literals(ComputeCurrentTimeSuite.scala:146)
>> ...
>> - analyzer should replace current_date with literals *** FAILED ***
>>   java.lang.ClassCastException: class scala.collection.mutable.ArrayBuffer 
>> cannot be cast to class scala.collection.immutable.Seq 
>> (scala.collection.mutable.ArrayBuffer and scala.collection.immutable.Seq are 
>> in unnamed module of loader 'app')
>> ...
>>
>> I haven't investigated yet, just flagging in case anyone knows more about it 
>> immediately.
>>
>>
>> On Fri, Jun 3, 2022 at 9:54 AM Maxim Gekk 
>>  wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark version 
>>> 3.3.0.
>>>
>>> The vote is open until 11:59pm Pacific time June 7th and passes if a 
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.3.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.3.0-rc4 (commit 
>>> 4e3599bc11a1cb0ea9fc819e7f752d2228e54baf):
>>> https://github.com/apache/spark/tree/v3.3.0-rc4
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc4-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1405
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc4-docs/
>>>
>>> The list of bug fixes going into 3.3.0 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>>>
>>> This release is using the release script of the tag v3.3.0-rc4.
>>>
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.3.0?
>>> ===
>>> The current list of open tickets targeted at 3.3.0 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target 
>>> Version/s" = 3.3.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>> Maxim Gekk
>>>
>>> Software Engineer
>>>
>>> Databricks, Inc.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Introducing "Pandas API on Spark" component in JIRA, and use "PS" PR title component

2022-05-19 Thread L. C. Hsieh

+1. Thanks Hyukjin.

On Thu, May 19, 2022 at 10:14 AM Bryan Cutler  wrote:
>
> +1, sounds good
>
> On Wed, May 18, 2022 at 9:16 PM Dongjoon Hyun  wrote:
>>
>> +1
>>
>> Thank you for the suggestion, Hyukjin.
>>
>> Dongjoon.
>>
>> On Wed, May 18, 2022 at 11:08 AM Bjørn Jørgensen  
>> wrote:
>>>
>>> +1
>>> But can will have PR Title and PR label the same,  PS
>>>
>>> ons. 18. mai 2022 kl. 18:57 skrev Xinrong Meng 
>>> :

 Great!

 It saves us from always specifying "Pandas API on Spark" in PR titles.

 Thanks!


 Xinrong Meng

 Software Engineer

 Databricks



 On Tue, May 17, 2022 at 1:08 AM Maciej  wrote:
>
> Sounds good!
>
> +1
>
> On 5/17/22 06:08, Yikun Jiang wrote:
> > It's a pretty good idea, +1.
> >
> > To be clear in Github:
> >
> > - For each PR Title: [SPARK-XXX][PYTHON][PS] The Pandas on spark pr 
> > title
> > (*still keep [PYTHON]* and [PS] new added)
> >
> > - For PR label: new added: `PANDAS API ON Spark`, still keep: `PYTHON`,
> > `CORE`
> > (*still keep `PYTHON`, `CORE`* and `PANDAS API ON SPARK` new added)
> > https://github.com/apache/spark/pull/36574
> > 
> >
> > Right?
> >
> > Regards,
> > Yikun
> >
> >
> > On Tue, May 17, 2022 at 11:26 AM Hyukjin Kwon  > > wrote:
> >
> > Hi all,
> >
> > What about we introduce a component in JIRA "Pandas API on Spark",
> > and use "PS"  (pandas-on-Spark) in PR titles? We already use "ps" in
> > many places when we: import pyspark.pandas as ps.
> > This is similar to "Structured Streaming" in JIRA, and "SS" in PR 
> > title.
> >
> > I think it'd be easier to track the changes here with that.
> > Currently it's a bit difficult to identify it from pure PySpark 
> > changes.
> >
>
>
> --
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> PGP: A30CEF0C31A501EC
>>>
>>>
>>>
>>> --
>>> Bjørn Jørgensen
>>> Vestre Aspehaug 4, 6010 Ålesund
>>> Norge
>>>
>>> +47 480 94 297

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: SIGMOD System Award for Apache Spark

2022-05-13 Thread L. C. Hsieh

This is awesome! Great congrats to everyone in the Spark community!

On Fri, May 13, 2022 at 10:57 AM Manolis Gemeliaris <
gemeliarismano...@gmail.com> wrote:

> Congratulations everyone !
>
> Στις Παρ 13 Μαΐ 2022 στις 8:06 μ.μ., ο/η Xingbo Jiang <
> jiangxb1...@gmail.com> έγραψε:
>
>> Congratulations!
>>
>> On Fri, May 13, 2022 at 9:43 AM Xiao Li 
>> wrote:
>>
>>> Congratulations to everyone!
>>>
>>> Xiao
>>>
>>> On Fri, May 13, 2022 at 9:34 AM Dongjoon Hyun 
>>> wrote:
>>>
 Ya, it's really great!. Congratulations to the whole community!

 Dongjoon.

 On Fri, May 13, 2022 at 8:12 AM Chao Sun  wrote:

> Huge congrats to the whole community!
>
> On Fri, May 13, 2022 at 1:56 AM Wenchen Fan 
> wrote:
>
>> Great! Congratulations to everyone!
>>
>> On Fri, May 13, 2022 at 10:38 AM Gengliang Wang 
>> wrote:
>>
>>> Congratulations to the whole spark community!
>>>
>>> On Fri, May 13, 2022 at 10:14 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 Congrats Spark community!

 On Fri, May 13, 2022 at 10:40 AM Qian Sun 
 wrote:

> Congratulations !!!
>
> 2022年5月13日 上午3:44，Matei Zaharia  写道：
>
> Hi all,
>
> We recently found out that Apache Spark received
>  the SIGMOD System Award
> this year, given by SIGMOD (the ACM’s data management research
> organization) to impactful real-world and research systems. This puts 
> Spark
> in good company with some very impressive previous recipients
> . This
> award is really an achievement by the whole community, so I wanted to 
> say
> congrats to everyone who contributes to Spark, whether through code, 
> issue
> reports, docs, or other means.
>
> Matei
>
>
>
>>>
>>> --
>>>
>>>

Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-04 Thread L. C. Hsieh

+1

On Thu, Feb 3, 2022 at 7:25 PM Chao Sun  wrote:
>
> +1 (non-binding). Looking forward to this feature!
>
> On Thu, Feb 3, 2022 at 2:32 PM Ryan Blue  wrote:
>>
>> +1 for the SPIP. I think it's well designed and it has worked quite well at 
>> Netflix for a long time.
>>
>> On Thu, Feb 3, 2022 at 2:04 PM John Zhuge  wrote:
>>>
>>> Hi Spark community,
>>>
>>> I’d like to restart the vote for the ViewCatalog design proposal (SPIP).
>>>
>>> The proposal is to add a ViewCatalog interface that can be used to load, 
>>> create, alter, and drop views in DataSourceV2.
>>>
>>> Please vote on the SPIP until Feb. 9th (Wednesday).
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>> Thanks!
>>
>>
>>
>> --
>> Ryan Blue
>> Tabular

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [ANNOUNCE] Apache Spark 3.2.1 released

2022-01-28 Thread L. C. Hsieh

Thanks Huaxin for the 3.2.1 release!

On Fri, Jan 28, 2022 at 10:14 PM Dongjoon Hyun  wrote:
>
> Thank you again, Huaxin!
>
> Dongjoon.
>
> On Fri, Jan 28, 2022 at 6:23 PM DB Tsai  wrote:
>>
>> Thank you, Huaxin for the 3.2.1 release!
>>
>> Sent from my iPhone
>>
>> On Jan 28, 2022, at 5:45 PM, Chao Sun  wrote:
>>
>> 
>> Thanks Huaxin for driving the release!
>>
>> On Fri, Jan 28, 2022 at 5:37 PM Ruifeng Zheng  wrote:
>>>
>>> It's Great!
>>> Congrats and thanks, huaxin!
>>>
>>>
>>> -- 原始邮件 --
>>> 发件人: "huaxin gao" ;
>>> 发送时间: 2022年1月29日(星期六) 上午9:07
>>> 收件人: "dev";"user";
>>> 主题: [ANNOUNCE] Apache Spark 3.2.1 released
>>>
>>> We are happy to announce the availability of Spark 3.2.1!
>>>
>>> Spark 3.2.1 is a maintenance release containing stability fixes. This
>>> release is based on the branch-3.2 maintenance branch of Spark. We strongly
>>> recommend all 3.2 users to upgrade to this stable release.
>>>
>>> To download Spark 3.2.1, head over to the download page:
>>> https://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-3-2-1.html
>>>
>>> We would like to acknowledge all community members for contributing to this
>>> release. This release would not have been possible without you.
>>>
>>> Huaxin Gao

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-06 Thread L. C. Hsieh

Thank you, Shane.

On Mon, Dec 6, 2021 at 4:27 PM Holden Karau  wrote:
>
> Shane you kick ass thank you for everything you’ve done for us :) Keep on 
> rocking :)
>
> On Mon, Dec 6, 2021 at 4:24 PM Hyukjin Kwon  wrote:
>>
>> Thanks, Shane.
>>
>> On Tue, 7 Dec 2021 at 09:19, Dongjoon Hyun  wrote:
>>>
>>> I really want to thank you for all your help.
>>> You've done so many things for the Apache Spark community.
>>>
>>> Sincerely,
>>> Dongjoon
>>>
>>>
>>> On Mon, Dec 6, 2021 at 12:02 PM shane knapp ☠  wrote:

 hey everyone!

 after a marathon run of nearly a decade, we're finally going to be 
 shutting down {amp|rise}lab jenkins at the end of this month...

 the earliest snapshot i could find is from 2013 with builds for spark 0.7:
 https://web.archive.org/web/20130426155726/https://amplab.cs.berkeley.edu/jenkins/

 it's been a hell of a run, and i'm gonna miss randomly tweaking the build 
 system, but technology has moved on and running a dedicated set of servers 
 for just one open source project is just too expensive for us here at uc 
 berkeley.

 if there's interest, i'll fire up a zoom session and all y'alls can watch 
 me type the final command:

 systemctl stop jenkins

 feeling bittersweet,

 shane
 --
 Shane Knapp
 Computer Guy / Voice of Reason
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE][RESULT] SPIP: Row-level operations in Data Source V2

2021-11-16 Thread L. C. Hsieh

Sorry, it should be 13 +1 votes and no -1 or +0 votes, I missed Yufei
Gu's vote during editing the list.

Updated vote list:

Liang-Chi Hsieh*
Anton Okolnychyi
DB Tsai*
Yufei Gu
Huaxin Gao
Dongjoon Hyun*
Russell Spitzer
Mich Talebzadeh
Ryan Blue
Chao Sun
John Zhuge
Wenchen Fan*
Gengliang Wang

* = binding

On Tue, Nov 16, 2021 at 9:37 AM L. C. Hsieh  wrote:
>
> Hi all,
>
> The vote passed with the following 12 +1 votes and no -1 or +0 votes:
>
> Liang-Chi Hsieh*
> Anton Okolnychyi
> DB Tsai*
> Huaxin Gao
> Dongjoon Hyun*
> Russell Spitzer
> Mich Talebzadeh
> Ryan Blue
> Chao Sun
> John Zhuge
> Wenchen Fan*
> Gengliang Wang
>
> * = binding
>
> Thank you guys all for your feedback and votes.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[VOTE][RESULT] SPIP: Row-level operations in Data Source V2

2021-11-16 Thread L. C. Hsieh

Hi all,

The vote passed with the following 12 +1 votes and no -1 or +0 votes:

Liang-Chi Hsieh*
Anton Okolnychyi
DB Tsai*
Huaxin Gao
Dongjoon Hyun*
Russell Spitzer
Mich Talebzadeh
Ryan Blue
Chao Sun
John Zhuge
Wenchen Fan*
Gengliang Wang

* = binding

Thank you guys all for your feedback and votes.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[VOTE] SPIP: Row-level operations in Data Source V2

2021-11-12 Thread L. C. Hsieh

Hi all,

I’d like to start a vote for SPIP: Row-level operations in Data Source V2.

The proposal is to add support for executing row-level operations
such as DELETE, UPDATE, MERGE for v2 tables (SPARK-35801). The
execution should be the same across data sources and the best way to do
that is to implement it in Spark.

Right now, Spark can only parse and to some extent analyze DELETE, UPDATE,
MERGE commands. Data sources that support row-level changes have to build
custom Spark extensions to execute such statements. The goal of this effort
is to come up with a flexible and easy-to-use API that will work across
data sources.

Please also refer to:

   - Previous discussion in dev mailing list: [DISCUSS] SPIP:
Row-level operations in Data Source V2
   

   - JIRA: SPARK-35801 
   - PR for handling DELETE statements:


   - Design doc


Please vote on the SPIP for the next 72 hours:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] SPIP: Row-level operations in Data Source V2

2021-11-12 Thread L. C. Hsieh

Hi all,

I think mostly we are in favor for the SPIP as I've seen.

If not more comments or discussion on the SPIP doc, I will raise a vote soon.
Thanks.

On Tue, Nov 2, 2021 at 9:58 AM L. C. Hsieh  wrote:
>
> +1 for the idea to commit the work earlier.
>
> I think we will raise the voting soon. Once it is passed, we can submit the 
> PRs.
>
> What do you think? Anton.
>
> On Mon, Nov 1, 2021 at 7:59 AM Wenchen Fan  wrote:
> >
> > The general idea looks great. This is indeed a complicated API and we 
> > probably need more time to evaluate the API design. It's better to commit 
> > this work earlier so that we have more time to verify it before the 3.3 
> > release. Maybe we can commit the group-based API first, then the 
> > delta-based one, as the delta-based API is significantly more convoluted.
> >
> > On Thu, Oct 28, 2021 at 12:53 AM L. C. Hsieh  wrote:
> >>
> >>
> >> Thanks for the initial feedback.
> >>
> >> I think previously the community is busy on the works related to Spark 3.2 
> >> release.
> >> As 3.2 release was done, I'd like to bring this up to the surface again 
> >> and seek for more discussion and feedback.
> >>
> >> Thanks.
> >>
> >> On 2021/06/25 15:49:49, huaxin gao  wrote:
> >> > I took a quick look at the PR and it looks like a great feature to have. 
> >> > It
> >> > provides unified APIs for data sources to perform the commonly used
> >> > operations easily and efficiently, so users don't have to implement
> >> > customer extensions on their own. Thanks Anton for the work!
> >> >
> >> > On Thu, Jun 24, 2021 at 9:42 PM L. C. Hsieh  wrote:
> >> >
> >> > > Thanks Anton. I'm voluntarily to be the shepherd of the SPIP. This is 
> >> > > also
> >> > > my first time to shepherd a SPIP, so please let me know if anything I 
> >> > > can
> >> > > improve.
> >> > >
> >> > > This looks great features and the rationale claimed by the proposal 
> >> > > makes
> >> > > sense. These operations are getting more common and more important in 
> >> > > big
> >> > > data workloads. Instead of building custom extensions by individual 
> >> > > data
> >> > > sources, it makes more sense to support the API from Spark.
> >> > >
> >> > > Please provide your thoughts about the proposal and the design. 
> >> > > Appreciate
> >> > > your feedback. Thank you!
> >> > >
> >> > > On 2021/06/24 23:53:32, Anton Okolnychyi  wrote:
> >> > > > Hey everyone,
> >> > > >
> >> > > > I'd like to start a discussion on adding support for executing 
> >> > > > row-level
> >> > > > operations such as DELETE, UPDATE, MERGE for v2 tables 
> >> > > > (SPARK-35801). The
> >> > > > execution should be the same across data sources and the best way to 
> >> > > > do
> >> > > > that is to implement it in Spark.
> >> > > >
> >> > > > Right now, Spark can only parse and to some extent analyze DELETE,
> >> > > UPDATE,
> >> > > > MERGE commands. Data sources that support row-level changes have to 
> >> > > > build
> >> > > > custom Spark extensions to execute such statements. The goal of this
> >> > > effort
> >> > > > is to come up with a flexible and easy-to-use API that will work 
> >> > > > across
> >> > > > data sources.
> >> > > >
> >> > > > Design doc:
> >> > > >
> >> > > https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60/
> >> > > >
> >> > > > PR for handling DELETE statements:
> >> > > > https://github.com/apache/spark/pull/33008
> >> > > >
> >> > > > Any feedback is more than welcome.
> >> > > >
> >> > > > Liang-Chi was kind enough to shepherd this effort. Thanks!
> >> > > >
> >> > > > - Anton
> >> > > >
> >> > >
> >> > > -
> >> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >> > >
> >> > >
> >> >
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] SPIP: Row-level operations in Data Source V2

2021-11-02 Thread L. C. Hsieh

+1 for the idea to commit the work earlier.

I think we will raise the voting soon. Once it is passed, we can submit the PRs.

What do you think? Anton.

On Mon, Nov 1, 2021 at 7:59 AM Wenchen Fan  wrote:
>
> The general idea looks great. This is indeed a complicated API and we 
> probably need more time to evaluate the API design. It's better to commit 
> this work earlier so that we have more time to verify it before the 3.3 
> release. Maybe we can commit the group-based API first, then the delta-based 
> one, as the delta-based API is significantly more convoluted.
>
> On Thu, Oct 28, 2021 at 12:53 AM L. C. Hsieh  wrote:
>>
>>
>> Thanks for the initial feedback.
>>
>> I think previously the community is busy on the works related to Spark 3.2 
>> release.
>> As 3.2 release was done, I'd like to bring this up to the surface again and 
>> seek for more discussion and feedback.
>>
>> Thanks.
>>
>> On 2021/06/25 15:49:49, huaxin gao  wrote:
>> > I took a quick look at the PR and it looks like a great feature to have. It
>> > provides unified APIs for data sources to perform the commonly used
>> > operations easily and efficiently, so users don't have to implement
>> > customer extensions on their own. Thanks Anton for the work!
>> >
>> > On Thu, Jun 24, 2021 at 9:42 PM L. C. Hsieh  wrote:
>> >
>> > > Thanks Anton. I'm voluntarily to be the shepherd of the SPIP. This is 
>> > > also
>> > > my first time to shepherd a SPIP, so please let me know if anything I can
>> > > improve.
>> > >
>> > > This looks great features and the rationale claimed by the proposal makes
>> > > sense. These operations are getting more common and more important in big
>> > > data workloads. Instead of building custom extensions by individual data
>> > > sources, it makes more sense to support the API from Spark.
>> > >
>> > > Please provide your thoughts about the proposal and the design. 
>> > > Appreciate
>> > > your feedback. Thank you!
>> > >
>> > > On 2021/06/24 23:53:32, Anton Okolnychyi  wrote:
>> > > > Hey everyone,
>> > > >
>> > > > I'd like to start a discussion on adding support for executing 
>> > > > row-level
>> > > > operations such as DELETE, UPDATE, MERGE for v2 tables (SPARK-35801). 
>> > > > The
>> > > > execution should be the same across data sources and the best way to do
>> > > > that is to implement it in Spark.
>> > > >
>> > > > Right now, Spark can only parse and to some extent analyze DELETE,
>> > > UPDATE,
>> > > > MERGE commands. Data sources that support row-level changes have to 
>> > > > build
>> > > > custom Spark extensions to execute such statements. The goal of this
>> > > effort
>> > > > is to come up with a flexible and easy-to-use API that will work across
>> > > > data sources.
>> > > >
>> > > > Design doc:
>> > > >
>> > > https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60/
>> > > >
>> > > > PR for handling DELETE statements:
>> > > > https://github.com/apache/spark/pull/33008
>> > > >
>> > > > Any feedback is more than welcome.
>> > > >
>> > > > Liang-Chi was kind enough to shepherd this effort. Thanks!
>> > > >
>> > > > - Anton
>> > > >
>> > >
>> > > -
>> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> > >
>> > >
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPIP: Storage Partitioned Join for Data Source V2

2021-10-29 Thread L . C . Hsieh



I'll start with my +1.

On 2021/10/29 17:30:03, L. C. Hsieh  wrote: 
> Hi all,
> 
> I’d like to start a vote for SPIP: Storage Partitioned Join for Data Source 
> V2.
> 
> The proposal is to support a new type of join: storage partitioned join which
> covers bucket join support for DataSourceV2 but is more general. The goal
> is to let Spark leverage distribution properties reported by data sources and
> eliminate shuffle whenever possible.
> 
> Please also refer to:
> 
>- Previous discussion in dev mailing list: [DISCUSS] SPIP: Storage 
> Partitioned Join for Data Source V2
>
> <https://lists.apache.org/thread.html/r7dc67c3db280a8b2e65855cb0b1c86b524d4e6ae1ed9db9ca12cb2e6%40%3Cdev.spark.apache.org%3E>
>.
>- JIRA: SPARK-37166 <https://issues.apache.org/jira/browse/SPARK-37166>
>- Design doc 
> <https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE>
>  
> 
> Please vote on the SPIP for the next 72 hours:
> 
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[VOTE] SPIP: Storage Partitioned Join for Data Source V2

2021-10-29 Thread L . C . Hsieh

Hi all,

I’d like to start a vote for SPIP: Storage Partitioned Join for Data Source V2.

The proposal is to support a new type of join: storage partitioned join which
covers bucket join support for DataSourceV2 but is more general. The goal
is to let Spark leverage distribution properties reported by data sources and
eliminate shuffle whenever possible.

Please also refer to:

   - Previous discussion in dev mailing list: [DISCUSS] SPIP: Storage 
Partitioned Join for Data Source V2
   

   .
   - JIRA: SPARK-37166 
   - Design doc 

 

Please vote on the SPIP for the next 72 hours:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-29 Thread L . C . Hsieh



Thanks all for your inputs here!

Seems the discussion already settles, I will be the shepherd for the SPIP and 
call for a vote on the SPIP moving forward in a new thread.

On 2021/10/28 13:05:53, Wenchen Fan  wrote: 
> Thanks for the explanation! It makes sense to always resolve the logical
> transforms to concrete implementations, and check the concrete
> implementations to decide compatible partitions. We can discuss more
> details in the PR later.
> 
> On Thu, Oct 28, 2021 at 4:14 AM Ryan Blue  wrote:
> 
> > The transform expressions in v2 are logical, not concrete implementations.
> > Even days may have different implementations -- the only expectation is
> > that the partitions are day-sized. For example, you could use a transform
> > that splits days at UTC 00:00, or uses some other day boundary.
> >
> > Because the expressions are logical, we need to resolve them to
> > implementations at some point, like Chao outlines. We can do that using a
> > FunctionCatalog, although I think it's worth considering adding an
> > interface so that a transform from a Table can be converted into a
> > `BoundFunction` directly. That is easier than defining a way for Spark to
> > query the function catalog.
> >
> > In any case, I'm sure it's easy to understand how this works once you get
> > a concrete implementation.
> >
> > On Wed, Oct 27, 2021 at 9:35 AM Wenchen Fan  wrote:
> >
> >> `BucketTransform` is a builtin partition transform in Spark, instead of a
> >> UDF from `FunctionCatalog`. Will Iceberg use UDF from `FunctionCatalog` to
> >> represent its bucket transform, or use the Spark builtin `BucketTransform`?
> >> I'm asking this because other v2 sources may also use the builtin
> >> `BucketTransform` but use a different bucket hash function. Or we can
> >> clearly define the bucket hash function of the builtin `BucketTransform` in
> >> the doc.
> >>
> >> On Thu, Oct 28, 2021 at 12:25 AM Ryan Blue  wrote:
> >>
> >>> Two v2 sources may return different bucket IDs for the same value, and
> >>> this breaks the phase 1 split-wise join.
> >>>
> >>> This is why the FunctionCatalog included a canonicalName method (docs
> >>> ).
> >>> That method returns an identifier that can be used to compare whether two
> >>> bucket function instances are the same.
> >>>
> >>>
> >>>1. Can we apply this idea to partitioned file source tables
> >>>(non-bucketed) as well?
> >>>
> >>> What do you mean here? The design doc discusses transforms like days(ts)
> >>> that can be supported in the future. Is that what you’re asking about? Or
> >>> are you referring to v1 file sources? I think the goal is to support v2,
> >>> since v1 doesn’t have reliable behavior.
> >>>
> >>> Note that the initial implementation goal is to support bucketing since
> >>> that’s an easier case because both sides have the same number of
> >>> partitions. More complex storage-partitioned joins can be implemented 
> >>> later.
> >>>
> >>>
> >>>1. What if the table has many partitions? Shall we apply certain
> >>>join algorithms in the phase 1 split-wise join as well? Or even launch 
> >>> a
> >>>Spark job to do so?
> >>>
> >>> I think that this proposal opens up a lot of possibilities, like what
> >>> you’re suggesting here. It is a bit like AQE. We’ll need to come up with
> >>> heuristics for choosing how and when to use storage partitioning in joins.
> >>> As I said above, bucketing is a great way to get started because it fills
> >>> an existing gap. More complex use cases can be supported over time.
> >>>
> >>> Ryan
> >>>
> >>> On Wed, Oct 27, 2021 at 9:08 AM Wenchen Fan  wrote:
> >>>
>  IIUC, the general idea is to let each input split report its partition
>  value, and Spark can perform the join in two phases:
>  1. join the input splits from left and right tables according to their
>  partitions values and join keys, at the driver side.
>  2. for each joined input splits pair (or a group of splits), launch a
>  Spark task to join the rows.
> 
>  My major concern is about how to define "compatible partitions". Things
>  like `days(ts)` are straightforward: the same timestamp value always
>  results in the same partition value, in whatever v2 sources. `bucket(col,
>  num)` is tricky, as Spark doesn't define the bucket hash function. Two v2
>  sources may return different bucket IDs for the same value, and this 
>  breaks
>  the phase 1 split-wise join.
> 
>  And two questions for further improvements:
>  1. Can we apply this idea to partitioned file source tables
>  (non-bucketed) as well?
>  2. What if the table has many partitions? Shall we apply certain join
>  algorithms in the phase 1 split-wise join as well? Or even launch a Spark
>  job to do so?
> 
>  Thanks,
>  Wenchen
> 
>  On

Re: [DISCUSS] SPIP: Row-level operations in Data Source V2

2021-10-27 Thread L . C . Hsieh



Thanks for the initial feedback.

I think previously the community is busy on the works related to Spark 3.2 
release.
As 3.2 release was done, I'd like to bring this up to the surface again and 
seek for more discussion and feedback.

Thanks.

On 2021/06/25 15:49:49, huaxin gao  wrote: 
> I took a quick look at the PR and it looks like a great feature to have. It
> provides unified APIs for data sources to perform the commonly used
> operations easily and efficiently, so users don't have to implement
> customer extensions on their own. Thanks Anton for the work!
> 
> On Thu, Jun 24, 2021 at 9:42 PM L. C. Hsieh  wrote:
> 
> > Thanks Anton. I'm voluntarily to be the shepherd of the SPIP. This is also
> > my first time to shepherd a SPIP, so please let me know if anything I can
> > improve.
> >
> > This looks great features and the rationale claimed by the proposal makes
> > sense. These operations are getting more common and more important in big
> > data workloads. Instead of building custom extensions by individual data
> > sources, it makes more sense to support the API from Spark.
> >
> > Please provide your thoughts about the proposal and the design. Appreciate
> > your feedback. Thank you!
> >
> > On 2021/06/24 23:53:32, Anton Okolnychyi  wrote:
> > > Hey everyone,
> > >
> > > I'd like to start a discussion on adding support for executing row-level
> > > operations such as DELETE, UPDATE, MERGE for v2 tables (SPARK-35801). The
> > > execution should be the same across data sources and the best way to do
> > > that is to implement it in Spark.
> > >
> > > Right now, Spark can only parse and to some extent analyze DELETE,
> > UPDATE,
> > > MERGE commands. Data sources that support row-level changes have to build
> > > custom Spark extensions to execute such statements. The goal of this
> > effort
> > > is to come up with a flexible and easy-to-use API that will work across
> > > data sources.
> > >
> > > Design doc:
> > >
> > https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60/
> > >
> > > PR for handling DELETE statements:
> > > https://github.com/apache/spark/pull/33008
> > >
> > > Any feedback is more than welcome.
> > >
> > > Liang-Chi was kind enough to shepherd this effort. Thanks!
> > >
> > > - Anton
> > >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-27 Thread L . C . Hsieh

+1 for the SPIP. This is a great improvement and optimization!

On 2021/10/26 19:01:03, Erik Krogen  wrote: 
> It's great to see this SPIP going live. Once this is complete, it will
> really help Spark to play nicely with a broader data ecosystem (Hive,
> Iceberg, Trino, etc.), and it's great to see that besides just bringing the
> existing bucketed-join support to V2, we are also making the types of
> partitioning that can be accommodated more broad and leaving open pathways
> for future optimizations like partially clustered distributions.
> 
> Big thanks to Ryan and Chao!
> 
> On Tue, Oct 26, 2021 at 10:35 AM Cheng Su  wrote:
> 
> > +1 for this. This is exciting movement to efficiently read bucketed table
> > from other systems (Hive, Trino & Presto)!
> >
> >
> >
> > Still looking at the details but having some early questions:
> >
> >
> >
> >1. Is migrating Hive table read path to data source v2, being a
> >prerequisite of this SPIP?
> >
> >
> >
> > Hive table read path is currently a mix of data source v1 (for Parquet &
> > ORC file format only), and legacy Hive code path (HiveTableScanExec). In
> > the SPIP, I am seeing we only make change for data source v2, so wondering
> > how this would work with existing Hive table read path. In addition, just
> > FYI, supporting writing Hive bucketed table is merged in master recently (
> > SPARK-19256  has
> > details).
> >
> >
> >
> >1. Would aggregate work automatically after the SPIP?
> >
> >
> >
> > Another major benefit for having bucketed table, is to avoid shuffle
> > before aggregate. Just want to bring to our attention that it would be
> > great to consider aggregate as well when doing this proposal.
> >
> >
> >
> >1. Any major use cases in mind except Hive bucketed table?
> >
> >
> >
> > Just curious if there’s any other use cases we are targeting as part of
> > SPIP.
> >
> >
> >
> > Thanks,
> >
> > Cheng Su
> >
> >
> >
> >
> >
> >
> >
> > *From: *Ryan Blue 
> > *Date: *Tuesday, October 26, 2021 at 9:39 AM
> > *To: *John Zhuge 
> > *Cc: *Chao Sun , Wenchen Fan ,
> > Cheng Su , DB Tsai , Dongjoon Hyun <
> > dongjoon.h...@gmail.com>, Hyukjin Kwon , Wenchen Fan
> > , angers zhu , dev <
> > dev@spark.apache.org>, huaxin gao 
> > *Subject: *Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2
> >
> > Instead of commenting on the doc, could we keep discussion here on the dev
> > list please? That way more people can follow it and there is more room for
> > discussion. Comment threads have a very small area and easily become hard
> > to follow.
> >
> >
> >
> > Ryan
> >
> >
> >
> > On Tue, Oct 26, 2021 at 9:32 AM John Zhuge  wrote:
> >
> > +1  Nicely done!
> >
> >
> >
> > On Tue, Oct 26, 2021 at 8:08 AM Chao Sun  wrote:
> >
> > Oops, sorry. I just fixed the permission setting.
> >
> >
> >
> > Thanks everyone for the positive support!
> >
> >
> >
> > On Tue, Oct 26, 2021 at 7:30 AM Wenchen Fan  wrote:
> >
> > +1 to this SPIP and nice writeup of the design doc!
> >
> >
> >
> > Can we open comment permission in the doc so that we can discuss details
> > there?
> >
> >
> >
> > On Tue, Oct 26, 2021 at 8:29 PM Hyukjin Kwon  wrote:
> >
> > Seems making sense to me.
> >
> > Would be great to have some feedback from people such as @Wenchen Fan
> >  @Cheng Su  @angers zhu
> > .
> >
> >
> >
> >
> >
> > On Tue, 26 Oct 2021 at 17:25, Dongjoon Hyun 
> > wrote:
> >
> > +1 for this SPIP.
> >
> >
> >
> > On Sun, Oct 24, 2021 at 9:59 AM huaxin gao  wrote:
> >
> > +1. Thanks for lifting the current restrictions on bucket join and making
> > this more generalized.
> >
> >
> >
> > On Sun, Oct 24, 2021 at 9:33 AM Ryan Blue  wrote:
> >
> > +1 from me as well. Thanks Chao for doing so much to get it to this point!
> >
> >
> >
> > On Sat, Oct 23, 2021 at 11:29 PM DB Tsai  wrote:
> >
> > +1 on this SPIP.
> >
> > This is a more generalized version of bucketed tables and bucketed
> > joins which can eliminate very expensive data shuffles when joins, and
> > many users in the Apache Spark community have wanted this feature for
> > a long time!
> >
> > Thank you, Ryan and Chao, for working on this, and I look forward to
> > it as a new feature in Spark 3.3
> >
> > DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
> >
> > On Fri, Oct 22, 2021 at 12:18 PM Chao Sun  wrote:
> > >
> > > Hi,
> > >
> > > Ryan and I drafted a design doc to support a new type of join: storage
> > partitioned join which covers bucket join support for DataSourceV2 but is
> > more general. The goal is to let Spark leverage distribution properties
> > reported by data sources and eliminate shuffle whenever possible.
> > >
> > > Design doc:
> > https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
> > (includes a POC link at the end)
> > >
> > > We'd like to start a discussion on the doc and any feedback is welcome!
> > >
> > > Thanks,
> > > Chao
> >
> >
> >
> >
> > --
> >
> > Ryan Blue
> >
> >
> >
> >
>

Re: [VOTE] Release Spark 3.2.0 (RC7)

2021-10-08 Thread L . C . Hsieh

+1

Looks good.

Liang-Chi

On 2021/10/08 16:16:12, Kent Yao  wrote: 
> +1 (non-binding) BR
> 
> 
> 
> 
> 
> 
> font{
> line-height: 1.6;
> }
> 
> 
> 
> font{
> line-height: 1.6;
> }
> 
> 
> 
> font{
> line-height: 1.6;
> }
> 
> 
> 
> font{
> line-height: 1.6;
> }
> 
> 
> 
> font{
> line-height: 1.6;
> }
> 
> 
> 
> font{
> line-height: 1.6;
> }
> 
> 
> 
> font{
> line-height: 1.6;
> }
> 
> 
> 
> font{
> line-height: 1.6;
> }
> 
> 
> 
> font{
> line-height: 1.6;
> }
> 
> 
> 
> font{
> line-height: 1.6;
> }
> 
> Kent Yao 
> @ Data Science Center, Hangzhou 
> Research Institute, NetEase Corp. style="font-size: 13px;">a s style="font-size: 13px;">park enthusiast style="color: rgb(0, 0, 0); font-family: Helvetica; font-size: 13px;"> style="caret-color: rgb(49, 53, 59); color: rgb(49, 53, 59); font-size: 14px; 
 font-weight: bold; orphans: 2; widows: 2;">https://github.com/yaooqinn/kyuubi; style="box-sizing: 
border-box;">kyuubiis a 
unified multi-tenant JDBC interface 
for large-scale data processing and analytics, built on top of http://spark.apache.org/; rel="nofollow" style="font-we
 ight: normal; color: rgb(49, 53, 59); font-family: 宋体-简; font-size: 13px; 
box-sizing: border-box; font-variant-ligatures: normal;">Apache Spark.https://github.com/yaooqinn/spark-authorizer; style="box-sizing: 
border-box; outline-width: 0px;">spark-authorizerA Spark SQL extension which provides SQL Standard 
Authorization for http://spark.apache.org/; rel="nofollow" style="font-weight: normal; 
color: rgb(49, 53, 59); font-family: 宋体-简; font-size: 13px; box-sizing: 
border-box; font-variant-ligatures: normal;">Apache Spark.https://github.com/yaooqinn/spark-postgres; style="font-size: 13px; 
font-family: 宋体-简; box-sizing: border-box; outline-width: 
0px;">spark-postgres A 
library for reading data from and transferring data to Postgres / Greenplum 
with Spark SQL and DataFrames, 10~100x 
faster.https://github.com/yaooqinn/spark-func-extras; style="box-sizing: 
border-box; outline-width: 0px;">itatchiA library that brings useful functions from 
 >various modern database management systems to style="color: rgb(49, 53, 59); font-family: Helvetica; font-size: 14px; 
 >display: inline !important;">http://spark.apache.org/; 
 >rel="nofollow" style="font-weight: normal; color: rgb(49, 53, 59); 
 >font-family: 宋体-简; font-size: 13px; box-sizing: border-box; 
 >font-variant-ligatures: normal;">Apache Sparkstyle="font-weight: normal; color: rgb(82, 82, 82); font-family: 宋体-简; 
 >font-size: 13px; caret-color: rgb(82, 82, 8
 2); font-variant-ligatures: normal; text-decoration-thickness: 
initial;">.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  
>   
>  发自 
>  target='_blank' style='text-decoration: none; color: #12'>网易邮箱大师
> 
> 
>  回复的原邮件 
> 
> 
> 
> 
> 发件人
> 
> 
>  style="text-decoration:none;color:#0886e8;" 
> href="mailto:huaxin.ga...@gmail.com;>huaxin gao
> 
> 
> 
> 
> 日期
> 
> 
> 2021年10月08日 23:59
> 
> 
> 
> 
> 收件人
> 
> 
>  style="text-decoration:none;color:#0886e8;" 
> href="mailto:sha...@uber.com.invalid;>Xinli shang
> 
> 
> 
> 
> 抄送至
> 
> 
>  style="text-decoration:none;color:#0886e8;" 
> href="mailto:sunc...@apache.org;>Chao Sun、 class="mail-cc" style="text-decoration:none;color:#0886e8;" 
> href="mailto:maxim.g...@databricks.com;>Maxim 
> Gekk、 style="text-decoration:none;color:#0886e8;" 
> href="mailto:mich.talebza...@gmail.com;>Mich 
> Talebzadeh、 style="text-decoration:none;color:#0886e8;" 
> href="mailto:dev@spark.apache.org;>dev
> 
> 
> 
> 
> 主题
> 
> 
> Re: [VOTE] Release Spark 3.2.0 (RC7)
> 
> 
> 
> 
> +1 (non-binding) class="gmail_quote">On Fri, Oct 8, 2021 at 
> 8:27 AM Xinli shang  wrote: class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid 
> rgb(204,204,204);padding-left:1ex">+1 
> (non-binding) dir="ltr" class="gmail_attr">On Fri, Oct 8, 2021 at 7:59 AM Chao Sun < href="mailto:sunc...@apache.org; target="_blank">sunc...@apache.org> 
> wrote: dir="auto">+1

Re: [ANNOUNCE] Apache Spark 3.0.3 released

2021-06-25 Thread L . C . Hsieh

Thanks Yi for the work!

On 2021/06/25 05:51:38, Yi Wu  wrote: 
> We are happy to announce the availability of Spark 3.0.3!
> 
> Spark 3.0.3 is a maintenance release containing stability fixes. This
> release is based on the branch-3.0 maintenance branch of Spark. We strongly
> recommend all 3.0 users to upgrade to this stable release.
> 
> To download Spark 3.0.3, head over to the download page:
> https://spark.apache.org/downloads.html
> 
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-0-3.html
> 
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
> 
> Yi
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] SPIP: Row-level operations in Data Source V2

2021-06-24 Thread L . C . Hsieh

Thanks Anton. I'm voluntarily to be the shepherd of the SPIP. This is also my 
first time to shepherd a SPIP, so please let me know if anything I can improve.

This looks great features and the rationale claimed by the proposal makes 
sense. These operations are getting more common and more important in big data 
workloads. Instead of building custom extensions by individual data sources, it 
makes more sense to support the API from Spark.

Please provide your thoughts about the proposal and the design. Appreciate your 
feedback. Thank you!

On 2021/06/24 23:53:32, Anton Okolnychyi  wrote: 
> Hey everyone,
> 
> I'd like to start a discussion on adding support for executing row-level
> operations such as DELETE, UPDATE, MERGE for v2 tables (SPARK-35801). The
> execution should be the same across data sources and the best way to do
> that is to implement it in Spark.
> 
> Right now, Spark can only parse and to some extent analyze DELETE, UPDATE,
> MERGE commands. Data sources that support row-level changes have to build
> custom Spark extensions to execute such statements. The goal of this effort
> is to come up with a flexible and easy-to-use API that will work across
> data sources.
> 
> Design doc:
> https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60/
> 
> PR for handling DELETE statements:
> https://github.com/apache/spark/pull/33008
> 
> Any feedback is more than welcome.
> 
> Liang-Chi was kind enough to shepherd this effort. Thanks!
> 
> - Anton
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

96 matches

Mail list logo