Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Wenchen Fan
+1 On Tue, May 14, 2024 at 8:19 AM Zhou Jiang wrote: > +1 (non-binding) > > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh wrote: > >> Hi all, >> >> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs. >> >> Please also refer to: >> >>- Discussion thread: >>

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-13 Thread Wenchen Fan
ng again to fix this problem, but it needs to be in > collaboration with a committer since I cannot fully test the release > scripts. (This testing gap is what doomed my last attempt at fixing this > problem.) > > Nick > > > On May 13, 2024, at 12:18 AM, Wenchen Fan wrote: &

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-12 Thread Wenchen Fan
+1 On Mon, May 13, 2024 at 10:30 AM Kent Yao wrote: > +1 > > Dongjoon Hyun 于2024年5月13日周一 08:39写道: > > > > +1 > > > > On Sun, May 12, 2024 at 3:50 PM huaxin gao > wrote: > >> > >> +1 > >> > >> On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh wrote: > >>> > >>> +1 > >>> > >>> On Sat, May 11, 2024

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-12 Thread Wenchen Fan
After finishing the 4.0.0-preview1 RC1, I have more experience with this topic now. In fact, the main job of the release process: building packages and documents, is tested in Github Action jobs. However, the way we test them is different from what we do in the release scripts. 1. the execution

[VOTE] SPARK 4.0.0-preview1 (RC1)

2024-05-10 Thread Wenchen Fan
Please vote on releasing the following candidate as Apache Spark version 4.0.0-preview1. The vote is open until May 16 PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 4.0.0-preview1 [ ] -1 Do not release this package

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-09 Thread Wenchen Fan
Thanks for leading this project! Let's move forward. On Fri, May 10, 2024 at 10:31 AM L. C. Hsieh wrote: > Thanks Anton. Thank you, Wenchen, Dongjoon, Ryan, Serge, Allison and > others if I miss those who are participating in the discussion. > > I suppose we have reached a consensus or close to

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Wenchen Fan
MB > > Dongjoon. > > > > On Thu, May 9, 2024 at 8:12 AM Wenchen Fan wrote: > >> I've created a ticket: https://issues.apache.org/jira/browse/INFRA-25776 >> >> On Thu, May 9, 2024 at 11:06 PM Dongjoon Hyun >> wrote: >> >>> In addition, FYI, I

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Wenchen Fan
;> Could you file an INFRA JIRA issue with the error message and context >> first, Wenchen? >> >> As you know, if we see something, we had better file a JIRA issue because >> it could be not only an Apache Spark project issue but also all ASF project >> issues. >>

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-09 Thread Wenchen Fan
Thanks for starting the discussion! To add a bit more color, we should at least add a test job to make sure the release script can produce the packages correctly. Today it's kind of being manually tested by the release manager each time, which slows down the release process. It's better if we can

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Wenchen Fan
>>>>> >>>>> בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏< >>>>> holden.ka...@gmail.com>: >>>>> >>>>>> Indeed. We could conceivably build the release in CI/CD but the final >>>>>&

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Wenchen Fan
wrote: > +1 > > > > *发件人**: *Jungtaek Lim > *日期**: *2024年5月2日 星期四 10:21 > *收件人**: *Holden Karau > *抄送**: *Chao Sun , Xiao Li , > Tathagata Das , Wenchen Fan < > cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas < > nicholas.cham...@gmail.com>,

Re: ASF board report draft for May

2024-05-06 Thread Wenchen Fan
The preview release also needs a vote. I'll try my best to cut the RC on Monday, but the actual release may take some time. Hopefully, we can get it out this week but if the vote fails, it will take longer as we need more RCs. On Mon, May 6, 2024 at 7:22 AM Dongjoon Hyun wrote: > +1 for

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Wenchen Fan
park as a service from a >> provider >> 2. Providers/Operators: There are some users that provide spark as a >> service for their internal(on-prem setup with yarn/k8s)/external(Something >> like EMR) customers >> 3. ? >> >> Perhaps we need to consider

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Wenchen Fan
modate the second group of users. > > On 1 May 2024, at 06:08, Wenchen Fan wrote: > > Hi all, > > It's exciting to see innovations keep happening in the Spark community and > Spark keeps evolving itself. To make these innovations available to more > users, it's important to help

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Wenchen Fan
a Preview release, > the faster we can start getting feedback for fixing things for a great > Spark 4.0 final release. > > So I urge the community to produce a Spark 4.0 Preview soon even if > certain features targeting the Delta 4.0 release are still incomplete. > > Thanks! > >

[DISCUSS] clarify the definition of behavior changes

2024-04-30 Thread Wenchen Fan
Hi all, It's exciting to see innovations keep happening in the Spark community and Spark keeps evolving itself. To make these innovations available to more users, it's important to help users upgrade to newer Spark versions easily. We've done a good job on it: the PR template requires the author

Re: Potential Impact of Hive Upgrades on Spark Tables

2024-04-30 Thread Wenchen Fan
Yes, Spark has a shim layer to support all Hive versions. It shouldn't be an issue as many users create native Spark data source tables already today, by explicitly putting the `USING` clause in the CREATE TABLE statement. On Wed, May 1, 2024 at 12:56 AM Mich Talebzadeh wrote: > @Wenchen

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Wenchen Fan
ave a more reasonable default behavior: creating Parquet tables (or whatever is specified by `spark.sql.sources.default`). On Tue, Apr 30, 2024 at 10:45 AM Wenchen Fan wrote: > @Mich Talebzadeh there seems to be a > misunderstanding here. The Spark native data source table is still stor

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Wenchen Fan
024 at 2:08 AM Mich Talebzadeh > wrote: > >> >> Hi @Wenchen Fan >> >> Thanks for your response. I believe we have not had enough time to >> "DISCUSS" this matter. >> >> Currently in order to make Spark take advantage of Hive, I create a soft >

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-28 Thread Wenchen Fan
@Mich Talebzadeh thanks for sharing your concern! Note: creating Spark native data source tables is usually Hive compatible as well, unless we use features that Hive does not support (TIMESTAMP NTZ, ANSI INTERVAL, etc.). I think it's a better default to create Spark native table in this case,

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Wenchen Fan
as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Thu, 25 Apr 2024 at 11:17, Wenchen Fan wrote: > >

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Wenchen Fan
+1 On Thu, Apr 25, 2024 at 2:46 PM Kent Yao wrote: > +1 > > Nit: the umbrella ticket is SPARK-44111, not SPARK-4. > > Thanks, > Kent Yao > > Dongjoon Hyun 于2024年4月25日周四 14:39写道: > > > > Hi, All. > > > > It's great to see community activities to polish 4.0.0 more and more. > > Thank you

Re: [DISCUSS] Spark 4.0.0 release

2024-04-17 Thread Wenchen Fan
add more context on the plan for transformWithState) > > > > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan wrote: > > Hi all, > > > > It's close to the previously proposed 4.0.0 release date (June 2024), > and I think it's time to prepare for it and discuss

Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-16 Thread Wenchen Fan
+1 On Mon, Apr 15, 2024 at 12:31 PM Dongjoon Hyun wrote: > I'll start with my +1. > > - Checked checksum and signature > - Checked Scala/Java/R/Python/SQL Document's Spark version > - Checked published Maven artifacts > - All CIs passed. > > Thanks, > Dongjoon. > > On 2024/04/15 04:22:26

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-14 Thread Wenchen Fan
+1 On Sun, Apr 14, 2024 at 6:28 AM Dongjoon Hyun wrote: > I'll start from my +1. > > Dongjoon. > > On 2024/04/13 22:22:05 Dongjoon Hyun wrote: > > Please vote on SPARK-4 to use ANSI SQL mode by default. > > The technical scope is defined in the following PR which is > > one line of code

Re: [DISCUSS] SPARK-44444: Use ANSI SQL mode by default

2024-04-12 Thread Wenchen Fan
+1, the existing "NULL on error" behavior is terrible for data quality. I have one concern about error reporting with DataFrame APIs. Query execution is lazy and where the error happens can be far away from where the dataframe/column was created. We are improving it (PR

[DISCUSS] Spark 4.0.0 release

2024-04-12 Thread Wenchen Fan
collation support - Spark k8s operator versioning Please help to add more items to this list that are missed here. I would like to volunteer as the release manager for Apache Spark 4.0.0 if there is no objection. Thank you all for the great work that fills Spark 4.0! Wenchen Fan

Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-10 Thread Wenchen Fan
It's good to reduce duplication between different native accelerators of Spark, and AFAIK there is already a project trying to solve it: https://substrait.io/ I'm not sure why we need to do this inside Spark, instead of doing the unification for a wider scope (for all engines, not only Spark).

Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark

2024-03-11 Thread Wenchen Fan
+1 On Mon, Mar 11, 2024 at 5:26 PM Hyukjin Kwon wrote: > +1 > > On Mon, 11 Mar 2024 at 18:11, yangjie01 > wrote: > >> +1 >> >> >> >> Jie Yang >> >> >> >> *发件人**: *Haejoon Lee >> *日期**: *2024年3月11日 星期一 17:09 >> *收件人**: *Gengliang Wang >> *抄送**: *dev >> *主题**: *Re: [VOTE] SPIP: Structured

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-19 Thread Wenchen Fan
+1, thanks for making the release! On Sat, Feb 17, 2024 at 3:54 AM Sean Owen wrote: > Yeah let's get that fix in, but it seems to be a minor test only issue so > should not block release. > > On Fri, Feb 16, 2024, 9:30 AM yangjie01 wrote: > >> Very sorry. When I was fixing `SPARK-45242 ( >>

Re: [VOTE] SPIP: Structured Streaming - Arbitrary State API v2

2024-01-10 Thread Wenchen Fan
+1 On Thu, Jan 11, 2024 at 9:32 AM L. C. Hsieh wrote: > +1 > > On Wed, Jan 10, 2024 at 9:06 AM Bhuwan Sahni > wrote: > >> +1. This is a good addition. >> >> >> *Bhuwan Sahni* >> Staff Software Engineer >> >> bhuwan.sa...@databricks.com >> 500 108th Ave. NE >>

Re: [DISCUSS] SPIP: Testing Framework for Spark UI Javascript files

2023-11-21 Thread Wenchen Fan
+1, very useful! On Wed, Nov 22, 2023 at 10:29 AM Dongjoon Hyun wrote: > Thank you for proposing a new UI test framework for Apache Spark 4.0. > > It looks very useful. > > Thanks, > Dongjoon. > > > On Tue, Nov 21, 2023 at 1:51 AM Kent Yao wrote: > >> Hi Spark Dev, >> >> This is a call to

Re: [VOTE] SPIP: State Data Source - Reader

2023-10-23 Thread Wenchen Fan
+1 On Mon, Oct 23, 2023 at 4:03 PM Jungtaek Lim wrote: > Starting with my +1 (non-binding). Thanks! > > On Mon, Oct 23, 2023 at 1:23 PM Jungtaek Lim > wrote: > >> Hi all, >> >> I'd like to start the vote for SPIP: State Data Source - Reader. >> >> The high level summary of the SPIP is that we

Re: Welcome to Our New Apache Spark Committer and PMCs

2023-10-03 Thread Wenchen Fan
Congrats! On Wed, Oct 4, 2023 at 8:25 AM Hyukjin Kwon wrote: > Woohoo! > > On Tue, 3 Oct 2023 at 22:47, Hussein Awala wrote: > >> Congrats to all of you! >> >> On Tue 3 Oct 2023 at 08:15, Rui Wang wrote: >> >>> Congratulations! Well deserved! >>> >>> -Rui >>> >>> >>> On Mon, Oct 2, 2023 at

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-11 Thread Wenchen Fan
+1 On Tue, Sep 12, 2023 at 9:00 AM Yuanjian Li wrote: > +1 (non-binding) > > Yuanjian Li 于2023年9月11日周一 09:36写道: > >> @Peter Toth I've looked into the details of this >> issue, and it appears that it's neither a regression in version 3.5.0 nor a >> correctness issue. It's a bug related to a

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-31 Thread Wenchen Fan
Sorry for the last-minute bug report, but we found a regression in 3.5: the SQL INSERT command without a column list fills missing columns with NULL while Spark 3.4 does not allow it. According to the SQL standard, this shouldn't be allowed and thus a regression in 3.5. The fix has been merged

Re: Spark writing API

2023-08-17 Thread Wenchen Fan
o Wenchen, > > On Wed, Aug 16, 2023 at 23:33 Wenchen Fan wrote: > >> > is there a way to hint to the downstream users on the number of rows >> expected to write? >> >> It will be very hard to do. Spark pipelines the execution (within shuffle >> boundaries) a

Re: Spark writing API

2023-08-16 Thread Wenchen Fan
> is there a way to hint to the downstream users on the number of rows expected to write? It will be very hard to do. Spark pipelines the execution (within shuffle boundaries) and we can't predict the number of final output rows. On Mon, Aug 7, 2023 at 8:27 PM Steve Loughran wrote: > > > On

Re: What else could be removed in Spark 4?

2023-08-07 Thread Wenchen Fan
I think the principle is we should remove things that block us from supporting new things like Java 21, or come with a significant maintenance cost. If there is no benefit to removing deprecated APIs (just to keep the codebase clean?), I'd prefer to leave them there and not bother. On Tue, Aug 8,

Welcome two new Apache Spark committers

2023-08-06 Thread Wenchen Fan
Hi all, The Spark PMC recently voted to add two new committers. Please join me in welcoming them to their new role! - Peter Toth (Spark SQL) - Xiduo You (Spark SQL) They consistently make contributions to the project and clearly showed their expertise. We are very excited to have them join as

Re: [DISCUSS] SPIP: Python Data Source API

2023-06-20 Thread Wenchen Fan
In an ideal world, every data source you want to connect to already has a Spark data source implementation (either v1 or v2), then this Python API is useless. But I feel it's common that people want to do quick data exploration, and the target data system is not popular enough to have an existing

Re: [Feature Request] create *permanent* Spark View from DataFrame via PySpark

2023-06-09 Thread Wenchen Fan
DataFrame view stores the logical plan, while SQL view stores SQL text. I don't think we can support this feature until we have a reliable way to materialize logical plans. On Sun, Jun 4, 2023 at 10:31 PM Mich Talebzadeh wrote: > Try sending it to dev@spark.apache.org (and join that group) > >

Re: Apache Spark 3.4.1 Release?

2023-06-09 Thread Wenchen Fan
+1 On Fri, Jun 9, 2023 at 8:52 PM Xinrong Meng wrote: > +1. Thank you Doonjoon! > > Thanks, > > Xinrong Meng > > Mridul Muralidharan 于2023年6月9日 周五上午5:22写道: > >> >> +1, thanks Dongjoon ! >> >> Regards, >> Mridul >> >> On Thu, Jun 8, 2023 at 7:16 PM Jia Fan >> wrote: >> >>> +1 >>> >>>

Re: [VOTE] Release Apache Spark 3.4.0 (RC7)

2023-04-11 Thread Wenchen Fan
+1 On Tue, Apr 11, 2023 at 9:57 AM Yuming Wang wrote: > +1. > > On Tue, Apr 11, 2023 at 9:14 AM Yikun Jiang wrote: > >> +1 (non-binding) >> >> Also ran the docker image related test (signatures/standalone/k8s) with >> rc7: https://github.com/apache/spark-docker/pull/32 >> >> Regards, >> Yikun

Re: [VOTE] Release Apache Spark 3.2.4 (RC1)

2023-04-11 Thread Wenchen Fan
+1 On Tue, Apr 11, 2023 at 10:09 AM Hyukjin Kwon wrote: > +1 > > On Tue, 11 Apr 2023 at 11:04, Ruifeng Zheng wrote: > >> +1 (non-binding) >> >> Thank you for driving this release! >> >> -- >> Ruifeng Zheng >> ruife...@foxmail.com >> >>

Re: [VOTE] Release Apache Spark 3.4.0 (RC5)

2023-04-03 Thread Wenchen Fan
Sorry for the last-minute change, but we found two wrong behaviors and want to fix them before the release: https://github.com/apache/spark/pull/40641 We missed a corner case when the input index for `array_insert` is 0. It should fail as 0 is an invalid index.

Re: Time for release v3.3.2

2023-01-31 Thread Wenchen Fan
+1, thanks! On Tue, Jan 31, 2023 at 3:17 PM Maxim Gekk wrote: > +1 > > On Tue, Jan 31, 2023 at 10:12 AM John Zhuge wrote: > >> +1 Thanks Liang-Chi for driving the release! >> >> On Mon, Jan 30, 2023 at 10:26 PM Yuming Wang wrote: >> >>> +1 >>> >>> On Tue, Jan 31, 2023 at 12:18 PM yangjie01

Re: [VOTE][SPIP] Asynchronous Offset Management in Structured Streaming

2022-12-01 Thread Wenchen Fan
+1 On Thu, Dec 1, 2022 at 12:31 PM Shixiong Zhu wrote: > +1 > > > On Wed, Nov 30, 2022 at 8:04 PM Hyukjin Kwon wrote: > >> +1 >> >> On Thu, 1 Dec 2022 at 12:39, Mridul Muralidharan >> wrote: >> >>> >>> +1 >>> >>> Regards, >>> Mridul >>> >>> On Wed, Nov 30, 2022 at 8:55 PM Xingbo Jiang >>>

Re: [DISCUSSION] SPIP: Asynchronous Offset Management in Structured Streaming

2022-11-30 Thread Wenchen Fan
+1 to improve the widely used micro-batch mode first. On Thu, Dec 1, 2022 at 8:49 AM Hyukjin Kwon wrote: > +1 > > On Thu, 1 Dec 2022 at 08:10, Shixiong Zhu wrote: > >> +1 >> >> This is exciting. I agree with Jerry that this SPIP and continuous >> processing are orthogonal. This SPIP itself

Re: [ANNOUNCE] Apache Spark 3.2.3 released

2022-11-30 Thread Wenchen Fan
Thanks, Chao! On Wed, Nov 30, 2022 at 1:33 AM Chao Sun wrote: > We are happy to announce the availability of Apache Spark 3.2.3! > > Spark 3.2.3 is a maintenance release containing stability fixes. This > release is based on the branch-3.2 maintenance branch of Spark. We strongly > recommend

Re: [VOTE][SPIP] Better Spark UI scalability and Driver stability for large applications

2022-11-16 Thread Wenchen Fan
+1, I'm looking forward to it! On Thu, Nov 17, 2022 at 9:44 AM Ye Zhou wrote: > +1 (non-binding) > Thanks for proposing this improvement to SHS, it resolves the main > performance issue within SHS. > > On Wed, Nov 16, 2022 at 1:15 PM Jungtaek Lim > wrote: > >> +1 >> >> Nice to see the chance

Re: [VOTE] Release Spark 3.2.3 (RC1)

2022-11-16 Thread Wenchen Fan
+1 On Thu, Nov 17, 2022 at 10:20 AM Yang,Jie(INF) wrote: > +1,non-binding > > > > The test combination of Java 11 + Scala 2.12 and Java 11 + Scala 2.13 has > passed. > > > > Yang Jie > > > > *发件人**: *Chris Nauroth > *日期**: *2022年11月17日 星期四 04:27 > *收件人**: *Yuming Wang > *抄送**:

Re: [DISCUSS] SPIP: Better Spark UI scalability and Driver stability for large applications

2022-11-15 Thread Wenchen Fan
This looks great! UI stability/scalability has been a pain point for a long time. On Sat, Nov 12, 2022 at 5:24 AM Gengliang Wang wrote: > Hi Everyone, > > I want to discuss the "Better Spark UI scalability and Driver stability > for large applications" proposal. Please find the links below: > >

Re: [VOTE] Release Spark 3.3.1 (RC4)

2022-10-18 Thread Wenchen Fan
+1 On Wed, Oct 19, 2022 at 4:59 AM Chao Sun wrote: > +1. Thanks Yuming! > > Chao > > On Tue, Oct 18, 2022 at 1:18 PM Thomas graves wrote: > > > > +1. Ran internal test suite. > > > > Tom > > > > On Sun, Oct 16, 2022 at 9:14 PM Yuming Wang wrote: > > > > > > Please vote on releasing the

Re: [DISCUSS] SPIP: Support Docker Official Image for Spark

2022-09-19 Thread Wenchen Fan
+1 On Mon, Sep 19, 2022 at 2:59 PM Yang,Jie(INF) wrote: > +1 (non-binding) > > > > Yang Jie > -- > *发件人:* Yikun Jiang > *发送时间:* 2022年9月19日 14:23:14 > *收件人:* Denny Lee > *抄送:* bo zhaobo; Yuming Wang; Kent Yao; Gengliang Wang; Hyukjin Kwon; > dev; zrf > *主题:* Re:

Re: Non-deterministic function duplicated in final Spark plan

2022-08-01 Thread Wenchen Fan
This is a hard one. Spark duplicates the join child plan if it's a self-join because Spark does not support diamond-shaped query plans. Seems the only option here is to write the join child plan to a parquet table (or using a shuffle) and read it back. On Mon, Aug 1, 2022 at 4:46 PM Enrico Minack

Re: [VOTE] Release Spark 3.2.2 (RC1)

2022-07-14 Thread Wenchen Fan
+1 On Wed, Jul 13, 2022 at 7:29 PM Yikun Jiang wrote: > +1 (non-binding) > > Checked out tag and built from source on Linux aarch64 and ran some basic > test. > > > Regards, > Yikun > > > On Wed, Jul 13, 2022 at 5:54 AM Mridul Muralidharan > wrote: > >> >> +1 >> >> Signatures, digests, etc

Re: [DISCUSS][Catalog API] Deprecate 4 Catalog API that takes two parameters which are (dbName, tableName/functionName)

2022-07-08 Thread Wenchen Fan
It's better to keep all APIs working. But in this case, I really have no idea how to make these 4 APIs reasonable. For example, tableExists(dbName: String, tableName: String) currently checks if table "dbName.tableName" exists in the Hive metastore, and does not work with v2 catalogs at all. It's

Re: Apache Spark 3.2.2 Release?

2022-07-06 Thread Wenchen Fan
+1 On Thu, Jul 7, 2022 at 10:41 AM Xinrong Meng wrote: > +1 > > Thanks! > > > Xinrong Meng > > Software Engineer > > Databricks > > > On Wed, Jul 6, 2022 at 7:25 PM Xiao Li wrote: > >> +1 >> >> Xiao >> >> Cheng Su 于2022年7月6日周三 19:16写道: >> >>> +1 (non-binding) >>> >>> Thanks, >>> Cheng Su >>>

Re: [VOTE][SPIP] Spark Connect

2022-06-14 Thread Wenchen Fan
+1 On Tue, Jun 14, 2022 at 9:38 AM Ruifeng Zheng wrote: > +1 > > > -- 原始邮件 -- > *发件人:* "huaxin gao" ; > *发送时间:* 2022年6月14日(星期二) 上午8:47 > *收件人:* "L. C. Hsieh"; > *抄送:* "Spark dev list"; > *主题:* Re: [VOTE][SPIP] Spark Connect > > +1 > > On Mon, Jun 13, 2022 at 5:42

Re: [VOTE] Release Spark 3.3.0 (RC6)

2022-06-13 Thread Wenchen Fan
+1, tests are all green and there are no more blocker issues AFAIK. On Fri, Jun 10, 2022 at 12:27 PM Maxim Gekk wrote: > Please vote on releasing the following candidate as > Apache Spark version 3.3.0. > > The vote is open until 11:59pm Pacific time June 14th and passes if a > majority +1 PMC

Re: [VOTE] Release Spark 3.3.0 (RC2)

2022-05-19 Thread Wenchen Fan
I think it should have been fixed by https://github.com/apache/spark/commit/0fdb6757946e2a0991256a3b73c0c09d6e764eed . Maybe the fix is not completed... On Thu, May 19, 2022 at 2:16 PM Kent Yao wrote: > Thanks, Maxim. > > Leave my -1 for this release candidate. > > Unfortunately, I don't know

Re: Unable to create view due to up cast error when migrating from Hive to Spark

2022-05-18 Thread Wenchen Fan
A view is essentially a SQL query. It's fragile to share views between Spark and Hive because different systems have different SQL dialects. They may interpret the view SQL query differently and introduce unexpected behaviors. In this case, Spark returns decimal type for gender * 0.3 - 0.1 but

Re: SIGMOD System Award for Apache Spark

2022-05-13 Thread Wenchen Fan
Great! Congratulations to everyone! On Fri, May 13, 2022 at 10:38 AM Gengliang Wang wrote: > Congratulations to the whole spark community! > > On Fri, May 13, 2022 at 10:14 AM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> Congrats Spark community! >> >> On Fri, May 13, 2022 at

Re: [VOTE] Release Spark 3.3.0 (RC1)

2022-05-10 Thread Wenchen Fan
I'd like to see an RC2 as well. There is kind of a correctness bug fixed after RC1 is cut: https://github.com/apache/spark/pull/36468 Users may hit this bug much more frequently if they enable ANSI mode. It's not a regression so I'd vote -0. On Wed, May 11, 2022 at 5:24 AM Thomas graves wrote:

Re: PR builder not working now

2022-04-19 Thread Wenchen Fan
Thank you, Hyukjin! On Wed, Apr 20, 2022 at 7:48 AM Dongjoon Hyun wrote: > It's great! Thank you. :) > > On Tue, Apr 19, 2022 at 4:42 PM Hyukjin Kwon wrote: > >> It's fixed now. >> >> On Tue, 19 Apr 2022 at 08:33, Hyukjin Kwon wrote: >> >>> It's still persistent. I will send an email to

Re: bazel and external/

2022-03-21 Thread Wenchen Fan
How about renaming it to `connectors` if docker is the only exception and will be moved out? On Sat, Mar 19, 2022 at 6:18 PM Alkis Evlogimenos wrote: > It looks like renaming the directory and moving components can be separate > steps. If there is consensus that connectors will move out, should

Re: Apache Spark 3.3 Release

2022-03-21 Thread Wenchen Fan
Just checked the release calendar, the planned RC cut date is April: [image: image.png] Let's revisit after 2 weeks then? On Mon, Mar 21, 2022 at 2:47 PM Wenchen Fan wrote: > Shall we revisit this list after a week? Ideally, they should be either > merged or rejected for 3.3, so that we c

Re: Apache Spark 3.3 Release

2022-03-21 Thread Wenchen Fan
Shall we revisit this list after a week? Ideally, they should be either merged or rejected for 3.3, so that we can cut rc1. We can still discuss them case by case at that time if there are exceptions. On Sat, Mar 19, 2022 at 5:27 AM Dongjoon Hyun wrote: > Thank you for your summarization. > > I

Re: Apache Spark 3.3 Release

2022-03-16 Thread Wenchen Fan
+1 to define an allowlist of features that we want to backport to branch 3.3. I also have a few in my mind complex type support in vectorized parquet reader: https://github.com/apache/spark/pull/34659 refine the DS v2 filter API for JDBC v2: https://github.com/apache/spark/pull/35768 a few new SQL

Re: Data correctness issue with Repartition + FetchFailure

2022-03-16 Thread Wenchen Fan
ous fix for > repartition works for deterministic data. With non-deterministic data, I > didn't find an API to pass DeterministicLevel to underlying rdd. > Do you plan to continue work on integration with SQL operators? If not, > I'm available to take a stab. > > On Mon, Mar 14, 20

Re: Data correctness issue with Repartition + FetchFailure

2022-03-14 Thread Wenchen Fan
We fixed the repartition correctness bug before, by sorting the data before doing round-robin partitioning. But the issue is that we need to propagate the isDeterministic property through SQL operators. On Tue, Mar 15, 2022 at 1:50 AM Jason Xu wrote: > Hi Reynold, do you suggest removing

Re: [VOTE] Spark 3.1.3 RC4

2022-02-15 Thread Wenchen Fan
+1 On Tue, Feb 15, 2022 at 3:59 PM Yuming Wang wrote: > +1 (non-binding). > > On Tue, Feb 15, 2022 at 10:22 AM Ruifeng Zheng > wrote: > >> +1 (non-binding) >> >> checked the release script issue Dongjoon mentioned: >> >> curl -s >>

Re: [VOTE] Spark 3.1.3 RC3

2022-02-07 Thread Wenchen Fan
Shall we use the release scripts of branch 3.1 to release 3.1? On Fri, Feb 4, 2022 at 4:57 AM Holden Karau wrote: > Good catch Dongjoon :) > > This release candidate fails, but feel free to keep testing for any other > potential blockers. > > I’ll roll RC4 next week with the older release

Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-07 Thread Wenchen Fan
+1 (binding) On Sun, Feb 6, 2022 at 10:27 AM Jacky Lee wrote: > +1 (non-binding). Thanks John! > It's great to see ViewCatalog moving on, it's a nice feature. > > Terry Kim 于2022年2月5日周六 11:57写道: > >> +1 (non-binding). Thanks John! >> >> Terry >> >> On Fri, Feb 4, 2022 at 4:13 PM Yufei Gu

Re: [VOTE] Release Spark 3.2.1 (RC2)

2022-01-24 Thread Wenchen Fan
+1 On Tue, Jan 25, 2022 at 10:13 AM Ruifeng Zheng wrote: > +1 (non-binding) > > > -- 原始邮件 -- > *发件人:* "Kent Yao" ; > *发送时间:* 2022年1月25日(星期二) 上午10:09 > *收件人:* "John Zhuge"; > *抄送:* "dev"; > *主题:* Re: [VOTE] Release Spark 3.2.1 (RC2) > > +1, non-binding > > John

Re: Difference in behavior for Spark 3.0 vs Spark 3.1 "create database "

2022-01-11 Thread Wenchen Fan
Hopefully, this StackOverflow answer can solve your problem: https://stackoverflow.com/questions/47523037/how-do-i-configure-pyspark-to-write-to-hdfs-by-default Spark doesn't control the behavior of qualifying paths. It's decided by certain configs and/or config files. On Tue, Jan 11, 2022 at

Re: Time for Spark 3.2.1?

2021-12-06 Thread Wenchen Fan
+1 to make new maintenance releases for all 3.x branches. On Tue, Dec 7, 2021 at 8:57 AM Sean Owen wrote: > Always fine by me if someone wants to roll a release. > > It's been ~6 months since the last 3.0.x and 3.1.x releases, too; a new > release of those wouldn't hurt either, if any of our

Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-06 Thread Wenchen Fan
Thanks, Shane! Really appreciate it! Wenchen On Tue, Dec 7, 2021 at 12:38 PM Xiao Li wrote: > Hi, Shane, > > Thank you for your work on it! > > Xiao > > > > > On Mon, Dec 6, 2021 at 6:20 PM L. C. Hsieh wrote: > >> Thank you, Shane. >> >> On Mon, Dec 6, 2021 at 4:27 PM Holden Karau wrote: >>

Re: Supports Dynamic Table Options for Spark SQL

2021-11-16 Thread Wenchen Fan
It's useful to have a SQL API to specify table options, similar to the DataFrameReader API. However, I share the same concern from @Hyukjin Kwon and am not very comfortable with using hints to do it. In the PR, someone mentioned TVF. I think it's better than hints, but still has problems. For

Re: [FYI] Build and run tests on Java 17 for Apache Spark 3.3

2021-11-16 Thread Wenchen Fan
Great job! On Sat, Nov 13, 2021 at 11:18 AM Hyukjin Kwon wrote: > Awesome! > > On Sat, Nov 13, 2021 at 12:04 PM Xiao Li wrote: > >> Thank you! Great job! >> >> Xiao >> >> >> On Fri, Nov 12, 2021 at 7:02 PM Mridul Muralidharan >> wrote: >> >>> >>> Nice job ! >>> There are some nice API's which

Re: [VOTE] SPIP: Row-level operations in Data Source V2

2021-11-16 Thread Wenchen Fan
+1 On Mon, Nov 15, 2021 at 2:54 AM John Zhuge wrote: > +1 (non-binding) > > On Sun, Nov 14, 2021 at 10:33 AM Chao Sun wrote: > >> +1 (non-binding). Thanks Anton for the work! >> >> On Sun, Nov 14, 2021 at 10:01 AM Ryan Blue wrote: >> >>> +1 >>> >>> Thanks to Anton for all this great work! >>>

Re: Issue Upgrading to 3.2

2021-11-01 Thread Wenchen Fan
Function registration: > Catalog.expressions.foreach(f => { > val functionIdentifier = > FunctionIdentifier(f.getClass.getSimpleName.dropRight(1)) > val expressionInfo = new ExpressionInfo( > f.getClass.getCanonicalName, > functionIdentifier.database.orNull, &g

Re: [DISCUSS] SPIP: Row-level operations in Data Source V2

2021-11-01 Thread Wenchen Fan
The general idea looks great. This is indeed a complicated API and we probably need more time to evaluate the API design. It's better to commit this work earlier so that we have more time to verify it before the 3.3 release. Maybe we can commit the group-based API first, then the delta-based one,

Re: Issue Upgrading to 3.2

2021-11-01 Thread Wenchen Fan
Hi Adam, Thanks for reporting this issue! Do you have the full stacktrace or a code snippet to reproduce the issue at Spark side? It looks like a bug, but it's not obvious to me how this bug can happen. Thanks, Wenchen On Sat, Oct 30, 2021 at 1:08 AM Adam Binford wrote: > Hi devs, > > I'm

Re: [VOTE] SPIP: Storage Partitioned Join for Data Source V2

2021-10-31 Thread Wenchen Fan
+1 On Sat, Oct 30, 2021 at 8:58 AM Cheng Su wrote: > +1 > > > > Thanks, > > Cheng Su > > > > *From: *Holden Karau > *Date: *Friday, October 29, 2021 at 12:41 PM > *To: *DB Tsai > *Cc: *Dongjoon Hyun , Ryan Blue , > dev , huaxin gao > *Subject: *Re: [VOTE] SPIP: Storage Partitioned Join for

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-28 Thread Wenchen Fan
> `BoundFunction` directly. That is easier than defining a way for Spark to > query the function catalog. > > In any case, I'm sure it's easy to understand how this works once you get > a concrete implementation. > > On Wed, Oct 27, 2021 at 9:35 AM Wenchen Fan wrote: > >> `B

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-27 Thread Wenchen Fan
et started because it fills > an existing gap. More complex use cases can be supported over time. > > Ryan > > On Wed, Oct 27, 2021 at 9:08 AM Wenchen Fan wrote: > >> IIUC, the general idea is to let each input split report its partition >> value, and Spark can

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-27 Thread Wenchen Fan
; >> >> >> Another major benefit for having bucketed table, is to avoid shuffle >> before aggregate. Just want to bring to our attention that it would be >> great to consider aggregate as well when doing this proposal. >> >> >> >>1. Any major us

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-26 Thread Wenchen Fan
+1 to this SPIP and nice writeup of the design doc! Can we open comment permission in the doc so that we can discuss details there? On Tue, Oct 26, 2021 at 8:29 PM Hyukjin Kwon wrote: > Seems making sense to me. > > Would be great to have some feedback from people such as @We

Re: [ANNOUNCE] Apache Spark 3.2.0

2021-10-19 Thread Wenchen Fan
Yea the file naming is a bit confusing, we can fix it in the next release. 3.2 actually means 3.2 or higher, so not a big deal I think. Congrats and thanks! On Wed, Oct 20, 2021 at 3:44 AM Jungtaek Lim wrote: > Thanks to Gengliang for driving this huge release! > > On Wed, Oct 20, 2021 at 1:50

Re: [VOTE] Release Spark 3.2.0 (RC7)

2021-10-10 Thread Wenchen Fan
+1 On Sat, Oct 9, 2021 at 2:36 PM angers zhu wrote: > +1 (non-binding) > > Cheng Pan 于2021年10月9日周六 下午2:06写道: > >> +1 (non-binding) >> >> Integration test passed[1] with my project[2]. >> >> [1] >> https://github.com/housepower/spark-clickhouse-connector/runs/3834335017 >> [2]

Re: [SQL] When SQLConf vals gets own accessor defs?

2021-09-06 Thread Wenchen Fan
I think SQLConf doesn't need defs anymore. In the beginning, SQLConf lived in sql/core, so we have to add defs if the code in sql/catalyst needs to access configs. Now SQLConf is in sql/catalyst (this was done a few years ago), defs are only needed if we have some special logic that is not just

Re: [SQL] s.s.a.coalescePartitions.parallelismFirst true but recommends false

2021-09-06 Thread Wenchen Fan
This is correct. It's true by default so that AQE doesn't have performance regression. If you run a benchmark, larger parallelism usually means better performance. However, it's recommended to set it to false, so that AQE can give better resource utilization, which is good for a busy Spark

Re: [VOTE] SPIP: Catalog API for view metadata

2021-05-26 Thread Wenchen Fan
OK, then I'd vote for TableViewCatalog, because 1. This is how Hive catalog works, and we need to migrate Hive catalog to the v2 API sooner or later. 2. Because of 1, TableViewCatalog is easy to support in the current table/view resolution framework. 3. It's better to avoid name conflicts between

Re: [Spark Core]: Adding support for size based partition coalescing

2021-05-25 Thread Wenchen Fan
what does a > repartition() call do if AQE is not enabled? this is essentially a new api > so would repartitionBySize or something be less confusing to users who > already use repartition(num_partitions). > > Tom > > On Monday, May 24, 2021, 12:30:20 PM CDT, Wenchen Fan > wrote: >

Re: Bridging gap between Spark UI and Code

2021-05-25 Thread Wenchen Fan
You can see the SQL plan node name in the DAG visualization. Please refer to https://spark.apache.org/docs/latest/web-ui.html for more details. If you still have any confusion, please let us know and we will keep improving the document. On Tue, May 25, 2021 at 4:41 AM mhawes wrote: > @Wenc

Re: SPIP: Catalog API for view metadata

2021-05-24 Thread Wenchen Fan
hen it happens well >> before table resolution. And, View and Table are very different objects; >> returning Object from this API doesn't make much sense. >> >> One extra RPC is not unreasonable, and the choice should be left to >> sources. That's the easiest place

Re: About Spark executs sqlscript

2021-05-24 Thread Wenchen Fan
It's not possible to load everything into memory. We should use a big query connector (should be existing already?) and register table B and C and temp views in Spark. On Fri, May 14, 2021 at 8:50 AM bo zhao wrote: > Hi Team, > > I've followed Spark community for several years. This is my first

Re: Purpose of OffsetHolder as a LeafNode?

2021-05-24 Thread Wenchen Fan
It's just an immediate place holder to update the query plan in each micro-batch. On Sat, May 15, 2021 at 10:29 PM Jacek Laskowski wrote: > Hi, > > Just stumbled upon OffsetHolder [1] and am curious why it's a LeafNode? > What logical plan could it be part of? > > [1] >

  1   2   3   4   5   6   >