Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Ryan Blue
Spark will read data written with v2 encodings just fine. You just don't need to worry about making Spark produce v2. And you should probably also not produce v2 encodings from other systems. On Mon, Apr 15, 2024 at 4:37 PM Prem Sahoo wrote: > oops but so spark does not support parquet V2 atm

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Prem Sahoo
oops but so spark does not support parquet V2 atm ?, as We have a use case where we need parquet V2 as one of our components uses Parquet V2 . On Mon, Apr 15, 2024 at 7:09 PM Ryan Blue wrote: > Hi Prem, > > Parquet v1 is the default because v2 has not been finalized and adopted by > the

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Ryan Blue
Hi Prem, Parquet v1 is the default because v2 has not been finalized and adopted by the community. I highly recommend not using v2 encodings at this time. Ryan On Mon, Apr 15, 2024 at 3:05 PM Prem Sahoo wrote: > I am using spark 3.2.0 . but my spark package comes with parquet-mr 1.2.1 > which

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Prem Sahoo
I am using spark 3.2.0 . but my spark package comes with parquet-mr 1.2.1 which writes in parquet version 1 not version version 2:(. so I was looking how to write in Parquet version2 ? On Mon, Apr 15, 2024 at 5:05 PM Mich Talebzadeh wrote: > Sorry you have a point there. It was released in

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Mich Talebzadeh
Sorry you have a point there. It was released in version 3.00. What version of spark are you using? Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Prem Sahoo
Thank you so much for the info! But do we have any release notes where it says spark2.4.0 onwards supports parquet version 2. I was under the impression Spark3.0 onwards it started supporting . On Mon, Apr 15, 2024 at 4:28 PM Mich Talebzadeh wrote: > Well if I am correct, Parquet version 2

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Mich Talebzadeh
Well if I am correct, Parquet version 2 support was introduced in Spark version 2.4.0. Therefore, any version of Spark starting from 2.4.0 supports Parquet version 2. Assuming that you are using Spark version 2.4.0 or later, you should be able to take advantage of Parquet version 2 features. HTH

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Prem Sahoo
Thank you for the information! I can use any version of parquet-mr to produce parquet file. regarding 2nd question . Which version of spark is supporting parquet version 2? May I get the release notes where parquet versions are mentioned ? On Mon, Apr 15, 2024 at 2:34 PM Mich Talebzadeh wrote:

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-15 Thread Josh Rosen
+1 On Mon, Apr 15, 2024 at 11:26 AM Maciej wrote: > +1 > > Best regards, > Maciej Szymkiewicz > > Web: https://zero323.net > PGP: A30CEF0C31A501EC > > On 4/15/24 8:16 PM, Rui Wang wrote: > > +1, non-binding. > > Thanks Dongjoon to drive this! > > > -Rui > > On Mon, Apr 15, 2024 at 10:10 AM

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Mich Talebzadeh
Parquet-mr is a Java library that provides functionality for working with Parquet files with hadoop. It is therefore more geared towards working with Parquet files within the Hadoop ecosystem, particularly using MapReduce jobs. There is no definitive way to check exact compatible versions within

Re: Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-15 Thread Takuya UESHIN
+1 On Mon, Apr 15, 2024 at 11:17 AM Rui Wang wrote: > +1, non-binding. > > Thanks Dongjoon to drive this! > > > -Rui > > On Mon, Apr 15, 2024 at 10:10 AM Xinrong Meng wrote: > >> +1 >> >> Thank you @Dongjoon Hyun ! >> >> On Mon, Apr 15, 2024 at 6:33 AM beliefer wrote: >> >>> +1 >>> >>> >>> 在

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-15 Thread Maciej
+1 Best regards, Maciej Szymkiewicz Web:https://zero323.net PGP: A30CEF0C31A501EC On 4/15/24 8:16 PM, Rui Wang wrote: +1, non-binding. Thanks Dongjoon to drive this! -Rui On Mon, Apr 15, 2024 at 10:10 AM Xinrong Meng wrote: +1 Thank you @Dongjoon Hyun

Re: Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-15 Thread Rui Wang
+1, non-binding. Thanks Dongjoon to drive this! -Rui On Mon, Apr 15, 2024 at 10:10 AM Xinrong Meng wrote: > +1 > > Thank you @Dongjoon Hyun ! > > On Mon, Apr 15, 2024 at 6:33 AM beliefer wrote: > >> +1 >> >> >> 在 2024-04-15 15:54:07,"Peter Toth" 写道: >> >> +1 >> >> Wenchen Fan ezt írta

Re: Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-15 Thread Xinrong Meng
+1 Thank you @Dongjoon Hyun ! On Mon, Apr 15, 2024 at 6:33 AM beliefer wrote: > +1 > > > 在 2024-04-15 15:54:07,"Peter Toth" 写道: > > +1 > > Wenchen Fan ezt írta (időpont: 2024. ápr. 15., H, > 9:08): > >> +1 >> >> On Sun, Apr 14, 2024 at 6:28 AM Dongjoon Hyun >> wrote: >> >>> I'll start from

Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Prem Sahoo
Hello Team, May I know how to check which version of parquet is supported by parquet-mr 1.2.1 ? Which version of parquet-mr is supporting parquet version 2 (V2) ? Which version of spark is supporting parquet version 2? May I get the release notes where parquet versions are mentioned ?

Re:Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-15 Thread beliefer
+1 在 2024-04-15 15:54:07,"Peter Toth" 写道: +1 Wenchen Fan ezt írta (időpont: 2024. ápr. 15., H, 9:08): +1 On Sun, Apr 14, 2024 at 6:28 AM Dongjoon Hyun wrote: I'll start from my +1. Dongjoon. On 2024/04/13 22:22:05 Dongjoon Hyun wrote: > Please vote on SPARK-4 to use ANSI SQL

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-15 Thread Peter Toth
+1 Wenchen Fan ezt írta (időpont: 2024. ápr. 15., H, 9:08): > +1 > > On Sun, Apr 14, 2024 at 6:28 AM Dongjoon Hyun wrote: > >> I'll start from my +1. >> >> Dongjoon. >> >> On 2024/04/13 22:22:05 Dongjoon Hyun wrote: >> > Please vote on SPARK-4 to use ANSI SQL mode by default. >> > The

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-15 Thread Cheng Pan
+1, non-binding Thanks, Cheng Pan > On Apr 15, 2024, at 14:14, John Zhuge wrote: > > +1 (non-binding) > > On Sun, Apr 14, 2024 at 7:18 PM Jungtaek Lim > wrote: > +1 (non-binding), thanks Dongjoon. > > On Sun, Apr 14, 2024 at 7:22 AM Dongjoon Hyun wrote: > Please vote on SPARK-4 to

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-15 Thread John Zhuge
+1 (non-binding) On Sun, Apr 14, 2024 at 7:18 PM Jungtaek Lim wrote: > +1 (non-binding), thanks Dongjoon. > > On Sun, Apr 14, 2024 at 7:22 AM Dongjoon Hyun > wrote: > >> Please vote on SPARK-4 to use ANSI SQL mode by default. >> The technical scope is defined in the following PR which is

Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-14 Thread Dongjoon Hyun
I'll start with my +1. - Checked checksum and signature - Checked Scala/Java/R/Python/SQL Document's Spark version - Checked published Maven artifacts - All CIs passed. Thanks, Dongjoon. On 2024/04/15 04:22:26 Dongjoon Hyun wrote: > Please vote on releasing the following candidate as Apache

[VOTE] Release Spark 3.4.3 (RC2)

2024-04-14 Thread Dongjoon Hyun
Please vote on releasing the following candidate as Apache Spark version 3.4.3. The vote is open until April 18th 1AM (PDT) and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 3.4.3 [ ] -1 Do not release this package because

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-14 Thread Jungtaek Lim
+1 (non-binding), thanks Dongjoon. On Sun, Apr 14, 2024 at 7:22 AM Dongjoon Hyun wrote: > Please vote on SPARK-4 to use ANSI SQL mode by default. > The technical scope is defined in the following PR which is > one line of code change and one line of migration guide. > > - DISCUSSION: >

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-14 Thread Wenchen Fan
+1 On Sun, Apr 14, 2024 at 6:28 AM Dongjoon Hyun wrote: > I'll start from my +1. > > Dongjoon. > > On 2024/04/13 22:22:05 Dongjoon Hyun wrote: > > Please vote on SPARK-4 to use ANSI SQL mode by default. > > The technical scope is defined in the following PR which is > > one line of code

Re: [DISCUSS] Spark 4.0.0 release

2024-04-14 Thread Jungtaek Lim
W.r.t. state data source - reader (SPARK-45511 ), there are several follow-up tickets, but we don't plan to address them soon. The current implementation is the final shape for Spark 4.0.0, unless there are demands on the follow-up tickets. We

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-14 Thread yangjie01
+1 for me Jie Yang 发件人: Mich Talebzadeh 日期: 2024年4月14日 星期日 15:41 收件人: Dongjoon Hyun , Spark dev list 主题: Re: [VOTE] SPARK-4: Use ANSI SQL mode by default + 1 for me It makes it more compatible with the other ANSI SQL compliant products. Mich Talebzadeh, Technologist | Solutions

Re: [VOTE] Add new `Versions` in Apache Spark JIRA for Versioning of Spark Operator

2024-04-14 Thread Hussein Awala
+1 (non-binding) to using an independent version for the Spark Kubernetes Operator with a compatibility matrix with Spark versions. On Fri, Apr 12, 2024 at 5:31 AM L. C. Hsieh wrote: > Hi all, > > Thanks for all discussions in the thread of "Versioning of Spark > Operator": >

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-14 Thread Christiano Anderson
+1 On 14/04/2024 00:22, Dongjoon Hyun wrote: Please vote on SPARK-4 to use ANSI SQL mode by default. The technical scope is defined in the following PR which is one line of code change and one line of migration guide. - DISCUSSION:

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-14 Thread Mich Talebzadeh
+ 1 for me It makes it more compatible with the other ANSI SQL compliant products. Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-13 Thread Xiao Li
+1 On Sat, Apr 13, 2024 at 17:21 huaxin gao wrote: > +1 > > On Sat, Apr 13, 2024 at 4:36 PM L. C. Hsieh wrote: > >> +1 >> >> On Sat, Apr 13, 2024 at 4:12 PM Hyukjin Kwon >> wrote: >> > >> > +1 >> > >> > On Sun, Apr 14, 2024 at 7:46 AM Chao Sun wrote: >> >> >> >> +1. >> >> >> >> This feature

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-13 Thread Denny Lee
+1 (non-binding) On Sat, Apr 13, 2024 at 7:49 PM huaxin gao wrote: > +1 > > On Sat, Apr 13, 2024 at 4:36 PM L. C. Hsieh wrote: > >> +1 >> >> On Sat, Apr 13, 2024 at 4:12 PM Hyukjin Kwon >> wrote: >> > >> > +1 >> > >> > On Sun, Apr 14, 2024 at 7:46 AM Chao Sun wrote: >> >> >> >> +1. >> >> >>

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-13 Thread huaxin gao
+1 On Sat, Apr 13, 2024 at 4:36 PM L. C. Hsieh wrote: > +1 > > On Sat, Apr 13, 2024 at 4:12 PM Hyukjin Kwon wrote: > > > > +1 > > > > On Sun, Apr 14, 2024 at 7:46 AM Chao Sun wrote: > >> > >> +1. > >> > >> This feature is very helpful for guarding against correctness issues, > such as null

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-13 Thread Holden Karau
+1 -- even if it's not perfect now is the time to change default values On Sat, Apr 13, 2024 at 4:11 PM Hyukjin Kwon wrote: > +1 > > On Sun, Apr 14, 2024 at 7:46 AM Chao Sun wrote: > >> +1. >> >> This feature is very helpful for guarding against correctness issues, >> such as null results due

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-13 Thread L. C. Hsieh
+1 On Sat, Apr 13, 2024 at 4:12 PM Hyukjin Kwon wrote: > > +1 > > On Sun, Apr 14, 2024 at 7:46 AM Chao Sun wrote: >> >> +1. >> >> This feature is very helpful for guarding against correctness issues, such >> as null results due to invalid input or math overflows. It’s been there for >> a

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-13 Thread Hyukjin Kwon
+1 On Sun, Apr 14, 2024 at 7:46 AM Chao Sun wrote: > +1. > > This feature is very helpful for guarding against correctness issues, such > as null results due to invalid input or math overflows. It’s been there for > a while now and it’s a good time to enable it by default as Spark enters > the

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-13 Thread Chao Sun
+1. This feature is very helpful for guarding against correctness issues, such as null results due to invalid input or math overflows. It’s been there for a while now and it’s a good time to enable it by default as Spark enters the next major release. On Sat, Apr 13, 2024 at 3:27 PM Dongjoon

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-13 Thread Gengliang Wang
+1 On Sat, Apr 13, 2024 at 3:26 PM Dongjoon Hyun wrote: > I'll start from my +1. > > Dongjoon. > > On 2024/04/13 22:22:05 Dongjoon Hyun wrote: > > Please vote on SPARK-4 to use ANSI SQL mode by default. > > The technical scope is defined in the following PR which is > > one line of code

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-13 Thread Dongjoon Hyun
I'll start from my +1. Dongjoon. On 2024/04/13 22:22:05 Dongjoon Hyun wrote: > Please vote on SPARK-4 to use ANSI SQL mode by default. > The technical scope is defined in the following PR which is > one line of code change and one line of migration guide. > > - DISCUSSION: >

[VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-13 Thread Dongjoon Hyun
Please vote on SPARK-4 to use ANSI SQL mode by default. The technical scope is defined in the following PR which is one line of code change and one line of migration guide. - DISCUSSION: https://lists.apache.org/thread/ztlwoz1v1sn81ssks12tb19x37zozxlz - JIRA:

Re: [DISCUSS] SPARK-44444: Use ANSI SQL mode by default

2024-04-13 Thread Dongjoon Hyun
Thank you for your opinions, Gangling, Liang-Chi, Wenchen, Huaxin, Serge, Nicholas. To Nicholas, Apache Spark community already decided not to pursuit PostgreSQL dialect. > I’m flagging this since Spark’s behavior differs in these cases from > Postgres, > as described in the ticket. Please

Support Avro rolling version upgrades using schema manager

2024-04-13 Thread Nimrod Ofek
Hi, Currently, Avro records are supported in Spark - but with the limitation that we must specify the input and output schema versions. For writing out an avro record that is fine - but for reading avro records, that is usually a problem since there are upgrades and changes - and the current

Re: [VOTE] Add new `Versions` in Apache Spark JIRA for Versioning of Spark Operator

2024-04-12 Thread Chao Sun
+1 On Fri, Apr 12, 2024 at 4:23 PM Xiao Li wrote: > +1 > > > > > On Fri, Apr 12, 2024 at 14:30 bo yang wrote: > >> +1 >> > >> On Fri, Apr 12, 2024 at 12:34 PM huaxin gao >> wrote: >> >>> +1 >>> >>> On Fri, Apr 12, 2024 at 9:07 AM Dongjoon Hyun >>> wrote: >>> +1 Thank you!

Re: [VOTE] Add new `Versions` in Apache Spark JIRA for Versioning of Spark Operator

2024-04-12 Thread Xiao Li
+1 On Fri, Apr 12, 2024 at 14:30 bo yang wrote: > +1 > > On Fri, Apr 12, 2024 at 12:34 PM huaxin gao > wrote: > >> +1 >> >> On Fri, Apr 12, 2024 at 9:07 AM Dongjoon Hyun >> wrote: >> >>> +1 >>> >>> Thank you! >>> >>> I hope we can customize `dev/merge_spark_pr.py` script per repository >>>

Re: [VOTE] Add new `Versions` in Apache Spark JIRA for Versioning of Spark Operator

2024-04-12 Thread bo yang
+1 On Fri, Apr 12, 2024 at 12:34 PM huaxin gao wrote: > +1 > > On Fri, Apr 12, 2024 at 9:07 AM Dongjoon Hyun wrote: > >> +1 >> >> Thank you! >> >> I hope we can customize `dev/merge_spark_pr.py` script per repository >> after this PR. >> >> Dongjoon. >> >> On 2024/04/12 03:28:36 "L. C. Hsieh"

Re: [VOTE] Add new `Versions` in Apache Spark JIRA for Versioning of Spark Operator

2024-04-12 Thread L. C. Hsieh
+1 Thank you, Dongjoon. Yea, We may need to customize the merge script for a particular repository. On Fri, Apr 12, 2024 at 9:07 AM Dongjoon Hyun wrote: > > +1 > > Thank you! > > I hope we can customize `dev/merge_spark_pr.py` script per repository after > this PR. > > Dongjoon. > > On

Re: [DISCUSS] SPARK-44444: Use ANSI SQL mode by default

2024-04-12 Thread Nicholas Chammas
This is a side issue, but I’d like to bring people’s attention to SPARK-28024. Cases 2, 3, and 4 described in that ticket are still problems today on master (I just rechecked) even with ANSI mode enabled. Well, maybe not problems, but I’m flagging this since Spark’s behavior differs in these

Re: [DISCUSS] Spark 4.0.0 release

2024-04-12 Thread Dongjoon Hyun
Thank you for volunteering, Wenchen. Dongjoon. On 2024/04/12 15:11:04 Wenchen Fan wrote: > Hi all, > > It's close to the previously proposed 4.0.0 release date (June 2024), and I > think it's time to prepare for it and discuss the ongoing projects: > >- ANSI by default >- Spark Connect

Re: [DISCUSS] SPARK-44444: Use ANSI SQL mode by default

2024-04-12 Thread serge rielau . com
+1 it‘s the wrapping on math overflows that does it for me. Sent from my iPhone On Apr 12, 2024, at 9:36 AM, huaxin gao wrote:  +1 On Thu, Apr 11, 2024 at 11:18 PM L. C. Hsieh mailto:vii...@gmail.com>> wrote: +1 I believe ANSI mode is well developed after many releases. No doubt it could

Re: [DISCUSS] SPARK-44444: Use ANSI SQL mode by default

2024-04-12 Thread huaxin gao
+1 On Thu, Apr 11, 2024 at 11:18 PM L. C. Hsieh wrote: > +1 > > I believe ANSI mode is well developed after many releases. No doubt it > could be used. > Since it is very easy to disable it to restore to current behavior, I > guess the impact could be limited. > Do we have known the possible

Re: [VOTE] Add new `Versions` in Apache Spark JIRA for Versioning of Spark Operator

2024-04-12 Thread huaxin gao
+1 On Fri, Apr 12, 2024 at 9:07 AM Dongjoon Hyun wrote: > +1 > > Thank you! > > I hope we can customize `dev/merge_spark_pr.py` script per repository > after this PR. > > Dongjoon. > > On 2024/04/12 03:28:36 "L. C. Hsieh" wrote: > > Hi all, > > > > Thanks for all discussions in the thread of

Re: [VOTE] Add new `Versions` in Apache Spark JIRA for Versioning of Spark Operator

2024-04-12 Thread Dongjoon Hyun
+1 Thank you! I hope we can customize `dev/merge_spark_pr.py` script per repository after this PR. Dongjoon. On 2024/04/12 03:28:36 "L. C. Hsieh" wrote: > Hi all, > > Thanks for all discussions in the thread of "Versioning of Spark > Operator":

Re: [DISCUSS] SPARK-44444: Use ANSI SQL mode by default

2024-04-12 Thread Wenchen Fan
+1, the existing "NULL on error" behavior is terrible for data quality. I have one concern about error reporting with DataFrame APIs. Query execution is lazy and where the error happens can be far away from where the dataframe/column was created. We are improving it (PR

[DISCUSS] Spark 4.0.0 release

2024-04-12 Thread Wenchen Fan
Hi all, It's close to the previously proposed 4.0.0 release date (June 2024), and I think it's time to prepare for it and discuss the ongoing projects: - ANSI by default - Spark Connect GA - Structured Logging - Streaming state store data source - new data type VARIANT - STRING

Re: [DISCUSS] SPARK-44444: Use ANSI SQL mode by default

2024-04-12 Thread L. C. Hsieh
+1 I believe ANSI mode is well developed after many releases. No doubt it could be used. Since it is very easy to disable it to restore to current behavior, I guess the impact could be limited. Do we have known the possible impacts such as what are the major changes (e.g., what kind of

Re: [DISCUSS] SPARK-44444: Use ANSI SQL mode by default

2024-04-11 Thread Gengliang Wang
+1, enabling Spark's ANSI SQL mode in version 4.0 will significantly enhance data quality and integrity. I fully support this initiative. > In other words, the current Spark ANSI SQL implementation becomes the first implementation for Spark SQL users to face at first while providing

Re: [PySpark]: DataFrameWriterV2.overwrite fails with spark connect

2024-04-11 Thread Ruifeng Zheng
Toki Takahashi, Thanks for reporting this, I created https://issues.apache.org/jira/browse/SPARK-47828 to track this bug. I will take a look. On Thu, Apr 11, 2024 at 10:11 PM Toki Takahashi wrote: > Hi Community, > > I get the following error when using Spark Connect in PySpark 3.5.1 > and

[VOTE] Add new `Versions` in Apache Spark JIRA for Versioning of Spark Operator

2024-04-11 Thread L. C. Hsieh
Hi all, Thanks for all discussions in the thread of "Versioning of Spark Operator": https://lists.apache.org/thread/zhc7nb2sxm8jjxdppq8qjcmlf4rcsthh I would like to create this vote to get the consensus for versioning of the Spark Kubernetes Operator. The proposal is to use an independent

[DISCUSS] SPARK-44444: Use ANSI SQL mode by default

2024-04-11 Thread Dongjoon Hyun
Hi, All. Thanks to you, we've been achieving many things and have on-going SPIPs. I believe it's time to scope Apache Spark 4.0.0 (SPARK-44111) more narrowly by asking your opinions about Apache Spark's ANSI SQL mode. https://issues.apache.org/jira/browse/SPARK-44111 Prepare Apache Spark

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-04-11 Thread Jungtaek Lim
I'm still having a hard time reviewing this. I have been handling a bunch of context right now, and the change is non-trivial to review in parallel. I see people were OK with the algorithm in high-level, but from a code perspective it's uneasy to understand without knowledge of DRA. It would take

Re: External Spark shuffle service for k8s

2024-04-11 Thread Bjørn Jørgensen
I think this answers your question about what to do if you need more space on nodes. https://spark.apache.org/docs/latest/running-on-kubernetes.html#local-storage Local Storage Spark supports using volumes to spill

[PySpark]: DataFrameWriterV2.overwrite fails with spark connect

2024-04-11 Thread Toki Takahashi
Hi Community, I get the following error when using Spark Connect in PySpark 3.5.1 and writing with DataFrameWriterV2.overwrite. ``` > df.writeTo('db.table').overwrite(F.col('id')==F.lit(1)) ... SparkConnectGrpcException: (org.apache.spark.sql.connect.common.InvalidPlanInput) Expression with ID:

Re: [External] Re: Versioning of Spark Operator

2024-04-11 Thread Ofir Manor
A related question - what is the expected release cadence? At least for the next 12-18 months? Since this is a new subproject, I am personally hoping it would have a faster cadence at first, maybe one a month or once every couple of months... If so, that would affect versioning. Also, if it

Re: External Spark shuffle service for k8s

2024-04-11 Thread Bjørn Jørgensen
" In the end for my usecase I started using pvcs and pvc aware scheduling along with decommissioning. So far performance is good with this choice." How did you do this? tor. 11. apr. 2024 kl. 04:13 skrev Arun Ravi : > Hi Everyone, > > I had to explored IBM's and AWS's S3 shuffle plugins (some

Re: Introducing Apache Gluten(incubating), a middle layer to offload Spark to native engine

2024-04-10 Thread Holden Karau
On Wed, Apr 10, 2024 at 9:54 PM Binwei Yang wrote: > > Gluten currently already support Velox backend and Clickhouse backend. > data fusion support is also proposed but no one worked on it. > > Gluten isn't a POC. It's under actively developing but some companies > already used it. > > > On

Re: Introducing Apache Gluten(incubating), a middle layer to offload Spark to native engine

2024-04-10 Thread Binwei Yang
Gluten currently already support Velox backend and Clickhouse backend. data fusion support is also proposed but no one worked on it. Gluten isn't a POC. It's under actively developing but some companies already used it. On 2024/04/11 03:32:01 Dongjoon Hyun wrote: > I'm interested in your

Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-10 Thread Binwei Yang
Gluten java part is pretty stable now. The development is more in the c++ code, velox code as well as Clickhouse backend. The SPIP doesn't plan to introduce whole Gluten stack into Spark. But the way to serialize Spark physical plan and be able to send to native backend, through JNI or gRPC.

Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-10 Thread Binwei Yang
We (Gluten and Arrow guys) actually do planned to put the plan conversation in the substrait-java repo. But to me it makes more sense to put it as part of Spark repo. Native library and accelerator support will be more and more import in future. On 2024/04/10 08:29:08 Wenchen Fan wrote: >

Re: Introducing Apache Gluten(incubating), a middle layer to offload Spark to native engine

2024-04-10 Thread Dongjoon Hyun
I'm interested in your claim. Could you elaborate or provide some evidence for your claim, *a door for all native libraries*, Binwei? For example, is there any POC for that claim? Maybe, did I miss something in that SPIP? Dongjoon. On Wed, Apr 10, 2024 at 8:19 PM Binwei Yang wrote: > > The

Re: Introducing Apache Gluten(incubating), a middle layer to offload Spark to native engine

2024-04-10 Thread Binwei Yang
The SPIP is not for current Gluten, but open a door for all native libraries and accelerators support. On 2024/04/11 00:27:43 Weiting Chen wrote: > Yes, the 1st Apache release(v1.2.0) for Gluten will be in September. > For Spark version support, currently Gluten v1.1.1 support Spark3.2 and

Re: External Spark shuffle service for k8s

2024-04-10 Thread Arun Ravi
Hi Everyone, I had to explored IBM's and AWS's S3 shuffle plugins (some time back), I had also explored AWS FSX lustre in few of my production jobs which has ~20TB of shuffle operations with 200-300 executors. What I have observed is S3 and fax behaviour was fine during the write phase, however I

Re: Introducing Apache Gluten(incubating), a middle layer to offload Spark to native engine

2024-04-10 Thread Weiting Chen
Yes, the 1st Apache release(v1.2.0) for Gluten will be in September. For Spark version support, currently Gluten v1.1.1 support Spark3.2 and 3.3. We are planning to support Spark3.4 and 3.5 in Gluten v1.2.0. Spark4.0 support for Gluten is depending on the release schedule in Spark community. On

Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-10 Thread L. C. Hsieh
+1 for Wenchen's point. I don't see a strong reason to pull these transformations into Spark instead of keeping them in third party packages/projects. On Wed, Apr 10, 2024 at 5:32 AM Wenchen Fan wrote: > > It's good to reduce duplication between different native accelerators of > Spark, and

Re: Versioning of Spark Operator

2024-04-10 Thread L. C. Hsieh
This approach makes sense to me. If Spark K8s operator is aligned with Spark versions, for example, it uses 4.0.0 now. Because these JIRA tickets are not actually targeting Spark 4.0.0, it will cause confusion and more questions, like when we are going to cut Spark release, should we include

Re: Versioning of Spark Operator

2024-04-10 Thread bo yang
Cool, looks like we have two options here. Option 1: Spark Operator and Connect Go Client versioning independent of Spark, e.g. starting with 0.1.0. Pros: they can evolve versions independently. Cons: people will need an extra step to decide the version when using Spark Operator and Connect Go

Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-10 Thread Mich Talebzadeh
I read the SPIP. I have a number of ;points if I may - Maturity of Gluten: as the excerpt mentions, Gluten is a project, and its feature set and stability IMO are still under development. Integrating a non-core component could introduce risks if it is not fully mature - Complexity: integrating

Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-10 Thread Wenchen Fan
It's good to reduce duplication between different native accelerators of Spark, and AFAIK there is already a project trying to solve it: https://substrait.io/ I'm not sure why we need to do this inside Spark, instead of doing the unification for a wider scope (for all engines, not only Spark).

Re: Versioning of Spark Operator

2024-04-10 Thread Dongjoon Hyun
Ya, that would work. Inevitably, I looked at Apache Flink K8s Operator's JIRA and GitHub repo. It looks reasonable to me. Although they share the same JIRA, they choose different patterns per place. 1. In POM file and Maven Artifact, independent version number. 1.8.0 2. Tag is also based on

Re: Versioning of Spark Operator

2024-04-10 Thread L. C. Hsieh
Yea, I guess, for example, the first release of Spark K8s Operator would be something like 0.1.0 instead of 4.0.0. It sounds hard to align with Spark versions because of that? On Tue, Apr 9, 2024 at 10:15 AM Dongjoon Hyun wrote: > > Ya, that's simple and possible. > > However, it may cause

Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-09 Thread Holden Karau
I like the idea of improving flexibility of Sparks physical plans and really anything that might reduce code duplication among the ~4 or so different accelerators. Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9

Re: Versioning of Spark Operator

2024-04-09 Thread L. C. Hsieh
For Spark Operator, I think the answer is yes. According to my impression, Spark Operator should be Spark version-agnostic. Zhou, please correct me if I'm wrong. I am not sure about the Spark Connector Go client, but if it is going to talk with Spark cluster, I guess it should be still related to

Re: Versioning of Spark Operator

2024-04-09 Thread Dongjoon Hyun
Do we have a compatibility matrix of Apache Connect Go client already, Bo? Specifically, I'm wondering which versions the existing Apache Spark Connect Go repository is able to support as of now. We know that it is supposed to be compatible always, but do we have a way to verify that actually

Re: Versioning of Spark Operator

2024-04-09 Thread bo yang
Thanks Liang-Chi for the Spark Operator work, and also the discussion here! For Spark Operator and Connector Go Client, I am guessing they need to support multiple versions of Spark? e.g. same Spark Operator may support running multiple versions of Spark, and Connector Go Client might support

Re: Versioning of Spark Operator

2024-04-09 Thread Dongjoon Hyun
Ya, that's simple and possible. However, it may cause many confusions because it implies that new `Spark K8s Operator 4.0.0` and `Spark Connect Go 4.0.0` follow the same `Semantic Versioning` policy like Apache Spark 4.0.0. In addition, `Versioning` is directly related to the Release Cadence.

Re: Versioning of Spark Operator

2024-04-09 Thread DB Tsai
Aligning with Spark releases is sensible, as it allows us to guarantee that the Spark operator functions correctly with the new version while also maintaining support for previous versions. DB Tsai | https://www.dbtsai.com/ | PGP 42E5B25A8F7A82C1 > On Apr 9, 2024, at 9:45 AM, Mridul

Re: Versioning of Spark Operator

2024-04-09 Thread Mridul Muralidharan
I am trying to understand if we can simply align with Spark's version for this ? Makes the release and jira management much more simpler for developers and intuitive for users. Regards, Mridul On Tue, Apr 9, 2024 at 10:09 AM Dongjoon Hyun wrote: > Hi, Liang-Chi. > > Thank you for leading

Re: Versioning of Spark Operator

2024-04-09 Thread Dongjoon Hyun
Hi, Liang-Chi. Thank you for leading Apache Spark K8s operator as a shepherd. I took a look at `Apache Spark Connect Go` repo mentioned in the thread. Sadly, there is no release at all and no activity since last 6 months. It seems to be the first time for Apache Spark community to consider

Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-09 Thread Dongjoon Hyun
Thank you for sharing, Jia. I have the same questions like the previous Weiting's thread. Do you think you can share the future milestone of Apache Gluten? I'm wondering when the first stable release will come and how we can coordinate across the ASF communities. > This project is still under

Re: Introducing Apache Gluten(incubating), a middle layer to offload Spark to native engine

2024-04-09 Thread Dongjoon Hyun
Thank you for sharing, Weiting. Do you think you can share the future milestone of Apache Gluten? I'm wondering when the first stable release will come and how we can coordinate across the ASF communities. > This project is still under active development now, and doesn't have a stable release. >

Introducing Apache Gluten(incubating), a middle layer to offload Spark to native engine

2024-04-09 Thread WeitingChen
Hi all, We are excited to introduce a new Apache incubating project called Gluten. Gluten serves as a middleware layer designed to offload Spark to native engines like Velox or ClickHouse. For more detailed information, please visit the project repository at

SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-09 Thread Ke Jia
Apache Spark currently lacks an official mechanism to support cross-platform execution of physical plans. The Gluten project offers a mechanism that utilizes the Substrait standard to convert and optimize Spark's physical plans. By introducing Gluten's plan conversion, validation, and fallback

Versioning of Spark Operator

2024-04-08 Thread L. C. Hsieh
Hi all, We've opened the dedicated repository of Spark Kubernetes Operator, and the first PR is created. Thank you for the review from the community so far. About the versioning of Spark Operator, there are questions. As we are using Spark JIRA, when we are going to merge PRs, we need to choose

Unsubscribe

2024-04-08 Thread bruce COTTMAN
- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Apache Spark 3.4.3 (?)

2024-04-08 Thread Dongjoon Hyun
Thank you, Holden, Mridul, Kent, Liang-Chi, Mich, Jungtaek. I added `Target Version: 3.4.3` to SPARK-47318 and am going to continue to prepare for RC1 (April 15th). Dongjoon. - To unsubscribe e-mail:

Re: External Spark shuffle service for k8s

2024-04-08 Thread Mich Talebzadeh
Hi, First thanks everyone for their contributions I was going to reply to @Enrico Minack but noticed additional info. As I understand for example, Apache Uniffle is an incubating project aimed at providing a pluggable shuffle service for Spark. So basically, all these "external shuffle

Re: External Spark shuffle service for k8s

2024-04-08 Thread Vakaris Baškirov
I see that both Uniffle and Celebron support S3/HDFS backends which is great. In the case someone is using S3/HDFS, I wonder what would be the advantages of using Celebron or Uniffle vs IBM shuffle service plugin or Cloud Shuffle Storage Plugin from AWS

Re: External Spark shuffle service for k8s

2024-04-08 Thread roryqi
Apache Uniffle (incubating) may be another solution. You can see https://github.com/apache/incubator-uniffle https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era Mich Talebzadeh 于2024年4月8日周一 07:15写道: > Splendid > > The

Re: Apache Spark 3.4.3 (?)

2024-04-07 Thread Jungtaek Lim
Sounds like a plan. +1 (non-binding) Thanks for volunteering! On Sun, Apr 7, 2024 at 5:45 AM Dongjoon Hyun wrote: > Hi, All. > > Apache Spark 3.4.2 tag was created on Nov 24th and `branch-3.4` has 85 > commits including important security and correctness patches like > SPARK-45580, SPARK-46092,

Fwd: Apache Spark 3.4.3 (?)

2024-04-07 Thread Mich Talebzadeh
Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct

Re: Apache Spark 3.4.3 (?)

2024-04-07 Thread L. C. Hsieh
+1 Thanks Dongjoon! On Sun, Apr 7, 2024 at 1:56 AM Kent Yao wrote: > > +1, thank you, Dongjoon > > > Kent > > Holden Karau 于2024年4月7日周日 14:54写道: > > > > Sounds good to me :) > > > > Twitter: https://twitter.com/holdenkarau > > Books (Learning Spark, High Performance Spark, etc.): > >

Re: External Spark shuffle service for k8s

2024-04-07 Thread Mich Talebzadeh
Thanks Cheng for the heads up. I will have a look. Cheers Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile

Re: External Spark shuffle service for k8s

2024-04-07 Thread Cheng Pan
Instead of External Shuffle Shufle, Apache Celeborn might be a good option as a Remote Shuffle Service for Spark on K8s. There are some useful resources you might be interested in. [1] https://celeborn.apache.org/ [2] https://www.youtube.com/watch?v=s5xOtG6Venw [3]

<    1   2   3   4   5   6   7   8   9   10   >