Re: [DISCUSS] SPARK-44444: Use ANSI SQL mode by default

2024-04-11 Thread Gengliang Wang
+1, enabling Spark's ANSI SQL mode in version 4.0 will significantly enhance data quality and integrity. I fully support this initiative. > In other words, the current Spark ANSI SQL implementation becomes the first implementation for Spark SQL users to face at first while providing

Re: [PySpark]: DataFrameWriterV2.overwrite fails with spark connect

2024-04-11 Thread Ruifeng Zheng
Toki Takahashi, Thanks for reporting this, I created https://issues.apache.org/jira/browse/SPARK-47828 to track this bug. I will take a look. On Thu, Apr 11, 2024 at 10:11 PM Toki Takahashi wrote: > Hi Community, > > I get the following error when using Spark Connect in PySpark 3.5.1 > and

[VOTE] Add new `Versions` in Apache Spark JIRA for Versioning of Spark Operator

2024-04-11 Thread L. C. Hsieh
Hi all, Thanks for all discussions in the thread of "Versioning of Spark Operator": https://lists.apache.org/thread/zhc7nb2sxm8jjxdppq8qjcmlf4rcsthh I would like to create this vote to get the consensus for versioning of the Spark Kubernetes Operator. The proposal is to use an independent

[DISCUSS] SPARK-44444: Use ANSI SQL mode by default

2024-04-11 Thread Dongjoon Hyun
Hi, All. Thanks to you, we've been achieving many things and have on-going SPIPs. I believe it's time to scope Apache Spark 4.0.0 (SPARK-44111) more narrowly by asking your opinions about Apache Spark's ANSI SQL mode. https://issues.apache.org/jira/browse/SPARK-44111 Prepare Apache Spark

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-04-11 Thread Jungtaek Lim
I'm still having a hard time reviewing this. I have been handling a bunch of context right now, and the change is non-trivial to review in parallel. I see people were OK with the algorithm in high-level, but from a code perspective it's uneasy to understand without knowledge of DRA. It would take

Re: External Spark shuffle service for k8s

2024-04-11 Thread Bjørn Jørgensen
I think this answers your question about what to do if you need more space on nodes. https://spark.apache.org/docs/latest/running-on-kubernetes.html#local-storage Local Storage Spark supports using volumes to spill

[PySpark]: DataFrameWriterV2.overwrite fails with spark connect

2024-04-11 Thread Toki Takahashi
Hi Community, I get the following error when using Spark Connect in PySpark 3.5.1 and writing with DataFrameWriterV2.overwrite. ``` > df.writeTo('db.table').overwrite(F.col('id')==F.lit(1)) ... SparkConnectGrpcException: (org.apache.spark.sql.connect.common.InvalidPlanInput) Expression with ID:

Re: [External] Re: Versioning of Spark Operator

2024-04-11 Thread Ofir Manor
A related question - what is the expected release cadence? At least for the next 12-18 months? Since this is a new subproject, I am personally hoping it would have a faster cadence at first, maybe one a month or once every couple of months... If so, that would affect versioning. Also, if it

Re: External Spark shuffle service for k8s

2024-04-11 Thread Bjørn Jørgensen
" In the end for my usecase I started using pvcs and pvc aware scheduling along with decommissioning. So far performance is good with this choice." How did you do this? tor. 11. apr. 2024 kl. 04:13 skrev Arun Ravi : > Hi Everyone, > > I had to explored IBM's and AWS's S3 shuffle plugins (some