Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread John Zhuge
Congratulations! Excellent work! On Tue, Feb 13, 2024 at 8:04 PM Yufei Gu wrote: > Absolutely thrilled to see the project going open-source! Huge congrats to > Chao and the entire team on this milestone! > > Yufei > > > On Tue, Feb 13, 2024 at 12:43 PM Chao Sun wrote: > >> Hi all, >> >> We are

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread Yufei Gu
Absolutely thrilled to see the project going open-source! Huge congrats to Chao and the entire team on this milestone! Yufei On Tue, Feb 13, 2024 at 12:43 PM Chao Sun wrote: > Hi all, > > We are very happy to announce that Project Comet, a plugin to > accelerate Spark query execution via

Re: How do you debug a code-generated aggregate?

2024-02-13 Thread Mich Talebzadeh
Sure thanks for clarification. I gather what you are alluding to is -- in a distributed environment, when one does operations that involve shuffling or repartitioning of data, the order in which this data is processed across partitions is not guaranteed. So when repartitioning a dataframe, the

Re: How do you debug a code-generated aggregate?

2024-02-13 Thread Jack Goodson
Apologies if it wasn't clear, I was meaning the difficulty of debugging, not floating point precision :) On Wed, Feb 14, 2024 at 2:03 AM Mich Talebzadeh wrote: > Hi Jack, > > " most SQL engines suffer from the same issue... "" > > Sure. This behavior is not a bug, but rather a consequence

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread Holden Karau
This looks really cool :) Out of interest what are the differences in the approach between this and Glutten? On Tue, Feb 13, 2024 at 12:42 PM Chao Sun wrote: > Hi all, > > We are very happy to announce that Project Comet, a plugin to > accelerate Spark query execution via leveraging DataFusion

Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread Chao Sun
Hi all, We are very happy to announce that Project Comet, a plugin to accelerate Spark query execution via leveraging DataFusion and Arrow, has now been open sourced under the Apache Arrow umbrella. Please check the project repo https://github.com/apache/arrow-datafusion-comet for more details if

Re: Heads-up: Update on Spark 3.5.1 RC

2024-02-13 Thread Dongjoon Hyun
Thank you for the update, Jungtaek. Dongjoon. On Tue, Feb 13, 2024 at 7:29 AM Jungtaek Lim wrote: > Hi, > > Just a head-up since I didn't give an update for a week after the last > update from the discussion thread. > > I've been following the automated release process and encountered several

Re: Extracting Input and Output Partitions in Spark

2024-02-13 Thread Daniel Saha
This would be helpful for a few use cases. For context my team works in security space, and customers access data through a wrapper around spark sql connected to hive metastore. 1. When snapshot (non-partitioned) tables are queried, it’s not clear when the underlying snapshot was last updated.

Heads-up: Update on Spark 3.5.1 RC

2024-02-13 Thread Jungtaek Lim
Hi, Just a head-up since I didn't give an update for a week after the last update from the discussion thread. I've been following the automated release process and encountered several issues. Maybe I will file JIRA tickets and follow PRs. Issues I figured out so far are 1) python library

Re: How do you debug a code-generated aggregate?

2024-02-13 Thread Mich Talebzadeh
Hi Jack, " most SQL engines suffer from the same issue... "" Sure. This behavior is not a bug, but rather a consequence of the limitations of floating-point precision. The numbers involved in the example (see SPIP [SPARK-47024] Sum of floats/doubles may be incorrect depending on