Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-14 Thread Chao Sun
> Out of interest what are the differences in the approach between this and 
> Glutten?

Overall they are similar, although Gluten supports multiple backends
including Velox and Clickhouse. One major difference is (obviously)
Comet is based on DataFusion and Arrow, and written in Rust, while
Gluten is mostly C++.
I haven't looked very deep into Gluten yet, but there could be other
differences such as how strictly the engine follows Spark's semantics,
table format support (Iceberg, Delta, etc), fallback mechanism
(coarse-grained fallback on stage level or more fine-grained fallback
within stages), UDF support (Comet hasn't started on this yet),
shuffle support, memory management, etc.

Both engines are backed by very strong and vibrant open source
communities (Velox, Clickhouse, Arrow & DataFusion) so it's very
exciting to see how the projects will grow in future.

Best,
Chao

On Tue, Feb 13, 2024 at 10:06 PM John Zhuge  wrote:
>
> Congratulations! Excellent work!
>
> On Tue, Feb 13, 2024 at 8:04 PM Yufei Gu  wrote:
>>
>> Absolutely thrilled to see the project going open-source! Huge congrats to 
>> Chao and the entire team on this milestone!
>>
>> Yufei
>>
>>
>> On Tue, Feb 13, 2024 at 12:43 PM Chao Sun  wrote:
>>>
>>> Hi all,
>>>
>>> We are very happy to announce that Project Comet, a plugin to
>>> accelerate Spark query execution via leveraging DataFusion and Arrow,
>>> has now been open sourced under the Apache Arrow umbrella. Please
>>> check the project repo
>>> https://github.com/apache/arrow-datafusion-comet for more details if
>>> you are interested. We'd love to collaborate with people from the open
>>> source community who share similar goals.
>>>
>>> Thanks,
>>> Chao
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>
>
> --
> John Zhuge

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-14 Thread praveen sinha
Hi Chao,

Is there any example app/gist/repo which can help me use this plugin. I
wanted to try out some realtime aggregate performance on top of parquet and
spark dataframes.

Thanks and Regards
Praveen


On Wed, Feb 14, 2024 at 9:20 AM Chao Sun  wrote:

> > Out of interest what are the differences in the approach between this
> and Glutten?
>
> Overall they are similar, although Gluten supports multiple backends
> including Velox and Clickhouse. One major difference is (obviously)
> Comet is based on DataFusion and Arrow, and written in Rust, while
> Gluten is mostly C++.
> I haven't looked very deep into Gluten yet, but there could be other
> differences such as how strictly the engine follows Spark's semantics,
> table format support (Iceberg, Delta, etc), fallback mechanism
> (coarse-grained fallback on stage level or more fine-grained fallback
> within stages), UDF support (Comet hasn't started on this yet),
> shuffle support, memory management, etc.
>
> Both engines are backed by very strong and vibrant open source
> communities (Velox, Clickhouse, Arrow & DataFusion) so it's very
> exciting to see how the projects will grow in future.
>
> Best,
> Chao
>
> On Tue, Feb 13, 2024 at 10:06 PM John Zhuge  wrote:
> >
> > Congratulations! Excellent work!
> >
> > On Tue, Feb 13, 2024 at 8:04 PM Yufei Gu  wrote:
> >>
> >> Absolutely thrilled to see the project going open-source! Huge congrats
> to Chao and the entire team on this milestone!
> >>
> >> Yufei
> >>
> >>
> >> On Tue, Feb 13, 2024 at 12:43 PM Chao Sun  wrote:
> >>>
> >>> Hi all,
> >>>
> >>> We are very happy to announce that Project Comet, a plugin to
> >>> accelerate Spark query execution via leveraging DataFusion and Arrow,
> >>> has now been open sourced under the Apache Arrow umbrella. Please
> >>> check the project repo
> >>> https://github.com/apache/arrow-datafusion-comet for more details if
> >>> you are interested. We'd love to collaborate with people from the open
> >>> source community who share similar goals.
> >>>
> >>> Thanks,
> >>> Chao
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
> >
> >
> > --
> > John Zhuge
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-14 Thread Chao Sun
Hi Praveen,

We will add a "Getting Started" section in the README soon, but basically
comet-spark-shell

in
the repo should provide a basic tool to build Comet and launch a Spark
shell with it.

Note that we haven't open sourced several features yet including shuffle
support, which the aggregate operation depends on. Please stay tuned!

Chao


On Wed, Feb 14, 2024 at 2:44 PM praveen sinha 
wrote:

> Hi Chao,
>
> Is there any example app/gist/repo which can help me use this plugin. I
> wanted to try out some realtime aggregate performance on top of parquet and
> spark dataframes.
>
> Thanks and Regards
> Praveen
>
>
> On Wed, Feb 14, 2024 at 9:20 AM Chao Sun  wrote:
>
>> > Out of interest what are the differences in the approach between this
>> and Glutten?
>>
>> Overall they are similar, although Gluten supports multiple backends
>> including Velox and Clickhouse. One major difference is (obviously)
>> Comet is based on DataFusion and Arrow, and written in Rust, while
>> Gluten is mostly C++.
>> I haven't looked very deep into Gluten yet, but there could be other
>> differences such as how strictly the engine follows Spark's semantics,
>> table format support (Iceberg, Delta, etc), fallback mechanism
>> (coarse-grained fallback on stage level or more fine-grained fallback
>> within stages), UDF support (Comet hasn't started on this yet),
>> shuffle support, memory management, etc.
>>
>> Both engines are backed by very strong and vibrant open source
>> communities (Velox, Clickhouse, Arrow & DataFusion) so it's very
>> exciting to see how the projects will grow in future.
>>
>> Best,
>> Chao
>>
>> On Tue, Feb 13, 2024 at 10:06 PM John Zhuge  wrote:
>> >
>> > Congratulations! Excellent work!
>> >
>> > On Tue, Feb 13, 2024 at 8:04 PM Yufei Gu  wrote:
>> >>
>> >> Absolutely thrilled to see the project going open-source! Huge
>> congrats to Chao and the entire team on this milestone!
>> >>
>> >> Yufei
>> >>
>> >>
>> >> On Tue, Feb 13, 2024 at 12:43 PM Chao Sun  wrote:
>> >>>
>> >>> Hi all,
>> >>>
>> >>> We are very happy to announce that Project Comet, a plugin to
>> >>> accelerate Spark query execution via leveraging DataFusion and Arrow,
>> >>> has now been open sourced under the Apache Arrow umbrella. Please
>> >>> check the project repo
>> >>> https://github.com/apache/arrow-datafusion-comet for more details if
>> >>> you are interested. We'd love to collaborate with people from the open
>> >>> source community who share similar goals.
>> >>>
>> >>> Thanks,
>> >>> Chao
>> >>>
>> >>> -
>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>>
>> >
>> >
>> > --
>> > John Zhuge
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>