Re: Introducing Apache Gluten(incubating), a middle layer to offload Spark to native engine

2024-04-10 Thread Holden Karau
On Wed, Apr 10, 2024 at 9:54 PM Binwei Yang  wrote:

>
> Gluten currently already support Velox backend and Clickhouse backend.
> data fusion support is also proposed but no one worked on it.
>
> Gluten isn't a POC. It's under actively developing but some companies
> already used it.
>
>
> On 2024/04/11 03:32:01 Dongjoon Hyun wrote:
> > I'm interested in your claim.
> >
> > Could you elaborate or provide some evidence for your claim, *a door for
> > all native libraries*, Binwei?
> >
> > For example, is there any POC for that claim? Maybe, did I miss something
> > in that SPIP?
>
I think the concern here is there are multiple different layers to get from
Spark -> Native code and ideally any changes we introduce in Spark would be
for common functionality that is useful across them (e.g. data fusion comet
& gluten & photon*, etc.)


* Photon being harder to guess at since it's closed source.

> >
> > Dongjoon.
> >
> > On Wed, Apr 10, 2024 at 8:19 PM Binwei Yang  wrote:
> >
> > >
> > > The SPIP is not for current Gluten, but open a door for all native
> > > libraries and accelerators support.
> > >
> > > On 2024/04/11 00:27:43 Weiting Chen wrote:
> > > > Yes, the 1st Apache release(v1.2.0) for Gluten will be in September.
> > > > For Spark version support, currently Gluten v1.1.1 support Spark3.2
> and
> > > 3.3.
> > > > We are planning to support Spark3.4 and 3.5 in Gluten v1.2.0.
> > > > Spark4.0 support for Gluten is depending on the release schedule in
> > > Spark community.
> > > >
> > > > On 2024/04/09 07:14:13 Dongjoon Hyun wrote:
> > > > > Thank you for sharing, Weiting.
> > > > >
> > > > > Do you think you can share the future milestone of Apache Gluten?
> > > > > I'm wondering when the first stable release will come and how we
> can
> > > > > coordinate across the ASF communities.
> > > > >
> > > > > > This project is still under active development now, and doesn't
> have
> > > a
> > > > > stable release.
> > > > > > https://github.com/apache/incubator-gluten/releases/tag/v1.1.1
> > > > >
> > > > > In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end
> of
> > > > > support.
> > > > > And, 3.4 will have 3.4.3 next week and 3.4.4 (another EOL release)
> is
> > > > > scheduled in October.
> > > > >
> > > > > For the SPIP, I guess it's applicable for Apache Spark 4.0.0 only
> if
> > > there
> > > > > is something we need to do from Spark side.
> > > > >
> > > > > Thanks,
> > > > > Dongjoon.
> > > > >
> > > > >
> > > > > On Mon, Apr 8, 2024 at 11:19 PM WeitingChen <
> weitingc...@apache.org>
> > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > We are excited to introduce a new Apache incubating project
> called
> > > Gluten.
> > > > > > Gluten serves as a middleware layer designed to offload Spark to
> > > native
> > > > > > engines like Velox or ClickHouse.
> > > > > > For more detailed information, please visit the project
> repository at
> > > > > > https://github.com/apache/incubator-gluten
> > > > > >
> > > > > > Additionally, a new Spark SPIP related to Spark + Gluten
> > > collaboration has
> > > > > > been proposed at
> https://issues.apache.org/jira/browse/SPARK-47773.
> > > > > > We eagerly await feedback from the Spark community.
> > > > > >
> > > > > > Thanks,
> > > > > > Weiting.
> > > > > >
> > > > > >
> > > > >
> > > >
> > > > -
> > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > > >
> > > >
> > >
> > > -
> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >
> > >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Introducing Apache Gluten(incubating), a middle layer to offload Spark to native engine

2024-04-10 Thread Binwei Yang


Gluten currently already support Velox backend and Clickhouse backend. data 
fusion support is also proposed but no one worked on it.

Gluten isn't a POC. It's under actively developing but some companies already 
used it.


On 2024/04/11 03:32:01 Dongjoon Hyun wrote:
> I'm interested in your claim.
> 
> Could you elaborate or provide some evidence for your claim, *a door for
> all native libraries*, Binwei?
> 
> For example, is there any POC for that claim? Maybe, did I miss something
> in that SPIP?
> 
> Dongjoon.
> 
> On Wed, Apr 10, 2024 at 8:19 PM Binwei Yang  wrote:
> 
> >
> > The SPIP is not for current Gluten, but open a door for all native
> > libraries and accelerators support.
> >
> > On 2024/04/11 00:27:43 Weiting Chen wrote:
> > > Yes, the 1st Apache release(v1.2.0) for Gluten will be in September.
> > > For Spark version support, currently Gluten v1.1.1 support Spark3.2 and
> > 3.3.
> > > We are planning to support Spark3.4 and 3.5 in Gluten v1.2.0.
> > > Spark4.0 support for Gluten is depending on the release schedule in
> > Spark community.
> > >
> > > On 2024/04/09 07:14:13 Dongjoon Hyun wrote:
> > > > Thank you for sharing, Weiting.
> > > >
> > > > Do you think you can share the future milestone of Apache Gluten?
> > > > I'm wondering when the first stable release will come and how we can
> > > > coordinate across the ASF communities.
> > > >
> > > > > This project is still under active development now, and doesn't have
> > a
> > > > stable release.
> > > > > https://github.com/apache/incubator-gluten/releases/tag/v1.1.1
> > > >
> > > > In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end of
> > > > support.
> > > > And, 3.4 will have 3.4.3 next week and 3.4.4 (another EOL release) is
> > > > scheduled in October.
> > > >
> > > > For the SPIP, I guess it's applicable for Apache Spark 4.0.0 only if
> > there
> > > > is something we need to do from Spark side.
> > > >
> > > > Thanks,
> > > > Dongjoon.
> > > >
> > > >
> > > > On Mon, Apr 8, 2024 at 11:19 PM WeitingChen 
> > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > We are excited to introduce a new Apache incubating project called
> > Gluten.
> > > > > Gluten serves as a middleware layer designed to offload Spark to
> > native
> > > > > engines like Velox or ClickHouse.
> > > > > For more detailed information, please visit the project repository at
> > > > > https://github.com/apache/incubator-gluten
> > > > >
> > > > > Additionally, a new Spark SPIP related to Spark + Gluten
> > collaboration has
> > > > > been proposed at https://issues.apache.org/jira/browse/SPARK-47773.
> > > > > We eagerly await feedback from the Spark community.
> > > > >
> > > > > Thanks,
> > > > > Weiting.
> > > > >
> > > > >
> > > >
> > >
> > > -
> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >
> > >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-10 Thread Binwei Yang


Gluten java part is pretty stable now. The development is more in the c++ code, 
velox code as well as Clickhouse backend.

The SPIP doesn't plan to introduce whole Gluten stack into Spark. But the way 
to serialize Spark physical plan and be able to send to native backend, through 
JNI or gRPC. Currently Spark has no API for this. The physical plan format can 
be substrait or extended Spark Connect. 

On 2024/04/10 12:34:26 Mich Talebzadeh wrote:
> I read the SPIP. I have a number of  ;points if I may
> 
> - Maturity of Gluten: as the excerpt mentions, Gluten is a project, and its
> feature set and stability IMO are still under development. Integrating a
> non-core component could introduce risks if it is not fully mature
> - Complexity: integrating Gluten's functionalities into Spark might add
> complexity to the codebase, potentially increasing maintenance
> overhead. Users might need to learn about Gluten's functionalities and
> potential limitations for effective utilization?
> - Performance Overhead: the plan conversion process itself could introduce
> some overhead compared to native Spark execution.The effectiveness of
> performance optimizations from Gluten might vary depending on the specific
> engine and workload.
> - Potential compatibility issues::not all data processing engines might
> have complete support for the "Substrate standard", potentially limiting
> the universality of the approach. There could be edge cases where plan
> conversion or execution on a specific engine leads to unexpected behavior.
> - Security: If other engines have different security models or access
> controls, integrating them with Spark might require additional security
> considerations.
> - integration and support in the cloud
> 
> HTH
> 
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> Mich Talebzadeh,
> London
> United Kingdom
> 
> 
>view my Linkedin profile
> 
> 
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
> 
> 
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
> 
> 
> On Wed, 10 Apr 2024 at 12:33, Wenchen Fan  wrote:
> 
> > It's good to reduce duplication between different native accelerators of
> > Spark, and AFAIK there is already a project trying to solve it:
> > https://substrait.io/
> >
> > I'm not sure why we need to do this inside Spark, instead of doing
> > the unification for a wider scope (for all engines, not only Spark).
> >
> >
> > On Wed, Apr 10, 2024 at 10:11 AM Holden Karau 
> > wrote:
> >
> >> I like the idea of improving flexibility of Sparks physical plans and
> >> really anything that might reduce code duplication among the ~4 or so
> >> different accelerators.
> >>
> >> Twitter: https://twitter.com/holdenkarau
> >> Books (Learning Spark, High Performance Spark, etc.):
> >> https://amzn.to/2MaRAG9  
> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> >>
> >>
> >> On Tue, Apr 9, 2024 at 3:14 AM Dongjoon Hyun 
> >> wrote:
> >>
> >>> Thank you for sharing, Jia.
> >>>
> >>> I have the same questions like the previous Weiting's thread.
> >>>
> >>> Do you think you can share the future milestone of Apache Gluten?
> >>> I'm wondering when the first stable release will come and how we can
> >>> coordinate across the ASF communities.
> >>>
> >>> > This project is still under active development now, and doesn't have a
> >>> stable release.
> >>> > https://github.com/apache/incubator-gluten/releases/tag/v1.1.1
> >>>
> >>> In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end of
> >>> support.
> >>> And, 3.4 will have 3.4.3 next week and 3.4.4 (another EOL release) is
> >>> scheduled in October.
> >>>
> >>> For the SPIP, I guess it's applicable for Apache Spark 4.0.0 only if
> >>> there is something we need to do from Spark side.
> >>>
> >> +1 I think any changes need to target 4.0
> >>
> >>>
> >>> Thanks,
> >>> Dongjoon.
> >>>
> >>>
> >>> On Tue, Apr 9, 2024 at 12:22 AM Ke Jia  wrote:
> >>>
>  Apache Spark currently lacks an official mechanism to support
>  cross-platform execution of physical plans. The Gluten project offers a
>  mechanism that utilizes the Substrait standard to convert and optimize
>  Spark's physical plans. By introducing Gluten's plan conversion,
>  validation, and fallback mechanisms into Spark, we can significantly
>  enhance the portability and interoperability of Spark's physical plans,
>  enabling them to operate across a broader spectrum of execution
>  environments without requiring users to migrate, while also improving
>  Spark's execution efficiency through the utilization of 

Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-10 Thread Binwei Yang


We (Gluten and Arrow guys) actually do planned to put the plan conversation in 
the substrait-java repo. But to me it makes more sense to put it as part of 
Spark repo. Native library and accelerator support will be more and more import 
in future.

On 2024/04/10 08:29:08 Wenchen Fan wrote:
> It's good to reduce duplication between different native accelerators of
> Spark, and AFAIK there is already a project trying to solve it:
> https://substrait.io/
> 
> I'm not sure why we need to do this inside Spark, instead of doing
> the unification for a wider scope (for all engines, not only Spark).
> 
> 
> On Wed, Apr 10, 2024 at 10:11 AM Holden Karau 
> wrote:
> 
> > I like the idea of improving flexibility of Sparks physical plans and
> > really anything that might reduce code duplication among the ~4 or so
> > different accelerators.
> >
> > Twitter: https://twitter.com/holdenkarau
> > Books (Learning Spark, High Performance Spark, etc.):
> > https://amzn.to/2MaRAG9  
> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> >
> >
> > On Tue, Apr 9, 2024 at 3:14 AM Dongjoon Hyun 
> > wrote:
> >
> >> Thank you for sharing, Jia.
> >>
> >> I have the same questions like the previous Weiting's thread.
> >>
> >> Do you think you can share the future milestone of Apache Gluten?
> >> I'm wondering when the first stable release will come and how we can
> >> coordinate across the ASF communities.
> >>
> >> > This project is still under active development now, and doesn't have a
> >> stable release.
> >> > https://github.com/apache/incubator-gluten/releases/tag/v1.1.1
> >>
> >> In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end of
> >> support.
> >> And, 3.4 will have 3.4.3 next week and 3.4.4 (another EOL release) is
> >> scheduled in October.
> >>
> >> For the SPIP, I guess it's applicable for Apache Spark 4.0.0 only if
> >> there is something we need to do from Spark side.
> >>
> > +1 I think any changes need to target 4.0
> >
> >>
> >> Thanks,
> >> Dongjoon.
> >>
> >>
> >> On Tue, Apr 9, 2024 at 12:22 AM Ke Jia  wrote:
> >>
> >>> Apache Spark currently lacks an official mechanism to support
> >>> cross-platform execution of physical plans. The Gluten project offers a
> >>> mechanism that utilizes the Substrait standard to convert and optimize
> >>> Spark's physical plans. By introducing Gluten's plan conversion,
> >>> validation, and fallback mechanisms into Spark, we can significantly
> >>> enhance the portability and interoperability of Spark's physical plans,
> >>> enabling them to operate across a broader spectrum of execution
> >>> environments without requiring users to migrate, while also improving
> >>> Spark's execution efficiency through the utilization of Gluten's advanced
> >>> optimization techniques. And the integration of Gluten into Spark has
> >>> already shown significant performance improvements with ClickHouse and
> >>> Velox backends and has been successfully deployed in production by several
> >>> customers.
> >>>
> >>> References:
> >>> JIAR Ticket 
> >>> SPIP Doc
> >>> 
> >>>
> >>> Your feedback and comments are welcome and appreciated.  Thanks.
> >>>
> >>> Thanks,
> >>> Jia Ke
> >>>
> >>
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Introducing Apache Gluten(incubating), a middle layer to offload Spark to native engine

2024-04-10 Thread Dongjoon Hyun
I'm interested in your claim.

Could you elaborate or provide some evidence for your claim, *a door for
all native libraries*, Binwei?

For example, is there any POC for that claim? Maybe, did I miss something
in that SPIP?

Dongjoon.

On Wed, Apr 10, 2024 at 8:19 PM Binwei Yang  wrote:

>
> The SPIP is not for current Gluten, but open a door for all native
> libraries and accelerators support.
>
> On 2024/04/11 00:27:43 Weiting Chen wrote:
> > Yes, the 1st Apache release(v1.2.0) for Gluten will be in September.
> > For Spark version support, currently Gluten v1.1.1 support Spark3.2 and
> 3.3.
> > We are planning to support Spark3.4 and 3.5 in Gluten v1.2.0.
> > Spark4.0 support for Gluten is depending on the release schedule in
> Spark community.
> >
> > On 2024/04/09 07:14:13 Dongjoon Hyun wrote:
> > > Thank you for sharing, Weiting.
> > >
> > > Do you think you can share the future milestone of Apache Gluten?
> > > I'm wondering when the first stable release will come and how we can
> > > coordinate across the ASF communities.
> > >
> > > > This project is still under active development now, and doesn't have
> a
> > > stable release.
> > > > https://github.com/apache/incubator-gluten/releases/tag/v1.1.1
> > >
> > > In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end of
> > > support.
> > > And, 3.4 will have 3.4.3 next week and 3.4.4 (another EOL release) is
> > > scheduled in October.
> > >
> > > For the SPIP, I guess it's applicable for Apache Spark 4.0.0 only if
> there
> > > is something we need to do from Spark side.
> > >
> > > Thanks,
> > > Dongjoon.
> > >
> > >
> > > On Mon, Apr 8, 2024 at 11:19 PM WeitingChen 
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > We are excited to introduce a new Apache incubating project called
> Gluten.
> > > > Gluten serves as a middleware layer designed to offload Spark to
> native
> > > > engines like Velox or ClickHouse.
> > > > For more detailed information, please visit the project repository at
> > > > https://github.com/apache/incubator-gluten
> > > >
> > > > Additionally, a new Spark SPIP related to Spark + Gluten
> collaboration has
> > > > been proposed at https://issues.apache.org/jira/browse/SPARK-47773.
> > > > We eagerly await feedback from the Spark community.
> > > >
> > > > Thanks,
> > > > Weiting.
> > > >
> > > >
> > >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Introducing Apache Gluten(incubating), a middle layer to offload Spark to native engine

2024-04-10 Thread Binwei Yang


The SPIP is not for current Gluten, but open a door for all native libraries 
and accelerators support.

On 2024/04/11 00:27:43 Weiting Chen wrote:
> Yes, the 1st Apache release(v1.2.0) for Gluten will be in September.
> For Spark version support, currently Gluten v1.1.1 support Spark3.2 and 3.3.
> We are planning to support Spark3.4 and 3.5 in Gluten v1.2.0.
> Spark4.0 support for Gluten is depending on the release schedule in Spark 
> community.
> 
> On 2024/04/09 07:14:13 Dongjoon Hyun wrote:
> > Thank you for sharing, Weiting.
> > 
> > Do you think you can share the future milestone of Apache Gluten?
> > I'm wondering when the first stable release will come and how we can
> > coordinate across the ASF communities.
> > 
> > > This project is still under active development now, and doesn't have a
> > stable release.
> > > https://github.com/apache/incubator-gluten/releases/tag/v1.1.1
> > 
> > In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end of
> > support.
> > And, 3.4 will have 3.4.3 next week and 3.4.4 (another EOL release) is
> > scheduled in October.
> > 
> > For the SPIP, I guess it's applicable for Apache Spark 4.0.0 only if there
> > is something we need to do from Spark side.
> > 
> > Thanks,
> > Dongjoon.
> > 
> > 
> > On Mon, Apr 8, 2024 at 11:19 PM WeitingChen  wrote:
> > 
> > > Hi all,
> > >
> > > We are excited to introduce a new Apache incubating project called Gluten.
> > > Gluten serves as a middleware layer designed to offload Spark to native
> > > engines like Velox or ClickHouse.
> > > For more detailed information, please visit the project repository at
> > > https://github.com/apache/incubator-gluten
> > >
> > > Additionally, a new Spark SPIP related to Spark + Gluten collaboration has
> > > been proposed at https://issues.apache.org/jira/browse/SPARK-47773.
> > > We eagerly await feedback from the Spark community.
> > >
> > > Thanks,
> > > Weiting.
> > >
> > >
> > 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: External Spark shuffle service for k8s

2024-04-10 Thread Arun Ravi
Hi Everyone,

I had to explored IBM's and AWS's S3 shuffle plugins (some time back), I
had also explored AWS FSX lustre in few of my production jobs which has
~20TB of shuffle operations with 200-300 executors. What I have observed is
S3 and fax behaviour was fine during the write phase, however I faced iops
throttling during the read phase(read taking forever to complete). I think
this might be contributed by the heavy use of shuffle index file (I didn't
perform any extensive research on this), so I believe the shuffle manager
logic have to be intelligent enough to reduce the fetching of files from
object store. In the end for my usecase I started using pvcs and pvc aware
scheduling along with decommissioning. So far performance is good with this
choice.

Thank you

On Tue, 9 Apr 2024, 15:17 Mich Talebzadeh, 
wrote:

> Hi,
>
> First thanks everyone for their contributions
>
> I was going to reply to @Enrico Minack   but
> noticed additional info. As I understand for example,  Apache Uniffle is an
> incubating project aimed at providing a pluggable shuffle service for
> Spark. So basically, all these "external shuffle services" have in common
> is to offload shuffle data management to external services, thus reducing
> the memory and CPU overhead on Spark executors. That is great.  While
> Uniffle and others enhance shuffle performance and scalability, it would be
> great to integrate them with Spark UI. This may require additional
> development efforts. I suppose  the interest would be to have these
> external matrices incorporated into Spark with one look and feel. This may
> require customizing the UI to fetch and display metrics or statistics from
> the external shuffle services. Has any project done this?
>
> Thanks
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 8 Apr 2024 at 14:19, Vakaris Baškirov 
> wrote:
>
>> I see that both Uniffle and Celebron support S3/HDFS backends which is
>> great.
>> In the case someone is using S3/HDFS, I wonder what would be the
>> advantages of using Celebron or Uniffle vs IBM shuffle service plugin
>>  or Cloud Shuffle Storage
>> Plugin from AWS
>> 
>> ?
>>
>> These plugins do not require deploying a separate service. Are there any
>> advantages to using Uniffle/Celebron in the case of using S3 backend, which
>> would require deploying a separate service?
>>
>> Thanks
>> Vakaris
>>
>> On Mon, Apr 8, 2024 at 10:03 AM roryqi  wrote:
>>
>>> Apache Uniffle (incubating) may be another solution.
>>> You can see
>>> https://github.com/apache/incubator-uniffle
>>>
>>> https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era
>>>
>>> Mich Talebzadeh  于2024年4月8日周一 07:15写道:
>>>
 Splendid

 The configurations below can be used with k8s deployments of Spark.
 Spark applications running on k8s can utilize these configurations to
 seamlessly access data stored in Google Cloud Storage (GCS) and Amazon S3.

 For Google GCS we may have

 spark_config_gcs = {
 "spark.kubernetes.authenticate.driver.serviceAccountName":
 "service_account_name",
 "spark.hadoop.fs.gs.impl":
 "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
 "spark.hadoop.google.cloud.auth.service.account.enable": "true",
 "spark.hadoop.google.cloud.auth.service.account.json.keyfile":
 "/path/to/keyfile.json",
 }

 For Amazon S3 similar

 spark_config_s3 = {
 "spark.kubernetes.authenticate.driver.serviceAccountName":
 "service_account_name",
 "spark.hadoop.fs.s3a.impl":
 "org.apache.hadoop.fs.s3a.S3AFileSystem",
 "spark.hadoop.fs.s3a.access.key": "s3_access_key",
 "spark.hadoop.fs.s3a.secret.key": "secret_key",
 }


 To implement these configurations and enable Spark applications to
 interact with GCS and S3, I guess we can approach it this way

 1) Spark Repository Integration: These configurations need to be added
 to the Spark repository as part of the supported configuration options for
 k8s deployments.

 2) Configuration Settings: Users need to specify these configurations
 when submitting Spark applications to a Kubernetes cluster. They can
 include 

Re: Introducing Apache Gluten(incubating), a middle layer to offload Spark to native engine

2024-04-10 Thread Weiting Chen
Yes, the 1st Apache release(v1.2.0) for Gluten will be in September.
For Spark version support, currently Gluten v1.1.1 support Spark3.2 and 3.3.
We are planning to support Spark3.4 and 3.5 in Gluten v1.2.0.
Spark4.0 support for Gluten is depending on the release schedule in Spark 
community.

On 2024/04/09 07:14:13 Dongjoon Hyun wrote:
> Thank you for sharing, Weiting.
> 
> Do you think you can share the future milestone of Apache Gluten?
> I'm wondering when the first stable release will come and how we can
> coordinate across the ASF communities.
> 
> > This project is still under active development now, and doesn't have a
> stable release.
> > https://github.com/apache/incubator-gluten/releases/tag/v1.1.1
> 
> In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end of
> support.
> And, 3.4 will have 3.4.3 next week and 3.4.4 (another EOL release) is
> scheduled in October.
> 
> For the SPIP, I guess it's applicable for Apache Spark 4.0.0 only if there
> is something we need to do from Spark side.
> 
> Thanks,
> Dongjoon.
> 
> 
> On Mon, Apr 8, 2024 at 11:19 PM WeitingChen  wrote:
> 
> > Hi all,
> >
> > We are excited to introduce a new Apache incubating project called Gluten.
> > Gluten serves as a middleware layer designed to offload Spark to native
> > engines like Velox or ClickHouse.
> > For more detailed information, please visit the project repository at
> > https://github.com/apache/incubator-gluten
> >
> > Additionally, a new Spark SPIP related to Spark + Gluten collaboration has
> > been proposed at https://issues.apache.org/jira/browse/SPARK-47773.
> > We eagerly await feedback from the Spark community.
> >
> > Thanks,
> > Weiting.
> >
> >
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-10 Thread L. C. Hsieh
+1 for Wenchen's point.

I don't see a strong reason to pull these transformations into Spark
instead of keeping them in third party packages/projects.

On Wed, Apr 10, 2024 at 5:32 AM Wenchen Fan  wrote:
>
> It's good to reduce duplication between different native accelerators of 
> Spark, and AFAIK there is already a project trying to solve it: 
> https://substrait.io/
>
> I'm not sure why we need to do this inside Spark, instead of doing the 
> unification for a wider scope (for all engines, not only Spark).
>
>
> On Wed, Apr 10, 2024 at 10:11 AM Holden Karau  wrote:
>>
>> I like the idea of improving flexibility of Sparks physical plans and really 
>> anything that might reduce code duplication among the ~4 or so different 
>> accelerators.
>>
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>> On Tue, Apr 9, 2024 at 3:14 AM Dongjoon Hyun  wrote:
>>>
>>> Thank you for sharing, Jia.
>>>
>>> I have the same questions like the previous Weiting's thread.
>>>
>>> Do you think you can share the future milestone of Apache Gluten?
>>> I'm wondering when the first stable release will come and how we can 
>>> coordinate across the ASF communities.
>>>
>>> > This project is still under active development now, and doesn't have a 
>>> > stable release.
>>> > https://github.com/apache/incubator-gluten/releases/tag/v1.1.1
>>>
>>> In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end of 
>>> support.
>>> And, 3.4 will have 3.4.3 next week and 3.4.4 (another EOL release) is 
>>> scheduled in October.
>>>
>>> For the SPIP, I guess it's applicable for Apache Spark 4.0.0 only if there 
>>> is something we need to do from Spark side.
>>
>> +1 I think any changes need to target 4.0
>>>
>>>
>>> Thanks,
>>> Dongjoon.
>>>
>>>
>>> On Tue, Apr 9, 2024 at 12:22 AM Ke Jia  wrote:

 Apache Spark currently lacks an official mechanism to support 
 cross-platform execution of physical plans. The Gluten project offers a 
 mechanism that utilizes the Substrait standard to convert and optimize 
 Spark's physical plans. By introducing Gluten's plan conversion, 
 validation, and fallback mechanisms into Spark, we can significantly 
 enhance the portability and interoperability of Spark's physical plans, 
 enabling them to operate across a broader spectrum of execution 
 environments without requiring users to migrate, while also improving 
 Spark's execution efficiency through the utilization of Gluten's advanced 
 optimization techniques. And the integration of Gluten into Spark has 
 already shown significant performance improvements with ClickHouse and 
 Velox backends and has been successfully deployed in production by several 
 customers.

 References:
 JIAR Ticket
 SPIP Doc

 Your feedback and comments are welcome and appreciated.  Thanks.

 Thanks,
 Jia Ke

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Versioning of Spark Operator

2024-04-10 Thread L. C. Hsieh
This approach makes sense to me.

If Spark K8s operator is aligned with Spark versions, for example, it
uses 4.0.0 now.
Because these JIRA tickets are not actually targeting Spark 4.0.0, it
will cause confusion and more questions, like when we are going to cut
Spark release,
should we include Spark operator JIRAs in the release note, etc.

So I think an independent version number for Spark K8s operator would
be a better option.

If there are no more options or comments, I will create a vote later
to create new "Versions" in Apache Spark JIRA.

Thank you all.

On Wed, Apr 10, 2024 at 12:20 AM Dongjoon Hyun  wrote:
>
> Ya, that would work.
>
> Inevitably, I looked at Apache Flink K8s Operator's JIRA and GitHub repo.
>
> It looks reasonable to me.
>
> Although they share the same JIRA, they choose different patterns per place.
>
> 1. In POM file and Maven Artifact, independent version number.
> 1.8.0
>
> 2. Tag is also based on the independent version number
> https://github.com/apache/flink-kubernetes-operator/tags
> - release-1.8.0
> - release-1.7.0
>
> 3. JIRA Fixed Version is `kubernetes-operator-` prefix.
> https://issues.apache.org/jira/browse/FLINK-34957
> > Fix Version/s: kubernetes-operator-1.9.0
>
> Maybe, we can borrow this pattern.
>
> I guess we need a vote for any further decision because we need to create new 
> `Versions` in Apache Spark JIRA.
>
> Dongjoon.
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Versioning of Spark Operator

2024-04-10 Thread bo yang
Cool, looks like we have two options here.

Option 1: Spark Operator and Connect Go Client versioning independent of
Spark, e.g. starting with 0.1.0.
Pros: they can evolve versions independently.
Cons: people will need an extra step to decide the version when using Spark
Operator and Connect Go Client.

Option 2: Spark Operator and Connect Go Client versioning loosely related
with Spark, e.g. starting with the Supported Spark version
Pros: might be easy for beginning users to choose version when using Spark
Operator and Connect Go Client.
Cons: there is uncertainty how the compatibility will go in the future for
Spark Operator and Connect Go Client regarding Spark, which may impact this
version naming.

Right now, Connect Go Client uses Option 2, but can change to Option 1 if
needed.


On Wed, Apr 10, 2024 at 6:19 AM Dongjoon Hyun 
wrote:

> Ya, that would work.
>
> Inevitably, I looked at Apache Flink K8s Operator's JIRA and GitHub repo.
>
> It looks reasonable to me.
>
> Although they share the same JIRA, they choose different patterns per
> place.
>
> 1. In POM file and Maven Artifact, independent version number.
> 1.8.0
>
> 2. Tag is also based on the independent version number
> https://github.com/apache/flink-kubernetes-operator/tags
> - release-1.8.0
> - release-1.7.0
>
> 3. JIRA Fixed Version is `kubernetes-operator-` prefix.
> https://issues.apache.org/jira/browse/FLINK-34957
> > Fix Version/s: kubernetes-operator-1.9.0
>
> Maybe, we can borrow this pattern.
>
> I guess we need a vote for any further decision because we need to create
> new `Versions` in Apache Spark JIRA.
>
> Dongjoon.
>
>


Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-10 Thread Mich Talebzadeh
I read the SPIP. I have a number of  ;points if I may

- Maturity of Gluten: as the excerpt mentions, Gluten is a project, and its
feature set and stability IMO are still under development. Integrating a
non-core component could introduce risks if it is not fully mature
- Complexity: integrating Gluten's functionalities into Spark might add
complexity to the codebase, potentially increasing maintenance
overhead. Users might need to learn about Gluten's functionalities and
potential limitations for effective utilization?
- Performance Overhead: the plan conversion process itself could introduce
some overhead compared to native Spark execution.The effectiveness of
performance optimizations from Gluten might vary depending on the specific
engine and workload.
- Potential compatibility issues::not all data processing engines might
have complete support for the "Substrate standard", potentially limiting
the universality of the approach. There could be edge cases where plan
conversion or execution on a specific engine leads to unexpected behavior.
- Security: If other engines have different security models or access
controls, integrating them with Spark might require additional security
considerations.
- integration and support in the cloud

HTH

Technologist | Solutions Architect | Data Engineer  | Generative AI
Mich Talebzadeh,
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Wed, 10 Apr 2024 at 12:33, Wenchen Fan  wrote:

> It's good to reduce duplication between different native accelerators of
> Spark, and AFAIK there is already a project trying to solve it:
> https://substrait.io/
>
> I'm not sure why we need to do this inside Spark, instead of doing
> the unification for a wider scope (for all engines, not only Spark).
>
>
> On Wed, Apr 10, 2024 at 10:11 AM Holden Karau 
> wrote:
>
>> I like the idea of improving flexibility of Sparks physical plans and
>> really anything that might reduce code duplication among the ~4 or so
>> different accelerators.
>>
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>> On Tue, Apr 9, 2024 at 3:14 AM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you for sharing, Jia.
>>>
>>> I have the same questions like the previous Weiting's thread.
>>>
>>> Do you think you can share the future milestone of Apache Gluten?
>>> I'm wondering when the first stable release will come and how we can
>>> coordinate across the ASF communities.
>>>
>>> > This project is still under active development now, and doesn't have a
>>> stable release.
>>> > https://github.com/apache/incubator-gluten/releases/tag/v1.1.1
>>>
>>> In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end of
>>> support.
>>> And, 3.4 will have 3.4.3 next week and 3.4.4 (another EOL release) is
>>> scheduled in October.
>>>
>>> For the SPIP, I guess it's applicable for Apache Spark 4.0.0 only if
>>> there is something we need to do from Spark side.
>>>
>> +1 I think any changes need to target 4.0
>>
>>>
>>> Thanks,
>>> Dongjoon.
>>>
>>>
>>> On Tue, Apr 9, 2024 at 12:22 AM Ke Jia  wrote:
>>>
 Apache Spark currently lacks an official mechanism to support
 cross-platform execution of physical plans. The Gluten project offers a
 mechanism that utilizes the Substrait standard to convert and optimize
 Spark's physical plans. By introducing Gluten's plan conversion,
 validation, and fallback mechanisms into Spark, we can significantly
 enhance the portability and interoperability of Spark's physical plans,
 enabling them to operate across a broader spectrum of execution
 environments without requiring users to migrate, while also improving
 Spark's execution efficiency through the utilization of Gluten's advanced
 optimization techniques. And the integration of Gluten into Spark has
 already shown significant performance improvements with ClickHouse and
 Velox backends and has been successfully deployed in production by several
 customers.

 References:
 JIAR Ticket 
 SPIP Doc
 

 Your feedback and comments are welcome and appreciated.  Thanks.

 Thanks,
 Jia Ke

>>>


Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-10 Thread Wenchen Fan
It's good to reduce duplication between different native accelerators of
Spark, and AFAIK there is already a project trying to solve it:
https://substrait.io/

I'm not sure why we need to do this inside Spark, instead of doing
the unification for a wider scope (for all engines, not only Spark).


On Wed, Apr 10, 2024 at 10:11 AM Holden Karau 
wrote:

> I like the idea of improving flexibility of Sparks physical plans and
> really anything that might reduce code duplication among the ~4 or so
> different accelerators.
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Tue, Apr 9, 2024 at 3:14 AM Dongjoon Hyun 
> wrote:
>
>> Thank you for sharing, Jia.
>>
>> I have the same questions like the previous Weiting's thread.
>>
>> Do you think you can share the future milestone of Apache Gluten?
>> I'm wondering when the first stable release will come and how we can
>> coordinate across the ASF communities.
>>
>> > This project is still under active development now, and doesn't have a
>> stable release.
>> > https://github.com/apache/incubator-gluten/releases/tag/v1.1.1
>>
>> In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end of
>> support.
>> And, 3.4 will have 3.4.3 next week and 3.4.4 (another EOL release) is
>> scheduled in October.
>>
>> For the SPIP, I guess it's applicable for Apache Spark 4.0.0 only if
>> there is something we need to do from Spark side.
>>
> +1 I think any changes need to target 4.0
>
>>
>> Thanks,
>> Dongjoon.
>>
>>
>> On Tue, Apr 9, 2024 at 12:22 AM Ke Jia  wrote:
>>
>>> Apache Spark currently lacks an official mechanism to support
>>> cross-platform execution of physical plans. The Gluten project offers a
>>> mechanism that utilizes the Substrait standard to convert and optimize
>>> Spark's physical plans. By introducing Gluten's plan conversion,
>>> validation, and fallback mechanisms into Spark, we can significantly
>>> enhance the portability and interoperability of Spark's physical plans,
>>> enabling them to operate across a broader spectrum of execution
>>> environments without requiring users to migrate, while also improving
>>> Spark's execution efficiency through the utilization of Gluten's advanced
>>> optimization techniques. And the integration of Gluten into Spark has
>>> already shown significant performance improvements with ClickHouse and
>>> Velox backends and has been successfully deployed in production by several
>>> customers.
>>>
>>> References:
>>> JIAR Ticket 
>>> SPIP Doc
>>> 
>>>
>>> Your feedback and comments are welcome and appreciated.  Thanks.
>>>
>>> Thanks,
>>> Jia Ke
>>>
>>


Re: Versioning of Spark Operator

2024-04-10 Thread Dongjoon Hyun
Ya, that would work.

Inevitably, I looked at Apache Flink K8s Operator's JIRA and GitHub repo.

It looks reasonable to me.

Although they share the same JIRA, they choose different patterns per place.

1. In POM file and Maven Artifact, independent version number.
1.8.0

2. Tag is also based on the independent version number
https://github.com/apache/flink-kubernetes-operator/tags
- release-1.8.0
- release-1.7.0

3. JIRA Fixed Version is `kubernetes-operator-` prefix.
https://issues.apache.org/jira/browse/FLINK-34957
> Fix Version/s: kubernetes-operator-1.9.0

Maybe, we can borrow this pattern.

I guess we need a vote for any further decision because we need to create
new `Versions` in Apache Spark JIRA.

Dongjoon.


Re: Versioning of Spark Operator

2024-04-10 Thread L. C. Hsieh
Yea, I guess, for example, the first release of Spark K8s Operator
would be something like 0.1.0 instead of 4.0.0.

It sounds hard to align with Spark versions because of that?


On Tue, Apr 9, 2024 at 10:15 AM Dongjoon Hyun  wrote:
>
> Ya, that's simple and possible.
>
> However, it may cause many confusions because it implies that new `Spark K8s 
> Operator 4.0.0` and `Spark Connect Go 4.0.0` follow the same `Semantic 
> Versioning` policy like Apache Spark 4.0.0.
>
> In addition, `Versioning` is directly related to the Release Cadence. It's 
> unlikely for us to have `Spark K8s Operator` and `Spark Connect Go` releases 
> at every Apache Spark maintenance release. For example, there is no commit in 
> Spark Connect Go repository.
>
> I believe the versioning and release cadence is related to those subprojects' 
> maturity more.
>
> Dongjoon.
>
> On 2024/04/09 16:59:40 DB Tsai wrote:
> >  Aligning with Spark releases is sensible, as it allows us to guarantee 
> > that the Spark operator functions correctly with the new version while also 
> > maintaining support for previous versions.
> >
> > DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
> >
> > > On Apr 9, 2024, at 9:45 AM, Mridul Muralidharan  wrote:
> > >
> > >
> > >   I am trying to understand if we can simply align with Spark's version 
> > > for this ?
> > > Makes the release and jira management much more simpler for developers 
> > > and intuitive for users.
> > >
> > > Regards,
> > > Mridul
> > >
> > >
> > > On Tue, Apr 9, 2024 at 10:09 AM Dongjoon Hyun  > > > wrote:
> > >> Hi, Liang-Chi.
> > >>
> > >> Thank you for leading Apache Spark K8s operator as a shepherd.
> > >>
> > >> I took a look at `Apache Spark Connect Go` repo mentioned in the thread. 
> > >> Sadly, there is no release at all and no activity since last 6 months. 
> > >> It seems to be the first time for Apache Spark community to consider 
> > >> these sister repositories (Go and K8s Operator).
> > >>
> > >> https://github.com/apache/spark-connect-go/commits/master/
> > >>
> > >> Dongjoon.
> > >>
> > >> On 2024/04/08 17:48:18 "L. C. Hsieh" wrote:
> > >> > Hi all,
> > >> >
> > >> > We've opened the dedicated repository of Spark Kubernetes Operator,
> > >> > and the first PR is created.
> > >> > Thank you for the review from the community so far.
> > >> >
> > >> > About the versioning of Spark Operator, there are questions.
> > >> >
> > >> > As we are using Spark JIRA, when we are going to merge PRs, we need to
> > >> > choose a Spark version. However, the Spark Operator is versioning
> > >> > differently than Spark. I'm wondering how we deal with this?
> > >> >
> > >> > Not sure if Connect also has its versioning different to Spark? If so,
> > >> > maybe we can follow how Connect does.
> > >> >
> > >> > Can someone who is familiar with Connect versioning give some 
> > >> > suggestions?
> > >> >
> > >> > Thank you.
> > >> >
> > >> > Liang-Chi
> > >> >
> > >> > -
> > >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> > >> > 
> > >> >
> > >> >
> > >>
> > >> -
> > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> > >> 
> > >>
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org