Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

Binwei Yang Wed, 10 Apr 2024 20:41:46 -0700


Gluten java part is pretty stable now. The development is more in the c++ code, 
velox code as well as Clickhouse backend.


The SPIP doesn't plan to introduce whole Gluten stack into Spark. But the way 
to serialize Spark physical plan and be able to send to native backend, through 
JNI or gRPC. Currently Spark has no API for this. The physical plan format can 
be substrait or extended Spark Connect. 

On 2024/04/10 12:34:26 Mich Talebzadeh wrote:
> I read the SPIP. I have a number of  ;points if I may
> 
> - Maturity of Gluten: as the excerpt mentions, Gluten is a project, and its
> feature set and stability IMO are still under development. Integrating a
> non-core component could introduce risks if it is not fully mature
> - Complexity: integrating Gluten's functionalities into Spark might add
> complexity to the codebase, potentially increasing maintenance
> overhead. Users might need to learn about Gluten's functionalities and
> potential limitations for effective utilization?
> - Performance Overhead: the plan conversion process itself could introduce
> some overhead compared to native Spark execution.The effectiveness of
> performance optimizations from Gluten might vary depending on the specific
> engine and workload.
> - Potential compatibility issues::not all data processing engines might
> have complete support for the "Substrate standard", potentially limiting
> the universality of the approach. There could be edge cases where plan
> conversion or execution on a specific engine leads to unexpected behavior.
> - Security: If other engines have different security models or access
> controls, integrating them with Spark might require additional security
> considerations.
> - integration and support in the cloud
> 
> HTH
> 
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> Mich Talebzadeh,
> London
> United Kingdom
> 
> 
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> 
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
> 
> 
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
> 
> 
> On Wed, 10 Apr 2024 at 12:33, Wenchen Fan <cloud0...@gmail.com> wrote:
> 
> > It's good to reduce duplication between different native accelerators of
> > Spark, and AFAIK there is already a project trying to solve it:
> > https://substrait.io/
> >
> > I'm not sure why we need to do this inside Spark, instead of doing
> > the unification for a wider scope (for all engines, not only Spark).
> >
> >
> > On Wed, Apr 10, 2024 at 10:11 AM Holden Karau <holden.ka...@gmail.com>
> > wrote:
> >
> >> I like the idea of improving flexibility of Sparks physical plans and
> >> really anything that might reduce code duplication among the ~4 or so
> >> different accelerators.
> >>
> >> Twitter: https://twitter.com/holdenkarau
> >> Books (Learning Spark, High Performance Spark, etc.):
> >> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> >>
> >>
> >> On Tue, Apr 9, 2024 at 3:14 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
> >> wrote:
> >>
> >>> Thank you for sharing, Jia.
> >>>
> >>> I have the same questions like the previous Weiting's thread.
> >>>
> >>> Do you think you can share the future milestone of Apache Gluten?
> >>> I'm wondering when the first stable release will come and how we can
> >>> coordinate across the ASF communities.
> >>>
> >>> > This project is still under active development now, and doesn't have a
> >>> stable release.
> >>> > https://github.com/apache/incubator-gluten/releases/tag/v1.1.1
> >>>
> >>> In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end of
> >>> support.
> >>> And, 3.4 will have 3.4.3 next week and 3.4.4 (another EOL release) is
> >>> scheduled in October.
> >>>
> >>> For the SPIP, I guess it's applicable for Apache Spark 4.0.0 only if
> >>> there is something we need to do from Spark side.
> >>>
> >> +1 I think any changes need to target 4.0
> >>
> >>>
> >>> Thanks,
> >>> Dongjoon.
> >>>
> >>>
> >>> On Tue, Apr 9, 2024 at 12:22 AM Ke Jia <kejia1...@gmail.com> wrote:
> >>>
> >>>> Apache Spark currently lacks an official mechanism to support
> >>>> cross-platform execution of physical plans. The Gluten project offers a
> >>>> mechanism that utilizes the Substrait standard to convert and optimize
> >>>> Spark's physical plans. By introducing Gluten's plan conversion,
> >>>> validation, and fallback mechanisms into Spark, we can significantly
> >>>> enhance the portability and interoperability of Spark's physical plans,
> >>>> enabling them to operate across a broader spectrum of execution
> >>>> environments without requiring users to migrate, while also improving
> >>>> Spark's execution efficiency through the utilization of Gluten's advanced
> >>>> optimization techniques. And the integration of Gluten into Spark has
> >>>> already shown significant performance improvements with ClickHouse and
> >>>> Velox backends and has been successfully deployed in production by 
> >>>> several
> >>>> customers.
> >>>>
> >>>> References:
> >>>> JIAR Ticket <https://issues.apache.org/jira/browse/SPARK-47773>
> >>>> SPIP Doc
> >>>> <https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing>
> >>>>
> >>>> Your feedback and comments are welcome and appreciated.  Thanks.
> >>>>
> >>>> Thanks,
> >>>> Jia Ke
> >>>>
> >>>
> 

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

Reply via email to