Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

Mich Talebzadeh Wed, 10 Apr 2024 05:35:13 -0700

I read the SPIP. I have a number of  ;points if I may

- Maturity of Gluten: as the excerpt mentions, Gluten is a project, and its
feature set and stability IMO are still under development. Integrating a
non-core component could introduce risks if it is not fully mature
- Complexity: integrating Gluten's functionalities into Spark might add
complexity to the codebase, potentially increasing maintenance
overhead. Users might need to learn about Gluten's functionalities and
potential limitations for effective utilization?
- Performance Overhead: the plan conversion process itself could introduce
some overhead compared to native Spark execution.The effectiveness of
performance optimizations from Gluten might vary depending on the specific
engine and workload.
- Potential compatibility issues::not all data processing engines might
have complete support for the "Substrate standard", potentially limiting
the universality of the approach. There could be edge cases where plan
conversion or execution on a specific engine leads to unexpected behavior.
- Security: If other engines have different security models or access
controls, integrating them with Spark might require additional security
considerations.
- integration and support in the cloud


HTH

Technologist | Solutions Architect | Data Engineer  | Generative AI
Mich Talebzadeh,
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Wed, 10 Apr 2024 at 12:33, Wenchen Fan <cloud0...@gmail.com> wrote:

> It's good to reduce duplication between different native accelerators of
> Spark, and AFAIK there is already a project trying to solve it:
> https://substrait.io/
>
> I'm not sure why we need to do this inside Spark, instead of doing
> the unification for a wider scope (for all engines, not only Spark).
>
>
> On Wed, Apr 10, 2024 at 10:11 AM Holden Karau <holden.ka...@gmail.com>
> wrote:
>
>> I like the idea of improving flexibility of Sparks physical plans and
>> really anything that might reduce code duplication among the ~4 or so
>> different accelerators.
>>
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>> On Tue, Apr 9, 2024 at 3:14 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
>> wrote:
>>
>>> Thank you for sharing, Jia.
>>>
>>> I have the same questions like the previous Weiting's thread.
>>>
>>> Do you think you can share the future milestone of Apache Gluten?
>>> I'm wondering when the first stable release will come and how we can
>>> coordinate across the ASF communities.
>>>
>>> > This project is still under active development now, and doesn't have a
>>> stable release.
>>> > https://github.com/apache/incubator-gluten/releases/tag/v1.1.1
>>>
>>> In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end of
>>> support.
>>> And, 3.4 will have 3.4.3 next week and 3.4.4 (another EOL release) is
>>> scheduled in October.
>>>
>>> For the SPIP, I guess it's applicable for Apache Spark 4.0.0 only if
>>> there is something we need to do from Spark side.
>>>
>> +1 I think any changes need to target 4.0
>>
>>>
>>> Thanks,
>>> Dongjoon.
>>>
>>>
>>> On Tue, Apr 9, 2024 at 12:22 AM Ke Jia <kejia1...@gmail.com> wrote:
>>>
>>>> Apache Spark currently lacks an official mechanism to support
>>>> cross-platform execution of physical plans. The Gluten project offers a
>>>> mechanism that utilizes the Substrait standard to convert and optimize
>>>> Spark's physical plans. By introducing Gluten's plan conversion,
>>>> validation, and fallback mechanisms into Spark, we can significantly
>>>> enhance the portability and interoperability of Spark's physical plans,
>>>> enabling them to operate across a broader spectrum of execution
>>>> environments without requiring users to migrate, while also improving
>>>> Spark's execution efficiency through the utilization of Gluten's advanced
>>>> optimization techniques. And the integration of Gluten into Spark has
>>>> already shown significant performance improvements with ClickHouse and
>>>> Velox backends and has been successfully deployed in production by several
>>>> customers.
>>>>
>>>> References:
>>>> JIAR Ticket <https://issues.apache.org/jira/browse/SPARK-47773>
>>>> SPIP Doc
>>>> <https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing>
>>>>
>>>> Your feedback and comments are welcome and appreciated.  Thanks.
>>>>
>>>> Thanks,
>>>> Jia Ke
>>>>
>>>

Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

Reply via email to