[D] August 29, 2025: Weekly Status Update in Gluten [incubator-gluten]

via GitHub Fri, 29 Aug 2025 13:28:44 -0700


GitHub user GlutenPerfBot created a discussion: August 29, 2025: Weekly Status 
Update in Gluten


*This weekly update is generated by LLMs. You're welcome to join our 
[Github](https://github.com/apache/incubator-gluten/discussions) for in-depth 
discussions.*

## Overall Activity Summary
This week in the Gluten community has been highly productive, with a strong 
focus on advancing Flink integration, enhancing data lake write capabilities, 
and performing significant core code refactoring. Development on the Velox 
backend remains very active, with major strides in Iceberg support and shuffle 
performance. We also saw important build system improvements and continued 
efforts to simplify the codebase for better long-term maintainability.

## Key Ongoing Projects
Several key initiatives are driving the project forward, thanks to our 
dedicated contributors:

*   **Flink Integration Advancement:** The Flink backend is rapidly maturing. A 
major effort to support all remaining operators for the Nexmark benchmark is 
underway in #10548 by @shuai-xu. Several fixes and enhancements were also 
contributed, including a fix for JDK 17 compatibility in #10572 by @KevinyhZou.
*   **Data Lake Write Capabilities:** We are significantly expanding our data 
lake support. A key initiative is adding support for Iceberg partitioned writes 
in #10497 by @jinchengchenghh, which builds upon the recently merged support 
for Iceberg table overwrites.
*   **Core Engine Refactoring:** A consistent effort to improve code health and 
simplicity is being led by @beliefer, who submitted a series of PRs to remove 
redundant code and simplify core logic, such as in #10553 and #10545.
*   **Performance Optimizations:**
    *   A significant architectural improvement for the shuffle reader is 
proposed in #10499 by @marin-ma, which aims to merge input streams to 
accelerate sort-based shuffle reads.
    *   Support for columnar partial generate for Hive UDTFs was introduced in 
#10475 by @jiangjiangtian, reducing costly row-to-columnar conversions and 
improving performance.
*   **Tooling and Benchmark Enhancements:** The `gluten-it` integration testing 
framework is being improved by @zhztheplayer, with work to support Delta tables 
in benchmarks (#10562) and clean up Maven dependencies (#10563).

## Priority Items
We encourage the community to review and provide feedback on these important 
pull requests:

*   **#10497** by @jinchengchenghh: This PR adds critical support for 
partitioned writes in Iceberg. Community review is needed to validate this 
important data lake feature.
*   **#10499** by @marin-ma: This proposes a new `ColumnarShuffleReader` to 
boost shuffle performance. Feedback on this architectural change is highly 
valuable.
*   **#10475** by @jiangjiangtian: A large and impactful PR that adds columnar 
partial generation for Hive UDTFs. More eyes on this would help ensure its 
stability and performance benefits.
*   **#9107** by @Yifeng-Wang: This long-running PR to support 
`CharVarcharCodegenUtils` has seen extensive discussion and could use a final 
review to move it toward merging.

## Notable Discussions
Several important conversations are shaping the future of Gluten:

*   **#10188**: A significant proposal has been made to add a new "Omni" 
backend, which is highly optimized for ARM platforms. This could open up new 
possibilities for Gluten's ecosystem.
*   **#10407**: An important discussion is ongoing regarding the plan to drop 
support for Spark 3.2, which would allow for significant code simplification 
and modernization.
*   **#8849**: With the recent surge in Flink development, this discussion on 
overall Flink support remains a key place for strategic alignment and community 
input.

## Emerging Trends
Based on this week's activity, we've identified several key trends:

*   **Maturing Flink Support:** The focus has shifted from foundational work to 
enabling complex, real-world benchmarks like Nexmark, signaling that the Flink 
backend is becoming ready for more serious evaluation.
*   **Comprehensive Data Lake Functionality:** The project is moving beyond 
basic read support for data lake formats. The active development of advanced 
write operations for Iceberg positions Gluten as a more complete and powerful 
engine for modern data lakehouse architectures.
*   **Sustained Focus on Code Health:** The high volume of refactoring PRs 
demonstrates a strong commitment to reducing technical debt and improving the 
maintainability of the core engine, which is crucial for the project's 
long-term health.

## Good First Issues
Looking to make your first contribution to Gluten? These issues are 
well-defined and a great way to get started:

*   **#4730**: Add support for the `date_from_unix_date` function in the 
ClickHouse backend.
*   **#6807**: Implement the `split_part` function for the ClickHouse backend.
*   **#6812**: Add support for the `SparkPartitionID` function in the 
ClickHouse backend.
*   **#6814**: Implement the `MakeYMInterval` expression for the ClickHouse 
backend.

These issues are excellent entry points for contributors with some C++ and 
Scala/Java experience. They involve implementing a single, well-scoped 
function, allowing you to get familiar with Gluten's expression framework and 
contribution process without needing to understand the entire system. Welcome 
to the community

GitHub link: https://github.com/apache/incubator-gluten/discussions/10586

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[D] August 29, 2025: Weekly Status Update in Gluten [incubator-gluten]

Reply via email to