GitHub user GlutenPerfBot created a discussion: August 29, 2025: Weekly Status Update in Gluten
*This weekly update is generated by LLMs. You're welcome to join our [Github](https://github.com/apache/incubator-gluten/discussions) for in-depth discussions.* ## Overall Activity Summary This week in the Gluten community has been highly productive, with a strong focus on advancing Flink integration, enhancing data lake write capabilities, and performing significant core code refactoring. Development on the Velox backend remains very active, with major strides in Iceberg support and shuffle performance. We also saw important build system improvements and continued efforts to simplify the codebase for better long-term maintainability. ## Key Ongoing Projects Several key initiatives are driving the project forward, thanks to our dedicated contributors: * **Flink Integration Advancement:** The Flink backend is rapidly maturing. A major effort to support all remaining operators for the Nexmark benchmark is underway in #10548 by @shuai-xu. Several fixes and enhancements were also contributed, including a fix for JDK 17 compatibility in #10572 by @KevinyhZou. * **Data Lake Write Capabilities:** We are significantly expanding our data lake support. A key initiative is adding support for Iceberg partitioned writes in #10497 by @jinchengchenghh, which builds upon the recently merged support for Iceberg table overwrites. * **Core Engine Refactoring:** A consistent effort to improve code health and simplicity is being led by @beliefer, who submitted a series of PRs to remove redundant code and simplify core logic, such as in #10553 and #10545. * **Performance Optimizations:** * A significant architectural improvement for the shuffle reader is proposed in #10499 by @marin-ma, which aims to merge input streams to accelerate sort-based shuffle reads. * Support for columnar partial generate for Hive UDTFs was introduced in #10475 by @jiangjiangtian, reducing costly row-to-columnar conversions and improving performance. * **Tooling and Benchmark Enhancements:** The `gluten-it` integration testing framework is being improved by @zhztheplayer, with work to support Delta tables in benchmarks (#10562) and clean up Maven dependencies (#10563). ## Priority Items We encourage the community to review and provide feedback on these important pull requests: * **#10497** by @jinchengchenghh: This PR adds critical support for partitioned writes in Iceberg. Community review is needed to validate this important data lake feature. * **#10499** by @marin-ma: This proposes a new `ColumnarShuffleReader` to boost shuffle performance. Feedback on this architectural change is highly valuable. * **#10475** by @jiangjiangtian: A large and impactful PR that adds columnar partial generation for Hive UDTFs. More eyes on this would help ensure its stability and performance benefits. * **#9107** by @Yifeng-Wang: This long-running PR to support `CharVarcharCodegenUtils` has seen extensive discussion and could use a final review to move it toward merging. ## Notable Discussions Several important conversations are shaping the future of Gluten: * **#10188**: A significant proposal has been made to add a new "Omni" backend, which is highly optimized for ARM platforms. This could open up new possibilities for Gluten's ecosystem. * **#10407**: An important discussion is ongoing regarding the plan to drop support for Spark 3.2, which would allow for significant code simplification and modernization. * **#8849**: With the recent surge in Flink development, this discussion on overall Flink support remains a key place for strategic alignment and community input. ## Emerging Trends Based on this week's activity, we've identified several key trends: * **Maturing Flink Support:** The focus has shifted from foundational work to enabling complex, real-world benchmarks like Nexmark, signaling that the Flink backend is becoming ready for more serious evaluation. * **Comprehensive Data Lake Functionality:** The project is moving beyond basic read support for data lake formats. The active development of advanced write operations for Iceberg positions Gluten as a more complete and powerful engine for modern data lakehouse architectures. * **Sustained Focus on Code Health:** The high volume of refactoring PRs demonstrates a strong commitment to reducing technical debt and improving the maintainability of the core engine, which is crucial for the project's long-term health. ## Good First Issues Looking to make your first contribution to Gluten? These issues are well-defined and a great way to get started: * **#4730**: Add support for the `date_from_unix_date` function in the ClickHouse backend. * **#6807**: Implement the `split_part` function for the ClickHouse backend. * **#6812**: Add support for the `SparkPartitionID` function in the ClickHouse backend. * **#6814**: Implement the `MakeYMInterval` expression for the ClickHouse backend. These issues are excellent entry points for contributors with some C++ and Scala/Java experience. They involve implementing a single, well-scoped function, allowing you to get familiar with Gluten's expression framework and contribution process without needing to understand the entire system. Welcome to the community GitHub link: https://github.com/apache/incubator-gluten/discussions/10586 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
