malinjawi opened a new issue, #12025: URL: https://github.com/apache/gluten/issues/12025
### Background This tracks the writer-support slice of the broader lakehouse connector goal: better Gluten support for Iceberg and Delta ingestion workloads. The specific scope here is: - improve writer support for partitioned writes - improve writer support for non-partitioned writes - improve writer support for sorted / layout-aware writes This is intended to track execution and correctness work across Delta and Iceberg writer paths, separate from reader-side MOR improvements. ### Motivation Lakehouse ingestion workloads are sensitive to writer behavior because file layout, partition handling, sorting/distribution, statistics, and commit metadata directly affect both write throughput and downstream read performance. Gluten should keep more of these write paths native where possible, while preserving table-format metadata correctness. ### Scope Areas to investigate and improve: - non-partitioned Delta and Iceberg native writes - partitioned Delta and Iceberg native writes - sorted, clustered, or distribution-aware write paths where the table format exposes layout requirements - command-driven writer paths such as Delta OPTIMIZE compaction, where applicable - file-level statistics and metadata correctness for native writes - unnecessary row/columnar transitions around write, commit, and stats collection paths - benchmark coverage for ingestion-style workloads ### Expected outcomes - native writer paths are used consistently for supported Delta/Iceberg write modes - partition values, stats, file metadata, and commit logs remain correct - sorted/layout-aware writes either stay native end-to-end where supported or expose clear fallback diagnostics - regressions are covered by focused correctness tests - performance impact is measured with write-heavy benchmarks ### Suggested test coverage - Delta non-partitioned writes - Delta partitioned writes - Delta layout commands such as OPTIMIZE compaction and, separately, OPTIMIZE ZORDER where native expressions are supported - Iceberg non-partitioned writes - Iceberg partitioned writes - Iceberg sorted/distribution-aware writes where applicable - metadata validation for partition values, stats, and data files - native/fallback plan assertions ### Suggested benchmark coverage - write-heavy Delta and Iceberg ingestion workloads - partitioned and non-partitioned variants - sorted/layout-aware variants - larger scale data where write volume dominates planning and commit overhead ### Non-goals - reader-side MOR improvements - Delta deletion vector scan integration - broad native support for all table-format maintenance commands in a single patch Those should be tracked separately so this issue can stay focused on writer support. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
