malinjawi opened a new issue, #12025:
URL: https://github.com/apache/gluten/issues/12025

   ### Background
   
   This tracks the writer-support slice of the broader lakehouse connector 
goal: better Gluten support for Iceberg and Delta ingestion workloads.
   
   The specific scope here is:
   
   - improve writer support for partitioned writes
   - improve writer support for non-partitioned writes
   - improve writer support for sorted / layout-aware writes
   
   This is intended to track execution and correctness work across Delta and 
Iceberg writer paths, separate from reader-side MOR improvements.
   
   ### Motivation
   
   Lakehouse ingestion workloads are sensitive to writer behavior because file 
layout, partition handling, sorting/distribution, statistics, and commit 
metadata directly affect both write throughput and downstream read performance. 
Gluten should keep more of these write paths native where possible, while 
preserving table-format metadata correctness.
   
   ### Scope
   
   Areas to investigate and improve:
   
   - non-partitioned Delta and Iceberg native writes
   - partitioned Delta and Iceberg native writes
   - sorted, clustered, or distribution-aware write paths where the table 
format exposes layout requirements
   - command-driven writer paths such as Delta OPTIMIZE compaction, where 
applicable
   - file-level statistics and metadata correctness for native writes
   - unnecessary row/columnar transitions around write, commit, and stats 
collection paths
   - benchmark coverage for ingestion-style workloads
   
   ### Expected outcomes
   
   - native writer paths are used consistently for supported Delta/Iceberg 
write modes
   - partition values, stats, file metadata, and commit logs remain correct
   - sorted/layout-aware writes either stay native end-to-end where supported 
or expose clear fallback diagnostics
   - regressions are covered by focused correctness tests
   - performance impact is measured with write-heavy benchmarks
   
   ### Suggested test coverage
   
   - Delta non-partitioned writes
   - Delta partitioned writes
   - Delta layout commands such as OPTIMIZE compaction and, separately, 
OPTIMIZE ZORDER where native expressions are supported
   - Iceberg non-partitioned writes
   - Iceberg partitioned writes
   - Iceberg sorted/distribution-aware writes where applicable
   - metadata validation for partition values, stats, and data files
   - native/fallback plan assertions
   
   ### Suggested benchmark coverage
   
   - write-heavy Delta and Iceberg ingestion workloads
   - partitioned and non-partitioned variants
   - sorted/layout-aware variants
   - larger scale data where write volume dominates planning and commit overhead
   
   ### Non-goals
   
   - reader-side MOR improvements
   - Delta deletion vector scan integration
   - broad native support for all table-format maintenance commands in a single 
patch
   
   Those should be tracked separately so this issue can stay focused on writer 
support.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to