laskoviymishka opened a new issue, #832:
URL: https://github.com/apache/iceberg-go/issues/832

   ### Feature Request / Improvement
   Parent: #829 (v2 spec completion)
   All building blocks for compaction exist:
   - `PlanFiles` with delete file matching (scanner.go)
   - `ReadTasks` for pre-planned tasks (#781) — materializes rows with position 
(#825, #762) and equality (#818) deletes applied
   - `WriteRecords` with partitioned fanout (#622) and rolling file size (#759)
   - `ReplaceDataFilesWithDataFiles` / `AddDataFiles` (#723) for atomic commits
   - `SlicePacker` for bin-packing (internal/utils.go)
   What's missing is the top-level API that wires them together, and the 
ability to remove delete files in the same commit.
   Without compaction, tables under Update/Delete workloads accumulate equality 
delete files (#809, #823) and read performance degrades with every commit.
   
   ### Key gap: delete file removal
   `overwriteFiles.deletedEntries()` at `snapshot_producers.go` explicitly 
filters to `EntryContentData` only. After compaction, position/equality delete 
files that covered the rewritten data files are orphaned in manifests. The 
overwrite producer needs to handle delete file removal alongside data file 
replacement.
   
   ### Nice to have: CLI
   
   ```
       $ iceberg compact analyze db.events
       Compaction Plan for db.events
         Files scanned:        1,247
         Files to rewrite:        89   (7.1%)
         Compaction groups:        12
         Est. size change:      2.3 GB → 1.8 GB  (-22%)
       $ iceberg compact run db.events --partial-progress
       Compacting db.events...
         [1/12] date=2024-01-15: 12 files → 2 files ✓
         [2/12] date=2024-01-16:  8 files → 1 file  ✓
       Done. Rewrote 89 → 15 files. Removed 23 delete files.
   ```
   
   ### Related
   Compaction is the v2 approach to read-perf degradation under deletes. The v3 
approach (deletion vectors) is tracked in #589 — puffin reader/writer exists, 
scanner integration TBD. Both are complementary: DVs reduce write 
amplification, compaction is still needed for file consolidation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to