laskoviymishka opened a new issue, #832:
URL: https://github.com/apache/iceberg-go/issues/832
### Feature Request / Improvement
Parent: #829 (v2 spec completion)
All building blocks for compaction exist:
- `PlanFiles` with delete file matching (scanner.go)
- `ReadTasks` for pre-planned tasks (#781) — materializes rows with position
(#825, #762) and equality (#818) deletes applied
- `WriteRecords` with partitioned fanout (#622) and rolling file size (#759)
- `ReplaceDataFilesWithDataFiles` / `AddDataFiles` (#723) for atomic commits
- `SlicePacker` for bin-packing (internal/utils.go)
What's missing is the top-level API that wires them together, and the
ability to remove delete files in the same commit.
Without compaction, tables under Update/Delete workloads accumulate equality
delete files (#809, #823) and read performance degrades with every commit.
### Key gap: delete file removal
`overwriteFiles.deletedEntries()` at `snapshot_producers.go` explicitly
filters to `EntryContentData` only. After compaction, position/equality delete
files that covered the rewritten data files are orphaned in manifests. The
overwrite producer needs to handle delete file removal alongside data file
replacement.
### Nice to have: CLI
```
$ iceberg compact analyze db.events
Compaction Plan for db.events
Files scanned: 1,247
Files to rewrite: 89 (7.1%)
Compaction groups: 12
Est. size change: 2.3 GB → 1.8 GB (-22%)
$ iceberg compact run db.events --partial-progress
Compacting db.events...
[1/12] date=2024-01-15: 12 files → 2 files ✓
[2/12] date=2024-01-16: 8 files → 1 file ✓
Done. Rewrote 89 → 15 files. Removed 23 delete files.
```
### Related
Compaction is the v2 approach to read-perf degradation under deletes. The v3
approach (deletion vectors) is tracked in #589 — puffin reader/writer exists,
scanner integration TBD. Both are complementary: DVs reduce write
amplification, compaction is still needed for file consolidation.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]