iemejia opened a new issue, #12399: URL: https://github.com/apache/gluten/issues/12399
## Description This is a tracking issue for a set of optimizations to the Delta Lake Deletion Vector (DV) processing path in the Velox backend. When a Delta table receives MERGE, UPDATE, or DELETE operations, Delta writes Deletion Vectors -- small sidecar files containing bitmaps of logically deleted row positions -- instead of rewriting the affected data files. The physical data stays in place until compaction (OPTIMIZE) runs. This makes mutations fast, but shifts cost to reads: every subsequent query that touches files with pending deletions must load the DV bitmaps from storage and filter out deleted rows. Between compaction cycles, this cost is paid on every read query. On remote storage (ABFS, S3, HDFS), this cost is amplified because every storage operation involves a network round-trip. The optimizations below target three layers of the DV processing path: query planning (JVM), row filtering (C++ native engine), and file I/O. ## Tracked PRs | PR | Area | Description | CI | Status | |---|---|---|---|---| | [#12390](https://github.com/apache/gluten/pull/12390) | JVM (planning) | Eliminate redundant network calls during DV materialization: cache path resolution per partition, read raw DV bytes directly (skip Java deserialize/re-serialize), early-exit guard for non-Delta queries, fused rule execution | Green | Open, in review | | [#12395](https://github.com/apache/gluten/pull/12395) | C++ (native engine) | Iterator-based DV bitmap filtering: replace per-row `contains()` with `move_equalorlarger()` iterator so cost scales with actual deletions, not total rows | Green | Open, in review | | [#12389](https://github.com/apache/gluten/pull/12389) | C++ (plan converter) | Remove double `dynamic_pointer_cast` and unnecessary `std::string` copy of DV data in `parseDeltaSplitInfo` | Green | Open, in review | | TBD | JVM/C++ (file I/O) | Enable file handle cache by default with TTL-based eviction, wire previously dead-code TTL config to Velox cache | -- | Not yet submitted | ## Measured improvements **DV bitmap filtering (C++, PR #12395):** | Deletion density | Speedup | |---|---| | 1% (sparse, typical after MERGE/UPDATE) | 198x | | 10% (moderate) | 10x | | 50% (dense) | 2x | **DV materialization (JVM, PR #12390):** - Projected up to 20x faster on ABFS by eliminating redundant HTTP round-trips per file - Non-Delta queries: 22x faster rule evaluation via early-exit guard **File handle caching (not yet submitted):** - Estimated 40-70% improvement for repeated scans of many small files on remote storage ## Was this issue authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
