iemejia opened a new issue, #12399:
URL: https://github.com/apache/gluten/issues/12399

   ## Description
   
   This is a tracking issue for a set of optimizations to the Delta Lake 
Deletion Vector (DV) processing path in the Velox backend.
   
   When a Delta table receives MERGE, UPDATE, or DELETE operations, Delta 
writes Deletion Vectors -- small sidecar files containing bitmaps of logically 
deleted row positions -- instead of rewriting the affected data files. The 
physical data stays in place until compaction (OPTIMIZE) runs. This makes 
mutations fast, but shifts cost to reads: every subsequent query that touches 
files with pending deletions must load the DV bitmaps from storage and filter 
out deleted rows. Between compaction cycles, this cost is paid on every read 
query.
   
   On remote storage (ABFS, S3, HDFS), this cost is amplified because every 
storage operation involves a network round-trip. The optimizations below target 
three layers of the DV processing path: query planning (JVM), row filtering 
(C++ native engine), and file I/O.
   
   ## Tracked PRs
   
   | PR | Area | Description | CI | Status |
   |---|---|---|---|---|
   | [#12390](https://github.com/apache/gluten/pull/12390) | JVM (planning) | 
Eliminate redundant network calls during DV materialization: cache path 
resolution per partition, read raw DV bytes directly (skip Java 
deserialize/re-serialize), early-exit guard for non-Delta queries, fused rule 
execution | Green | Open, in review |
   | [#12395](https://github.com/apache/gluten/pull/12395) | C++ (native 
engine) | Iterator-based DV bitmap filtering: replace per-row `contains()` with 
`move_equalorlarger()` iterator so cost scales with actual deletions, not total 
rows | Green | Open, in review |
   | [#12389](https://github.com/apache/gluten/pull/12389) | C++ (plan 
converter) | Remove double `dynamic_pointer_cast` and unnecessary `std::string` 
copy of DV data in `parseDeltaSplitInfo` | Green | Open, in review |
   | TBD | JVM/C++ (file I/O) | Enable file handle cache by default with 
TTL-based eviction, wire previously dead-code TTL config to Velox cache | -- | 
Not yet submitted |
   
   ## Measured improvements
   
   **DV bitmap filtering (C++, PR #12395):**
   
   | Deletion density | Speedup |
   |---|---|
   | 1% (sparse, typical after MERGE/UPDATE) | 198x |
   | 10% (moderate) | 10x |
   | 50% (dense) | 2x |
   
   **DV materialization (JVM, PR #12390):**
   - Projected up to 20x faster on ABFS by eliminating redundant HTTP 
round-trips per file
   - Non-Delta queries: 22x faster rule evaluation via early-exit guard
   
   **File handle caching (not yet submitted):**
   - Estimated 40-70% improvement for repeated scans of many small files on 
remote storage
   
   ## Was this issue authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to