iemejia opened a new pull request, #12395:
URL: https://github.com/apache/gluten/pull/12395
## What changes are proposed in this pull request?
Replace per-row `contains()` calls with an iterator-based scan in
`DeltaDeletionVectorReader::applyDeletionFilter()`.
**Before:** O(batch_size) calls to `Roaring64Map::contains()`, each
performing a `std::map` lookup (O(log N) on the high-32-bit buckets) plus a
Roaring container check per row position.
**After:** O(deletions_in_range) iterator advances using
`move_equalorlarger()` to seek directly to the first deleted row in the batch
range, then sequential `++iterator` advances (O(1) amortized within a Roaring
container) until the range end.
### Benchmark results (1M row file, batches of 4096, scanning full file)
| Deletion % | Baseline (ms) | Optimized (ms) | Speedup |
|---|---|---|---|
| **1% (sparse)** | **9.90** | **0.050** | **198x** |
| 10% (moderate) | 4.06 | 0.398 | 10.2x |
| 50% (dense) | 3.82 | 1.99 | 1.9x |
| 90% (very dense) | 4.37 | 3.64 | 1.2x |
For the common case of sparse deletions (1%, typical for Delta
MERGE/UPDATE), this is a **198x improvement**.
The baseline O(batch_size) approach spent constant time (~4ms per 1M rows)
regardless of deletion density because it always probed every row. The
optimized O(deletions) approach scales linearly with actual deletions.
## How was this patch tested?
- 21 C++ unit tests pass (12 existing + 9 new corner cases):
- All rows deleted / no rows deleted
- First and last row of batch deleted
- Large 64-bit offsets (multi-bucket Roaring64Map)
- Single-row batches (deleted / not deleted)
- Dense/sparse mixed deletion patterns
- Batch range entirely before / after all deletions
- Partial overlap with deletion range
- New `BM_ApplyDeletionFilter` benchmark added to `DeltaBitmapBenchmark.cc`
with 5 scenarios (sparse/moderate/dense/very-dense/large-batch).
## Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude claude-opus-4.6
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]