junegunn commented on PR #8001: URL: https://github.com/apache/hbase/pull/8001#issuecomment-4151701496
I found a regression with this patch. When scanning across many rows where each row has only one `DeleteFamily` (or `DeleteColumn`) marker, scan performance degrades by ~50% compared to master. **The seek triggered by this optimization is more expensive than a simple skip when there's nothing to skip over.** The optimization helps when multiple delete markers accumulate for the same row or column. But for the common case of one delete per row, the seek is wasted and the overhead adds up across many rows. Benchmark data (scan time at 300K iterations, `DeleteFamily` on different rows): - master: ~0.2s - HBASE-29039-alt: ~0.3s <img width="1152" height="960" alt="image" src="https://github.com/user-attachments/assets/2f1aa636-5fdf-49d7-82e4-bdf4a3c8a603" /> One possible approach: only seek on the second (or n-th) delete marker for the same scope. The first one would `SKIP` as before. If a second one appears (redundant), it signals accumulation and we switch to seek. This way: - One delete per row (common case): always skips, no regression - Accumulated deletes (the case we're optimizing): first one skips, rest seek Would this kind of heuristic make sense? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
