junegunn commented on PR #8001:
URL: https://github.com/apache/hbase/pull/8001#issuecomment-4151701496

   I found a regression with this patch. When scanning across many rows where 
each row has only one `DeleteFamily` (or `DeleteColumn`) marker, scan 
performance degrades by ~50% compared to master. **The seek triggered by this 
optimization is more expensive than a simple skip when there's nothing to skip 
over.**
   
   The optimization helps when multiple delete markers accumulate for the same 
row or column. But for the common case of one delete per row, the seek is 
wasted and the overhead adds up across many rows.
   
   Benchmark data (scan time at 300K iterations, `DeleteFamily` on different 
rows):
   - master: ~0.2s
   - HBASE-29039-alt: ~0.3s
   
   <img width="1152" height="960" alt="image" 
src="https://github.com/user-attachments/assets/2f1aa636-5fdf-49d7-82e4-bdf4a3c8a603";
 />
   
   One possible approach: only seek on the second (or n-th) delete marker for 
the same scope. The first one would `SKIP` as before. If a second one appears 
(redundant), it signals accumulation and we switch to seek. This way:
   - One delete per row (common case): always skips, no regression
   - Accumulated deletes (the case we're optimizing): first one skips, rest seek
   
   Would this kind of heuristic make sense?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to