Re: [I] Idea: Use datasketches to compute a "mean row key" when writing an RFile to compute split points [accumulo]

via GitHub Tue, 29 Jul 2025 10:26:22 -0700


kevinrr888 commented on issue #5733:
URL: https://github.com/apache/accumulo/issues/5733#issuecomment-3133415437


   I no longer think this is a workable ticket.
   
   Let's look at the first approach. We:
   * Use Datasketches to compute the midpoint of an RFile as it is written
   * Use these midpoints with weights of the RFile size to compute split points 
at automatic split point computation time
   
   A detremental downside of this is that the midpoints and weights become 
outdated with subsequent splits and merges and it doesn't seem like there is a 
way to address this without opening and reading RFiles again at split time 
(which we were trying to avoid). Here is an example to outline this:
   
   ------------------------
   Tablet0
   Range: (-inf, inf)
   rf1 with rows 1, 2, 3
   mid point of rf1 = 2
   rf2 with rows 4, 5, 6
   mid point of rf2 = 5
   rf3 with rows 7, 8, 9
   mid point of rf3 = 8
   ------------------------
    |
    |  split at 5
    V
   ------------------------
   Tablet1
   Range: (-inf, 5]
   rf1 with rows 1, 2, 3
   mid point of rf1 = 2
   rf2 with rows 4, 5, 6 <--- NOTE: this rfile has keys which lie beyond the 
range
   mid point of rf2 = 5
   ------------------------
   Tablet2
   Range: (5, inf)
   rf2 with rows 4, 5, 6 <--- NOTE: this rfile has keys which lie beyond the 
range
   mid point of rf2 = 5
   rf3 with rows 7, 8, 9
   mid point of rf3 = 8
   ------------------------
   When we split a tablet in the middle of some rfile, both tablets will be 
assigned that rfile (rf2 in this case). Lets say, an automatic split was 
calculated for Tablet2 at this point. Also assume that rf2 is the largest file 
out of rf2 and rf3. The mid point for rf2 would then be the preferred split. So 
we split Tablet2 at 5. This is wrong as 5 lies outside Tablet2's range.
   To address this, after the split at 5, for each tablet, we would have to 
open rf2, compute a new midpoint accounting for the tablet range, and would 
also need to estimate the size of rf2 which lies in the tablet range so we can 
use an appropriate weight. This added complexity and still needing to open 
rfiles for splits may be a deal-breaker for the approach... The entire benefit 
of this approach was avoiding opening and reading RFiles at split time.
   
   Let's look at the second approach.
   > An alternative to this would be simply to continue to compute the 
midpoints at the last moment instead of precomputing them and storing them in 
the metadata, but make it more efficient by using datasketches to do it in 
fewer passes over the indexes.
   
   The current approach only does one pass over the indexes: see 
`SplitUtils.findSplits`. Maybe the current implementation can still benefit 
from using Datasketches by using it over the single pass? After some rough 
timing calculations averaged over 5 runs for the old approach and an approach 
using Datasketches, I found that a pass over the keys using Datasketches 
(updating the `ItemsSketch` with `update()` each iteration) takes over 2x the 
time of the current single-pass implementation on average. And overall, the 
computation time of `findSplits` using the new approach takes about 20% longer 
on average due to this. There are also other problems with using Datasketches 
here. For example, shortening the row becomes trickier using Datasketches 
compared to the original approach. The current approach gets the longest common 
length of the previous row and the current row, this is harder to do with 
Datasketches (as Datasketches only uses estimations, no real simple way to get 
the previous 
 row of the split point we calculate). Another issue is not all split 
candidates will be added as a split--they need to pass the `rowPredicate`. This 
is also complicated when using Datasketches.
   
   For these reasons, I don't think this ticket is workable, and may be closed 
unless anyone has any other ideas or suggestions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Idea: Use datasketches to compute a "mean row key" when writing an RFile to compute split points [accumulo]

Reply via email to