kevinrr888 commented on issue #5733:
URL: https://github.com/apache/accumulo/issues/5733#issuecomment-3133415437
I no longer think this is a workable ticket.
Let's look at the first approach. We:
* Use Datasketches to compute the midpoint of an RFile as it is written
* Use these midpoints with weights of the RFile size to compute split points
at automatic split point computation time
A detremental downside of this is that the midpoints and weights become
outdated with subsequent splits and merges and it doesn't seem like there is a
way to address this without opening and reading RFiles again at split time
(which we were trying to avoid). Here is an example to outline this:
------------------------
Tablet0
Range: (-inf, inf)
rf1 with rows 1, 2, 3
mid point of rf1 = 2
rf2 with rows 4, 5, 6
mid point of rf2 = 5
rf3 with rows 7, 8, 9
mid point of rf3 = 8
------------------------
|
| split at 5
V
------------------------
Tablet1
Range: (-inf, 5]
rf1 with rows 1, 2, 3
mid point of rf1 = 2
rf2 with rows 4, 5, 6 <--- NOTE: this rfile has keys which lie beyond the
range
mid point of rf2 = 5
------------------------
Tablet2
Range: (5, inf)
rf2 with rows 4, 5, 6 <--- NOTE: this rfile has keys which lie beyond the
range
mid point of rf2 = 5
rf3 with rows 7, 8, 9
mid point of rf3 = 8
------------------------
When we split a tablet in the middle of some rfile, both tablets will be
assigned that rfile (rf2 in this case). Lets say, an automatic split was
calculated for Tablet2 at this point. Also assume that rf2 is the largest file
out of rf2 and rf3. The mid point for rf2 would then be the preferred split. So
we split Tablet2 at 5. This is wrong as 5 lies outside Tablet2's range.
To address this, after the split at 5, for each tablet, we would have to
open rf2, compute a new midpoint accounting for the tablet range, and would
also need to estimate the size of rf2 which lies in the tablet range so we can
use an appropriate weight. This added complexity and still needing to open
rfiles for splits may be a deal-breaker for the approach... The entire benefit
of this approach was avoiding opening and reading RFiles at split time.
Let's look at the second approach.
> An alternative to this would be simply to continue to compute the
midpoints at the last moment instead of precomputing them and storing them in
the metadata, but make it more efficient by using datasketches to do it in
fewer passes over the indexes.
The current approach only does one pass over the indexes: see
`SplitUtils.findSplits`. Maybe the current implementation can still benefit
from using Datasketches by using it over the single pass? After some rough
timing calculations averaged over 5 runs for the old approach and an approach
using Datasketches, I found that a pass over the keys using Datasketches
(updating the `ItemsSketch` with `update()` each iteration) takes over 2x the
time of the current single-pass implementation on average. And overall, the
computation time of `findSplits` using the new approach takes about 20% longer
on average due to this. There are also other problems with using Datasketches
here. For example, shortening the row becomes trickier using Datasketches
compared to the original approach. The current approach gets the longest common
length of the previous row and the current row, this is harder to do with
Datasketches (as Datasketches only uses estimations, no real simple way to get
the previous
row of the split point we calculate). Another issue is not all split
candidates will be added as a split--they need to pass the `rowPredicate`. This
is also complicated when using Datasketches.
For these reasons, I don't think this ticket is workable, and may be closed
unless anyone has any other ideas or suggestions.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]