[GitHub] [accumulo] cshannon opened a new pull request, #3286: Add support for storing ranges in DataFileValue

via GitHub Sat, 08 Apr 2023 09:10:21 -0700


cshannon opened a new pull request, #3286:
URL: https://github.com/apache/accumulo/pull/3286


   This adds support for storing a collection of ranges associated with a data 
file in the Accumulo metadata table. This will allow the ability to fence off 
RFiles by ranges in the future so we can accomplish no-chop merges. The 
previous CSV format of the serialized DataFileValue has been converted to Json 
and the ranges are Base64 encoded. 
   
   This is the second part of #1327 and will provide a mechanism to store 
ranges. Note that this PR doesn't actually update Accumulo to start storing 
ranges yet, it just adds support to do so. When this PR and #3246 is ready, a 
follow on task would be to go back and update the merge code to no longer run 
chop compactions but to actually start populating ranges in the metadata, etc.
   
   The goal here is to be able to optionally store ranges for a file in 
metadata. If the range list is empty then that would just mean the entire file 
is used (infinite range) and this is nice as it allows us to be backwards 
compatible with previous metadata that didn't include a range and also allows 
us to not have to store anything unless we need to fence the file off in the 
future (such as when merging).
   
   One question I had was what the size and numEntries values should be now 
inside of DataFileValue. Will those values stay as they currently are (size and 
numEntries for the entire file) or do we need to set those values based on the 
Ranges provided? Ie do we need to try and reduce the size somehow based on the 
fenced off values vs the actual values? I don't think so, I think we can leave 
the values alone (apply to the whole file) and just get what we need out of the 
ranges as those values are used for things like splits which I would still 
think you'd want to take into account the entire file for but maybe someone 
else with more background can comment on that.
   
   A couple of other notes about this PR:
   
   1. I made it a draft PR for now as this should really be target at a 3.1 
branch and right now main is just 3.0 plus I wanted to get some initial 
feedback.
   2. DataFileValue format is now a JSON string when serialized into the 
metadata table instead of a CSV. Reading of the old format is detected by 
checking for a Json parsing exception and falling back to the legacy format.
   3. Ranges are encoded as Base64. The simplest and most compact way to store 
a range was to encode the entire Range into a byte array and then encode that 
as Base 64 as ranges are complex with keys and a lot of binary values.
   4. I left the equals/hashcode methods alone for now in DataFileValue and 
they only compare the size and number of entries. I wasn't sure if we wanted to 
also add the range comparison to the equals yet, it probably depends how we 
intend to use it.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [accumulo] cshannon opened a new pull request, #3286: Add support for storing ranges in DataFileValue

Reply via email to