cshannon opened a new pull request, #3286: URL: https://github.com/apache/accumulo/pull/3286
This adds support for storing a collection of ranges associated with a data file in the Accumulo metadata table. This will allow the ability to fence off RFiles by ranges in the future so we can accomplish no-chop merges. The previous CSV format of the serialized DataFileValue has been converted to Json and the ranges are Base64 encoded. This is the second part of #1327 and will provide a mechanism to store ranges. Note that this PR doesn't actually update Accumulo to start storing ranges yet, it just adds support to do so. When this PR and #3246 is ready, a follow on task would be to go back and update the merge code to no longer run chop compactions but to actually start populating ranges in the metadata, etc. The goal here is to be able to optionally store ranges for a file in metadata. If the range list is empty then that would just mean the entire file is used (infinite range) and this is nice as it allows us to be backwards compatible with previous metadata that didn't include a range and also allows us to not have to store anything unless we need to fence the file off in the future (such as when merging). One question I had was what the size and numEntries values should be now inside of DataFileValue. Will those values stay as they currently are (size and numEntries for the entire file) or do we need to set those values based on the Ranges provided? Ie do we need to try and reduce the size somehow based on the fenced off values vs the actual values? I don't think so, I think we can leave the values alone (apply to the whole file) and just get what we need out of the ranges as those values are used for things like splits which I would still think you'd want to take into account the entire file for but maybe someone else with more background can comment on that. A couple of other notes about this PR: 1. I made it a draft PR for now as this should really be target at a 3.1 branch and right now main is just 3.0 plus I wanted to get some initial feedback. 2. DataFileValue format is now a JSON string when serialized into the metadata table instead of a CSV. Reading of the old format is detected by checking for a Json parsing exception and falling back to the legacy format. 3. Ranges are encoded as Base64. The simplest and most compact way to store a range was to encode the entire Range into a byte array and then encode that as Base 64 as ranges are complex with keys and a lot of binary values. 4. I left the equals/hashcode methods alone for now in DataFileValue and they only compare the size and number of entries. I wasn't sure if we wanted to also add the range comparison to the equals yet, it probably depends how we intend to use it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
