[ https://issues.apache.org/jira/browse/ACCUMULO-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13470674#comment-13470674 ]
Keith Turner commented on ACCUMULO-790: --------------------------------------- For timestamps we could experiment with subtracting the previous timestamp and writing out the result using writeVLong(). > RFile should compress using common prefixes of key elements > ----------------------------------------------------------- > > Key: ACCUMULO-790 > URL: https://issues.apache.org/jira/browse/ACCUMULO-790 > Project: Accumulo > Issue Type: Improvement > Components: tserver > Reporter: Christopher Tubbs > Assignee: Keith Turner > Labels: compression, file, optimization, rfile > Fix For: 1.5.0 > > > Relative keys have proven themselves as a great way to compress within > dimensions of the key. However, we could probably do better, since we know > that our data is sorted lexicographically, we can make a reasonable > assumption that we will get better compression if we only store the fact that > a key (or portion of a key) has a common prefix with the previous key, even > if it is not an exact match. > Currently, in RFile, unused bits from the delete flag byte are being used to > store flags that show whether an element of the key is exactly the same as > the previous, or if it is different. We can change the semantics of these > flags to store three states per element of the key: exact match as previous > key, has a common prefix as previous key, no relative key compression. If we > don't want to add a byte to store 2 bits for 3 states per element, we can > just take the ordinal value of the unused 7 bits of the delete flag field and > map it to an enumeration of relative key flags. > In the case of a common prefix flag enabled for a given element of the > current key when reading the RFile, we can interpret the first bytes of that > element as a VInt expressing the length of the common prefix relative to the > previous key's same element. Because this will add at least one byte to the > the length of that element, we will not want to use the common prefix > compression if the common prefix is less than 2 bytes. For less than 2 bytes > in common (1 or 0 bytes in common), we'd select the no compression flag for > that element. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira