[ https://issues.apache.org/jira/browse/ACCUMULO-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14289671#comment-14289671 ]
Adam Fuchs commented on ACCUMULO-652: ------------------------------------- At this point I don't have any plans to complete this for 1.7. The best option is probably to find someone to hand this off to. Current status is that there is an ACCUMULO-652 branch with a prototype implementation. This will likely need another iteration on the API, as well as addition performance and correctness testing. There are also some ugly parts with the RFile configuration that may lead to a bigger RFile configuration improvement project. > support block-based filtering within RFile > ------------------------------------------ > > Key: ACCUMULO-652 > URL: https://issues.apache.org/jira/browse/ACCUMULO-652 > Project: Accumulo > Issue Type: Improvement > Components: tserver > Reporter: Adam Fuchs > Assignee: Adam Fuchs > Fix For: 1.7.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > If we keep some stats about what is in an RFile block, we might be able to > efficiently [O(log N)], with high probability, implement filters that > currently require linear table scans. Two use cases of this include timestamp > range filtering (i.e. give me everything from last Tuesday) and cell-level > security filtering (i.e. give me everything that I can see with my > authorizations). > For the timestamp range filter, we can keep minimum and maximum timestamps > across all keys used in a block within the index entry for that block. For > the cell-level security filter, we can keep an aggregate label. This could be > done using a simplified disjunction of all of the labels in the block. The > extra block statistics information can propagate up the index hierarchy as > well, giving nice performance characteristics for finding the next matching > entry in a file. > In general, this is a heuristic technique that is good if data tends to > naturally cluster in blocks with respect to the way it is queried. Testing > its efficacy will require closely emulating real-world use cases -- tests > like the continuous ingest test will not be sufficient. We will have to test > for a few things: > # The cost for storing the extra stats in the index are not too expensive. > # The performance benefit for common use cases is significant. > # We shouldn't introduce any unacceptable worst-case behavior, like bloating > the index to ridiculous proportions for any data set. > Eventually this will all need to be exposed through the Iterator API to be > useful, which will be another ticket. -- This message was sent by Atlassian JIRA (v6.3.4#6332)