[ https://issues.apache.org/jira/browse/ACCUMULO-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adam Fuchs updated ACCUMULO-652: -------------------------------- Fix Version/s: 1.5.0 > support block-based filtering within RFile > ------------------------------------------ > > Key: ACCUMULO-652 > URL: https://issues.apache.org/jira/browse/ACCUMULO-652 > Project: Accumulo > Issue Type: Bug > Reporter: Adam Fuchs > Assignee: Adam Fuchs > Fix For: 1.5.0 > > > If we keep some stats about what is in an RFile block, we might be able to > efficiently [O(log N)], with high probability, implement filters that > currently require linear table scans. Two use cases of this include timestamp > range filtering (i.e. give me everything from last Tuesday) and cell-level > security filtering (i.e. give me everything that I can see with my > authorizations). > For the timestamp range filter, we can keep minimum and maximum timestamps > across all keys used in a block within the index entry for that block. For > the cell-level security filter, we can keep an aggregate label. This could be > done using a simplified disjunction of all of the labels in the block. The > extra block statistics information can propagate up the index hierarchy as > well, giving nice performance characteristics for finding the next matching > entry in a file. > In general, this is a heuristic technique that is good if data tends to > naturally cluster in blocks with respect to the way it is queried. Testing > its efficacy will require closely emulating real-world use cases -- tests > like the continuous ingest test will not be sufficient. We will have to test > for a few things: > # The cost for storing the extra stats in the index are not too expensive. > # The performance benefit for common use cases is significant. > # We shouldn't introduce any unacceptable worst-case behavior, like bloating > the index to ridiculous proportions for any data set. > Eventually this will all need to be exposed through the Iterator API to be > useful, which will be another ticket. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira