[ https://issues.apache.org/jira/browse/HBASE-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Purtell updated HBASE-2571: ---------------------------------- Issue Type: New Feature (was: Sub-task) Parent: (was: HBASE-2000) > Coprocessors: Minitables > ------------------------ > > Key: HBASE-2571 > URL: https://issues.apache.org/jira/browse/HBASE-2571 > Project: HBase > Issue Type: New Feature > Components: coprocessors > Reporter: Andrew Purtell > > From > http://turing.cs.washington.edu/papers/dataprojects-google-sigmodrecord08.pdf > : > {quote} > MINITABLES: SAMPLING BIGTABLE > Alberto Lerner and S. Muthukrishnan > [...] Because of [BigTable] semantics and storing scheme, skipping N rows is > not feasible without actually reading them. Even finding the count of rows in > a Bigtable at any point in time can be done only probabilistically. On the > bright side, since Bigtable does not provide a relational query engine, we do > not need to consider what are suitable sampling methods for various > relational operators (like joins) or take into account how sampling errors > compound with increasing levels of query composition. > _Uniform Random Sampling_ > Our sampling scheme extracts and presents a sample of a Bigtable's rows as if > it were a Bigtable itself, which we call a Minitable. The rationale here is > that code written to run against a Bigtable can run unchanged against a > sample thereof. Our sampling is based on a hash scheme. We pick a convenient > hash function that maps the rowname space into a very large keyspace (e.g., a > ax+b mod p function, where p is as large as 2128). The rows falling into the > first fp keys where f is the relative sample size (it is a fraction), would > belong in the sample. Formally, we pick a hash function h : R −> 0..p and if > h(r) E [0, fp−1], then add r to the sample. It is easy to see that the > expected size of the sample is f * 100% of the Bigtable rowcount independent > of the rowcount, and the probability that a particular row r is in the sample > is f, as desired. This hash-based sampling method supports maintenance of the > sample with each Bigtable mutation (insert, update, or deletion). Only the > system may forward relevant mutations from the Bigtable to the Minitable. > Otherwise, the latter would behave as just any other Bigtable: it could be > backed up and even be replicated. We are currently deploying Minitables in > the repository of documents that the crawling system generates. Several > Minitables, each with a different sample factor, allow that system to compute > aggregates much faster and more often. > _Biased Sampling_ > Uniform random sampling is quite useful but some scenarios require biased > sampling methods. We are currently working on one such extension that we call > Mask Sampling. In this scheme, the decision to select a row to the sample is > still based on its rowname but now a user may specify a mask m over it. The > mask, which can be a regular expression that matches portions of a rowname, > is used to group rows together. Two rows belong to a same group if their > masks result in the same string. Mask sampling guarantees that if a group is > selected to the sample, that group will be adequately represented there, > regardless of that group's relative size. > {quote} > Clearly minitables can be constructed on the fly by a coprocessor attached to > the source table. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira