[ 
https://issues.apache.org/jira/browse/HBASE-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-2571:
----------------------------------

    Issue Type: New Feature  (was: Sub-task)
        Parent:     (was: HBASE-2000)

> Coprocessors: Minitables
> ------------------------
>
>                 Key: HBASE-2571
>                 URL: https://issues.apache.org/jira/browse/HBASE-2571
>             Project: HBase
>          Issue Type: New Feature
>          Components: coprocessors
>            Reporter: Andrew Purtell
>
> From 
> http://turing.cs.washington.edu/papers/dataprojects-google-sigmodrecord08.pdf 
> :
> {quote}
> MINITABLES: SAMPLING BIGTABLE
> Alberto Lerner and S. Muthukrishnan
> [...] Because of [BigTable] semantics and storing scheme, skipping N rows is 
> not feasible without actually reading them. Even finding the count of rows in 
> a Bigtable at any point in time can be done only probabilistically. On the 
> bright side, since Bigtable does not provide a relational query engine, we do 
> not need to consider what are suitable sampling methods for various 
> relational operators (like joins) or take into account how sampling errors 
> compound with increasing levels of query composition. 
> _Uniform Random Sampling_
> Our sampling scheme extracts and presents a sample of a Bigtable's rows as if 
> it were a Bigtable itself, which we call a Minitable. The rationale here is 
> that code written to run against a Bigtable can run unchanged against a 
> sample thereof. Our sampling is based on a hash scheme. We pick a convenient 
> hash function that maps the rowname space into a very large keyspace (e.g., a 
> ax+b mod p function, where p is as large as 2128). The rows falling into the 
> first fp keys where f is the relative sample size (it is a fraction), would 
> belong in the sample. Formally, we pick a hash function h : R −> 0..p and if 
> h(r) E [0, fp−1], then add r to the sample. It is easy to see that the 
> expected size of the sample is f * 100% of the Bigtable rowcount independent 
> of the rowcount, and the probability that a particular row r is in the sample 
> is f, as desired. This hash-based sampling method supports maintenance of the 
> sample with each Bigtable mutation (insert, update, or deletion). Only the 
> system may forward relevant mutations from the Bigtable to the Minitable. 
> Otherwise, the latter would behave as just any other Bigtable: it could be 
> backed up and even be replicated. We are currently deploying Minitables in 
> the repository of documents that the crawling system generates. Several 
> Minitables, each with a different sample factor, allow that system to compute 
> aggregates much faster and more often.
> _Biased Sampling_
> Uniform random sampling is quite useful but some scenarios require biased 
> sampling methods. We are currently working on one such extension that we call 
> Mask Sampling. In this scheme, the decision to select a row to the sample is 
> still based on its rowname but now a user may specify a mask m over it. The 
> mask, which can be a regular expression that matches portions of a rowname, 
> is used to group rows together. Two rows belong to a same group if their 
> masks result in the same string. Mask sampling guarantees that if a group is 
> selected to the sample, that group will be adequately represented there, 
> regardless of that group's relative size.
> {quote}
> Clearly minitables can be constructed on the fly by a coprocessor attached to 
> the source table.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to