U rock Chris, I was hoping that would not have to re-invent the "code".
This is exactly what I am looking for. Now all I need is a working single node version of an acccumulo vm to try it out on ... Cloudera CDH4.3 don't work with accumulo 1.4.3 (thanks Bill!) so I am looking for someting simple to work with that runs out of the box :-) On Wed, Feb 5, 2014 at 5:44 AM, Chris Bennight [via Apache Accumulo] < [email protected]> wrote: > If it's for the input to some algorithm (machine learning, etc.) I'm > assuming it *is* important to have that 25% be representative of the > entire > population. > > HBase implements a simple strategy with a [1]RandomRowFilter that could > trivially be adapted to an accumulo filter (Iterator). The caveat being > it's going to be essentially a full table scan each time - set a > percentage, and then randomly choose if each key is accepted or not. > Note > that if each of your "values" (i.e. the granularity you want to accept or > reject groups on) is more than one key value, you will want to use > something like the WholeRowIterator first to aggregate them, then test for > accept/reject. You probably don't want to use the WholeRowIterator as > is, > as you would want to test/reject on the full key, and only aggregate if it > passes - but you can use it as a pattern. > > If you want something faster then I think you are going to generate and > keep some population statistics / summaries on ingest, and query those. > This will add more sampling error based on the granularity of your > summaries - but you should be able to quantify that with standard error > propagation. > > > [1] > > https://github.com/apache/hbase/blob/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/filter/RandomRowFilter.java > > > > On Tue, Feb 4, 2014 at 10:39 PM, cprigano <[hidden > email]<http://user/SendEmail.jtp?type=node&node=7403&i=0>> > wrote: > > > Good questions all! I am to start trying to just take a percentile of > rows > > in a table similar to a percentile to construct training, > cross-validation > > and testing sets. I am a machine learning person and what to be able to > do > > say a 25% random sample of rows in a table ( I may not know the size and > > the percentile should be settable) Starting with the easiest assumption, > > that all row are the say "type" will get things started. I can then > move > > to more exotic scenarios. Accumulo is a new nut for me to crack and I > would > > very much like your thoughts. Thanks mate! > > > > > > On Tue, Feb 4, 2014 at 7:27 PM, Chris Bennight [via Apache Accumulo] < > > [hidden email] <http://user/SendEmail.jtp?type=node&node=7403&i=1>> > wrote: > > > > > I'm assuming you want a random selection of entries in accumulo - so > say > > a > > > random selection of key's/values? > > > > > > How are your keys formatted (conceptually is fine); is there some sort > of > > > regularity to them? (I.e. can you calculate ahead of time a random > > > distribution of keys without validating which keys are present)? > > > > > > If you can't calculate the key distribution ahead of time, are you > > keeping > > > any statistics (or could you) on ingest (cardinality, distribution, > etc.) > > > - > > > and finally, how rigorous and performant do you need this random > sampling > > > to be? Do you just want representative data, or are you trying to do > > > something like BlinkDB[1] (allow people to specify confidence > intervals > > > on > > > queries, and only sample enough data to meet the requisite uncertainty > > > requirements)? > > > > > > [1] http://blinkdb.org/ > > > > > > Chris > > > > > > > > > > > > > > > On Sat, Feb 1, 2014 at 3:58 PM, cprigano <[hidden email]< > > http://user/SendEmail.jtp?type=node&node=7394&i=0>> > > > wrote: > > > > > > > I am looking at writing an Accumulo iterator to return a random > sample > > > of a > > > > percentile of a table. > > > > > > > > I would appreciate any suggestions. > > > > > > > > Thnaks, > > > > > > > > Chris > > > > > > > > > > > > > > > > -- > > > > View this message in context: > > > > > > > > > > http://apache-accumulo.1065345.n5.nabble.com/Accumulo-iterator-to-return-a-random-sample-of-a-percentile-of-a-table-tp7354.html > > > > Sent from the Developers mailing list archive at Nabble.com. > > > > > > > > > > > > > ------------------------------ > > > If you reply to this email, your message will be added to the > discussion > > > below: > > > > > > > > > http://apache-accumulo.1065345.n5.nabble.com/Accumulo-iterator-to-return-a-random-sample-of-a-percentile-of-a-table-tp7354p7394.html > > > To unsubscribe from Accumulo iterator to return a random sample of a > > > percentile of a table, click here< > > > > > > > . > > > NAML< > > > http://apache-accumulo.1065345.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml > > > > > > > > > > > > > > > > > -- > > View this message in context: > > > http://apache-accumulo.1065345.n5.nabble.com/Accumulo-iterator-to-return-a-random-sample-of-a-percentile-of-a-table-tp7354p7400.html > > > Sent from the Developers mailing list archive at Nabble.com. > > > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > > http://apache-accumulo.1065345.n5.nabble.com/Accumulo-iterator-to-return-a-random-sample-of-a-percentile-of-a-table-tp7354p7403.html > To unsubscribe from Accumulo iterator to return a random sample of a > percentile of a table, click > here<http://apache-accumulo.1065345.n5.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=7354&code=Y2hyaXMucC5yaWdhbm9AZ21haWwuY29tfDczNTR8NTkyODE0MjEy> > . > NAML<http://apache-accumulo.1065345.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Accumulo-iterator-to-return-a-random-sample-of-a-percentile-of-a-table-tp7354p7417.html Sent from the Developers mailing list archive at Nabble.com.
