[ https://issues.apache.org/jira/browse/DATAFU-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matthew Hayes closed DATAFU-3. ------------------------------ Resolution: Won't Do Closing this as it is quite old and there have been no updates. > Bootstrap sum UDF > ----------------- > > Key: DATAFU-3 > URL: https://issues.apache.org/jira/browse/DATAFU-3 > Project: DataFu > Issue Type: New Feature > Reporter: Josh Wills > Priority: Major > > There was a Sawzall table called bootstrapsum that I used to find handy for > some of the analysis work I did at teh goog: > http://szl.googlecode.com/svn/trunk/src/emitters/szlbootstrapsum.cc > It would be nice to have it back again in the Hadoop ecosystem. There was a > good blog post about the utility of Poisson bootstraps for random forests > here: > http://blog.cloudera.com/blog/2013/02/how-to-resample-from-a-large-data-set-in-parallel-with-r-on-hadoop/ > ...but it's useful in all sorts of nerdy stats contexts (e.g., computing > confidence intervals for experiments.) I'm open to the particular structure > of the function; it could either have: > 1) A constructor that took in the number of bootstrap samples to create and > then a call() method that took in a counting variable and a weighting > variable, or > 2) Three args to the call method (num samples, counting variable, and > weighting variable, in some order.) > The return type would be a bag of tuples, (index: int, sum: T) where the type > of the sum would depend on the input type of the counting variable. index = 0 > would always be the actual sum computed, while the rest of the indices would > be numbered 1..numSamples for each of the different bootstrap samples. -- This message was sent by Atlassian Jira (v8.3.4#803005)