----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/16911/#review31989 -----------------------------------------------------------
Overall this looks pretty good to me. The one open question I have is about breaking API compatibility. If we are going to release 2.0 with a org.apache.datafu packaging change then this seems okay to me. src/java/datafu/pig/sampling/SimpleRandomSample.java <https://reviews.apache.org/r/16911/#comment60743> One thing I was curious about was whether Pig uses separate instances of Initial in a case like this: sampled = FOREACH (GROUP data ALL) GENERATE SRS(data, 0.3), SRS(data, 0.4); I couldn't find this documented but it did use separate instances in the test. It may be worth adding a test case like this. src/java/datafu/pig/sampling/SimpleRandomSample.java <https://reviews.apache.org/r/16911/#comment60741> _totalNumItems isn't actually the real count right? It's just the count for this particular map task. Should update the comment. I suppose if one's lower bound is low enough (relative to the true value) it's possible for this new count to be higher. src/java/datafu/pig/sampling/SimpleRandomSample.java <https://reviews.apache.org/r/16911/#comment60745> info is probably better here - Matthew Hayes On Jan. 15, 2014, 6:14 p.m., Xiangrui Meng wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/16911/ > ----------------------------------------------------------- > > (Updated Jan. 15, 2014, 6:14 p.m.) > > > Review request for DataFu and Matthew Hayes. > > > Repository: datafu > > > Description > ------- > > In the current implementation, SRS takes the sampling probability in the > constructor of the UDF, while SRSWR takes the sample size in the function > call. The attached patch updates SRS to make it consistent with SRSWR. > After the patch, SRS takes a bag of items, a desired sampling probability, > and optionally a lower bound of the size of the population as the inputs, > while SRSWR takes a bag of items, a desired sample size, and optionally a > lower bound of the size of the population as the inputs. > Another benefit of the patch is that user doesn't have to create multiple > instances of the UDF to sample with different probabilities. > > It is a rewrite of the UDF, so better check the new file directly instead of > the diff. > > > Diffs > ----- > > src/java/datafu/pig/sampling/SimpleRandomSample.java 98d4e58 > test/pig/datafu/test/pig/sampling/SimpleRandomSampleTest.java 8ca9d5d > > Diff: https://reviews.apache.org/r/16911/diff/ > > > Testing > ------- > > unit tests > > > Thanks, > > Xiangrui Meng > >
