> On Jan. 16, 2014, 2:07 a.m., Matthew Hayes wrote: > > src/java/datafu/pig/sampling/SimpleRandomSample.java, line 163 > > <https://reviews.apache.org/r/16911/diff/1/?file=424023#file424023line163> > > > > One thing I was curious about was whether Pig uses separate instances > > of Initial in a case like this: > > > > sampled = FOREACH (GROUP data ALL) GENERATE SRS(data, 0.3), SRS(data, > > 0.4); > > > > I couldn't find this documented but it did use separate instances in > > the test. It may be worth adding a test case like this.
Added a test case. Actually, Pig may generate many Initial instances, one per record. So I changed the random number generator to static. Otherwise, the generated values are not "independent". > On Jan. 16, 2014, 2:07 a.m., Matthew Hayes wrote: > > src/java/datafu/pig/sampling/SimpleRandomSample.java, line 223 > > <https://reviews.apache.org/r/16911/diff/1/?file=424023#file424023line223> > > > > _totalNumItems isn't actually the real count right? It's just the > > count for this particular map task. Should update the comment. I suppose > > if one's lower bound is low enough (relative to the true value) it's > > possible for this new count to be higher. Changed the name to _localCount. Yes, this is for the case when the provided lower bound is not good enough. > On Jan. 16, 2014, 2:07 a.m., Matthew Hayes wrote: > > src/java/datafu/pig/sampling/SimpleRandomSample.java, line 410 > > <https://reviews.apache.org/r/16911/diff/1/?file=424023#file424023line410> > > > > info is probably better here Changed to System.out. - Xiangrui ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/16911/#review31989 ----------------------------------------------------------- On Jan. 15, 2014, 6:14 p.m., Xiangrui Meng wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/16911/ > ----------------------------------------------------------- > > (Updated Jan. 15, 2014, 6:14 p.m.) > > > Review request for DataFu and Matthew Hayes. > > > Repository: datafu > > > Description > ------- > > In the current implementation, SRS takes the sampling probability in the > constructor of the UDF, while SRSWR takes the sample size in the function > call. The attached patch updates SRS to make it consistent with SRSWR. > After the patch, SRS takes a bag of items, a desired sampling probability, > and optionally a lower bound of the size of the population as the inputs, > while SRSWR takes a bag of items, a desired sample size, and optionally a > lower bound of the size of the population as the inputs. > Another benefit of the patch is that user doesn't have to create multiple > instances of the UDF to sample with different probabilities. > > It is a rewrite of the UDF, so better check the new file directly instead of > the diff. > > > Diffs > ----- > > src/java/datafu/pig/sampling/SimpleRandomSample.java 98d4e58 > test/pig/datafu/test/pig/sampling/SimpleRandomSampleTest.java 8ca9d5d > > Diff: https://reviews.apache.org/r/16911/diff/ > > > Testing > ------- > > unit tests > > > Thanks, > > Xiangrui Meng > >
