> On Jan. 16, 2014, 2:07 a.m., Matthew Hayes wrote:
> > src/java/datafu/pig/sampling/SimpleRandomSample.java, line 163
> > <https://reviews.apache.org/r/16911/diff/1/?file=424023#file424023line163>
> >
> >     One thing I was curious about was whether Pig uses separate instances 
> > of Initial in a case like this:
> >     
> >     sampled = FOREACH (GROUP data ALL) GENERATE SRS(data, 0.3), SRS(data, 
> > 0.4);
> >     
> >     I couldn't find this documented but it did use separate instances in 
> > the test.  It may be worth adding a test case like this.

Added a test case. Actually, Pig may generate many Initial instances, one per 
record. So I changed the random number generator to static. Otherwise, the 
generated values are not "independent".


> On Jan. 16, 2014, 2:07 a.m., Matthew Hayes wrote:
> > src/java/datafu/pig/sampling/SimpleRandomSample.java, line 223
> > <https://reviews.apache.org/r/16911/diff/1/?file=424023#file424023line223>
> >
> >     _totalNumItems isn't actually the real count right?  It's just the 
> > count for this particular map task.  Should update the comment.  I suppose 
> > if one's lower bound is low enough (relative to the true value) it's 
> > possible for this new count to be higher.

Changed the name to _localCount. Yes, this is for the case when the provided 
lower bound is not good enough.


> On Jan. 16, 2014, 2:07 a.m., Matthew Hayes wrote:
> > src/java/datafu/pig/sampling/SimpleRandomSample.java, line 410
> > <https://reviews.apache.org/r/16911/diff/1/?file=424023#file424023line410>
> >
> >     info is probably better here

Changed to System.out.


- Xiangrui


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16911/#review31989
-----------------------------------------------------------


On Jan. 15, 2014, 6:14 p.m., Xiangrui Meng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16911/
> -----------------------------------------------------------
> 
> (Updated Jan. 15, 2014, 6:14 p.m.)
> 
> 
> Review request for DataFu and Matthew Hayes.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> In the current implementation, SRS takes the sampling probability in the 
> constructor of the UDF, while SRSWR takes the sample size in the function 
> call. The attached patch updates SRS to make it consistent with SRSWR.
> After the patch, SRS takes a bag of items, a desired sampling probability, 
> and optionally a lower bound of the size of the population as the inputs, 
> while SRSWR takes a bag of items, a desired sample size, and optionally a 
> lower bound of the size of the population as the inputs.
> Another benefit of the patch is that user doesn't have to create multiple 
> instances of the UDF to sample with different probabilities.
> 
> It is a rewrite of the UDF, so better check the new file directly instead of 
> the diff.
> 
> 
> Diffs
> -----
> 
>   src/java/datafu/pig/sampling/SimpleRandomSample.java 98d4e58 
>   test/pig/datafu/test/pig/sampling/SimpleRandomSampleTest.java 8ca9d5d 
> 
> Diff: https://reviews.apache.org/r/16911/diff/
> 
> 
> Testing
> -------
> 
> unit tests
> 
> 
> Thanks,
> 
> Xiangrui Meng
> 
>

Reply via email to