-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16911/#review31989
-----------------------------------------------------------


Overall this looks pretty good to me.  The one open question I have is about 
breaking API compatibility.  If we are going to release 2.0 with a 
org.apache.datafu packaging change then this seems okay to me.


src/java/datafu/pig/sampling/SimpleRandomSample.java
<https://reviews.apache.org/r/16911/#comment60743>

    One thing I was curious about was whether Pig uses separate instances of 
Initial in a case like this:
    
    sampled = FOREACH (GROUP data ALL) GENERATE SRS(data, 0.3), SRS(data, 0.4);
    
    I couldn't find this documented but it did use separate instances in the 
test.  It may be worth adding a test case like this.



src/java/datafu/pig/sampling/SimpleRandomSample.java
<https://reviews.apache.org/r/16911/#comment60741>

    _totalNumItems isn't actually the real count right?  It's just the count 
for this particular map task.  Should update the comment.  I suppose if one's 
lower bound is low enough (relative to the true value) it's possible for this 
new count to be higher.



src/java/datafu/pig/sampling/SimpleRandomSample.java
<https://reviews.apache.org/r/16911/#comment60745>

    info is probably better here


- Matthew Hayes


On Jan. 15, 2014, 6:14 p.m., Xiangrui Meng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16911/
> -----------------------------------------------------------
> 
> (Updated Jan. 15, 2014, 6:14 p.m.)
> 
> 
> Review request for DataFu and Matthew Hayes.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> In the current implementation, SRS takes the sampling probability in the 
> constructor of the UDF, while SRSWR takes the sample size in the function 
> call. The attached patch updates SRS to make it consistent with SRSWR.
> After the patch, SRS takes a bag of items, a desired sampling probability, 
> and optionally a lower bound of the size of the population as the inputs, 
> while SRSWR takes a bag of items, a desired sample size, and optionally a 
> lower bound of the size of the population as the inputs.
> Another benefit of the patch is that user doesn't have to create multiple 
> instances of the UDF to sample with different probabilities.
> 
> It is a rewrite of the UDF, so better check the new file directly instead of 
> the diff.
> 
> 
> Diffs
> -----
> 
>   src/java/datafu/pig/sampling/SimpleRandomSample.java 98d4e58 
>   test/pig/datafu/test/pig/sampling/SimpleRandomSampleTest.java 8ca9d5d 
> 
> Diff: https://reviews.apache.org/r/16911/diff/
> 
> 
> Testing
> -------
> 
> unit tests
> 
> 
> Thanks,
> 
> Xiangrui Meng
> 
>

Reply via email to