[ 
https://issues.apache.org/jira/browse/PIG-3224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13640042#comment-13640042
 ] 

Vicki Fu commented on PIG-3224:
-------------------------------

Hi Gianmarco,
Please correct me if my understand is wrong.

The algorithm for Reservoir Sample should be(Not big Data Version):
Assume that you have the memory to store k elements. Store the first k elements 
in the memory in an array. Now when you receive the nth element (where n > k), 
generate a random number r between 1 and n. If r > k discard the nth element. 
Otherwise replace the rth element in the array with the nth element. This 
approach will ensure that at any stage your array would contain k elements that 
are uniformly randomly selected from the input elements received so far.

When we need to consider Big Data, the input Data M split into N block into 
different node, we can do the algorithm above parallel. So it should be same. 
Then it will keep each element will evenly 



reference: http://en.wikipedia.org/wiki/Reservoir_sampling
                
> Reservoir sampling
> ------------------
>
>                 Key: PIG-3224
>                 URL: https://issues.apache.org/jira/browse/PIG-3224
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>              Labels: gsoc2013
>
> Implement a reservoir sampling option, or make it the default ( 
> http://en.wikipedia.org/wiki/Reservoir_sampling ) in Pig's SAMPLE operator.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to