[
https://issues.apache.org/jira/browse/PIG-3224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13640042#comment-13640042
]
Vicki Fu commented on PIG-3224:
-------------------------------
Hi Gianmarco,
Please correct me if my understand is wrong.
The algorithm for Reservoir Sample should be(Not big Data Version):
Assume that you have the memory to store k elements. Store the first k elements
in the memory in an array. Now when you receive the nth element (where n > k),
generate a random number r between 1 and n. If r > k discard the nth element.
Otherwise replace the rth element in the array with the nth element. This
approach will ensure that at any stage your array would contain k elements that
are uniformly randomly selected from the input elements received so far.
When we need to consider Big Data, the input Data M split into N block into
different node, we can do the algorithm above parallel. So it should be same.
Then it will keep each element will evenly
reference: http://en.wikipedia.org/wiki/Reservoir_sampling
> Reservoir sampling
> ------------------
>
> Key: PIG-3224
> URL: https://issues.apache.org/jira/browse/PIG-3224
> Project: Pig
> Issue Type: New Feature
> Reporter: Gianmarco De Francisci Morales
> Labels: gsoc2013
>
> Implement a reservoir sampling option, or make it the default (
> http://en.wikipedia.org/wiki/Reservoir_sampling ) in Pig's SAMPLE operator.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira