[ 
https://issues.apache.org/jira/browse/DATAFU-63?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236576#comment-16236576
 ] 

OlgaK commented on DATAFU-63:
-----------------------------

Hello everybody,
as discussed with [~eyal] it'd be better to discuss the issue on this board, so 
every body involved could take part. At first I'd like to clarify the task. 
One needs to have a function which returns a random sample of size `k` 
(elements) or `p` (fraction or percentage) from a set of size `n`.  
Does it sounds like the aim of this ticket?
The first question: should this function return uniformed random sample or it 
is desirable that the function may handle cases where the samples obey 
different distributions, specified as an extra parameter, not just uniform? 
That would be substantially more complicated, taking into account map/reducing 
or parallel processing. 
If we just would like to get k (or p) elements from a set, why to complicate 
simple stuff? 
```
Set sample = ();
while (sample.size() < k) { // p and k are related, having p always one can get 
k as k = ceil(p*n) 
  sample.add(input[rand(n)]);
} 
```
in case of parallel processing and `k % number_of_parallel_threads =/= 0` round 
up, then in the reducer eliminate the excess from the sample
Am I missing something?
To save the processing time, check the borders (0,n), in this case no a random 
number is required, the result should be immediate. In case `k > n/2` instead 
of the addition, start from the `sample = input`
then subtract `n-k` elements.
What do you think?                    

> SimpleRandomSample by a fixed number
> ------------------------------------
>
>                 Key: DATAFU-63
>                 URL: https://issues.apache.org/jira/browse/DATAFU-63
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: jian wang
>            Assignee: jian wang
>            Priority: Major
>
> SimpleRandomSample currently supports random sampling by probability, it does 
> not support random sample a fixed number of items. ReserviorSample may do the 
> work but since it relies on an in-memory priority queue, memory issue may 
> happen if we are going to sample a huge number of items, eg: sample 100M from 
> 100G data. 
> Suggested approach is to create a new class "SimpleRandomSampleByCount" that 
> uses Manuver's rejection threshold to reject items whose weight exceeds the 
> threshold as we go from mapper to combiner to reducer. The majority part of 
> the algorithm will be very similar to SimpleRandomSample, except that we do 
> not use Berstein's theory to accept items and replace probability p = k / n,  
> k is the number of items to sample, n is the total number of items local in 
> mapper, combiner and reducer.
> Quote this requirement from others:
> "Hi folks,
> Question: does anybody know if there is a quicker way to randomly sample a 
> specified number of rows from grouped data? I’m currently doing this, since 
> it appears that the SAMPLE operator doesn’t work inside FOREACH statements:
> photosGrouped = GROUP photos BY farm;
> agg = FOREACH photosGrouped {
>   rnds = FOREACH photos GENERATE *, RANDOM() as rnd;
>   ordered_rnds = ORDER rnds BY rnd;
>   limitSet = LIMIT ordered_rnds 5000;
>   GENERATE group AS farm,
>            FLATTEN(limitSet.(photo_id, server, secret)) AS (photo_id, server, 
> secret);
> };
> This approach seems clumsy, and appears to run quite slowly (I’m assuming the 
> ORDER/LIMIT isn’t great for performance). Is there a less awkward way to do 
> this?
> Thanks,
> "



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to