jian wang created DATAFU-63:
-------------------------------

             Summary: SimpleRandomSample by a fixed number
                 Key: DATAFU-63
                 URL: https://issues.apache.org/jira/browse/DATAFU-63
             Project: DataFu
          Issue Type: New Feature
            Reporter: jian wang


SimpleRandomSample currently supports random sampling by probability, it does 
not support random sample a fixed number of items. ReserviorSample may do the 
work but since it relies on an in-memory priority queue, memory issue may 
happen if we are going to sample a huge number of items, eg: sample 100M from 
100G data. 

Suggested approach is to create a new class "SimpleRandomSampleByCount" that 
uses Manuver's rejection threshold to reject items whose weight exceeds the 
threshold as we go from mapper to combiner to reducer. The majority part of the 
algorithm will be very similar to SimpleRandomSample, except that we do not use 
Berstein's theory to accept items and replace probability p = k / n,  k is the 
number of items to sample, n is the total number of items local in mapper, 
combiner and reducer.

Quote this requirement from others:

"Hi folks,

Question: does anybody know if there is a quicker way to randomly sample a 
specified number of rows from grouped data? I’m currently doing this, since it 
appears that the SAMPLE operator doesn’t work inside FOREACH statements:

photosGrouped = GROUP photos BY farm;

agg = FOREACH photosGrouped {
  rnds = FOREACH photos GENERATE *, RANDOM() as rnd;
  ordered_rnds = ORDER rnds BY rnd;
  limitSet = LIMIT ordered_rnds 5000;
  GENERATE group AS farm,
           FLATTEN(limitSet.(photo_id, server, secret)) AS (photo_id, server, 
secret);
};

This approach seems clumsy, and appears to run quite slowly (I’m assuming the 
ORDER/LIMIT isn’t great for performance). Is there a less awkward way to do 
this?

Thanks,
"



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to