[ 
https://issues.apache.org/jira/browse/DATAFU-63?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243464#comment-16243464
 ] 

Eyal Allweil commented on DATAFU-63:
------------------------------------

Hi Olga,

I'll try to answer as many of your questions as I can, and hopefully someone 
can correct me if I'm off. The purpose of the ticket is to add a sample by size 
k, since SimpleRandomSample already works with a porportion p. I think the 
samples are expected to be uniformly random.

The code sample you provided assumes we have a way to access all of the input 
by index, which is not true for the DataBag we receive in UDFs. Specifically, 
when making an Algebraic UDF, we are expecting only some of the input for each 
invocation of the Intermediate step. That's why SimpleRandomSample iterates 
over the entire input and only passes on values intended for the sample.

As for the optimization for sample sizes - though it sounds like a good idea in 
general, in practice I don't think people take samples that are larger than 
half of their initial data. If this is true - and I'd be glad if someone else 
could chime in - I would forego this optimization for simplicity's sake.

Thanks for looking into this!



> SimpleRandomSample by a fixed number
> ------------------------------------
>
>                 Key: DATAFU-63
>                 URL: https://issues.apache.org/jira/browse/DATAFU-63
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: jian wang
>            Assignee: jian wang
>
> SimpleRandomSample currently supports random sampling by probability, it does 
> not support random sample a fixed number of items. ReserviorSample may do the 
> work but since it relies on an in-memory priority queue, memory issue may 
> happen if we are going to sample a huge number of items, eg: sample 100M from 
> 100G data. 
> Suggested approach is to create a new class "SimpleRandomSampleByCount" that 
> uses Manuver's rejection threshold to reject items whose weight exceeds the 
> threshold as we go from mapper to combiner to reducer. The majority part of 
> the algorithm will be very similar to SimpleRandomSample, except that we do 
> not use Berstein's theory to accept items and replace probability p = k / n,  
> k is the number of items to sample, n is the total number of items local in 
> mapper, combiner and reducer.
> Quote this requirement from others:
> "Hi folks,
> Question: does anybody know if there is a quicker way to randomly sample a 
> specified number of rows from grouped data? I’m currently doing this, since 
> it appears that the SAMPLE operator doesn’t work inside FOREACH statements:
> photosGrouped = GROUP photos BY farm;
> agg = FOREACH photosGrouped {
>   rnds = FOREACH photos GENERATE *, RANDOM() as rnd;
>   ordered_rnds = ORDER rnds BY rnd;
>   limitSet = LIMIT ordered_rnds 5000;
>   GENERATE group AS farm,
>            FLATTEN(limitSet.(photo_id, server, secret)) AS (photo_id, server, 
> secret);
> };
> This approach seems clumsy, and appears to run quite slowly (I’m assuming the 
> ORDER/LIMIT isn’t great for performance). Is there a less awkward way to do 
> this?
> Thanks,
> "



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to