[
https://issues.apache.org/jira/browse/DATAFU-63?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14227836#comment-14227836
]
Matthew Hayes commented on DATAFU-63:
-------------------------------------
Okay I've read through [the
paper|http://jmlr.org/proceedings/papers/v28/meng13a.pdf]. So what you are
describing then is Algorithm 6. This UDF seems worth implementing. It seems
pretty straightforward given that much of the code already exists from
SimpleRandomSample.
> SimpleRandomSample by a fixed number
> ------------------------------------
>
> Key: DATAFU-63
> URL: https://issues.apache.org/jira/browse/DATAFU-63
> Project: DataFu
> Issue Type: New Feature
> Reporter: jian wang
> Assignee: jian wang
>
> SimpleRandomSample currently supports random sampling by probability, it does
> not support random sample a fixed number of items. ReserviorSample may do the
> work but since it relies on an in-memory priority queue, memory issue may
> happen if we are going to sample a huge number of items, eg: sample 100M from
> 100G data.
> Suggested approach is to create a new class "SimpleRandomSampleByCount" that
> uses Manuver's rejection threshold to reject items whose weight exceeds the
> threshold as we go from mapper to combiner to reducer. The majority part of
> the algorithm will be very similar to SimpleRandomSample, except that we do
> not use Berstein's theory to accept items and replace probability p = k / n,
> k is the number of items to sample, n is the total number of items local in
> mapper, combiner and reducer.
> Quote this requirement from others:
> "Hi folks,
> Question: does anybody know if there is a quicker way to randomly sample a
> specified number of rows from grouped data? I’m currently doing this, since
> it appears that the SAMPLE operator doesn’t work inside FOREACH statements:
> photosGrouped = GROUP photos BY farm;
> agg = FOREACH photosGrouped {
> rnds = FOREACH photos GENERATE *, RANDOM() as rnd;
> ordered_rnds = ORDER rnds BY rnd;
> limitSet = LIMIT ordered_rnds 5000;
> GENERATE group AS farm,
> FLATTEN(limitSet.(photo_id, server, secret)) AS (photo_id, server,
> secret);
> };
> This approach seems clumsy, and appears to run quite slowly (I’m assuming the
> ORDER/LIMIT isn’t great for performance). Is there a less awkward way to do
> this?
> Thanks,
> "
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)