[jira] [Commented] (DATAFU-63) SimpleRandomSample by a fixed number

Eyal Allweil (JIRA) Sun, 19 Nov 2017 05:12:17 -0800

    [ 
https://issues.apache.org/jira/browse/DATAFU-63?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16258480#comment-16258480
 ]


Eyal Allweil commented on DATAFU-63:
------------------------------------

I wonder if the gradlew script is there because it can be "just used" on some 
systems. On one computer I couldn't find a gradle installation at all - maybe I 
skipped the bootstrapping step and the script worked for me. I'm no gradle 
expert so I don't know if this is feasible. If not, it sounds like a good idea 
to open a new issue for removing the gradlew, following your suggestion.

I looked at the code you linked, and I'll try again to answer your questions 
(in this issue and in the comments in the code).

The way the Algebraic interface works in Pig is that Pig calls 
EvalFunc.getInitial to get the name of the (inner) class with the Initial 
implementation, and creates it. You can see this code in POUserFunc in the 
Apache Pig project, it's not part of DataFu. The Initial class is then 
instantiated, and the exec there is called. The output is gathered and then the 
same process happens for Intermediate, and the output from Initial becomes the 
input for Intermediate. Finally, all these outputs from various mappers are 
sent to a reducer and the getFinal method is used to instantiate the Final 
class.

This means the implementations you've suggested would only work if this UDF was 
implementing the EvalFunc or AccumulatorEvalFunc interfaces. But in that case 
we already have a UDF that gives us a comparable implementation in DataFu - 
[ReservoirSample|https://github.com/apache/incubator-datafu/blob/master/datafu-pig/src/main/java/datafu/pig/sampling/ReservoirSample.java],
 so I don't think it's what is desired in this particular Jira issue.

This brings us back to what Matthew wrote way back in 2014 - implementing 
algorithm 6 from [this 
paper|http://jmlr.org/proceedings/papers/v28/meng13a.pdf]. I think what that 
algorithm provides is a way to generate a _k_-sized sample without ever holding 
all k items in memory in the mapper (which is the limitation for 
ReservoirSample, it won't work if _k_ is too big). 

Does that make sense?

> SimpleRandomSample by a fixed number
> ------------------------------------
>
>                 Key: DATAFU-63
>                 URL: https://issues.apache.org/jira/browse/DATAFU-63
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: jian wang
>            Assignee: jian wang
>
> SimpleRandomSample currently supports random sampling by probability, it does 
> not support random sample a fixed number of items. ReserviorSample may do the 
> work but since it relies on an in-memory priority queue, memory issue may 
> happen if we are going to sample a huge number of items, eg: sample 100M from 
> 100G data. 
> Suggested approach is to create a new class "SimpleRandomSampleByCount" that 
> uses Manuver's rejection threshold to reject items whose weight exceeds the 
> threshold as we go from mapper to combiner to reducer. The majority part of 
> the algorithm will be very similar to SimpleRandomSample, except that we do 
> not use Berstein's theory to accept items and replace probability p = k / n,  
> k is the number of items to sample, n is the total number of items local in 
> mapper, combiner and reducer.
> Quote this requirement from others:
> "Hi folks,
> Question: does anybody know if there is a quicker way to randomly sample a 
> specified number of rows from grouped data? I’m currently doing this, since 
> it appears that the SAMPLE operator doesn’t work inside FOREACH statements:
> photosGrouped = GROUP photos BY farm;
> agg = FOREACH photosGrouped {
>   rnds = FOREACH photos GENERATE *, RANDOM() as rnd;
>   ordered_rnds = ORDER rnds BY rnd;
>   limitSet = LIMIT ordered_rnds 5000;
>   GENERATE group AS farm,
>            FLATTEN(limitSet.(photo_id, server, secret)) AS (photo_id, server, 
> secret);
> };
> This approach seems clumsy, and appears to run quite slowly (I’m assuming the 
> ORDER/LIMIT isn’t great for performance). Is there a less awkward way to do 
> this?
> Thanks,
> "



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (DATAFU-63) SimpleRandomSample by a fixed number

Reply via email to