[jira] [Commented] (DATAFU-63) SimpleRandomSample by a fixed number

OlgaK (JIRA) Mon, 20 Nov 2017 17:30:43 -0800

    [ 
https://issues.apache.org/jira/browse/DATAFU-63?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260133#comment-16260133
 ]


OlgaK commented on DATAFU-63:
-----------------------------

some mems to keep track (eventually can be added to the docs):
to build on <home> gradle version:
1.  export GRADLE_USER_HOME=/where/gradle/installed/
2.  edit gradle/wrapper/gradle-wrapper.properties adjust last lane to point out 
the installed gradle version

Java 8 build fails, the code requires Java 7 
3. unset  JAVA_HOME pointing to Java 8 (my ancient system still has Java 7 as a 
default, while it's already Java 9 time; has been surprised, my ancient system 
isn't  ancient enough for full compatibility with this code )
4. now build as pointed in the docs: .`/gradlew clean assemble`

Appeared, DataBag has no remove or alike method: 
https://pig.apache.org/docs/r0.8.1/api/org/apache/pig/data/DataBag.html Am I 
right? 

I can build the code with my added module, just need to figure out what to do 
in case one can't remove elements, while sum of ( `ceil` or `(int) k / 
num_of_partitions` ) returns some excess      



> SimpleRandomSample by a fixed number
> ------------------------------------
>
>                 Key: DATAFU-63
>                 URL: https://issues.apache.org/jira/browse/DATAFU-63
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: jian wang
>            Assignee: jian wang
>
> SimpleRandomSample currently supports random sampling by probability, it does 
> not support random sample a fixed number of items. ReserviorSample may do the 
> work but since it relies on an in-memory priority queue, memory issue may 
> happen if we are going to sample a huge number of items, eg: sample 100M from 
> 100G data. 
> Suggested approach is to create a new class "SimpleRandomSampleByCount" that 
> uses Manuver's rejection threshold to reject items whose weight exceeds the 
> threshold as we go from mapper to combiner to reducer. The majority part of 
> the algorithm will be very similar to SimpleRandomSample, except that we do 
> not use Berstein's theory to accept items and replace probability p = k / n,  
> k is the number of items to sample, n is the total number of items local in 
> mapper, combiner and reducer.
> Quote this requirement from others:
> "Hi folks,
> Question: does anybody know if there is a quicker way to randomly sample a 
> specified number of rows from grouped data? I’m currently doing this, since 
> it appears that the SAMPLE operator doesn’t work inside FOREACH statements:
> photosGrouped = GROUP photos BY farm;
> agg = FOREACH photosGrouped {
>   rnds = FOREACH photos GENERATE *, RANDOM() as rnd;
>   ordered_rnds = ORDER rnds BY rnd;
>   limitSet = LIMIT ordered_rnds 5000;
>   GENERATE group AS farm,
>            FLATTEN(limitSet.(photo_id, server, secret)) AS (photo_id, server, 
> secret);
> };
> This approach seems clumsy, and appears to run quite slowly (I’m assuming the 
> ORDER/LIMIT isn’t great for performance). Is there a less awkward way to do 
> this?
> Thanks,
> "



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (DATAFU-63) SimpleRandomSample by a fixed number

Reply via email to