[jira] [Comment Edited] (DATAFU-63) SimpleRandomSample by a fixed number

2017-11-13 Thread OlgaK (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-63?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249870#comment-16249870
 ] 

OlgaK edited comment on DATAFU-63 at 11/13/17 5:47 PM:
---

I'm on Linux/Fedora. I've not modified  gradlew file manually just, as it 
pointed in the docs `gradle -b bootstrap.gradle`. I've done it with my 3.1 
gradle. To remove a file from the repo: move the file somewhere else, then 
commit, ten move it back and add to gitignore. Especially if it's 
platform/versions dependent and by the docs should be generated locally.   
The changes in the file are substantial about 1/3 of the file. For example
{noformat}git diff gradlew
diff --git a/gradlew b/gradlew
index 16f..9aa616c 100755
--- a/gradlew
+++ b/gradlew
@@ -6,12 +6,30 @@
 ##
 ##
 
-# Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to p
-DEFAULT_JVM_OPTS="-XX:MaxPermSize=512m"
+# Attempt to set APP_HOME
+# Resolve links: $0 may be a link
+PRG="$0"
+# Need this for relative symlinks.
+while [ -h "$PRG" ] ; do
+ls=`ls -ld "$PRG"`
+link=`expr "$ls" : '.*-> \(.*\)$'`
+if expr "$link" : '/.*' > /dev/null; then
+PRG="$link"
+else
+PRG=`dirname "$PRG"`"/$link"
+fi
+done
.
{noformat}


was (Author: cur4so):
I'm on Linux/Fedora. I've not modified  gradlew file manually just, as it 
pointed in the docs `gradle -b bootstrap.gradle`. I've done it with my 3.1 
gradle. To remove a file from the repo: move the file somewhere else, then 
commit, ten move it back and add to gitignore. Especially if it's 
platform/versions dependent and by the docs should be generated locally.   
The changes in the file are substantial about 1/3 of the file. For example
{quote}git diff gradlew
diff --git a/gradlew b/gradlew
index 16f..9aa616c 100755
--- a/gradlew
+++ b/gradlew
@@ -6,12 +6,30 @@
 ##
 ##
 
-# Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to p
-DEFAULT_JVM_OPTS="-XX:MaxPermSize=512m"
+# Attempt to set APP_HOME
+# Resolve links: $0 may be a link
+PRG="$0"
+# Need this for relative symlinks.
+while [ -h "$PRG" ] ; do
+ls=`ls -ld "$PRG"`
+link=`expr "$ls" : '.*-> \(.*\)$'`
+if expr "$link" : '/.*' > /dev/null; then
+PRG="$link"
+else
+PRG=`dirname "$PRG"`"/$link"
+fi
+done
.
{quote}  

> SimpleRandomSample by a fixed number
> 
>
> Key: DATAFU-63
> URL: https://issues.apache.org/jira/browse/DATAFU-63
> Project: DataFu
>  Issue Type: New Feature
>Reporter: jian wang
>Assignee: jian wang
>
> SimpleRandomSample currently supports random sampling by probability, it does 
> not support random sample a fixed number of items. ReserviorSample may do the 
> work but since it relies on an in-memory priority queue, memory issue may 
> happen if we are going to sample a huge number of items, eg: sample 100M from 
> 100G data. 
> Suggested approach is to create a new class "SimpleRandomSampleByCount" that 
> uses Manuver's rejection threshold to reject items whose weight exceeds the 
> threshold as we go from mapper to combiner to reducer. The majority part of 
> the algorithm will be very similar to SimpleRandomSample, except that we do 
> not use Berstein's theory to accept items and replace probability p = k / n,  
> k is the number of items to sample, n is the total number of items local in 
> mapper, combiner and reducer.
> Quote this requirement from others:
> "Hi folks,
> Question: does anybody know if there is a quicker way to randomly sample a 
> specified number of rows from grouped data? I’m currently doing this, since 
> it appears that the SAMPLE operator doesn’t work inside FOREACH statements:
> photosGrouped = GROUP photos BY farm;
> agg = FOREACH photosGrouped {
>   rnds = FOREACH photos GENERATE *, RANDOM() as rnd;
>   ordered_rnds = ORDER rnds BY rnd;
>   limitSet = LIMIT ordered_rnds 5000;
>   GENERATE group AS farm,
>FLATTEN(limitSet.(photo_id, server, secret)) AS (photo_id, server, 
> secret);
> };
> This approach seems clumsy, and appears to run quite slowly (I’m assuming the 
> ORDER/LIMIT isn’t great for performance). Is there a less awkward way to do 
> this?
> Thanks,
> "



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-63) SimpleRandomSample by a fixed number

2017-11-13 Thread OlgaK (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-63?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249870#comment-16249870
 ] 

OlgaK commented on DATAFU-63:
-

I'm on Linux/Fedora. I've not modified  gradlew file manually just, as it 
pointed in the docs `gradle -b bootstrap.gradle`. I've done it with my 3.1 
gradle. To remove a file from the repo: move the file somewhere else, then 
commit, ten move it back and add to gitignore. Especially if it's 
platform/versions dependent and by the docs should be generated locally.   
The changes in the file are substantial about 1/3 of the file. For example
{quote}git diff gradlew
diff --git a/gradlew b/gradlew
index 16f..9aa616c 100755
--- a/gradlew
+++ b/gradlew
@@ -6,12 +6,30 @@
 ##
 ##
 
-# Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to p
-DEFAULT_JVM_OPTS="-XX:MaxPermSize=512m"
+# Attempt to set APP_HOME
+# Resolve links: $0 may be a link
+PRG="$0"
+# Need this for relative symlinks.
+while [ -h "$PRG" ] ; do
+ls=`ls -ld "$PRG"`
+link=`expr "$ls" : '.*-> \(.*\)$'`
+if expr "$link" : '/.*' > /dev/null; then
+PRG="$link"
+else
+PRG=`dirname "$PRG"`"/$link"
+fi
+done
.
{quote}  

> SimpleRandomSample by a fixed number
> 
>
> Key: DATAFU-63
> URL: https://issues.apache.org/jira/browse/DATAFU-63
> Project: DataFu
>  Issue Type: New Feature
>Reporter: jian wang
>Assignee: jian wang
>
> SimpleRandomSample currently supports random sampling by probability, it does 
> not support random sample a fixed number of items. ReserviorSample may do the 
> work but since it relies on an in-memory priority queue, memory issue may 
> happen if we are going to sample a huge number of items, eg: sample 100M from 
> 100G data. 
> Suggested approach is to create a new class "SimpleRandomSampleByCount" that 
> uses Manuver's rejection threshold to reject items whose weight exceeds the 
> threshold as we go from mapper to combiner to reducer. The majority part of 
> the algorithm will be very similar to SimpleRandomSample, except that we do 
> not use Berstein's theory to accept items and replace probability p = k / n,  
> k is the number of items to sample, n is the total number of items local in 
> mapper, combiner and reducer.
> Quote this requirement from others:
> "Hi folks,
> Question: does anybody know if there is a quicker way to randomly sample a 
> specified number of rows from grouped data? I’m currently doing this, since 
> it appears that the SAMPLE operator doesn’t work inside FOREACH statements:
> photosGrouped = GROUP photos BY farm;
> agg = FOREACH photosGrouped {
>   rnds = FOREACH photos GENERATE *, RANDOM() as rnd;
>   ordered_rnds = ORDER rnds BY rnd;
>   limitSet = LIMIT ordered_rnds 5000;
>   GENERATE group AS farm,
>FLATTEN(limitSet.(photo_id, server, secret)) AS (photo_id, server, 
> secret);
> };
> This approach seems clumsy, and appears to run quite slowly (I’m assuming the 
> ORDER/LIMIT isn’t great for performance). Is there a less awkward way to do 
> this?
> Thanks,
> "



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-63) SimpleRandomSample by a fixed number

2017-11-13 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-63?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249793#comment-16249793
 ] 

Eyal Allweil commented on DATAFU-63:


Hi [~cur4so],

I'll quickly answer your last comment - I'll get to the previous one as soon as 
I can. We do indeed use still gradle 2.4 in the master branch. We're [about to 
update to Gradle 3.5|https://issues.apache.org/jira/browse/DATAFU-125], but it 
hasn't been merged yet.

However, when I did the gradle bootstrapping, it didn't modify my _gradlew_ 
file - what OS are you on? (BTW - we can't add it to the gitignore because it's 
checked into the repository, and you can't ignore files that are checked in)

> SimpleRandomSample by a fixed number
> 
>
> Key: DATAFU-63
> URL: https://issues.apache.org/jira/browse/DATAFU-63
> Project: DataFu
>  Issue Type: New Feature
>Reporter: jian wang
>Assignee: jian wang
>
> SimpleRandomSample currently supports random sampling by probability, it does 
> not support random sample a fixed number of items. ReserviorSample may do the 
> work but since it relies on an in-memory priority queue, memory issue may 
> happen if we are going to sample a huge number of items, eg: sample 100M from 
> 100G data. 
> Suggested approach is to create a new class "SimpleRandomSampleByCount" that 
> uses Manuver's rejection threshold to reject items whose weight exceeds the 
> threshold as we go from mapper to combiner to reducer. The majority part of 
> the algorithm will be very similar to SimpleRandomSample, except that we do 
> not use Berstein's theory to accept items and replace probability p = k / n,  
> k is the number of items to sample, n is the total number of items local in 
> mapper, combiner and reducer.
> Quote this requirement from others:
> "Hi folks,
> Question: does anybody know if there is a quicker way to randomly sample a 
> specified number of rows from grouped data? I’m currently doing this, since 
> it appears that the SAMPLE operator doesn’t work inside FOREACH statements:
> photosGrouped = GROUP photos BY farm;
> agg = FOREACH photosGrouped {
>   rnds = FOREACH photos GENERATE *, RANDOM() as rnd;
>   ordered_rnds = ORDER rnds BY rnd;
>   limitSet = LIMIT ordered_rnds 5000;
>   GENERATE group AS farm,
>FLATTEN(limitSet.(photo_id, server, secret)) AS (photo_id, server, 
> secret);
> };
> This approach seems clumsy, and appears to run quite slowly (I’m assuming the 
> ORDER/LIMIT isn’t great for performance). Is there a less awkward way to do 
> this?
> Thanks,
> "



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)