[jira] [Comment Edited] (DATAFU-63) SimpleRandomSample by a fixed number
[ https://issues.apache.org/jira/browse/DATAFU-63?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249870#comment-16249870 ] OlgaK edited comment on DATAFU-63 at 11/13/17 5:47 PM: --- I'm on Linux/Fedora. I've not modified gradlew file manually just, as it pointed in the docs `gradle -b bootstrap.gradle`. I've done it with my 3.1 gradle. To remove a file from the repo: move the file somewhere else, then commit, ten move it back and add to gitignore. Especially if it's platform/versions dependent and by the docs should be generated locally. The changes in the file are substantial about 1/3 of the file. For example {noformat}git diff gradlew diff --git a/gradlew b/gradlew index 16f..9aa616c 100755 --- a/gradlew +++ b/gradlew @@ -6,12 +6,30 @@ ## ## -# Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to p -DEFAULT_JVM_OPTS="-XX:MaxPermSize=512m" +# Attempt to set APP_HOME +# Resolve links: $0 may be a link +PRG="$0" +# Need this for relative symlinks. +while [ -h "$PRG" ] ; do +ls=`ls -ld "$PRG"` +link=`expr "$ls" : '.*-> \(.*\)$'` +if expr "$link" : '/.*' > /dev/null; then +PRG="$link" +else +PRG=`dirname "$PRG"`"/$link" +fi +done . {noformat} was (Author: cur4so): I'm on Linux/Fedora. I've not modified gradlew file manually just, as it pointed in the docs `gradle -b bootstrap.gradle`. I've done it with my 3.1 gradle. To remove a file from the repo: move the file somewhere else, then commit, ten move it back and add to gitignore. Especially if it's platform/versions dependent and by the docs should be generated locally. The changes in the file are substantial about 1/3 of the file. For example {quote}git diff gradlew diff --git a/gradlew b/gradlew index 16f..9aa616c 100755 --- a/gradlew +++ b/gradlew @@ -6,12 +6,30 @@ ## ## -# Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to p -DEFAULT_JVM_OPTS="-XX:MaxPermSize=512m" +# Attempt to set APP_HOME +# Resolve links: $0 may be a link +PRG="$0" +# Need this for relative symlinks. +while [ -h "$PRG" ] ; do +ls=`ls -ld "$PRG"` +link=`expr "$ls" : '.*-> \(.*\)$'` +if expr "$link" : '/.*' > /dev/null; then +PRG="$link" +else +PRG=`dirname "$PRG"`"/$link" +fi +done . {quote} > SimpleRandomSample by a fixed number > > > Key: DATAFU-63 > URL: https://issues.apache.org/jira/browse/DATAFU-63 > Project: DataFu > Issue Type: New Feature >Reporter: jian wang >Assignee: jian wang > > SimpleRandomSample currently supports random sampling by probability, it does > not support random sample a fixed number of items. ReserviorSample may do the > work but since it relies on an in-memory priority queue, memory issue may > happen if we are going to sample a huge number of items, eg: sample 100M from > 100G data. > Suggested approach is to create a new class "SimpleRandomSampleByCount" that > uses Manuver's rejection threshold to reject items whose weight exceeds the > threshold as we go from mapper to combiner to reducer. The majority part of > the algorithm will be very similar to SimpleRandomSample, except that we do > not use Berstein's theory to accept items and replace probability p = k / n, > k is the number of items to sample, n is the total number of items local in > mapper, combiner and reducer. > Quote this requirement from others: > "Hi folks, > Question: does anybody know if there is a quicker way to randomly sample a > specified number of rows from grouped data? I’m currently doing this, since > it appears that the SAMPLE operator doesn’t work inside FOREACH statements: > photosGrouped = GROUP photos BY farm; > agg = FOREACH photosGrouped { > rnds = FOREACH photos GENERATE *, RANDOM() as rnd; > ordered_rnds = ORDER rnds BY rnd; > limitSet = LIMIT ordered_rnds 5000; > GENERATE group AS farm, >FLATTEN(limitSet.(photo_id, server, secret)) AS (photo_id, server, > secret); > }; > This approach seems clumsy, and appears to run quite slowly (I’m assuming the > ORDER/LIMIT isn’t great for performance). Is there a less awkward way to do > this? > Thanks, > " -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-63) SimpleRandomSample by a fixed number
[ https://issues.apache.org/jira/browse/DATAFU-63?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249870#comment-16249870 ] OlgaK commented on DATAFU-63: - I'm on Linux/Fedora. I've not modified gradlew file manually just, as it pointed in the docs `gradle -b bootstrap.gradle`. I've done it with my 3.1 gradle. To remove a file from the repo: move the file somewhere else, then commit, ten move it back and add to gitignore. Especially if it's platform/versions dependent and by the docs should be generated locally. The changes in the file are substantial about 1/3 of the file. For example {quote}git diff gradlew diff --git a/gradlew b/gradlew index 16f..9aa616c 100755 --- a/gradlew +++ b/gradlew @@ -6,12 +6,30 @@ ## ## -# Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to p -DEFAULT_JVM_OPTS="-XX:MaxPermSize=512m" +# Attempt to set APP_HOME +# Resolve links: $0 may be a link +PRG="$0" +# Need this for relative symlinks. +while [ -h "$PRG" ] ; do +ls=`ls -ld "$PRG"` +link=`expr "$ls" : '.*-> \(.*\)$'` +if expr "$link" : '/.*' > /dev/null; then +PRG="$link" +else +PRG=`dirname "$PRG"`"/$link" +fi +done . {quote} > SimpleRandomSample by a fixed number > > > Key: DATAFU-63 > URL: https://issues.apache.org/jira/browse/DATAFU-63 > Project: DataFu > Issue Type: New Feature >Reporter: jian wang >Assignee: jian wang > > SimpleRandomSample currently supports random sampling by probability, it does > not support random sample a fixed number of items. ReserviorSample may do the > work but since it relies on an in-memory priority queue, memory issue may > happen if we are going to sample a huge number of items, eg: sample 100M from > 100G data. > Suggested approach is to create a new class "SimpleRandomSampleByCount" that > uses Manuver's rejection threshold to reject items whose weight exceeds the > threshold as we go from mapper to combiner to reducer. The majority part of > the algorithm will be very similar to SimpleRandomSample, except that we do > not use Berstein's theory to accept items and replace probability p = k / n, > k is the number of items to sample, n is the total number of items local in > mapper, combiner and reducer. > Quote this requirement from others: > "Hi folks, > Question: does anybody know if there is a quicker way to randomly sample a > specified number of rows from grouped data? I’m currently doing this, since > it appears that the SAMPLE operator doesn’t work inside FOREACH statements: > photosGrouped = GROUP photos BY farm; > agg = FOREACH photosGrouped { > rnds = FOREACH photos GENERATE *, RANDOM() as rnd; > ordered_rnds = ORDER rnds BY rnd; > limitSet = LIMIT ordered_rnds 5000; > GENERATE group AS farm, >FLATTEN(limitSet.(photo_id, server, secret)) AS (photo_id, server, > secret); > }; > This approach seems clumsy, and appears to run quite slowly (I’m assuming the > ORDER/LIMIT isn’t great for performance). Is there a less awkward way to do > this? > Thanks, > " -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-63) SimpleRandomSample by a fixed number
[ https://issues.apache.org/jira/browse/DATAFU-63?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249793#comment-16249793 ] Eyal Allweil commented on DATAFU-63: Hi [~cur4so], I'll quickly answer your last comment - I'll get to the previous one as soon as I can. We do indeed use still gradle 2.4 in the master branch. We're [about to update to Gradle 3.5|https://issues.apache.org/jira/browse/DATAFU-125], but it hasn't been merged yet. However, when I did the gradle bootstrapping, it didn't modify my _gradlew_ file - what OS are you on? (BTW - we can't add it to the gitignore because it's checked into the repository, and you can't ignore files that are checked in) > SimpleRandomSample by a fixed number > > > Key: DATAFU-63 > URL: https://issues.apache.org/jira/browse/DATAFU-63 > Project: DataFu > Issue Type: New Feature >Reporter: jian wang >Assignee: jian wang > > SimpleRandomSample currently supports random sampling by probability, it does > not support random sample a fixed number of items. ReserviorSample may do the > work but since it relies on an in-memory priority queue, memory issue may > happen if we are going to sample a huge number of items, eg: sample 100M from > 100G data. > Suggested approach is to create a new class "SimpleRandomSampleByCount" that > uses Manuver's rejection threshold to reject items whose weight exceeds the > threshold as we go from mapper to combiner to reducer. The majority part of > the algorithm will be very similar to SimpleRandomSample, except that we do > not use Berstein's theory to accept items and replace probability p = k / n, > k is the number of items to sample, n is the total number of items local in > mapper, combiner and reducer. > Quote this requirement from others: > "Hi folks, > Question: does anybody know if there is a quicker way to randomly sample a > specified number of rows from grouped data? I’m currently doing this, since > it appears that the SAMPLE operator doesn’t work inside FOREACH statements: > photosGrouped = GROUP photos BY farm; > agg = FOREACH photosGrouped { > rnds = FOREACH photos GENERATE *, RANDOM() as rnd; > ordered_rnds = ORDER rnds BY rnd; > limitSet = LIMIT ordered_rnds 5000; > GENERATE group AS farm, >FLATTEN(limitSet.(photo_id, server, secret)) AS (photo_id, server, > secret); > }; > This approach seems clumsy, and appears to run quite slowly (I’m assuming the > ORDER/LIMIT isn’t great for performance). Is there a less awkward way to do > this? > Thanks, > " -- This message was sent by Atlassian JIRA (v6.4.14#64029)