[jira] [Commented] (SPARK-3250) More Efficient Sampling
[ https://issues.apache.org/jira/browse/SPARK-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140897#comment-14140897 ] Apache Spark commented on SPARK-3250: - User 'erikerlandson' has created a pull request for this issue: https://github.com/apache/spark/pull/2455 > More Efficient Sampling > --- > > Key: SPARK-3250 > URL: https://issues.apache.org/jira/browse/SPARK-3250 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: RJ Nowling > > Sampling, as currently implemented in Spark, is an O\(n\) operation. A > number of stochastic algorithms achieve speed ups by exploiting O\(k\) > sampling, where k is the number of data points to sample. Examples of such > algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient > Descent with mini batching. > More efficient sampling may be achievable by packing partitions with an > ArrayBuffer or other data structure supporting random access. Since many of > these stochastic algorithms perform repeated rounds of sampling, it may be > feasible to perform a transformation to change the backing data structure > followed by multiple rounds of sampling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3250) More Efficient Sampling
[ https://issues.apache.org/jira/browse/SPARK-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139965#comment-14139965 ] Erik Erlandson commented on SPARK-3250: --- PR: https://github.com/apache/spark/pull/2455 [SPARK-3250] Implement Gap Sampling optimization for random sampling > More Efficient Sampling > --- > > Key: SPARK-3250 > URL: https://issues.apache.org/jira/browse/SPARK-3250 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: RJ Nowling > > Sampling, as currently implemented in Spark, is an O\(n\) operation. A > number of stochastic algorithms achieve speed ups by exploiting O\(k\) > sampling, where k is the number of data points to sample. Examples of such > algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient > Descent with mini batching. > More efficient sampling may be achievable by packing partitions with an > ArrayBuffer or other data structure supporting random access. Since many of > these stochastic algorithms perform repeated rounds of sampling, it may be > feasible to perform a transformation to change the backing data structure > followed by multiple rounds of sampling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3250) More Efficient Sampling
[ https://issues.apache.org/jira/browse/SPARK-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14130880#comment-14130880 ] RJ Nowling commented on SPARK-3250: --- Great work! If these performance improvements hold up when implemented in Spark, this could offer minibatch methods a fighting chance. In particular, we would need the count the elements once and then we'd have faster sampling for the subsequent iterations, especially if the underlying data structures can we coerced into Arrays when we do the copy. > More Efficient Sampling > --- > > Key: SPARK-3250 > URL: https://issues.apache.org/jira/browse/SPARK-3250 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: RJ Nowling > > Sampling, as currently implemented in Spark, is an O\(n\) operation. A > number of stochastic algorithms achieve speed ups by exploiting O\(k\) > sampling, where k is the number of data points to sample. Examples of such > algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient > Descent with mini batching. > More efficient sampling may be achievable by packing partitions with an > ArrayBuffer or other data structure supporting random access. Since many of > these stochastic algorithms perform repeated rounds of sampling, it may be > feasible to perform a transformation to change the backing data structure > followed by multiple rounds of sampling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3250) More Efficient Sampling
[ https://issues.apache.org/jira/browse/SPARK-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14130809#comment-14130809 ] Erik Erlandson commented on SPARK-3250: --- I developed prototype iterator classes for "fast gap sampling" with and without replacement. The code, testing rig and test results can be seen here: https://gist.github.com/erikerlandson/05db1f15c8d623448ff6 I also wrote up some discussion of the algorithms here: Faster Random Samples With Gap Sampling http://erikerlandson.github.io/blog/2014/09/11/faster-random-samples-with-gap-sampling/ > More Efficient Sampling > --- > > Key: SPARK-3250 > URL: https://issues.apache.org/jira/browse/SPARK-3250 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: RJ Nowling > > Sampling, as currently implemented in Spark, is an O\(n\) operation. A > number of stochastic algorithms achieve speed ups by exploiting O\(k\) > sampling, where k is the number of data points to sample. Examples of such > algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient > Descent with mini batching. > More efficient sampling may be achievable by packing partitions with an > ArrayBuffer or other data structure supporting random access. Since many of > these stochastic algorithms perform repeated rounds of sampling, it may be > feasible to perform a transformation to change the backing data structure > followed by multiple rounds of sampling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3250) More Efficient Sampling
[ https://issues.apache.org/jira/browse/SPARK-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14115213#comment-14115213 ] RJ Nowling commented on SPARK-3250: --- Very clever! Once it's verified to sample correctly, it would make a nice incremental improvement to the current sampling in Spark. Can you try different data sizes? It shouldn't change the O(n) scaling but it would be good to verify. > More Efficient Sampling > --- > > Key: SPARK-3250 > URL: https://issues.apache.org/jira/browse/SPARK-3250 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: RJ Nowling > > Sampling, as currently implemented in Spark, is an O\(n\) operation. A > number of stochastic algorithms achieve speed ups by exploiting O\(k\) > sampling, where k is the number of data points to sample. Examples of such > algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient > Descent with mini batching. > More efficient sampling may be achievable by packing partitions with an > ArrayBuffer or other data structure supporting random access. Since many of > these stochastic algorithms perform repeated rounds of sampling, it may be > feasible to perform a transformation to change the backing data structure > followed by multiple rounds of sampling. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3250) More Efficient Sampling
[ https://issues.apache.org/jira/browse/SPARK-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114789#comment-14114789 ] Erik Erlandson commented on SPARK-3250: --- I did some experiments with sampling that models the gaps between samples (so one can use iterator.drop between samples). The results are here: https://gist.github.com/erikerlandson/66b42d96500589f25553 There appears to be a crossover point in efficiency, around sampling probability p=0.3, where densities below 0.3 are best done using the new logic, and higher sampling densities are better done using traditional filter-based logic. I need to run more tests, but the first results are promising. At low sampling densities the improvement is large. > More Efficient Sampling > --- > > Key: SPARK-3250 > URL: https://issues.apache.org/jira/browse/SPARK-3250 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: RJ Nowling > > Sampling, as currently implemented in Spark, is an O\(n\) operation. A > number of stochastic algorithms achieve speed ups by exploiting O\(k\) > sampling, where k is the number of data points to sample. Examples of such > algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient > Descent with mini batching. > More efficient sampling may be achievable by packing partitions with an > ArrayBuffer or other data structure supporting random access. Since many of > these stochastic algorithms perform repeated rounds of sampling, it may be > feasible to perform a transformation to change the backing data structure > followed by multiple rounds of sampling. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org