RJ Nowling created SPARK-3250:
---------------------------------

             Summary: More Efficient Sampling
                 Key: SPARK-3250
                 URL: https://issues.apache.org/jira/browse/SPARK-3250
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
            Reporter: RJ Nowling


Sampling, as currently implemented in Spark, is an O(n) operation.  A number of 
stochastic algorithms achieve speed ups by exploiting O(k) sampling, where k is 
the number of data points to sample.  Examples of such algorithms include 
KMeans MiniBatch and Stochastic Gradient Descent with mini batching.

More efficient sampling may be achievable by packing partitions with an 
ArrayBuffer or other data structure supporting random access.  Since many of 
these stochastic algorithms perform repeated rounds of sampling, it may be 
feasible to perform a transformation to change the backing data structure 
followed by multiple rounds of sampling.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to