Re: Does filter on an RDD scan every data item ?

dsiegel Wed, 03 Dec 2014 12:30:57 -0800

>> nsareen wrote
>>> 1) Does filter function scan every element saved in RDD? if my RDD
>>> represents 10 Million rows, and if i want to work on only 1000 of them,
>>> is
>>> there an efficient way of filtering the subset without having to scan
>>> every element ?


using .take(1000) may be a biased sample. 
you may want to consider sampling your RDD (with or without replacement)
using a seed for randomization, using .takeSample()
eg.
rdd.takeSample(false, 1000, 1)
this returns an Array, from which you could create another RDD.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-tp20170p20289.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Does filter on an RDD scan every data item ?

Reply via email to