Hi,

For your first question, I think that we can use
/sc.parallelize(rdd.take(1000))/

For your second question, I am not sure. But I don't think that we can
restricted filter within certain partition without scan every element.

Cheers
Gen


nsareen wrote
> Hi ,
> 
> I wanted some clarity into the functioning of Filter function of RDD.
> 
> 1) Does filter function scan every element saved in RDD? if my RDD
> represents 10 Million rows, and if i want to work on only 1000 of them, is
> there an efficient way of filtering the subset without having to scan
> every element ?
> 
> 2) If my RDD represents a Key / Value data set. When i filter this data
> set of 10 Million rows, can i specify that the search should be restricted
> to only partitions which contain specific keys ? Will spark run by filter
> operation on all partitions if the partitions are done by key,
> irrespective the key exists in a partition or not ?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-tp20170p20174.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to