take(1000) merely takes the first 1000 elements of an RDD. I don't imagine that's what the OP means. filter() is how you select a subset of elements to work with. Yes, this requires evaluating the predicate on all 10M elements, at least once. I don't think you could avoid this in general, right, in any system?
You might be able to take advantage of additional info you have. For example if you have a particular partitioning system and you know that elements of interest are only in one partition, you could create a more efficient version with mapPartitions that simply drops other partitions. Same answer to your second question. It sounds like you expect that Spark somehow has an index over keys, but it does not. It has no special notion of where your keys are or what they are. This changes a bit if you mean you are using the SQL APIs, but it doesn't sound like you are. On Tue, Dec 2, 2014 at 10:17 AM, Gen <gen.tan...@gmail.com> wrote: > Hi, > > For your first question, I think that we can use > /sc.parallelize(rdd.take(1000))/ > > For your second question, I am not sure. But I don't think that we can > restricted filter within certain partition without scan every element. > > Cheers > Gen > > > nsareen wrote >> Hi , >> >> I wanted some clarity into the functioning of Filter function of RDD. >> >> 1) Does filter function scan every element saved in RDD? if my RDD >> represents 10 Million rows, and if i want to work on only 1000 of them, is >> there an efficient way of filtering the subset without having to scan >> every element ? >> >> 2) If my RDD represents a Key / Value data set. When i filter this data >> set of 10 Million rows, can i specify that the search should be restricted >> to only partitions which contain specific keys ? Will spark run by filter >> operation on all partitions if the partitions are done by key, >> irrespective the key exists in a partition or not ? > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-tp20170p20174.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org