take(1000) merely takes the first 1000 elements of an RDD. I don't
imagine that's what the OP means. filter() is how you select a subset
of elements to work with. Yes, this requires evaluating the predicate
on all 10M elements, at least once. I don't think you could avoid this
in general, right, in any system?

You might be able to take advantage of additional info you have. For
example if you have a particular partitioning system and you know that
elements of interest are only in one partition, you could create a
more efficient version with mapPartitions that simply drops other
partitions.

Same answer to your second question. It sounds like you expect that
Spark somehow has an index over keys, but it does not. It has no
special notion of where your keys are or what they are.

This changes a bit if you mean you are using the SQL APIs, but it
doesn't sound like you are.

On Tue, Dec 2, 2014 at 10:17 AM, Gen <gen.tan...@gmail.com> wrote:
> Hi,
>
> For your first question, I think that we can use
> /sc.parallelize(rdd.take(1000))/
>
> For your second question, I am not sure. But I don't think that we can
> restricted filter within certain partition without scan every element.
>
> Cheers
> Gen
>
>
> nsareen wrote
>> Hi ,
>>
>> I wanted some clarity into the functioning of Filter function of RDD.
>>
>> 1) Does filter function scan every element saved in RDD? if my RDD
>> represents 10 Million rows, and if i want to work on only 1000 of them, is
>> there an efficient way of filtering the subset without having to scan
>> every element ?
>>
>> 2) If my RDD represents a Key / Value data set. When i filter this data
>> set of 10 Million rows, can i specify that the search should be restricted
>> to only partitions which contain specific keys ? Will spark run by filter
>> operation on all partitions if the partitions are done by key,
>> irrespective the key exists in a partition or not ?
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-tp20170p20174.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to