I'm not sure sample is what i was looking for. 

As mentioned in another post above. this is what i'm looking for.

1) My RDD contains this structure. Tuple2<CustomTuple,Double>.
2) Each CustomTuple is a combination of string id's e.g. 
CustomTuple.dimensionOne="AE232323"
CustomTuple.dimensionTwo="BE232323"
CustomTuple.dimensionThree="CE232323"
and so on ---
3) CustomTuple has overridden equals & hash implementation which helps
identify unique objects and equality if values in dimensionOne,Two,Three
match for two distinct objects.
4) Double is a numberic value.
5) I want to create RDD of  50-100Million or more such tuples in Spark,
which can grow over time.
6) My Web Application would request to process a subset of these millions of
rows. The processing is nothing but aggregation / arithmetic functions over
this data set. We felt spark would be the right candidate to process this in
distributed fashion and also would help scalability for future. Where we are
stuck is that, in case the application requests a subset comprising of
100thousand tuples, we would have to construct these many CustomTuple
objects and pass them via Spark Driver program to the filter function, which
in turn would go and scan these 100 million rows to generate the subset. 

I was of the assumption, that since Spark allows Key / Value storage, there
would be some indexing for the Keys stored, which would help spark locate
objects.






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-tp20170p20366.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to