I'm not sure sample is what i was looking for. As mentioned in another post above. this is what i'm looking for.
1) My RDD contains this structure. Tuple2<CustomTuple,Double>. 2) Each CustomTuple is a combination of string id's e.g. CustomTuple.dimensionOne="AE232323" CustomTuple.dimensionTwo="BE232323" CustomTuple.dimensionThree="CE232323" and so on --- 3) CustomTuple has overridden equals & hash implementation which helps identify unique objects and equality if values in dimensionOne,Two,Three match for two distinct objects. 4) Double is a numberic value. 5) I want to create RDD of 50-100Million or more such tuples in Spark, which can grow over time. 6) My Web Application would request to process a subset of these millions of rows. The processing is nothing but aggregation / arithmetic functions over this data set. We felt spark would be the right candidate to process this in distributed fashion and also would help scalability for future. Where we are stuck is that, in case the application requests a subset comprising of 100thousand tuples, we would have to construct these many CustomTuple objects and pass them via Spark Driver program to the filter function, which in turn would go and scan these 100 million rows to generate the subset. I was of the assumption, that since Spark allows Key / Value storage, there would be some indexing for the Keys stored, which would help spark locate objects. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-tp20170p20366.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org