Problem is I have RDD of about 10M rows and it keeps growing. Everytime when we want to perform query and compute on subset of data we have to use filter and then some aggregation. Here I know filter goes through each partitions and every rows of RDD which may not be efficient at all.
Spark having Ordered RDD functions I dont see why it's so difficult to implement such function. Cassandra/Hbase has it for years where they can fetch data only from certain partitions based on your rowkey. Scala TreeMap has Range function to do the same. I think people have been looking for this for while. I see several post asking this. http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-td20170.html#a26048 By the way, I assume there Thanks Nirav -- [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] <https://twitter.com/Xactly> [image: Facebook] <https://www.facebook.com/XactlyCorp> [image: YouTube] <http://www.youtube.com/xactlycorporation>