Problem is I have RDD of about 10M rows and it keeps growing. Everytime
when we want to perform query and compute on subset of data we have to use
filter and then some aggregation. Here I know filter goes through each
partitions and every rows of RDD which may not be efficient at all.

Spark having Ordered RDD functions I dont see why it's so difficult to
implement such function. Cassandra/Hbase has it for years where they can
fetch data only from certain partitions based on your rowkey. Scala TreeMap
has Range function to do the same.

I think people have been looking for this for while. I see several post
asking this.

http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-td20170.html#a26048

By the way, I assume there
Thanks
Nirav

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Reply via email to