Re: How to efficiently Scan (not filter nor lookup) part of Paird RDD or Ordered RDD

2016-04-02 Thread Nirav Patel
@IIya Ganellin, not sure how zipWithIndex() will do less then O(n) scan. Spark doc doesnt mention anything about it. I found solution with spark 1.5.2 OrderedRDDFunctions. It has filterByRange api. Thanks On Sun, Jan 24, 2016 at 10:27 PM, Sonal Goyal wrote: > One thing

Re: How to efficiently Scan (not filter nor lookup) part of Paird RDD or Ordered RDD

2016-01-24 Thread Ilya Ganelin
The solution I normally use is to zipWithIndex() and then use the filter operation. Filter is an O(m) operation where m is the size of your partition, not an O(N) operation. -Ilya Ganelin On Sat, Jan 23, 2016 at 5:48 AM, Nirav Patel wrote: > Problem is I have RDD of

Re: How to efficiently Scan (not filter nor lookup) part of Paird RDD or Ordered RDD

2016-01-24 Thread Sonal Goyal
One thing you can also look at is to save your data in a way that can be accessed through file patterns. Eg by hour, zone etc so that you only load what you need. On Jan 24, 2016 10:00 PM, "Ilya Ganelin" wrote: > The solution I normally use is to zipWithIndex() and then use

How to efficiently Scan (not filter nor lookup) part of Paird RDD or Ordered RDD

2016-01-23 Thread Nirav Patel
Problem is I have RDD of about 10M rows and it keeps growing. Everytime when we want to perform query and compute on subset of data we have to use filter and then some aggregation. Here I know filter goes through each partitions and every rows of RDD which may not be efficient at all. Spark