@IIya Ganellin, not sure how zipWithIndex() will do less then O(n) scan.
Spark doc doesnt mention anything about it.
I found solution with spark 1.5.2 OrderedRDDFunctions. It has filterByRange
api.
Thanks
On Sun, Jan 24, 2016 at 10:27 PM, Sonal Goyal wrote:
> One thing
The solution I normally use is to zipWithIndex() and then use the filter
operation. Filter is an O(m) operation where m is the size of your
partition, not an O(N) operation.
-Ilya Ganelin
On Sat, Jan 23, 2016 at 5:48 AM, Nirav Patel wrote:
> Problem is I have RDD of
One thing you can also look at is to save your data in a way that can be
accessed through file patterns. Eg by hour, zone etc so that you only load
what you need.
On Jan 24, 2016 10:00 PM, "Ilya Ganelin" wrote:
> The solution I normally use is to zipWithIndex() and then use
Problem is I have RDD of about 10M rows and it keeps growing. Everytime
when we want to perform query and compute on subset of data we have to use
filter and then some aggregation. Here I know filter goes through each
partitions and every rows of RDD which may not be efficient at all.
Spark