Hi, I have an RDD that represents data over a time interval and I want
to select some subinterval of my data and partition it by day
based on a unix time field in the data.
What is the best way to do this with Spark?

I have currently implemented 2 solutions, both which seem suboptimal.
Solution 1 is to filter the subinterval from the overall data set,
and then to filter each day out of this filtered data set.
However, this causes the same data in the subset to be filtered many times.

Solution 2 is to map the objects into a pair RDD where the
key is the number of the day in the interval, then group by
key, collect, and parallelize the resulting grouped data.
However, I worry collecting large data sets is going to be
a serious performance bottleneck.

A small query using Solution 1 takes 13 seconds to run, and the same
query using Solution 2 takes 10 seconds to run,
but I think this can be further improved.
Does anybody have any suggestions on the best way to separate
a subset of data by day?

Thanks,
Brandon.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-separate-a-subset-of-an-RDD-by-day-tp9454.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to