Hi, I have an RDD that represents data over a time interval and I want to select some subinterval of my data and partition it by day based on a unix time field in the data. What is the best way to do this with Spark?
I have currently implemented 2 solutions, both which seem suboptimal. Solution 1 is to filter the subinterval from the overall data set, and then to filter each day out of this filtered data set. However, this causes the same data in the subset to be filtered many times. Solution 2 is to map the objects into a pair RDD where the key is the number of the day in the interval, then group by key, collect, and parallelize the resulting grouped data. However, I worry collecting large data sets is going to be a serious performance bottleneck. A small query using Solution 1 takes 13 seconds to run, and the same query using Solution 2 takes 10 seconds to run, but I think this can be further improved. Does anybody have any suggestions on the best way to separate a subset of data by day? Thanks, Brandon. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-separate-a-subset-of-an-RDD-by-day-tp9454.html Sent from the Apache Spark User List mailing list archive at Nabble.com.