Re: How to separate a subset of an RDD by day?

2014-07-11 Thread Soumya Simanta
If you are on 1.0.0 release you can also try converting your RDD to a SchemaRDD and run a groupBy there. The SparkSQL optimizer may yield better results. It's worth a try at least. On Fri, Jul 11, 2014 at 5:24 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Solution 2 is to map the

Re: How to separate a subset of an RDD by day?

2014-07-11 Thread bdamos
ssimanta wrote Solution 2 is to map the objects into a pair RDD where the key is the number of the day in the interval, then group by key, collect, and parallelize the resulting grouped data. However, I worry collecting large data sets is going to be a serious performance bottleneck. Why do

Re: How to separate a subset of an RDD by day?

2014-07-11 Thread Soumya Simanta
I think my best option is to partition my data in directories by day before running my Spark application, and then direct my Spark application to load RDD's from each directory when I want to load a date range. How does this sound? If your upstream system can write data by day then it makes

Re: How to separate a subset of an RDD by day?

2014-07-11 Thread bdamos
Sean Owen-2 wrote Can you not just filter the range you want, then groupBy timestamp/86400 ? That sounds like your solution 1 and is about as fast as it gets, I think. Are you thinking you would have to filter out each day individually from there, and that's why it would be slow? I don't

Re: How to separate a subset of an RDD by day?

2014-07-11 Thread Sean Owen
On Fri, Jul 11, 2014 at 10:53 PM, bdamos a...@adobe.com wrote: I didn't make it clear in my first message that I want to obtain an RDD instead of an Iterable, and will be doing map-reduce like operations on the data by day. My problem is that groupBy returns an RDD[(K, Iterable[T])], but I