If you are on 1.0.0 release you can also try converting your RDD to a
SchemaRDD and run a groupBy there. The SparkSQL optimizer may yield
better results. It's worth a try at least.
On Fri, Jul 11, 2014 at 5:24 PM, Soumya Simanta soumya.sima...@gmail.com
wrote:
Solution 2 is to map the
ssimanta wrote
Solution 2 is to map the objects into a pair RDD where the
key is the number of the day in the interval, then group by
key, collect, and parallelize the resulting grouped data.
However, I worry collecting large data sets is going to be
a serious performance bottleneck.
Why do
I think my best option is to partition my data in directories by day
before running my Spark application, and then direct
my Spark application to load RDD's from each directory when
I want to load a date range. How does this sound?
If your upstream system can write data by day then it makes
Sean Owen-2 wrote
Can you not just filter the range you want, then groupBy
timestamp/86400 ? That sounds like your solution 1 and is about as
fast as it gets, I think. Are you thinking you would have to filter
out each day individually from there, and that's why it would be slow?
I don't
On Fri, Jul 11, 2014 at 10:53 PM, bdamos a...@adobe.com wrote:
I didn't make it clear in my first message that I want to obtain an RDD
instead
of an Iterable, and will be doing map-reduce like operations on the
data by day. My problem is that groupBy returns an RDD[(K, Iterable[T])],
but I