Process time series RDD after sortByKey

Shuai Zheng Mon, 09 Mar 2015 15:42:18 -0700

Hi All,


I am processing some time series data. For one day, it might has 500GB, then
for each hour, it is around 20GB data.

 

I need to sort the data before I start process. Assume I can sort them
successfully

 

dayRDD.sortByKey

 

but after that, I might have thousands of partitions (to make the sort
successfully), might be 1000 partitions. And then I try to process the data
by hour (not need exactly one hour, but some kind of similar time frame).
And I can't just re-partition size to 24 because then one partition might be
too big to fit into memory (if it is 20GB). So is there any way for me to
just can process underlying partitions by certain order? Basically I want to
call mapPartitionsWithIndex with a range of index?

 

Anyway to do it? Hope I describe my issue clear. J

 

Regards,

 

Shuai

Process time series RDD after sortByKey

Reply via email to