Re: Process time series RDD after sortByKey

Zhan Zhang Mon, 09 Mar 2015 17:47:58 -0700

Does the code flow similar to following work for you, which processes each 
partition of an RDD sequentially?


while( iterPartition < RDD.partitions.length) {
      val res = sc.runJob(this, (it: Iterator[T]) => somFunc, iterPartition, 
allowLocal = true)
      Some other function after processing one partition.
      iterPartition += 1
}

You can refer RDD.take for example.

Thanks.

Zhan Zhang

On Mar 9, 2015, at 3:41 PM, Shuai Zheng 
<szheng.c...@gmail.com<mailto:szheng.c...@gmail.com>> wrote:

Hi All,

I am processing some time series data. For one day, it might has 500GB, then 
for each hour, it is around 20GB data.

I need to sort the data before I start process. Assume I can sort them 
successfully

dayRDD.sortByKey

but after that, I might have thousands of partitions (to make the sort 
successfully), might be 1000 partitions. And then I try to process the data by 
hour (not need exactly one hour, but some kind of similar time frame). And I 
can’t just re-partition size to 24 because then one partition might be too big 
to fit into memory (if it is 20GB). So is there any way for me to just can 
process underlying partitions by certain order? Basically I want to call 
mapPartitionsWithIndex with a range of index?

Anyway to do it? Hope I describe my issue clear… :)

Regards,

Shuai

Re: Process time series RDD after sortByKey

Reply via email to