Does the code flow similar to following work for you, which processes each
partition of an RDD sequentially?
while( iterPartition < RDD.partitions.length) {
val res = sc.runJob(this, (it: Iterator[T]) => somFunc, iterPartition,
allowLocal = true)
Some other function after processing one partition.
iterPartition += 1
}
You can refer RDD.take for example.
Thanks.
Zhan Zhang
On Mar 9, 2015, at 3:41 PM, Shuai Zheng
<[email protected]<mailto:[email protected]>> wrote:
Hi All,
I am processing some time series data. For one day, it might has 500GB, then
for each hour, it is around 20GB data.
I need to sort the data before I start process. Assume I can sort them
successfully
dayRDD.sortByKey
but after that, I might have thousands of partitions (to make the sort
successfully), might be 1000 partitions. And then I try to process the data by
hour (not need exactly one hour, but some kind of similar time frame). And I
can’t just re-partition size to 24 because then one partition might be too big
to fit into memory (if it is 20GB). So is there any way for me to just can
process underlying partitions by certain order? Basically I want to call
mapPartitionsWithIndex with a range of index?
Anyway to do it? Hope I describe my issue clear… :)
Regards,
Shuai