Hi:

There is such one case about using reduce operation like that:


I Need to reduce large data made up of billions of records with a Key-Value
pair.

For the following:

    *First,group by Key, and the records with the same Key need to be in
order of one field called “date” in Value*

*    Second, in records with the same Key, every operation to one recored
need to depend on the result of dealing with the last one, and the first
one is original state..*


Some attempts:

1. groupByKey + map :  because of the bad performance, CPU is to 100%, so
executors heartbeat time out and throw errors “Lost executor”, or the
application is hung…


2. AggregateByKey:

* def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,*

*    combOp: (U, U) => U): RDD[(K, U)]*

About aggregateByKey, is all the records with the same Key In the same
partition ? Is the zeroValue applied to the first one in all records with
the same Key, or in each partition ? If it is the former, comOp Function do
nothing!

I tried to take the second “numPartitions” parameter, pass the number of
key to it. But, the number of key is so large to all the tasks be killed.


What should I do with this case ?

I'm asking for advises online...

Thank you.

Reply via email to