Hi:
There is such one case about using reduce operation like that: I Need to reduce large data made up of billions of records with a Key-Value pair. For the following: *First,group by Key, and the records with the same Key need to be in order of one field called “date” in Value* * Second, in records with the same Key, every operation to one recored need to depend on the result of dealing with the last one, and the first one is original state..* Some attempts: 1. groupByKey + map : because of the bad performance, CPU is to 100%, so executors heartbeat time out and throw errors “Lost executor”, or the application is hung… 2. AggregateByKey: * def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,* * combOp: (U, U) => U): RDD[(K, U)]* About aggregateByKey, is all the records with the same Key In the same partition ? Is the zeroValue applied to the first one in all records with the same Key, or in each partition ? If it is the former, comOp Function do nothing! I tried to take the second “numPartitions” parameter, pass the number of key to it. But, the number of key is so large to all the tasks be killed. What should I do with this case ? I'm asking for advises online... Thank you.