I'm not sure about what you want to do.

You should try to sort the RDD by (yourKey, date), it ensures that all the
keys are in the same partition.

You problem after that is you want to aggregate only on yourKey and if you
change the Key of the sorted RDD you loose partitionning.

Depending of the size of the result you can use an aggregate bulding a map
of results by youKey or use MapPartition to output a rdd (in this case set
the number of partition high enough to allow the partition to fit in
memory).

see
http://apache-spark-user-list.1001560.n3.nabble.com/Folding-an-RDD-in-order-td16577.html

2015-09-15 11:25 GMT+08:00 毕岩 <biyan900...@gmail.com>:

> Hi:
>
>
> There is such one case about using reduce operation like that:
>
>
> I Need to reduce large data made up of billions of records with a
> Key-Value pair.
>
> For the following:
>
>     *First,group by Key, and the records with the same Key need to be in
> order of one field called “date” in Value*
>
> *    Second, in records with the same Key, every operation to one recored
> need to depend on the result of dealing with the last one, and the first
> one is original state..*
>
>
> Some attempts:
>
> 1. groupByKey + map :  because of the bad performance, CPU is to 100%, so
> executors heartbeat time out and throw errors “Lost executor”, or the
> application is hung…
>
>
> 2. AggregateByKey:
>
> * def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,*
>
> *    combOp: (U, U) => U): RDD[(K, U)]*
>
> About aggregateByKey, is all the records with the same Key In the same
> partition ? Is the zeroValue applied to the first one in all records with
> the same Key, or in each partition ? If it is the former, comOp Function
> do nothing!
>
> I tried to take the second “numPartitions” parameter, pass the number of
> key to it. But, the number of key is so large to all the tasks be killed.
>
>
> What should I do with this case ?
>
> I'm asking for advises online...
>
> Thank you.
>



-- 
Alexis GILLAIN

Reply via email to