Re: Spark aggregateByKey Issues

biyan900116 Tue, 15 Sep 2015 01:36:01 -0700

Hi Alexis:

Of course, it’s very useful to me, specially about the operations after sort 
operation is done.


And, i still have one question:
How to set the decent number of partition,  if it need not to be equal to the 
number of keys ?

> 在 2015年9月15日，下午3:41，Alexis Gillain <alexis.gill...@googlemail.com> 写道：
> 
> Sorry I made a typo error in my previous message, you can't 
> sortByKey(youkey,date) and have all records of your keys in the same 
> partition.
>  
> So you can sortByKey(yourkey)  with a decent number of partition (doesnt have 
> to be the number of keys). After that you have records of a key grouped in a 
> partition but not sort by date.
> 
> Then you use mapPartitions to copy the partition in a List and you can sort 
> by (youkey, date) and use this list to compute whatever you want. The main 
> issue is that a partition must fit in memory.
> 
> Hope this help.
> 
> 2015-09-15 13:50 GMT+08:00 <biyan900...@gmail.com 
> <mailto:biyan900...@gmail.com>>:
> Hi Alexis:
> 
> Thank you for your replying.
> 
> My case is that each operation to one record need to depend on one value that 
> will be set by the operating to the last record. 
> 
> So your advise is that i can use “sortByKey”. “sortByKey” will put all 
> records with the same Key in one partition. Need I take the “numPartitions” 
> parameter ? Or even if i don’t , it still do that .
> 
> If it works, add “aggregate” to deal with my case, i think the comOp function 
> in parameter list of aggregate API won’t be executed.. Is my understanding 
> wrong ? 
>   
> 
>> 在 2015年9月15日，下午12:47，Alexis Gillain <alexis.gill...@googlemail.com 
>> <mailto:alexis.gill...@googlemail.com>> 写道：
>> 
>> I'm not sure about what you want to do.
>> 
>> You should try to sort the RDD by (yourKey, date), it ensures that all the 
>> keys are in the same partition.
>> 
>> You problem after that is you want to aggregate only on yourKey and if you 
>> change the Key of the sorted RDD you loose partitionning.
>> 
>> Depending of the size of the result you can use an aggregate bulding a map 
>> of results by youKey or use MapPartition to output a rdd (in this case set 
>> the number of partition high enough to allow the partition to fit in memory).
>> 
>> see 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Folding-an-RDD-in-order-td16577.html
>>  
>> <http://apache-spark-user-list.1001560.n3.nabble.com/Folding-an-RDD-in-order-td16577.html>
>> 
>> 2015-09-15 11:25 GMT+08:00 毕岩 <biyan900...@gmail.com 
>> <mailto:biyan900...@gmail.com>>:
>> Hi:
>> 
>> 
>> 
>> There is such one case about using reduce operation like that:
>> 
>> 
>> 
>> I Need to reduce large data made up of billions of records with a Key-Value 
>> pair.
>> 
>> For the following:
>> 
>>     First，group by Key， and the records with the same Key need to be in 
>> order of one field called “date” in Value
>> 
>>     Second, in records with the same Key, every operation to one recored 
>> need to depend on the result of dealing with the last one, and the first one 
>> is original state..
>> 
>> 
>> 
>> Some attempts:
>> 
>> 1. groupByKey + map :  because of the bad performance, CPU is to 100%, so 
>> executors heartbeat time out and throw errors “Lost executor”, or the 
>> application is hung…
>> 
>> 
>> 
>> 2. AggregateByKey:
>> 
>> def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
>> 
>>     combOp: (U, U) => U): RDD[(K, U)]
>> 
>> About aggregateByKey, is all the records with the same Key In the same 
>> partition ? Is the zeroValue applied to the first one in all records with 
>> the same Key, or in each partition ? If it is the former, comOp Function do 
>> nothing! 
>> 
>> I tried to take the second “numPartitions” parameter, pass the number of key 
>> to it. But, the number of key is so large to all the tasks be killed.
>> 
>> 
>> 
>> What should I do with this case ? 
>> 
>> I'm asking for advises online...
>> 
>> Thank you.
>> 
>> 
>> 
>> 
>> -- 
>> Alexis GILLAIN
> 
> 
> 
> 
> -- 
> Alexis GILLAIN

Re: Spark aggregateByKey Issues

Reply via email to