Re: Spark aggregateByKey Issues

2015-09-15 Thread biyan900116
Hi Alexis: Of course, it’s very useful to me, specially about the operations after sort operation is done. And, i still have one question: How to set the decent number of partition, if it need not to be equal to the number of keys ? > 在 2015年9月15日,下午3:41,Alexis Gillain

Re: Spark aggregateByKey Issues

2015-09-15 Thread Alexis Gillain
That's the tricky part. If the volume of data is always the same you can test and learn one. If the volume of data can vary you can use the number of records in your file divide by the number of records you think can fit in memory. Anyway the distribution of your records can still impact the

Re: Spark aggregateByKey Issues

2015-09-14 Thread Alexis Gillain
I'm not sure about what you want to do. You should try to sort the RDD by (yourKey, date), it ensures that all the keys are in the same partition. You problem after that is you want to aggregate only on yourKey and if you change the Key of the sorted RDD you loose partitionning. Depending of

Spark aggregateByKey Issues

2015-09-14 Thread 毕岩
Hi: There is such one case about using reduce operation like that: I Need to reduce large data made up of billions of records with a Key-Value pair. For the following: *First,group by Key, and the records with the same Key need to be in order of one field called “date” in Value* *

Re: Spark aggregateByKey Issues

2015-09-14 Thread biyan900116
Hi Alexis: Thank you for your replying. My case is that each operation to one record need to depend on one value that will be set by the operating to the last record. So your advise is that i can use “sortByKey”. “sortByKey” will put all records with the same Key in one partition. Need I