Each folder should have no dups. Dups only exist among different folders.  

The logic inside is that only take the longest string value for each key. 

The current problem is exceeding the largest frame size when trying to write to 
hdfs, which is 500m which setting is 80m. 

Sent from my iPhone

> On Jun 14, 2015, at 02:10, ayan guha <guha.a...@gmail.com> wrote:
> 
> Can you do dedupe process locally for each file first and then globally? 
> Also I did not fully get the logic of the part inside reducebykey. Can you 
> kindly explain?
> 
>> On 14 Jun 2015 13:58, "Gavin Yue" <yue.yuany...@gmail.com> wrote:
>> I have 10 folder, each with 6000 files. Each folder is roughly 500GB.  So 
>> totally 5TB data. 
>> 
>> The data is  formatted as  key t/ value.  After union,  I want to remove the 
>> duplicates among keys. So each key should be unique and  has only one value. 
>> 
>> Here is what I am doing. 
>> 
>> folders = Array("folder1","folder2"...."folder10" ) 
>> 
>> var rawData = sc.textFile(folders(0)).map(x => (x.split("\t")(0), 
>> x.split("\t")(1))) 
>> 
>> for (a <- 1 to sud_paths.length - 1) { 
>>   rawData = rawData.union(sc.textFile(folders (a)).map(x => 
>> (x.split("\t")(0), x.split("\t")(1)))) 
>> } 
>> 
>> val nodups = rawData.reduceByKey((a,b)=> 
>> { 
>>   if(a.length > b.length) 
>>   {a} 
>>   else 
>>   {b} 
>>   } 
>> ) 
>> nodups.saveAsTextFile("/nodups") 
>> 
>> Anything I could do to make this process faster?   Right now my process dies 
>> when output the data to the HDFS. 
>> 
>> 
>> Thank you !

Reply via email to