Can you do dedupe process locally for each file first and then globally?
Also I did not fully get the logic of the part inside reducebykey. Can you
kindly explain?
On 14 Jun 2015 13:58, Gavin Yue yue.yuany...@gmail.com wrote:
I have 10 folder, each with 6000 files. Each folder is roughly 500GB.
Each folder should have no dups. Dups only exist among different folders.
The logic inside is that only take the longest string value for each key.
The current problem is exceeding the largest frame size when trying to write to
hdfs, which is 500m which setting is 80m.
Sent from my iPhone
If your job is dying due to out of memory errors in the post-shuffle stage,
I'd consider the following approach for implementing de-duplication /
distinct():
- Use sortByKey() to perform a full sort of your dataset.
- Use mapPartitions() to iterate through each partition of the sorted
dataset,
I have 10 folder, each with 6000 files. Each folder is roughly 500GB. So
totally 5TB data.
The data is formatted as key t/ value. After union, I want to remove
the duplicates among keys. So each key should be unique and has only one
value.
Here is what I am doing.
folders =
I have 10 folder, each with 6000 files. Each folder is roughly 500GB. So
totally 5TB data.
The data is formatted as key t/ value. After union, I want to remove the
duplicates among keys. So each key should be unique and has only one value.
Here is what I am doing.
folders =