If your job is dying due to out of memory errors in the post-shuffle stage,
I'd consider the following approach for implementing de-duplication /
distinct():

- Use sortByKey() to perform a full sort of your dataset.
- Use mapPartitions() to iterate through each partition of the sorted
dataset, consuming contiguous runs of records with the same key and
performing your aggregation / de-duplication logic.
- If you think that your workload will benefit from pre-shuffle
de-duplication, implement this using mapPartitions() before the sortBykey()
transformation.

This is effectively a way of performing external aggregation using the
existing Spark core APIs.

In the longer term, I would recommend using DataFrames for this, since
there are planned optimizations for distinct queries that could make a
large difference here (e.g. the Tungsten data layout optimizations,
forthcoming Tungsten record-sorting optimizations, etc).

On Sat, Jun 13, 2015 at 10:49 AM, Gavin Yue <yue.yuany...@gmail.com> wrote:

> I have 10 folder, each with 6000 files. Each folder is roughly 500GB.  So
> totally 5TB data.
>
> The data is  formatted as  key t/ value.  After union,  I want to remove
> the
> duplicates among keys. So each key should be unique and  has only one
> value.
>
> Here is what I am doing.
>
> folders = Array("folder1","folder2"...."folder10" )
>
> var rawData = sc.textFile(folders(0)).map(x => (x.split("\t")(0),
> x.split("\t")(1)))
>
> for (a <- 1 to sud_paths.length - 1) {
>   rawData = rawData.union(sc.textFile(folders (a)).map(x =>
> (x.split("\t")(0), x.split("\t")(1))))
> }
>
> val nodups = rawData.reduceByKey((a,b)=>
> {
>   if(a.length > b.length)
>   {a}
>   else
>   {b}
>   }
> )
> nodups.saveAsTextFile("/nodups")
>
> Anything I could do to make this process faster?   Right now my process
> dies
> when output the data to the HDFS.
>
>
> Thank you !
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/What-is-most-efficient-to-do-a-large-union-and-remove-duplicates-tp23303.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to