If your job is dying due to out of memory errors in the post-shuffle stage, I'd consider the following approach for implementing de-duplication / distinct():
- Use sortByKey() to perform a full sort of your dataset. - Use mapPartitions() to iterate through each partition of the sorted dataset, consuming contiguous runs of records with the same key and performing your aggregation / de-duplication logic. - If you think that your workload will benefit from pre-shuffle de-duplication, implement this using mapPartitions() before the sortBykey() transformation. This is effectively a way of performing external aggregation using the existing Spark core APIs. In the longer term, I would recommend using DataFrames for this, since there are planned optimizations for distinct queries that could make a large difference here (e.g. the Tungsten data layout optimizations, forthcoming Tungsten record-sorting optimizations, etc). On Sat, Jun 13, 2015 at 10:49 AM, Gavin Yue <yue.yuany...@gmail.com> wrote: > I have 10 folder, each with 6000 files. Each folder is roughly 500GB. So > totally 5TB data. > > The data is formatted as key t/ value. After union, I want to remove > the > duplicates among keys. So each key should be unique and has only one > value. > > Here is what I am doing. > > folders = Array("folder1","folder2"...."folder10" ) > > var rawData = sc.textFile(folders(0)).map(x => (x.split("\t")(0), > x.split("\t")(1))) > > for (a <- 1 to sud_paths.length - 1) { > rawData = rawData.union(sc.textFile(folders (a)).map(x => > (x.split("\t")(0), x.split("\t")(1)))) > } > > val nodups = rawData.reduceByKey((a,b)=> > { > if(a.length > b.length) > {a} > else > {b} > } > ) > nodups.saveAsTextFile("/nodups") > > Anything I could do to make this process faster? Right now my process > dies > when output the data to the HDFS. > > > Thank you ! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/What-is-most-efficient-to-do-a-large-union-and-remove-duplicates-tp23303.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >