I have 10 folder, each with 6000 files. Each folder is roughly 500GB. So totally 5TB data.
The data is formatted as key t/ value. After union, I want to remove the duplicates among keys. So each key should be unique and has only one value. Here is what I am doing. folders = Array("folder1","folder2"...."folder10" ) var rawData = sc.textFile(folders(0)).map(x => (x.split("\t")(0), x.split("\t")(1))) for (a <- 1 to sud_paths.length - 1) { rawData = rawData.union(sc.textFile(folders (a)).map(x => (x.split("\t")(0), x.split("\t")(1)))) } val nodups = rawData.reduceByKey((a,b)=> { if(a.length > b.length) {a} else {b} } ) nodups.saveAsTextFile("/nodups") Anything I could do to make this process faster? Right now my process dies when output the data to the HDFS. Thank you ! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-is-most-efficient-to-do-a-large-union-and-remove-duplicates-tp23303.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org