I have 10 folder, each with 6000 files. Each folder is roughly 500GB.  So
totally 5TB data. 

The data is  formatted as  key t/ value.  After union,  I want to remove the
duplicates among keys. So each key should be unique and  has only one value. 

Here is what I am doing. 

folders = Array("folder1","folder2"...."folder10" )

var rawData = sc.textFile(folders(0)).map(x => (x.split("\t")(0),
x.split("\t")(1)))

for (a <- 1 to sud_paths.length - 1) {
  rawData = rawData.union(sc.textFile(folders (a)).map(x =>
(x.split("\t")(0), x.split("\t")(1))))
}

val nodups = rawData.reduceByKey((a,b)=>
{
  if(a.length > b.length)
  {a}
  else
  {b}
  }
)
nodups.saveAsTextFile("/nodups")

Anything I could do to make this process faster?   Right now my process dies
when output the data to the HDFS. 


Thank you !



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-is-most-efficient-to-do-a-large-union-and-remove-duplicates-tp23303.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to