Hi, I have a Spark job and a dataset of 0.5 Million items. Each item performs some sort of computation (joining a shared external dataset, if that does matter) and produces an RDD containing 20-500 result items. Now I would like to combine all these RDDs and perform a next job. What I have found out is that the computation itself is quite fast, but combining these RDDs takes much longer time.
val result = data // 0.5M data items .map(compute(_)) // Produces an RDD - fast .reduce(_ ++ _) // Combining RDDs - slow I have also tried to collect results from compute(_) and use a flatMap, but that is also slow. Is there a way to efficiently do this? I'm thinking about writing this result to HDFS and reading from disk for the next job, but am not sure if that's a preferred way in Spark. Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Combining-Many-RDDs-tp22243.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org