sparkx <y...@yang-cs.com> writes: > Hi, > > I have a Spark job and a dataset of 0.5 Million items. Each item performs > some sort of computation (joining a shared external dataset, if that does > matter) and produces an RDD containing 20-500 result items. Now I would like > to combine all these RDDs and perform a next job. What I have found out is > that the computation itself is quite fast, but combining these RDDs takes > much longer time. > > val result = data // 0.5M data items > .map(compute(_)) // Produces an RDD - fast > .reduce(_ ++ _) // Combining RDDs - slow > > I have also tried to collect results from compute(_) and use a flatMap, but > that is also slow. > > Is there a way to efficiently do this? I'm thinking about writing this > result to HDFS and reading from disk for the next job, but am not sure if > that's a preferred way in Spark. >
Are you looking for SparkContext.union() [1] ? This is not performing well with spark cassandra connector. I am not sure whether this will help you. Thanks and Regards Noorul [1] http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org