Hi,

I have a Spark job and a dataset of 0.5 Million items. Each item performs
some sort of computation (joining a shared external dataset, if that does
matter) and produces an RDD containing 20-500 result items. Now I would like
to combine all these RDDs and perform a next job. What I have found out is
that the computation itself is quite fast, but combining these RDDs takes
much longer time.

    val result = data        // 0.5M data items
      .map(compute(_))   // Produces an RDD - fast
      .reduce(_ ++ _)      // Combining RDDs - slow

I have also tried to collect results from compute(_) and use a flatMap, but
that is also slow.

Is there a way to efficiently do this? I'm thinking about writing this
result to HDFS and reading from disk for the next job, but am not sure if
that's a preferred way in Spark.

Thank you.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Combining-Many-RDDs-tp22243.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to