Yang Chen <y...@yang-cs.com> writes: > Hi Noorul, > > Thank you for your suggestion. I tried that, but ran out of memory. I did > some search and found some suggestions > that we should try to avoid rdd.union( > http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark > ). > I will try to come up with some other ways. >
I think you are using rdd.union(), but I was referring to SparkContext.union(). I am not sure about the number of RDDs that you have but I had no issues with memory when I used it to combine 2000 RDDs. Having said that I had other performance issues with spark cassandra connector. Thanks and Regards Noorul > > On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M <noo...@noorul.com> wrote: > >> sparkx <y...@yang-cs.com> writes: >> >> > Hi, >> > >> > I have a Spark job and a dataset of 0.5 Million items. Each item performs >> > some sort of computation (joining a shared external dataset, if that does >> > matter) and produces an RDD containing 20-500 result items. Now I would >> like >> > to combine all these RDDs and perform a next job. What I have found out >> is >> > that the computation itself is quite fast, but combining these RDDs takes >> > much longer time. >> > >> > val result = data // 0.5M data items >> > .map(compute(_)) // Produces an RDD - fast >> > .reduce(_ ++ _) // Combining RDDs - slow >> > >> > I have also tried to collect results from compute(_) and use a flatMap, >> but >> > that is also slow. >> > >> > Is there a way to efficiently do this? I'm thinking about writing this >> > result to HDFS and reading from disk for the next job, but am not sure if >> > that's a preferred way in Spark. >> > >> >> Are you looking for SparkContext.union() [1] ? >> >> This is not performing well with spark cassandra connector. I am not >> sure whether this will help you. >> >> Thanks and Regards >> Noorul >> >> [1] >> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext >> --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org