Hi Kelvin, Thank you. That works for me. I wrote my own joins that produced Scala collections, instead of using rdd.join.
Regards, Yang On Thu, Mar 26, 2015 at 5:51 PM, Kelvin Chu <2dot7kel...@gmail.com> wrote: > Hi, I used union() before and yes it may be slow sometimes. I _guess_ your > variable 'data' is a Scala collection and compute() returns an RDD. Right? > If yes, I tried the approach below to operate on one RDD only during the > whole computation (Yes, I also saw that too many RDD hurt performance). > > Change compute() to return Scala collection instead of RDD. > > val result = sc.parallelize(data) // Create and partition the > 0.5M items in a single RDD. > .flatMap(compute(_)) // You still have only one RDD with each item > joined with external data already > > Hope this help. > > Kelvin > > On Thu, Mar 26, 2015 at 2:37 PM, Yang Chen <y...@yang-cs.com> wrote: > >> Hi Mark, >> >> That's true, but in neither way can I combine the RDDs, so I have to >> avoid unions. >> >> Thanks, >> Yang >> >> On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra <m...@clearstorydata.com> >> wrote: >> >>> RDD#union is not the same thing as SparkContext#union >>> >>> On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen <y...@yang-cs.com> wrote: >>> >>>> Hi Noorul, >>>> >>>> Thank you for your suggestion. I tried that, but ran out of memory. I >>>> did some search and found some suggestions >>>> that we should try to avoid rdd.union( >>>> http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark >>>> ). >>>> I will try to come up with some other ways. >>>> >>>> Thank you, >>>> Yang >>>> >>>> On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M <noo...@noorul.com> >>>> wrote: >>>> >>>>> sparkx <y...@yang-cs.com> writes: >>>>> >>>>> > Hi, >>>>> > >>>>> > I have a Spark job and a dataset of 0.5 Million items. Each item >>>>> performs >>>>> > some sort of computation (joining a shared external dataset, if that >>>>> does >>>>> > matter) and produces an RDD containing 20-500 result items. Now I >>>>> would like >>>>> > to combine all these RDDs and perform a next job. What I have found >>>>> out is >>>>> > that the computation itself is quite fast, but combining these RDDs >>>>> takes >>>>> > much longer time. >>>>> > >>>>> > val result = data // 0.5M data items >>>>> > .map(compute(_)) // Produces an RDD - fast >>>>> > .reduce(_ ++ _) // Combining RDDs - slow >>>>> > >>>>> > I have also tried to collect results from compute(_) and use a >>>>> flatMap, but >>>>> > that is also slow. >>>>> > >>>>> > Is there a way to efficiently do this? I'm thinking about writing >>>>> this >>>>> > result to HDFS and reading from disk for the next job, but am not >>>>> sure if >>>>> > that's a preferred way in Spark. >>>>> > >>>>> >>>>> Are you looking for SparkContext.union() [1] ? >>>>> >>>>> This is not performing well with spark cassandra connector. I am not >>>>> sure whether this will help you. >>>>> >>>>> Thanks and Regards >>>>> Noorul >>>>> >>>>> [1] >>>>> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext >>>>> >>>> >>>> >>>> >>>> -- >>>> Yang Chen >>>> Dept. of CISE, University of Florida >>>> Mail: y...@yang-cs.com >>>> Web: www.cise.ufl.edu/~yang >>>> >>> >>> >> >> >> -- >> Yang Chen >> Dept. of CISE, University of Florida >> Mail: y...@yang-cs.com >> Web: www.cise.ufl.edu/~yang >> > > -- Yang Chen Dept. of CISE, University of Florida Mail: y...@yang-cs.com Web: www.cise.ufl.edu/~yang