In your code, you're doing combination of large sets, like (set1 ++ set2).size which is not a good idea.
(rdd1 ++ rdd2).distinct is equivalent implementation and will compute in distributed manner. Not very sure your computation on key'd sets are feasible to be transformed into RDDs. Regards, Kevin On Tue Jan 20 2015 at 1:57:52 PM Kevin Jung <itsjb.j...@samsung.com> wrote: > As far as I know, the tasks before calling saveAsText are transformations > so > that they are lazy computed. Then saveAsText action performs all > transformations and your Set[String] grows up at this time. It creates > large > collection if you have few keys and this causes OOM easily when your > executor memory and fraction settings are not suitable for computing this. > If you want only collection counts by keys , you can use countByKey() or > map() RDD[(String, Set[String])] to RDD[(String,Long)] after creating hoge > RDD to make reduceByKey collect only counts of keys. > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String- > that-include-large-Set-tp21248p21251.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >