In your code, you're doing combination of large sets, like
(set1 ++ set2).size
which is not a good idea.

(rdd1 ++ rdd2).distinct
is equivalent implementation and will compute in distributed manner.
Not very sure your computation on key'd sets are feasible to be transformed
into RDDs.

Regards,
Kevin


On Tue Jan 20 2015 at 1:57:52 PM Kevin Jung <itsjb.j...@samsung.com> wrote:

> As far as I know, the tasks before calling saveAsText  are transformations
> so
> that they are lazy computed. Then saveAsText action performs all
> transformations and your Set[String] grows up at this time. It creates
> large
> collection if you have few keys and this causes OOM easily when your
> executor memory and fraction settings are not suitable for computing this.
> If you want only collection counts by keys , you can use countByKey() or
> map() RDD[(String, Set[String])] to RDD[(String,Long)] after creating hoge
> RDD to make reduceByKey collect only counts of keys.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-
> that-include-large-Set-tp21248p21251.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to