Re: How to compute RDD[(String, Set[String])] that include large Set
Kevin (Sangwoo) Kim wrote If keys are not too many, You can do like this: val data = List( (A, Set(1,2,3)), (A, Set(1,2,4)), (B, Set(1,2,3)) ) val rdd = sc.parallelize(data) rdd.persist() rdd.filter(_._1 == A).flatMap(_._2).distinct.count rdd.filter(_._1 == B).flatMap(_._2).distinct.count rdd.unpersist() == data: List[(String, scala.collection.mutable.Set[Int])] = List((A,Set(1, 2, 3)), (A,Set(1, 2, 4)), (B,Set(1, 2, 3))) rdd: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.Set[Int])] = ParallelCollectionRDD[6940] at parallelize at console :66 res332: rdd.type = ParallelCollectionRDD[6940] at parallelize at console :66 res334: Long = 4 res335: Long = 3 res336: rdd.type = ParallelCollectionRDD[6940] at parallelize at console :66 Regards, Kevin Wow, Got it! good solution Fortunately, I know what keys have large size Set, I was able to adopt this approach. thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-that-include-large-Set-tp21248p21275.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to compute RDD[(String, Set[String])] that include large Set
Great to hear you got solution!! Cheers! Kevin On Wed Jan 21 2015 at 11:13:44 AM jagaximo takuya_seg...@dwango.co.jp wrote: Kevin (Sangwoo) Kim wrote If keys are not too many, You can do like this: val data = List( (A, Set(1,2,3)), (A, Set(1,2,4)), (B, Set(1,2,3)) ) val rdd = sc.parallelize(data) rdd.persist() rdd.filter(_._1 == A).flatMap(_._2).distinct.count rdd.filter(_._1 == B).flatMap(_._2).distinct.count rdd.unpersist() == data: List[(String, scala.collection.mutable.Set[Int])] = List((A,Set(1, 2, 3)), (A,Set(1, 2, 4)), (B,Set(1, 2, 3))) rdd: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.Set[ Int])] = ParallelCollectionRDD[6940] at parallelize at console :66 res332: rdd.type = ParallelCollectionRDD[6940] at parallelize at console :66 res334: Long = 4 res335: Long = 3 res336: rdd.type = ParallelCollectionRDD[6940] at parallelize at console :66 Regards, Kevin Wow, Got it! good solution Fortunately, I know what keys have large size Set, I was able to adopt this approach. thanks! -- View this message in context: http://apache-spark-user-list. 1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String- that-include-large-Set-tp21248p21275.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to compute RDD[(String, Set[String])] that include large Set
Instead of counted.saveAsText(“/path/to/save/dir) if you call counted.collect what happens ? If you still face the same issue please paste the stacktrace here. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-that-include-large-Set-tp21248p21250.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to compute RDD[(String, Set[String])] that include large Set
As far as I know, the tasks before calling saveAsText are transformations so that they are lazy computed. Then saveAsText action performs all transformations and your Set[String] grows up at this time. It creates large collection if you have few keys and this causes OOM easily when your executor memory and fraction settings are not suitable for computing this. If you want only collection counts by keys , you can use countByKey() or map() RDD[(String, Set[String])] to RDD[(String,Long)] after creating hoge RDD to make reduceByKey collect only counts of keys. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-that-include-large-Set-tp21248p21251.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to compute RDD[(String, Set[String])] that include large Set
In your code, you're doing combination of large sets, like (set1 ++ set2).size which is not a good idea. (rdd1 ++ rdd2).distinct is equivalent implementation and will compute in distributed manner. Not very sure your computation on key'd sets are feasible to be transformed into RDDs. Regards, Kevin On Tue Jan 20 2015 at 1:57:52 PM Kevin Jung itsjb.j...@samsung.com wrote: As far as I know, the tasks before calling saveAsText are transformations so that they are lazy computed. Then saveAsText action performs all transformations and your Set[String] grows up at this time. It creates large collection if you have few keys and this causes OOM easily when your executor memory and fraction settings are not suitable for computing this. If you want only collection counts by keys , you can use countByKey() or map() RDD[(String, Set[String])] to RDD[(String,Long)] after creating hoge RDD to make reduceByKey collect only counts of keys. -- View this message in context: http://apache-spark-user-list. 1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String- that-include-large-Set-tp21248p21251.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to compute RDD[(String, Set[String])] that include large Set
That i want to do, get unique count for each key. so take map() or countByKey(), not get unique count. (because duplicate string is likely to be counted)... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-that-include-large-Set-tp21248p21254.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to compute RDD[(String, Set[String])] that include large Set
If keys are not too many, You can do like this: val data = List( (A, Set(1,2,3)), (A, Set(1,2,4)), (B, Set(1,2,3)) ) val rdd = sc.parallelize(data) rdd.persist() rdd.filter(_._1 == A).flatMap(_._2).distinct.count rdd.filter(_._1 == B).flatMap(_._2).distinct.count rdd.unpersist() == data: List[(String, scala.collection.mutable.Set[Int])] = List((A,Set(1, 2, 3)), (A,Set(1, 2, 4)), (B,Set(1, 2, 3))) rdd: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.Set[Int])] = ParallelCollectionRDD[6940] at parallelize at console:66 res332: rdd.type = ParallelCollectionRDD[6940] at parallelize at console:66 res334: Long = 4 res335: Long = 3 res336: rdd.type = ParallelCollectionRDD[6940] at parallelize at console:66 Regards, Kevin On Tue Jan 20 2015 at 2:53:22 PM jagaximo takuya_seg...@dwango.co.jp wrote: That i want to do, get unique count for each key. so take map() or countByKey(), not get unique count. (because duplicate string is likely to be counted)... -- View this message in context: http://apache-spark-user-list. 1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String- that-include-large-Set-tp21248p21254.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org