Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-20 Thread jagaximo
Kevin (Sangwoo) Kim wrote
 If keys are not too many, 
 You can do like this:
 
 val data = List(
   (A, Set(1,2,3)),
   (A, Set(1,2,4)),
   (B, Set(1,2,3))
 )
 val rdd = sc.parallelize(data)
 rdd.persist()
 
 rdd.filter(_._1 == A).flatMap(_._2).distinct.count
 rdd.filter(_._1 == B).flatMap(_._2).distinct.count
 rdd.unpersist()
 
 ==
 data: List[(String, scala.collection.mutable.Set[Int])] = List((A,Set(1,
 2, 3)), (A,Set(1, 2, 4)), (B,Set(1, 2, 3)))
 rdd: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.Set[Int])]
 = ParallelCollectionRDD[6940] at parallelize at 
 console
 :66
 res332: rdd.type = ParallelCollectionRDD[6940] at parallelize at 
 console
 :66
 res334: Long = 4
 res335: Long = 3
 res336: rdd.type = ParallelCollectionRDD[6940] at parallelize at 
 console
 :66
 
 Regards,
 Kevin

Wow, Got it! good solution
Fortunately, I know what keys have large size Set, I was able to adopt this
approach.

thanks!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-that-include-large-Set-tp21248p21275.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-20 Thread Kevin (Sangwoo) Kim
Great to hear you got solution!!
Cheers!

Kevin

On Wed Jan 21 2015 at 11:13:44 AM jagaximo takuya_seg...@dwango.co.jp
wrote:

 Kevin (Sangwoo) Kim wrote
  If keys are not too many,
  You can do like this:
 
  val data = List(
(A, Set(1,2,3)),
(A, Set(1,2,4)),
(B, Set(1,2,3))
  )
  val rdd = sc.parallelize(data)
  rdd.persist()
 
  rdd.filter(_._1 == A).flatMap(_._2).distinct.count
  rdd.filter(_._1 == B).flatMap(_._2).distinct.count
  rdd.unpersist()
 
  ==
  data: List[(String, scala.collection.mutable.Set[Int])] = List((A,Set(1,
  2, 3)), (A,Set(1, 2, 4)), (B,Set(1, 2, 3)))
  rdd: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.Set[
 Int])]
  = ParallelCollectionRDD[6940] at parallelize at
  console
  :66
  res332: rdd.type = ParallelCollectionRDD[6940] at parallelize at
  console
  :66
  res334: Long = 4
  res335: Long = 3
  res336: rdd.type = ParallelCollectionRDD[6940] at parallelize at
  console
  :66
 
  Regards,
  Kevin

 Wow, Got it! good solution
 Fortunately, I know what keys have large size Set, I was able to adopt this
 approach.

 thanks!




 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-
 that-include-large-Set-tp21248p21275.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-19 Thread Pankaj Narang
Instead of counted.saveAsText(“/path/to/save/dir) if you call
counted.collect what happens ?


If you still face the same issue please paste the stacktrace here.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-that-include-large-Set-tp21248p21250.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-19 Thread Kevin Jung
As far as I know, the tasks before calling saveAsText  are transformations so
that they are lazy computed. Then saveAsText action performs all
transformations and your Set[String] grows up at this time. It creates large
collection if you have few keys and this causes OOM easily when your
executor memory and fraction settings are not suitable for computing this.
If you want only collection counts by keys , you can use countByKey() or
map() RDD[(String, Set[String])] to RDD[(String,Long)] after creating hoge
RDD to make reduceByKey collect only counts of keys.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-that-include-large-Set-tp21248p21251.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-19 Thread Kevin (Sangwoo) Kim
In your code, you're doing combination of large sets, like
(set1 ++ set2).size
which is not a good idea.

(rdd1 ++ rdd2).distinct
is equivalent implementation and will compute in distributed manner.
Not very sure your computation on key'd sets are feasible to be transformed
into RDDs.

Regards,
Kevin


On Tue Jan 20 2015 at 1:57:52 PM Kevin Jung itsjb.j...@samsung.com wrote:

 As far as I know, the tasks before calling saveAsText  are transformations
 so
 that they are lazy computed. Then saveAsText action performs all
 transformations and your Set[String] grows up at this time. It creates
 large
 collection if you have few keys and this causes OOM easily when your
 executor memory and fraction settings are not suitable for computing this.
 If you want only collection counts by keys , you can use countByKey() or
 map() RDD[(String, Set[String])] to RDD[(String,Long)] after creating hoge
 RDD to make reduceByKey collect only counts of keys.



 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-
 that-include-large-Set-tp21248p21251.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-19 Thread jagaximo
That i want to do, get unique count for each key. so take map() or
countByKey(), not get unique count. (because duplicate string is likely to
be counted)...




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-that-include-large-Set-tp21248p21254.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-19 Thread Kevin (Sangwoo) Kim
If keys are not too many,
You can do like this:

val data = List(
  (A, Set(1,2,3)),
  (A, Set(1,2,4)),
  (B, Set(1,2,3))
)
val rdd = sc.parallelize(data)
rdd.persist()

rdd.filter(_._1 == A).flatMap(_._2).distinct.count
rdd.filter(_._1 == B).flatMap(_._2).distinct.count
rdd.unpersist()

==
data: List[(String, scala.collection.mutable.Set[Int])] = List((A,Set(1, 2,
3)), (A,Set(1, 2, 4)), (B,Set(1, 2, 3))) rdd:
org.apache.spark.rdd.RDD[(String, scala.collection.mutable.Set[Int])] =
ParallelCollectionRDD[6940] at parallelize at console:66 res332: rdd.type
= ParallelCollectionRDD[6940] at parallelize at console:66 res334: Long =
4 res335: Long = 3 res336: rdd.type = ParallelCollectionRDD[6940] at
parallelize at console:66

Regards,
Kevin



On Tue Jan 20 2015 at 2:53:22 PM jagaximo takuya_seg...@dwango.co.jp
wrote:

 That i want to do, get unique count for each key. so take map() or
 countByKey(), not get unique count. (because duplicate string is likely to
 be counted)...




 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-
 that-include-large-Set-tp21248p21254.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org