Re: Splitting RDD and Grouping together to perform computation
Hi, Thanks Nanzhu.I tried to implement your suggestion on following scenario.I have RDD of say 24 elements.In that when i partioned into two groups of 12 elements each.Their is loss of order of elements in partition.Elemest are partitioned randomly.I need to preserve the order such that the first 12 elements should be 1st partition and 2nd 12 elemts should be in 2nd partition. Guys please help me how to main order of original sequence even after partioningAny solution Before Partition:RDD 64 29186 16059 9143 6439 6155 9187 18416 25565 30420 33952 38302 43712 47092 48803 52687 56286 57471 63429 70715 75995 81878 80974 71288 48556 After Partition:In group1 with 12 elements 64, 29186, 18416 30420 33952 38302 43712 47092 56286 81878 80974 71288 48556 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153p3447.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
RE: Splitting RDD and Grouping together to perform computation
I think you should sort each RDD -Original Message- From: yh18190 [mailto:yh18...@gmail.com] Sent: March-28-14 4:44 PM To: u...@spark.incubator.apache.org Subject: Re: Splitting RDD and Grouping together to perform computation Hi, Thanks Nanzhu.I tried to implement your suggestion on following scenario.I have RDD of say 24 elements.In that when i partioned into two groups of 12 elements each.Their is loss of order of elements in partition.Elemest are partitioned randomly.I need to preserve the order such that the first 12 elements should be 1st partition and 2nd 12 elemts should be in 2nd partition. Guys please help me how to main order of original sequence even after partioningAny solution Before Partition:RDD 64 29186 16059 9143 6439 6155 9187 18416 25565 30420 33952 38302 43712 47092 48803 52687 56286 57471 63429 70715 75995 81878 80974 71288 48556 After Partition:In group1 with 12 elements 64, 29186, 18416 30420 33952 38302 43712 47092 56286 81878 80974 71288 48556 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153p3447.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
RE: Splitting RDD and Grouping together to perform computation
I say you need to remap so you have a key for each tuple that you can sort on. Then call rdd.sortByKey(true) like this mystream.transform(rdd = rdd.sortByKey(true)) For this fn to be available you need to import org.apache.spark.rdd.OrderedRDDFunctions -Original Message- From: yh18190 [mailto:yh18...@gmail.com] Sent: March-28-14 5:02 PM To: u...@spark.incubator.apache.org Subject: RE: Splitting RDD and Grouping together to perform computation Hi, Here is my code for given scenario.Could you please let me know where to sort?I mean on what basis we have to sort??so that they maintain order in partition as thatof original sequence.. val res2=reduced_hccg.map(_._2)// which gives RDD of numbers res2.foreach(println) val result= res2.mapPartitions(p={ val l=p.toList val approx=new ListBuffer[(Int)] val detail=new ListBuffer[Double] for(i-0 until l.length-1 by 2) { println(l(i),l(i+1)) approx+=(l(i),l(i+1)) } approx.toList.iterator detail.toList.iterator }) result.foreach(println) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153p3450.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Splitting RDD and Grouping together to perform computation
From the jist of it, it seems like you need to override the default partitioner to control how your data is distributed among partitions. Take a look at different Partitioners available (Default, Range, Hash) if none of these get you desired result, you might want to provide your own. On Fri, Mar 28, 2014 at 2:08 PM, Adrian Mocanu amoc...@verticalscope.comwrote: I say you need to remap so you have a key for each tuple that you can sort on. Then call rdd.sortByKey(true) like this mystream.transform(rdd = rdd.sortByKey(true)) For this fn to be available you need to import org.apache.spark.rdd.OrderedRDDFunctions -Original Message- From: yh18190 [mailto:yh18...@gmail.com] Sent: March-28-14 5:02 PM To: u...@spark.incubator.apache.org Subject: RE: Splitting RDD and Grouping together to perform computation Hi, Here is my code for given scenario.Could you please let me know where to sort?I mean on what basis we have to sort??so that they maintain order in partition as thatof original sequence.. val res2=reduced_hccg.map(_._2)// which gives RDD of numbers res2.foreach(println) val result= res2.mapPartitions(p={ val l=p.toList val approx=new ListBuffer[(Int)] val detail=new ListBuffer[Double] for(i-0 until l.length-1 by 2) { println(l(i),l(i+1)) approx+=(l(i),l(i+1)) } approx.toList.iterator detail.toList.iterator }) result.foreach(println) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153p3450.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
RE: Splitting RDD and Grouping together to perform computation
Hi Andriana, Thanks for suggestion.Could you please modify my code part where I need to do so..I apologise for inconvinience ,becoz i am new to spark I coudnt apply appropriately..i would be thankful to you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153p3452.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
RE: Splitting RDD and Grouping together to perform computation
Not sure how to change your code because you'd need to generate the keys where you get the data. Sorry about that. I can tell you where to put the code to remap and sort though. import org.apache.spark.rdd.OrderedRDDFunctions val res2=reduced_hccg.map(_._2) .map( x= (newkey,x)).sortByKey(true) //and if you want remap them to remove the key that you used for sorting: .map(x= x._2) res2.foreach(println) val result= res2.mapPartitions(p={ val l=p.toList val approx=new ListBuffer[(Int)] val detail=new ListBuffer[Double] for(i-0 until l.length-1 by 2) { println(l(i),l(i+1)) approx+=(l(i),l(i+1)) } approx.toList.iterator detail.toList.iterator }) result.foreach(println) -Original Message- From: yh18190 [mailto:yh18...@gmail.com] Sent: March-28-14 5:17 PM To: u...@spark.incubator.apache.org Subject: RE: Splitting RDD and Grouping together to perform computation Hi Andriana, Thanks for suggestion.Could you please modify my code part where I need to do so..I apologise for inconvinience ,becoz i am new to spark I coudnt apply appropriately..i would be thankful to you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153p3452.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Splitting RDD and Grouping together to perform computation
We need some one who can explain us with short code snippet on given example so that we get clear cut idea on RDDs indexing.. Guys please help us -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153p3158.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Splitting RDD and Grouping together to perform computation
partition your input into even number partitions use mapPartition to operate on Iterator[Int] maybe there are some more efficient way…. Best, -- Nan Zhu On Monday, March 24, 2014 at 7:59 PM, yh18190 wrote: Hi, I have large data set of numbers ie RDD and wanted to perform a computation only on groupof two values at a time. For example 1,2,3,4,5,6,7... is an RDD Can i group the RDD into (1,2),(3,4),(5,6)...?? and perform the respective computations ?in an efficient manner? As we do'nt have a way to index elements directly using forloop etc..(i,i+1)...is their way to resolve this problem? Please suggest me ..i would be really thankful to you View this message in context: Splitting RDD and Grouping together to perform computation (http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153.html) Sent from the Apache Spark User List mailing list archive (http://apache-spark-user-list.1001560.n3.nabble.com/) at Nabble.com (http://Nabble.com).
Re: Splitting RDD and Grouping together to perform computation
I didn’t group the integers, but process them in group of two, partition that scala val a = sc.parallelize(List(1, 2, 3, 4), 2) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at console:12 process each partition and process elements in the partition in group of 2 scala a.mapPartitions(p = {val l = p.toList; | val ret = new ListBuffer[Int] | for (i - 0 until l.length by 2) { | ret += l(i) + l(i + 1) | } | ret.toList.iterator | } | ) res7: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at mapPartitions at console:16 scala res7.collect res10: Array[Int] = Array(3, 7) Best, -- Nan Zhu On Monday, March 24, 2014 at 8:40 PM, Nan Zhu wrote: partition your input into even number partitions use mapPartition to operate on Iterator[Int] maybe there are some more efficient way…. Best, -- Nan Zhu On Monday, March 24, 2014 at 7:59 PM, yh18190 wrote: Hi, I have large data set of numbers ie RDD and wanted to perform a computation only on groupof two values at a time. For example 1,2,3,4,5,6,7... is an RDD Can i group the RDD into (1,2),(3,4),(5,6)...?? and perform the respective computations ?in an efficient manner? As we do'nt have a way to index elements directly using forloop etc..(i,i+1)...is their way to resolve this problem? Please suggest me ..i would be really thankful to you View this message in context: Splitting RDD and Grouping together to perform computation (http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153.html) Sent from the Apache Spark User List mailing list archive (http://apache-spark-user-list.1001560.n3.nabble.com/) at Nabble.com (http://Nabble.com).