Re: Convert from RDD[Object] to RDD[Array[Object]]
And if you can relax your constraints even further to only require RDD[List[Int]], then it's even simpler: rdd.mapPartitions(_.grouped(batchedDegree)) On Sat, Jul 12, 2014 at 6:26 PM, Aaron Davidson wrote: > If you don't really care about the batchedDegree, but rather just want to > do operations over some set of elements rather than one at a time, then > just use mapPartitions(). > > Otherwise, if you really do want certain sized batches and you are able to > relax the constraints slightly, is to construct these batches within each > partition. For instance: > > val batchedRDD = rdd.mapPartitions { iter: Iterator[Int] => > new Iterator[Array[Int]] { > def hasNext: Boolean = iter.hasNext > def next(): Array[Int] = { > iter.take(batchedDegree).toArray > } > } > } > > This function is efficient in that it does not load the entire partition > into memory, just enough to construct each batch. However, there will be > one smaller batch at the end of each partition (rather than just one over > the entire dataset). > > > > On Sat, Jul 12, 2014 at 6:03 PM, Parthus wrote: > >> Hi there, >> >> I have a bunch of data in a RDD, which I processed it one by one >> previously. >> For example, there was a RDD denoted by "data: RDD[Object]" and then I >> processed it using "data.map(...)". However, I got a new requirement to >> process the data in a patched way. It means that I need to convert the RDD >> from RDD[Object] to RDD[Array[Object]] and then process it, which is to >> fill >> out this function: def convert2array(inputs: RDD[Object], batchedDegree: >> Int): RDD[Array[Object]] = {...}. >> >> I hope that after the conversion, each element of the new RDD is an array >> of >> the previous RDD elements. The parameter "batchedDegree" specifies how >> many >> elements are batched together. For example, if I have 210 elements in the >> previous RDD, the result of the conversion functions should be a RDD with >> 3 >> elements. Each element is an array, and the first two arrays contains >> 1~100 >> and 101~200 elements. The third element contains 201~210 elements. >> >> I was wondering if anybody could help me complete this scala function with >> an efficient way. Thanks a lot. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Convert-from-RDD-Object-to-RDD-Array-Object-tp9530.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >
Re: Convert from RDD[Object] to RDD[Array[Object]]
If you don't really care about the batchedDegree, but rather just want to do operations over some set of elements rather than one at a time, then just use mapPartitions(). Otherwise, if you really do want certain sized batches and you are able to relax the constraints slightly, is to construct these batches within each partition. For instance: val batchedRDD = rdd.mapPartitions { iter: Iterator[Int] => new Iterator[Array[Int]] { def hasNext: Boolean = iter.hasNext def next(): Array[Int] = { iter.take(batchedDegree).toArray } } } This function is efficient in that it does not load the entire partition into memory, just enough to construct each batch. However, there will be one smaller batch at the end of each partition (rather than just one over the entire dataset). On Sat, Jul 12, 2014 at 6:03 PM, Parthus wrote: > Hi there, > > I have a bunch of data in a RDD, which I processed it one by one > previously. > For example, there was a RDD denoted by "data: RDD[Object]" and then I > processed it using "data.map(...)". However, I got a new requirement to > process the data in a patched way. It means that I need to convert the RDD > from RDD[Object] to RDD[Array[Object]] and then process it, which is to > fill > out this function: def convert2array(inputs: RDD[Object], batchedDegree: > Int): RDD[Array[Object]] = {...}. > > I hope that after the conversion, each element of the new RDD is an array > of > the previous RDD elements. The parameter "batchedDegree" specifies how many > elements are batched together. For example, if I have 210 elements in the > previous RDD, the result of the conversion functions should be a RDD with 3 > elements. Each element is an array, and the first two arrays contains 1~100 > and 101~200 elements. The third element contains 201~210 elements. > > I was wondering if anybody could help me complete this scala function with > an efficient way. Thanks a lot. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Convert-from-RDD-Object-to-RDD-Array-Object-tp9530.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >
Convert from RDD[Object] to RDD[Array[Object]]
Hi there, I have a bunch of data in a RDD, which I processed it one by one previously. For example, there was a RDD denoted by "data: RDD[Object]" and then I processed it using "data.map(...)". However, I got a new requirement to process the data in a patched way. It means that I need to convert the RDD from RDD[Object] to RDD[Array[Object]] and then process it, which is to fill out this function: def convert2array(inputs: RDD[Object], batchedDegree: Int): RDD[Array[Object]] = {...}. I hope that after the conversion, each element of the new RDD is an array of the previous RDD elements. The parameter "batchedDegree" specifies how many elements are batched together. For example, if I have 210 elements in the previous RDD, the result of the conversion functions should be a RDD with 3 elements. Each element is an array, and the first two arrays contains 1~100 and 101~200 elements. The third element contains 201~210 elements. I was wondering if anybody could help me complete this scala function with an efficient way. Thanks a lot. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Convert-from-RDD-Object-to-RDD-Array-Object-tp9530.html Sent from the Apache Spark User List mailing list archive at Nabble.com.