And if you can relax your constraints even further to only require RDD[List[Int]], then it's even simpler:
rdd.mapPartitions(_.grouped(batchedDegree)) On Sat, Jul 12, 2014 at 6:26 PM, Aaron Davidson <ilike...@gmail.com> wrote: > If you don't really care about the batchedDegree, but rather just want to > do operations over some set of elements rather than one at a time, then > just use mapPartitions(). > > Otherwise, if you really do want certain sized batches and you are able to > relax the constraints slightly, is to construct these batches within each > partition. For instance: > > val batchedRDD = rdd.mapPartitions { iter: Iterator[Int] => > new Iterator[Array[Int]] { > def hasNext: Boolean = iter.hasNext > def next(): Array[Int] = { > iter.take(batchedDegree).toArray > } > } > } > > This function is efficient in that it does not load the entire partition > into memory, just enough to construct each batch. However, there will be > one smaller batch at the end of each partition (rather than just one over > the entire dataset). > > > > On Sat, Jul 12, 2014 at 6:03 PM, Parthus <peng.wei....@gmail.com> wrote: > >> Hi there, >> >> I have a bunch of data in a RDD, which I processed it one by one >> previously. >> For example, there was a RDD denoted by "data: RDD[Object]" and then I >> processed it using "data.map(...)". However, I got a new requirement to >> process the data in a patched way. It means that I need to convert the RDD >> from RDD[Object] to RDD[Array[Object]] and then process it, which is to >> fill >> out this function: def convert2array(inputs: RDD[Object], batchedDegree: >> Int): RDD[Array[Object]] = {...}. >> >> I hope that after the conversion, each element of the new RDD is an array >> of >> the previous RDD elements. The parameter "batchedDegree" specifies how >> many >> elements are batched together. For example, if I have 210 elements in the >> previous RDD, the result of the conversion functions should be a RDD with >> 3 >> elements. Each element is an array, and the first two arrays contains >> 1~100 >> and 101~200 elements. The third element contains 201~210 elements. >> >> I was wondering if anybody could help me complete this scala function with >> an efficient way. Thanks a lot. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Convert-from-RDD-Object-to-RDD-Array-Object-tp9530.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >