And if you can relax your constraints even further to only require
RDD[List[Int]], then it's even simpler:

rdd.mapPartitions(_.grouped(batchedDegree))


On Sat, Jul 12, 2014 at 6:26 PM, Aaron Davidson <ilike...@gmail.com> wrote:

> If you don't really care about the batchedDegree, but rather just want to
> do operations over some set of elements rather than one at a time, then
> just use mapPartitions().
>
> Otherwise, if you really do want certain sized batches and you are able to
> relax the constraints slightly, is to construct these batches within each
> partition. For instance:
>
> val batchedRDD = rdd.mapPartitions { iter: Iterator[Int] =>
>   new Iterator[Array[Int]] {
>     def hasNext: Boolean = iter.hasNext
>     def next(): Array[Int] = {
>       iter.take(batchedDegree).toArray
>     }
>   }
> }
>
> This function is efficient in that it does not load the entire partition
> into memory, just enough to construct each batch. However, there will be
> one smaller batch at the end of each partition (rather than just one over
> the entire dataset).
>
>
>
> On Sat, Jul 12, 2014 at 6:03 PM, Parthus <peng.wei....@gmail.com> wrote:
>
>> Hi there,
>>
>> I have a bunch of data in a RDD, which I processed it one by one
>> previously.
>> For example, there was a RDD denoted by "data: RDD[Object]" and then I
>> processed it using "data.map(...)".  However, I got a new requirement to
>> process the data in a patched way. It means that I need to convert the RDD
>> from RDD[Object] to RDD[Array[Object]] and then process it, which is to
>> fill
>> out this function: def convert2array(inputs: RDD[Object], batchedDegree:
>> Int): RDD[Array[Object]] = {...}.
>>
>> I hope that after the conversion, each element of the new RDD is an array
>> of
>> the previous RDD elements. The parameter "batchedDegree" specifies how
>> many
>> elements are batched together. For example, if I have 210 elements in the
>> previous RDD, the result of the conversion functions should be a RDD with
>> 3
>> elements. Each element is an array, and the first two arrays contains
>> 1~100
>> and 101~200 elements. The third element contains 201~210 elements.
>>
>> I was wondering if anybody could help me complete this scala function with
>> an efficient way. Thanks a lot.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Convert-from-RDD-Object-to-RDD-Array-Object-tp9530.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Reply via email to