Re: Convert from RDD[Object] to RDD[Array[Object]]

2014-07-12 Thread Aaron Davidson
If you don't really care about the batchedDegree, but rather just want to
do operations over some set of elements rather than one at a time, then
just use mapPartitions().

Otherwise, if you really do want certain sized batches and you are able to
relax the constraints slightly, is to construct these batches within each
partition. For instance:

val batchedRDD = rdd.mapPartitions { iter: Iterator[Int] =
  new Iterator[Array[Int]] {
def hasNext: Boolean = iter.hasNext
def next(): Array[Int] = {
  iter.take(batchedDegree).toArray
}
  }
}

This function is efficient in that it does not load the entire partition
into memory, just enough to construct each batch. However, there will be
one smaller batch at the end of each partition (rather than just one over
the entire dataset).



On Sat, Jul 12, 2014 at 6:03 PM, Parthus peng.wei@gmail.com wrote:

 Hi there,

 I have a bunch of data in a RDD, which I processed it one by one
 previously.
 For example, there was a RDD denoted by data: RDD[Object] and then I
 processed it using data.map(...).  However, I got a new requirement to
 process the data in a patched way. It means that I need to convert the RDD
 from RDD[Object] to RDD[Array[Object]] and then process it, which is to
 fill
 out this function: def convert2array(inputs: RDD[Object], batchedDegree:
 Int): RDD[Array[Object]] = {...}.

 I hope that after the conversion, each element of the new RDD is an array
 of
 the previous RDD elements. The parameter batchedDegree specifies how many
 elements are batched together. For example, if I have 210 elements in the
 previous RDD, the result of the conversion functions should be a RDD with 3
 elements. Each element is an array, and the first two arrays contains 1~100
 and 101~200 elements. The third element contains 201~210 elements.

 I was wondering if anybody could help me complete this scala function with
 an efficient way. Thanks a lot.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Convert-from-RDD-Object-to-RDD-Array-Object-tp9530.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Convert from RDD[Object] to RDD[Array[Object]]

2014-07-12 Thread Mark Hamstra
And if you can relax your constraints even further to only require
RDD[List[Int]], then it's even simpler:

rdd.mapPartitions(_.grouped(batchedDegree))


On Sat, Jul 12, 2014 at 6:26 PM, Aaron Davidson ilike...@gmail.com wrote:

 If you don't really care about the batchedDegree, but rather just want to
 do operations over some set of elements rather than one at a time, then
 just use mapPartitions().

 Otherwise, if you really do want certain sized batches and you are able to
 relax the constraints slightly, is to construct these batches within each
 partition. For instance:

 val batchedRDD = rdd.mapPartitions { iter: Iterator[Int] =
   new Iterator[Array[Int]] {
 def hasNext: Boolean = iter.hasNext
 def next(): Array[Int] = {
   iter.take(batchedDegree).toArray
 }
   }
 }

 This function is efficient in that it does not load the entire partition
 into memory, just enough to construct each batch. However, there will be
 one smaller batch at the end of each partition (rather than just one over
 the entire dataset).



 On Sat, Jul 12, 2014 at 6:03 PM, Parthus peng.wei@gmail.com wrote:

 Hi there,

 I have a bunch of data in a RDD, which I processed it one by one
 previously.
 For example, there was a RDD denoted by data: RDD[Object] and then I
 processed it using data.map(...).  However, I got a new requirement to
 process the data in a patched way. It means that I need to convert the RDD
 from RDD[Object] to RDD[Array[Object]] and then process it, which is to
 fill
 out this function: def convert2array(inputs: RDD[Object], batchedDegree:
 Int): RDD[Array[Object]] = {...}.

 I hope that after the conversion, each element of the new RDD is an array
 of
 the previous RDD elements. The parameter batchedDegree specifies how
 many
 elements are batched together. For example, if I have 210 elements in the
 previous RDD, the result of the conversion functions should be a RDD with
 3
 elements. Each element is an array, and the first two arrays contains
 1~100
 and 101~200 elements. The third element contains 201~210 elements.

 I was wondering if anybody could help me complete this scala function with
 an efficient way. Thanks a lot.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Convert-from-RDD-Object-to-RDD-Array-Object-tp9530.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.