Here is what I think:
mapPartitions is for a specialized map that is called only once for each
partition. The entire content of the respective partitions is available as a
sequential stream of values via the input argument (Iterarator[T]). The
combined result iterators are automatically
Map partitions works as follows :
1) For each partition of your RDD, it provides an iterator over the values
within that partition
2) You then define a function that operates on that iterator
Thus if you do the following:
val parallel = sc.parallelize(1 to 10, 3)
parallel.mapPartitions( x =
List(x.next).iterator is giving you the first element from each partition,
which would be 1, 4 and 7 respectively.
On 3/18/15, 10:19 AM, ashish.usoni ashish.us...@gmail.com wrote:
I am trying to understand about mapPartitions but i am still not sure how
it
works
in the below example it create
Unlike a map() wherein your task is acting on a row at a time, with
mapPartitions(), the task is passed the entire content of the partition in
an iterator. You can then return back another iterator as the output. I
don't do scala, but from what I understand from your code snippet... The
iterator