Unlike a map() wherein your task is acting on a row at a time, with
mapPartitions(), the task is passed the entire content of the partition in
an iterator. You can then return back another iterator as the output. I
don't do scala, but from what I understand from your code snippet... The
iterator x
List(x.next).iterator is giving you the first element from each partition,
which would be 1, 4 and 7 respectively.
On 3/18/15, 10:19 AM, "ashish.usoni" wrote:
>I am trying to understand about mapPartitions but i am still not sure how
>it
>works
>
>in the below example it create three partition
>
Map partitions works as follows :
1) For each partition of your RDD, it provides an iterator over the values
within that partition
2) You then define a function that operates on that iterator
Thus if you do the following:
val parallel = sc.parallelize(1 to 10, 3)
parallel.mapPartitions( x => x.m
Here is what I think:
mapPartitions is for a specialized map that is called only once for each
partition. The entire content of the respective partitions is available as a
sequential stream of values via the input argument (Iterarator[T]). The
combined result iterators are automatically converte