doing cleanup in an iterator like that assumes the iterator always gets
fully read, which is not necessary the case (for example RDD.take does not).
instead i would use mapPartitionsWithContext, in which case you can write a
function of the form.
f: (TaskContext, Iterator[T]) => Iterator[U]
now
On Thu, Jul 31, 2014 at 2:11 AM, Tobias Pfeiffer wrote:
> rdd.mapPartitions { partition =>
>// Some setup code here
>val result = partition.map(yourfunction)
>
>// Some cleanup code here
>result
> }
Yes, I realized that after I hit send. You definitely have to store
and return the
Hi,
On Thu, Jul 31, 2014 at 2:23 AM, Sean Owen wrote:
>
> ... you can run setup code before mapping a bunch of records, and
> after, like so:
>
> rdd.mapPartitions { partition =>
>// Some setup code here
>partition.map(yourfunction)
>// Some cleanup code here
> }
>
Please be careful
Thanks to the both of you for your inputs. Looks like I'll play with the
mapPartitions function to start porting MapReduce algorithms to Spark.
On Wed, Jul 30, 2014 at 1:23 PM, Sean Owen wrote:
> Really, the analog of a Mapper is not map(), but mapPartitions(). Instead
> of:
>
> rdd.map(yourFun
Really, the analog of a Mapper is not map(), but mapPartitions(). Instead of:
rdd.map(yourFunction)
... you can run setup code before mapping a bunch of records, and
after, like so:
rdd.mapPartitions { partition =>
// Some setup code here
partition.map(yourfunction)
// Some cleanup code
use mapPartitions to get the equivalent functionality to hadoop
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Keep-state-inside-map-function-tp10968p10969.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.