Re: Keep state inside map function

2014-10-28 Thread Koert Kuipers
doing cleanup in an iterator like that assumes the iterator always gets fully read, which is not necessary the case (for example RDD.take does not). instead i would use mapPartitionsWithContext, in which case you can write a function of the form. f: (TaskContext, Iterator[T]) => Iterator[U] now

Re: Keep state inside map function

2014-07-31 Thread Sean Owen
On Thu, Jul 31, 2014 at 2:11 AM, Tobias Pfeiffer wrote: > rdd.mapPartitions { partition => >// Some setup code here >val result = partition.map(yourfunction) > >// Some cleanup code here >result > } Yes, I realized that after I hit send. You definitely have to store and return the

Re: Keep state inside map function

2014-07-30 Thread Tobias Pfeiffer
Hi, On Thu, Jul 31, 2014 at 2:23 AM, Sean Owen wrote: > > ... you can run setup code before mapping a bunch of records, and > after, like so: > > rdd.mapPartitions { partition => >// Some setup code here >partition.map(yourfunction) >// Some cleanup code here > } > Please be careful

Re: Keep state inside map function

2014-07-30 Thread Kevin
Thanks to the both of you for your inputs. Looks like I'll play with the mapPartitions function to start porting MapReduce algorithms to Spark. On Wed, Jul 30, 2014 at 1:23 PM, Sean Owen wrote: > Really, the analog of a Mapper is not map(), but mapPartitions(). Instead > of: > > rdd.map(yourFun

Re: Keep state inside map function

2014-07-30 Thread Sean Owen
Really, the analog of a Mapper is not map(), but mapPartitions(). Instead of: rdd.map(yourFunction) ... you can run setup code before mapping a bunch of records, and after, like so: rdd.mapPartitions { partition => // Some setup code here partition.map(yourfunction) // Some cleanup code

Re: Keep state inside map function

2014-07-30 Thread aaronjosephs
use mapPartitions to get the equivalent functionality to hadoop -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Keep-state-inside-map-function-tp10968p10969.html Sent from the Apache Spark User List mailing list archive at Nabble.com.