Re: preservesPartitioning

Matei Zaharia Thu, 17 Jul 2014 00:15:14 -0700

Hi Kamal,

This is not what preservesPartitioning does -- actually what it means is that 
if the RDD has a Partitioner set (which means it's an RDD of key-value pairs 
and the keys are grouped into a known way, e.g. hashed or range-partitioned), 
your map function is not changing the partition of keys. This lets the job 
scheduler know that downstream operations, like joins or reduceByKey, can be 
optimized assuming that all the data for a given partition is located on the 
same machine. In both cases though, your function f operates on each partition.

Just in case it's not clear, each RDD is composed of multiple blocks that are 
called the partitions. Each partition may be located on a different machine. 
mapPartitions is a way for you to operate on a whole partition at once, which 
is useful if you want to amortize a certain cost across the elements (e.g. you 
open a database connection and test each of them against the database). If you 
just want to see each element once and don't care about sharing stuff across 
them, use map().

Matei

On Jul 17, 2014, at 12:02 AM, Kamal Banga <banga.ka...@gmail.com> wrote:

> Hi All,
> 
> The function mapPartitions in RDD.scala takes a boolean parameter 
> preservesPartitioning. It seems if that parameter is passed as false, the 
> passed function f will operate on the data only once, whereas if it's passed 
> as true the function will operate on each partition of the data. 
> 
> In my case, whatever boolean value I pass, f operates on each partition of 
> data. 
> 
> Any help, regarding why I am getting this unexpected behaviour?

Re: preservesPartitioning

Reply via email to