Hi Kamal, This is not what preservesPartitioning does -- actually what it means is that if the RDD has a Partitioner set (which means it's an RDD of key-value pairs and the keys are grouped into a known way, e.g. hashed or range-partitioned), your map function is not changing the partition of keys. This lets the job scheduler know that downstream operations, like joins or reduceByKey, can be optimized assuming that all the data for a given partition is located on the same machine. In both cases though, your function f operates on each partition.
Just in case it's not clear, each RDD is composed of multiple blocks that are called the partitions. Each partition may be located on a different machine. mapPartitions is a way for you to operate on a whole partition at once, which is useful if you want to amortize a certain cost across the elements (e.g. you open a database connection and test each of them against the database). If you just want to see each element once and don't care about sharing stuff across them, use map(). Matei On Jul 17, 2014, at 12:02 AM, Kamal Banga <banga.ka...@gmail.com> wrote: > Hi All, > > The function mapPartitions in RDD.scala takes a boolean parameter > preservesPartitioning. It seems if that parameter is passed as false, the > passed function f will operate on the data only once, whereas if it's passed > as true the function will operate on each partition of the data. > > In my case, whatever boolean value I pass, f operates on each partition of > data. > > Any help, regarding why I am getting this unexpected behaviour?