Using a map and mapPartition on same df at the same time doesn't make much sense to me. Also without complete infor I am assuming that you have some partition strategy being defined/influenced by map operation. In that case you can create a hashmap of map values for each partitions, do mapPartitionByIndex, broadcast the hashmap and in each partitions retrieve required value from map
Rohit On 18-Nov-2016 2:27 AM, Zsolt Tóth <toth.zsolt....@gmail.com> wrote: Any comment on this one? 2016. nov. 16. du. 12:59 ezt írta ("Zsolt Tóth" <toth.zsolt....@gmail.com<mailto:toth.zsolt....@gmail.com>>): Hi, I need to run a map() and a mapPartitions() on my input DF. As a side-effect of the map(), a partition-local variable should be updated, that is used in the mapPartitions() afterwards. I can't use Broadcast variable, because it's shared between partitions on the same executor. Where can I define this variable? I could run a single mapPartitions() that defines the variable, iterates over the input (just as the map() would do), collect the result into an ArrayList, and then use the list's iterator (and the updated partition-local variable) as the input of the transformation that the original mapPartitions() did. It feels however, that this is not as optimal as running map()+mapPartitions() because I need to store the ArrayList (which is basically the whole data in the partition) in memory. Thanks, Zsolt