Using a map and mapPartition on same df at the same time doesn't make much 
sense to me.
Also without complete infor I am assuming that you have some partition strategy 
being defined/influenced by map operation. In that case you can create a 
hashmap of map values for each partitions, do mapPartitionByIndex, broadcast 
the hashmap and in each partitions retrieve required value from map

Rohit

On 18-Nov-2016 2:27 AM, Zsolt Tóth <toth.zsolt....@gmail.com> wrote:

Any comment on this one?

2016. nov. 16. du. 12:59 ezt írta ("Zsolt Tóth" 
<toth.zsolt....@gmail.com<mailto:toth.zsolt....@gmail.com>>):
Hi,

I need to run a map() and a mapPartitions() on my input DF. As a side-effect of 
the map(), a partition-local variable should be updated, that is used in the 
mapPartitions() afterwards.
I can't use Broadcast variable, because it's shared between partitions on the 
same executor.

Where can I define this variable?
I could run a single mapPartitions() that defines the variable, iterates over 
the input (just as the map() would do), collect the result into an ArrayList, 
and then use the list's iterator (and the updated partition-local variable) as 
the input of the transformation that the original mapPartitions() did.

It feels however, that this is not as optimal as running map()+mapPartitions() 
because I need to store the ArrayList (which is basically the whole data in the 
partition) in memory.

Thanks,
Zsolt

Reply via email to