foreach absolutely runs on the executors. For sending data to an external 
system you should likely use foreachPartition in order to batch the output. 
Also if you want to limit the parallelism of the output action then you can use 
coalesce.

What makes you think foreach is running on the driver?

From: Alexandre Rodrigues
Date: Thursday, July 2, 2015 at 12:32 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Fwd: map vs foreach for sending data to external system

Hi Spark devs,

I'm coding a spark job and at a certain point in execution I need to send some 
data present in an RDD to an external system.

val myRdd = ....

myRdd.foreach { record =>
  sendToWhtv(record)
}

The thing is that foreach forces materialization of the RDD and it seems to be 
executed on the driver program, which is not very benefitial in my case. So I 
changed the logic to a Map (mapWithParititons, but it's the same).

val newRdd = myRdd.map { record =>
  sendToWhtv(record)
}
newRdd.count()

My understanding is that map is a transformation operation and then I have to 
force materialization by invoking some action (such as count). Is this the 
correct way to do this kind of distributed foreach or is there any other 
function to achieve this that doesn't necessarily imply a data transformation 
or a returned RDD ?


Thanks,
Alex

Reply via email to