foreach absolutely runs on the executors. For sending data to an external system you should likely use foreachPartition in order to batch the output. Also if you want to limit the parallelism of the output action then you can use coalesce.
What makes you think foreach is running on the driver? From: Alexandre Rodrigues Date: Thursday, July 2, 2015 at 12:32 PM To: "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Fwd: map vs foreach for sending data to external system Hi Spark devs, I'm coding a spark job and at a certain point in execution I need to send some data present in an RDD to an external system. val myRdd = .... myRdd.foreach { record => sendToWhtv(record) } The thing is that foreach forces materialization of the RDD and it seems to be executed on the driver program, which is not very benefitial in my case. So I changed the logic to a Map (mapWithParititons, but it's the same). val newRdd = myRdd.map { record => sendToWhtv(record) } newRdd.count() My understanding is that map is a transformation operation and then I have to force materialization by invoking some action (such as count). Is this the correct way to do this kind of distributed foreach or is there any other function to achieve this that doesn't necessarily imply a data transformation or a returned RDD ? Thanks, Alex