Foreach is listed as an action[1]. I guess an *action* just means that it forces materialization of the RDD.
I just noticed much faster executions with map although I don't like the map approach. I'll look at it with new eyes if foreach is the way to go. [1] – https://spark.apache.org/docs/latest/programming-guide.html#actions Thanks guys! -- Alexandre Rodrigues On Thu, Jul 2, 2015 at 5:37 PM, Eugen Cepoi <cepoi.eu...@gmail.com> wrote: > > > *"The thing is that foreach forces materialization of the RDD and it seems > to be executed on the driver program"* > What makes you think that? No, foreach is run in the executors > (distributed) and not in the driver. > > 2015-07-02 18:32 GMT+02:00 Alexandre Rodrigues < > alex.jose.rodrig...@gmail.com>: > >> Hi Spark devs, >> >> I'm coding a spark job and at a certain point in execution I need to send >> some data present in an RDD to an external system. >> >> val myRdd = .... >> >> myRdd.foreach { record => >> sendToWhtv(record) >> } >> >> The thing is that foreach forces materialization of the RDD and it seems >> to be executed on the driver program, which is not very benefitial in my >> case. So I changed the logic to a Map (mapWithParititons, but it's the >> same). >> >> val newRdd = myRdd.map { record => >> sendToWhtv(record) >> } >> newRdd.count() >> >> My understanding is that map is a transformation operation and then I >> have to force materialization by invoking some action (such as count). Is >> this the correct way to do this kind of distributed foreach or is there any >> other function to achieve this that doesn't necessarily imply a data >> transformation or a returned RDD ? >> >> >> Thanks, >> Alex >> >> >