Re: map vs foreach for sending data to external system

Alexandre Rodrigues Thu, 02 Jul 2015 10:00:47 -0700

Foreach is listed as an action[1]. I guess an *action* just means that it
forces materialization of the RDD.


I just noticed much faster executions with map although I don't like the
map approach. I'll look at it with new eyes if foreach is the way to go.

[1] – https://spark.apache.org/docs/latest/programming-guide.html#actions

Thanks guys!




--
Alexandre Rodrigues

On Thu, Jul 2, 2015 at 5:37 PM, Eugen Cepoi <cepoi.eu...@gmail.com> wrote:

>
>
> *"The thing is that foreach forces materialization of the RDD and it seems
> to be executed on the driver program"*
> What makes you think that? No, foreach is run in the executors
> (distributed) and not in the driver.
>
> 2015-07-02 18:32 GMT+02:00 Alexandre Rodrigues <
> alex.jose.rodrig...@gmail.com>:
>
>> Hi Spark devs,
>>
>> I'm coding a spark job and at a certain point in execution I need to send
>> some data present in an RDD to an external system.
>>
>> val myRdd = ....
>>
>> myRdd.foreach { record =>
>>   sendToWhtv(record)
>> }
>>
>> The thing is that foreach forces materialization of the RDD and it seems
>> to be executed on the driver program, which is not very benefitial in my
>> case. So I changed the logic to a Map (mapWithParititons, but it's the
>> same).
>>
>> val newRdd = myRdd.map { record =>
>>   sendToWhtv(record)
>> }
>> newRdd.count()
>>
>> My understanding is that map is a transformation operation and then I
>> have to force materialization by invoking some action (such as count). Is
>> this the correct way to do this kind of distributed foreach or is there any
>> other function to achieve this that doesn't necessarily imply a data
>> transformation or a returned RDD ?
>>
>>
>> Thanks,
>> Alex
>>
>>
>

Re: map vs foreach for sending data to external system

Reply via email to