I noticed that toJavaRDD causes a computation on the DataFrame, so is it
considered an action, even though logically it's a transformation?
On Nov 4, 2015 6:51 PM, "Aliaksei Tsyvunchyk" <atsyvunc...@exadel.com>
wrote:

> Hello folks,
>
> Recently I have noticed unexpectedly big network traffic between Driver
> Program and Worker node.
> During debugging I have figured out that it is caused by following block
> of code
>
> —————— Java ——— —————
> DataFrame etpvRecords = context.sql(" SOME SQL query here");
> Mapper m = new Mapper(localValue, ProgramId::toProgId);
> return etpvRecords
>                 .toJavaRDD()
>                 .map(m::mapHutPutViewingRow)
>                 .reduce(Reducer::reduce);
> —————— Java ————————
>
> I’m using debug breakpoint and OS X nettop to monitor traffic between
> processes. So before approaching line toJavaRDD() I have 500Kb of traffic
> and after executing this line I have 2.2 Mb of traffic. But when I check
> size of result of reduce function it is 10 Kb.
> So .toJavaRDD() seems causing worker process return dataset to driver
> process and seems further map/reduce occurs on Driver.
>
> This is definitely not expected by me, so I have 2 questions.
> 1.  Is it really expected behavior that DataFrame.toJavaRDD cause whole
> dataset return to driver or I’m doing something wrong?
> 2.  What is expected way to perform transformation with DataFrame using
> custom Java map\reduce functions in case if standard SQL features are not
> fit all my needs?
>
> Env: Spark 1.5.1 in standalone mode, (1 master, 1 worker sharing same
> machine). Java 1.8.0_60.
>
> CONFIDENTIALITY NOTICE: This email and files attached to it are
> confidential. If you are not the intended recipient you are hereby notified
> that using, copying, distributing or taking any action in reliance on the
> contents of this information is strictly prohibited. If you have received
> this email in error please notify the sender and delete this email.
>

Reply via email to