Thanks for your feedback. Yes I am aware of stages design and Silvio what
you are describing is essentially map-side join which is not applicable
when you have both RDDs quite large.

It appears that

rdd.join(...).mapToPair(f)
f is piggybacked inside join stage  (right in the reducers I believe)

whereas

rdd.join(...).mapPartitionToPair( f )

f is executed in a different stage. This is surprising because at least
intuitively the difference between mapToPair and mapPartitionToPair is that
that former is about the push model whereas the latter is about polling
records out of the iterator (*I suspect there are other technical reasons*).
If anyone know the depths of the problem if would be of great help.

Best,
Marius

On Fri, Jul 3, 2015 at 6:43 PM Silvio Fiorito <silvio.fior...@granturing.com>
wrote:

>   One thing you could do is a broadcast join. You take your smaller RDD,
> save it as a broadcast variable. Then run a map operation to perform the
> join and whatever else you need to do. This will remove a shuffle stage but
> you will still have to collect the joined RDD and broadcast it. All depends
> on the size of your data if it’s worth it or not.
>
>   From: Marius Danciu
> Date: Friday, July 3, 2015 at 3:13 AM
> To: user
> Subject: Optimizations
>
>   Hi all,
>
>  If I have something like:
>
>  rdd.join(...).mapPartitionToPair(...)
>
>  It looks like mapPartitionToPair runs in a different stage then join. Is
> there a way to piggyback this computation inside the join stage ? ... such
> that each result partition after join is passed to
> the mapPartitionToPair function, all running in the same state without any
> other costs.
>
>  Best,
> Marius
>

Reply via email to