Hi,

You should perform an action (e.g. count, take, saveAs*, etc. ) in order
for your RDDs to be cached since cache/persist are lazy functions. You
might also want to do coalesce instead of repartition to avoid shuffling.

Thanks,
Deng

On Mon, Nov 2, 2015 at 5:53 PM, Sushrut Ikhar <sushrutikha...@gmail.com>
wrote:

> Hi,
> I need to split a RDD into 3 different RDD using filter-transformation.
> I have cached the original RDD before using filter.
> The input is lopsided leaving some executors with heavy load while others
> with less; so I have repartitioned it.
>
> *DAG-lineage I expected:*
>
> I/P RDD  -->  MAP RDD --> SHUFFLE RDD (repartition) -->
>
> *MAP RDD (cache)* --> FILTER RDD1 --> MAP1 --> UNION RDD --> O/P RDD
>                                --> FILTER RDD2 --> MAP2
>                                --> FILTER RDD3 --> MAP3
>
> *DAG-lineage I observed:*
>
> I/P RDD  -->  MAP RDD -->
>
> SHUFFLE RDD (repartition) --> *MAP RDD (cache)* --> FILTER RDD1 --> MAP1
> SHUFFLE RDD (repartition) --> *MAP RDD (cache)* --> FILTER RDD2 --> MAP2
> SHUFFLE RDD (repartition) --> *MAP RDD (cache)* --> FILTER RDD3 --> MAP3
> -->
>
> UNION RDD --> O/P RDD
>
> Also I Spark-UI shows that no RDD partitioned are actually being cached.
>
> How do I split then without shuffling thrice?
> Regards,
>
> Sushrut Ikhar
> [image: https://]about.me/sushrutikhar
> <https://about.me/sushrutikhar?promo=email_sig>
>
>

Reply via email to