Hi,
I need to split a RDD into 3 different RDD using filter-transformation.
I have cached the original RDD before using filter.
The input is lopsided leaving some executors with heavy load while others
with less; so I have repartitioned it.

*DAG-lineage I expected:*

I/P RDD  -->  MAP RDD --> SHUFFLE RDD (repartition) -->

*MAP RDD (cache)* --> FILTER RDD1 --> MAP1 --> UNION RDD --> O/P RDD
                               --> FILTER RDD2 --> MAP2
                               --> FILTER RDD3 --> MAP3

*DAG-lineage I observed:*

I/P RDD  -->  MAP RDD -->

SHUFFLE RDD (repartition) --> *MAP RDD (cache)* --> FILTER RDD1 --> MAP1
SHUFFLE RDD (repartition) --> *MAP RDD (cache)* --> FILTER RDD2 --> MAP2
SHUFFLE RDD (repartition) --> *MAP RDD (cache)* --> FILTER RDD3 --> MAP3 -->

UNION RDD --> O/P RDD

Also I Spark-UI shows that no RDD partitioned are actually being cached.

How do I split then without shuffling thrice?
Regards,

Sushrut Ikhar
[image: https://]about.me/sushrutikhar
<https://about.me/sushrutikhar?promo=email_sig>

Reply via email to