[SPARK-SQL] Spark Persist slower than non-persist calls

saurabh raval Thu, 31 Aug 2017 14:58:48 -0700

Spark 2.1
My settings are: Running Spark 2.1 on 3 node YARN cluster with 160 GB. Dynamic 
allocation turned on. spark.executor.memory=6G, spark.executor.cores=6
First, I am reading hive tables: orders(329MB) and lineitems(1.43GB) and doing 
left outer join.Next, I apply 7 different filter conditions based on joined 
dataset(something line var line1=joinedDf.filter("linenumber=1"),var 
line2=joinedDf.filter("l_linenumber=2, etc). Because I'm doing filter on joned 
dataset multiple times, I thought doing a persist(MEMORY_ONLY) should help here 
as the joined dataset will fit fully in memory.
1. I noticed that with persist, spark job takes longer time to run than 
non-persist(3.5 mins vs 3.3 mins). With persist, the DAG shows that a single 
stage got created for persist and other downstream jobs are waiting for the 
persist to complete. Does that mean persist is a blocking call? Or do stages in 
other jobs start processing as and when persisted blocks become available?2. In 
non-persist case, different jobs are creating different stages to read the same 
data. Data is read multiple times in different stages, but this is still is 
turning out to be faster than the persist case.3. With larger data sets, 
persist actually causes executors to run out of memory: Java heap space. 
Without persist, the spark jobs complete just fine. I looked at some other 
suggestions here: Spark java.lang.OutOfMemoryError: Java heap space I tried   
increasing/decreasing executor cores, persisting with disk only, increasing 
partitions, modifying storage ratio, but nothing seems to help with executor 
memory issues.
Would appreciate if someone could mention how persist works, in what cases it 
is faster than not-persisting and more importantly, how to go about 
troubleshooting out of memory issues.
Thanks


  
|  
|   
|   
|   |    |

   |

  |
|  
|   |  
Spark java.lang.OutOfMemoryError: Java heap space
 My cluster: 1 master, 11 slaves, each node has 6 GB memory. My settings: 
spark.executor.memory=4g, Dspark.akka...  |   |

  |

  |

[SPARK-SQL] Spark Persist slower than non-persist calls

Reply via email to