It would be hard to guess what could be going on without looking at the code. 
It looks like the driver program goes into a long stop-the-world GC pause. This 
should not happen on the machine running the driver program if all that you are 
doing is reading data from HDFS, perform a bunch of transformations and write 
result back into HDFS.

Perhaps, the program is not actually using Spark in cluster mode, but running 
Spark in local mode?

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Khaled Hammouda [mailto:khaled.hammo...@kik.com]
Sent: Tuesday, June 14, 2016 10:23 PM
To: user
Subject: Spark SQL driver memory keeps rising

I'm having trouble with a Spark SQL job in which I run a series of SQL 
transformations on data loaded from HDFS.

The first two stages load data from hdfs input without issues, but later stages 
that require shuffles cause the driver memory to keep rising until it is 
exhausted, and then the driver stalls, the spark UI stops responding, and the I 
can't even kill the driver with ^C, I have to forcibly kill the process.

I think I'm allocating enough memory to the driver: driver memory is 44 GB, and 
spark.driver.memoryOverhead is 4.5 GB. When I look at the memory usage, the 
driver memory before the shuffle starts is at about 2.4 GB (virtual mem size 
for the driver process is about 50 GB), and then once the stages that require 
shuffle start I can see the driver memory rising fast to about 47 GB, then 
everything stops responding.

I'm not invoking any output operation that collects data at the driver. I just 
call .cache() on a couple of dataframes since they get used more than once in 
the SQL transformations, but those should be cached on the workers. Then I 
write the final result to a parquet file, but the job doesn't get to this final 
stage.

What could possibly be causing the driver memory to rise that fast when no data 
is being collected at the driver?

Thanks,
Khaled

Reply via email to