It would be hard to guess what could be going on without looking at the code. It looks like the driver program goes into a long stop-the-world GC pause. This should not happen on the machine running the driver program if all that you are doing is reading data from HDFS, perform a bunch of transformations and write result back into HDFS.
Perhaps, the program is not actually using Spark in cluster mode, but running Spark in local mode? Mohammed Author: Big Data Analytics with Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/> From: Khaled Hammouda [mailto:khaled.hammo...@kik.com] Sent: Tuesday, June 14, 2016 10:23 PM To: user Subject: Spark SQL driver memory keeps rising I'm having trouble with a Spark SQL job in which I run a series of SQL transformations on data loaded from HDFS. The first two stages load data from hdfs input without issues, but later stages that require shuffles cause the driver memory to keep rising until it is exhausted, and then the driver stalls, the spark UI stops responding, and the I can't even kill the driver with ^C, I have to forcibly kill the process. I think I'm allocating enough memory to the driver: driver memory is 44 GB, and spark.driver.memoryOverhead is 4.5 GB. When I look at the memory usage, the driver memory before the shuffle starts is at about 2.4 GB (virtual mem size for the driver process is about 50 GB), and then once the stages that require shuffle start I can see the driver memory rising fast to about 47 GB, then everything stops responding. I'm not invoking any output operation that collects data at the driver. I just call .cache() on a couple of dataframes since they get used more than once in the SQL transformations, but those should be cached on the workers. Then I write the final result to a parquet file, but the job doesn't get to this final stage. What could possibly be causing the driver memory to rise that fast when no data is being collected at the driver? Thanks, Khaled