We run Spark on a general purpose HPC cluster (using standalone mode and the 
HPC scheduler), and are currently on Spark 1.6.1. One of the primary users has 
been testing various storage and other parameters for Spark, which involves 
doing multiple shuffles and shutting down and starting many applications 
serially on a single cluster instance. He is using pyspark (via jupyter 
notebooks). Python version is 2.7.6. 

We have been seeing multiple HPC node hard locks in this scenario, all at the 
termination of a jupyter kernel (read Spark application). The symptom is that 
the load on the node keeps going higher. We have determined this is because of 
iowait on background processes (namely puppet and facter, clean up scripts, 
etc). What he sees is that when he starts a new kernel (application), the 
executor on those nodes will not start. We can no longer ssh into the nodes, 
and no commands can be run on them; everything goes into iowait. The only 
solution is to do a hard reset on the nodes. 

Obviously this is very disruptive, both to us sysadmins and to him. We have a 
limited number of HPC nodes that are permitted to run spark clusters, so this 
is a big problem. 

I have attempted to limit the background processes, but it doesn’t seem to 
matter; it can be any process that attempts io on the boot drive. He has tried 
various things (limiting CPU cores used by Spark, reducing the memory, etc.), 
but we have been unable to find a solution, or really, a cause. 

Has anyone seen anything like this? Any ideas where to look next? 

Thanks, 
Ken
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to