I am no expert, but some naive thoughts... 1. How many HPC nodes do you have? How many of them crash (What do you mean by multiple)? Do all of them crash?
2. What things are you running on Puppet? Can't you switch it off and test Spark? Also you can switch of Facter. Btw, your observation that there is iowait on these applications might be because they have low priority than Spark. Hence they are waiting for Spark to finish. So the real bottleneck might be Spark and not these background processes 3. Limiting cpu's and memory for Spark, might have an inverse effect on iowait. As more of Spark processes would have to access the disk due to reduced memory and CPU 4. Offcourse, you might have to give more info on what kind of applications you are running on Spark as they might be the main culpirit Deepak Hey Namaskara~Nalama~Guten Tag~Bonjour -- Keigu Deepak 73500 12833 www.simtree.net, dee...@simtree.net deic...@gmail.com LinkedIn: www.linkedin.com/in/deicool Skype: thumsupdeicool Google talk: deicool Blog: http://loveandfearless.wordpress.com Facebook: http://www.facebook.com/deicool "Contribute to the world, environment and more : http://www.gridrepublic.org " On Thu, Jun 16, 2016 at 5:10 PM, Carlile, Ken <carli...@janelia.hhmi.org> wrote: > We run Spark on a general purpose HPC cluster (using standalone mode and > the HPC scheduler), and are currently on Spark 1.6.1. One of the primary > users has been testing various storage and other parameters for Spark, > which involves doing multiple shuffles and shutting down and starting many > applications serially on a single cluster instance. He is using pyspark > (via jupyter notebooks). Python version is 2.7.6. > > We have been seeing multiple HPC node hard locks in this scenario, all at > the termination of a jupyter kernel (read Spark application). The symptom > is that the load on the node keeps going higher. We have determined this is > because of iowait on background processes (namely puppet and facter, clean > up scripts, etc). What he sees is that when he starts a new kernel > (application), the executor on those nodes will not start. We can no longer > ssh into the nodes, and no commands can be run on them; everything goes > into iowait. The only solution is to do a hard reset on the nodes. > > Obviously this is very disruptive, both to us sysadmins and to him. We > have a limited number of HPC nodes that are permitted to run spark > clusters, so this is a big problem. > > I have attempted to limit the background processes, but it doesn’t seem to > matter; it can be any process that attempts io on the boot drive. He has > tried various things (limiting CPU cores used by Spark, reducing the > memory, etc.), but we have been unable to find a solution, or really, a > cause. > > Has anyone seen anything like this? Any ideas where to look next? > > Thanks, > Ken > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >