Well, my only guess (It is just a guess, as I don't have access to the machines which requires a hard reset)..The system is running into some kind of race condition while accessing the disk...And is not able to solve this..hence it is hanging (well this is a pretty vague statement, but it seems it will require some trial and error to figure out why exactly the system is hanging)...Also I believe you are using HDFS as data storage..HDFS relaxes some POSIX requirements for faster data access to the system, i wonder if this is the cause
Hey Namaskara~Nalama~Guten Tag~Bonjour -- Keigu Deepak 73500 12833 www.simtree.net, dee...@simtree.net deic...@gmail.com LinkedIn: www.linkedin.com/in/deicool Skype: thumsupdeicool Google talk: deicool Blog: http://loveandfearless.wordpress.com Facebook: http://www.facebook.com/deicool "Contribute to the world, environment and more : http://www.gridrepublic.org " On Thu, Jun 16, 2016 at 10:54 PM, Carlile, Ken <carli...@janelia.hhmi.org> wrote: > Hi Deepak, > > Yes, that’s about the size of it. The spark job isn’t filling the disk by > any stretch of the imagination; in fact the only stuff that’s writing to > the disk from Spark in certain of these instances is the logging. > > Thanks, > —Ken > > On Jun 16, 2016, at 12:17 PM, Deepak Goel <deic...@gmail.com> wrote: > > I guess what you are saying is: > > 1. The nodes work perfectly ok without io wait before Spark job. > 2. After you have run Spark job and killed it, the io wait persist. > > So what it seems, the Spark Job is altering the disk in such a way that > other programs can't access the disk after the spark job is killed. (A > naive thought) I wonder if the spark job fills up the disk so that no other > program on your node could write to it and hence the io wait. > > Also facter just normally reads up your system so it shouldn't block your > system. There must be some other background scripts running on your node > which are writing to the disk perhaps.. > > Hey > > Namaskara~Nalama~Guten Tag~Bonjour > > > -- > Keigu > > Deepak > 73500 12833 > www.simtree.net, dee...@simtree.net > deic...@gmail.com > > LinkedIn: www.linkedin.com/in/deicool > Skype: thumsupdeicool > Google talk: deicool > Blog: http://loveandfearless.wordpress.com > Facebook: http://www.facebook.com/deicool > > "Contribute to the world, environment and more : > http://www.gridrepublic.org > " > > On Thu, Jun 16, 2016 at 5:56 PM, Carlile, Ken <carli...@janelia.hhmi.org> > wrote: > >> 1. There are 320 nodes in total, with 96 dedicated to Spark. In this >> particular case, 21 are in the Spark cluster. In typical Spark usage, maybe >> 1-3 nodes will crash in a day, with probably an average of 4-5 Spark >> clusters running at a given time. In THIS case, 7-12 nodes will crash >> simultaneously on application termination (not Spark cluster termination, >> but termination of a Spark application/jupyter kernel) >> 2. I’ve turned off puppet, no effect. I’ve not fully disabled facter. The >> iowait persists after the scheduler kills the Spark job (that still works, >> at least) >> 3. He’s attempted to run with 15 cores out of 16 and 25GB of RAM out of >> 128. He still lost nodes. >> 4. He’s currently running storage benchmarking tests, which consist >> mainly of shuffles. >> >> Thanks! >> Ken >> >> On Jun 16, 2016, at 8:00 AM, Deepak Goel <deic...@gmail.com> wrote: >> >> I am no expert, but some naive thoughts... >> >> 1. How many HPC nodes do you have? How many of them crash (What do you >> mean by multiple)? Do all of them crash? >> >> 2. What things are you running on Puppet? Can't you switch it off and >> test Spark? Also you can switch of Facter. Btw, your observation that there >> is iowait on these applications might be because they have low priority >> than Spark. Hence they are waiting for Spark to finish. So the real >> bottleneck might be Spark and not these background processes >> >> 3. Limiting cpu's and memory for Spark, might have an inverse effect on >> iowait. As more of Spark processes would have to access the disk due to >> reduced memory and CPU >> >> 4. Offcourse, you might have to give more info on what kind of >> applications you are running on Spark as they might be the main culpirit >> >> Deepak >> >> Hey >> >> Namaskara~Nalama~Guten Tag~Bonjour >> >> >> -- >> Keigu >> >> Deepak >> 73500 12833 >> www.simtree.net, dee...@simtree.net >> deic...@gmail.com >> >> LinkedIn: www.linkedin.com/in/deicool >> Skype: thumsupdeicool >> Google talk: deicool >> Blog: http://loveandfearless.wordpress.com >> Facebook: http://www.facebook.com/deicool >> >> "Contribute to the world, environment and more : >> http://www.gridrepublic.org >> " >> >> On Thu, Jun 16, 2016 at 5:10 PM, Carlile, Ken <carli...@janelia.hhmi.org> >> wrote: >> >>> We run Spark on a general purpose HPC cluster (using standalone mode and >>> the HPC scheduler), and are currently on Spark 1.6.1. One of the primary >>> users has been testing various storage and other parameters for Spark, >>> which involves doing multiple shuffles and shutting down and starting many >>> applications serially on a single cluster instance. He is using pyspark >>> (via jupyter notebooks). Python version is 2.7.6. >>> >>> We have been seeing multiple HPC node hard locks in this scenario, all >>> at the termination of a jupyter kernel (read Spark application). The >>> symptom is that the load on the node keeps going higher. We have determined >>> this is because of iowait on background processes (namely puppet and >>> facter, clean up scripts, etc). What he sees is that when he starts a new >>> kernel (application), the executor on those nodes will not start. We can no >>> longer ssh into the nodes, and no commands can be run on them; everything >>> goes into iowait. The only solution is to do a hard reset on the nodes. >>> >>> Obviously this is very disruptive, both to us sysadmins and to him. We >>> have a limited number of HPC nodes that are permitted to run spark >>> clusters, so this is a big problem. >>> >>> I have attempted to limit the background processes, but it doesn’t seem >>> to matter; it can be any process that attempts io on the boot drive. He has >>> tried various things (limiting CPU cores used by Spark, reducing the >>> memory, etc.), but we have been unable to find a solution, or really, a >>> cause. >>> >>> Has anyone seen anything like this? Any ideas where to look next? >>> >>> Thanks, >>> Ken >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >> > >