Well, my only guess (It is just a guess, as I don't have access to the
machines which requires a hard reset)..The system is running into some kind
of race condition while accessing the disk...And is not able to solve
this..hence it is hanging (well this is a pretty vague statement, but it
seems it
Hi Deepak,
Yes, that’s about the size of it. The spark job isn’t filling the disk by any stretch of the imagination; in fact the only stuff that’s writing to the disk from Spark in certain of these instances is the logging.
Thanks,
—Ken
On Jun 16, 2016, at 12:17 PM,
I guess what you are saying is:
1. The nodes work perfectly ok without io wait before Spark job.
2. After you have run Spark job and killed it, the io wait persist.
So what it seems, the Spark Job is altering the disk in such a way that
other programs can't access the disk after the spark job is
1. There are 320 nodes in total, with 96 dedicated to Spark. In this particular case, 21 are in the Spark cluster. In typical Spark usage, maybe 1-3 nodes will crash in a day, with probably an average of 4-5 Spark clusters running at a given time. In THIS case,
7-12 nodes will crash
I am no expert, but some naive thoughts...
1. How many HPC nodes do you have? How many of them crash (What do you mean
by multiple)? Do all of them crash?
2. What things are you running on Puppet? Can't you switch it off and test
Spark? Also you can switch of Facter. Btw, your observation that
We run Spark on a general purpose HPC cluster (using standalone mode and the
HPC scheduler), and are currently on Spark 1.6.1. One of the primary users has
been testing various storage and other parameters for Spark, which involves
doing multiple shuffles and shutting down and starting many