Re: Spark crashes worker nodes with multiple application starts

2016-06-16 Thread Deepak Goel
Well, my only guess (It is just a guess, as I don't have access to the machines which requires a hard reset)..The system is running into some kind of race condition while accessing the disk...And is not able to solve this..hence it is hanging (well this is a pretty vague statement, but it seems it

Re: Spark crashes worker nodes with multiple application starts

2016-06-16 Thread Carlile, Ken
Hi Deepak,  Yes, that’s about the size of it. The spark job isn’t filling the disk by any stretch of the imagination; in fact the only stuff that’s writing to the disk from Spark in certain of these instances is the logging.  Thanks, —Ken On Jun 16, 2016, at 12:17 PM,

Re: Spark crashes worker nodes with multiple application starts

2016-06-16 Thread Deepak Goel
I guess what you are saying is: 1. The nodes work perfectly ok without io wait before Spark job. 2. After you have run Spark job and killed it, the io wait persist. So what it seems, the Spark Job is altering the disk in such a way that other programs can't access the disk after the spark job is

Re: Spark crashes worker nodes with multiple application starts

2016-06-16 Thread Carlile, Ken
1. There are 320 nodes in total, with 96 dedicated to Spark. In this particular case, 21 are in the Spark cluster. In typical Spark usage, maybe 1-3 nodes will crash in a day, with probably an average of 4-5 Spark clusters running at a given time. In THIS case, 7-12 nodes will crash

Re: Spark crashes worker nodes with multiple application starts

2016-06-16 Thread Deepak Goel
I am no expert, but some naive thoughts... 1. How many HPC nodes do you have? How many of them crash (What do you mean by multiple)? Do all of them crash? 2. What things are you running on Puppet? Can't you switch it off and test Spark? Also you can switch of Facter. Btw, your observation that

Spark crashes worker nodes with multiple application starts

2016-06-16 Thread Carlile, Ken
We run Spark on a general purpose HPC cluster (using standalone mode and the HPC scheduler), and are currently on Spark 1.6.1. One of the primary users has been testing various storage and other parameters for Spark, which involves doing multiple shuffles and shutting down and starting many