Well, my only guess (It is just a guess, as I don't have access to the
machines which requires a hard reset)..The system is running into some kind
of race condition while accessing the disk...And is not able to solve
this..hence it is hanging (well this is a pretty vague statement, but it
seems it will require some trial and error to figure out why exactly the
system is hanging)...Also I believe you are using HDFS as data
storage..HDFS relaxes some POSIX requirements for faster data access to the
system, i wonder if this is the cause

Hey

Namaskara~Nalama~Guten Tag~Bonjour


   --
Keigu

Deepak
73500 12833
www.simtree.net, dee...@simtree.net
deic...@gmail.com

LinkedIn: www.linkedin.com/in/deicool
Skype: thumsupdeicool
Google talk: deicool
Blog: http://loveandfearless.wordpress.com
Facebook: http://www.facebook.com/deicool

"Contribute to the world, environment and more : http://www.gridrepublic.org
"

On Thu, Jun 16, 2016 at 10:54 PM, Carlile, Ken <carli...@janelia.hhmi.org>
wrote:

> Hi Deepak,
>
> Yes, that’s about the size of it. The spark job isn’t filling the disk by
> any stretch of the imagination; in fact the only stuff that’s writing to
> the disk from Spark in certain of these instances is the logging.
>
> Thanks,
> —Ken
>
> On Jun 16, 2016, at 12:17 PM, Deepak Goel <deic...@gmail.com> wrote:
>
> I guess what you are saying is:
>
> 1. The nodes work perfectly ok without io wait before Spark job.
> 2. After you have run Spark job and killed it, the io wait persist.
>
> So what it seems, the Spark Job is altering the disk in such a way that
> other programs can't access the disk after the spark job is killed. (A
> naive thought) I wonder if the spark job fills up the disk so that no other
> program on your node could write to it and hence the io wait.
>
> Also facter just normally reads up your system so it shouldn't block your
> system. There must be some other background scripts running on your node
> which are writing to the disk perhaps..
>
> Hey
>
> Namaskara~Nalama~Guten Tag~Bonjour
>
>
>    --
> Keigu
>
> Deepak
> 73500 12833
> www.simtree.net, dee...@simtree.net
> deic...@gmail.com
>
> LinkedIn: www.linkedin.com/in/deicool
> Skype: thumsupdeicool
> Google talk: deicool
> Blog: http://loveandfearless.wordpress.com
> Facebook: http://www.facebook.com/deicool
>
> "Contribute to the world, environment and more :
> http://www.gridrepublic.org
> "
>
> On Thu, Jun 16, 2016 at 5:56 PM, Carlile, Ken <carli...@janelia.hhmi.org>
> wrote:
>
>> 1. There are 320 nodes in total, with 96 dedicated to Spark. In this
>> particular case, 21 are in the Spark cluster. In typical Spark usage, maybe
>> 1-3 nodes will crash in a day, with probably an average of 4-5 Spark
>> clusters running at a given time. In THIS case, 7-12 nodes will crash
>> simultaneously on application termination (not Spark cluster termination,
>> but termination of a Spark application/jupyter kernel)
>> 2. I’ve turned off puppet, no effect. I’ve not fully disabled facter. The
>> iowait persists after the scheduler kills the Spark job (that still works,
>> at least)
>> 3. He’s attempted to run with 15 cores out of 16 and 25GB of RAM out of
>> 128. He still lost nodes.
>> 4. He’s currently running storage benchmarking tests, which consist
>> mainly of shuffles.
>>
>> Thanks!
>> Ken
>>
>> On Jun 16, 2016, at 8:00 AM, Deepak Goel <deic...@gmail.com> wrote:
>>
>> I am no expert, but some naive thoughts...
>>
>> 1. How many HPC nodes do you have? How many of them crash (What do you
>> mean by multiple)? Do all of them crash?
>>
>> 2. What things are you running on Puppet? Can't you switch it off and
>> test Spark? Also you can switch of Facter. Btw, your observation that there
>> is iowait on these applications might be because they have low priority
>> than Spark. Hence they are waiting for Spark to finish. So the real
>> bottleneck might be Spark and not these background processes
>>
>> 3. Limiting cpu's and memory for Spark, might have an inverse effect on
>> iowait. As more of Spark processes would have to access the disk due to
>> reduced memory and CPU
>>
>> 4. Offcourse, you might have to give more info on what kind of
>> applications you are running on Spark as they might be the main culpirit
>>
>> Deepak
>>
>> Hey
>>
>> Namaskara~Nalama~Guten Tag~Bonjour
>>
>>
>>    --
>> Keigu
>>
>> Deepak
>> 73500 12833
>> www.simtree.net, dee...@simtree.net
>> deic...@gmail.com
>>
>> LinkedIn: www.linkedin.com/in/deicool
>> Skype: thumsupdeicool
>> Google talk: deicool
>> Blog: http://loveandfearless.wordpress.com
>> Facebook: http://www.facebook.com/deicool
>>
>> "Contribute to the world, environment and more :
>> http://www.gridrepublic.org
>> "
>>
>> On Thu, Jun 16, 2016 at 5:10 PM, Carlile, Ken <carli...@janelia.hhmi.org>
>> wrote:
>>
>>> We run Spark on a general purpose HPC cluster (using standalone mode and
>>> the HPC scheduler), and are currently on Spark 1.6.1. One of the primary
>>> users has been testing various storage and other parameters for Spark,
>>> which involves doing multiple shuffles and shutting down and starting many
>>> applications serially on a single cluster instance. He is using pyspark
>>> (via jupyter notebooks). Python version is 2.7.6.
>>>
>>> We have been seeing multiple HPC node hard locks in this scenario, all
>>> at the termination of a jupyter kernel (read Spark application). The
>>> symptom is that the load on the node keeps going higher. We have determined
>>> this is because of iowait on background processes (namely puppet and
>>> facter, clean up scripts, etc). What he sees is that when he starts a new
>>> kernel (application), the executor on those nodes will not start. We can no
>>> longer ssh into the nodes, and no commands can be run on them; everything
>>> goes into iowait. The only solution is to do a hard reset on the nodes.
>>>
>>> Obviously this is very disruptive, both to us sysadmins and to him. We
>>> have a limited number of HPC nodes that are permitted to run spark
>>> clusters, so this is a big problem.
>>>
>>> I have attempted to limit the background processes, but it doesn’t seem
>>> to matter; it can be any process that attempts io on the boot drive. He has
>>> tried various things (limiting CPU cores used by Spark, reducing the
>>> memory, etc.), but we have been unable to find a solution, or really, a
>>> cause.
>>>
>>> Has anyone seen anything like this? Any ideas where to look next?
>>>
>>> Thanks,
>>> Ken
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>
>

Reply via email to