1. There are 320 nodes in total, with 96 dedicated to Spark. In this particular case, 21 are in the Spark cluster. In typical Spark usage, maybe 1-3 nodes will crash in a day, with probably an average of 4-5 Spark clusters running at a given time. In THIS case,
7-12 nodes will crash simultaneously on application termination (not Spark cluster termination, but termination of a Spark application/jupyter kernel)
2. I’ve turned off puppet, no effect. I’ve not fully disabled facter. The iowait persists after the scheduler kills the Spark job (that still works, at least)
3. He’s attempted to run with 15 cores out of 16 and 25GB of RAM out of 128. He still lost nodes.
4. He’s currently running storage benchmarking tests, which consist mainly of shuffles.
Thanks!
Ken
|
- Spark crashes worker nodes with multiple application starts Carlile, Ken
- Re: Spark crashes worker nodes with multiple application... Deepak Goel
- Re: Spark crashes worker nodes with multiple applica... Carlile, Ken
- Re: Spark crashes worker nodes with multiple app... Deepak Goel
- Re: Spark crashes worker nodes with multiple... Carlile, Ken
- Re: Spark crashes worker nodes with mul... Deepak Goel