I’m attempting to use Spark 2.3.1 (spark-2.3.1-bin-hadoop2.7.tgz) in cluster 
mode and running into some issues. This is a cluster where we've had success 
using Spark 2.2.0 (spark-2.2.0-bin-hadoop2.7.tgz), and I'm simply upgrading our 
nodes with the new Spark 2.3.1 package and testing it out.

Some version information:

Spark v2.3.1
DC/OS v1.10.0
Mesos v1.4.0
Dispatcher: docker, mesosphere/spark:2.3.1-2.2.1-2-hadoop-2.6 (Docker image 
from https://github.com/mesosphere/spark-build)

This is a multi-node cluster. I'm submitting a job that's using the sample 
spark-pi jar included in the distribution. Occasionally, spark submits run 
without issue. Then a run will begin execution where a bunch of TASK_LOST 
messages occur immediately, followed by the BlockManager attempting to remove a 
handful of non-existent executors. I also can see where the driver/scheduler 
begins making a tight loop of SUBSCRIBE requests to the master.mesos service. 
The request volume and frequency is so high that the mesos.master stops 
responding to other requests, and eventually runs OOM and systemd restarts the 
failed process. If there is only one job running, and it's able to start an 
executor (exactly one started in my sample logs), the job will eventually 
complete. However, if I deploy multiple jobs (five seemed to do the trick), 
I've seen cases where none of the jobs complete, and the cluster begins to have 
cascading failures due to the master not servicing other API requests due to 
the influx of REGISTER requests from numerous spark driver frameworks.

Logs:
Problematic run (stdout, stderr, mesos.master logs): 
https://gist.github.com/davidhesson/791cb3101db2521a51478ff4e2d22841
Successful run (stdout, stderr; for comparison): 
https://gist.github.com/davidhesson/66e32196834b849cd2919dba8275cd4a
Snippet of flood of subscribes hitting master node: 
https://gist.github.com/davidhesson/2c5d22e4f87fad85ce975bc074289136
Spark submit JSON: 
https://gist.github.com/davidhesson/c0c77dffe48965650fd5bbb078731900

Reply via email to