Hi, We are observing with certain regularity that our Spark jobs, as Mesos framework, are hoarding resources and not releasing them, resulting in resource starvation to all jobs running on the Mesos cluster.
For example: This is a job that has spark.cores.max = 4 and spark.executor.memory="3g" IDFrameworkHostCPUsMem…5050-16506-1146497FooStreamingdnode-4.hdfs.private713.4 GB…5050-16506-1146495FooStreaming dnode-0.hdfs.private16.4 GB…5050-16506-1146491FooStreaming dnode-5.hdfs.private711.9 GB…5050-16506-1146449FooStreaming dnode-3.hdfs.private74.9 GB…5050-16506-1146247FooStreaming dnode-1.hdfs.private0.55.9 GB…5050-16506-1146226FooStreaming dnode-2.hdfs.private37.9 GB…5050-16506-1144069FooStreaming dnode-3.hdfs.private18.7 GB…5050-16506-1133091FooStreaming dnode-5.hdfs.private11.7 GB…5050-16506-1133090FooStreaming dnode-2.hdfs.private55.2 GB…5050-16506-1133089FooStreaming dnode-1.hdfs.private6.56.3 GB…5050-16506-1133088FooStreaming dnode-4.hdfs.private1251 MB…5050-16506-1133087FooStreaming dnode-0.hdfs.private6.46.8 GB The only way to release the resources is by manually finding the process in the cluster and killing it. The jobs are often streaming but also batch jobs show this behavior. We have more streaming jobs than batch, so stats are biased. Any ideas of what's up here? Hopefully some very bad ugly bug that has been fixed already and that will urge us to upgrade our infra? Mesos 0.20 + Marathon 0.7.4 + Spark 1.1.0 -kr, Gerard.