Hi Geraard, isn't this the same issueas this? https://issues.apache.org/jira/browse/MESOS-1688
On Mon, Jan 26, 2015 at 9:17 PM, Gerard Maas <gerard.m...@gmail.com> wrote: > Hi, > > We are observing with certain regularity that our Spark jobs, as Mesos > framework, are hoarding resources and not releasing them, resulting in > resource starvation to all jobs running on the Mesos cluster. > > For example: > This is a job that has spark.cores.max = 4 and spark.executor.memory="3g" > > IDFrameworkHostCPUsMem…5050-16506-1146497FooStreamingdnode-4.hdfs.private713.4 > GB…5050-16506-1146495FooStreaming > dnode-0.hdfs.private16.4 GB…5050-16506-1146491FooStreaming > dnode-5.hdfs.private711.9 GB…5050-16506-1146449FooStreaming > dnode-3.hdfs.private74.9 GB…5050-16506-1146247FooStreaming > dnode-1.hdfs.private0.55.9 GB…5050-16506-1146226FooStreaming > dnode-2.hdfs.private37.9 GB…5050-16506-1144069FooStreaming > dnode-3.hdfs.private18.7 GB…5050-16506-1133091FooStreaming > dnode-5.hdfs.private11.7 GB…5050-16506-1133090FooStreaming > dnode-2.hdfs.private55.2 GB…5050-16506-1133089FooStreaming > dnode-1.hdfs.private6.56.3 GB…5050-16506-1133088FooStreaming > dnode-4.hdfs.private1251 MB…5050-16506-1133087FooStreaming > dnode-0.hdfs.private6.46.8 GB > The only way to release the resources is by manually finding the process > in the cluster and killing it. The jobs are often streaming but also batch > jobs show this behavior. We have more streaming jobs than batch, so stats > are biased. > Any ideas of what's up here? Hopefully some very bad ugly bug that has > been fixed already and that will urge us to upgrade our infra? > > Mesos 0.20 + Marathon 0.7.4 + Spark 1.1.0 > > -kr, Gerard. >