> Hopefully some very bad ugly bug that has been fixed already and that will urge us to upgrade our infra? > Mesos 0.20 + Marathon 0.7.4 + Spark 1.1.0 Could be https://issues.apache.org/jira/browse/MESOS-1688 (fixed in Mesos 0.21)
On Mon, Jan 26, 2015 at 2:45 PM, Gerard Maas <gerard.m...@gmail.com> wrote: > Hi Jörn, > > A memory leak on the job would be contained within the resources reserved > for it, wouldn't it? > And the job holding resources is not always the same. Sometimes it's one > of the Streaming jobs, sometimes it's a heavy batch job that runs every > hour. > Looks to me that whatever is causing the issue, it's participating in the > resource offer protocol of Mesos and my first suspect would be the Mesos > scheduler in Spark. (The table above is the tab "Offers" from the Mesos UI. > > Are there any other factors involved in the offer acceptance/rejection > between Mesos and a scheduler? > > What do you think? > > -kr, Gerard. > > On Mon, Jan 26, 2015 at 11:23 PM, Jörn Franke <jornfra...@gmail.com> > wrote: > >> Hi, >> >> What do your jobs do? Ideally post source code, but some description >> would already helpful to support you. >> >> Memory leaks can have several reasons - it may not be Spark at all. >> >> Thank you. >> >> Le 26 janv. 2015 22:28, "Gerard Maas" <gerard.m...@gmail.com> a écrit : >> >> > >> > (looks like the list didn't like a HTML table on the previous email. My >> excuses for any duplicates) >> > >> > Hi, >> > >> > We are observing with certain regularity that our Spark jobs, as Mesos >> framework, are hoarding resources and not releasing them, resulting in >> resource starvation to all jobs running on the Mesos cluster. >> > >> > For example: >> > This is a job that has spark.cores.max = 4 and >> spark.executor.memory="3g" >> > >> > | ID |Framework |Host |CPUs |Mem >> > …5050-16506-1146497 FooStreaming dnode-4.hdfs.private 7 13.4 GB >> > …5050-16506-1146495 FooStreaming dnode-0.hdfs.private 1 6.4 GB >> > …5050-16506-1146491 FooStreaming dnode-5.hdfs.private 7 11.9 GB >> > …5050-16506-1146449 FooStreaming dnode-3.hdfs.private 7 4.9 GB >> > …5050-16506-1146247 FooStreaming dnode-1.hdfs.private 0.5 5.9 GB >> > …5050-16506-1146226 FooStreaming dnode-2.hdfs.private 3 7.9 GB >> > …5050-16506-1144069 FooStreaming dnode-3.hdfs.private 1 8.7 GB >> > …5050-16506-1133091 FooStreaming dnode-5.hdfs.private 1 1.7 GB >> > …5050-16506-1133090 FooStreaming dnode-2.hdfs.private 5 5.2 GB >> > …5050-16506-1133089 FooStreaming dnode-1.hdfs.private 6.5 6.3 GB >> > …5050-16506-1133088 FooStreaming dnode-4.hdfs.private 1 251 MB >> > …5050-16506-1133087 FooStreaming dnode-0.hdfs.private 6.4 6.8 GB >> > >> > The only way to release the resources is by manually finding the >> process in the cluster and killing it. The jobs are often streaming but >> also batch jobs show this behavior. We have more streaming jobs than batch, >> so stats are biased. >> > Any ideas of what's up here? Hopefully some very bad ugly bug that has >> been fixed already and that will urge us to upgrade our infra? >> > >> > Mesos 0.20 + Marathon 0.7.4 + Spark 1.1.0 >> > >> > -kr, Gerard. >> >> >