Thanks a lot.

After reading Mesos-1688, I still don't understand how/why a job will hoard
and hold on to so many resources even in the presence of that bug.
Looking at the release notes, I think this ticket could be relevant to
preventing the behavior we're seeing:
[MESOS-186] - Resource offers should be rescinded after some configurable
timeout

Bottom line, we're following your advice and we're testing Mesos 0.21 on
dev to roll out to our prod platforms later on.

Thanks!!

-kr, Gerard.


On Tue, Jan 27, 2015 at 9:15 PM, Tim Chen <t...@mesosphere.io> wrote:

> Hi Gerard,
>
> As others has mentioned I believe you're hitting Mesos-1688, can you
> upgrade to the latest Mesos release (0.21.1) and let us know if it resolves
> your problem?
>
> Thanks,
>
> Tim
>
> On Tue, Jan 27, 2015 at 10:39 AM, Sam Bessalah <samkiller....@gmail.com>
> wrote:
>
>> Hi Geraard,
>> isn't this the same issueas this?
>> https://issues.apache.org/jira/browse/MESOS-1688
>>
>> On Mon, Jan 26, 2015 at 9:17 PM, Gerard Maas <gerard.m...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> We are observing with certain regularity that our Spark  jobs, as Mesos
>>> framework, are hoarding resources and not releasing them, resulting in
>>> resource starvation to all jobs running on the Mesos cluster.
>>>
>>> For example:
>>> This is a job that has spark.cores.max = 4 and spark.executor.memory="3g"
>>>
>>> IDFrameworkHostCPUsMem…5050-16506-1146497FooStreaming
>>> dnode-4.hdfs.private713.4 GB…5050-16506-1146495FooStreaming
>>> dnode-0.hdfs.private16.4 GB…5050-16506-1146491FooStreaming
>>> dnode-5.hdfs.private711.9 GB…5050-16506-1146449FooStreaming
>>> dnode-3.hdfs.private74.9 GB…5050-16506-1146247FooStreaming
>>> dnode-1.hdfs.private0.55.9 GB…5050-16506-1146226FooStreaming
>>> dnode-2.hdfs.private37.9 GB…5050-16506-1144069FooStreaming
>>> dnode-3.hdfs.private18.7 GB…5050-16506-1133091FooStreaming
>>> dnode-5.hdfs.private11.7 GB…5050-16506-1133090FooStreaming
>>> dnode-2.hdfs.private55.2 GB…5050-16506-1133089FooStreaming
>>> dnode-1.hdfs.private6.56.3 GB…5050-16506-1133088FooStreaming
>>> dnode-4.hdfs.private1251 MB…5050-16506-1133087FooStreaming
>>> dnode-0.hdfs.private6.46.8 GB
>>> The only way to release the resources is by manually finding the process
>>> in the cluster and killing it. The jobs are often streaming but also batch
>>> jobs show this behavior. We have more streaming jobs than batch, so stats
>>> are biased.
>>> Any ideas of what's up here? Hopefully some very bad ugly bug that has
>>> been fixed already and that will urge us to upgrade our infra?
>>>
>>> Mesos 0.20 +  Marathon 0.7.4 + Spark 1.1.0
>>>
>>> -kr, Gerard.
>>>
>>
>>
>

Reply via email to