[ https://issues.apache.org/jira/browse/MESOS-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Timothy St. Clair resolved MESOS-1746. -------------------------------------- Resolution: Fixed Fix Version/s: 0.21.0 commit 8538eed683eea99a340ff5272205113db0580a25 Author: Chengwei Yang <chengwei.yang...@gmail.com> Date: Wed Oct 15 14:12:26 2014 -0500 Delete framework data in TaskStatus to avoid OOM There was a bug found that Spark use TaskStatus.data to transfer computed result and mesos-master RES memory keeps increasing fast and finally will be killed by OOM killer. Review: https://reviews.apache.org/r/25184 > clear TaskStatus data to avoid OOM > ---------------------------------- > > Key: MESOS-1746 > URL: https://issues.apache.org/jira/browse/MESOS-1746 > Project: Mesos > Issue Type: Bug > Environment: mesos-0.19.0 > Reporter: Chengwei Yang > Assignee: Chengwei Yang > Fix For: 0.21.0 > > > Spark on mesos may use TaskStatus to transfer computed result between worker > and scheduler, the source code like below (spark 1.0.2) > {code} > val serializedResult = { > if (serializedDirectResult.limit >= execBackend.akkaFrameSize() - > AkkaUtils.reservedSizeBytes) { > > > > logInfo("Storing result for " + taskId + " in local BlockManager") > val blockId = TaskResultBlockId(taskId) > env.blockManager.putBytes( > blockId, serializedDirectResult, > StorageLevel.MEMORY_AND_DISK_SER) > ser.serialize(new IndirectTaskResult[Any](blockId)) > > > > } else { > > > > logInfo("Sending result for " + taskId + " directly to driver") > serializedDirectResult > > > > } > > > > } > {code} > And In our test environment, we enlarge akkaFrameSize to 128MB from default > value (10MB) and this cause our mesos-master process will be OOM in tens of > minutes when running spark tasks in fine-grained mode. > As you can see, even changed akkaFrameSize back to default value (10MB), it's > very likely to make mesos-master OOM too, however more slower. > So I think it's good to delete data from TaskStatus since this is only > designed to on-top framework and we don't interested in it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)