[ https://issues.apache.org/jira/browse/MESOS-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15374121#comment-15374121 ]
ASF GitHub Bot commented on MESOS-3307: --------------------------------------- Github user jfarrell commented on the issue: https://github.com/apache/mesos/pull/82 Closing per request at https://s.apache.org/V8r3 > Configurable size of completed task / framework history > ------------------------------------------------------- > > Key: MESOS-3307 > URL: https://issues.apache.org/jira/browse/MESOS-3307 > Project: Mesos > Issue Type: Bug > Reporter: Ian Babrou > Assignee: Kevin Klues > Labels: mesosphere > Fix For: 0.24.2, 0.25.1, 0.26.1, 0.27.0 > > > We try to make Mesos work with multiple frameworks and mesos-dns at the same > time. The goal is to have set of frameworks per team / project on a single > Mesos cluster. > At this point our mesos state.json is at 4mb and it takes a while to > assembly. 5 mesos-dns instances hit state.json every 5 seconds, effectively > pushing mesos-master CPU usage through the roof. It's at 100%+ all the time. > Here's the problem: > {noformat} > mesos λ curl -s http://mesos-master:5050/master/state.json | jq > .frameworks[].completed_tasks[].framework_id | sort | uniq -c | sort -n > 1 "20150606-001827-252388362-5050-5982-0003" > 16 "20150606-001827-252388362-5050-5982-0005" > 18 "20150606-001827-252388362-5050-5982-0029" > 73 "20150606-001827-252388362-5050-5982-0007" > 141 "20150606-001827-252388362-5050-5982-0009" > 154 "20150820-154817-302720010-5050-15320-0000" > 289 "20150606-001827-252388362-5050-5982-0004" > 510 "20150606-001827-252388362-5050-5982-0012" > 666 "20150606-001827-252388362-5050-5982-0028" > 923 "20150116-002612-269165578-5050-32204-0003" > 1000 "20150606-001827-252388362-5050-5982-0001" > 1000 "20150606-001827-252388362-5050-5982-0006" > 1000 "20150606-001827-252388362-5050-5982-0010" > 1000 "20150606-001827-252388362-5050-5982-0011" > 1000 "20150606-001827-252388362-5050-5982-0027" > mesos λ fgrep 1000 -r src/master > src/master/constants.cpp:const size_t MAX_REMOVED_SLAVES = 100000; > src/master/constants.cpp:const uint32_t MAX_COMPLETED_TASKS_PER_FRAMEWORK = > 1000; > {noformat} > Active tasks are just 6% of state.json response: > {noformat} > mesos λ cat ~/temp/mesos-state.json | jq -c . | wc > 1 14796 4138942 > mesos λ cat ~/temp/mesos-state.json | jq .frameworks[].tasks | jq -c . | wc > 16 37 252774 > {noformat} > I see four options that can improve the situation: > 1. Add query string param to exclude completed tasks from state.json and use > it in mesos-dns and similar tools. There is no need for mesos-dns to know > about completed tasks, it's just extra load on master and mesos-dns. > 2. Make history size configurable. > 3. Make JSON serialization faster. With 10000s of tasks even without history > it would take a lot of time to serialize tasks for mesos-dns. Doing it every > 60 seconds instead of every 5 seconds isn't really an option. > 4. Create event bus for mesos master. Marathon has it and it'd be nice to > have it in Mesos. This way mesos-dns could avoid polling master state and > switch to listening for events. > All can be done independently. > Note to mesosphere folks: please start distributing debug symbols with your > distribution. I was asking for it for a while and it is really helpful: > https://github.com/mesosphere/marathon/issues/1497#issuecomment-104182501 > Perf report for leading master: > !http://i.imgur.com/iz7C3o0.png! > I'm on 0.23.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)