[ https://issues.apache.org/jira/browse/MESOS-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Anand Mazumdar updated MESOS-3302: ---------------------------------- Comment: was deleted (was: [~guoger] Thanks for testing out the new API. Here are the answer to your queries: - Restart leading master - For a non HA cluster, the behavior is expected. The scheduler library does not currently follow a redirect but merely relies on the {{detector}} to let it know of a new master. So, the behavior is expected and correctly works for a HA cluster as you pointed out. - We want to fix the behavior i.e. ensure there is a delay upon (re-)connection. https://issues.apache.org/jira/browse/MESOS-5359 - Restart agent - Currently, the long lived framework does not support moving existing tasks across agents. However, it would be good to test that the executor is correctly recovered upon agent restart with checkpointing enabled. If checkpointing is disabled, it should kill itself. - Also, restarting the agent with {{--http_command_executor}} enabled/disabled, should still successfully recover all the executors. - Emulate network partitions - I am assuming that when you say "the framework hangs", you just means that it does not have anything to do? - "However there was once that agent keeps launching new tasks without framework being aware of it during partition." This is expected. If a framework is partitioned from the master after sending {{LAUNCH}} messages, the agent would still go ahead and launch them. The framework would receive the status updates for the running tasks upon re-registering since then agent keeps retrying the updates every 10 mins. We currently do not implement any reconciliation in the long running framework. - Also, it would be good to test the other one way partition, i.e. the framework is partitioned away from the master. Also, to reduce noise here on this improvement JIRA, we should create a google doc with the testing details and link it to the JIRA? I would also add the testing details done by me to that doc and consolidate them at one place. If it's easier for you, I can create the doc myself and you can then add the details to it. Let me know what works for you. ) > Scheduler API v1 improvements > ----------------------------- > > Key: MESOS-3302 > URL: https://issues.apache.org/jira/browse/MESOS-3302 > Project: Mesos > Issue Type: Epic > Reporter: Marco Massenzio > Labels: mesosphere, twitter > > This Epic covers all the refinements that we may want to build on top of the > {{HTTP API}} MVP epic (MESOS-2288) which was released initially with Mesos > {{0.24.0}}. > The tasks/stories here cover the necessary work to bring the API v1 to what > we would regard as "Production-ready" state in preparation for the {{1.0.0}} > release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)