[jira] [Commented] (MESOS-3302) Scheduler API v1 improvements
[ https://issues.apache.org/jira/browse/MESOS-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300780#comment-15300780 ] Anand Mazumdar commented on MESOS-3302: --- Created https://issues.apache.org/jira/browse/MESOS-5457 to track this. > Scheduler API v1 improvements > - > > Key: MESOS-3302 > URL: https://issues.apache.org/jira/browse/MESOS-3302 > Project: Mesos > Issue Type: Epic >Reporter: Marco Massenzio > Labels: mesosphere, twitter > > This Epic covers all the refinements that we may want to build on top of the > {{HTTP API}} MVP epic (MESOS-2288) which was released initially with Mesos > {{0.24.0}}. > The tasks/stories here cover the necessary work to bring the API v1 to what > we would regard as "Production-ready" state in preparation for the {{1.0.0}} > release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3302) Scheduler API v1 improvements
[ https://issues.apache.org/jira/browse/MESOS-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300660#comment-15300660 ] Vinod Kone commented on MESOS-3302: --- [~anandmazumdar] Why don't you create the ticket and start the doc? [~guoger] and others can add to it. > Scheduler API v1 improvements > - > > Key: MESOS-3302 > URL: https://issues.apache.org/jira/browse/MESOS-3302 > Project: Mesos > Issue Type: Epic >Reporter: Marco Massenzio > Labels: mesosphere, twitter > > This Epic covers all the refinements that we may want to build on top of the > {{HTTP API}} MVP epic (MESOS-2288) which was released initially with Mesos > {{0.24.0}}. > The tasks/stories here cover the necessary work to bring the API v1 to what > we would regard as "Production-ready" state in preparation for the {{1.0.0}} > release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3302) Scheduler API v1 improvements
[ https://issues.apache.org/jira/browse/MESOS-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300286#comment-15300286 ] Anand Mazumdar commented on MESOS-3302: --- Jay Guo Thanks for testing out the new API. Here are the answer to your queries: * Restart leading master ** For a non HA cluster, the behavior is expected. The scheduler library does not currently follow a redirect but merely relies on the detector to let it know of a new master. So, the behavior is expected and correctly works for a HA cluster as you pointed out. ** We want to fix the behavior i.e. ensure there is a delay upon (re-)connection. https://issues.apache.org/jira/browse/MESOS-5359 * Restart agent ** Currently, the long lived framework does not support moving existing tasks across agents. However, it would be good to test that the executor is correctly recovered upon agent restart with checkpointing enabled. If checkpointing is disabled, it should kill itself. ** Also, restarting the agent with --http_command_executor enabled/disabled, should still successfully recover all the executors. * Emulate network partitions ** I am assuming that when you say "the framework hangs", you just means that it does not have anything to do? ** "However there was once that agent keeps launching new tasks without framework being aware of it during partition." This is expected. If a framework is partitioned from the master after sending LAUNCH messages, the agent would still go ahead and launch them. The framework would receive the status updates for the running tasks upon re-registering since then agent keeps retrying the updates every 10 mins. We currently do not implement any reconciliation in the long running framework. ** Also, it would be good to test the other one way partition, i.e. the framework is partitioned away from the master. To reduce noise here on this improvement JIRA, we should create a google doc with the testing details and link it to the JIRA? I would also add the testing details done by me to that doc and consolidate them at one place. If it's easier for you, I can create the doc myself and you can then add the details to it. Let me know what works for you. > Scheduler API v1 improvements > - > > Key: MESOS-3302 > URL: https://issues.apache.org/jira/browse/MESOS-3302 > Project: Mesos > Issue Type: Epic >Reporter: Marco Massenzio > Labels: mesosphere, twitter > > This Epic covers all the refinements that we may want to build on top of the > {{HTTP API}} MVP epic (MESOS-2288) which was released initially with Mesos > {{0.24.0}}. > The tasks/stories here cover the necessary work to bring the API v1 to what > we would regard as "Production-ready" state in preparation for the {{1.0.0}} > release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3302) Scheduler API v1 improvements
[ https://issues.apache.org/jira/browse/MESOS-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300279#comment-15300279 ] Anand Mazumdar commented on MESOS-3302: --- [~guoger] Thanks for testing out the new API. Here are the answer to your queries: - Restart leading master - For a non HA cluster, the behavior is expected. The scheduler library does not currently follow a redirect but merely relies on the {{detector}} to let it know of a new master. So, the behavior is expected and correctly works for a HA cluster as you pointed out. - We want to fix the behavior i.e. ensure there is a delay upon (re-)connection. https://issues.apache.org/jira/browse/MESOS-5359 - Restart agent - Currently, the long lived framework does not support moving existing tasks across agents. However, it would be good to test that the executor is correctly recovered upon agent restart with checkpointing enabled. If checkpointing is disabled, it should kill itself. - Also, restarting the agent with {{--http_command_executor}} enabled/disabled, should still successfully recover all the executors. - Emulate network partitions - I am assuming that when you say "the framework hangs", you just means that it does not have anything to do? - "However there was once that agent keeps launching new tasks without framework being aware of it during partition." This is expected. If a framework is partitioned from the master after sending {{LAUNCH}} messages, the agent would still go ahead and launch them. The framework would receive the status updates for the running tasks upon re-registering since then agent keeps retrying the updates every 10 mins. We currently do not implement any reconciliation in the long running framework. - Also, it would be good to test the other one way partition, i.e. the framework is partitioned away from the master. Also, to reduce noise here on this improvement JIRA, we should create a google doc with the testing details and link it to the JIRA? I would also add the testing details done by me to that doc and consolidate them at one place. If it's easier for you, I can create the doc myself and you can then add the details to it. Let me know what works for you. > Scheduler API v1 improvements > - > > Key: MESOS-3302 > URL: https://issues.apache.org/jira/browse/MESOS-3302 > Project: Mesos > Issue Type: Epic >Reporter: Marco Massenzio > Labels: mesosphere, twitter > > This Epic covers all the refinements that we may want to build on top of the > {{HTTP API}} MVP epic (MESOS-2288) which was released initially with Mesos > {{0.24.0}}. > The tasks/stories here cover the necessary work to bring the API v1 to what > we would regard as "Production-ready" state in preparation for the {{1.0.0}} > release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3302) Scheduler API v1 improvements
[ https://issues.apache.org/jira/browse/MESOS-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300176#comment-15300176 ] Jay Guo commented on MESOS-3302: [~vinodkone] We are manually testing HTTP APIs now and here are some observations: *Cluster setup:* * Bring up 3 masters, 3 agents, 3 zookeepers * Agents should be started with --use_http_command_executor flag (which uses http command executor) * Start long lived framework (which uses http scheduler api) *Test cases:* * Restart leading master _The framework is started with {{--master=}}. Therefore, it always talks to fixed master no matter being leader or follower._ *Expected:* {{307 Temporary Redirect}} and scheduler actually handles redirect and talks to real leader master, and these should be transparent to framework *Actual:* It reports this back to framework. Is this intended behaviour? On the other hand, when framework is started with --master=zk://... it correctly handles master detection and resumes when new leader master is elected. Although master detection happens continuously without a break. Do we consider to introduce an interval? * Restart agent *Expected:* Workload is migrated to other agents if current agent is down for a period longer than timeout, therefore removed. If agent is resurrected within the timeout, it resumes the tasks. *Actual:* Framework keeps waiting for the agent to recover. It does resume working if agent is back in time. Otherwise, it keeps waiting indefinitely. I guess this is reasonable since that long-lived-framework declines other offers, which will not be offered again to this framework. I don't see there's an option to expire the decline-offer-filter though, or am I missing something? There are also chances that the agent resumes running tasks for a little while and then _asked to terminate_ by master. This is somewhat flaky, need to investigate further. * Restart long lived framework *Expected:* Recover *Actual:* Recover * Restart all masters at once Same behaviour as _restarting leading master_ * Emulate network partitions (1 way - 2 way) between long lived framework and master _network partition is emulated at tcp layer using iptables rule {{iptables -A INPUT -p tcp -s -dport 5050 -j DROP}} ** One-way: Master <--X-- Framework For most cases it works as expected: framework simply hangs. Agent keeps resending messages since acknowledgements are blocked. When block is lifted, everything resumes to work. However there was once that agent keeps launching new tasks without framework being aware of it during partition. Need to find a way to reproduce it. I guess it has something to do with the status when network is cut. ** Two-way: WIP * Restart leading Zookeeper WIP * Restart all Zookeepers at once WIP > Scheduler API v1 improvements > - > > Key: MESOS-3302 > URL: https://issues.apache.org/jira/browse/MESOS-3302 > Project: Mesos > Issue Type: Epic >Reporter: Marco Massenzio > Labels: mesosphere, twitter > > This Epic covers all the refinements that we may want to build on top of the > {{HTTP API}} MVP epic (MESOS-2288) which was released initially with Mesos > {{0.24.0}}. > The tasks/stories here cover the necessary work to bring the API v1 to what > we would regard as "Production-ready" state in preparation for the {{1.0.0}} > release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3302) Scheduler API v1 improvements
[ https://issues.apache.org/jira/browse/MESOS-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170171#comment-15170171 ] Vinod Kone commented on MESOS-3302: --- Made the following changes: Renamed this epic to "Scheduler API v1 improvements" Moved executor related tickets to new "Executor API v1" epic Moved operator endpoint related tickets to new "Operator API v1" epic > Scheduler API v1 improvements > - > > Key: MESOS-3302 > URL: https://issues.apache.org/jira/browse/MESOS-3302 > Project: Mesos > Issue Type: Epic >Reporter: Marco Massenzio >Assignee: Marco Massenzio > Labels: mesosphere, twitter > > This Epic covers all the refinements that we may want to build on top of the > {{HTTP API}} MVP epic (MESOS-2288) which was released initially with Mesos > {{0.24.0}}. > The tasks/stories here cover the necessary work to bring the API v1 to what > we would regard as "Production-ready" state in preparation for the {{1.0.0}} > release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)