[ 
https://issues.apache.org/jira/browse/MESOS-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300279#comment-15300279
 ] 

Anand Mazumdar commented on MESOS-3302:
---------------------------------------

[~guoger] Thanks for testing out the new API. Here are the answer to your 
queries:

- Restart leading master
  - For a non HA cluster, the behavior is expected. The scheduler library does 
not currently follow a redirect but merely relies on the {{detector}} to let it 
know of a new master. So, the behavior is expected and correctly works for a HA 
cluster as you pointed out.
  - We want to fix the behavior i.e. ensure there is a delay upon 
(re-)connection. https://issues.apache.org/jira/browse/MESOS-5359

- Restart agent
- Currently, the long lived framework does not support moving existing tasks 
across agents. However, it would be good to test that the executor is correctly 
recovered upon agent restart with checkpointing enabled. If checkpointing is 
disabled, it should kill itself.
- Also, restarting the agent with {{--http_command_executor}} enabled/disabled, 
should still successfully recover all the executors.

- Emulate network partitions
  -  I am assuming that when you say "the framework hangs", you just means that 
it does not have anything to do?
  - "However there was once that agent keeps launching new tasks without 
framework being aware of it during partition."
      This is expected. If a framework is partitioned from the master after 
sending  {{LAUNCH}} messages, the agent would still go ahead and launch them. 
The framework would receive the status updates for the running tasks upon 
re-registering since then agent keeps retrying the updates every 10 mins. We 
currently do not implement any reconciliation in the long running framework.
  - Also, it would be good to test the other one way partition, i.e. the 
framework is partitioned away from the master.

Also, to reduce noise here on this improvement JIRA, we should create a google 
doc with the testing details and link it to the JIRA? I would also add the 
testing details done by me to that doc and consolidate them at one place. If 
it's easier for you, I can create the doc myself and you can then add the 
details to it. Let me know what works for you.


> Scheduler API v1 improvements
> -----------------------------
>
>                 Key: MESOS-3302
>                 URL: https://issues.apache.org/jira/browse/MESOS-3302
>             Project: Mesos
>          Issue Type: Epic
>            Reporter: Marco Massenzio
>              Labels: mesosphere, twitter
>
> This Epic covers all the refinements that we may want to build on top of the 
> {{HTTP API}} MVP epic (MESOS-2288) which was released initially with Mesos 
> {{0.24.0}}.
> The tasks/stories here cover the necessary work to bring the API v1 to what 
> we would regard as "Production-ready" state in preparation for the {{1.0.0}} 
> release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to