[jira] [Commented] (MESOS-3302) Scheduler API v1 improvements

2016-05-25 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300780#comment-15300780
 ] 

Anand Mazumdar commented on MESOS-3302:
---

Created https://issues.apache.org/jira/browse/MESOS-5457 to track this.

> Scheduler API v1 improvements
> -
>
> Key: MESOS-3302
> URL: https://issues.apache.org/jira/browse/MESOS-3302
> Project: Mesos
>  Issue Type: Epic
>Reporter: Marco Massenzio
>  Labels: mesosphere, twitter
>
> This Epic covers all the refinements that we may want to build on top of the 
> {{HTTP API}} MVP epic (MESOS-2288) which was released initially with Mesos 
> {{0.24.0}}.
> The tasks/stories here cover the necessary work to bring the API v1 to what 
> we would regard as "Production-ready" state in preparation for the {{1.0.0}} 
> release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3302) Scheduler API v1 improvements

2016-05-25 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300660#comment-15300660
 ] 

Vinod Kone commented on MESOS-3302:
---

[~anandmazumdar] Why don't you create the ticket and start the doc? [~guoger] 
and others can add to it.

> Scheduler API v1 improvements
> -
>
> Key: MESOS-3302
> URL: https://issues.apache.org/jira/browse/MESOS-3302
> Project: Mesos
>  Issue Type: Epic
>Reporter: Marco Massenzio
>  Labels: mesosphere, twitter
>
> This Epic covers all the refinements that we may want to build on top of the 
> {{HTTP API}} MVP epic (MESOS-2288) which was released initially with Mesos 
> {{0.24.0}}.
> The tasks/stories here cover the necessary work to bring the API v1 to what 
> we would regard as "Production-ready" state in preparation for the {{1.0.0}} 
> release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3302) Scheduler API v1 improvements

2016-05-25 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300286#comment-15300286
 ] 

Anand Mazumdar commented on MESOS-3302:
---

Jay Guo Thanks for testing out the new API. Here are the answer to your queries:
* Restart leading master
** For a non HA cluster, the behavior is expected. The scheduler library does 
not currently follow a redirect but merely relies on the detector to let it 
know of a new master. So, the behavior is expected and correctly works for a HA 
cluster as you pointed out.
** We want to fix the behavior i.e. ensure there is a delay upon 
(re-)connection. https://issues.apache.org/jira/browse/MESOS-5359

* Restart agent
** Currently, the long lived framework does not support moving existing tasks 
across agents. However, it would be good to test that the executor is correctly 
recovered upon agent restart with checkpointing enabled. If checkpointing is 
disabled, it should kill itself.
** Also, restarting the agent with --http_command_executor enabled/disabled, 
should still successfully recover all the executors.

* Emulate network partitions
** I am assuming that when you say "the framework hangs", you just means that 
it does not have anything to do?
** "However there was once that agent keeps launching new tasks without 
framework being aware of it during partition."
This is expected. If a framework is partitioned from the master after sending 
LAUNCH messages, the agent would still go ahead and launch them. The framework 
would receive the status updates for the running tasks upon re-registering 
since then agent keeps retrying the updates every 10 mins. We currently do not 
implement any reconciliation in the long running framework.
** Also, it would be good to test the other one way partition, i.e. the 
framework is partitioned away from the master.

To reduce noise here on this improvement JIRA, we should create a google doc 
with the testing details and link it to the JIRA? I would also add the testing 
details done by me to that doc and consolidate them at one place. If it's 
easier for you, I can create the doc myself and you can then add the details to 
it. Let me know what works for you.

> Scheduler API v1 improvements
> -
>
> Key: MESOS-3302
> URL: https://issues.apache.org/jira/browse/MESOS-3302
> Project: Mesos
>  Issue Type: Epic
>Reporter: Marco Massenzio
>  Labels: mesosphere, twitter
>
> This Epic covers all the refinements that we may want to build on top of the 
> {{HTTP API}} MVP epic (MESOS-2288) which was released initially with Mesos 
> {{0.24.0}}.
> The tasks/stories here cover the necessary work to bring the API v1 to what 
> we would regard as "Production-ready" state in preparation for the {{1.0.0}} 
> release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3302) Scheduler API v1 improvements

2016-05-25 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300279#comment-15300279
 ] 

Anand Mazumdar commented on MESOS-3302:
---

[~guoger] Thanks for testing out the new API. Here are the answer to your 
queries:

- Restart leading master
  - For a non HA cluster, the behavior is expected. The scheduler library does 
not currently follow a redirect but merely relies on the {{detector}} to let it 
know of a new master. So, the behavior is expected and correctly works for a HA 
cluster as you pointed out.
  - We want to fix the behavior i.e. ensure there is a delay upon 
(re-)connection. https://issues.apache.org/jira/browse/MESOS-5359

- Restart agent
- Currently, the long lived framework does not support moving existing tasks 
across agents. However, it would be good to test that the executor is correctly 
recovered upon agent restart with checkpointing enabled. If checkpointing is 
disabled, it should kill itself.
- Also, restarting the agent with {{--http_command_executor}} enabled/disabled, 
should still successfully recover all the executors.

- Emulate network partitions
  -  I am assuming that when you say "the framework hangs", you just means that 
it does not have anything to do?
  - "However there was once that agent keeps launching new tasks without 
framework being aware of it during partition."
  This is expected. If a framework is partitioned from the master after 
sending  {{LAUNCH}} messages, the agent would still go ahead and launch them. 
The framework would receive the status updates for the running tasks upon 
re-registering since then agent keeps retrying the updates every 10 mins. We 
currently do not implement any reconciliation in the long running framework.
  - Also, it would be good to test the other one way partition, i.e. the 
framework is partitioned away from the master.

Also, to reduce noise here on this improvement JIRA, we should create a google 
doc with the testing details and link it to the JIRA? I would also add the 
testing details done by me to that doc and consolidate them at one place. If 
it's easier for you, I can create the doc myself and you can then add the 
details to it. Let me know what works for you.


> Scheduler API v1 improvements
> -
>
> Key: MESOS-3302
> URL: https://issues.apache.org/jira/browse/MESOS-3302
> Project: Mesos
>  Issue Type: Epic
>Reporter: Marco Massenzio
>  Labels: mesosphere, twitter
>
> This Epic covers all the refinements that we may want to build on top of the 
> {{HTTP API}} MVP epic (MESOS-2288) which was released initially with Mesos 
> {{0.24.0}}.
> The tasks/stories here cover the necessary work to bring the API v1 to what 
> we would regard as "Production-ready" state in preparation for the {{1.0.0}} 
> release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3302) Scheduler API v1 improvements

2016-05-25 Thread Jay Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300176#comment-15300176
 ] 

Jay Guo commented on MESOS-3302:


[~vinodkone]

We are manually testing HTTP APIs now and here are some observations:

*Cluster setup:*
* Bring up 3 masters, 3 agents, 3 zookeepers
* Agents should be started with --use_http_command_executor flag (which uses 
http command executor)
* Start long lived framework (which uses http scheduler api)

*Test cases:*
* Restart leading master
_The framework is started with {{--master=}}. Therefore, it always 
talks to fixed master no matter being leader or follower._ 
*Expected:* {{307 Temporary Redirect}} and scheduler actually handles redirect 
and talks to real leader master, and these should be transparent to framework
*Actual:* It reports this back to framework.
Is this intended behaviour? On the other hand, when framework is started with 
--master=zk://... it correctly handles master detection and resumes when new 
leader master is elected. Although master detection happens continuously 
without a break. Do we consider to introduce an interval?

* Restart agent
*Expected:* Workload is migrated to other agents if current agent is down for a 
period longer than timeout, therefore removed. If agent is resurrected within 
the timeout, it resumes the tasks.
*Actual:* Framework keeps waiting for the agent to recover. It does resume 
working if agent is back in time. Otherwise, it keeps waiting indefinitely.
I guess this is reasonable since that long-lived-framework declines other 
offers, which will not be offered again to this framework. I don't see there's 
an option to expire the decline-offer-filter though, or am I missing something?
There are also chances that the agent resumes running tasks for a little while 
and then _asked to terminate_ by master. This is somewhat flaky, need to 
investigate further.

* Restart long lived framework
*Expected:* Recover
*Actual:* Recover

* Restart all masters at once
Same behaviour as _restarting leading master_

* Emulate network partitions (1 way - 2 way) between long lived framework and 
master
_network partition is emulated at tcp layer using iptables rule {{iptables -A 
INPUT -p tcp -s  -dport 5050 -j DROP}}
** One-way: Master <--X-- Framework
For most cases it works as expected: framework simply hangs. Agent keeps 
resending messages since acknowledgements are blocked. When block is lifted, 
everything resumes to work. However there was once that agent keeps launching 
new tasks without framework being aware of it during partition. Need to find a 
way to reproduce it. I guess it has something to do with the status when 
network is cut.
** Two-way: WIP

* Restart leading Zookeeper
WIP

* Restart all Zookeepers at once
WIP

> Scheduler API v1 improvements
> -
>
> Key: MESOS-3302
> URL: https://issues.apache.org/jira/browse/MESOS-3302
> Project: Mesos
>  Issue Type: Epic
>Reporter: Marco Massenzio
>  Labels: mesosphere, twitter
>
> This Epic covers all the refinements that we may want to build on top of the 
> {{HTTP API}} MVP epic (MESOS-2288) which was released initially with Mesos 
> {{0.24.0}}.
> The tasks/stories here cover the necessary work to bring the API v1 to what 
> we would regard as "Production-ready" state in preparation for the {{1.0.0}} 
> release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3302) Scheduler API v1 improvements

2016-02-26 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170171#comment-15170171
 ] 

Vinod Kone commented on MESOS-3302:
---

Made the following changes:

Renamed this epic to "Scheduler API v1 improvements"

Moved executor related tickets to new "Executor API v1" epic

Moved operator endpoint related tickets to new "Operator API v1" epic

> Scheduler API v1 improvements
> -
>
> Key: MESOS-3302
> URL: https://issues.apache.org/jira/browse/MESOS-3302
> Project: Mesos
>  Issue Type: Epic
>Reporter: Marco Massenzio
>Assignee: Marco Massenzio
>  Labels: mesosphere, twitter
>
> This Epic covers all the refinements that we may want to build on top of the 
> {{HTTP API}} MVP epic (MESOS-2288) which was released initially with Mesos 
> {{0.24.0}}.
> The tasks/stories here cover the necessary work to bring the API v1 to what 
> we would regard as "Production-ready" state in preparation for the {{1.0.0}} 
> release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)