[jira] [Commented] (MESOS-9501) Mesos executor fails to terminate and gets stuck after agent host reboot.

2019-01-11 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740141#comment-16740141
 ] 

Qian Zhang commented on MESOS-9501:
---

RR: https://reviews.apache.org/r/69705/

> Mesos executor fails to terminate and gets stuck after agent host reboot.
> -
>
> Key: MESOS-9501
> URL: https://issues.apache.org/jira/browse/MESOS-9501
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.5.1, 1.6.1, 1.7.0
>Reporter: Meng Zhu
>Assignee: Qian Zhang
>Priority: Critical
>
> When an agent host reboots, all of its containers are gone but the agent will 
> still try to recover from its checkpointed state after reboot.
> The agent will soon discover that all the cgroup hierarchies are gone and 
> assume (correctly) that the containers are destroyed.
> However, when trying to terminate the executor, the agent will first try to 
> wait for the exit status of its container:
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2631
> Agent dose so by `waitpid` on the checkpointed child process pid. If, after 
> the agent host reboot, a new process with the same pid gets spawned, then the 
> parent will wait for the wrong child process. This could get stuck until the 
> wrongly waited-for  process is somehow exited, see `ReaperProcess::wait()`: 
> https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L88-L114
> This will block the executor termination as well as future task status update 
> (e.g. master might still think the task is running).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9223) Storage local provider does not sufficiently handle container launch failures or errors

2019-01-11 Thread Benjamin Bannier (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740239#comment-16740239
 ] 

Benjamin Bannier commented on MESOS-9223:
-

Reviews:

 [r/69606/|https://reviews.apache.org/r/69606/]
 [r/69719/|https://reviews.apache.org/r/69719/]

> Storage local provider does not sufficiently handle container launch failures 
> or errors
> ---
>
> Key: MESOS-9223
> URL: https://issues.apache.org/jira/browse/MESOS-9223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, storage
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>Priority: Critical
>
> The storage local resource provider as currently implemented does not handle 
> launch failures or task errors of its standalone containers well enough, If 
> e.g., a RP container fails to come up during node start a warning would be 
> logged, but an operator still needs to detect degraded functionality, 
> manually check the state of containers with {{GET_CONTAINERS}}, and decide 
> whether the agent needs restarting; I suspect they do not have always have 
> enough context for this decision. It would be better if the provider would 
> either enforce a restart by failing over the whole agent, or by retrying the 
> operation (optionally: up to some maximum amount of retries).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9394) Maintenance of machine A causes "Removing offers" for machine B.

2019-01-11 Thread longfei (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740255#comment-16740255
 ] 

longfei commented on MESOS-9394:


[~arojas] Hi, would you take a look at this pls.

I think 
{code:java}
if (master->machines[id].info.mode() != MachineInfo::UP) {
  master->machines[id].info.set_mode(MachineInfo::UP);
  master->updateUnavailability(id, None());
}
{code}
will fix this.

> Maintenance of machine A causes "Removing offers" for machine B.
> 
>
> Key: MESOS-9394
> URL: https://issues.apache.org/jira/browse/MESOS-9394
> Project: Mesos
>  Issue Type: Bug
>Reporter: longfei
>Priority: Major
>  Labels: maintenance
>
> If I schedule machine A in a maintenance call, the logic in 
> "___updateMaintenanceSchedule" will check all the master's machines. 
> Another machine(say machine B) not in the maintenance schedule will be set to 
> UP Mode and call "updateUnavailability". This results in removing all offers 
> of slaves on machine B.
> If I am using these offers to run some tasks, these tasks would be lost for 
> REASON_INVALID_OFFERS.
> I think a maintenance schedule should not affect machines not in it. Is that 
> right?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9394) Maintenance of machine A causes "Removing offers" for machine B.

2019-01-11 Thread longfei (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

longfei reassigned MESOS-9394:
--

Assignee: longfei

> Maintenance of machine A causes "Removing offers" for machine B.
> 
>
> Key: MESOS-9394
> URL: https://issues.apache.org/jira/browse/MESOS-9394
> Project: Mesos
>  Issue Type: Bug
>Reporter: longfei
>Assignee: longfei
>Priority: Major
>  Labels: maintenance
>
> If I schedule machine A in a maintenance call, the logic in 
> "___updateMaintenanceSchedule" will check all the master's machines. 
> Another machine(say machine B) not in the maintenance schedule will be set to 
> UP Mode and call "updateUnavailability". This results in removing all offers 
> of slaves on machine B.
> If I am using these offers to run some tasks, these tasks would be lost for 
> REASON_INVALID_OFFERS.
> I think a maintenance schedule should not affect machines not in it. Is that 
> right?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-5189) SSLTest.ProtocolMismatch is slow

2019-01-11 Thread Till Toenshoff (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff reassigned MESOS-5189:
-

Assignee: Benjamin Bannier  (was: Till Toenshoff)

> SSLTest.ProtocolMismatch is slow
> 
>
> Key: MESOS-5189
> URL: https://issues.apache.org/jira/browse/MESOS-5189
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>Priority: Major
>  Labels: mesosphere
>
> For me {{SSLTest.ProtocolMismatch}} currently takes more than 8 seconds for 
> an unoptimized build under OS X.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-5189) SSLTest.ProtocolMismatch is slow

2019-01-11 Thread Till Toenshoff (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff reassigned MESOS-5189:
-

Assignee: Till Toenshoff

> SSLTest.ProtocolMismatch is slow
> 
>
> Key: MESOS-5189
> URL: https://issues.apache.org/jira/browse/MESOS-5189
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
>Reporter: Benjamin Bannier
>Assignee: Till Toenshoff
>Priority: Major
>  Labels: mesosphere
>
> For me {{SSLTest.ProtocolMismatch}} currently takes more than 8 seconds for 
> an unoptimized build under OS X.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9394) Maintenance of machine A causes "Removing offers" for machine B.

2019-01-11 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740647#comment-16740647
 ] 

Benno Evers commented on MESOS-9394:


Both the analysis and the proposed change look correct to me - the current 
behaviour certainly does not match what the documentation at 
http://mesos.apache.org/documentation/latest/maintenance/#scheduling-maintenance
 suggests.

[~carlone], if you want to keep credit for the fix I'd suggest to go ahead and 
post a patch to reviewboard, otherwise if you prefer I can also go ahead and do 
that for you.

> Maintenance of machine A causes "Removing offers" for machine B.
> 
>
> Key: MESOS-9394
> URL: https://issues.apache.org/jira/browse/MESOS-9394
> Project: Mesos
>  Issue Type: Bug
>Reporter: longfei
>Assignee: longfei
>Priority: Major
>  Labels: maintenance
>
> If I schedule machine A in a maintenance call, the logic in 
> "___updateMaintenanceSchedule" will check all the master's machines. 
> Another machine(say machine B) not in the maintenance schedule will be set to 
> UP Mode and call "updateUnavailability". This results in removing all offers 
> of slaves on machine B.
> If I am using these offers to run some tasks, these tasks would be lost for 
> REASON_INVALID_OFFERS.
> I think a maintenance schedule should not affect machines not in it. Is that 
> right?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9518) CNI_NETNS should not be set for orphan containers that do not have network namespace

2019-01-11 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740881#comment-16740881
 ] 

Jie Yu edited comment on MESOS-9518 at 1/11/19 11:40 PM:
-

Also need this for newer kernels:
https://reviews.apache.org/r/69727


was (Author: jieyu):
Also need this:
https://reviews.apache.org/r/69727

> CNI_NETNS should not be set for orphan containers that do not have network 
> namespace
> 
>
> Key: MESOS-9518
> URL: https://issues.apache.org/jira/browse/MESOS-9518
> Project: Mesos
>  Issue Type: Bug
>  Components: cni
>Affects Versions: 1.4.2, 1.5.1, 1.6.1, 1.7.0
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Major
>
> We introduced a new agent flag in MESOS-9492 so that CNI configs can be 
> persisted across reboot. This is for some CNI plugins to be able to cleanup 
> IP allocated to the containers after a sudden reboot of the host (not all CNI 
> plugins need this).
> It's important to unset `CNI_NETNS` environment variable after reboot when 
> invoking CNI plugin "DEL" command so that it conforms to the spec:
> {noformat}
> When CNI_NETNS and/or prevResult are not provided, the plugin should clean up 
> as many resources as possible (e.g. releasing IPAM allocations) and return a 
> successful response.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9518) CNI_NETNS should not be set for orphan containers that do not have network namespace

2019-01-11 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740881#comment-16740881
 ] 

Jie Yu commented on MESOS-9518:
---

Also need this:
https://reviews.apache.org/r/69727

> CNI_NETNS should not be set for orphan containers that do not have network 
> namespace
> 
>
> Key: MESOS-9518
> URL: https://issues.apache.org/jira/browse/MESOS-9518
> Project: Mesos
>  Issue Type: Bug
>  Components: cni
>Affects Versions: 1.4.2, 1.5.1, 1.6.1, 1.7.0
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Major
>
> We introduced a new agent flag in MESOS-9492 so that CNI configs can be 
> persisted across reboot. This is for some CNI plugins to be able to cleanup 
> IP allocated to the containers after a sudden reboot of the host (not all CNI 
> plugins need this).
> It's important to unset `CNI_NETNS` environment variable after reboot when 
> invoking CNI plugin "DEL" command so that it conforms to the spec:
> {noformat}
> When CNI_NETNS and/or prevResult are not provided, the plugin should clean up 
> as many resources as possible (e.g. releasing IPAM allocations) and return a 
> successful response.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)