[jira] [Commented] (MESOS-9501) Mesos executor fails to terminate and gets stuck after agent host reboot.
[ https://issues.apache.org/jira/browse/MESOS-9501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740141#comment-16740141 ] Qian Zhang commented on MESOS-9501: --- RR: https://reviews.apache.org/r/69705/ > Mesos executor fails to terminate and gets stuck after agent host reboot. > - > > Key: MESOS-9501 > URL: https://issues.apache.org/jira/browse/MESOS-9501 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.5.1, 1.6.1, 1.7.0 >Reporter: Meng Zhu >Assignee: Qian Zhang >Priority: Critical > > When an agent host reboots, all of its containers are gone but the agent will > still try to recover from its checkpointed state after reboot. > The agent will soon discover that all the cgroup hierarchies are gone and > assume (correctly) that the containers are destroyed. > However, when trying to terminate the executor, the agent will first try to > wait for the exit status of its container: > https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2631 > Agent dose so by `waitpid` on the checkpointed child process pid. If, after > the agent host reboot, a new process with the same pid gets spawned, then the > parent will wait for the wrong child process. This could get stuck until the > wrongly waited-for process is somehow exited, see `ReaperProcess::wait()`: > https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L88-L114 > This will block the executor termination as well as future task status update > (e.g. master might still think the task is running). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9223) Storage local provider does not sufficiently handle container launch failures or errors
[ https://issues.apache.org/jira/browse/MESOS-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740239#comment-16740239 ] Benjamin Bannier commented on MESOS-9223: - Reviews: [r/69606/|https://reviews.apache.org/r/69606/] [r/69719/|https://reviews.apache.org/r/69719/] > Storage local provider does not sufficiently handle container launch failures > or errors > --- > > Key: MESOS-9223 > URL: https://issues.apache.org/jira/browse/MESOS-9223 > Project: Mesos > Issue Type: Improvement > Components: agent, storage >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier >Priority: Critical > > The storage local resource provider as currently implemented does not handle > launch failures or task errors of its standalone containers well enough, If > e.g., a RP container fails to come up during node start a warning would be > logged, but an operator still needs to detect degraded functionality, > manually check the state of containers with {{GET_CONTAINERS}}, and decide > whether the agent needs restarting; I suspect they do not have always have > enough context for this decision. It would be better if the provider would > either enforce a restart by failing over the whole agent, or by retrying the > operation (optionally: up to some maximum amount of retries). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9394) Maintenance of machine A causes "Removing offers" for machine B.
[ https://issues.apache.org/jira/browse/MESOS-9394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740255#comment-16740255 ] longfei commented on MESOS-9394: [~arojas] Hi, would you take a look at this pls. I think {code:java} if (master->machines[id].info.mode() != MachineInfo::UP) { master->machines[id].info.set_mode(MachineInfo::UP); master->updateUnavailability(id, None()); } {code} will fix this. > Maintenance of machine A causes "Removing offers" for machine B. > > > Key: MESOS-9394 > URL: https://issues.apache.org/jira/browse/MESOS-9394 > Project: Mesos > Issue Type: Bug >Reporter: longfei >Priority: Major > Labels: maintenance > > If I schedule machine A in a maintenance call, the logic in > "___updateMaintenanceSchedule" will check all the master's machines. > Another machine(say machine B) not in the maintenance schedule will be set to > UP Mode and call "updateUnavailability". This results in removing all offers > of slaves on machine B. > If I am using these offers to run some tasks, these tasks would be lost for > REASON_INVALID_OFFERS. > I think a maintenance schedule should not affect machines not in it. Is that > right? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9394) Maintenance of machine A causes "Removing offers" for machine B.
[ https://issues.apache.org/jira/browse/MESOS-9394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] longfei reassigned MESOS-9394: -- Assignee: longfei > Maintenance of machine A causes "Removing offers" for machine B. > > > Key: MESOS-9394 > URL: https://issues.apache.org/jira/browse/MESOS-9394 > Project: Mesos > Issue Type: Bug >Reporter: longfei >Assignee: longfei >Priority: Major > Labels: maintenance > > If I schedule machine A in a maintenance call, the logic in > "___updateMaintenanceSchedule" will check all the master's machines. > Another machine(say machine B) not in the maintenance schedule will be set to > UP Mode and call "updateUnavailability". This results in removing all offers > of slaves on machine B. > If I am using these offers to run some tasks, these tasks would be lost for > REASON_INVALID_OFFERS. > I think a maintenance schedule should not affect machines not in it. Is that > right? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-5189) SSLTest.ProtocolMismatch is slow
[ https://issues.apache.org/jira/browse/MESOS-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Toenshoff reassigned MESOS-5189: - Assignee: Benjamin Bannier (was: Till Toenshoff) > SSLTest.ProtocolMismatch is slow > > > Key: MESOS-5189 > URL: https://issues.apache.org/jira/browse/MESOS-5189 > Project: Mesos > Issue Type: Bug > Components: libprocess, test >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier >Priority: Major > Labels: mesosphere > > For me {{SSLTest.ProtocolMismatch}} currently takes more than 8 seconds for > an unoptimized build under OS X. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-5189) SSLTest.ProtocolMismatch is slow
[ https://issues.apache.org/jira/browse/MESOS-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Toenshoff reassigned MESOS-5189: - Assignee: Till Toenshoff > SSLTest.ProtocolMismatch is slow > > > Key: MESOS-5189 > URL: https://issues.apache.org/jira/browse/MESOS-5189 > Project: Mesos > Issue Type: Bug > Components: libprocess, test >Reporter: Benjamin Bannier >Assignee: Till Toenshoff >Priority: Major > Labels: mesosphere > > For me {{SSLTest.ProtocolMismatch}} currently takes more than 8 seconds for > an unoptimized build under OS X. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9394) Maintenance of machine A causes "Removing offers" for machine B.
[ https://issues.apache.org/jira/browse/MESOS-9394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740647#comment-16740647 ] Benno Evers commented on MESOS-9394: Both the analysis and the proposed change look correct to me - the current behaviour certainly does not match what the documentation at http://mesos.apache.org/documentation/latest/maintenance/#scheduling-maintenance suggests. [~carlone], if you want to keep credit for the fix I'd suggest to go ahead and post a patch to reviewboard, otherwise if you prefer I can also go ahead and do that for you. > Maintenance of machine A causes "Removing offers" for machine B. > > > Key: MESOS-9394 > URL: https://issues.apache.org/jira/browse/MESOS-9394 > Project: Mesos > Issue Type: Bug >Reporter: longfei >Assignee: longfei >Priority: Major > Labels: maintenance > > If I schedule machine A in a maintenance call, the logic in > "___updateMaintenanceSchedule" will check all the master's machines. > Another machine(say machine B) not in the maintenance schedule will be set to > UP Mode and call "updateUnavailability". This results in removing all offers > of slaves on machine B. > If I am using these offers to run some tasks, these tasks would be lost for > REASON_INVALID_OFFERS. > I think a maintenance schedule should not affect machines not in it. Is that > right? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9518) CNI_NETNS should not be set for orphan containers that do not have network namespace
[ https://issues.apache.org/jira/browse/MESOS-9518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740881#comment-16740881 ] Jie Yu edited comment on MESOS-9518 at 1/11/19 11:40 PM: - Also need this for newer kernels: https://reviews.apache.org/r/69727 was (Author: jieyu): Also need this: https://reviews.apache.org/r/69727 > CNI_NETNS should not be set for orphan containers that do not have network > namespace > > > Key: MESOS-9518 > URL: https://issues.apache.org/jira/browse/MESOS-9518 > Project: Mesos > Issue Type: Bug > Components: cni >Affects Versions: 1.4.2, 1.5.1, 1.6.1, 1.7.0 >Reporter: Jie Yu >Assignee: Jie Yu >Priority: Major > > We introduced a new agent flag in MESOS-9492 so that CNI configs can be > persisted across reboot. This is for some CNI plugins to be able to cleanup > IP allocated to the containers after a sudden reboot of the host (not all CNI > plugins need this). > It's important to unset `CNI_NETNS` environment variable after reboot when > invoking CNI plugin "DEL" command so that it conforms to the spec: > {noformat} > When CNI_NETNS and/or prevResult are not provided, the plugin should clean up > as many resources as possible (e.g. releasing IPAM allocations) and return a > successful response. > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9518) CNI_NETNS should not be set for orphan containers that do not have network namespace
[ https://issues.apache.org/jira/browse/MESOS-9518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740881#comment-16740881 ] Jie Yu commented on MESOS-9518: --- Also need this: https://reviews.apache.org/r/69727 > CNI_NETNS should not be set for orphan containers that do not have network > namespace > > > Key: MESOS-9518 > URL: https://issues.apache.org/jira/browse/MESOS-9518 > Project: Mesos > Issue Type: Bug > Components: cni >Affects Versions: 1.4.2, 1.5.1, 1.6.1, 1.7.0 >Reporter: Jie Yu >Assignee: Jie Yu >Priority: Major > > We introduced a new agent flag in MESOS-9492 so that CNI configs can be > persisted across reboot. This is for some CNI plugins to be able to cleanup > IP allocated to the containers after a sudden reboot of the host (not all CNI > plugins need this). > It's important to unset `CNI_NETNS` environment variable after reboot when > invoking CNI plugin "DEL" command so that it conforms to the spec: > {noformat} > When CNI_NETNS and/or prevResult are not provided, the plugin should clean up > as many resources as possible (e.g. releasing IPAM allocations) and return a > successful response. > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)