[jira] [Commented] (MESOS-10042) Mesos UI template not always rendered

2020-01-21 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020546#comment-17020546
 ] 

Benjamin Mahler commented on MESOS-10042:
-

[~milipili] is this a transient thing? Or it gets stuck like that? Can you make 
a screen recording?

> Mesos UI template not always rendered
> -
>
> Key: MESOS-10042
> URL: https://issues.apache.org/jira/browse/MESOS-10042
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Affects Versions: 1.9.0
> Environment: Linux Vivaldi & Firefox
> ubuntu 18.04
>Reporter: Damien Gerard
>Priority: Minor
> Attachments: image-2019-11-27-17-34-29-733.png, 
> image-2019-11-27-17-37-18-679.png, image-2019-11-27-17-39-06-984.png, 
> image-2019-11-27-17-39-16-491.png, image-2019-11-27-17-39-37-341.png, 
> image-2019-11-27-17-39-44-306.png
>
>
> When opening the webui directly or when  switching tabs (by clicking on 
> "Frameworks"/"Agents"/whatever back to the main page), the page is not always 
> rendered (see as below).
>   !image-2019-11-27-17-39-37-341.png!
> Also, the cluster name is never replaced (the same in our mesos 1.6) even if 
> --cluster "some-value" is set
>   !image-2019-11-27-17-39-44-306.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10068) Mesos Master doesn't send AGENT_REMOVED when removing agent from internal state

2020-01-21 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020545#comment-17020545
 ] 

Benjamin Mahler commented on MESOS-10068:
-

The first thing to comment on is that we don't yet have a formalized agent 
lifecycle in the API, we have AgentAdded / AgentRemoved but internally there is 
also the notion of disconnecting, becoming unreachable, getting transitioned to 
gone. So the API and internals are at a bit of a mismatch here and more broadly 
of this particular ticket we would need to make them consistent to have events 
that make sense.

[~daltonmatos] It looks like the reason you're seeing no AGENT_REMOVED is that 
the the agent became unreachable, and we don't send it in that case. The first 
case goes through a different path where we never were able to communicate with 
the agent, but we don't know that and the agent retries its registration, upon 
seeing this we remove the previous version of that agent and try to register 
the new one. You may see this repeating itself over and over.

[~greggomann] looks like we don't send AGENT_REMOVED when an agent is marked as 
gone? Seems like a bug due to {{__removeSlave}} being used for both marking 
unreachable and gone?



> Mesos Master doesn't send AGENT_REMOVED when removing agent from internal 
> state
> ---
>
> Key: MESOS-10068
> URL: https://issues.apache.org/jira/browse/MESOS-10068
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.3, 1.8.2, 1.9.1
>Reporter: Dalton Matos Coelho Barreto
>Priority: Major
> Attachments: master-full-logs.log
>
>
> Hello,
>  
> Looking at the documentation of the master {{/api/v1}} endpoint, the 
> {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is 
> supported for this endpoint, but when a new agent joins the cluster a 
> {{AGENT_ADDED}} event is received.
> The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not 
> received by clients subscribed to the master API.
>  
> I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All 
> using the docker image {{mesos/mesos-centos}}.
> The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the 
> cluster but the master couldn't communicate with this agent, in this specific 
> test there was a firewall blocking port {{5051}} on the slave, that is, no 
> body was being able to tal to the slave on port {{5051}}.
>  
> h2. Here are the steps do reproduce the problem
>  * Start a new mesos master
>  * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message:
>  ** 
> {noformat}
> curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: 
> application/json" http://MASTER_IP:5050/api/v1{noformat}
>  * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered;
>  * Stop this slave;
>  * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the 
> field {{active=false}}.
>  * Waits for mesos master stop listing this slave, that is, 
> {{/slaves?slave_id=AGENT_ID}} returns an empty response;
> Even after the empty response, the event never reaches the subscriber.
>  
> The mesos master logs shows this:
> {noformat}
>  I1213 15:03:10.33893513 master.cpp:1297] Agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964) disconnected
> I1213 15:03:10.33908913 master.cpp:3399] Disconnecting agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> I1213 15:03:10.33920713 master.cpp:3418] Deactivating agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> {noformat}
> And then:
> {noformat}
> W1213 15:04:40.72667015 process.cpp:1917] Failed to send 
> 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to 
> connect to 172.18.0.51:5051: No route to host{noformat}
> And some time after this:
> {noformat}
> I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1   {noformat}
>  
> Even after this removal, the {{AGENT_REMOVED}} event is not delivered.
>  
> I will attach the full master logs also.
>  
> Do you think this could be a bug?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-1807) Disallow executors with cpu only or memory only resources

2020-01-21 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020530#comment-17020530
 ] 

Benjamin Mahler commented on MESOS-1807:


[~Charle] the command executor is a special case, it's implicitly generated and 
we oversubscribe a little bit to make room for it: 
https://github.com/apache/mesos/blob/1.9.0/src/slave/slave.cpp#L6663-L6676

I think the expectation for CUSTOM or the new DEFAULT executors are that they 
specify their resource requirements. Since it didn't break any backwards 
compatibility, we enforce it for the new DEFAULT case: 
https://github.com/apache/mesos/blob/1.9.0/src/master/validation.cpp#L1842-L1859

[~greggomann] is also working on cpu/mem requests vs limits (see MESOS-10001), 
so that may provide you with the flexibility you desire depending on what 
you're looking to do.

> Disallow executors with cpu only or memory only resources
> -
>
> Key: MESOS-1807
> URL: https://issues.apache.org/jira/browse/MESOS-1807
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Priority: Major
> Attachments: Screenshot 2015-07-28 14.40.35.png
>
>
> Currently master allows executors to be launched with either only cpus or 
> only memory but we shouldn't allow that.
> This is because executor is an actual unix process that is launched by the 
> slave. If an executor doesn't specify cpus, what should the cpu limits be for 
> that executor when there are no tasks running on it? If no cpu limits are set 
> then it might starve other executors/tasks on the slave violating isolation 
> guarantees. Same goes with memory. Moreover, the current 
> containerizer/isolator code will throw failures when using such an executor, 
> e.g., when the last task on the executor finishes and Containerizer::update() 
> is called with 0 cpus or 0 mem.
> According to a source code [TODO | 
> https://github.com/apache/mesos/blob/0226620747e1769434a1a83da547bfc3470a9549/src/master/validation.cpp#L400]
>  this should also include checking whether requested resources are greater 
> than  MIN_CPUS/MIN_BYTES.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10084) Detecting whether executor is generated for command task should work when the launcher_dir changes

2020-01-21 Thread Benjamin Bannier (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015863#comment-17015863
 ] 

Benjamin Bannier edited comment on MESOS-10084 at 1/21/20 12:21 PM:


Reviews:
https://reviews.apache.org/r/72033/
https://reviews.apache.org/r/72034/
https://reviews.apache.org/r/72035/


was (Author: bbannier):
Reviews:
https://reviews.apache.org/r/72002/
https://reviews.apache.org/r/72003/

> Detecting whether executor is generated for command task should work when the 
> launcher_dir changes
> --
>
> Key: MESOS-10084
> URL: https://issues.apache.org/jira/browse/MESOS-10084
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Sekretenko
>Assignee: Benjamin Bannier
>Priority: Critical
>
> As currently implemented, on recovery Mesos agent determines that the 
> executor is generated for command task by comparing the executor command with 
> a current path to Mesos executor:
> https://github.com/apache/mesos/blob/1.7.x/src/slave/slave.cpp#L9635
> During upgrade of production cluster we observed this check to break due to 
> the new launcher_dir being different from the one of checkpointed executor.
> This can cause problems of various kind: for example, after such upgrade, 
> Mesos master can begin to treat the checkpointed command executors as subject 
> to resource quota.
> Design considerations:
>  - proper solution is to checkpoint the flag indicating whether the executor 
> is a command/docker one.
>  - for correct upgrade from older Mesos versions, we will need some kind of 
> workaround to detect command executors after upgrade; the workaround logic 
> should be skipped if there is a checkpointed flag.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)