[ 
https://issues.apache.org/jira/browse/MESOS-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939054#comment-14939054
 ] 

Gabriel Hartmann edited comment on MESOS-3479 at 9/30/15 11:29 PM:
-------------------------------------------------------------------

[~haosd...@gmail.com]:  I'm seeing this issue as well.  The config is like this:
"gracePeriodSeconds": 300,
"intervalSeconds": 60,
"timeoutSeconds": 60,
"maxConsecutiveFailures": 3

We fail early on 3 times in a row.  Then the 4th attempt takes more than 60s to 
eventually fail/timeout.  While it's running a 5th attempt is started (it 
succeeds).  All this occurs before expiration of the grace period.  The 5th 
attempt is the last attempt.  No more health checks are made.  Marathon never 
receives a health check report.

Is there an ETA for a fix for this?  It's very disruptive to have frameworks 
which never converge to a healthy or unhealthy state.  Marathon in this case 
will see the framework as having 1 running task, with 0 staging, 0 healthy, and 
0 unhealthy. 


was (Author: gabriel.hartm...@gmail.com):
[~haosd...@gmail.com]:  I'm seeing this issue as well.  The config is like this:
"gracePeriodSeconds": 300,
"intervalSeconds": 60,
"timeoutSeconds": 60,
"maxConsecutiveFailures": 3

We fail early on 3 times in a row.  Then the 4th attempt takes more than 60s to 
eventually fail/timeout.  While it's running a 5th attempt is started (it 
succeeds).  All this occurs before expiration of the grace period.  The 5th 
attempt is the last attempt.  No more health checks are made.  Marathon never 
receives a health check report.

Is there an ETA for a fix for this?  It's very disruptive to frameworks 
converging to a healthy or unhealthy state.  Marathon in this case will see the 
framework as having 1 running task, with 0 staging, 0 healthy, and 0 unhealthy. 

> COMMAND Health Checks are not executed if the timeout is exceeded
> -----------------------------------------------------------------
>
>                 Key: MESOS-3479
>                 URL: https://issues.apache.org/jira/browse/MESOS-3479
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.23.0
>            Reporter: Matthias Veit
>            Assignee: haosdent
>            Priority: Critical
>
> The issue first appeared as Marathon Bug: See here for reference: 
> https://github.com/mesosphere/marathon/issues/2179.
> A COMMAND health check is defined with a timeout of 20 seconds.
> The command itself takes longer than 20 seconds to execute.
> Current behavior: 
> - The mesos health check process get's killed, but the defined command 
> process not (in the example the curl command returns after 21 seconds).
> - The check attempt is considered healthy, if the timeout is exceeded
> - The health check stops and is not executed any longer
> Expected behavior: 
> - The defined health check command is killed, when the timeout is exceeded
> - The check attempt is considered Unhealthy, if the timeout is exceeded
> - The health check does not stop 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to