[ 
https://issues.apache.org/jira/browse/MESOS-5910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Schlansker updated MESOS-5910:
-------------------------------------
    Description: 
The v1.0 Operator API adds a new SUBSCRIBE call, which returns a stream of 
events as they occur.  This is going to be extremely useful for monitoring and 
management jobs, as they can now have timely information about Mesos's 
operation without requiring repeated polling or other ugly solutions.

Unfortunately, the SUBSCRIBE call always returns from the time the call is 
made.  This means that any consumer cannot reliably subscribe to "all events"; 
if the application goes offline (network blip, code upgrade, etc) all events 
during that downtime are lost.

You could instead have a cluster of applications receiving the events and 
coordinating to deduplicate them to increase reliability, but this pushes a lot 
of complexity into clients, and I suspect most users would not do this 
correctly and would potentially lose events.

It would be extremely useful for a single client to be able to get a reliable 
event stream without requiring a single HTTP connection to be 100% available.

One possible solution is to assign every event an ID.  Then, extend the API to 
take a "start position" in the log.  The API immediately streams out all events 
from the start event up until the tail of the log, and then continues emitting 
new events are they occur.  This provides a reliable way for a consumer to get 
"at least once" semantics on events.  The caveat is that the consumer may only 
be down for as long as the master retains event history, but this is a much 
easier pill to swallow.  This is similar to etcd's "watch" api, if you are 
looking for an actual implementation to reference.

  was:
The v1.0 Operator API adds a new SUBSCRIBE call, which returns a stream of 
events as they occur.  This is going to be extremely useful for monitoring and 
management jobs, as they can now have timely information about Mesos's 
operation without requiring repeated polling or other ugly solutions.

Unfortunately, the SUBSCRIBE call always returns from the time the call is 
made.  This means that any consumer cannot reliably subscribe to "all events"; 
if the application goes offline (network blip, code upgrade, etc) all events 
during that downtime are lost.

You could instead have a cluster of applications receiving the events and 
coordinating to deduplicate them to increase reliability, but this pushes a lot 
of complexity into clients, and I suspect most users would not do this 
correctly and would potentially lose events.

It would be extremely useful for a single client to be able to get a reliable 
event stream without requiring a single HTTP connection to be 100% available.

One possible solution is to assign every event an ID.  Then, extend the API to 
take a "start position" in the log.  The API immediately streams out all events 
from the start event up until the tail of the log, and then continues emitting 
new events are they occur.  This provides a reliable way for a consumer to get 
"at least once" semantics on events.  The caveat is that the consumer may only 
be down for as long as the master retains event history, but this is a much 
easier pill to swallow.


> Operator SUBSCRIBE api should provide a method to get all events without 
> requiring 100% uptime
> ----------------------------------------------------------------------------------------------
>
>                 Key: MESOS-5910
>                 URL: https://issues.apache.org/jira/browse/MESOS-5910
>             Project: Mesos
>          Issue Type: Improvement
>          Components: HTTP API, json api
>    Affects Versions: 1.0.0
>            Reporter: Steven Schlansker
>
> The v1.0 Operator API adds a new SUBSCRIBE call, which returns a stream of 
> events as they occur.  This is going to be extremely useful for monitoring 
> and management jobs, as they can now have timely information about Mesos's 
> operation without requiring repeated polling or other ugly solutions.
> Unfortunately, the SUBSCRIBE call always returns from the time the call is 
> made.  This means that any consumer cannot reliably subscribe to "all 
> events"; if the application goes offline (network blip, code upgrade, etc) 
> all events during that downtime are lost.
> You could instead have a cluster of applications receiving the events and 
> coordinating to deduplicate them to increase reliability, but this pushes a 
> lot of complexity into clients, and I suspect most users would not do this 
> correctly and would potentially lose events.
> It would be extremely useful for a single client to be able to get a reliable 
> event stream without requiring a single HTTP connection to be 100% available.
> One possible solution is to assign every event an ID.  Then, extend the API 
> to take a "start position" in the log.  The API immediately streams out all 
> events from the start event up until the tail of the log, and then continues 
> emitting new events are they occur.  This provides a reliable way for a 
> consumer to get "at least once" semantics on events.  The caveat is that the 
> consumer may only be down for as long as the master retains event history, 
> but this is a much easier pill to swallow.  This is similar to etcd's "watch" 
> api, if you are looking for an actual implementation to reference.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to