Steven Schlansker created MESOS-5910: ----------------------------------------
Summary: Operator SUBSCRIBE api should provide a method to get all events without requiring 100% uptime Key: MESOS-5910 URL: https://issues.apache.org/jira/browse/MESOS-5910 Project: Mesos Issue Type: Improvement Components: HTTP API, json api Affects Versions: 1.0.0 Reporter: Steven Schlansker The v1.0 Operator API adds a new SUBSCRIBE call, which returns a stream of events as they occur. This is going to be extremely useful for monitoring and management jobs, as they can now have timely information about Mesos's operation without requiring repeated polling or other ugly solutions. Unfortunately, the SUBSCRIBE call always returns from the time the call is made. This means that any consumer cannot reliably subscribe to "all events"; if the application goes offline (network blip, code upgrade, etc) all events during that downtime are lost. You could instead have a cluster of applications receiving the events and coordinating to deduplicate them to increase reliability, but this pushes a lot of complexity into clients, and I suspect most users would not do this correctly and would potentially lose events. It would be extremely useful for a single client to be able to get a reliable event stream without requiring a single HTTP connection to be 100% available. One possible solution is to assign every event an ID. Then, extend the API to take a "start position" in the log. The API immediately streams out all events from the start event up until the tail of the log, and then continues emitting new events are they occur. This provides a reliable way for a consumer to get "at least once" semantics on events. The caveat is that the consumer may only be down for as long as the master retains event history, but this is a much easier pill to swallow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)