Re: Specialized operator recovery

Sanjay Pujare Mon, 24 Apr 2017 10:56:51 -0700

+1.

If we didn't have to worry about backward compatibility I would even make
this new behavior as the default behavior (so may be in v4.0 of Apex ?)


On Mon, Apr 24, 2017 at 8:57 AM, Pramod Immaneni <[email protected]>
wrote:

> In a failure scenario, when a container fails, it is redeployed along with
> all the operators in it. The operators downstream to these operators are
> also redeployed within their containers. The operators are restored from
> their checkpoint and connect to the appropriate point in the stream
> according to the processing mode. In at least once mode, for example, the
> data is replayed from the same checkpoint
>
> Restoring an operator state from checkpoint could turn out to be a costly
> operation depending on the size of the state. In some use cases, based on
> the operator logic, when there is an upstream failure, the operator state
> without being restored to the checkpoint i.e., remaining as is, will still
> produce the same results with the data replayed from the last fully
> processed window. This is true with some operators in batch use cases. The
> operator state can remain the same as it was before the upstream failure by
> reusing the same operator instance from before and only the streams and
> window reset to the window after the last fully processed window to
> guarantee the at least once processing of tuples. If the container where
> the operator itself is running goes down, it would need to be restored from
> the checkpoint of course.
>
> I would like to propose adding the ability for a user to explicitly
> identify operators to be of this type and the corresponding functionality
> in the engine to handle their recovery in the way described above by not
> restoring their state from checkpoint, reusing the instance and restoring
> the stream to the window after the last fully processed window for the
> operator. When operators are not identified to be of this type, the default
> behavior is what it is today and nothing changes.
>
> I have done some prototyping on the engine side to ensure that this is
> possible with our current code base without requiring a massive overhaul,
> especially the restoration of the operator instance within the Node in the
> streaming container, the re-establishment of the subscriber stream to a
> window in the buffer server where the publisher (upstream) hasn't yet
> reached as it would be restarting from checkpoint and have been able to get
> it all working successfully.
>
> Thanks
>

Re: Specialized operator recovery

Reply via email to