[
https://issues.apache.org/jira/browse/APEXCORE-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155605#comment-16155605
]
Thomas Weise commented on APEXCORE-714:
---------------------------------------
The description refers to "reusing the instance and restoring the stream to the
window after the last fully processed window for the operator". Which does not
apply to AT_MOST_ONCE. I think that the separate "recovery mode" may be
confusing to users. Why not make it an explicit variation of AT_LEAST_ONCE?
Also note that currently existing EXACTLY_ONCE mode needs to be removed in the
future.
> Reusable instance operator recovery
> -----------------------------------
>
> Key: APEXCORE-714
> URL: https://issues.apache.org/jira/browse/APEXCORE-714
> Project: Apache Apex Core
> Issue Type: Improvement
> Reporter: Pramod Immaneni
> Assignee: Pramod Immaneni
>
> In a failure scenario, when a container fails, it is redeployed along with
> all the operators in it. The operators downstream to these operators are also
> redeployed within their containers. The operators are restored from their
> checkpoint and connect to the appropriate point in the stream according to
> the processing mode. In at least once mode, for example, the data is replayed
> from the same checkpoint
> Restoring an operator state from checkpoint could turn out to be a costly
> operation depending on the size of the state. In some use cases, based on the
> operator logic, when there is an upstream failure, without restoring the
> operator from checkpoint and reusing the current instance, will still produce
> the same results with the data replayed from the last fully processed window.
> The operator state can remain the same as it was before the upstream failure
> by reusing the same operator instance from before and only the streams and
> window reset to the window after the last fully processed window to guarantee
> the at least once processing of tuples. If the container where the operator
> itself is running goes down, it would need to be restored from the checkpoint
> of course. This scenario occurs in some batch use cases with operators that
> have a large state.
> I would like to propose adding the ability for a user to explicitly identify
> operators to be of this type and the corresponding functionality in the
> engine to handle their recovery in the way described above by not restoring
> their state from checkpoint, reusing the instance and restoring the stream to
> the window after the last fully processed window for the operator. When
> operators are not identified to be of this type, the default behavior is what
> it is today and nothing changes.
> I have done some prototyping on the engine side to ensure that this is
> possible with our current code base without requiring a massive overhaul,
> especially the restoration of the operator instance within the Node in the
> streaming container, the re-establishment of the subscriber stream to a
> window in the buffer server where the publisher (upstream) hasn't yet reached
> as it would be restarting from checkpoint and have been able to get it all
> working successfully.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)