[
https://issues.apache.org/jira/browse/APEXCORE-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15991331#comment-15991331
]
ASF GitHub Bot commented on APEXCORE-714:
-----------------------------------------
GitHub user PramodSSImmaneni opened a pull request:
https://github.com/apache/apex-core/pull/522
APEXCORE-714 Adding a new recovery mode where the operator instance before
a failure event can be reused when recovering from an upstream operator failure
[Review Only]
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/PramodSSImmaneni/apex-core APEXCORE-714
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/apex-core/pull/522.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #522
----
commit 72f56dda9d4d244bbbf23ccde657435b94267362
Author: Pramod Immaneni <[email protected]>
Date: 2017-03-08T03:29:02Z
APEXCORE-714 Adding a new recovery mode where the operator instance before
a failure event can be reused when recovering from an upstream operator failure
----
> Reusable instance operator recovery
> -----------------------------------
>
> Key: APEXCORE-714
> URL: https://issues.apache.org/jira/browse/APEXCORE-714
> Project: Apache Apex Core
> Issue Type: Improvement
> Reporter: Pramod Immaneni
> Assignee: Pramod Immaneni
>
> In a failure scenario, when a container fails, it is redeployed along with
> all the operators in it. The operators downstream to these operators are also
> redeployed within their containers. The operators are restored from their
> checkpoint and connect to the appropriate point in the stream according to
> the processing mode. In at least once mode, for example, the data is replayed
> from the same checkpoint
> Restoring an operator state from checkpoint could turn out to be a costly
> operation depending on the size of the state. In some use cases, based on the
> operator logic, when there is an upstream failure, without restoring the
> operator from checkpoint and reusing the current instance, will still produce
> the same results with the data replayed from the last fully processed window.
> The operator state can remain the same as it was before the upstream failure
> by reusing the same operator instance from before and only the streams and
> window reset to the window after the last fully processed window to guarantee
> the at least once processing of tuples. If the container where the operator
> itself is running goes down, it would need to be restored from the checkpoint
> of course. This scenario occurs in some batch use cases with operators that
> have a large state.
> I would like to propose adding the ability for a user to explicitly identify
> operators to be of this type and the corresponding functionality in the
> engine to handle their recovery in the way described above by not restoring
> their state from checkpoint, reusing the instance and restoring the stream to
> the window after the last fully processed window for the operator. When
> operators are not identified to be of this type, the default behavior is what
> it is today and nothing changes.
> I have done some prototyping on the engine side to ensure that this is
> possible with our current code base without requiring a massive overhaul,
> especially the restoration of the operator instance within the Node in the
> streaming container, the re-establishment of the subscriber stream to a
> window in the buffer server where the publisher (upstream) hasn't yet reached
> as it would be restarting from checkpoint and have been able to get it all
> working successfully.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)