[ https://issues.apache.org/jira/browse/APEXCORE-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15991331#comment-15991331 ]
ASF GitHub Bot commented on APEXCORE-714: ----------------------------------------- GitHub user PramodSSImmaneni opened a pull request: https://github.com/apache/apex-core/pull/522 APEXCORE-714 Adding a new recovery mode where the operator instance before a failure event can be reused when recovering from an upstream operator failure [Review Only] You can merge this pull request into a Git repository by running: $ git pull https://github.com/PramodSSImmaneni/apex-core APEXCORE-714 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/apex-core/pull/522.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #522 ---- commit 72f56dda9d4d244bbbf23ccde657435b94267362 Author: Pramod Immaneni <pra...@datatorrent.com> Date: 2017-03-08T03:29:02Z APEXCORE-714 Adding a new recovery mode where the operator instance before a failure event can be reused when recovering from an upstream operator failure ---- > Reusable instance operator recovery > ----------------------------------- > > Key: APEXCORE-714 > URL: https://issues.apache.org/jira/browse/APEXCORE-714 > Project: Apache Apex Core > Issue Type: Improvement > Reporter: Pramod Immaneni > Assignee: Pramod Immaneni > > In a failure scenario, when a container fails, it is redeployed along with > all the operators in it. The operators downstream to these operators are also > redeployed within their containers. The operators are restored from their > checkpoint and connect to the appropriate point in the stream according to > the processing mode. In at least once mode, for example, the data is replayed > from the same checkpoint > Restoring an operator state from checkpoint could turn out to be a costly > operation depending on the size of the state. In some use cases, based on the > operator logic, when there is an upstream failure, without restoring the > operator from checkpoint and reusing the current instance, will still produce > the same results with the data replayed from the last fully processed window. > The operator state can remain the same as it was before the upstream failure > by reusing the same operator instance from before and only the streams and > window reset to the window after the last fully processed window to guarantee > the at least once processing of tuples. If the container where the operator > itself is running goes down, it would need to be restored from the checkpoint > of course. This scenario occurs in some batch use cases with operators that > have a large state. > I would like to propose adding the ability for a user to explicitly identify > operators to be of this type and the corresponding functionality in the > engine to handle their recovery in the way described above by not restoring > their state from checkpoint, reusing the instance and restoring the stream to > the window after the last fully processed window for the operator. When > operators are not identified to be of this type, the default behavior is what > it is today and nothing changes. > I have done some prototyping on the engine side to ensure that this is > possible with our current code base without requiring a massive overhaul, > especially the restoration of the operator instance within the Node in the > streaming container, the re-establishment of the subscriber stream to a > window in the buffer server where the publisher (upstream) hasn't yet reached > as it would be restarting from checkpoint and have been able to get it all > working successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346)