[ https://issues.apache.org/jira/browse/APEXCORE-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155592#comment-16155592 ]
Pramod Immaneni commented on APEXCORE-714: ------------------------------------------ @tweise Since this kind of recovery can apply for any of the "processing modes", using a separate recovery mode attribute for this. > Reusable instance operator recovery > ----------------------------------- > > Key: APEXCORE-714 > URL: https://issues.apache.org/jira/browse/APEXCORE-714 > Project: Apache Apex Core > Issue Type: Improvement > Reporter: Pramod Immaneni > Assignee: Pramod Immaneni > > In a failure scenario, when a container fails, it is redeployed along with > all the operators in it. The operators downstream to these operators are also > redeployed within their containers. The operators are restored from their > checkpoint and connect to the appropriate point in the stream according to > the processing mode. In at least once mode, for example, the data is replayed > from the same checkpoint > Restoring an operator state from checkpoint could turn out to be a costly > operation depending on the size of the state. In some use cases, based on the > operator logic, when there is an upstream failure, without restoring the > operator from checkpoint and reusing the current instance, will still produce > the same results with the data replayed from the last fully processed window. > The operator state can remain the same as it was before the upstream failure > by reusing the same operator instance from before and only the streams and > window reset to the window after the last fully processed window to guarantee > the at least once processing of tuples. If the container where the operator > itself is running goes down, it would need to be restored from the checkpoint > of course. This scenario occurs in some batch use cases with operators that > have a large state. > I would like to propose adding the ability for a user to explicitly identify > operators to be of this type and the corresponding functionality in the > engine to handle their recovery in the way described above by not restoring > their state from checkpoint, reusing the instance and restoring the stream to > the window after the last fully processed window for the operator. When > operators are not identified to be of this type, the default behavior is what > it is today and nothing changes. > I have done some prototyping on the engine side to ensure that this is > possible with our current code base without requiring a massive overhaul, > especially the restoration of the operator instance within the Node in the > streaming container, the re-establishment of the subscriber stream to a > window in the buffer server where the publisher (upstream) hasn't yet reached > as it would be restarting from checkpoint and have been able to get it all > working successfully. -- This message was sent by Atlassian JIRA (v6.4.14#64029)