[ 
https://issues.apache.org/jira/browse/APEXCORE-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15991331#comment-15991331
 ] 

ASF GitHub Bot commented on APEXCORE-714:
-----------------------------------------

GitHub user PramodSSImmaneni opened a pull request:

    https://github.com/apache/apex-core/pull/522

    APEXCORE-714 Adding a new recovery mode where the operator instance before 
a failure event can be reused when recovering from an upstream operator failure 
[Review Only]

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/PramodSSImmaneni/apex-core APEXCORE-714

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/apex-core/pull/522.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #522
    
----
commit 72f56dda9d4d244bbbf23ccde657435b94267362
Author: Pramod Immaneni <pra...@datatorrent.com>
Date:   2017-03-08T03:29:02Z

    APEXCORE-714 Adding a new recovery mode where the operator instance before 
a failure event can be reused when recovering from an upstream operator failure

----


> Reusable instance operator recovery
> -----------------------------------
>
>                 Key: APEXCORE-714
>                 URL: https://issues.apache.org/jira/browse/APEXCORE-714
>             Project: Apache Apex Core
>          Issue Type: Improvement
>            Reporter: Pramod Immaneni
>            Assignee: Pramod Immaneni
>
> In a failure scenario, when a container fails, it is redeployed along with 
> all the operators in it. The operators downstream to these operators are also 
> redeployed within their containers. The operators are restored from their 
> checkpoint and connect to the appropriate point in the stream according to 
> the processing mode. In at least once mode, for example, the data is replayed 
> from the same checkpoint
> Restoring an operator state from checkpoint could turn out to be a costly 
> operation depending on the size of the state. In some use cases, based on the 
> operator logic, when there is an upstream failure, without restoring the 
> operator from checkpoint and reusing the current instance, will still produce 
> the same results with the data replayed from the last fully processed window. 
> The operator state can remain the same as it was before the upstream failure 
> by reusing the same operator instance from before and only the streams and 
> window reset to the window after the last fully processed window to guarantee 
> the at least once processing of tuples. If the container where the operator 
> itself is running goes down, it would need to be restored from the checkpoint 
> of course. This scenario occurs in some batch use cases with operators that 
> have a large state.
> I would like to propose adding the ability for a user to explicitly identify 
> operators to be of this type and the corresponding functionality in the 
> engine to handle their recovery in the way described above by not restoring 
> their state from checkpoint, reusing the instance and restoring the stream to 
> the window after the last fully processed window for the operator. When 
> operators are not identified to be of this type, the default behavior is what 
> it is today and nothing changes.
> I have done some prototyping on the engine side to ensure that this is 
> possible with our current code base without requiring a massive overhaul, 
> especially the restoration of the operator instance within the Node in the 
> streaming container, the re-establishment of the subscriber stream to a 
> window in the buffer server where the publisher (upstream) hasn't yet reached 
> as it would be restarting from checkpoint and have been able to get it all 
> working successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to