[jira] [Commented] (APEXCORE-714) Reusable instance operator recovery

Pramod Immaneni (JIRA) Wed, 06 Sep 2017 09:06:09 -0700

    [ 
https://issues.apache.org/jira/browse/APEXCORE-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155592#comment-16155592
 ]


Pramod Immaneni commented on APEXCORE-714:
------------------------------------------

@tweise Since this kind of recovery can apply for any of the "processing 
modes", using a separate recovery mode attribute for this.

> Reusable instance operator recovery
> -----------------------------------
>
>                 Key: APEXCORE-714
>                 URL: https://issues.apache.org/jira/browse/APEXCORE-714
>             Project: Apache Apex Core
>          Issue Type: Improvement
>            Reporter: Pramod Immaneni
>            Assignee: Pramod Immaneni
>
> In a failure scenario, when a container fails, it is redeployed along with 
> all the operators in it. The operators downstream to these operators are also 
> redeployed within their containers. The operators are restored from their 
> checkpoint and connect to the appropriate point in the stream according to 
> the processing mode. In at least once mode, for example, the data is replayed 
> from the same checkpoint
> Restoring an operator state from checkpoint could turn out to be a costly 
> operation depending on the size of the state. In some use cases, based on the 
> operator logic, when there is an upstream failure, without restoring the 
> operator from checkpoint and reusing the current instance, will still produce 
> the same results with the data replayed from the last fully processed window. 
> The operator state can remain the same as it was before the upstream failure 
> by reusing the same operator instance from before and only the streams and 
> window reset to the window after the last fully processed window to guarantee 
> the at least once processing of tuples. If the container where the operator 
> itself is running goes down, it would need to be restored from the checkpoint 
> of course. This scenario occurs in some batch use cases with operators that 
> have a large state.
> I would like to propose adding the ability for a user to explicitly identify 
> operators to be of this type and the corresponding functionality in the 
> engine to handle their recovery in the way described above by not restoring 
> their state from checkpoint, reusing the instance and restoring the stream to 
> the window after the last fully processed window for the operator. When 
> operators are not identified to be of this type, the default behavior is what 
> it is today and nothing changes.
> I have done some prototyping on the engine side to ensure that this is 
> possible with our current code base without requiring a massive overhaul, 
> especially the restoration of the operator instance within the Node in the 
> streaming container, the re-establishment of the subscriber stream to a 
> window in the buffer server where the publisher (upstream) hasn't yet reached 
> as it would be restarting from checkpoint and have been able to get it all 
> working successfully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (APEXCORE-714) Reusable instance operator recovery

Reply via email to