[
https://issues.apache.org/jira/browse/APEXCORE-575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15684121#comment-15684121
]
Tushar Gosavi commented on APEXCORE-575:
----------------------------------------
We have observed that application relaunch takes long time.
The one major reason for delay in application startup during relaunch
is time taken to copy state of exisitng application to new application.
This state could grow in GBs and copy is performed in single thread before
new application is submitted to Yarn.
The state of previous application constists
- jars
- stram checkpoint/recovery file.
- events
- container file
- stats recording if enabled.
- operator checkpoints
- operator data.
We could avoid copying debugging data like stat recording which could
run in TB for long running application and is not required for functioning of
new application.
Similarly operator checkpoints could be read in parallel when they are
launched for first time, This will also help in copying only required
checkpoints and will be
done in parallel by multiple containers/threads.
For operator data stored in application directory, we could copy it
completely for now, but in future we could provide an callback which will allow
operator
partition to read only required state from previous location.
> Improve application relaunch time.
> ----------------------------------
>
> Key: APEXCORE-575
> URL: https://issues.apache.org/jira/browse/APEXCORE-575
> Project: Apache Apex Core
> Issue Type: Improvement
> Reporter: Tushar Gosavi
> Assignee: Tushar Gosavi
>
> Improve application relaunch time.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)