[
https://issues.apache.org/jira/browse/APEXCORE-575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15684250#comment-15684250
]
Tushar Gosavi commented on APEXCORE-575:
----------------------------------------
In my tests, I have observed that launching application having 2GB (130
operators) checkpointed state, takes around 2 minutes to relaunch. Most of the
time taken is for copying this state (106 seconds).
{code}
241.3 K 723.8 K
datatorrent/apps/application_1478656869152_0327/bval-jsr303-0.5.jar
2.0 G 5.9 G datatorrent/apps/application_1478656869152_0327/checkpoints
228.4 K 685.1 K
datatorrent/apps/application_1478656869152_0327/commons-beanutils-1.9.2.jar
{code}
{code}
16/11/21 01:26:34 INFO stram.StramClient: Restart from
hdfs://node18.morado.com:8020/user/tushar/datatorrent/apps/application_1478656869152_0327
16/11/21 01:26:35 INFO stram.FSRecoveryHandler: Creating
hdfs://node18.morado.com:8020/user/tushar/datatorrent/apps/application_1478656869152_0330/recovery/log
16/11/21 01:28:20 INFO stram.StramClient: copy of old state took 106398 ms <<
Time taken
16/11/21 01:28:21 INFO stram.StramClient: Set the environment for the
application master
{code}
In some cases when downstream operator keeps on crashing and upstream operator
keeps on taking checkpoints which are not purged because of downstream
failures. Sometimes relaunched is used to recover from this failure, and
copying of old app files could delay application relaunch.
For file-based storage agent we could avoid copying by keeping reference to old
checkpoint directory and use it for reading only, and write new checkpoints in
new application directory.
> Improve application relaunch time.
> ----------------------------------
>
> Key: APEXCORE-575
> URL: https://issues.apache.org/jira/browse/APEXCORE-575
> Project: Apache Apex Core
> Issue Type: Improvement
> Reporter: Tushar Gosavi
> Assignee: Tushar Gosavi
>
> Improve application relaunch time.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)