In case of application failure, we will like to have ability to
quickly restart the application while keeping the old state for
failure
analysis. Also the problem remains the same when we want to start from
savepoint, where we will need to copy state from
savepoint to application.

-Tushar.



On Tue, Sep 20, 2016 at 8:34 PM, Sandesh Hegde <sand...@datatorrent.com> wrote:
> How about re-launching the app from the same location?
>
> If at all they want to store the state we can provide savepoint feature.
>
> On Tue, Sep 20, 2016 at 4:39 AM Tushar Gosavi <tus...@datatorrent.com>
> wrote:
>
>> We have observed that application relaunch takes long time.
>> The one major reason for delay in application startup during relaunch
>> is time taken to copy state of exisitng application to new application.
>> This state could grow in GBs and copy is performed in single thread before
>> new application is submitted to Yarn.
>>
>> The state of previous application constists
>> - jars
>> - stram checkpoint/recovery file.
>> - events
>> - container file
>> - stats recording if enabled.
>> - operator checkpoints
>> - operator data.
>>
>> We could avoid copying debugging data like stat recording which could
>> run in TB for long
>> running application and is not required for functioning of new application.
>>
>> Similarly operator checkpoints could be read in parallel when they are
>> launched for first time,
>> This will also help in copying only required checkpoints and will be
>> done in parallel
>> by multiple containers/threads.
>>
>> For operator data stored in application directory, we could copy it
>> completely for now, but
>> in future we could provide an callback which will allow operator
>> partition to read only
>> required state from previous location.
>>
>> let me know your though on this.
>>
>> Regards,
>> - Tushar.
>>

Reply via email to