Re: When should the RETAIN_ON_CANCELLATION option be used?

2018-09-25 Thread vino yang
Hi Henry,

Your understanding is correct. Checkpoint itself is for recovery purposes.
If you cancel a job, Flink thinks it doesn't make sense to save the
checkpoint again. If you want to recover after cancel, then you should use
cancel with savepoint. So, by default, you don't need to manually clean up
checkpoint metadata unless you plan to use externalized checkpoints.

Thanks, vino.

徐涛  于2018年9月25日周二 下午2:59写道:

>  Hi Vino,
> So I will use the default setting of DELETE_ON_CANCELLATION. When the
> program cancels the checkpoint will be deleted, when the program
> fails,because the checkpoint will not be deleted, I still can have a
> checkpoint that can be used to resume.
> Please help to correct me if I am wrong.
>
> Thanks.
>
> Best
> Henry
>
> 在 2018年9月25日,下午2:22,vino yang  写道:
>
> Hi Henry,
>
> I gave a blue comment in your original email.
>
> Thanks, vino.
>
> 徐涛  于2018年9月25日周二 下午12:56写道:
>
>> Hi Vino,
>> *What is the definition and difference between job cancel and job fails?*
>> Can I say that if the program is shutdown artificially, then it is a job
>> cancel,
>>if the program is shutdown due to some error, it
>> is a job fail?
>>
>>
> This is not entirely true, and artificially triggering a cancel may also
> lead to failure. You can think that if the human triggers the cancel, each
> task instance can be correctly canceled, then the final job's status is
> canceled. The final state of the job due to various anomalies is failed.
>
>
>> This is important because it is the prerequisite for the following
>> question:
>>
>> In the document of Flink 1.6, it says:
>> * "ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION: Retain the
>> checkpoint when the job is cancelled. Note that you have to manually clean
>> up the checkpoint state after cancellation in this case.   *
>> *ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION: Delete the
>> checkpoint when the job is cancelled. The checkpoint state will only be
>> available if the job fails."*
>> But it does not says whether the checkpoint will be retained on fail.
>> If the checkpoint activity of fail is the same as cancel, then I have to
>> use RETAIL_ON_CANCELLATION, because if I do not use it, the checkpoint will
>> be deleted on job fail.
>> If the checkpoint activity of fail is not delete, then at this case it is
>> safe on job fail.
>>
>
> In the configuration, there are two enumeration classes
> `CheckpointRetentionPolicy` and `ExternalizedCheckpointCleanup`, you need
> to consider which configuration you want to use. Your main concern is
> ExternalizedCheckpointCleanup, which cleans up the metadata for
> externalized checkpoints. Are you sure you want to use it? Flink defaults
> to self-management checkpoint cleanup, which is a non-externalized
> checkpoint.
>
>
>> Best
>> Henry
>>
>>
>> 在 2018年9月25日,上午11:16,vino yang  写道:
>>
>> Hi Henry,
>>
>> Answer your question:
>>
>> What is the definition and difference between job cancel and job fails?
>>
>> > The cancellation and failure of the job will cause the job to enter the
>> termination state. But cancellation is artificially triggered and normally
>> terminated, while failure is usually a passive termination due to an
>> exception.
>>
>> If I use DELETE_ON_CANCELLATION option, in this case, does I have the
>> checkpoint to resume the program?
>>
>> > No, if you use externalized checkpoints. you cannot resume from
>> externalized checkpoints after the job has been cancelled.
>>
>> I mean if I can guarantee that a savepoint can always be made before
>> manually cancelation. If I use DELETE_ON_CANCELLATION option on
>> checkpoints, is there any probability that I do not have a checkpoint to
>> recover from?
>>
>> > From the latest source code, savepoint is not affected by
>> CheckpointRetentionPolicy, it needs to be cleaned up manually.
>>
>> Thanks, vino.
>>
>> 徐涛  于2018年9月25日周二 上午11:06写道:
>>
>>> Hi All,
>>> I mean if I can guarantee that a savepoint can always be made before
>>> manually cancelation. If I use DELETE_ON_CANCELLATION option on
>>> checkpoints, is there any probability that I do not have a checkpoint to
>>> recover from?
>>> Thank a a lot.
>>>
>>> Best
>>> Henry
>>>
>>>
>>>
>>> 在 2018年9月25日,上午10:41,徐涛  写道:
>>>
>>> Hi All,
>>> In flink document, it says
>>> DELETE_ON_CANCELLATION: “Delete the checkpoint when the job is
>>> cancelled. The checkpoint state will only be available if the job fails.”
>>> What is the definition and difference between job cancel and job
>>> fails? If I run the program on yarn, and after a few days, the yarn
>>> application get failed for some reason.
>>> If I use DELETE_ON_CANCELLATION option, in this case, does I have the
>>> checkpoint to resume the program?
>>>
>>> If the checkpoint are *only* deleted when I cancel the program, I can
>>> always make the savepoint before cancelation. Then it seems that I can
>>> *only* set DELETE_ON_CANCELLATION then.
>>> I can not find a case that RETAIN_ON_CANCELLATION should be used.

Re: When should the RETAIN_ON_CANCELLATION option be used?

2018-09-25 Thread 徐涛
 Hi Vino,
So I will use the default setting of DELETE_ON_CANCELLATION. When the 
program cancels the checkpoint will be deleted, when the program fails,because 
the checkpoint will not be deleted, I still can have a checkpoint that can be 
used to resume.
Please help to correct me if I am wrong. 

Thanks.

Best 
Henry

> 在 2018年9月25日,下午2:22,vino yang  写道:
> 
> Hi Henry,
> 
> I gave a blue comment in your original email.
> 
> Thanks, vino.
> 
> 徐涛 mailto:happydexu...@gmail.com>> 于2018年9月25日周二 
> 下午12:56写道:
> Hi Vino,
>   What is the definition and difference between job cancel and job fails?
>   Can I say that if the program is shutdown artificially, then it is a 
> job cancel,
>  if the program is shutdown due to some error, it 
> is a job fail?
> 
> 
> This is not entirely true, and artificially triggering a cancel may also lead 
> to failure. You can think that if the human triggers the cancel, each task 
> instance can be correctly canceled, then the final job's status is canceled. 
> The final state of the job due to various anomalies is failed.
>  
>   This is important because it is the prerequisite for the following 
> question:
> 
>   In the document of Flink 1.6, it says:
>   "ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION: Retain the 
> checkpoint when the job is cancelled. Note that you have to manually clean up 
> the checkpoint state after cancellation in this case.
> ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION: Delete the 
> checkpoint when the job is cancelled. The checkpoint state will only be 
> available if the job fails."
>   But it does not says whether the checkpoint will be retained on fail.
>   If the checkpoint activity of fail is the same as cancel, then I have 
> to use RETAIL_ON_CANCELLATION, because if I do not use it, the checkpoint 
> will be deleted on job fail.
>   If the checkpoint activity of fail is not delete, then at this case it 
> is safe on job fail.
> 
> In the configuration, there are two enumeration classes 
> `CheckpointRetentionPolicy` and `ExternalizedCheckpointCleanup`, you need to 
> consider which configuration you want to use. Your main concern is 
> ExternalizedCheckpointCleanup, which cleans up the metadata for externalized 
> checkpoints. Are you sure you want to use it? Flink defaults to 
> self-management checkpoint cleanup, which is a non-externalized checkpoint.
>  
>   
> Best 
> Henry 
>   
> 
> 
>> 在 2018年9月25日,上午11:16,vino yang > > 写道:
>> 
>> Hi Henry,
>> 
>> Answer your question:
>> 
>> What is the definition and difference between job cancel and job fails?
>> 
>> > The cancellation and failure of the job will cause the job to enter the 
>> > termination state. But cancellation is artificially triggered and normally 
>> > terminated, while failure is usually a passive termination due to an 
>> > exception.
>> 
>> If I use DELETE_ON_CANCELLATION option, in this case, does I have the 
>> checkpoint to resume the program?
>> 
>> > No, if you use externalized checkpoints. you cannot resume from 
>> > externalized checkpoints after the job has been cancelled.
>> 
>> I mean if I can guarantee that a savepoint can always be made before 
>> manually cancelation. If I use DELETE_ON_CANCELLATION option on checkpoints, 
>> is there any probability that I do not have a checkpoint to recover from?
>> 
>> > From the latest source code, savepoint is not affected by 
>> > CheckpointRetentionPolicy, it needs to be cleaned up manually.
>> 
>> Thanks, vino.
>> 
>> 徐涛 mailto:happydexu...@gmail.com>> 于2018年9月25日周二 
>> 上午11:06写道:
>> Hi All,
>>  I mean if I can guarantee that a savepoint can always be made before 
>> manually cancelation. If I use DELETE_ON_CANCELLATION option on checkpoints, 
>> is there any probability that I do not have a checkpoint to recover from?
>>  Thank a a lot.
>> 
>> Best
>> Henry
>> 
>> 
>> 
>>> 在 2018年9月25日,上午10:41,徐涛 >> > 写道:
>>> 
>>> Hi All,
>>> In flink document, it says
>>> DELETE_ON_CANCELLATION: “Delete the checkpoint when the job is 
>>> cancelled. The checkpoint state will only be available if the job fails.”
>>> What is the definition and difference between job cancel and job fails? 
>>> If I run the program on yarn, and after a few days, the yarn application 
>>> get failed for some reason.
>>> If I use DELETE_ON_CANCELLATION option, in this case, does I have the 
>>> checkpoint to resume the program?
>>> 
>>> If the checkpoint are only deleted when I cancel the program, I can 
>>> always make the savepoint before cancelation. Then it seems that I can only 
>>> set DELETE_ON_CANCELLATION then.
>>> I can not find a case that RETAIN_ON_CANCELLATION should be used.
>>> 
>>> 
>>> Best
>>> Henry
>>> 
>> 
> 



Re: When should the RETAIN_ON_CANCELLATION option be used?

2018-09-25 Thread vino yang
Hi Henry,

I gave a blue comment in your original email.

Thanks, vino.

徐涛  于2018年9月25日周二 下午12:56写道:

> Hi Vino,
> *What is the definition and difference between job cancel and job fails?*
> Can I say that if the program is shutdown artificially, then it is a job
> cancel,
>if the program is shutdown due to some error, it is
> a job fail?
>
>
This is not entirely true, and artificially triggering a cancel may also
lead to failure. You can think that if the human triggers the cancel, each
task instance can be correctly canceled, then the final job's status is
canceled. The final state of the job due to various anomalies is failed.


> This is important because it is the prerequisite for the following
> question:
>
> In the document of Flink 1.6, it says:
> * "ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION: Retain the
> checkpoint when the job is cancelled. Note that you have to manually clean
> up the checkpoint state after cancellation in this case.   *
> *ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION: Delete the
> checkpoint when the job is cancelled. The checkpoint state will only be
> available if the job fails."*
> But it does not says whether the checkpoint will be retained on fail.
> If the checkpoint activity of fail is the same as cancel, then I have to
> use RETAIL_ON_CANCELLATION, because if I do not use it, the checkpoint will
> be deleted on job fail.
> If the checkpoint activity of fail is not delete, then at this case it is
> safe on job fail.
>

In the configuration, there are two enumeration classes
`CheckpointRetentionPolicy` and `ExternalizedCheckpointCleanup`, you need
to consider which configuration you want to use. Your main concern is
ExternalizedCheckpointCleanup, which cleans up the metadata for
externalized checkpoints. Are you sure you want to use it? Flink defaults
to self-management checkpoint cleanup, which is a non-externalized
checkpoint.


> Best
> Henry
>
>
> 在 2018年9月25日,上午11:16,vino yang  写道:
>
> Hi Henry,
>
> Answer your question:
>
> What is the definition and difference between job cancel and job fails?
>
> > The cancellation and failure of the job will cause the job to enter the
> termination state. But cancellation is artificially triggered and normally
> terminated, while failure is usually a passive termination due to an
> exception.
>
> If I use DELETE_ON_CANCELLATION option, in this case, does I have the
> checkpoint to resume the program?
>
> > No, if you use externalized checkpoints. you cannot resume from
> externalized checkpoints after the job has been cancelled.
>
> I mean if I can guarantee that a savepoint can always be made before
> manually cancelation. If I use DELETE_ON_CANCELLATION option on
> checkpoints, is there any probability that I do not have a checkpoint to
> recover from?
>
> > From the latest source code, savepoint is not affected by
> CheckpointRetentionPolicy, it needs to be cleaned up manually.
>
> Thanks, vino.
>
> 徐涛  于2018年9月25日周二 上午11:06写道:
>
>> Hi All,
>> I mean if I can guarantee that a savepoint can always be made before
>> manually cancelation. If I use DELETE_ON_CANCELLATION option on
>> checkpoints, is there any probability that I do not have a checkpoint to
>> recover from?
>> Thank a a lot.
>>
>> Best
>> Henry
>>
>>
>>
>> 在 2018年9月25日,上午10:41,徐涛  写道:
>>
>> Hi All,
>> In flink document, it says
>> DELETE_ON_CANCELLATION: “Delete the checkpoint when the job is cancelled.
>> The checkpoint state will only be available if the job fails.”
>> What is the definition and difference between job cancel and job
>> fails? If I run the program on yarn, and after a few days, the yarn
>> application get failed for some reason.
>> If I use DELETE_ON_CANCELLATION option, in this case, does I have the
>> checkpoint to resume the program?
>>
>> If the checkpoint are *only* deleted when I cancel the program, I can
>> always make the savepoint before cancelation. Then it seems that I can
>> *only* set DELETE_ON_CANCELLATION then.
>> I can not find a case that RETAIN_ON_CANCELLATION should be used.
>>
>> Best
>> Henry
>>
>>
>>
>


Re: When should the RETAIN_ON_CANCELLATION option be used?

2018-09-24 Thread 徐涛
Hi Vino,
What is the definition and difference between job cancel and job fails?
Can I say that if the program is shutdown artificially, then it is a 
job cancel,
   if the program is shutdown due to some error, it 
is a job fail?

This is important because it is the prerequisite for the following 
question:

In the document of Flink 1.6, it says:
"ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION: Retain the 
checkpoint when the job is cancelled. Note that you have to manually clean up 
the checkpoint state after cancellation in this case.
ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION: Delete the 
checkpoint when the job is cancelled. The checkpoint state will only be 
available if the job fails."
But it does not says whether the checkpoint will be retained on fail.
If the checkpoint activity of fail is the same as cancel, then I have 
to use RETAIL_ON_CANCELLATION, because if I do not use it, the checkpoint will 
be deleted on job fail.
If the checkpoint activity of fail is not delete, then at this case it 
is safe on job fail.

Best 
Henry   



> 在 2018年9月25日,上午11:16,vino yang  写道:
> 
> Hi Henry,
> 
> Answer your question:
> 
> What is the definition and difference between job cancel and job fails?
> 
> > The cancellation and failure of the job will cause the job to enter the 
> > termination state. But cancellation is artificially triggered and normally 
> > terminated, while failure is usually a passive termination due to an 
> > exception.
> 
> If I use DELETE_ON_CANCELLATION option, in this case, does I have the 
> checkpoint to resume the program?
> 
> > No, if you use externalized checkpoints. you cannot resume from 
> > externalized checkpoints after the job has been cancelled.
> 
> I mean if I can guarantee that a savepoint can always be made before manually 
> cancelation. If I use DELETE_ON_CANCELLATION option on checkpoints, is there 
> any probability that I do not have a checkpoint to recover from?
> 
> > From the latest source code, savepoint is not affected by 
> > CheckpointRetentionPolicy, it needs to be cleaned up manually.
> 
> Thanks, vino.
> 
> 徐涛 mailto:happydexu...@gmail.com>> 于2018年9月25日周二 
> 上午11:06写道:
> Hi All,
>   I mean if I can guarantee that a savepoint can always be made before 
> manually cancelation. If I use DELETE_ON_CANCELLATION option on checkpoints, 
> is there any probability that I do not have a checkpoint to recover from?
>   Thank a a lot.
> 
> Best
> Henry
> 
> 
> 
>> 在 2018年9月25日,上午10:41,徐涛 > > 写道:
>> 
>> Hi All,
>>  In flink document, it says
>>  DELETE_ON_CANCELLATION: “Delete the checkpoint when the job is 
>> cancelled. The checkpoint state will only be available if the job fails.”
>>  What is the definition and difference between job cancel and job fails? 
>> If I run the program on yarn, and after a few days, the yarn application get 
>> failed for some reason.
>>  If I use DELETE_ON_CANCELLATION option, in this case, does I have the 
>> checkpoint to resume the program?
>> 
>>  If the checkpoint are only deleted when I cancel the program, I can 
>> always make the savepoint before cancelation. Then it seems that I can only 
>> set DELETE_ON_CANCELLATION then.
>>  I can not find a case that RETAIN_ON_CANCELLATION should be used.
>>  
>> 
>> Best
>> Henry
>> 
> 



Re: When should the RETAIN_ON_CANCELLATION option be used?

2018-09-24 Thread vino yang
Hi Henry,

Answer your question:

What is the definition and difference between job cancel and job fails?

> The cancellation and failure of the job will cause the job to enter the
termination state. But cancellation is artificially triggered and normally
terminated, while failure is usually a passive termination due to an
exception.

If I use DELETE_ON_CANCELLATION option, in this case, does I have the
checkpoint to resume the program?

> No, if you use externalized checkpoints. you cannot resume from
externalized checkpoints after the job has been cancelled.

I mean if I can guarantee that a savepoint can always be made before
manually cancelation. If I use DELETE_ON_CANCELLATION option on
checkpoints, is there any probability that I do not have a checkpoint to
recover from?

> From the latest source code, savepoint is not affected by
CheckpointRetentionPolicy, it needs to be cleaned up manually.

Thanks, vino.

徐涛  于2018年9月25日周二 上午11:06写道:

> Hi All,
> I mean if I can guarantee that a savepoint can always be made before
> manually cancelation. If I use DELETE_ON_CANCELLATION option on
> checkpoints, is there any probability that I do not have a checkpoint to
> recover from?
> Thank a a lot.
>
> Best
> Henry
>
>
>
> 在 2018年9月25日,上午10:41,徐涛  写道:
>
> Hi All,
> In flink document, it says
> DELETE_ON_CANCELLATION: “Delete the checkpoint when the job is cancelled.
> The checkpoint state will only be available if the job fails.”
> What is the definition and difference between job cancel and job fails? If
> I run the program on yarn, and after a few days, the yarn application get
> failed for some reason.
> If I use DELETE_ON_CANCELLATION option, in this case, does I have the
> checkpoint to resume the program?
>
> If the checkpoint are *only* deleted when I cancel the program, I can
> always make the savepoint before cancelation. Then it seems that I can
> *only* set DELETE_ON_CANCELLATION then.
> I can not find a case that RETAIN_ON_CANCELLATION should be used.
>
> Best
> Henry
>
>
>


Re: When should the RETAIN_ON_CANCELLATION option be used?

2018-09-24 Thread 徐涛
Hi All,
I mean if I can guarantee that a savepoint can always be made before 
manually cancelation. If I use DELETE_ON_CANCELLATION option on checkpoints, is 
there any probability that I do not have a checkpoint to recover from?
Thank a a lot.

Best
Henry



> 在 2018年9月25日,上午10:41,徐涛  写道:
> 
> Hi All,
>   In flink document, it says
>   DELETE_ON_CANCELLATION: “Delete the checkpoint when the job is 
> cancelled. The checkpoint state will only be available if the job fails.”
>   What is the definition and difference between job cancel and job fails? 
> If I run the program on yarn, and after a few days, the yarn application get 
> failed for some reason.
>   If I use DELETE_ON_CANCELLATION option, in this case, does I have the 
> checkpoint to resume the program?
> 
>   If the checkpoint are only deleted when I cancel the program, I can 
> always make the savepoint before cancelation. Then it seems that I can only 
> set DELETE_ON_CANCELLATION then.
>   I can not find a case that RETAIN_ON_CANCELLATION should be used.
>   
> 
> Best
> Henry
> 



When should the RETAIN_ON_CANCELLATION option be used?

2018-09-24 Thread 徐涛
Hi All,
In flink document, it says
DELETE_ON_CANCELLATION: “Delete the checkpoint when the job is 
cancelled. The checkpoint state will only be available if the job fails.”
What is the definition and difference between job cancel and job fails? 
If I run the program on yarn, and after a few days, the yarn application get 
failed for some reason.
If I use DELETE_ON_CANCELLATION option, in this case, does I have the 
checkpoint to resume the program?

If the checkpoint are only deleted when I cancel the program, I can 
always make the savepoint before cancelation. Then it seems that I can only set 
DELETE_ON_CANCELLATION then.
I can not find a case that RETAIN_ON_CANCELLATION should be used.


Best
Henry