Re: suggestion of FLINK-10868

Anyang Hu Thu, 12 Sep 2019 00:53:56 -0700

Thanks Till, I will continue to follow this issue and see what we can do.

Best regards,
Anyang


Till Rohrmann <trohrm...@apache.org> 于2019年9月11日周三 下午5:12写道：

> Suggestion 1 makes sense. For the quick termination I think we need to
> think a bit more about it to find a good solution also to support strict
> SLA requirements.
>
> Cheers,
> Till
>
> On Wed, Sep 11, 2019 at 11:11 AM Anyang Hu <huanyang1...@gmail.com> wrote:
>
>> Hi Till,
>>
>> Some of our online batch tasks have strict SLA requirements, and they are
>> not allowed to be stuck for a long time. Therefore, we take a rude way to
>> make the job exit immediately. The way to wait for connection recovery is a
>> better solution. Maybe we need to add a timeout to wait for JM to restore
>> the connection?
>>
>> For suggestion 1, make interval configurable, given that we have done it,
>> and if we can, we hope to give back to the community.
>>
>> Best regards,
>> Anyang
>>
>> Till Rohrmann <trohrm...@apache.org> 于2019年9月9日周一 下午3:09写道：
>>
>>> Hi Anyang,
>>>
>>> I think we cannot take your proposal because this means that whenever we
>>> want to call notifyAllocationFailure when there is a connection problem
>>> between the RM and the JM, then we fail the whole cluster. This is
>>> something a robust and resilient system should not do because connection
>>> problems are expected and need to be handled gracefully. Instead if one
>>> deems the notifyAllocationFailure message to be very important, then one
>>> would need to keep it and tell the JM once it has connected back.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Sun, Sep 8, 2019 at 11:26 AM Anyang Hu <huanyang1...@gmail.com>
>>> wrote:
>>>
>>>> Hi Peter,
>>>>
>>>> For our online batch task, there is a scene where the failed Container
>>>> reaches MAXIMUM_WORKERS_FAILURE_RATE but the client will not immediately
>>>> exit (the probability of JM loss is greatly improved when thousands of
>>>> Containers is to be started). It is found that the JM disconnection (the
>>>> reason for JM loss is unknown) will cause the notifyAllocationFailure not
>>>> to take effect.
>>>>
>>>> After the introduction of FLINK-13184
>>>> <https://jira.apache.org/jira/browse/FLINK-13184> to start  the
>>>> container with multi-threaded, the JM disconnection situation has been
>>>> alleviated. In order to stably implement the client immediate exit, we use
>>>> the following code to determine  whether call onFatalError when
>>>> MaximumFailedTaskManagerExceedingException is occurd:
>>>>
>>>> @Override
>>>> public void notifyAllocationFailure(JobID jobId, AllocationID 
>>>> allocationId, Exception cause) {
>>>>    validateRunsInMainThread();
>>>>
>>>>    JobManagerRegistration jobManagerRegistration = 
>>>> jobManagerRegistrations.get(jobId);
>>>>    if (jobManagerRegistration != null) {
>>>>       
>>>> jobManagerRegistration.getJobManagerGateway().notifyAllocationFailure(allocationId,
>>>>  cause);
>>>>    } else {
>>>>       if (exitProcessOnJobManagerTimedout) {
>>>>          ResourceManagerException exception = new 
>>>> ResourceManagerException("Job Manager is lost, can not notify allocation 
>>>> failure.");
>>>>          onFatalError(exception);
>>>>       }
>>>>    }
>>>> }
>>>>
>>>>
>>>> Best regards,
>>>>
>>>> Anyang
>>>>
>>>>

Re: suggestion of FLINK-10868

Reply via email to