Thanks Till, I will continue to follow this issue and see what we can do. Best regards, Anyang
Till Rohrmann <trohrm...@apache.org> 于2019年9月11日周三 下午5:12写道: > Suggestion 1 makes sense. For the quick termination I think we need to > think a bit more about it to find a good solution also to support strict > SLA requirements. > > Cheers, > Till > > On Wed, Sep 11, 2019 at 11:11 AM Anyang Hu <huanyang1...@gmail.com> wrote: > >> Hi Till, >> >> Some of our online batch tasks have strict SLA requirements, and they are >> not allowed to be stuck for a long time. Therefore, we take a rude way to >> make the job exit immediately. The way to wait for connection recovery is a >> better solution. Maybe we need to add a timeout to wait for JM to restore >> the connection? >> >> For suggestion 1, make interval configurable, given that we have done it, >> and if we can, we hope to give back to the community. >> >> Best regards, >> Anyang >> >> Till Rohrmann <trohrm...@apache.org> 于2019年9月9日周一 下午3:09写道: >> >>> Hi Anyang, >>> >>> I think we cannot take your proposal because this means that whenever we >>> want to call notifyAllocationFailure when there is a connection problem >>> between the RM and the JM, then we fail the whole cluster. This is >>> something a robust and resilient system should not do because connection >>> problems are expected and need to be handled gracefully. Instead if one >>> deems the notifyAllocationFailure message to be very important, then one >>> would need to keep it and tell the JM once it has connected back. >>> >>> Cheers, >>> Till >>> >>> On Sun, Sep 8, 2019 at 11:26 AM Anyang Hu <huanyang1...@gmail.com> >>> wrote: >>> >>>> Hi Peter, >>>> >>>> For our online batch task, there is a scene where the failed Container >>>> reaches MAXIMUM_WORKERS_FAILURE_RATE but the client will not immediately >>>> exit (the probability of JM loss is greatly improved when thousands of >>>> Containers is to be started). It is found that the JM disconnection (the >>>> reason for JM loss is unknown) will cause the notifyAllocationFailure not >>>> to take effect. >>>> >>>> After the introduction of FLINK-13184 >>>> <https://jira.apache.org/jira/browse/FLINK-13184> to start the >>>> container with multi-threaded, the JM disconnection situation has been >>>> alleviated. In order to stably implement the client immediate exit, we use >>>> the following code to determine whether call onFatalError when >>>> MaximumFailedTaskManagerExceedingException is occurd: >>>> >>>> @Override >>>> public void notifyAllocationFailure(JobID jobId, AllocationID >>>> allocationId, Exception cause) { >>>> validateRunsInMainThread(); >>>> >>>> JobManagerRegistration jobManagerRegistration = >>>> jobManagerRegistrations.get(jobId); >>>> if (jobManagerRegistration != null) { >>>> >>>> jobManagerRegistration.getJobManagerGateway().notifyAllocationFailure(allocationId, >>>> cause); >>>> } else { >>>> if (exitProcessOnJobManagerTimedout) { >>>> ResourceManagerException exception = new >>>> ResourceManagerException("Job Manager is lost, can not notify allocation >>>> failure."); >>>> onFatalError(exception); >>>> } >>>> } >>>> } >>>> >>>> >>>> Best regards, >>>> >>>> Anyang >>>> >>>>