[GitHub] yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures
yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures URL: https://github.com/apache/flink/pull/6567#issuecomment-434257204 @azagrebin OK, agree. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures
yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures URL: https://github.com/apache/flink/pull/6567#issuecomment-434220095 Hi @azagrebin I am currently busy with several other PRs and hope it will be merged into Flink 1.7.0. When those PRs are fixed, I will come back and reorganize it. I think some logic of this PR is available, but the counting logic may need to be refactored. Can we put it aside for some time? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures
yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures URL: https://github.com/apache/flink/pull/6567#issuecomment-423940208 @azagrebin thanks for your suggestion, I agree and reconsider more details. Considering that @tillrohrmann has said that the current checkpoint exception handler should also be implemented in `CheckpointCoordinator`. > I understand why you implemented it the way you did. I think the `setFailOnCheckpointingErrors` should actually also be handled by the `CheckpointCoordinator`/`JM` and not on the Task level (but this is a different story). This looks a little bit like a shortcut we made back in the days. Do we need to take this refactoring into account? Because this PR is actually a supplement to the checkpoint exception handler. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures
yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures URL: https://github.com/apache/flink/pull/6567#issuecomment-423453821 hi @tillrohrmann what do you think about the latest implementation? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures
yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures URL: https://github.com/apache/flink/pull/6567#issuecomment-422427072 hi @tillrohrmann I have refactored this PR and counted the failure number in the `CheckpointCoordinator`. I think I should push the implementation to let you estimate. I have tested the counter's run path, but I don't know if it is the [right way](https://github.com/apache/flink/pull/6567/files#diff-a38ea0fa799bdaa0b354d80cd8368c60R1010) of failing the `ExecutionGraph` . And maybe the test case I added have too much assert, I will reduce it. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures
yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures URL: https://github.com/apache/flink/pull/6567#issuecomment-421948370 @tillrohrmann I originally wanted to do it in the checkpoint coordinator when I first implemented it, but when I touched setFailOnCheckpointingErrors, I was guided and turned to the TM. I have also thought about this issue recently and it seems that it is more reasonable to implement on the checkpoint coordinator. Well, I will implement it again based on the checkpoint coordinator. To @klion26 thanks for your opinion, will change the implementation. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures
yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures URL: https://github.com/apache/flink/pull/6567#issuecomment-421931626 @tweise Any comments or opinions? The current count implementation is for a single sub task instance, not the job level. Maybe @tillrohrmann 's idea is more reasonable? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures
yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures URL: https://github.com/apache/flink/pull/6567#issuecomment-421282760 Hi @tillrohrmann , Currently it is applied to the sub task instance (TM) instead of the checkpoint coordinator (JM) as a complement to the setFailOnCheckpointingErrors method, since setFailOnCheckpointingErrors only provides the user with yes or no choices. But it sounds reasonable to apply the count on the Job globally. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures
yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures URL: https://github.com/apache/flink/pull/6567#issuecomment-419137130 @azagrebin thanks for your suggestion, refactored this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures
yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures URL: https://github.com/apache/flink/pull/6567#issuecomment-419073463 @tillrohrmann and @zentol if you have time, can you have a look at this PR? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures
yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures URL: https://github.com/apache/flink/pull/6567#issuecomment-418295546 @azagrebin thanks for your suggestion, I have refactored this PR, please review again~ This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures
yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures URL: https://github.com/apache/flink/pull/6567#issuecomment-415303507 @tillrohrmann When you have free time, please review this PR. thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures
yanghua commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures URL: https://github.com/apache/flink/pull/6567#issuecomment-413486884 cc @tillrohrmann This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services