[GitHub] [flink] rkhachatryan commented on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats
rkhachatryan commented on pull request #14635: URL: https://github.com/apache/flink/pull/14635#issuecomment-766707720 Thanks for the review @pnowojski . I've added the space and created a ticket to translate the docs. I've also squashed the commits. > for example AsynCheckpointRunnable fails (throws an exception), I can not see any stats for any subtasks that have finished after the failure As discussed offline, this happens because the failed upstream doesn't sent barrier downstream. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] rkhachatryan commented on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats
rkhachatryan commented on pull request #14635: URL: https://github.com/apache/flink/pull/14635#issuecomment-766707720 Thanks for the review @pnowojski . I've added the space and created a ticket to translate the docs. I've also squashed the commits. > for example AsynCheckpointRunnable fails (throws an exception), I can not see any stats for any subtasks that have finished after the failure As discussed offline, this happens because the failed upstream doesn't sent barrier downstream. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] rkhachatryan commented on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats
rkhachatryan commented on pull request #14635: URL: https://github.com/apache/flink/pull/14635#issuecomment-765417912 I've updated the PR (adding 4 new commits): 1. Tasks reporting upon abort RPC are marked as `aborted` in e2e duration column 2. Only tasks that actually ACKed checkpoint are counted for ackCount and lastAckTime 3. `-1B` is shown as `-` (the same way as durations) 4. Fix the docs ![image](https://user-images.githubusercontent.com/3939322/105499876-669e6700-5cc2-11eb-8d99-b301a83a548c.png) cc: @NicoK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] rkhachatryan commented on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats
rkhachatryan commented on pull request #14635: URL: https://github.com/apache/flink/pull/14635#issuecomment-763414913 Thanks a lot for trying it out. > I think it's strictly necessary to: > clearly mark which checkpoint for which subtask has failed It is not always the task that fails a checkpoint. Timeout decision is made by the `CheckpointCoordinator`. Multiple tasks can fail independently as well. I agree that marking "failed" tasks would be useful but I don't think it's directly related to this feature or at least this PR. > if we were not able to collect/calculate a metric, it must be N/A - not just 0ms I don't see `0ms` on your screenshots nor while running locally. Do you mean `0 B` per operator? If so, why is it incorrect? (I do see non-zero size running cluster). > correctly calculate the durations (end to end, sync, async, etc...) also for failed checkpoints, not just N/A A checkpoint can be cancelled before even being started on some subtasks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] rkhachatryan commented on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats
rkhachatryan commented on pull request #14635: URL: https://github.com/apache/flink/pull/14635#issuecomment-760328916 Thanks for reviewing, @pnowojski. I've addressed your feedback, PTAL. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org