[GitHub] TisonKun commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM
TisonKun commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-430886993 As "deploying tasks in topological order", I agree that it could help. It is a orthonormal improvement though. For your hesitancy, I'd like to learn in which situation that a downstream operator would not be failed by a upstream failing. To keep the state clean either the upstream fails downstream and both restore from the least checkpoint, or we need to implement a failover strategy that take the responsibility for reconcile the state. The latter sounds quite costly. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] TisonKun commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM
TisonKun commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-427716261 @tillrohrmann it is better to say that `JobMaster` will be overwhelmed by too many rpc request. This issue is filed during a benchmark of the job scheduling performance with a 2000x2000 ALL-to-ALL streaming(EAGER) job. The input data is empty so that the tasks finishes soon after started. In this case JM shows slow RPC responses and TM/RM heartbeats to JM will finally timeout. Digging out the reason, there are ~2,000,000 `requestPartitionState` messages triggered by `triggerPartitionProducerStateCheck` in a short time, which overwhelms JM RPC main thread. This is due to downstream tasks can be started earlier than upstream tasks in EAGER scheduling. For you second question, the task can just keep waiting for a while and retrying if the partition does not exist. There are two cases when the partition does not exist: 1. the partition is not started yet 2. the partition is failed. In case 1, retry works. In case 2, a task failover will soon happen and cancel the downstream tasks as well. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] TisonKun commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM
TisonKun commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-422276930 cc @tillrohrmann @GJL @StefanRRichter This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] TisonKun commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM
TisonKun commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-421882421 @Clark Thanks for your review! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] TisonKun commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM
TisonKun commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-421882434 cc @StephanEwen @twalthr This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] TisonKun commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM
TisonKun commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-421222747 cc @tillrohrmann This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] TisonKun commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM
TisonKun commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-420871756 @Clark thanks for you reply! Sorry for late response. `triggerPartitionProducerStateCheck` called if there is a `PartitionNotFoundException`, that is, producer not found. Please note that former, we ask JM to check producer state, If it is a Timeout Exception, it will try again and assume it's still running; however, now we ALWAYS assume producer is still running and try again. So with the changes we use a loosely fail strategy. For the single-thread pool, could we just reuse `JobMaster#scheduledExecutorService`? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] TisonKun commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM
TisonKun commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-420577451 @Clark why do you think it fails the execution eagerly? Former, Task would ask for JM to check the producer state and decide whether fails the execution or not; but now Task would always retry. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services