[jira] [Commented] (FLINK-10319) Too many requestPartitionState would crash JM
[ https://issues.apache.org/jira/browse/FLINK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655010#comment-16655010 ] ASF GitHub Bot commented on FLINK-10319: TisonKun commented on issue #6680: [FLINK-10319] [runtime] Many requestPartitionState messages overwhelm JM URL: https://github.com/apache/flink/pull/6680#issuecomment-430956239 I learn some machine learning cases might not concern on accurate compute. Maybe we can defer this change. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Too many requestPartitionState would crash JM > - > > Key: FLINK-10319 > URL: https://issues.apache.org/jira/browse/FLINK-10319 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination >Affects Versions: 1.7.0 >Reporter: TisonKun >Assignee: TisonKun >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Do not requestPartitionState from JM on partition request fail, which may > generate too many RPC requests and block JM. > We gain little benefit to check what state producer is in, which in the other > hand crash JM by too many RPC requests. Task could always > retriggerPartitionRequest from its InputGate, it would be fail if the > producer has gone and succeed if the producer alive. Anyway, no need to ask > for JM for help. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10319) Too many requestPartitionState would crash JM
[ https://issues.apache.org/jira/browse/FLINK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654688#comment-16654688 ] ASF GitHub Bot commented on FLINK-10319: TisonKun commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-430886993 As "deploying tasks in topological order", I agree that it could help. It is a orthonormal improvement though. For your hesitancy, I'd like to learn in which situation that a downstream operator would not be failed by a upstream failing. To keep the state clean either the upstream fails downstream and both restore from the least checkpoint, or we need to implement a failover strategy that take the responsibility for reconcile the state. The latter sounds quite costly. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Too many requestPartitionState would crash JM > - > > Key: FLINK-10319 > URL: https://issues.apache.org/jira/browse/FLINK-10319 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination >Affects Versions: 1.7.0 >Reporter: TisonKun >Assignee: TisonKun >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Do not requestPartitionState from JM on partition request fail, which may > generate too many RPC requests and block JM. > We gain little benefit to check what state producer is in, which in the other > hand crash JM by too many RPC requests. Task could always > retriggerPartitionRequest from its InputGate, it would be fail if the > producer has gone and succeed if the producer alive. Anyway, no need to ask > for JM for help. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10319) Too many requestPartitionState would crash JM
[ https://issues.apache.org/jira/browse/FLINK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16641625#comment-16641625 ] ASF GitHub Bot commented on FLINK-10319: tillrohrmann commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-427783408 I see the problem with very large jobs. Maybe we could solve it a bit differently, by deploying tasks in topological order when using the `EAGER` scheduling. Concerning your answer to my second question: What if the producer partition would get disposed (e.g. due to a failover which does not necessarily restart the downstream operators). At the moment an upstream task failure will always fail the downstream consumers. However, this can change in the future and the more assumptions (e.g. downstream operators will be failed if upstream operators fail) we bake in, the harder it gets to change this behaviour. Moreover, I think it is always a good idea, to make the components as self-contained as possible. This also entails that the failover behaviour should ideally not depend on other things to happen. Therefore, I'm a bit hesitant to change the existing behaviour. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Too many requestPartitionState would crash JM > - > > Key: FLINK-10319 > URL: https://issues.apache.org/jira/browse/FLINK-10319 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination >Affects Versions: 1.7.0 >Reporter: TisonKun >Assignee: TisonKun >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Do not requestPartitionState from JM on partition request fail, which may > generate too many RPC requests and block JM. > We gain little benefit to check what state producer is in, which in the other > hand crash JM by too many RPC requests. Task could always > retriggerPartitionRequest from its InputGate, it would be fail if the > producer has gone and succeed if the producer alive. Anyway, no need to ask > for JM for help. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10319) Too many requestPartitionState would crash JM
[ https://issues.apache.org/jira/browse/FLINK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16641328#comment-16641328 ] ASF GitHub Bot commented on FLINK-10319: TisonKun commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-427716261 @tillrohrmann it is better to say that `JobMaster` will be overwhelmed by too many rpc request. This issue is filed during a benchmark of the job scheduling performance with a 2000x2000 ALL-to-ALL streaming(EAGER) job. The input data is empty so that the tasks finishes soon after started. In this case JM shows slow RPC responses and TM/RM heartbeats to JM will finally timeout. Digging out the reason, there are ~2,000,000 `requestPartitionState` messages triggered by `triggerPartitionProducerStateCheck` in a short time, which overwhelms JM RPC main thread. This is due to downstream tasks can be started earlier than upstream tasks in EAGER scheduling. For you second question, the task can just keep waiting for a while and retrying if the partition does not exist. There are two cases when the partition does not exist: 1. the partition is not started yet 2. the partition is failed. In case 1, retry works. In case 2, a task failover will soon happen and cancel the downstream tasks as well. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Too many requestPartitionState would crash JM > - > > Key: FLINK-10319 > URL: https://issues.apache.org/jira/browse/FLINK-10319 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination >Affects Versions: 1.7.0 >Reporter: tison >Assignee: tison >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Do not requestPartitionState from JM on partition request fail, which may > generate too many RPC requests and block JM. > We gain little benefit to check what state producer is in, which in the other > hand crash JM by too many RPC requests. Task could always > retriggerPartitionRequest from its InputGate, it would be fail if the > producer has gone and succeed if the producer alive. Anyway, no need to ask > for JM for help. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10319) Too many requestPartitionState would crash JM
[ https://issues.apache.org/jira/browse/FLINK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16637987#comment-16637987 ] ASF GitHub Bot commented on FLINK-10319: tillrohrmann edited a comment on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-426950862 Thanks for opening this PR @TisonKun. Before diving into the details of this PR I'd like to know whether you've observed that the JM crashes or is this more of theoretical nature? If it does crash indeed, then I would be interested to learn why, because the `requestPartitionState` method should not be blocking at all. How many `requestPartitionState` messages are in generated in the crash case? Another question is concerning your assumptions: You said that `retriggerPartitionRequest` would fail if the producer is gone. With producer do you mean the producing `Task` or the `TaskManager`. In the former case, I think the remote `TaskManager` would simply respond with a `PartitionNotFoundException` which retriggers the same partition request method again. Thus, I'm not quite sure whether the consumer task would actually fail or simply retry infinitely. The latter result is imo what we try to prevent with asking the JM about the state of the result partition. I would like to hear @uce opinion on this as well, because he used to work on this part of the code in the past. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Too many requestPartitionState would crash JM > - > > Key: FLINK-10319 > URL: https://issues.apache.org/jira/browse/FLINK-10319 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination >Affects Versions: 1.7.0 >Reporter: tison >Assignee: tison >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Do not requestPartitionState from JM on partition request fail, which may > generate too many RPC requests and block JM. > We gain little benefit to check what state producer is in, which in the other > hand crash JM by too many RPC requests. Task could always > retriggerPartitionRequest from its InputGate, it would be fail if the > producer has gone and succeed if the producer alive. Anyway, no need to ask > for JM for help. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10319) Too many requestPartitionState would crash JM
[ https://issues.apache.org/jira/browse/FLINK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16637986#comment-16637986 ] ASF GitHub Bot commented on FLINK-10319: tillrohrmann commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-426950862 Thanks for opening this PR @TisonKun. Before diving into the details of this PR I'd like to know whether you've observed that the JM crashes or is this more of theoretical nature? If it does crash indeed, then I would be interested to learn why, because the `requestPartitionState` method should not be blocking at all. How many `requestPartitionState` messages are in generated in the crash case? Another question is concerning your assumptions: You said that `retriggerPartitionRequest` would fail if the producer is gone. With producer do you mean the producing `Task` or the `TaskManager`. In the former case, I think the remote `TaskManager` would simply respond with a `PartitionNotFoundException` which retriggers the same partition request method again. Thus, I'm not quite sure whether the consumer task would actually fail or simply retry infinitely. The latter result is imo what we try to prevent with asking the JM about the state of the result partition. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Too many requestPartitionState would crash JM > - > > Key: FLINK-10319 > URL: https://issues.apache.org/jira/browse/FLINK-10319 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination >Affects Versions: 1.7.0 >Reporter: tison >Assignee: tison >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Do not requestPartitionState from JM on partition request fail, which may > generate too many RPC requests and block JM. > We gain little benefit to check what state producer is in, which in the other > hand crash JM by too many RPC requests. Task could always > retriggerPartitionRequest from its InputGate, it would be fail if the > producer has gone and succeed if the producer alive. Anyway, no need to ask > for JM for help. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10319) Too many requestPartitionState would crash JM
[ https://issues.apache.org/jira/browse/FLINK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626745#comment-16626745 ] ASF GitHub Bot commented on FLINK-10319: TisonKun removed a comment on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-421222747 cc @tillrohrmann @GJL This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Too many requestPartitionState would crash JM > - > > Key: FLINK-10319 > URL: https://issues.apache.org/jira/browse/FLINK-10319 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination >Affects Versions: 1.7.0 >Reporter: tison >Assignee: tison >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Do not requestPartitionState from JM on partition request fail, which may > generate too many RPC requests and block JM. > We gain little benefit to check what state producer is in, which in the other > hand crash JM by too many RPC requests. Task could always > retriggerPartitionRequest from its InputGate, it would be fail if the > producer has gone and succeed if the producer alive. Anyway, no need to ask > for JM for help. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10319) Too many requestPartitionState would crash JM
[ https://issues.apache.org/jira/browse/FLINK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626724#comment-16626724 ] ASF GitHub Bot commented on FLINK-10319: TisonKun removed a comment on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-422276930 cc @tillrohrmann @GJL @StefanRRichter This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Too many requestPartitionState would crash JM > - > > Key: FLINK-10319 > URL: https://issues.apache.org/jira/browse/FLINK-10319 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination >Affects Versions: 1.7.0 >Reporter: tison >Assignee: tison >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Do not requestPartitionState from JM on partition request fail, which may > generate too many RPC requests and block JM. > We gain little benefit to check what state producer is in, which in the other > hand crash JM by too many RPC requests. Task could always > retriggerPartitionRequest from its InputGate, it would be fail if the > producer has gone and succeed if the producer alive. Anyway, no need to ask > for JM for help. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10319) Too many requestPartitionState would crash JM
[ https://issues.apache.org/jira/browse/FLINK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626721#comment-16626721 ] ASF GitHub Bot commented on FLINK-10319: TisonKun removed a comment on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-421882434 cc @StephanEwen @twalthr This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Too many requestPartitionState would crash JM > - > > Key: FLINK-10319 > URL: https://issues.apache.org/jira/browse/FLINK-10319 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination >Affects Versions: 1.7.0 >Reporter: tison >Assignee: tison >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Do not requestPartitionState from JM on partition request fail, which may > generate too many RPC requests and block JM. > We gain little benefit to check what state producer is in, which in the other > hand crash JM by too many RPC requests. Task could always > retriggerPartitionRequest from its InputGate, it would be fail if the > producer has gone and succeed if the producer alive. Anyway, no need to ask > for JM for help. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10319) Too many requestPartitionState would crash JM
[ https://issues.apache.org/jira/browse/FLINK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618564#comment-16618564 ] ASF GitHub Bot commented on FLINK-10319: TisonKun commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-422276930 cc @tillrohrmann @GJL @StefanRRichter This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Too many requestPartitionState would crash JM > - > > Key: FLINK-10319 > URL: https://issues.apache.org/jira/browse/FLINK-10319 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination >Affects Versions: 1.7.0 >Reporter: 陈梓立 >Assignee: 陈梓立 >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Do not requestPartitionState from JM on partition request fail, which may > generate too many RPC requests and block JM. > We gain little benefit to check what state producer is in, which in the other > hand crash JM by too many RPC requests. Task could always > retriggerPartitionRequest from its InputGate, it would be fail if the > producer has gone and succeed if the producer alive. Anyway, no need to ask > for JM for help. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10319) Too many requestPartitionState would crash JM
[ https://issues.apache.org/jira/browse/FLINK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16617034#comment-16617034 ] ASF GitHub Bot commented on FLINK-10319: TisonKun commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-421882421 @Clark Thanks for your review! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Too many requestPartitionState would crash JM > - > > Key: FLINK-10319 > URL: https://issues.apache.org/jira/browse/FLINK-10319 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination >Affects Versions: 1.7.0 >Reporter: 陈梓立 >Assignee: 陈梓立 >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Do not requestPartitionState from JM on partition request fail, which may > generate too many RPC requests and block JM. > We gain little benefit to check what state producer is in, which in the other > hand crash JM by too many RPC requests. Task could always > retriggerPartitionRequest from its InputGate, it would be fail if the > producer has gone and succeed if the producer alive. Anyway, no need to ask > for JM for help. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10319) Too many requestPartitionState would crash JM
[ https://issues.apache.org/jira/browse/FLINK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16617035#comment-16617035 ] ASF GitHub Bot commented on FLINK-10319: TisonKun commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-421882434 cc @StephanEwen @twalthr This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Too many requestPartitionState would crash JM > - > > Key: FLINK-10319 > URL: https://issues.apache.org/jira/browse/FLINK-10319 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination >Affects Versions: 1.7.0 >Reporter: 陈梓立 >Assignee: 陈梓立 >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Do not requestPartitionState from JM on partition request fail, which may > generate too many RPC requests and block JM. > We gain little benefit to check what state producer is in, which in the other > hand crash JM by too many RPC requests. Task could always > retriggerPartitionRequest from its InputGate, it would be fail if the > producer has gone and succeed if the producer alive. Anyway, no need to ask > for JM for help. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10319) Too many requestPartitionState would crash JM
[ https://issues.apache.org/jira/browse/FLINK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616210#comment-16616210 ] ASF GitHub Bot commented on FLINK-10319: TisonKun edited a comment on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-421222747 cc @tillrohrmann @GJL This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Too many requestPartitionState would crash JM > - > > Key: FLINK-10319 > URL: https://issues.apache.org/jira/browse/FLINK-10319 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination >Affects Versions: 1.7.0 >Reporter: 陈梓立 >Assignee: 陈梓立 >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Do not requestPartitionState from JM on partition request fail, which may > generate too many RPC requests and block JM. > We gain little benefit to check what state producer is in, which in the other > hand crash JM by too many RPC requests. Task could always > retriggerPartitionRequest from its InputGate, it would be fail if the > producer has gone and succeed if the producer alive. Anyway, no need to ask > for JM for help. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10319) Too many requestPartitionState would crash JM
[ https://issues.apache.org/jira/browse/FLINK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16614325#comment-16614325 ] ASF GitHub Bot commented on FLINK-10319: TisonKun commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-421222747 cc @tillrohrmann This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Too many requestPartitionState would crash JM > - > > Key: FLINK-10319 > URL: https://issues.apache.org/jira/browse/FLINK-10319 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination >Affects Versions: 1.7.0 >Reporter: 陈梓立 >Assignee: 陈梓立 >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Do not requestPartitionState from JM on partition request fail, which may > generate too many RPC requests and block JM. > We gain little benefit to check what state producer is in, which in the other > hand crash JM by too many RPC requests. Task could always > retriggerPartitionRequest from its InputGate, it would be fail if the > producer has gone and succeed if the producer alive. Anyway, no need to ask > for JM for help. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10319) Too many requestPartitionState would crash JM
[ https://issues.apache.org/jira/browse/FLINK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16613058#comment-16613058 ] ASF GitHub Bot commented on FLINK-10319: Clark commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-420893761 @TisonKun Sorry, I had some misunderstanding here and now it looks good to me. And I think there's no need for single-thread pool any more since we do not need to ask JM. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Too many requestPartitionState would crash JM > - > > Key: FLINK-10319 > URL: https://issues.apache.org/jira/browse/FLINK-10319 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination >Affects Versions: 1.7.0 >Reporter: 陈梓立 >Assignee: 陈梓立 >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Do not requestPartitionState from JM on partition request fail, which may > generate too many RPC requests and block JM. > We gain little benefit to check what state producer is in, which in the other > hand crash JM by too many RPC requests. Task could always > retriggerPartitionRequest from its InputGate, it would be fail if the > producer has gone and succeed if the producer alive. Anyway, no need to ask > for JM for help. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10319) Too many requestPartitionState would crash JM
[ https://issues.apache.org/jira/browse/FLINK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612969#comment-16612969 ] ASF GitHub Bot commented on FLINK-10319: TisonKun commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-420871756 @Clark thanks for you reply! Sorry for late response. `triggerPartitionProducerStateCheck` called if there is a `PartitionNotFoundException`, that is, producer not found. Please note that former, we ask JM to check producer state, If it is a Timeout Exception, it will try again and assume it's still running; however, now we ALWAYS assume producer is still running and try again. So with the changes we use a loosely fail strategy. For the single-thread pool, could we just reuse `JobMaster#scheduledExecutorService`? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Too many requestPartitionState would crash JM > - > > Key: FLINK-10319 > URL: https://issues.apache.org/jira/browse/FLINK-10319 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination >Affects Versions: 1.7.0 >Reporter: 陈梓立 >Assignee: 陈梓立 >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Do not requestPartitionState from JM on partition request fail, which may > generate too many RPC requests and block JM. > We gain little benefit to check what state producer is in, which in the other > hand crash JM by too many RPC requests. Task could always > retriggerPartitionRequest from its InputGate, it would be fail if the > producer has gone and succeed if the producer alive. Anyway, no need to ask > for JM for help. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10319) Too many requestPartitionState would crash JM
[ https://issues.apache.org/jira/browse/FLINK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612224#comment-16612224 ] ASF GitHub Bot commented on FLINK-10319: Clark commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-420670340 @TisonKun Currently, the task will try to ask JM to check producer state. If it is a Timeout Exception, it will try again and assume it's still running. I am not sure about when the triggerPartitionProducerStateCheck get called, is it possible that the producer state is still running? If it is possible, then we might restart the Execution which is not necessary and go through the whole task cancellation logic(it might restart the whole job in Streaming mode). And by using a single-thread thread pool, we will not introduce too much pressure on the JM and avoid unnecessary task cancellation. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Too many requestPartitionState would crash JM > - > > Key: FLINK-10319 > URL: https://issues.apache.org/jira/browse/FLINK-10319 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination >Affects Versions: 1.7.0 >Reporter: 陈梓立 >Assignee: 陈梓立 >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Do not requestPartitionState from JM on partition request fail, which may > generate too many RPC requests and block JM. > We gain little benefit to check what state producer is in, which in the other > hand crash JM by too many RPC requests. Task could always > retriggerPartitionRequest from its InputGate, it would be fail if the > producer has gone and succeed if the producer alive. Anyway, no need to ask > for JM for help. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10319) Too many requestPartitionState would crash JM
[ https://issues.apache.org/jira/browse/FLINK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16611824#comment-16611824 ] ASF GitHub Bot commented on FLINK-10319: TisonKun edited a comment on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-420577451 @Clark why do you think it fails the execution eagerly? Former, Task would ask for JM to check the producer state and decide whether fails the execution or not; but now Task would always retry. You are right that introduce a single-thread thread pool would loose the problem, but without ask for JM, Task is aware of the producer fail later. There is little benefit we gain by ask JM for checkState but burden it. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Too many requestPartitionState would crash JM > - > > Key: FLINK-10319 > URL: https://issues.apache.org/jira/browse/FLINK-10319 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination >Affects Versions: 1.7.0 >Reporter: 陈梓立 >Assignee: 陈梓立 >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Do not requestPartitionState from JM on partition request fail, which may > generate too many RPC requests and block JM. > We gain little benefit to check what state producer is in, which in the other > hand crash JM by too many RPC requests. Task could always > retriggerPartitionRequest from its InputGate, it would be fail if the > producer has gone and succeed if the producer alive. Anyway, no need to ask > for JM for help. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10319) Too many requestPartitionState would crash JM
[ https://issues.apache.org/jira/browse/FLINK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16611816#comment-16611816 ] ASF GitHub Bot commented on FLINK-10319: TisonKun commented on issue #6680: [FLINK-10319] [runtime] Too many requestPartitionState would crash JM URL: https://github.com/apache/flink/pull/6680#issuecomment-420577451 @Clark why do you think it fails the execution eagerly? Former, Task would ask for JM to check the producer state and decide whether fails the execution or not; but now Task would always retry. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Too many requestPartitionState would crash JM > - > > Key: FLINK-10319 > URL: https://issues.apache.org/jira/browse/FLINK-10319 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination >Affects Versions: 1.7.0 >Reporter: 陈梓立 >Assignee: 陈梓立 >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Do not requestPartitionState from JM on partition request fail, which may > generate too many RPC requests and block JM. > We gain little benefit to check what state producer is in, which in the other > hand crash JM by too many RPC requests. Task could always > retriggerPartitionRequest from its InputGate, it would be fail if the > producer has gone and succeed if the producer alive. Anyway, no need to ask > for JM for help. -- This message was sent by Atlassian JIRA (v7.6.3#76005)