[jira] [Commented] (FLINK-15687) Potential test instabilities due to concurrent access to TaskSlotTable.
[ https://issues.apache.org/jira/browse/FLINK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134792#comment-17134792 ] Xintong Song commented on FLINK-15687: -- Fixed via * master: a3b51f805f247e6aeb681bb36b38da777ba25d79 * 1.11.0: 15e7c26ab6b9c9b2b1faa81495dc941e2f10bac6 > Potential test instabilities due to concurrent access to TaskSlotTable. > --- > > Key: FLINK-15687 > URL: https://issues.apache.org/jira/browse/FLINK-15687 > Project: Flink > Issue Type: Task > Components: Runtime / Coordination, Tests >Affects Versions: 1.10.0 >Reporter: Kostas Kloudas >Assignee: Xintong Song >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > > Working on [FLINK-14742|https://issues.apache.org/jira/browse/FLINK-14742] > revealed that the problem with that test instability was the modification of > the {{taskSlotTable}} of the {{TaskManager}} under test from multiple > threads, namely the test thread and the main thread of the {{rpcEnpoint}}. > This data-structure is not thread-safe and this should not happen. > This anti-pattern seems to be repeated in multiple tests like most of the > tests in the {{TaskExecutorSubmissionTest}} (look for the call to the > {{TaskSlotTable.allocateSlot()}}). There we seem to call > {{taskSlotTable.allocateSlot()}} and then \{{tmGateway.submitTask()}} which > is essentially accessing the slot table from within the main rpc-endpoint > thread. > This JIRA is just to investigate if this is also a problem in those tests or > not. > cc [~trohrmann], [~chesnay] , [~yangwang166] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15687) Potential test instabilities due to concurrent access to TaskSlotTable.
[ https://issues.apache.org/jira/browse/FLINK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133216#comment-17133216 ] Till Rohrmann commented on FLINK-15687: --- Thanks for fixing this problem [~xintongsong]. Let me know once you have the PR ready. > Potential test instabilities due to concurrent access to TaskSlotTable. > --- > > Key: FLINK-15687 > URL: https://issues.apache.org/jira/browse/FLINK-15687 > Project: Flink > Issue Type: Task > Components: Runtime / Coordination, Tests >Affects Versions: 1.10.0 >Reporter: Kostas Kloudas >Assignee: Xintong Song >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > > Working on [FLINK-14742|https://issues.apache.org/jira/browse/FLINK-14742] > revealed that the problem with that test instability was the modification of > the {{taskSlotTable}} of the {{TaskManager}} under test from multiple > threads, namely the test thread and the main thread of the {{rpcEnpoint}}. > This data-structure is not thread-safe and this should not happen. > This anti-pattern seems to be repeated in multiple tests like most of the > tests in the {{TaskExecutorSubmissionTest}} (look for the call to the > {{TaskSlotTable.allocateSlot()}}). There we seem to call > {{taskSlotTable.allocateSlot()}} and then \{{tmGateway.submitTask()}} which > is essentially accessing the slot table from within the main rpc-endpoint > thread. > This JIRA is just to investigate if this is also a problem in those tests or > not. > cc [~trohrmann], [~chesnay] , [~yangwang166] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15687) Potential test instabilities due to concurrent access to TaskSlotTable.
[ https://issues.apache.org/jira/browse/FLINK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133053#comment-17133053 ] Xintong Song commented on FLINK-15687: -- [~trohrmann], Indeed, {{JobTable}} should also be accessed on the RPC main thread. To guarantee that, we can add the {{MainThreadExecutable}} of the {{TaskExecutor}} as an argument of {{TaskSubmissionTestEnvironment#registerJobMasterConnection}}. The method is called in: * the constructor of {{TaskSubmissionTestEnvironment}} * {{TaskExecutorPartitionLifecycleTest#testJobMasterConnectionTerminationAfterExternalReleaseOrPromotion}} For both above, the {{MainThreadExecutable}} can be retrieved via {{TestingTaskExecutor#getMainThreadExecutableForTesting}}. I'll try to provide a fix. BTW, I think the concurrent access to {{JobTable}} exist before the changes of this ticket. > Potential test instabilities due to concurrent access to TaskSlotTable. > --- > > Key: FLINK-15687 > URL: https://issues.apache.org/jira/browse/FLINK-15687 > Project: Flink > Issue Type: Task > Components: Runtime / Coordination, Tests >Affects Versions: 1.10.0 >Reporter: Kostas Kloudas >Assignee: Xintong Song >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > > Working on [FLINK-14742|https://issues.apache.org/jira/browse/FLINK-14742] > revealed that the problem with that test instability was the modification of > the {{taskSlotTable}} of the {{TaskManager}} under test from multiple > threads, namely the test thread and the main thread of the {{rpcEnpoint}}. > This data-structure is not thread-safe and this should not happen. > This anti-pattern seems to be repeated in multiple tests like most of the > tests in the {{TaskExecutorSubmissionTest}} (look for the call to the > {{TaskSlotTable.allocateSlot()}}). There we seem to call > {{taskSlotTable.allocateSlot()}} and then \{{tmGateway.submitTask()}} which > is essentially accessing the slot table from within the main rpc-endpoint > thread. > This JIRA is just to investigate if this is also a problem in those tests or > not. > cc [~trohrmann], [~chesnay] , [~yangwang166] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15687) Potential test instabilities due to concurrent access to TaskSlotTable.
[ https://issues.apache.org/jira/browse/FLINK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125596#comment-17125596 ] Till Rohrmann commented on FLINK-15687: --- Sounds good to me [~xintongsong]. Go ahead. > Potential test instabilities due to concurrent access to TaskSlotTable. > --- > > Key: FLINK-15687 > URL: https://issues.apache.org/jira/browse/FLINK-15687 > Project: Flink > Issue Type: Task > Components: Runtime / Coordination, Tests >Affects Versions: 1.10.0 >Reporter: Kostas Kloudas >Assignee: Xintong Song >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > > Working on [FLINK-14742|https://issues.apache.org/jira/browse/FLINK-14742] > revealed that the problem with that test instability was the modification of > the {{taskSlotTable}} of the {{TaskManager}} under test from multiple > threads, namely the test thread and the main thread of the {{rpcEnpoint}}. > This data-structure is not thread-safe and this should not happen. > This anti-pattern seems to be repeated in multiple tests like most of the > tests in the {{TaskExecutorSubmissionTest}} (look for the call to the > {{TaskSlotTable.allocateSlot()}}). There we seem to call > {{taskSlotTable.allocateSlot()}} and then \{{tmGateway.submitTask()}} which > is essentially accessing the slot table from within the main rpc-endpoint > thread. > This JIRA is just to investigate if this is also a problem in those tests or > not. > cc [~trohrmann], [~chesnay] , [~yangwang166] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15687) Potential test instabilities due to concurrent access to TaskSlotTable.
[ https://issues.apache.org/jira/browse/FLINK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118361#comment-17118361 ] Xintong Song commented on FLINK-15687: -- For all the rest of test cases listed above, the concurrency is cause by directly access to {{TaskSlotTable}} via {{TaskSubmissionTestEnvironment#getTaskSlotTable}}. The approach [~trohrmann] suggested, getting rid of {{TaskSubmissionTestEnvironment#getTaskSlotTable}} and rework all the affected cases, should solve the problem. However, reworking all the test cases is a non-trivial effort. I think what we truly need is not to prevent direct accesses to {{TaskSlotTable}} from the test cases, but to make sure such accesses are always in the rpc main thread. Therefore, I propose to make {{TaskSubmissionTestEnvironment#getTaskSlotTable}} return a wrapped {{ThreadSafeTaskSlotTable}}, which calls the underlying {{TaskSlotTable}} in the rpc main thread. In this way, only accesses through {{TaskSubmissionTestEnvironment#getTaskSlotTable}} will be gated to the rpc main thread, and accesses from other to-be-tested code paths will not be affected. Ideally, we do not need to rework any of the test cases. WDYT? [~trohrmann] > Potential test instabilities due to concurrent access to TaskSlotTable. > --- > > Key: FLINK-15687 > URL: https://issues.apache.org/jira/browse/FLINK-15687 > Project: Flink > Issue Type: Task > Components: Runtime / Coordination, Tests >Affects Versions: 1.10.0 >Reporter: Kostas Kloudas >Assignee: Xintong Song >Priority: Critical > Labels: test-stability > Fix For: 1.11.0 > > > Working on [FLINK-14742|https://issues.apache.org/jira/browse/FLINK-14742] > revealed that the problem with that test instability was the modification of > the {{taskSlotTable}} of the {{TaskManager}} under test from multiple > threads, namely the test thread and the main thread of the {{rpcEnpoint}}. > This data-structure is not thread-safe and this should not happen. > This anti-pattern seems to be repeated in multiple tests like most of the > tests in the {{TaskExecutorSubmissionTest}} (look for the call to the > {{TaskSlotTable.allocateSlot()}}). There we seem to call > {{taskSlotTable.allocateSlot()}} and then \{{tmGateway.submitTask()}} which > is essentially accessing the slot table from within the main rpc-endpoint > thread. > This JIRA is just to investigate if this is also a problem in those tests or > not. > cc [~trohrmann], [~chesnay] , [~yangwang166] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15687) Potential test instabilities due to concurrent access to TaskSlotTable.
[ https://issues.apache.org/jira/browse/FLINK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118281#comment-17118281 ] Xintong Song commented on FLINK-15687: -- I've printed the thread names for all testing scope accesses to {{TaskSlotTable}}. The following cases are discovered with the reported problem. I'll try to provide fixes asap. * TaskExecutorTest ** testDynamicSlotAllocation ** testOfferSlotToJobMasterAfterTimeout ** testShouldShutDownTaskManagerServicesInPostStop * TaskExecutorSubmissionTest ** testFailingScheduleOrUpdateConsumers ** testUpdateTaskInputPartitionsFailure ** testRunJobWithForwardChannel ** testTaskSubmissionAndCancelling ** testCancellingDependentAndStateUpdateFails ** testLocalPartitionNotFound ** testRemotePartitionNotFound ** testTaskSubmission ** testGateChannelEdgeMismatch ** testRequestTaskBackPressure * TaskExecutorOperatorEventHandlingTest ** eventHandlingInTaskFailureFailsTask ** eventToCoordinatorDeliveryFailureFailsTask > Potential test instabilities due to concurrent access to TaskSlotTable. > --- > > Key: FLINK-15687 > URL: https://issues.apache.org/jira/browse/FLINK-15687 > Project: Flink > Issue Type: Task > Components: Runtime / Coordination, Tests >Affects Versions: 1.10.0 >Reporter: Kostas Kloudas >Assignee: Xintong Song >Priority: Critical > Labels: test-stability > Fix For: 1.11.0 > > > Working on [FLINK-14742|https://issues.apache.org/jira/browse/FLINK-14742] > revealed that the problem with that test instability was the modification of > the {{taskSlotTable}} of the {{TaskManager}} under test from multiple > threads, namely the test thread and the main thread of the {{rpcEnpoint}}. > This data-structure is not thread-safe and this should not happen. > This anti-pattern seems to be repeated in multiple tests like most of the > tests in the {{TaskExecutorSubmissionTest}} (look for the call to the > {{TaskSlotTable.allocateSlot()}}). There we seem to call > {{taskSlotTable.allocateSlot()}} and then \{{tmGateway.submitTask()}} which > is essentially accessing the slot table from within the main rpc-endpoint > thread. > This JIRA is just to investigate if this is also a problem in those tests or > not. > cc [~trohrmann], [~chesnay] , [~yangwang166] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15687) Potential test instabilities due to concurrent access to TaskSlotTable.
[ https://issues.apache.org/jira/browse/FLINK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023116#comment-17023116 ] Till Rohrmann commented on FLINK-15687: --- Sorry [~fly_in_gis], I linked you by accident and I actually meant to ask [~dangdangdang]. > Potential test instabilities due to concurrent access to TaskSlotTable. > --- > > Key: FLINK-15687 > URL: https://issues.apache.org/jira/browse/FLINK-15687 > Project: Flink > Issue Type: Task > Components: Runtime / Coordination, Tests >Affects Versions: 1.10.0 >Reporter: Kostas Kloudas >Priority: Critical > Labels: test-stability > Fix For: 1.11.0 > > > Working on [FLINK-14742|https://issues.apache.org/jira/browse/FLINK-14742] > revealed that the problem with that test instability was the modification of > the {{taskSlotTable}} of the {{TaskManager}} under test from multiple > threads, namely the test thread and the main thread of the {{rpcEnpoint}}. > This data-structure is not thread-safe and this should not happen. > This anti-pattern seems to be repeated in multiple tests like most of the > tests in the {{TaskExecutorSubmissionTest}} (look for the call to the > {{TaskSlotTable.allocateSlot()}}). There we seem to call > {{taskSlotTable.allocateSlot()}} and then \{{tmGateway.submitTask()}} which > is essentially accessing the slot table from within the main rpc-endpoint > thread. > This JIRA is just to investigate if this is also a problem in those tests or > not. > cc [~trohrmann], [~chesnay] , [~yangwang166] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15687) Potential test instabilities due to concurrent access to TaskSlotTable.
[ https://issues.apache.org/jira/browse/FLINK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019860#comment-17019860 ] Yang Wang commented on FLINK-15687: --- Hi [~trohrmann], i agree with you that we should avoid accessing {{TaskSlotTable}} in multiple threads. Removing the {{TaskSubmissionTestEnvironment#getTaskSlotTable}} could work for that purpose. nit: I guess maybe you want to cc [~dangdangdang] since he is the assignee of FLINK-11593. > Potential test instabilities due to concurrent access to TaskSlotTable. > --- > > Key: FLINK-15687 > URL: https://issues.apache.org/jira/browse/FLINK-15687 > Project: Flink > Issue Type: Task > Components: Runtime / Coordination, Tests >Affects Versions: 1.10.0 >Reporter: Kostas Kloudas >Priority: Critical > Labels: test-stability > Fix For: 1.10.0, 1.11.0 > > > Working on [FLINK-14742|https://issues.apache.org/jira/browse/FLINK-14742] > revealed that the problem with that test instability was the modification of > the {{taskSlotTable}} of the {{TaskManager}} under test from multiple > threads, namely the test thread and the main thread of the {{rpcEnpoint}}. > This data-structure is not thread-safe and this should not happen. > This anti-pattern seems to be repeated in multiple tests like most of the > tests in the {{TaskExecutorSubmissionTest}} (look for the call to the > {{TaskSlotTable.allocateSlot()}}). There we seem to call > {{taskSlotTable.allocateSlot()}} and then \{{tmGateway.submitTask()}} which > is essentially accessing the slot table from within the main rpc-endpoint > thread. > This JIRA is just to investigate if this is also a problem in those tests or > not. > cc [~trohrmann], [~chesnay] , [~yangwang166] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15687) Potential test instabilities due to concurrent access to TaskSlotTable.
[ https://issues.apache.org/jira/browse/FLINK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019533#comment-17019533 ] Till Rohrmann commented on FLINK-15687: --- [~yangwang166] do you know whether there is any safeguard in the {{TaskSubmissionTestEnvironment}} which allows to mutate the {{TaskSlotTable}} outside of the {{TaskExecutor's}} main thread? If not, then I would suggest to remove {{TaskSubmissionTestEnvironment#getTaskSlotTable}} altogether and to rework the affected tests. > Potential test instabilities due to concurrent access to TaskSlotTable. > --- > > Key: FLINK-15687 > URL: https://issues.apache.org/jira/browse/FLINK-15687 > Project: Flink > Issue Type: Task > Components: Runtime / Coordination, Tests >Affects Versions: 1.10.0 >Reporter: Kostas Kloudas >Priority: Critical > Labels: test-stability > > Working on [FLINK-14742|https://issues.apache.org/jira/browse/FLINK-14742] > revealed that the problem with that test instability was the modification of > the {{taskSlotTable}} of the {{TaskManager}} under test from multiple > threads, namely the test thread and the main thread of the {{rpcEnpoint}}. > This data-structure is not thread-safe and this should not happen. > This anti-pattern seems to be repeated in multiple tests like most of the > tests in the {{TaskExecutorSubmissionTest}} (look for the call to the > {{TaskSlotTable.allocateSlot()}}). There we seem to call > {{taskSlotTable.allocateSlot()}} and then \{{tmGateway.submitTask()}} which > is essentially accessing the slot table from within the main rpc-endpoint > thread. > This JIRA is just to investigate if this is also a problem in those tests or > not. > cc [~trohrmann], [~chesnay] , [~yangwang166] -- This message was sent by Atlassian Jira (v8.3.4#803005)