[jira] [Commented] (FLINK-15687) Potential test instabilities due to concurrent access to TaskSlotTable.

2020-06-13 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134792#comment-17134792
 ] 

Xintong Song commented on FLINK-15687:
--

Fixed via
 * master: a3b51f805f247e6aeb681bb36b38da777ba25d79
 * 1.11.0: 15e7c26ab6b9c9b2b1faa81495dc941e2f10bac6

> Potential test instabilities due to concurrent access to TaskSlotTable.
> ---
>
> Key: FLINK-15687
> URL: https://issues.apache.org/jira/browse/FLINK-15687
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.10.0
>Reporter: Kostas Kloudas
>Assignee: Xintong Song
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.11.0
>
>
> Working on [FLINK-14742|https://issues.apache.org/jira/browse/FLINK-14742] 
> revealed that the problem with that test instability was the modification of 
> the {{taskSlotTable}} of the {{TaskManager}} under test from multiple 
> threads, namely the test thread and the main thread of the {{rpcEnpoint}}. 
> This data-structure is not thread-safe and this should not happen.
> This anti-pattern seems to be repeated in multiple tests like most of the 
> tests in the {{TaskExecutorSubmissionTest}} (look for the call to the 
> {{TaskSlotTable.allocateSlot()}}). There we seem to call 
> {{taskSlotTable.allocateSlot()}} and then \{{tmGateway.submitTask()}} which 
> is essentially accessing the slot table from within the main rpc-endpoint 
> thread.
> This JIRA is just to investigate if this is also a problem in those tests or 
> not.
> cc [~trohrmann], [~chesnay] , [~yangwang166]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15687) Potential test instabilities due to concurrent access to TaskSlotTable.

2020-06-11 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133216#comment-17133216
 ] 

Till Rohrmann commented on FLINK-15687:
---

Thanks for fixing this problem [~xintongsong]. Let me know once you have the PR 
ready.

> Potential test instabilities due to concurrent access to TaskSlotTable.
> ---
>
> Key: FLINK-15687
> URL: https://issues.apache.org/jira/browse/FLINK-15687
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.10.0
>Reporter: Kostas Kloudas
>Assignee: Xintong Song
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.11.0
>
>
> Working on [FLINK-14742|https://issues.apache.org/jira/browse/FLINK-14742] 
> revealed that the problem with that test instability was the modification of 
> the {{taskSlotTable}} of the {{TaskManager}} under test from multiple 
> threads, namely the test thread and the main thread of the {{rpcEnpoint}}. 
> This data-structure is not thread-safe and this should not happen.
> This anti-pattern seems to be repeated in multiple tests like most of the 
> tests in the {{TaskExecutorSubmissionTest}} (look for the call to the 
> {{TaskSlotTable.allocateSlot()}}). There we seem to call 
> {{taskSlotTable.allocateSlot()}} and then \{{tmGateway.submitTask()}} which 
> is essentially accessing the slot table from within the main rpc-endpoint 
> thread.
> This JIRA is just to investigate if this is also a problem in those tests or 
> not.
> cc [~trohrmann], [~chesnay] , [~yangwang166]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15687) Potential test instabilities due to concurrent access to TaskSlotTable.

2020-06-11 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133053#comment-17133053
 ] 

Xintong Song commented on FLINK-15687:
--

[~trohrmann],

Indeed, {{JobTable}} should also be accessed on the RPC main thread.

To guarantee that, we can add the {{MainThreadExecutable}} of the 
{{TaskExecutor}} as an argument of 
{{TaskSubmissionTestEnvironment#registerJobMasterConnection}}.

The method is called in:
* the constructor of {{TaskSubmissionTestEnvironment}}
* 
{{TaskExecutorPartitionLifecycleTest#testJobMasterConnectionTerminationAfterExternalReleaseOrPromotion}}

For both above, the {{MainThreadExecutable}} can be retrieved via 
{{TestingTaskExecutor#getMainThreadExecutableForTesting}}.

I'll try to provide a fix.

BTW, I think the concurrent access to {{JobTable}} exist before the changes of 
this ticket.

> Potential test instabilities due to concurrent access to TaskSlotTable.
> ---
>
> Key: FLINK-15687
> URL: https://issues.apache.org/jira/browse/FLINK-15687
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.10.0
>Reporter: Kostas Kloudas
>Assignee: Xintong Song
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.11.0
>
>
> Working on [FLINK-14742|https://issues.apache.org/jira/browse/FLINK-14742] 
> revealed that the problem with that test instability was the modification of 
> the {{taskSlotTable}} of the {{TaskManager}} under test from multiple 
> threads, namely the test thread and the main thread of the {{rpcEnpoint}}. 
> This data-structure is not thread-safe and this should not happen.
> This anti-pattern seems to be repeated in multiple tests like most of the 
> tests in the {{TaskExecutorSubmissionTest}} (look for the call to the 
> {{TaskSlotTable.allocateSlot()}}). There we seem to call 
> {{taskSlotTable.allocateSlot()}} and then \{{tmGateway.submitTask()}} which 
> is essentially accessing the slot table from within the main rpc-endpoint 
> thread.
> This JIRA is just to investigate if this is also a problem in those tests or 
> not.
> cc [~trohrmann], [~chesnay] , [~yangwang166]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15687) Potential test instabilities due to concurrent access to TaskSlotTable.

2020-06-04 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125596#comment-17125596
 ] 

Till Rohrmann commented on FLINK-15687:
---

Sounds good to me [~xintongsong]. Go ahead.

> Potential test instabilities due to concurrent access to TaskSlotTable.
> ---
>
> Key: FLINK-15687
> URL: https://issues.apache.org/jira/browse/FLINK-15687
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.10.0
>Reporter: Kostas Kloudas
>Assignee: Xintong Song
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.11.0
>
>
> Working on [FLINK-14742|https://issues.apache.org/jira/browse/FLINK-14742] 
> revealed that the problem with that test instability was the modification of 
> the {{taskSlotTable}} of the {{TaskManager}} under test from multiple 
> threads, namely the test thread and the main thread of the {{rpcEnpoint}}. 
> This data-structure is not thread-safe and this should not happen.
> This anti-pattern seems to be repeated in multiple tests like most of the 
> tests in the {{TaskExecutorSubmissionTest}} (look for the call to the 
> {{TaskSlotTable.allocateSlot()}}). There we seem to call 
> {{taskSlotTable.allocateSlot()}} and then \{{tmGateway.submitTask()}} which 
> is essentially accessing the slot table from within the main rpc-endpoint 
> thread.
> This JIRA is just to investigate if this is also a problem in those tests or 
> not.
> cc [~trohrmann], [~chesnay] , [~yangwang166]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15687) Potential test instabilities due to concurrent access to TaskSlotTable.

2020-05-28 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118361#comment-17118361
 ] 

Xintong Song commented on FLINK-15687:
--

For all the rest of test cases listed above, the concurrency is cause by 
directly access to {{TaskSlotTable}} via 
{{TaskSubmissionTestEnvironment#getTaskSlotTable}}.

The approach [~trohrmann] suggested, getting rid of 
{{TaskSubmissionTestEnvironment#getTaskSlotTable}} and rework all the affected 
cases, should solve the problem. However, reworking all the test cases is a 
non-trivial effort.

I think what we truly need is not to prevent direct accesses to 
{{TaskSlotTable}} from the test cases, but to make sure such accesses are 
always in the rpc main thread. Therefore, I propose to make 
{{TaskSubmissionTestEnvironment#getTaskSlotTable}} return a wrapped 
{{ThreadSafeTaskSlotTable}}, which calls the underlying {{TaskSlotTable}} in 
the rpc main thread. In this way, only accesses through 
{{TaskSubmissionTestEnvironment#getTaskSlotTable}} will be gated to the rpc 
main thread, and accesses from other to-be-tested code paths will not be 
affected. Ideally, we do not need to rework any of the test cases.

WDYT? [~trohrmann]

> Potential test instabilities due to concurrent access to TaskSlotTable.
> ---
>
> Key: FLINK-15687
> URL: https://issues.apache.org/jira/browse/FLINK-15687
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.10.0
>Reporter: Kostas Kloudas
>Assignee: Xintong Song
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> Working on [FLINK-14742|https://issues.apache.org/jira/browse/FLINK-14742] 
> revealed that the problem with that test instability was the modification of 
> the {{taskSlotTable}} of the {{TaskManager}} under test from multiple 
> threads, namely the test thread and the main thread of the {{rpcEnpoint}}. 
> This data-structure is not thread-safe and this should not happen.
> This anti-pattern seems to be repeated in multiple tests like most of the 
> tests in the {{TaskExecutorSubmissionTest}} (look for the call to the 
> {{TaskSlotTable.allocateSlot()}}). There we seem to call 
> {{taskSlotTable.allocateSlot()}} and then \{{tmGateway.submitTask()}} which 
> is essentially accessing the slot table from within the main rpc-endpoint 
> thread.
> This JIRA is just to investigate if this is also a problem in those tests or 
> not.
> cc [~trohrmann], [~chesnay] , [~yangwang166]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15687) Potential test instabilities due to concurrent access to TaskSlotTable.

2020-05-27 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118281#comment-17118281
 ] 

Xintong Song commented on FLINK-15687:
--

I've printed the thread names for all testing scope accesses to 
{{TaskSlotTable}}. The following cases are discovered with the reported 
problem. I'll try to provide fixes asap.

* TaskExecutorTest
** testDynamicSlotAllocation
** testOfferSlotToJobMasterAfterTimeout
** testShouldShutDownTaskManagerServicesInPostStop
* TaskExecutorSubmissionTest
** testFailingScheduleOrUpdateConsumers
** testUpdateTaskInputPartitionsFailure
** testRunJobWithForwardChannel
** testTaskSubmissionAndCancelling
** testCancellingDependentAndStateUpdateFails
** testLocalPartitionNotFound
** testRemotePartitionNotFound
** testTaskSubmission
** testGateChannelEdgeMismatch
** testRequestTaskBackPressure
* TaskExecutorOperatorEventHandlingTest
** eventHandlingInTaskFailureFailsTask
** eventToCoordinatorDeliveryFailureFailsTask

> Potential test instabilities due to concurrent access to TaskSlotTable.
> ---
>
> Key: FLINK-15687
> URL: https://issues.apache.org/jira/browse/FLINK-15687
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.10.0
>Reporter: Kostas Kloudas
>Assignee: Xintong Song
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> Working on [FLINK-14742|https://issues.apache.org/jira/browse/FLINK-14742] 
> revealed that the problem with that test instability was the modification of 
> the {{taskSlotTable}} of the {{TaskManager}} under test from multiple 
> threads, namely the test thread and the main thread of the {{rpcEnpoint}}. 
> This data-structure is not thread-safe and this should not happen.
> This anti-pattern seems to be repeated in multiple tests like most of the 
> tests in the {{TaskExecutorSubmissionTest}} (look for the call to the 
> {{TaskSlotTable.allocateSlot()}}). There we seem to call 
> {{taskSlotTable.allocateSlot()}} and then \{{tmGateway.submitTask()}} which 
> is essentially accessing the slot table from within the main rpc-endpoint 
> thread.
> This JIRA is just to investigate if this is also a problem in those tests or 
> not.
> cc [~trohrmann], [~chesnay] , [~yangwang166]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15687) Potential test instabilities due to concurrent access to TaskSlotTable.

2020-01-24 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023116#comment-17023116
 ] 

Till Rohrmann commented on FLINK-15687:
---

Sorry [~fly_in_gis], I linked you by accident and I actually meant to ask 
[~dangdangdang].

> Potential test instabilities due to concurrent access to TaskSlotTable.
> ---
>
> Key: FLINK-15687
> URL: https://issues.apache.org/jira/browse/FLINK-15687
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.10.0
>Reporter: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> Working on [FLINK-14742|https://issues.apache.org/jira/browse/FLINK-14742] 
> revealed that the problem with that test instability was the modification of 
> the {{taskSlotTable}} of the {{TaskManager}} under test from multiple 
> threads, namely the test thread and the main thread of the {{rpcEnpoint}}. 
> This data-structure is not thread-safe and this should not happen.
> This anti-pattern seems to be repeated in multiple tests like most of the 
> tests in the {{TaskExecutorSubmissionTest}} (look for the call to the 
> {{TaskSlotTable.allocateSlot()}}). There we seem to call 
> {{taskSlotTable.allocateSlot()}} and then \{{tmGateway.submitTask()}} which 
> is essentially accessing the slot table from within the main rpc-endpoint 
> thread.
> This JIRA is just to investigate if this is also a problem in those tests or 
> not.
> cc [~trohrmann], [~chesnay] , [~yangwang166]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15687) Potential test instabilities due to concurrent access to TaskSlotTable.

2020-01-20 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019860#comment-17019860
 ] 

Yang Wang commented on FLINK-15687:
---

Hi [~trohrmann], i agree with you that we should avoid accessing 
{{TaskSlotTable}} in multiple threads. Removing the 
{{TaskSubmissionTestEnvironment#getTaskSlotTable}} could work for that purpose.

nit: I guess maybe you want to cc [~dangdangdang] since he is the assignee of 
FLINK-11593.

 

> Potential test instabilities due to concurrent access to TaskSlotTable.
> ---
>
> Key: FLINK-15687
> URL: https://issues.apache.org/jira/browse/FLINK-15687
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.10.0
>Reporter: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.10.0, 1.11.0
>
>
> Working on [FLINK-14742|https://issues.apache.org/jira/browse/FLINK-14742] 
> revealed that the problem with that test instability was the modification of 
> the {{taskSlotTable}} of the {{TaskManager}} under test from multiple 
> threads, namely the test thread and the main thread of the {{rpcEnpoint}}. 
> This data-structure is not thread-safe and this should not happen.
> This anti-pattern seems to be repeated in multiple tests like most of the 
> tests in the {{TaskExecutorSubmissionTest}} (look for the call to the 
> {{TaskSlotTable.allocateSlot()}}). There we seem to call 
> {{taskSlotTable.allocateSlot()}} and then \{{tmGateway.submitTask()}} which 
> is essentially accessing the slot table from within the main rpc-endpoint 
> thread.
> This JIRA is just to investigate if this is also a problem in those tests or 
> not.
> cc [~trohrmann], [~chesnay] , [~yangwang166]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15687) Potential test instabilities due to concurrent access to TaskSlotTable.

2020-01-20 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019533#comment-17019533
 ] 

Till Rohrmann commented on FLINK-15687:
---

[~yangwang166] do you know whether there is any safeguard in the 
{{TaskSubmissionTestEnvironment}} which allows to mutate the {{TaskSlotTable}} 
outside of the {{TaskExecutor's}} main thread? If not, then I would suggest to 
remove {{TaskSubmissionTestEnvironment#getTaskSlotTable}} altogether and to 
rework the affected tests.

> Potential test instabilities due to concurrent access to TaskSlotTable.
> ---
>
> Key: FLINK-15687
> URL: https://issues.apache.org/jira/browse/FLINK-15687
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Coordination, Tests
>Affects Versions: 1.10.0
>Reporter: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> Working on [FLINK-14742|https://issues.apache.org/jira/browse/FLINK-14742] 
> revealed that the problem with that test instability was the modification of 
> the {{taskSlotTable}} of the {{TaskManager}} under test from multiple 
> threads, namely the test thread and the main thread of the {{rpcEnpoint}}. 
> This data-structure is not thread-safe and this should not happen.
> This anti-pattern seems to be repeated in multiple tests like most of the 
> tests in the {{TaskExecutorSubmissionTest}} (look for the call to the 
> {{TaskSlotTable.allocateSlot()}}). There we seem to call 
> {{taskSlotTable.allocateSlot()}} and then \{{tmGateway.submitTask()}} which 
> is essentially accessing the slot table from within the main rpc-endpoint 
> thread.
> This JIRA is just to investigate if this is also a problem in those tests or 
> not.
> cc [~trohrmann], [~chesnay] , [~yangwang166]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)