[
https://issues.apache.org/jira/browse/FLINK-24800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441866#comment-17441866
]
Anton Kalashnikov commented on FLINK-24800:
-------------------------------------------
I have fixed the test but the problem is not the test but the race condition in
`{*}PipelinedSubpartition#add{*}`:
# Suppose we add the event to the empty queue(event is always the finished
buffer)
# `{*}notifyDataAvailable{*}` is set to true because we have only one buffer
in the queue which is finished
# we call the `{*}notifyDataAvailable{*}` method
# we add one more unfinished buffer which will be the second in the queue
# `notifyDataAvailable` is set to true because it is the second buffer(in
fact, we should set this flag to false because we already notified about the
first buffer but unfortunately we don't have such information so we want to
notify again, just in case)
# reader thread polls(`{*}PipelinedSubpartition#pollBuffer{*}`) the first
buffer from the queue(finished one).
# only now, we call the `{*}notifyDataAvailable{*}` method but in fact, we
have only one unfinished buffer in the queue so this notification doesn't make
sense.
# reader thread polls the unfinished buffer from the queue.
So as you can see it is possible to notify about one buffer twice which can
lead to the situation when the unfinished buffer will be read. This can happen
because the calculation of `notifyDataAvailable` happens inside the
`synchronized` block while the calling `notifyDataAvailable` happens outside of
this block which leads to a race condition.
at least for now, I personally don't think that it is a big problem and I don't
think that it makes sense to complicate our implementation of data availability
notification in order to fix this problem. It is why I just have fixed the
test. But I am still thinking about that and maybe I will change my mind.
[~pnowojski] , Anyway, I am not really confident in my conclusion so let's
discuss it especially if you think that it is a more serious problem than I
think.
> BufferTimeoutITCase.testDisablingBufferTimeout failed on Azure
> --------------------------------------------------------------
>
> Key: FLINK-24800
> URL: https://issues.apache.org/jira/browse/FLINK-24800
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Network
> Affects Versions: 1.15.0
> Reporter: Yun Gao
> Assignee: Anton Kalashnikov
> Priority: Blocker
> Labels: pull-request-available, test-stability
> Fix For: 1.15.0
>
>
> {code:java}
> 2021-11-05T12:18:50.5272055Z Nov 05 12:18:50 [INFO] Results:
> 2021-11-05T12:18:50.5273369Z Nov 05 12:18:50 [INFO]
> 2021-11-05T12:18:50.5274011Z Nov 05 12:18:50 [ERROR] Failures:
> 2021-11-05T12:18:50.5274518Z Nov 05 12:18:50 [ERROR]
> BufferTimeoutITCase.testDisablingBufferTimeout:85
> 2021-11-05T12:18:50.5274871Z Nov 05 12:18:50 Expected: <0>
> 2021-11-05T12:18:50.5275150Z Nov 05 12:18:50 but: was <1>
> 2021-11-05T12:18:50.5276136Z Nov 05 12:18:50 [INFO]
> 2021-11-05T12:18:50.5276667Z Nov 05 12:18:50 [ERROR] Tests run: 1849,
> Failures: 1, Errors: 0, Skipped: 58
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=26018&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7&l=10850
--
This message was sent by Atlassian Jira
(v8.20.1#820001)