[
https://issues.apache.org/jira/browse/FLINK-21133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17285219#comment-17285219
]
Piotr Nowojski edited comment on FLINK-21133 at 2/16/21, 2:47 PM:
------------------------------------------------------------------
{quote}
Given all above, I would say "stop-with-savepoint" as an utility method to
"take a savepoint and then exit" does not deserve any extra semantics
guarantees from streaming data flow comparing to "cancel --savepoint".
If we could reach above agreement, then I think the refactor is pretty simple,
just "take a savepoint, cleanup resources, then exit" in single task. No data
flow events will involving in. More importantly, we could clearly document that
"close" will only be "called after all records have been added to the
operators" to fix api semantics exception.
{quote}
[~kezhuw] as I tried to mention a couple of time in the FLINK-21132. Simply
"take a savepoint, cleanup resources, then exit" in a single task, disregarding
what the upstream task is doing, can easily lead to a deadlock, if upstream
task thread is blocked/fully backpressured. We could probably ignore this
problem in network tasks and FLIP-27 sources, by just assuming that if
downstream task received an aligned savepoint barrier, upstream task has
completely paused production of records, so it's not backpressured. But this
assumption doesn't hold with the legacy sources. Maybe it would also cause
problems with iterations/cyclic graphs in the future.
And I would be afraid that relaying on such kind of assumption might be
fragile. Orderly shutdown from sources to the sinks is from this perspective
safer.
However I admit, that such approach has good advantages that you mentioned (not
having to flush buffered records, faster closing etc).
was (Author: pnowojski):
{quote}
Given all above, I would say "stop-with-savepoint" as an utility method to
"take a savepoint and then exit" does not deserve any extra semantics
guarantees from streaming data flow comparing to "cancel --savepoint".
If we could reach above agreement, then I think the refactor is pretty simple,
just "take a savepoint, cleanup resources, then exit" in single task. No data
flow events will involving in. More importantly, we could clearly document that
"close" will only be "called after all records have been added to the
operators" to fix api semantics exception.
{quote}
[~kezhuw] as I tried to mention a couple of time in the FLINK-21132. Simply
"take a savepoint, cleanup resources, then exit" in a single task, disregarding
what the upstream task is doing, can easily lead to a deadlock, if upstream
task thread is blocked/fully backpressured. We could probably ignore this
problem in network tasks and FLIP-27 sources, by just assuming that if
downstream task received an aligned savepoint barrier, upstream task has
completely paused production of records, so it's not backpressured. But this
assumption doesn't hold with the legacy sources. Maybe it would also cause
problems with iterations/cyclic graphs in the future.
And I would be afraid that relaying on such kind of assumption might be
fragile. Orderly shutdown from sources to the sinks is from this perspective
safer.
> FLIP-27 Source does not work with synchronous savepoint
> -------------------------------------------------------
>
> Key: FLINK-21133
> URL: https://issues.apache.org/jira/browse/FLINK-21133
> Project: Flink
> Issue Type: Bug
> Components: API / Core, API / DataStream, Runtime / Checkpointing
> Affects Versions: 1.11.3, 1.12.1
> Reporter: Kezhu Wang
> Priority: Critical
> Fix For: 1.11.4, 1.13.0, 1.12.3
>
>
> I have pushed branch
> [synchronous-savepoint-conflict-with-bounded-end-input-case|https://github.com/kezhuw/flink/commits/synchronous-savepoint-conflict-with-bounded-end-input-case]
> in my repository. {{SavepointITCase.testStopSavepointWithFlip27Source}}
> failed due to timeout.
> See also FLINK-21132 and
> [apache/iceberg#2033|https://github.com/apache/iceberg/issues/2033]..
--
This message was sent by Atlassian Jira
(v8.3.4#803005)