[jira] [Commented] (FLINK-33897) Allow triggering unaligned checkpoint via CLI
[ https://issues.apache.org/jira/browse/FLINK-33897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805471#comment-17805471 ] Zakelly Lan commented on FLINK-33897: - {quote}By real world motivation, I meant if that really is an issue that someone complained about? {quote} The reason behind this is some of our customers start their job with default configuration and find out a back-pressure and checkpoint failures last for a while. They reached out to me to ask if there is some way that can eliminate the back-pressure without introducing much more delay or pouring much duplicated data into sink (not exactly-once). What they complain is they must suffer first to restart the job. > Allow triggering unaligned checkpoint via CLI > - > > Key: FLINK-33897 > URL: https://issues.apache.org/jira/browse/FLINK-33897 > Project: Flink > Issue Type: New Feature > Components: Command Line Client, Runtime / Checkpointing >Reporter: Zakelly Lan >Assignee: Zakelly Lan >Priority: Major > > After FLINK-6755, user could trigger checkpoint through CLI. However I > noticed there would be value supporting trigger it in unaligned way, since > the job may encounter a high back-pressure and an aligned checkpoint would > fail. > > I suggest we provide an option '-unaligned' in CLI to support that. > > Similar option would also be useful for REST api -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33897) Allow triggering unaligned checkpoint via CLI
[ https://issues.apache.org/jira/browse/FLINK-33897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803535#comment-17803535 ] Piotr Nowojski commented on FLINK-33897: By real world motivation, I meant if that really is an issue that someone complained about? If not, and this is just a theoretical possibility that comes from your observation when implementing FLINK-6755 "it could be implemented, someone might find it useful", I would put it aside for the time being. Honestly, I doubt many users would use this feature. In most cases just cancelling the job and restarting with new configuration would be faster vs someone first trying to find out in the docs/user mailing list/stack overflow that he can actually trigger unaligned checkpoint from CLI first. This would be only useful to a handful of power users, but those should already know about that it's better to use unaligned checkpoints from the get go. {quote} I'm not very familiar with this part so if you think this is a big change, I won't insist on doing it. {quote} Adding a new BarrierHandlerState maybe is not a very big change per se, but will visible increase complexity of the code when someone needs to read/understand it. {quote} I do agree we could enable timeout for aligned cp by default, which greatly reduce this case {quote} Let me start the dev mailing list discussion about that. > Allow triggering unaligned checkpoint via CLI > - > > Key: FLINK-33897 > URL: https://issues.apache.org/jira/browse/FLINK-33897 > Project: Flink > Issue Type: New Feature > Components: Command Line Client, Runtime / Checkpointing >Reporter: Zakelly Lan >Assignee: Zakelly Lan >Priority: Major > > After FLINK-6755, user could trigger checkpoint through CLI. However I > noticed there would be value supporting trigger it in unaligned way, since > the job may encounter a high back-pressure and an aligned checkpoint would > fail. > > I suggest we provide an option '-unaligned' in CLI to support that. > > Similar option would also be useful for REST api -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33897) Allow triggering unaligned checkpoint via CLI
[ https://issues.apache.org/jira/browse/FLINK-33897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803407#comment-17803407 ] Zakelly Lan commented on FLINK-33897: - [~pnowojski] Actually there is real world motivation. When a job encountered high back-pressure and after dozens of minutes of aligned checkpointing without success, the user finds that they need to switch to unaligned cp or enlarge the parallelism. Such change requires a job restart, which puts users in a dilemma because this involves replaying much data and a longer delay. This feature allows users to make an unaligned cp temporarily and restart from it, preventing from the large data replay. I do agree we could enable timeout for aligned cp by default, which greatly reduce this case. And I also think there would be value giving user a chance to change the configuration and restart the job with less pain when they misconfigured their jobs, by supporting triggering a swift and promising checkpoint or savepoint. As for the complication supporting this feature, IIUC, some changes should apply to the handler states (may introduce a new {{{}BarrierHandlerState{}}}) and less change will make to the {{SingleCheckpointBarrierHandler}} itself. I'm not very familiar with this part so if you think this is a big change, I won't insist on doing it. > Allow triggering unaligned checkpoint via CLI > - > > Key: FLINK-33897 > URL: https://issues.apache.org/jira/browse/FLINK-33897 > Project: Flink > Issue Type: New Feature > Components: Command Line Client, Runtime / Checkpointing >Reporter: Zakelly Lan >Assignee: Zakelly Lan >Priority: Major > > After FLINK-6755, user could trigger checkpoint through CLI. However I > noticed there would be value supporting trigger it in unaligned way, since > the job may encounter a high back-pressure and an aligned checkpoint would > fail. > > I suggest we provide an option '-unaligned' in CLI to support that. > > Similar option would also be useful for REST api -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33897) Allow triggering unaligned checkpoint via CLI
[ https://issues.apache.org/jira/browse/FLINK-33897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803185#comment-17803185 ] Piotr Nowojski commented on FLINK-33897: I have mixed feelings. Shouldn't the solution be to just use/enable unaligned checkpoints? If one sets the alignment timeout to some reasonable value, I don't see a reason for someone to use aligned checkpoints anymore. Maybe instead let's consider deprecating aligned checkpoints without timeout? Is there some real world motivation behind this feature? I would be -1 for this feature, if it requires complicating/making changes to the actual barrier handling (apart of replacing {{SingleCheckpointBarrierHandler#aligned}} with {{SingleCheckpointBarrierHandler#alternating}} call). This code is complicated and in the past we had a lot of deadlocks, data corruptions and other critical bugs around those areas, so keeping it as simple as possible and minimising amount of supported features is quite important. > Allow triggering unaligned checkpoint via CLI > - > > Key: FLINK-33897 > URL: https://issues.apache.org/jira/browse/FLINK-33897 > Project: Flink > Issue Type: Improvement > Components: Command Line Client, Runtime / Checkpointing >Reporter: Zakelly Lan >Assignee: Zakelly Lan >Priority: Major > > After FLINK-6755, user could trigger checkpoint through CLI. However I > noticed there would be value supporting trigger it in unaligned way, since > the job may encounter a high back-pressure and an aligned checkpoint would > fail. > > I suggest we provide an option '-unaligned' in CLI to support that. > > Similar option would also be useful for REST api -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33897) Allow triggering unaligned checkpoint via CLI
[ https://issues.apache.org/jira/browse/FLINK-33897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799233#comment-17799233 ] Zakelly Lan commented on FLINK-33897: - This also require the {{SingleCheckpointBarrierHandler}} changing from aligned to unaligned state when receiving an unaligned barrier. Would like to hear your thoughts [~pnowojski] [~dwysakowicz] > Allow triggering unaligned checkpoint via CLI > - > > Key: FLINK-33897 > URL: https://issues.apache.org/jira/browse/FLINK-33897 > Project: Flink > Issue Type: Improvement > Components: Command Line Client, Runtime / Checkpointing >Reporter: Zakelly Lan >Assignee: Zakelly Lan >Priority: Major > > After FLINK-6755, user could trigger checkpoint through CLI. However I > noticed there would be value supporting trigger it in unaligned way, since > the job may encounter a high back-pressure and an aligned checkpoint would > fail. > > I suggest we provide an option '-unaligned' in CLI to support that. > > Similar option would also be useful for REST api -- This message was sent by Atlassian Jira (v8.20.10#820010)