[jira] [Commented] (FLINK-9465) Specify a separate savepoint timeout option via CLI
[ https://issues.apache.org/jira/browse/FLINK-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429781#comment-17429781 ] Feifan Wang commented on FLINK-9465: Hi [~trohrmann], the [PR|https://github.com/apache/flink/pull/17443] is ready , can you help review the it ? > Specify a separate savepoint timeout option via CLI > --- > > Key: FLINK-9465 > URL: https://issues.apache.org/jira/browse/FLINK-9465 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.5.0 >Reporter: Truong Duc Kien >Assignee: Feifan Wang >Priority: Minor > Labels: auto-deprioritized-major, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Savepoint can take much longer time to perform than checkpoint, especially > with incremental checkpoint enabled. This leads to a couple of troubles: > * For our job, we currently have to set the checkpoint timeout much large > than necessary, otherwise we would be unable to perform savepoint. > * During rush hour, our cluster would encounter high rate of checkpoint > timeout due to backpressure, however we're unable to migrate to a larger > configuration, because savepoint also timeout. > In my opinion, the timeout for savepoint should be configurable separately, > both in the config file and as parameter to the savepoint command. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-9465) Specify a separate savepoint timeout option via CLI
[ https://issues.apache.org/jira/browse/FLINK-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429326#comment-17429326 ] Feifan Wang commented on FLINK-9465: I still work on this [~flink-jira-bot]. > Specify a separate savepoint timeout option via CLI > --- > > Key: FLINK-9465 > URL: https://issues.apache.org/jira/browse/FLINK-9465 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.5.0 >Reporter: Truong Duc Kien >Assignee: Feifan Wang >Priority: Minor > Labels: auto-deprioritized-major, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Savepoint can take much longer time to perform than checkpoint, especially > with incremental checkpoint enabled. This leads to a couple of troubles: > * For our job, we currently have to set the checkpoint timeout much large > than necessary, otherwise we would be unable to perform savepoint. > * During rush hour, our cluster would encounter high rate of checkpoint > timeout due to backpressure, however we're unable to migrate to a larger > configuration, because savepoint also timeout. > In my opinion, the timeout for savepoint should be configurable separately, > both in the config file and as parameter to the savepoint command. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-9465) Specify a separate savepoint timeout option via CLI
[ https://issues.apache.org/jira/browse/FLINK-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17426559#comment-17426559 ] Feifan Wang commented on FLINK-9465: Hi [~trohrmann], I open a pull request to resolve this, but there are still some unit test that I think need to be complete. Can you take a glance over this PR and give me some guidance on the unit test ? > Specify a separate savepoint timeout option via CLI > --- > > Key: FLINK-9465 > URL: https://issues.apache.org/jira/browse/FLINK-9465 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.5.0 >Reporter: Truong Duc Kien >Assignee: Feifan Wang >Priority: Minor > Labels: auto-deprioritized-major, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Savepoint can take much longer time to perform than checkpoint, especially > with incremental checkpoint enabled. This leads to a couple of troubles: > * For our job, we currently have to set the checkpoint timeout much large > than necessary, otherwise we would be unable to perform savepoint. > * During rush hour, our cluster would encounter high rate of checkpoint > timeout due to backpressure, however we're unable to migrate to a larger > configuration, because savepoint also timeout. > In my opinion, the timeout for savepoint should be configurable separately, > both in the config file and as parameter to the savepoint command. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-9465) Specify a separate savepoint timeout option via CLI
[ https://issues.apache.org/jira/browse/FLINK-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17426306#comment-17426306 ] Feifan Wang commented on FLINK-9465: Hi [~trohrmann], thanks for introduce FLINK-15787, I very agree after reading. Finally, I will name it "savepointTimeout" in all above four place. > Specify a separate savepoint timeout option via CLI > --- > > Key: FLINK-9465 > URL: https://issues.apache.org/jira/browse/FLINK-9465 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.5.0 >Reporter: Truong Duc Kien >Assignee: Feifan Wang >Priority: Minor > Labels: auto-deprioritized-major, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Savepoint can take much longer time to perform than checkpoint, especially > with incremental checkpoint enabled. This leads to a couple of troubles: > * For our job, we currently have to set the checkpoint timeout much large > than necessary, otherwise we would be unable to perform savepoint. > * During rush hour, our cluster would encounter high rate of checkpoint > timeout due to backpressure, however we're unable to migrate to a larger > configuration, because savepoint also timeout. > In my opinion, the timeout for savepoint should be configurable separately, > both in the config file and as parameter to the savepoint command. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-9465) Specify a separate savepoint timeout option via CLI
[ https://issues.apache.org/jira/browse/FLINK-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17426276#comment-17426276 ] Till Rohrmann commented on FLINK-9465: -- Hi [~Feifan Wang], sorry for my late reply. I think we should use camel case for the parameter. I think we have agreed on this in FLINK-15787. For the CLI we should also name it {{savepointTimeout}}. > Specify a separate savepoint timeout option via CLI > --- > > Key: FLINK-9465 > URL: https://issues.apache.org/jira/browse/FLINK-9465 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.5.0 >Reporter: Truong Duc Kien >Assignee: Feifan Wang >Priority: Minor > Labels: auto-deprioritized-major, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Savepoint can take much longer time to perform than checkpoint, especially > with incremental checkpoint enabled. This leads to a couple of troubles: > * For our job, we currently have to set the checkpoint timeout much large > than necessary, otherwise we would be unable to perform savepoint. > * During rush hour, our cluster would encounter high rate of checkpoint > timeout due to backpressure, however we're unable to migrate to a larger > configuration, because savepoint also timeout. > In my opinion, the timeout for savepoint should be configurable separately, > both in the config file and as parameter to the savepoint command. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-9465) Specify a separate savepoint timeout option via CLI
[ https://issues.apache.org/jira/browse/FLINK-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17424536#comment-17424536 ] Feifan Wang commented on FLINK-9465: Hi [~trohrmann], Since the two REST API mentioned above use the POST method, I tend to add parameter as part of the body of the http request, just like other parameters. I want to name parameter as "savepoint-timeout" or "savepointTimeout" directly. * "savepoint-timeout" for [REST API : /jobs/:jobid/savepoints|https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/rest_api/#jobs-jobid-savepoints] and [CLI : Creating a Savepoint|https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/cli/#creating-a-savepoint] * "savepointTimeout" for [REST API : /jobs/:jobid/stop|https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/rest_api/#jobs-jobid-stop] and [CLI : Stopping a Job Gracefully Creating a Final Savepoint|https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/cli/#stopping-a-job-gracefully-creating-a-final-savepoint] And the parameter in this 4 place should be optional, if not appear, checkpoint timeout will take effect. How do you think about ? > Specify a separate savepoint timeout option via CLI > --- > > Key: FLINK-9465 > URL: https://issues.apache.org/jira/browse/FLINK-9465 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.5.0 >Reporter: Truong Duc Kien >Assignee: Feifan Wang >Priority: Minor > Labels: auto-deprioritized-major, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Savepoint can take much longer time to perform than checkpoint, especially > with incremental checkpoint enabled. This leads to a couple of troubles: > * For our job, we currently have to set the checkpoint timeout much large > than necessary, otherwise we would be unable to perform savepoint. > * During rush hour, our cluster would encounter high rate of checkpoint > timeout due to backpressure, however we're unable to migrate to a larger > configuration, because savepoint also timeout. > In my opinion, the timeout for savepoint should be configurable separately, > both in the config file and as parameter to the savepoint command. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-9465) Specify a separate savepoint timeout option via CLI
[ https://issues.apache.org/jira/browse/FLINK-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423992#comment-17423992 ] Till Rohrmann commented on FLINK-9465: -- Do you want to add the parameter as a query parameter or make it part of the body of the http request? For the changes in the REST api I would stick to how it's done there. Same for the CLI parameter formatting. How would you name the parameter in the CLI and REST api? The places you suggested make change for the introduction of the timeout parameter. > Specify a separate savepoint timeout option via CLI > --- > > Key: FLINK-9465 > URL: https://issues.apache.org/jira/browse/FLINK-9465 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.5.0 >Reporter: Truong Duc Kien >Assignee: Feifan Wang >Priority: Minor > Labels: auto-deprioritized-major, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Savepoint can take much longer time to perform than checkpoint, especially > with incremental checkpoint enabled. This leads to a couple of troubles: > * For our job, we currently have to set the checkpoint timeout much large > than necessary, otherwise we would be unable to perform savepoint. > * During rush hour, our cluster would encounter high rate of checkpoint > timeout due to backpressure, however we're unable to migrate to a larger > configuration, because savepoint also timeout. > In my opinion, the timeout for savepoint should be configurable separately, > both in the config file and as parameter to the savepoint command. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-9465) Specify a separate savepoint timeout option via CLI
[ https://issues.apache.org/jira/browse/FLINK-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423539#comment-17423539 ] Feifan Wang commented on FLINK-9465: Hi [~trohrmann], thanks for reply, I think we can add the "savepoint-timeout" parameter in the following four places: REST API : * [/jobs/:jobid/savepoints|https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/rest_api/#jobs-jobid-savepoints] * [/jobs/:jobid/stop|https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/rest_api/#jobs-jobid-stop] Command-Line Interface : * [Creating a Savepoint|https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/cli/#creating-a-savepoint] * [Stopping a Job Gracefully Creating a Final Savepoint|https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/cli/#stopping-a-job-gracefully-creating-a-final-savepoint] BTW, I noticed that there are different styles of parameter formats in rest api and cli, some are in camel case, and others are in kebab case. Should we use a uniform format ? > Specify a separate savepoint timeout option via CLI > --- > > Key: FLINK-9465 > URL: https://issues.apache.org/jira/browse/FLINK-9465 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.5.0 >Reporter: Truong Duc Kien >Assignee: Feifan Wang >Priority: Minor > Labels: auto-deprioritized-major, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Savepoint can take much longer time to perform than checkpoint, especially > with incremental checkpoint enabled. This leads to a couple of troubles: > * For our job, we currently have to set the checkpoint timeout much large > than necessary, otherwise we would be unable to perform savepoint. > * During rush hour, our cluster would encounter high rate of checkpoint > timeout due to backpressure, however we're unable to migrate to a larger > configuration, because savepoint also timeout. > In my opinion, the timeout for savepoint should be configurable separately, > both in the config file and as parameter to the savepoint command. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-9465) Specify a separate savepoint timeout option via CLI
[ https://issues.apache.org/jira/browse/FLINK-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423266#comment-17423266 ] Till Rohrmann commented on FLINK-9465: -- Hi [~Feifan Wang], I've assigned the ticket to you. Before you start coding could you quickly explain how you intend to solve the problem (e.g. which REST parameters to add, whether to add a CLI option and if yes what's its name?). > Specify a separate savepoint timeout option via CLI > --- > > Key: FLINK-9465 > URL: https://issues.apache.org/jira/browse/FLINK-9465 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.5.0 >Reporter: Truong Duc Kien >Assignee: Feifan Wang >Priority: Minor > Labels: auto-deprioritized-major, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Savepoint can take much longer time to perform than checkpoint, especially > with incremental checkpoint enabled. This leads to a couple of troubles: > * For our job, we currently have to set the checkpoint timeout much large > than necessary, otherwise we would be unable to perform savepoint. > * During rush hour, our cluster would encounter high rate of checkpoint > timeout due to backpressure, however we're unable to migrate to a larger > configuration, because savepoint also timeout. > In my opinion, the timeout for savepoint should be configurable separately, > both in the config file and as parameter to the savepoint command. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-9465) Specify a separate savepoint timeout option via CLI
[ https://issues.apache.org/jira/browse/FLINK-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420579#comment-17420579 ] Feifan Wang commented on FLINK-9465: Hi [~trohrmann] [~twalthr] , this problem also bothers us, I much agree with specify a different value than the configured checkpoint timeout in CLI or REST API. And I am glad work on it, can you assign this issue to me ? > Specify a separate savepoint timeout option via CLI > --- > > Key: FLINK-9465 > URL: https://issues.apache.org/jira/browse/FLINK-9465 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.5.0 >Reporter: Truong Duc Kien >Priority: Minor > Labels: auto-deprioritized-major, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Savepoint can take much longer time to perform than checkpoint, especially > with incremental checkpoint enabled. This leads to a couple of troubles: > * For our job, we currently have to set the checkpoint timeout much large > than necessary, otherwise we would be unable to perform savepoint. > * During rush hour, our cluster would encounter high rate of checkpoint > timeout due to backpressure, however we're unable to migrate to a larger > configuration, because savepoint also timeout. > In my opinion, the timeout for savepoint should be configurable separately, > both in the config file and as parameter to the savepoint command. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-9465) Specify a separate savepoint timeout option via CLI
[ https://issues.apache.org/jira/browse/FLINK-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17336627#comment-17336627 ] Flink Jira Bot commented on FLINK-9465: --- This issue was labeled "stale-major" 7 ago and has not received any updates so it is being deprioritized. If this ticket is actually Major, please raise the priority and ask a committer to assign you the issue or revive the public discussion. > Specify a separate savepoint timeout option via CLI > --- > > Key: FLINK-9465 > URL: https://issues.apache.org/jira/browse/FLINK-9465 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.5.0 >Reporter: Truong Duc Kien >Priority: Major > Labels: pull-request-available, stale-major > Time Spent: 10m > Remaining Estimate: 0h > > Savepoint can take much longer time to perform than checkpoint, especially > with incremental checkpoint enabled. This leads to a couple of troubles: > * For our job, we currently have to set the checkpoint timeout much large > than necessary, otherwise we would be unable to perform savepoint. > * During rush hour, our cluster would encounter high rate of checkpoint > timeout due to backpressure, however we're unable to migrate to a larger > configuration, because savepoint also timeout. > In my opinion, the timeout for savepoint should be configurable separately, > both in the config file and as parameter to the savepoint command. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-9465) Specify a separate savepoint timeout option via CLI
[ https://issues.apache.org/jira/browse/FLINK-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17328626#comment-17328626 ] Flink Jira Bot commented on FLINK-9465: --- This major issue is unassigned and itself and all of its Sub-Tasks have not been updated for 30 days. So, it has been labeled "stale-major". If this ticket is indeed "major", please either assign yourself or give an update. Afterwards, please remove the label. In 7 days the issue will be deprioritized. > Specify a separate savepoint timeout option via CLI > --- > > Key: FLINK-9465 > URL: https://issues.apache.org/jira/browse/FLINK-9465 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.5.0 >Reporter: Truong Duc Kien >Priority: Major > Labels: pull-request-available, stale-major > Time Spent: 10m > Remaining Estimate: 0h > > Savepoint can take much longer time to perform than checkpoint, especially > with incremental checkpoint enabled. This leads to a couple of troubles: > * For our job, we currently have to set the checkpoint timeout much large > than necessary, otherwise we would be unable to perform savepoint. > * During rush hour, our cluster would encounter high rate of checkpoint > timeout due to backpressure, however we're unable to migrate to a larger > configuration, because savepoint also timeout. > In my opinion, the timeout for savepoint should be configurable separately, > both in the config file and as parameter to the savepoint command. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-9465) Specify a separate savepoint timeout option via CLI
[ https://issues.apache.org/jira/browse/FLINK-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267135#comment-17267135 ] Timo Walther commented on FLINK-9465: - Seems nobody is working on this issue anymore. I marked it as unassigned. There was also a thread on the user@ ML on this topic recently: https://lists.apache.org/thread.html/rac24855efe372b09b025a1eeb1c8111c9bc8c216265ce94cbf0d3880%40%3Cuser.flink.apache.org%3E > Specify a separate savepoint timeout option via CLI > --- > > Key: FLINK-9465 > URL: https://issues.apache.org/jira/browse/FLINK-9465 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.5.0 >Reporter: Truong Duc Kien >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Savepoint can take much longer time to perform than checkpoint, especially > with incremental checkpoint enabled. This leads to a couple of troubles: > * For our job, we currently have to set the checkpoint timeout much large > than necessary, otherwise we would be unable to perform savepoint. > * During rush hour, our cluster would encounter high rate of checkpoint > timeout due to backpressure, however we're unable to migrate to a larger > configuration, because savepoint also timeout. > In my opinion, the timeout for savepoint should be configurable separately, > both in the config file and as parameter to the savepoint command. -- This message was sent by Atlassian Jira (v8.3.4#803005)