[jira] [Updated] (FLINK-13856) Reduce the delete file api when the checkpoint is completed
[ https://issues.apache.org/jira/browse/FLINK-13856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew.D.lin updated FLINK-13856: - Fix Version/s: 1.13.1 > Reduce the delete file api when the checkpoint is completed > --- > > Key: FLINK-13856 > URL: https://issues.apache.org/jira/browse/FLINK-13856 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing, Runtime / State Backends >Affects Versions: 1.8.1, 1.9.0 >Reporter: Andrew.D.lin >Assignee: Andrew.D.lin >Priority: Major > Labels: pull-request-available, stale-assigned > Fix For: 1.13.1 > > Attachments: after.png, before.png, > f6cc56b7-2c74-4f4b-bb6a-476d28a22096.png > > Original Estimate: 48h > Time Spent: 10m > Remaining Estimate: 47h 50m > > When the new checkpoint is completed, an old checkpoint will be deleted by > calling CompletedCheckpoint.discardOnSubsume(). > When deleting old checkpoints, follow these steps: > 1, drop the metadata > 2, discard private state objects > 3, discard location as a whole > In some cases, is it possible to delete the checkpoint folder recursively by > one call? > As far as I know the full amount of checkpoint, it should be possible to > delete the folder directly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13856) Reduce the delete file api when the checkpoint is completed
[ https://issues.apache.org/jira/browse/FLINK-13856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325549#comment-17325549 ] Andrew.D.lin commented on FLINK-13856: -- Said above is wrong. The [getSharedState|https://github.com/apache/flink/blob/24031e55e4cf35a5818db2e927e65b290a9b2aed/flink-runtime/src/main/java/org/apache/flink/runtime/state/IncrementalRemoteKeyedStateHandle.java#L123] method return is FileStateHandle not IncrementalRemoteKeyedStateHandle (share state handle). This method is easy to misunderstand and misuse. So currently we can only collect IncrementalRemoteKeyedStateHandle (only type of increment state handle) for separate processing. > Reduce the delete file api when the checkpoint is completed > --- > > Key: FLINK-13856 > URL: https://issues.apache.org/jira/browse/FLINK-13856 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing, Runtime / State Backends >Affects Versions: 1.8.1, 1.9.0 >Reporter: Andrew.D.lin >Assignee: Andrew.D.lin >Priority: Major > Labels: pull-request-available, stale-assigned > Attachments: after.png, before.png, > f6cc56b7-2c74-4f4b-bb6a-476d28a22096.png > > Original Estimate: 48h > Time Spent: 10m > Remaining Estimate: 47h 50m > > When the new checkpoint is completed, an old checkpoint will be deleted by > calling CompletedCheckpoint.discardOnSubsume(). > When deleting old checkpoints, follow these steps: > 1, drop the metadata > 2, discard private state objects > 3, discard location as a whole > In some cases, is it possible to delete the checkpoint folder recursively by > one call? > As far as I know the full amount of checkpoint, it should be possible to > delete the folder directly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13856) Reduce the delete file api when the checkpoint is completed
[ https://issues.apache.org/jira/browse/FLINK-13856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325541#comment-17325541 ] Andrew.D.lin commented on FLINK-13856: -- [~sewen] Thanks for your careful reply. In the next minor version of 1.13 or 1.14, we plan to do this? It is convenient for me to modify the expected repair version of the issue. In addition, I noticed that we can get all the shared state through the [getSharedState|https://github.com/apache/flink/blob/24031e55e4cf35a5818db2e927e65b290a9b2aed/flink-runtime/src/main/java/org/apache/flink/runtime/state/IncrementalRemoteKeyedStateHandle.java#L123] method. This way we can get all the shared state for separate processing(discard of shared state handles). Then we can directly and safely delete the exclusive folder. > Reduce the delete file api when the checkpoint is completed > --- > > Key: FLINK-13856 > URL: https://issues.apache.org/jira/browse/FLINK-13856 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing, Runtime / State Backends >Affects Versions: 1.8.1, 1.9.0 >Reporter: Andrew.D.lin >Assignee: Andrew.D.lin >Priority: Major > Labels: pull-request-available, stale-assigned > Attachments: after.png, before.png, > f6cc56b7-2c74-4f4b-bb6a-476d28a22096.png > > Original Estimate: 48h > Time Spent: 10m > Remaining Estimate: 47h 50m > > When the new checkpoint is completed, an old checkpoint will be deleted by > calling CompletedCheckpoint.discardOnSubsume(). > When deleting old checkpoints, follow these steps: > 1, drop the metadata > 2, discard private state objects > 3, discard location as a whole > In some cases, is it possible to delete the checkpoint folder recursively by > one call? > As far as I know the full amount of checkpoint, it should be possible to > delete the folder directly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13856) Reduce the delete file api when the checkpoint is completed
[ https://issues.apache.org/jira/browse/FLINK-13856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324869#comment-17324869 ] Andrew.D.lin commented on FLINK-13856: -- [~fanrui] Thx, This feature is turned on is configurable, in our scenario, we have been working properly for 2 years. [~sewen] If S3 etc is worse than directly deleting the keys, We can recommend turning off this optimization in the S3 scenario. I think this feature is confirmed to bring a good improvement in some scenarios. Do you think it is necessary to continue? https://github.com/apache/flink/pull/15667 > Reduce the delete file api when the checkpoint is completed > --- > > Key: FLINK-13856 > URL: https://issues.apache.org/jira/browse/FLINK-13856 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing, Runtime / State Backends >Affects Versions: 1.8.1, 1.9.0 >Reporter: Andrew.D.lin >Assignee: Andrew.D.lin >Priority: Major > Labels: pull-request-available, stale-assigned > Attachments: after.png, before.png, > f6cc56b7-2c74-4f4b-bb6a-476d28a22096.png > > Original Estimate: 48h > Time Spent: 10m > Remaining Estimate: 47h 50m > > When the new checkpoint is completed, an old checkpoint will be deleted by > calling CompletedCheckpoint.discardOnSubsume(). > When deleting old checkpoints, follow these steps: > 1, drop the metadata > 2, discard private state objects > 3, discard location as a whole > In some cases, is it possible to delete the checkpoint folder recursively by > one call? > As far as I know the full amount of checkpoint, it should be possible to > delete the folder directly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections
[ https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311547#comment-17311547 ] Andrew.D.lin edited comment on FLINK-10052 at 3/30/21, 2:21 PM: Note change here![FLINK-18677|https://issues.apache.org/jira/browse/FLINK-18677] I think it should not be increased logic here,[handleStateChange|#L152]],Here is just to add logs at the beginning. Connection State processing should be managed by [ConnectionStateErrorPolicy|https://github.com/chendonglin521/curator-1/blob/15a9f03f6f7b156806d05d0dd7ce6cfd3ef39c72/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateErrorPolicy.java#L27] (support session and Standard) was (Author: andrew_lin): Note change here!FLINK-18677 I think it should not be increased logic here,[handleStateChange|#L152]],Here is just to add logs at the beginning. Connection State processing should be managed by [ConnectionStateErrorPolicy|https://github.com/chendonglin521/curator-1/blob/15a9f03f6f7b156806d05d0dd7ce6cfd3ef39c72/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateErrorPolicy.java#L27] (support session and Standard) > Tolerate temporarily suspended ZooKeeper connections > > > Key: FLINK-10052 > URL: https://issues.apache.org/jira/browse/FLINK-10052 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1 >Reporter: Till Rohrmann >Assignee: Zili Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.13.0 > > Time Spent: 50m > Remaining Estimate: 0h > > This issue results from FLINK-10011 which uncovered a problem with Flink's HA > recovery and proposed the following solution to harden Flink: > The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator > recipe for leader election. The leader latch revokes leadership in case of a > suspended ZooKeeper connection. This can be premature in case that the system > can reconnect to ZooKeeper before its session expires. The effect of the lost > leadership is that all jobs will be canceled and directly restarted after > regaining the leadership. > Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper > connection, it would be better to wait until the ZooKeeper connection is > LOST. That way we would allow the system to reconnect and not lose the > leadership. This could be achievable by using Curator's {{LeaderSelector}} > instead of the {{LeaderLatch}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections
[ https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311547#comment-17311547 ] Andrew.D.lin edited comment on FLINK-10052 at 3/30/21, 2:21 PM: Note change here!FLINK-18677 I think it should not be increased logic here,[handleStateChange|#L152]],Here is just to add logs at the beginning. Connection State processing should be managed by [ConnectionStateErrorPolicy|https://github.com/chendonglin521/curator-1/blob/15a9f03f6f7b156806d05d0dd7ce6cfd3ef39c72/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateErrorPolicy.java#L27] (support session and Standard) was (Author: andrew_lin): Note change here![FLINK-18677|https://issues.apache.org/jira/browse/FLINK-18677] I think it should not be increased logic here,[handleStateChange|[https://github.com/apache/flink/blob/940dfb0deccb31e0ca576b4c044cbf588e0765dd/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalDriver.java#L152]|https://github.com/apache/flink/blob/940dfb0deccb31e0ca576b4c044cbf588e0765dd/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalDriver.java#L152],]Here is just to add logs at the beginning. Connection State processing should be managed by[ConnectionStateErrorPolicy|https://github.com/chendonglin521/curator-1/blob/15a9f03f6f7b156806d05d0dd7ce6cfd3ef39c72/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateErrorPolicy.java#L27] (support session and Standard) > Tolerate temporarily suspended ZooKeeper connections > > > Key: FLINK-10052 > URL: https://issues.apache.org/jira/browse/FLINK-10052 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1 >Reporter: Till Rohrmann >Assignee: Zili Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.13.0 > > Time Spent: 50m > Remaining Estimate: 0h > > This issue results from FLINK-10011 which uncovered a problem with Flink's HA > recovery and proposed the following solution to harden Flink: > The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator > recipe for leader election. The leader latch revokes leadership in case of a > suspended ZooKeeper connection. This can be premature in case that the system > can reconnect to ZooKeeper before its session expires. The effect of the lost > leadership is that all jobs will be canceled and directly restarted after > regaining the leadership. > Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper > connection, it would be better to wait until the ZooKeeper connection is > LOST. That way we would allow the system to reconnect and not lose the > leadership. This could be achievable by using Curator's {{LeaderSelector}} > instead of the {{LeaderLatch}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections
[ https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311547#comment-17311547 ] Andrew.D.lin commented on FLINK-10052: -- Note change here![FLINK-18677|https://issues.apache.org/jira/browse/FLINK-18677] I think it should not be increased logic here,[handleStateChange|[https://github.com/apache/flink/blob/940dfb0deccb31e0ca576b4c044cbf588e0765dd/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalDriver.java#L152]|https://github.com/apache/flink/blob/940dfb0deccb31e0ca576b4c044cbf588e0765dd/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalDriver.java#L152],]Here is just to add logs at the beginning. Connection State processing should be managed by[ConnectionStateErrorPolicy|https://github.com/chendonglin521/curator-1/blob/15a9f03f6f7b156806d05d0dd7ce6cfd3ef39c72/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateErrorPolicy.java#L27] (support session and Standard) > Tolerate temporarily suspended ZooKeeper connections > > > Key: FLINK-10052 > URL: https://issues.apache.org/jira/browse/FLINK-10052 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1 >Reporter: Till Rohrmann >Assignee: Zili Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.13.0 > > Time Spent: 50m > Remaining Estimate: 0h > > This issue results from FLINK-10011 which uncovered a problem with Flink's HA > recovery and proposed the following solution to harden Flink: > The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator > recipe for leader election. The leader latch revokes leadership in case of a > suspended ZooKeeper connection. This can be premature in case that the system > can reconnect to ZooKeeper before its session expires. The effect of the lost > leadership is that all jobs will be canceled and directly restarted after > regaining the leadership. > Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper > connection, it would be better to wait until the ZooKeeper connection is > LOST. That way we would allow the system to reconnect and not lose the > leadership. This could be achievable by using Curator's {{LeaderSelector}} > instead of the {{LeaderLatch}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-21858) TaskMetricGroup taskName is too long, especially in sql tasks.
[ https://issues.apache.org/jira/browse/FLINK-21858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304192#comment-17304192 ] Andrew.D.lin commented on FLINK-21858: -- Thanks for your reply. (y) [~chesnay] Ok, let me pay attention to [FLINK-20388|https://issues.apache.org/jira/browse/FLINK-20388] > TaskMetricGroup taskName is too long, especially in sql tasks. > -- > > Key: FLINK-21858 > URL: https://issues.apache.org/jira/browse/FLINK-21858 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.12.0, 1.12.1, 1.12.2 >Reporter: Andrew.D.lin >Assignee: Andrew.D.lin >Priority: Major > > Now operatorName is limited to 80 by > org.apache.flink.runtime.metrics.groups.TaskMetricGroup#METRICS_OPERATOR_NAME_MAX_LENGTH. > So propose to limit the maximum length of metric name by configuration. > > Here is an example: > > "taskName":"GlobalGroupAggregate(groupBy=[dt, src, src1, src2, src3, ct1, > ct2], select=[dt, src, src1, src2, src3, ct1, ct2, SUM_RETRACT((sum$0, > count$1)) AS sx_pv, SUM_RETRACT((sum$2, count$3)) AS sx_uv, > MAX_RETRACT(max$4) AS updt_time, MAX_RETRACT(max$5) AS time_id]) -> > Calc(select=[((MD5((dt CONCAT _UTF-16LE'|' CONCAT src CONCAT _UTF-16LE'|' > CONCAT src1 CONCAT _UTF-16LE'|' CONCAT src2 CONCAT _UTF-16LE'|' CONCAT src3 > CONCAT _UTF-16LE'|' CONCAT ct1 CONCAT _UTF-16LE'|' CONCAT ct2 CONCAT > _UTF-16LE'|' CONCAT time_id)) SUBSTR 1 SUBSTR 2) CONCAT _UTF-16LE'_' CONCAT > (dt CONCAT _UTF-16LE'|' CONCAT src CONCAT _UTF-16LE'|' CONCAT src1 CONCAT > _UTF-16LE'|' CONCAT src2 CONCAT _UTF-16LE'|' CONCAT src3 CONCAT _UTF-16LE'|' > CONCAT ct1 CONCAT _UTF-16LE'|' CONCAT ct2 CONCAT _UTF-16LE'|' CONCAT > time_id)) AS rowkey, sx_pv, sx_uv, updt_time]) -> > LocalGroupAggregate(groupBy=[rowkey], select=[rowkey, MAX_RETRACT(sx_pv) AS > max$0, MAX_RETRACT(sx_uv) AS max$1, MAX_RETRACT(updt_time) AS max$2, > COUNT_RETRACT(*) AS count1$3])" > "operatorName":"GlobalGroupAggregate(groupBy=[dt, src, src1, src2, src3, ct1, > ct2], selec=[dt, s" > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20388) Supports users setting operators' metrics name
[ https://issues.apache.org/jira/browse/FLINK-20388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304188#comment-17304188 ] Andrew.D.lin commented on FLINK-20388: -- org.apache.flink.runtime.metrics.groups.TaskMetricGroup#taskName also has this problem, especially in sql tasks. I hope it can be solved together https://issues.apache.org/jira/browse/FLINK-21858 > Supports users setting operators' metrics name > -- > > Key: FLINK-20388 > URL: https://issues.apache.org/jira/browse/FLINK-20388 > Project: Flink > Issue Type: Improvement > Components: API / DataStream, Runtime / Metrics >Reporter: hailong wang >Priority: Major > > Currently, we only support users setting operators name. > And we use those in the topology to distinguish operators, at the same time, > as the operator metrics name. > If the operator name length is larger than 80, we truncate it simply. > I think we can allow users to set operator metrics name like operators name. > If the user is not set, use the current way. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-21858) TaskMetricGroup taskName is too long, especially in sql tasks.
[ https://issues.apache.org/jira/browse/FLINK-21858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304037#comment-17304037 ] Andrew.D.lin commented on FLINK-21858: -- Hi [~chesnay] Is it reasonable to restrict the metric name by adding configuration? there is a better suggestion?:) > TaskMetricGroup taskName is too long, especially in sql tasks. > -- > > Key: FLINK-21858 > URL: https://issues.apache.org/jira/browse/FLINK-21858 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.12.0, 1.12.1, 1.12.2 >Reporter: Andrew.D.lin >Assignee: Andrew.D.lin >Priority: Major > > Now operatorName is limited to 80 by > org.apache.flink.runtime.metrics.groups.TaskMetricGroup#METRICS_OPERATOR_NAME_MAX_LENGTH. > So propose to limit the maximum length of metric name by configuration. > > Here is an example: > > "taskName":"GlobalGroupAggregate(groupBy=[dt, src, src1, src2, src3, ct1, > ct2], select=[dt, src, src1, src2, src3, ct1, ct2, SUM_RETRACT((sum$0, > count$1)) AS sx_pv, SUM_RETRACT((sum$2, count$3)) AS sx_uv, > MAX_RETRACT(max$4) AS updt_time, MAX_RETRACT(max$5) AS time_id]) -> > Calc(select=[((MD5((dt CONCAT _UTF-16LE'|' CONCAT src CONCAT _UTF-16LE'|' > CONCAT src1 CONCAT _UTF-16LE'|' CONCAT src2 CONCAT _UTF-16LE'|' CONCAT src3 > CONCAT _UTF-16LE'|' CONCAT ct1 CONCAT _UTF-16LE'|' CONCAT ct2 CONCAT > _UTF-16LE'|' CONCAT time_id)) SUBSTR 1 SUBSTR 2) CONCAT _UTF-16LE'_' CONCAT > (dt CONCAT _UTF-16LE'|' CONCAT src CONCAT _UTF-16LE'|' CONCAT src1 CONCAT > _UTF-16LE'|' CONCAT src2 CONCAT _UTF-16LE'|' CONCAT src3 CONCAT _UTF-16LE'|' > CONCAT ct1 CONCAT _UTF-16LE'|' CONCAT ct2 CONCAT _UTF-16LE'|' CONCAT > time_id)) AS rowkey, sx_pv, sx_uv, updt_time]) -> > LocalGroupAggregate(groupBy=[rowkey], select=[rowkey, MAX_RETRACT(sx_pv) AS > max$0, MAX_RETRACT(sx_uv) AS max$1, MAX_RETRACT(updt_time) AS max$2, > COUNT_RETRACT(*) AS count1$3])" > "operatorName":"GlobalGroupAggregate(groupBy=[dt, src, src1, src2, src3, ct1, > ct2], selec=[dt, s" > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-21858) TaskMetricGroup taskName is too long, especially in sql tasks.
[ https://issues.apache.org/jira/browse/FLINK-21858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304033#comment-17304033 ] Andrew.D.lin commented on FLINK-21858: -- OK .thanks for your reply. [~zjwang] > TaskMetricGroup taskName is too long, especially in sql tasks. > -- > > Key: FLINK-21858 > URL: https://issues.apache.org/jira/browse/FLINK-21858 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.12.0, 1.12.1, 1.12.2 >Reporter: Andrew.D.lin >Assignee: Andrew.D.lin >Priority: Major > > Now operatorName is limited to 80 by > org.apache.flink.runtime.metrics.groups.TaskMetricGroup#METRICS_OPERATOR_NAME_MAX_LENGTH. > So propose to limit the maximum length of metric name by configuration. > > Here is an example: > > "taskName":"GlobalGroupAggregate(groupBy=[dt, src, src1, src2, src3, ct1, > ct2], select=[dt, src, src1, src2, src3, ct1, ct2, SUM_RETRACT((sum$0, > count$1)) AS sx_pv, SUM_RETRACT((sum$2, count$3)) AS sx_uv, > MAX_RETRACT(max$4) AS updt_time, MAX_RETRACT(max$5) AS time_id]) -> > Calc(select=[((MD5((dt CONCAT _UTF-16LE'|' CONCAT src CONCAT _UTF-16LE'|' > CONCAT src1 CONCAT _UTF-16LE'|' CONCAT src2 CONCAT _UTF-16LE'|' CONCAT src3 > CONCAT _UTF-16LE'|' CONCAT ct1 CONCAT _UTF-16LE'|' CONCAT ct2 CONCAT > _UTF-16LE'|' CONCAT time_id)) SUBSTR 1 SUBSTR 2) CONCAT _UTF-16LE'_' CONCAT > (dt CONCAT _UTF-16LE'|' CONCAT src CONCAT _UTF-16LE'|' CONCAT src1 CONCAT > _UTF-16LE'|' CONCAT src2 CONCAT _UTF-16LE'|' CONCAT src3 CONCAT _UTF-16LE'|' > CONCAT ct1 CONCAT _UTF-16LE'|' CONCAT ct2 CONCAT _UTF-16LE'|' CONCAT > time_id)) AS rowkey, sx_pv, sx_uv, updt_time]) -> > LocalGroupAggregate(groupBy=[rowkey], select=[rowkey, MAX_RETRACT(sx_pv) AS > max$0, MAX_RETRACT(sx_uv) AS max$1, MAX_RETRACT(updt_time) AS max$2, > COUNT_RETRACT(*) AS count1$3])" > "operatorName":"GlobalGroupAggregate(groupBy=[dt, src, src1, src2, src3, ct1, > ct2], selec=[dt, s" > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-21858) TaskMetricGroup taskName is too long, especially in sql tasks.
[ https://issues.apache.org/jira/browse/FLINK-21858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304000#comment-17304000 ] Andrew.D.lin commented on FLINK-21858: -- I am willing to do it! > TaskMetricGroup taskName is too long, especially in sql tasks. > -- > > Key: FLINK-21858 > URL: https://issues.apache.org/jira/browse/FLINK-21858 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.12.0, 1.12.1, 1.12.2 >Reporter: Andrew.D.lin >Priority: Major > > Now operatorName is limited to 80 by > org.apache.flink.runtime.metrics.groups.TaskMetricGroup#METRICS_OPERATOR_NAME_MAX_LENGTH. > So propose to limit the maximum length of metric name by configuration. > > Here is an example: > > "taskName":"GlobalGroupAggregate(groupBy=[dt, src, src1, src2, src3, ct1, > ct2], select=[dt, src, src1, src2, src3, ct1, ct2, SUM_RETRACT((sum$0, > count$1)) AS sx_pv, SUM_RETRACT((sum$2, count$3)) AS sx_uv, > MAX_RETRACT(max$4) AS updt_time, MAX_RETRACT(max$5) AS time_id]) -> > Calc(select=[((MD5((dt CONCAT _UTF-16LE'|' CONCAT src CONCAT _UTF-16LE'|' > CONCAT src1 CONCAT _UTF-16LE'|' CONCAT src2 CONCAT _UTF-16LE'|' CONCAT src3 > CONCAT _UTF-16LE'|' CONCAT ct1 CONCAT _UTF-16LE'|' CONCAT ct2 CONCAT > _UTF-16LE'|' CONCAT time_id)) SUBSTR 1 SUBSTR 2) CONCAT _UTF-16LE'_' CONCAT > (dt CONCAT _UTF-16LE'|' CONCAT src CONCAT _UTF-16LE'|' CONCAT src1 CONCAT > _UTF-16LE'|' CONCAT src2 CONCAT _UTF-16LE'|' CONCAT src3 CONCAT _UTF-16LE'|' > CONCAT ct1 CONCAT _UTF-16LE'|' CONCAT ct2 CONCAT _UTF-16LE'|' CONCAT > time_id)) AS rowkey, sx_pv, sx_uv, updt_time]) -> > LocalGroupAggregate(groupBy=[rowkey], select=[rowkey, MAX_RETRACT(sx_pv) AS > max$0, MAX_RETRACT(sx_uv) AS max$1, MAX_RETRACT(updt_time) AS max$2, > COUNT_RETRACT(*) AS count1$3])" > "operatorName":"GlobalGroupAggregate(groupBy=[dt, src, src1, src2, src3, ct1, > ct2], selec=[dt, s" > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21858) TaskMetricGroup taskName is too long, especially in sql tasks.
Andrew.D.lin created FLINK-21858: Summary: TaskMetricGroup taskName is too long, especially in sql tasks. Key: FLINK-21858 URL: https://issues.apache.org/jira/browse/FLINK-21858 Project: Flink Issue Type: Improvement Components: Runtime / Metrics Affects Versions: 1.12.2, 1.12.1, 1.12.0 Reporter: Andrew.D.lin Now operatorName is limited to 80 by org.apache.flink.runtime.metrics.groups.TaskMetricGroup#METRICS_OPERATOR_NAME_MAX_LENGTH. So propose to limit the maximum length of metric name by configuration. Here is an example: "taskName":"GlobalGroupAggregate(groupBy=[dt, src, src1, src2, src3, ct1, ct2], select=[dt, src, src1, src2, src3, ct1, ct2, SUM_RETRACT((sum$0, count$1)) AS sx_pv, SUM_RETRACT((sum$2, count$3)) AS sx_uv, MAX_RETRACT(max$4) AS updt_time, MAX_RETRACT(max$5) AS time_id]) -> Calc(select=[((MD5((dt CONCAT _UTF-16LE'|' CONCAT src CONCAT _UTF-16LE'|' CONCAT src1 CONCAT _UTF-16LE'|' CONCAT src2 CONCAT _UTF-16LE'|' CONCAT src3 CONCAT _UTF-16LE'|' CONCAT ct1 CONCAT _UTF-16LE'|' CONCAT ct2 CONCAT _UTF-16LE'|' CONCAT time_id)) SUBSTR 1 SUBSTR 2) CONCAT _UTF-16LE'_' CONCAT (dt CONCAT _UTF-16LE'|' CONCAT src CONCAT _UTF-16LE'|' CONCAT src1 CONCAT _UTF-16LE'|' CONCAT src2 CONCAT _UTF-16LE'|' CONCAT src3 CONCAT _UTF-16LE'|' CONCAT ct1 CONCAT _UTF-16LE'|' CONCAT ct2 CONCAT _UTF-16LE'|' CONCAT time_id)) AS rowkey, sx_pv, sx_uv, updt_time]) -> LocalGroupAggregate(groupBy=[rowkey], select=[rowkey, MAX_RETRACT(sx_pv) AS max$0, MAX_RETRACT(sx_uv) AS max$1, MAX_RETRACT(updt_time) AS max$2, COUNT_RETRACT(*) AS count1$3])" "operatorName":"GlobalGroupAggregate(groupBy=[dt, src, src1, src2, src3, ct1, ct2], selec=[dt, s" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20424) The percent of acknowledged checkpoint seems incorrect
[ https://issues.apache.org/jira/browse/FLINK-20424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242891#comment-17242891 ] Andrew.D.lin commented on FLINK-20424: -- I discovered this problem before, and I think it is more appropriate to keep the percentage to two decimal places. Can i take it? > The percent of acknowledged checkpoint seems incorrect > -- > > Key: FLINK-20424 > URL: https://issues.apache.org/jira/browse/FLINK-20424 > Project: Flink > Issue Type: Improvement > Components: Runtime / Web Frontend >Reporter: zlzhang0122 >Priority: Minor > Attachments: 2020-11-30 14-18-34 的屏幕截图.png > > > As the picture below, the percent of acknowledged checkpoint seems > incorrect.I think the number must not be 100% because one of the checkpoint > acknowledge was failed. > !2020-11-30 14-18-34 的屏幕截图.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14012) Failed to start job for consuming Secure Kafka after the job cancel
[ https://issues.apache.org/jira/browse/FLINK-14012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199296#comment-17199296 ] Andrew.D.lin commented on FLINK-14012: -- [~aljoscha] [~Daebeom] , Hi may I ask which pr or commit fixed this? Can I cherry-pick to lower version? > Failed to start job for consuming Secure Kafka after the job cancel > --- > > Key: FLINK-14012 > URL: https://issues.apache.org/jira/browse/FLINK-14012 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka >Affects Versions: 1.9.0 > Environment: * Kubernetes 1.13.2 > * Flink 1.9.0 > * Kafka client libary 2.2.0 >Reporter: Daebeom Lee >Priority: Minor > > Hello, this is Daebeom Lee. > h2. Background > I installed Flink 1.9.0 at this our Kubernetes cluster. > We use Flink session cluster. - build fatJar file and upload it at the UI, > run serval jobs. > At first, our jobs are good to start. > But, when we cancel some jobs, the job failed > This is the error code. > {code:java} > // code placeholder > java.lang.NoClassDefFoundError: > org/apache/kafka/common/security/scram/internals/ScramSaslClient > at > org.apache.kafka.common.security.scram.internals.ScramSaslClient$ScramSaslClientFactory.createSaslClient(ScramSaslClient.java:235) > at javax.security.sasl.Sasl.createSaslClient(Sasl.java:384) > at > org.apache.kafka.common.security.authenticator.SaslClientAuthenticator.lambda$createSaslClient$0(SaslClientAuthenticator.java:180) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.kafka.common.security.authenticator.SaslClientAuthenticator.createSaslClient(SaslClientAuthenticator.java:176) > at > org.apache.kafka.common.security.authenticator.SaslClientAuthenticator.(SaslClientAuthenticator.java:168) > at > org.apache.kafka.common.network.SaslChannelBuilder.buildClientAuthenticator(SaslChannelBuilder.java:254) > at > org.apache.kafka.common.network.SaslChannelBuilder.lambda$buildChannel$1(SaslChannelBuilder.java:202) > at > org.apache.kafka.common.network.KafkaChannel.(KafkaChannel.java:140) > at > org.apache.kafka.common.network.SaslChannelBuilder.buildChannel(SaslChannelBuilder.java:210) > at > org.apache.kafka.common.network.Selector.buildAndAttachKafkaChannel(Selector.java:334) > at > org.apache.kafka.common.network.Selector.registerChannel(Selector.java:325) > at org.apache.kafka.common.network.Selector.connect(Selector.java:257) > at > org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:920) > at org.apache.kafka.clients.NetworkClient.ready(NetworkClient.java:287) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.trySend(ConsumerNetworkClient.java:474) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:255) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:215) > at > org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:292) > at > org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1803) > at > org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1771) > at > org.apache.flink.streaming.connectors.kafka.internal.KafkaPartitionDiscoverer.getAllPartitionsForTopics(KafkaPartitionDiscoverer.java:77) > at > org.apache.flink.streaming.connectors.kafka.internals.AbstractPartitionDiscoverer.discoverPartitions(AbstractPartitionDiscoverer.java:131) > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.open(FlinkKafkaConsumerBase.java:508) > at > org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36) > at > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:102) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:529) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:393) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) > at java.lang.Thread.run(Thread.java:748) > {code} > h2. Our workaround > * I think that this is Flink JVM classloader issue. > * Classloader unloads when job cancels by the way kafka client library is > included fatJar. > * So, I located Kafka client library to /opt/flink/lib > ** /opt/flink/lib/kafka-clients-2.2.0.jar > * And
[jira] [Commented] (FLINK-16099) Translate "HiveCatalog" page of "Hive Integration" into Chinese
[ https://issues.apache.org/jira/browse/FLINK-16099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101527#comment-17101527 ] Andrew.D.lin commented on FLINK-16099: -- ok thanks > Translate "HiveCatalog" page of "Hive Integration" into Chinese > > > Key: FLINK-16099 > URL: https://issues.apache.org/jira/browse/FLINK-16099 > Project: Flink > Issue Type: Sub-task > Components: chinese-translation, Documentation >Reporter: Jark Wu >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The page url is > https://ci.apache.org/projects/flink/flink-docs-master/zh/dev/table/hive/hive_catalog.html > The markdown file is located in > {{flink/docs/dev/table/hive/hive_catalog.zh.md}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16099) Translate "HiveCatalog" page of "Hive Integration" into Chinese
[ https://issues.apache.org/jira/browse/FLINK-16099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101333#comment-17101333 ] Andrew.D.lin commented on FLINK-16099: -- Hi [~jark], I am willing to do it! > Translate "HiveCatalog" page of "Hive Integration" into Chinese > > > Key: FLINK-16099 > URL: https://issues.apache.org/jira/browse/FLINK-16099 > Project: Flink > Issue Type: Sub-task > Components: chinese-translation, Documentation >Reporter: Jark Wu >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The page url is > https://ci.apache.org/projects/flink/flink-docs-master/zh/dev/table/hive/hive_catalog.html > The markdown file is located in > {{flink/docs/dev/table/hive/hive_catalog.zh.md}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-15307) Subclasses of FailoverStrategy are easily confused with implementation classes of RestartStrategy
[ https://issues.apache.org/jira/browse/FLINK-15307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014810#comment-17014810 ] Andrew.D.lin edited comment on FLINK-15307 at 1/14/20 3:01 AM: --- Hi [~zhuzh], I modified it according to your suggestion( the name patterns discussed above), and only modified strategy under flip1. hope you help review and give me some suggestions. thank you very much! :D was (Author: andrew_lin): Hi [~zhuzh], I modified it according to your suggestion, only modified strategy under flip1. hope you help review and give me some suggestions. thank you very much! :D > Subclasses of FailoverStrategy are easily confused with implementation > classes of RestartStrategy > - > > Key: FLINK-15307 > URL: https://issues.apache.org/jira/browse/FLINK-15307 > Project: Flink > Issue Type: Improvement > Components: Runtime / Configuration >Reporter: Andrew.D.lin >Assignee: Andrew.D.lin >Priority: Minor > Labels: pull-request-available > Fix For: 1.11.0 > > Attachments: image-2019-12-18-14-59-03-181.png > > Time Spent: 10m > Remaining Estimate: 24h > > Subclasses of RestartStrategy > * FailingRestartStrategy > * FailureRateRestartStrategy > * FixedDelayRestartStrategy > * InfiniteDelayRestartStrategy > Implementation class of FailoverStrategy > * AdaptedRestartPipelinedRegionStrategyNG > * RestartAllStrategy > * RestartIndividualStrategy > * RestartPipelinedRegionStrategy > > FailoverStrategy describes how the job computation recovers from task > failures. > I think the following names may be easier to understand and easier to > distinguish: > Implementation class of FailoverStrategy > * AdaptedPipelinedRegionFailoverStrategyNG > * FailoverAllStrategy > * FailoverIndividualStrategy > * FailoverPipelinedRegionStrategy > FailoverStrategy is currently generated by configuration. If we change the > name of the implementation class, it will not affect compatibility. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-15307) Subclasses of FailoverStrategy are easily confused with implementation classes of RestartStrategy
[ https://issues.apache.org/jira/browse/FLINK-15307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014810#comment-17014810 ] Andrew.D.lin edited comment on FLINK-15307 at 1/14/20 2:56 AM: --- Hi [~zhuzh], I modified it according to your suggestion, only modified strategy under flip1. hope you help review and give me some suggestions. thank you very much! :D was (Author: andrew_lin): [~zhuzh] sorry, I missed it, Let me modify it > Subclasses of FailoverStrategy are easily confused with implementation > classes of RestartStrategy > - > > Key: FLINK-15307 > URL: https://issues.apache.org/jira/browse/FLINK-15307 > Project: Flink > Issue Type: Improvement > Components: Runtime / Configuration >Reporter: Andrew.D.lin >Assignee: Andrew.D.lin >Priority: Minor > Labels: pull-request-available > Fix For: 1.11.0 > > Attachments: image-2019-12-18-14-59-03-181.png > > Time Spent: 10m > Remaining Estimate: 24h > > Subclasses of RestartStrategy > * FailingRestartStrategy > * FailureRateRestartStrategy > * FixedDelayRestartStrategy > * InfiniteDelayRestartStrategy > Implementation class of FailoverStrategy > * AdaptedRestartPipelinedRegionStrategyNG > * RestartAllStrategy > * RestartIndividualStrategy > * RestartPipelinedRegionStrategy > > FailoverStrategy describes how the job computation recovers from task > failures. > I think the following names may be easier to understand and easier to > distinguish: > Implementation class of FailoverStrategy > * AdaptedPipelinedRegionFailoverStrategyNG > * FailoverAllStrategy > * FailoverIndividualStrategy > * FailoverPipelinedRegionStrategy > FailoverStrategy is currently generated by configuration. If we change the > name of the implementation class, it will not affect compatibility. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15307) Subclasses of FailoverStrategy are easily confused with implementation classes of RestartStrategy
[ https://issues.apache.org/jira/browse/FLINK-15307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014810#comment-17014810 ] Andrew.D.lin commented on FLINK-15307: -- [~zhuzh] sorry, I missed it, Let me modify it > Subclasses of FailoverStrategy are easily confused with implementation > classes of RestartStrategy > - > > Key: FLINK-15307 > URL: https://issues.apache.org/jira/browse/FLINK-15307 > Project: Flink > Issue Type: Improvement > Components: Runtime / Configuration >Reporter: Andrew.D.lin >Assignee: Andrew.D.lin >Priority: Minor > Labels: pull-request-available > Fix For: 1.11.0 > > Attachments: image-2019-12-18-14-59-03-181.png > > Time Spent: 10m > Remaining Estimate: 24h > > Subclasses of RestartStrategy > * FailingRestartStrategy > * FailureRateRestartStrategy > * FixedDelayRestartStrategy > * InfiniteDelayRestartStrategy > Implementation class of FailoverStrategy > * AdaptedRestartPipelinedRegionStrategyNG > * RestartAllStrategy > * RestartIndividualStrategy > * RestartPipelinedRegionStrategy > > FailoverStrategy describes how the job computation recovers from task > failures. > I think the following names may be easier to understand and easier to > distinguish: > Implementation class of FailoverStrategy > * AdaptedPipelinedRegionFailoverStrategyNG > * FailoverAllStrategy > * FailoverIndividualStrategy > * FailoverPipelinedRegionStrategy > FailoverStrategy is currently generated by configuration. If we change the > name of the implementation class, it will not affect compatibility. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15307) Subclasses of FailoverStrategy are easily confused with implementation classes of RestartStrategy
[ https://issues.apache.org/jira/browse/FLINK-15307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014771#comment-17014771 ] Andrew.D.lin commented on FLINK-15307: -- Hi [~zhuzh] I open the PR: [https://github.com/apache/flink/pull/10848, :D|https://github.com/apache/flink/pull/10848] > Subclasses of FailoverStrategy are easily confused with implementation > classes of RestartStrategy > - > > Key: FLINK-15307 > URL: https://issues.apache.org/jira/browse/FLINK-15307 > Project: Flink > Issue Type: Improvement > Components: Runtime / Configuration >Reporter: Andrew.D.lin >Assignee: Andrew.D.lin >Priority: Minor > Labels: pull-request-available > Fix For: 1.11.0 > > Attachments: image-2019-12-18-14-59-03-181.png > > Time Spent: 10m > Remaining Estimate: 24h > > Subclasses of RestartStrategy > * FailingRestartStrategy > * FailureRateRestartStrategy > * FixedDelayRestartStrategy > * InfiniteDelayRestartStrategy > Implementation class of FailoverStrategy > * AdaptedRestartPipelinedRegionStrategyNG > * RestartAllStrategy > * RestartIndividualStrategy > * RestartPipelinedRegionStrategy > > FailoverStrategy describes how the job computation recovers from task > failures. > I think the following names may be easier to understand and easier to > distinguish: > Implementation class of FailoverStrategy > * AdaptedPipelinedRegionFailoverStrategyNG > * FailoverAllStrategy > * FailoverIndividualStrategy > * FailoverPipelinedRegionStrategy > FailoverStrategy is currently generated by configuration. If we change the > name of the implementation class, it will not affect compatibility. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-15307) Subclasses of FailoverStrategy are easily confused with implementation classes of RestartStrategy
[ https://issues.apache.org/jira/browse/FLINK-15307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew.D.lin updated FLINK-15307: - External issue URL: https://github.com/apache/flink/pull/10848 Remaining Estimate: 24h (was: 0h) > Subclasses of FailoverStrategy are easily confused with implementation > classes of RestartStrategy > - > > Key: FLINK-15307 > URL: https://issues.apache.org/jira/browse/FLINK-15307 > Project: Flink > Issue Type: Improvement > Components: Runtime / Configuration >Reporter: Andrew.D.lin >Assignee: Andrew.D.lin >Priority: Minor > Labels: pull-request-available > Fix For: 1.11.0 > > Attachments: image-2019-12-18-14-59-03-181.png > > Time Spent: 10m > Remaining Estimate: 24h > > Subclasses of RestartStrategy > * FailingRestartStrategy > * FailureRateRestartStrategy > * FixedDelayRestartStrategy > * InfiniteDelayRestartStrategy > Implementation class of FailoverStrategy > * AdaptedRestartPipelinedRegionStrategyNG > * RestartAllStrategy > * RestartIndividualStrategy > * RestartPipelinedRegionStrategy > > FailoverStrategy describes how the job computation recovers from task > failures. > I think the following names may be easier to understand and easier to > distinguish: > Implementation class of FailoverStrategy > * AdaptedPipelinedRegionFailoverStrategyNG > * FailoverAllStrategy > * FailoverIndividualStrategy > * FailoverPipelinedRegionStrategy > FailoverStrategy is currently generated by configuration. If we change the > name of the implementation class, it will not affect compatibility. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15568) RestartPipelinedRegionStrategy: not ensure the EXACTLY_ONCE semantics, in 1.8 versions
[ https://issues.apache.org/jira/browse/FLINK-15568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014738#comment-17014738 ] Andrew.D.lin commented on FLINK-15568: -- thanks for your detailed replay [~zhuzh], Can you help me mark this issue Can't fix or close it in this case? > RestartPipelinedRegionStrategy: not ensure the EXACTLY_ONCE semantics, in 1.8 > versions > -- > > Key: FLINK-15568 > URL: https://issues.apache.org/jira/browse/FLINK-15568 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.8.0, 1.8.1, 1.8.3 >Reporter: Andrew.D.lin >Priority: Minor > Attachments: image-2020-01-13-16-40-47-888.png > > > In 1.8* versions, FailoverRegion.java restart method not restore from latest > checkpoint and marked TODO. > Should we support this feature (region restart) in flink 1.8? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-15568) RestartPipelinedRegionStrategy: not ensure the EXACTLY_ONCE semantics, in 1.8 versions
[ https://issues.apache.org/jira/browse/FLINK-15568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew.D.lin updated FLINK-15568: - Summary: RestartPipelinedRegionStrategy: not ensure the EXACTLY_ONCE semantics, in 1.8 versions (was: RestartPipelinedRegionStrategy: not ensure the EXACTLY_ONCE semantics) > RestartPipelinedRegionStrategy: not ensure the EXACTLY_ONCE semantics, in 1.8 > versions > -- > > Key: FLINK-15568 > URL: https://issues.apache.org/jira/browse/FLINK-15568 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.8.0, 1.8.1, 1.8.3 >Reporter: Andrew.D.lin >Priority: Minor > Attachments: image-2020-01-13-16-40-47-888.png > > > In 1.8* versions, FailoverRegion.java restart method not restore from latest > checkpoint and marked TODO. > Should we support this feature (region restart) in flink 1.8? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-15568) RestartPipelinedRegionStrategy: not ensure the EXACTLY_ONCE semantics
[ https://issues.apache.org/jira/browse/FLINK-15568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew.D.lin updated FLINK-15568: - Description: In 1.8* versions, FailoverRegion.java restart method not restore from latest checkpoint and marked TODO. Should we support this feature (region restart) in flink 1.8? was: !image-2020-01-13-16-40-47-888.png! In 1.8* versions, FailoverRegion.java restart method not restore from latest checkpoint and marked TODO. Should we support this feature (region restart) in flink 1.8? > RestartPipelinedRegionStrategy: not ensure the EXACTLY_ONCE semantics > - > > Key: FLINK-15568 > URL: https://issues.apache.org/jira/browse/FLINK-15568 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.8.0, 1.8.1, 1.8.3 >Reporter: Andrew.D.lin >Priority: Minor > Attachments: image-2020-01-13-16-40-47-888.png > > > In 1.8* versions, FailoverRegion.java restart method not restore from latest > checkpoint and marked TODO. > Should we support this feature (region restart) in flink 1.8? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15568) RestartPipelinedRegionStrategy: not ensure the EXACTLY_ONCE semantics
Andrew.D.lin created FLINK-15568: Summary: RestartPipelinedRegionStrategy: not ensure the EXACTLY_ONCE semantics Key: FLINK-15568 URL: https://issues.apache.org/jira/browse/FLINK-15568 Project: Flink Issue Type: Improvement Components: Runtime / Checkpointing Affects Versions: 1.8.3, 1.8.1, 1.8.0 Reporter: Andrew.D.lin Attachments: image-2020-01-13-16-40-47-888.png !image-2020-01-13-16-40-47-888.png! In 1.8* versions, FailoverRegion.java restart method not restore from latest checkpoint and marked TODO. Should we support this feature (region restart) in flink 1.8? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15307) Subclasses of FailoverStrategy are easily confused with implementation classes of RestartStrategy
[ https://issues.apache.org/jira/browse/FLINK-15307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000673#comment-17000673 ] andrew.D.lin commented on FLINK-15307: -- Hi [~zhuzh], [~tison] OK, I am willing to do this. According to Zhuzhu‘ suggestion. Can you give me this ticket? > Subclasses of FailoverStrategy are easily confused with implementation > classes of RestartStrategy > - > > Key: FLINK-15307 > URL: https://issues.apache.org/jira/browse/FLINK-15307 > Project: Flink > Issue Type: Improvement > Components: Runtime / Configuration >Reporter: andrew.D.lin >Priority: Minor > Fix For: 1.11.0 > > Attachments: image-2019-12-18-14-59-03-181.png > > > Subclasses of RestartStrategy > * FailingRestartStrategy > * FailureRateRestartStrategy > * FixedDelayRestartStrategy > * InfiniteDelayRestartStrategy > Implementation class of FailoverStrategy > * AdaptedRestartPipelinedRegionStrategyNG > * RestartAllStrategy > * RestartIndividualStrategy > * RestartPipelinedRegionStrategy > > FailoverStrategy describes how the job computation recovers from task > failures. > I think the following names may be easier to understand and easier to > distinguish: > Implementation class of FailoverStrategy > * AdaptedPipelinedRegionFailoverStrategyNG > * FailoverAllStrategy > * FailoverIndividualStrategy > * FailoverPipelinedRegionStrategy > FailoverStrategy is currently generated by configuration. If we change the > name of the implementation class, it will not affect compatibility. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Issue Comment Deleted] (FLINK-15307) Subclasses of FailoverStrategy are easily confused with implementation classes of RestartStrategy
[ https://issues.apache.org/jira/browse/FLINK-15307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] andrew.D.lin updated FLINK-15307: - Comment: was deleted (was: [~zhuzh] [~tison] ) > Subclasses of FailoverStrategy are easily confused with implementation > classes of RestartStrategy > - > > Key: FLINK-15307 > URL: https://issues.apache.org/jira/browse/FLINK-15307 > Project: Flink > Issue Type: Improvement > Components: Runtime / Configuration >Reporter: andrew.D.lin >Priority: Minor > Fix For: 1.11.0 > > Attachments: image-2019-12-18-14-59-03-181.png > > > Subclasses of RestartStrategy > * FailingRestartStrategy > * FailureRateRestartStrategy > * FixedDelayRestartStrategy > * InfiniteDelayRestartStrategy > Implementation class of FailoverStrategy > * AdaptedRestartPipelinedRegionStrategyNG > * RestartAllStrategy > * RestartIndividualStrategy > * RestartPipelinedRegionStrategy > > FailoverStrategy describes how the job computation recovers from task > failures. > I think the following names may be easier to understand and easier to > distinguish: > Implementation class of FailoverStrategy > * AdaptedPipelinedRegionFailoverStrategyNG > * FailoverAllStrategy > * FailoverIndividualStrategy > * FailoverPipelinedRegionStrategy > FailoverStrategy is currently generated by configuration. If we change the > name of the implementation class, it will not affect compatibility. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15307) Subclasses of FailoverStrategy are easily confused with implementation classes of RestartStrategy
[ https://issues.apache.org/jira/browse/FLINK-15307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000666#comment-17000666 ] andrew.D.lin commented on FLINK-15307: -- [~zhuzh] [~tison] > Subclasses of FailoverStrategy are easily confused with implementation > classes of RestartStrategy > - > > Key: FLINK-15307 > URL: https://issues.apache.org/jira/browse/FLINK-15307 > Project: Flink > Issue Type: Improvement > Components: Runtime / Configuration >Reporter: andrew.D.lin >Priority: Minor > Fix For: 1.11.0 > > Attachments: image-2019-12-18-14-59-03-181.png > > > Subclasses of RestartStrategy > * FailingRestartStrategy > * FailureRateRestartStrategy > * FixedDelayRestartStrategy > * InfiniteDelayRestartStrategy > Implementation class of FailoverStrategy > * AdaptedRestartPipelinedRegionStrategyNG > * RestartAllStrategy > * RestartIndividualStrategy > * RestartPipelinedRegionStrategy > > FailoverStrategy describes how the job computation recovers from task > failures. > I think the following names may be easier to understand and easier to > distinguish: > Implementation class of FailoverStrategy > * AdaptedPipelinedRegionFailoverStrategyNG > * FailoverAllStrategy > * FailoverIndividualStrategy > * FailoverPipelinedRegionStrategy > FailoverStrategy is currently generated by configuration. If we change the > name of the implementation class, it will not affect compatibility. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15307) Subclasses of FailoverStrategy are easily confused with implementation classes of RestartStrategy
andrew.D.lin created FLINK-15307: Summary: Subclasses of FailoverStrategy are easily confused with implementation classes of RestartStrategy Key: FLINK-15307 URL: https://issues.apache.org/jira/browse/FLINK-15307 Project: Flink Issue Type: Improvement Components: Runtime / Configuration Affects Versions: 1.9.1, 1.9.0, 1.10.0 Reporter: andrew.D.lin Attachments: image-2019-12-18-14-59-03-181.png Subclasses of RestartStrategy * FailingRestartStrategy * FailureRateRestartStrategy * FixedDelayRestartStrategy * InfiniteDelayRestartStrategy Implementation class of FailoverStrategy * AdaptedRestartPipelinedRegionStrategyNG * RestartAllStrategy * RestartIndividualStrategy * RestartPipelinedRegionStrategy FailoverStrategy describes how the job computation recovers from task failures. I think the following names may be easier to understand and easier to distinguish: Implementation class of FailoverStrategy * AdaptedPipelinedRegionFailoverStrategyNG * FailoverAllStrategy * FailoverIndividualStrategy * FailoverPipelinedRegionStrategy FailoverStrategy is currently generated by configuration. If we change the name of the implementation class, it will not affect compatibility. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13856) Reduce the delete file api when the checkpoint is completed
[ https://issues.apache.org/jira/browse/FLINK-13856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989520#comment-16989520 ] andrew.D.lin commented on FLINK-13856: -- [~klion26] thanks congxian,I thinks you have a better dessign! Looking forward to using your new fratures! > Reduce the delete file api when the checkpoint is completed > --- > > Key: FLINK-13856 > URL: https://issues.apache.org/jira/browse/FLINK-13856 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing, Runtime / State Backends >Affects Versions: 1.8.1, 1.9.0 >Reporter: andrew.D.lin >Assignee: andrew.D.lin >Priority: Major > Labels: pull-request-available > Attachments: after.png, before.png, > f6cc56b7-2c74-4f4b-bb6a-476d28a22096.png > > Original Estimate: 48h > Time Spent: 10m > Remaining Estimate: 47h 50m > > When the new checkpoint is completed, an old checkpoint will be deleted by > calling CompletedCheckpoint.discardOnSubsume(). > When deleting old checkpoints, follow these steps: > 1, drop the metadata > 2, discard private state objects > 3, discard location as a whole > In some cases, is it possible to delete the checkpoint folder recursively by > one call? > As far as I know the full amount of checkpoint, it should be possible to > delete the folder directly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13856) Reduce the delete file api when the checkpoint is completed
[ https://issues.apache.org/jira/browse/FLINK-13856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951763#comment-16951763 ] andrew.D.lin commented on FLINK-13856: -- [~sewen] Hi Stephan! Deleting the shared state first, and then dropping the exclusive location as a whole. I think it's ok and won‘t cause any trouble(It was supposed to be deleted),can you task about it? Furthermore,most file systems support recursive deletion including S3. > Reduce the delete file api when the checkpoint is completed > --- > > Key: FLINK-13856 > URL: https://issues.apache.org/jira/browse/FLINK-13856 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing, Runtime / State Backends >Affects Versions: 1.8.1, 1.9.0 >Reporter: andrew.D.lin >Assignee: andrew.D.lin >Priority: Major > Labels: pull-request-available > Attachments: after.png, before.png, > f6cc56b7-2c74-4f4b-bb6a-476d28a22096.png > > Original Estimate: 48h > Time Spent: 10m > Remaining Estimate: 47h 50m > > When the new checkpoint is completed, an old checkpoint will be deleted by > calling CompletedCheckpoint.discardOnSubsume(). > When deleting old checkpoints, follow these steps: > 1, drop the metadata > 2, discard private state objects > 3, discard location as a whole > In some cases, is it possible to delete the checkpoint folder recursively by > one call? > As far as I know the full amount of checkpoint, it should be possible to > delete the folder directly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-13856) Reduce the delete file api when the checkpoint is completed
[ https://issues.apache.org/jira/browse/FLINK-13856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] andrew.D.lin updated FLINK-13856: - Component/s: Runtime / Checkpointing > Reduce the delete file api when the checkpoint is completed > --- > > Key: FLINK-13856 > URL: https://issues.apache.org/jira/browse/FLINK-13856 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing, Runtime / State Backends >Affects Versions: 1.8.1, 1.9.0 >Reporter: andrew.D.lin >Priority: Major > Attachments: f6cc56b7-2c74-4f4b-bb6a-476d28a22096.png > > Original Estimate: 48h > Remaining Estimate: 48h > > When the new checkpoint is completed, an old checkpoint will be deleted by > calling CompletedCheckpoint.discardOnSubsume(). > When deleting old checkpoints, follow these steps: > 1, drop the metadata > 2, discard private state objects > 3, discard location as a whole > In some cases, is it possible to delete the checkpoint folder recursively by > one call? > As far as I know the full amount of checkpoint, it should be possible to > delete the folder directly. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (FLINK-13856) Reduce the delete file api when the checkpoint is completed
andrew.D.lin created FLINK-13856: Summary: Reduce the delete file api when the checkpoint is completed Key: FLINK-13856 URL: https://issues.apache.org/jira/browse/FLINK-13856 Project: Flink Issue Type: Improvement Components: Runtime / State Backends Affects Versions: 1.9.0, 1.8.1 Reporter: andrew.D.lin Attachments: f6cc56b7-2c74-4f4b-bb6a-476d28a22096.png When the new checkpoint is completed, an old checkpoint will be deleted by calling CompletedCheckpoint.discardOnSubsume(). When deleting old checkpoints, follow these steps: 1, drop the metadata 2, discard private state objects 3, discard location as a whole In some cases, is it possible to delete the checkpoint folder recursively by one call? As far as I know the full amount of checkpoint, it should be possible to delete the folder directly. -- This message was sent by Atlassian Jira (v8.3.2#803003)