[jira] [Updated] (FLINK-13856) Reduce the delete file api when the checkpoint is completed

2021-04-26 Thread Andrew.D.lin (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew.D.lin updated FLINK-13856:
-
Fix Version/s: 1.13.1

> Reduce the delete file api when the checkpoint is completed
> ---
>
> Key: FLINK-13856
> URL: https://issues.apache.org/jira/browse/FLINK-13856
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing, Runtime / State Backends
>Affects Versions: 1.8.1, 1.9.0
>Reporter: Andrew.D.lin
>Assignee: Andrew.D.lin
>Priority: Major
>  Labels: pull-request-available, stale-assigned
> Fix For: 1.13.1
>
> Attachments: after.png, before.png, 
> f6cc56b7-2c74-4f4b-bb6a-476d28a22096.png
>
>   Original Estimate: 48h
>  Time Spent: 10m
>  Remaining Estimate: 47h 50m
>
> When the new checkpoint is completed, an old checkpoint will be deleted by 
> calling CompletedCheckpoint.discardOnSubsume().
> When deleting old checkpoints, follow these steps:
> 1, drop the metadata
> 2, discard private state objects
> 3, discard location as a whole
> In some cases, is it possible to delete the checkpoint folder recursively by 
> one call?
> As far as I know the full amount of checkpoint, it should be possible to 
> delete the folder directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13856) Reduce the delete file api when the checkpoint is completed

2021-04-20 Thread Andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325549#comment-17325549
 ] 

Andrew.D.lin commented on FLINK-13856:
--

Said above is wrong. The 
[getSharedState|https://github.com/apache/flink/blob/24031e55e4cf35a5818db2e927e65b290a9b2aed/flink-runtime/src/main/java/org/apache/flink/runtime/state/IncrementalRemoteKeyedStateHandle.java#L123]
 method return is FileStateHandle not IncrementalRemoteKeyedStateHandle (share 
state handle). This method is easy to misunderstand and misuse. 

So currently we can only collect IncrementalRemoteKeyedStateHandle (only type 
of increment state handle) for separate processing.

> Reduce the delete file api when the checkpoint is completed
> ---
>
> Key: FLINK-13856
> URL: https://issues.apache.org/jira/browse/FLINK-13856
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing, Runtime / State Backends
>Affects Versions: 1.8.1, 1.9.0
>Reporter: Andrew.D.lin
>Assignee: Andrew.D.lin
>Priority: Major
>  Labels: pull-request-available, stale-assigned
> Attachments: after.png, before.png, 
> f6cc56b7-2c74-4f4b-bb6a-476d28a22096.png
>
>   Original Estimate: 48h
>  Time Spent: 10m
>  Remaining Estimate: 47h 50m
>
> When the new checkpoint is completed, an old checkpoint will be deleted by 
> calling CompletedCheckpoint.discardOnSubsume().
> When deleting old checkpoints, follow these steps:
> 1, drop the metadata
> 2, discard private state objects
> 3, discard location as a whole
> In some cases, is it possible to delete the checkpoint folder recursively by 
> one call?
> As far as I know the full amount of checkpoint, it should be possible to 
> delete the folder directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13856) Reduce the delete file api when the checkpoint is completed

2021-04-20 Thread Andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325541#comment-17325541
 ] 

Andrew.D.lin commented on FLINK-13856:
--

[~sewen] Thanks for your careful reply.

In the next minor version of 1.13 or 1.14, we plan to do this?  It is 
convenient for me to modify the expected repair version of the issue.

In addition, I noticed that we can get all the shared state through the 
[getSharedState|https://github.com/apache/flink/blob/24031e55e4cf35a5818db2e927e65b290a9b2aed/flink-runtime/src/main/java/org/apache/flink/runtime/state/IncrementalRemoteKeyedStateHandle.java#L123]
 method. This way we can get all the shared state for separate 
processing(discard of shared state handles). Then we can directly and safely 
delete the exclusive folder.

> Reduce the delete file api when the checkpoint is completed
> ---
>
> Key: FLINK-13856
> URL: https://issues.apache.org/jira/browse/FLINK-13856
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing, Runtime / State Backends
>Affects Versions: 1.8.1, 1.9.0
>Reporter: Andrew.D.lin
>Assignee: Andrew.D.lin
>Priority: Major
>  Labels: pull-request-available, stale-assigned
> Attachments: after.png, before.png, 
> f6cc56b7-2c74-4f4b-bb6a-476d28a22096.png
>
>   Original Estimate: 48h
>  Time Spent: 10m
>  Remaining Estimate: 47h 50m
>
> When the new checkpoint is completed, an old checkpoint will be deleted by 
> calling CompletedCheckpoint.discardOnSubsume().
> When deleting old checkpoints, follow these steps:
> 1, drop the metadata
> 2, discard private state objects
> 3, discard location as a whole
> In some cases, is it possible to delete the checkpoint folder recursively by 
> one call?
> As far as I know the full amount of checkpoint, it should be possible to 
> delete the folder directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13856) Reduce the delete file api when the checkpoint is completed

2021-04-19 Thread Andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324869#comment-17324869
 ] 

Andrew.D.lin commented on FLINK-13856:
--

[~fanrui] Thx, This feature is turned on is configurable, in our scenario, we 
have been working properly for 2 years. 

[~sewen] If S3 etc  is worse than directly deleting the keys, We can recommend 
turning off this optimization in the S3 scenario. I think this feature is 
confirmed to bring a good improvement in some scenarios. Do you think it is 
necessary to continue?  

https://github.com/apache/flink/pull/15667

> Reduce the delete file api when the checkpoint is completed
> ---
>
> Key: FLINK-13856
> URL: https://issues.apache.org/jira/browse/FLINK-13856
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing, Runtime / State Backends
>Affects Versions: 1.8.1, 1.9.0
>Reporter: Andrew.D.lin
>Assignee: Andrew.D.lin
>Priority: Major
>  Labels: pull-request-available, stale-assigned
> Attachments: after.png, before.png, 
> f6cc56b7-2c74-4f4b-bb6a-476d28a22096.png
>
>   Original Estimate: 48h
>  Time Spent: 10m
>  Remaining Estimate: 47h 50m
>
> When the new checkpoint is completed, an old checkpoint will be deleted by 
> calling CompletedCheckpoint.discardOnSubsume().
> When deleting old checkpoints, follow these steps:
> 1, drop the metadata
> 2, discard private state objects
> 3, discard location as a whole
> In some cases, is it possible to delete the checkpoint folder recursively by 
> one call?
> As far as I know the full amount of checkpoint, it should be possible to 
> delete the folder directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2021-03-30 Thread Andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311547#comment-17311547
 ] 

Andrew.D.lin edited comment on FLINK-10052 at 3/30/21, 2:21 PM:


Note change here![FLINK-18677|https://issues.apache.org/jira/browse/FLINK-18677]

I think it should not be increased logic here,[handleStateChange|#L152]],Here 
is just to add logs at the beginning.

Connection State processing should be managed by 
[ConnectionStateErrorPolicy|https://github.com/chendonglin521/curator-1/blob/15a9f03f6f7b156806d05d0dd7ce6cfd3ef39c72/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateErrorPolicy.java#L27]
 (support session and Standard)


was (Author: andrew_lin):
Note change here!FLINK-18677

I think it should not be increased logic here,[handleStateChange|#L152]],Here 
is just to add logs at the beginning.

Connection State processing should be managed by 
[ConnectionStateErrorPolicy|https://github.com/chendonglin521/curator-1/blob/15a9f03f6f7b156806d05d0dd7ce6cfd3ef39c72/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateErrorPolicy.java#L27]
 (support session and Standard)

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Zili Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.13.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2021-03-30 Thread Andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311547#comment-17311547
 ] 

Andrew.D.lin edited comment on FLINK-10052 at 3/30/21, 2:21 PM:


Note change here!FLINK-18677

I think it should not be increased logic here,[handleStateChange|#L152]],Here 
is just to add logs at the beginning.

Connection State processing should be managed by 
[ConnectionStateErrorPolicy|https://github.com/chendonglin521/curator-1/blob/15a9f03f6f7b156806d05d0dd7ce6cfd3ef39c72/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateErrorPolicy.java#L27]
 (support session and Standard)


was (Author: andrew_lin):
Note change here![FLINK-18677|https://issues.apache.org/jira/browse/FLINK-18677]

I think it should not be increased logic 
here,[handleStateChange|[https://github.com/apache/flink/blob/940dfb0deccb31e0ca576b4c044cbf588e0765dd/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalDriver.java#L152]|https://github.com/apache/flink/blob/940dfb0deccb31e0ca576b4c044cbf588e0765dd/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalDriver.java#L152],]Here
 is just to add logs at the beginning.

Connection State processing should be managed 
by[ConnectionStateErrorPolicy|https://github.com/chendonglin521/curator-1/blob/15a9f03f6f7b156806d05d0dd7ce6cfd3ef39c72/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateErrorPolicy.java#L27]
 (support session and Standard)

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Zili Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.13.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2021-03-30 Thread Andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311547#comment-17311547
 ] 

Andrew.D.lin commented on FLINK-10052:
--

Note change here![FLINK-18677|https://issues.apache.org/jira/browse/FLINK-18677]

I think it should not be increased logic 
here,[handleStateChange|[https://github.com/apache/flink/blob/940dfb0deccb31e0ca576b4c044cbf588e0765dd/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalDriver.java#L152]|https://github.com/apache/flink/blob/940dfb0deccb31e0ca576b4c044cbf588e0765dd/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalDriver.java#L152],]Here
 is just to add logs at the beginning.

Connection State processing should be managed 
by[ConnectionStateErrorPolicy|https://github.com/chendonglin521/curator-1/blob/15a9f03f6f7b156806d05d0dd7ce6cfd3ef39c72/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateErrorPolicy.java#L27]
 (support session and Standard)

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Zili Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.13.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-21858) TaskMetricGroup taskName is too long, especially in sql tasks.

2021-03-18 Thread Andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304192#comment-17304192
 ] 

Andrew.D.lin commented on FLINK-21858:
--

Thanks for your reply. (y)  [~chesnay]

Ok, let me pay attention to 
[FLINK-20388|https://issues.apache.org/jira/browse/FLINK-20388]

> TaskMetricGroup taskName is too long, especially in sql tasks.
> --
>
> Key: FLINK-21858
> URL: https://issues.apache.org/jira/browse/FLINK-21858
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.12.0, 1.12.1, 1.12.2
>Reporter: Andrew.D.lin
>Assignee: Andrew.D.lin
>Priority: Major
>
> Now operatorName is limited to 80 by 
> org.apache.flink.runtime.metrics.groups.TaskMetricGroup#METRICS_OPERATOR_NAME_MAX_LENGTH.
> So propose to limit the maximum length of metric name by configuration.
>  
> Here is an example:
>  
> "taskName":"GlobalGroupAggregate(groupBy=[dt, src, src1, src2, src3, ct1, 
> ct2], select=[dt, src, src1, src2, src3, ct1, ct2, SUM_RETRACT((sum$0, 
> count$1)) AS sx_pv, SUM_RETRACT((sum$2, count$3)) AS sx_uv, 
> MAX_RETRACT(max$4) AS updt_time, MAX_RETRACT(max$5) AS time_id]) -> 
> Calc(select=[((MD5((dt CONCAT _UTF-16LE'|' CONCAT src CONCAT _UTF-16LE'|' 
> CONCAT src1 CONCAT _UTF-16LE'|' CONCAT src2 CONCAT _UTF-16LE'|' CONCAT src3 
> CONCAT _UTF-16LE'|' CONCAT ct1 CONCAT _UTF-16LE'|' CONCAT ct2 CONCAT 
> _UTF-16LE'|' CONCAT time_id)) SUBSTR 1 SUBSTR 2) CONCAT _UTF-16LE'_' CONCAT 
> (dt CONCAT _UTF-16LE'|' CONCAT src CONCAT _UTF-16LE'|' CONCAT src1 CONCAT 
> _UTF-16LE'|' CONCAT src2 CONCAT _UTF-16LE'|' CONCAT src3 CONCAT _UTF-16LE'|' 
> CONCAT ct1 CONCAT _UTF-16LE'|' CONCAT ct2 CONCAT _UTF-16LE'|' CONCAT 
> time_id)) AS rowkey, sx_pv, sx_uv, updt_time]) -> 
> LocalGroupAggregate(groupBy=[rowkey], select=[rowkey, MAX_RETRACT(sx_pv) AS 
> max$0, MAX_RETRACT(sx_uv) AS max$1, MAX_RETRACT(updt_time) AS max$2, 
> COUNT_RETRACT(*) AS count1$3])"
> "operatorName":"GlobalGroupAggregate(groupBy=[dt, src, src1, src2, src3, ct1, 
> ct2], selec=[dt, s"
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20388) Supports users setting operators' metrics name

2021-03-18 Thread Andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304188#comment-17304188
 ] 

Andrew.D.lin commented on FLINK-20388:
--

org.apache.flink.runtime.metrics.groups.TaskMetricGroup#taskName also has this 
problem, especially in sql tasks.

I hope it can be solved together

https://issues.apache.org/jira/browse/FLINK-21858

> Supports users setting operators' metrics name
> --
>
> Key: FLINK-20388
> URL: https://issues.apache.org/jira/browse/FLINK-20388
> Project: Flink
>  Issue Type: Improvement
>  Components: API / DataStream, Runtime / Metrics
>Reporter: hailong wang
>Priority: Major
>
> Currently, we only support users setting operators name.
> And we use those in the topology to distinguish operators, at the same time,  
> as the operator metrics name.
> If the operator name length is larger than 80, we truncate it simply.
> I think we can allow users to set operator metrics name like operators name. 
> If the user is not set, use the current way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-21858) TaskMetricGroup taskName is too long, especially in sql tasks.

2021-03-18 Thread Andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304037#comment-17304037
 ] 

Andrew.D.lin commented on FLINK-21858:
--

Hi [~chesnay] 

 Is it reasonable to restrict the metric name by adding configuration?  there 
is a better suggestion?:)

> TaskMetricGroup taskName is too long, especially in sql tasks.
> --
>
> Key: FLINK-21858
> URL: https://issues.apache.org/jira/browse/FLINK-21858
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.12.0, 1.12.1, 1.12.2
>Reporter: Andrew.D.lin
>Assignee: Andrew.D.lin
>Priority: Major
>
> Now operatorName is limited to 80 by 
> org.apache.flink.runtime.metrics.groups.TaskMetricGroup#METRICS_OPERATOR_NAME_MAX_LENGTH.
> So propose to limit the maximum length of metric name by configuration.
>  
> Here is an example:
>  
> "taskName":"GlobalGroupAggregate(groupBy=[dt, src, src1, src2, src3, ct1, 
> ct2], select=[dt, src, src1, src2, src3, ct1, ct2, SUM_RETRACT((sum$0, 
> count$1)) AS sx_pv, SUM_RETRACT((sum$2, count$3)) AS sx_uv, 
> MAX_RETRACT(max$4) AS updt_time, MAX_RETRACT(max$5) AS time_id]) -> 
> Calc(select=[((MD5((dt CONCAT _UTF-16LE'|' CONCAT src CONCAT _UTF-16LE'|' 
> CONCAT src1 CONCAT _UTF-16LE'|' CONCAT src2 CONCAT _UTF-16LE'|' CONCAT src3 
> CONCAT _UTF-16LE'|' CONCAT ct1 CONCAT _UTF-16LE'|' CONCAT ct2 CONCAT 
> _UTF-16LE'|' CONCAT time_id)) SUBSTR 1 SUBSTR 2) CONCAT _UTF-16LE'_' CONCAT 
> (dt CONCAT _UTF-16LE'|' CONCAT src CONCAT _UTF-16LE'|' CONCAT src1 CONCAT 
> _UTF-16LE'|' CONCAT src2 CONCAT _UTF-16LE'|' CONCAT src3 CONCAT _UTF-16LE'|' 
> CONCAT ct1 CONCAT _UTF-16LE'|' CONCAT ct2 CONCAT _UTF-16LE'|' CONCAT 
> time_id)) AS rowkey, sx_pv, sx_uv, updt_time]) -> 
> LocalGroupAggregate(groupBy=[rowkey], select=[rowkey, MAX_RETRACT(sx_pv) AS 
> max$0, MAX_RETRACT(sx_uv) AS max$1, MAX_RETRACT(updt_time) AS max$2, 
> COUNT_RETRACT(*) AS count1$3])"
> "operatorName":"GlobalGroupAggregate(groupBy=[dt, src, src1, src2, src3, ct1, 
> ct2], selec=[dt, s"
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-21858) TaskMetricGroup taskName is too long, especially in sql tasks.

2021-03-18 Thread Andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304033#comment-17304033
 ] 

Andrew.D.lin commented on FLINK-21858:
--

OK .thanks for your reply. [~zjwang]

> TaskMetricGroup taskName is too long, especially in sql tasks.
> --
>
> Key: FLINK-21858
> URL: https://issues.apache.org/jira/browse/FLINK-21858
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.12.0, 1.12.1, 1.12.2
>Reporter: Andrew.D.lin
>Assignee: Andrew.D.lin
>Priority: Major
>
> Now operatorName is limited to 80 by 
> org.apache.flink.runtime.metrics.groups.TaskMetricGroup#METRICS_OPERATOR_NAME_MAX_LENGTH.
> So propose to limit the maximum length of metric name by configuration.
>  
> Here is an example:
>  
> "taskName":"GlobalGroupAggregate(groupBy=[dt, src, src1, src2, src3, ct1, 
> ct2], select=[dt, src, src1, src2, src3, ct1, ct2, SUM_RETRACT((sum$0, 
> count$1)) AS sx_pv, SUM_RETRACT((sum$2, count$3)) AS sx_uv, 
> MAX_RETRACT(max$4) AS updt_time, MAX_RETRACT(max$5) AS time_id]) -> 
> Calc(select=[((MD5((dt CONCAT _UTF-16LE'|' CONCAT src CONCAT _UTF-16LE'|' 
> CONCAT src1 CONCAT _UTF-16LE'|' CONCAT src2 CONCAT _UTF-16LE'|' CONCAT src3 
> CONCAT _UTF-16LE'|' CONCAT ct1 CONCAT _UTF-16LE'|' CONCAT ct2 CONCAT 
> _UTF-16LE'|' CONCAT time_id)) SUBSTR 1 SUBSTR 2) CONCAT _UTF-16LE'_' CONCAT 
> (dt CONCAT _UTF-16LE'|' CONCAT src CONCAT _UTF-16LE'|' CONCAT src1 CONCAT 
> _UTF-16LE'|' CONCAT src2 CONCAT _UTF-16LE'|' CONCAT src3 CONCAT _UTF-16LE'|' 
> CONCAT ct1 CONCAT _UTF-16LE'|' CONCAT ct2 CONCAT _UTF-16LE'|' CONCAT 
> time_id)) AS rowkey, sx_pv, sx_uv, updt_time]) -> 
> LocalGroupAggregate(groupBy=[rowkey], select=[rowkey, MAX_RETRACT(sx_pv) AS 
> max$0, MAX_RETRACT(sx_uv) AS max$1, MAX_RETRACT(updt_time) AS max$2, 
> COUNT_RETRACT(*) AS count1$3])"
> "operatorName":"GlobalGroupAggregate(groupBy=[dt, src, src1, src2, src3, ct1, 
> ct2], selec=[dt, s"
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-21858) TaskMetricGroup taskName is too long, especially in sql tasks.

2021-03-18 Thread Andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304000#comment-17304000
 ] 

Andrew.D.lin commented on FLINK-21858:
--

I am willing to do it!

> TaskMetricGroup taskName is too long, especially in sql tasks.
> --
>
> Key: FLINK-21858
> URL: https://issues.apache.org/jira/browse/FLINK-21858
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.12.0, 1.12.1, 1.12.2
>Reporter: Andrew.D.lin
>Priority: Major
>
> Now operatorName is limited to 80 by 
> org.apache.flink.runtime.metrics.groups.TaskMetricGroup#METRICS_OPERATOR_NAME_MAX_LENGTH.
> So propose to limit the maximum length of metric name by configuration.
>  
> Here is an example:
>  
> "taskName":"GlobalGroupAggregate(groupBy=[dt, src, src1, src2, src3, ct1, 
> ct2], select=[dt, src, src1, src2, src3, ct1, ct2, SUM_RETRACT((sum$0, 
> count$1)) AS sx_pv, SUM_RETRACT((sum$2, count$3)) AS sx_uv, 
> MAX_RETRACT(max$4) AS updt_time, MAX_RETRACT(max$5) AS time_id]) -> 
> Calc(select=[((MD5((dt CONCAT _UTF-16LE'|' CONCAT src CONCAT _UTF-16LE'|' 
> CONCAT src1 CONCAT _UTF-16LE'|' CONCAT src2 CONCAT _UTF-16LE'|' CONCAT src3 
> CONCAT _UTF-16LE'|' CONCAT ct1 CONCAT _UTF-16LE'|' CONCAT ct2 CONCAT 
> _UTF-16LE'|' CONCAT time_id)) SUBSTR 1 SUBSTR 2) CONCAT _UTF-16LE'_' CONCAT 
> (dt CONCAT _UTF-16LE'|' CONCAT src CONCAT _UTF-16LE'|' CONCAT src1 CONCAT 
> _UTF-16LE'|' CONCAT src2 CONCAT _UTF-16LE'|' CONCAT src3 CONCAT _UTF-16LE'|' 
> CONCAT ct1 CONCAT _UTF-16LE'|' CONCAT ct2 CONCAT _UTF-16LE'|' CONCAT 
> time_id)) AS rowkey, sx_pv, sx_uv, updt_time]) -> 
> LocalGroupAggregate(groupBy=[rowkey], select=[rowkey, MAX_RETRACT(sx_pv) AS 
> max$0, MAX_RETRACT(sx_uv) AS max$1, MAX_RETRACT(updt_time) AS max$2, 
> COUNT_RETRACT(*) AS count1$3])"
> "operatorName":"GlobalGroupAggregate(groupBy=[dt, src, src1, src2, src3, ct1, 
> ct2], selec=[dt, s"
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21858) TaskMetricGroup taskName is too long, especially in sql tasks.

2021-03-18 Thread Andrew.D.lin (Jira)
Andrew.D.lin created FLINK-21858:


 Summary: TaskMetricGroup taskName is too long, especially in sql 
tasks.
 Key: FLINK-21858
 URL: https://issues.apache.org/jira/browse/FLINK-21858
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Metrics
Affects Versions: 1.12.2, 1.12.1, 1.12.0
Reporter: Andrew.D.lin


Now operatorName is limited to 80 by 
org.apache.flink.runtime.metrics.groups.TaskMetricGroup#METRICS_OPERATOR_NAME_MAX_LENGTH.

So propose to limit the maximum length of metric name by configuration.

 

Here is an example:

 

"taskName":"GlobalGroupAggregate(groupBy=[dt, src, src1, src2, src3, ct1, ct2], 
select=[dt, src, src1, src2, src3, ct1, ct2, SUM_RETRACT((sum$0, count$1)) AS 
sx_pv, SUM_RETRACT((sum$2, count$3)) AS sx_uv, MAX_RETRACT(max$4) AS updt_time, 
MAX_RETRACT(max$5) AS time_id]) -> Calc(select=[((MD5((dt CONCAT _UTF-16LE'|' 
CONCAT src CONCAT _UTF-16LE'|' CONCAT src1 CONCAT _UTF-16LE'|' CONCAT src2 
CONCAT _UTF-16LE'|' CONCAT src3 CONCAT _UTF-16LE'|' CONCAT ct1 CONCAT 
_UTF-16LE'|' CONCAT ct2 CONCAT _UTF-16LE'|' CONCAT time_id)) SUBSTR 1 SUBSTR 2) 
CONCAT _UTF-16LE'_' CONCAT (dt CONCAT _UTF-16LE'|' CONCAT src CONCAT 
_UTF-16LE'|' CONCAT src1 CONCAT _UTF-16LE'|' CONCAT src2 CONCAT _UTF-16LE'|' 
CONCAT src3 CONCAT _UTF-16LE'|' CONCAT ct1 CONCAT _UTF-16LE'|' CONCAT ct2 
CONCAT _UTF-16LE'|' CONCAT time_id)) AS rowkey, sx_pv, sx_uv, updt_time]) -> 
LocalGroupAggregate(groupBy=[rowkey], select=[rowkey, MAX_RETRACT(sx_pv) AS 
max$0, MAX_RETRACT(sx_uv) AS max$1, MAX_RETRACT(updt_time) AS max$2, 
COUNT_RETRACT(*) AS count1$3])"


"operatorName":"GlobalGroupAggregate(groupBy=[dt, src, src1, src2, src3, ct1, 
ct2], selec=[dt, s"
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20424) The percent of acknowledged checkpoint seems incorrect

2020-12-02 Thread Andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242891#comment-17242891
 ] 

Andrew.D.lin commented on FLINK-20424:
--

I discovered this problem before, and I think it is more appropriate to keep 
the percentage to two decimal places. Can i take it?

> The percent of acknowledged checkpoint seems incorrect
> --
>
> Key: FLINK-20424
> URL: https://issues.apache.org/jira/browse/FLINK-20424
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Web Frontend
>Reporter: zlzhang0122
>Priority: Minor
> Attachments: 2020-11-30 14-18-34 的屏幕截图.png
>
>
> As the picture below, the percent of acknowledged checkpoint seems 
> incorrect.I think the number must not be 100% because one of the checkpoint 
> acknowledge was failed.
> !2020-11-30 14-18-34 的屏幕截图.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14012) Failed to start job for consuming Secure Kafka after the job cancel

2020-09-21 Thread Andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199296#comment-17199296
 ] 

Andrew.D.lin commented on FLINK-14012:
--

[~aljoscha] [~Daebeom] ,  Hi may I ask which pr or commit fixed this? Can I 
cherry-pick to lower version?

> Failed to start job for consuming Secure Kafka after the job cancel
> ---
>
> Key: FLINK-14012
> URL: https://issues.apache.org/jira/browse/FLINK-14012
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.9.0
> Environment: * Kubernetes 1.13.2
>  * Flink 1.9.0
>  * Kafka client libary 2.2.0
>Reporter: Daebeom Lee
>Priority: Minor
>
> Hello, this is Daebeom Lee.
> h2. Background
> I installed Flink 1.9.0 at this our Kubernetes cluster.
> We use Flink session cluster. - build fatJar file and upload it at the UI, 
> run serval jobs.
> At first, our jobs are good to start.
> But, when we cancel some jobs, the job failed
> This is the error code.
> {code:java}
> // code placeholder
> java.lang.NoClassDefFoundError: 
> org/apache/kafka/common/security/scram/internals/ScramSaslClient
> at 
> org.apache.kafka.common.security.scram.internals.ScramSaslClient$ScramSaslClientFactory.createSaslClient(ScramSaslClient.java:235)
> at javax.security.sasl.Sasl.createSaslClient(Sasl.java:384)
> at 
> org.apache.kafka.common.security.authenticator.SaslClientAuthenticator.lambda$createSaslClient$0(SaslClientAuthenticator.java:180)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.kafka.common.security.authenticator.SaslClientAuthenticator.createSaslClient(SaslClientAuthenticator.java:176)
> at 
> org.apache.kafka.common.security.authenticator.SaslClientAuthenticator.(SaslClientAuthenticator.java:168)
> at 
> org.apache.kafka.common.network.SaslChannelBuilder.buildClientAuthenticator(SaslChannelBuilder.java:254)
> at 
> org.apache.kafka.common.network.SaslChannelBuilder.lambda$buildChannel$1(SaslChannelBuilder.java:202)
> at 
> org.apache.kafka.common.network.KafkaChannel.(KafkaChannel.java:140)
> at 
> org.apache.kafka.common.network.SaslChannelBuilder.buildChannel(SaslChannelBuilder.java:210)
> at 
> org.apache.kafka.common.network.Selector.buildAndAttachKafkaChannel(Selector.java:334)
> at 
> org.apache.kafka.common.network.Selector.registerChannel(Selector.java:325)
> at org.apache.kafka.common.network.Selector.connect(Selector.java:257)
> at 
> org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:920)
> at org.apache.kafka.clients.NetworkClient.ready(NetworkClient.java:287)
> at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.trySend(ConsumerNetworkClient.java:474)
> at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:255)
> at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236)
> at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:215)
> at 
> org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:292)
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1803)
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1771)
> at 
> org.apache.flink.streaming.connectors.kafka.internal.KafkaPartitionDiscoverer.getAllPartitionsForTopics(KafkaPartitionDiscoverer.java:77)
> at 
> org.apache.flink.streaming.connectors.kafka.internals.AbstractPartitionDiscoverer.discoverPartitions(AbstractPartitionDiscoverer.java:131)
> at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.open(FlinkKafkaConsumerBase.java:508)
> at 
> org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
> at 
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:102)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:529)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:393)
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> h2. Our workaround
>  * I think that this is Flink JVM classloader issue.
>  * Classloader unloads when job cancels by the way kafka client library is 
> included fatJar.
>  * So, I located Kafka client library to /opt/flink/lib 
>  ** /opt/flink/lib/kafka-clients-2.2.0.jar
>  * And 

[jira] [Commented] (FLINK-16099) Translate "HiveCatalog" page of "Hive Integration" into Chinese

2020-05-07 Thread Andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101527#comment-17101527
 ] 

Andrew.D.lin commented on FLINK-16099:
--

ok thanks

> Translate "HiveCatalog" page of "Hive Integration" into Chinese 
> 
>
> Key: FLINK-16099
> URL: https://issues.apache.org/jira/browse/FLINK-16099
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation, Documentation
>Reporter: Jark Wu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The page url is 
> https://ci.apache.org/projects/flink/flink-docs-master/zh/dev/table/hive/hive_catalog.html
> The markdown file is located in 
> {{flink/docs/dev/table/hive/hive_catalog.zh.md}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16099) Translate "HiveCatalog" page of "Hive Integration" into Chinese

2020-05-06 Thread Andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101333#comment-17101333
 ] 

Andrew.D.lin commented on FLINK-16099:
--

Hi [~jark], I am willing to do it!

> Translate "HiveCatalog" page of "Hive Integration" into Chinese 
> 
>
> Key: FLINK-16099
> URL: https://issues.apache.org/jira/browse/FLINK-16099
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation, Documentation
>Reporter: Jark Wu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The page url is 
> https://ci.apache.org/projects/flink/flink-docs-master/zh/dev/table/hive/hive_catalog.html
> The markdown file is located in 
> {{flink/docs/dev/table/hive/hive_catalog.zh.md}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-15307) Subclasses of FailoverStrategy are easily confused with implementation classes of RestartStrategy

2020-01-13 Thread Andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014810#comment-17014810
 ] 

Andrew.D.lin edited comment on FLINK-15307 at 1/14/20 3:01 AM:
---

Hi [~zhuzh],  I modified it according to your suggestion( the name patterns 
discussed above), and only modified strategy under flip1.

hope you help review and give me some suggestions. thank you very much! :D


was (Author: andrew_lin):
Hi [~zhuzh],  I modified it according to your suggestion, only modified 
strategy under flip1.

hope you help review and give me some suggestions. thank you very much! :D

> Subclasses of FailoverStrategy are easily confused with implementation 
> classes of RestartStrategy
> -
>
> Key: FLINK-15307
> URL: https://issues.apache.org/jira/browse/FLINK-15307
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Reporter: Andrew.D.lin
>Assignee: Andrew.D.lin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
> Attachments: image-2019-12-18-14-59-03-181.png
>
>  Time Spent: 10m
>  Remaining Estimate: 24h
>
> Subclasses of RestartStrategy
>  * FailingRestartStrategy
>  * FailureRateRestartStrategy
>  * FixedDelayRestartStrategy
>  * InfiniteDelayRestartStrategy
> Implementation class of FailoverStrategy
>  * AdaptedRestartPipelinedRegionStrategyNG
>  * RestartAllStrategy
>  * RestartIndividualStrategy
>  * RestartPipelinedRegionStrategy
>  
> FailoverStrategy describes how the job computation recovers from task 
> failures.
> I think the following names may be easier to understand and easier to 
> distinguish:
> Implementation class of FailoverStrategy
>  * AdaptedPipelinedRegionFailoverStrategyNG
>  * FailoverAllStrategy
>  * FailoverIndividualStrategy
>  * FailoverPipelinedRegionStrategy
> FailoverStrategy is currently generated by configuration. If we change the 
> name of the implementation class, it will not affect compatibility.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-15307) Subclasses of FailoverStrategy are easily confused with implementation classes of RestartStrategy

2020-01-13 Thread Andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014810#comment-17014810
 ] 

Andrew.D.lin edited comment on FLINK-15307 at 1/14/20 2:56 AM:
---

Hi [~zhuzh],  I modified it according to your suggestion, only modified 
strategy under flip1.

hope you help review and give me some suggestions. thank you very much! :D


was (Author: andrew_lin):
[~zhuzh] sorry, I missed it, Let me modify it

> Subclasses of FailoverStrategy are easily confused with implementation 
> classes of RestartStrategy
> -
>
> Key: FLINK-15307
> URL: https://issues.apache.org/jira/browse/FLINK-15307
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Reporter: Andrew.D.lin
>Assignee: Andrew.D.lin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
> Attachments: image-2019-12-18-14-59-03-181.png
>
>  Time Spent: 10m
>  Remaining Estimate: 24h
>
> Subclasses of RestartStrategy
>  * FailingRestartStrategy
>  * FailureRateRestartStrategy
>  * FixedDelayRestartStrategy
>  * InfiniteDelayRestartStrategy
> Implementation class of FailoverStrategy
>  * AdaptedRestartPipelinedRegionStrategyNG
>  * RestartAllStrategy
>  * RestartIndividualStrategy
>  * RestartPipelinedRegionStrategy
>  
> FailoverStrategy describes how the job computation recovers from task 
> failures.
> I think the following names may be easier to understand and easier to 
> distinguish:
> Implementation class of FailoverStrategy
>  * AdaptedPipelinedRegionFailoverStrategyNG
>  * FailoverAllStrategy
>  * FailoverIndividualStrategy
>  * FailoverPipelinedRegionStrategy
> FailoverStrategy is currently generated by configuration. If we change the 
> name of the implementation class, it will not affect compatibility.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15307) Subclasses of FailoverStrategy are easily confused with implementation classes of RestartStrategy

2020-01-13 Thread Andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014810#comment-17014810
 ] 

Andrew.D.lin commented on FLINK-15307:
--

[~zhuzh] sorry, I missed it, Let me modify it

> Subclasses of FailoverStrategy are easily confused with implementation 
> classes of RestartStrategy
> -
>
> Key: FLINK-15307
> URL: https://issues.apache.org/jira/browse/FLINK-15307
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Reporter: Andrew.D.lin
>Assignee: Andrew.D.lin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
> Attachments: image-2019-12-18-14-59-03-181.png
>
>  Time Spent: 10m
>  Remaining Estimate: 24h
>
> Subclasses of RestartStrategy
>  * FailingRestartStrategy
>  * FailureRateRestartStrategy
>  * FixedDelayRestartStrategy
>  * InfiniteDelayRestartStrategy
> Implementation class of FailoverStrategy
>  * AdaptedRestartPipelinedRegionStrategyNG
>  * RestartAllStrategy
>  * RestartIndividualStrategy
>  * RestartPipelinedRegionStrategy
>  
> FailoverStrategy describes how the job computation recovers from task 
> failures.
> I think the following names may be easier to understand and easier to 
> distinguish:
> Implementation class of FailoverStrategy
>  * AdaptedPipelinedRegionFailoverStrategyNG
>  * FailoverAllStrategy
>  * FailoverIndividualStrategy
>  * FailoverPipelinedRegionStrategy
> FailoverStrategy is currently generated by configuration. If we change the 
> name of the implementation class, it will not affect compatibility.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15307) Subclasses of FailoverStrategy are easily confused with implementation classes of RestartStrategy

2020-01-13 Thread Andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014771#comment-17014771
 ] 

Andrew.D.lin commented on FLINK-15307:
--

Hi [~zhuzh] 

I open the PR: [https://github.com/apache/flink/pull/10848, 
:D|https://github.com/apache/flink/pull/10848]

 

> Subclasses of FailoverStrategy are easily confused with implementation 
> classes of RestartStrategy
> -
>
> Key: FLINK-15307
> URL: https://issues.apache.org/jira/browse/FLINK-15307
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Reporter: Andrew.D.lin
>Assignee: Andrew.D.lin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
> Attachments: image-2019-12-18-14-59-03-181.png
>
>  Time Spent: 10m
>  Remaining Estimate: 24h
>
> Subclasses of RestartStrategy
>  * FailingRestartStrategy
>  * FailureRateRestartStrategy
>  * FixedDelayRestartStrategy
>  * InfiniteDelayRestartStrategy
> Implementation class of FailoverStrategy
>  * AdaptedRestartPipelinedRegionStrategyNG
>  * RestartAllStrategy
>  * RestartIndividualStrategy
>  * RestartPipelinedRegionStrategy
>  
> FailoverStrategy describes how the job computation recovers from task 
> failures.
> I think the following names may be easier to understand and easier to 
> distinguish:
> Implementation class of FailoverStrategy
>  * AdaptedPipelinedRegionFailoverStrategyNG
>  * FailoverAllStrategy
>  * FailoverIndividualStrategy
>  * FailoverPipelinedRegionStrategy
> FailoverStrategy is currently generated by configuration. If we change the 
> name of the implementation class, it will not affect compatibility.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15307) Subclasses of FailoverStrategy are easily confused with implementation classes of RestartStrategy

2020-01-13 Thread Andrew.D.lin (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew.D.lin updated FLINK-15307:
-
External issue URL: https://github.com/apache/flink/pull/10848
Remaining Estimate: 24h  (was: 0h)

> Subclasses of FailoverStrategy are easily confused with implementation 
> classes of RestartStrategy
> -
>
> Key: FLINK-15307
> URL: https://issues.apache.org/jira/browse/FLINK-15307
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Reporter: Andrew.D.lin
>Assignee: Andrew.D.lin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
> Attachments: image-2019-12-18-14-59-03-181.png
>
>  Time Spent: 10m
>  Remaining Estimate: 24h
>
> Subclasses of RestartStrategy
>  * FailingRestartStrategy
>  * FailureRateRestartStrategy
>  * FixedDelayRestartStrategy
>  * InfiniteDelayRestartStrategy
> Implementation class of FailoverStrategy
>  * AdaptedRestartPipelinedRegionStrategyNG
>  * RestartAllStrategy
>  * RestartIndividualStrategy
>  * RestartPipelinedRegionStrategy
>  
> FailoverStrategy describes how the job computation recovers from task 
> failures.
> I think the following names may be easier to understand and easier to 
> distinguish:
> Implementation class of FailoverStrategy
>  * AdaptedPipelinedRegionFailoverStrategyNG
>  * FailoverAllStrategy
>  * FailoverIndividualStrategy
>  * FailoverPipelinedRegionStrategy
> FailoverStrategy is currently generated by configuration. If we change the 
> name of the implementation class, it will not affect compatibility.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15568) RestartPipelinedRegionStrategy: not ensure the EXACTLY_ONCE semantics, in 1.8 versions

2020-01-13 Thread Andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014738#comment-17014738
 ] 

Andrew.D.lin commented on FLINK-15568:
--

thanks for your detailed replay [~zhuzh], 

Can you help me mark this issue Can't fix or close it in this case?

 

> RestartPipelinedRegionStrategy: not ensure the EXACTLY_ONCE semantics, in 1.8 
> versions
> --
>
> Key: FLINK-15568
> URL: https://issues.apache.org/jira/browse/FLINK-15568
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.8.0, 1.8.1, 1.8.3
>Reporter: Andrew.D.lin
>Priority: Minor
> Attachments: image-2020-01-13-16-40-47-888.png
>
>
> In 1.8* versions, FailoverRegion.java restart method  not restore from latest 
> checkpoint and marked TODO.
> Should we support this feature (region restart) in flink 1.8?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15568) RestartPipelinedRegionStrategy: not ensure the EXACTLY_ONCE semantics, in 1.8 versions

2020-01-13 Thread Andrew.D.lin (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew.D.lin updated FLINK-15568:
-
Summary: RestartPipelinedRegionStrategy: not ensure the EXACTLY_ONCE 
semantics, in 1.8 versions  (was: RestartPipelinedRegionStrategy: not ensure 
the EXACTLY_ONCE semantics)

> RestartPipelinedRegionStrategy: not ensure the EXACTLY_ONCE semantics, in 1.8 
> versions
> --
>
> Key: FLINK-15568
> URL: https://issues.apache.org/jira/browse/FLINK-15568
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.8.0, 1.8.1, 1.8.3
>Reporter: Andrew.D.lin
>Priority: Minor
> Attachments: image-2020-01-13-16-40-47-888.png
>
>
> In 1.8* versions, FailoverRegion.java restart method  not restore from latest 
> checkpoint and marked TODO.
> Should we support this feature (region restart) in flink 1.8?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15568) RestartPipelinedRegionStrategy: not ensure the EXACTLY_ONCE semantics

2020-01-13 Thread Andrew.D.lin (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew.D.lin updated FLINK-15568:
-
Description: 
In 1.8* versions, FailoverRegion.java restart method  not restore from latest 
checkpoint and marked TODO.

Should we support this feature (region restart) in flink 1.8?

  was:
!image-2020-01-13-16-40-47-888.png!

 

In 1.8* versions, FailoverRegion.java restart method  not restore from latest 
checkpoint and marked TODO.

Should we support this feature (region restart) in flink 1.8?


> RestartPipelinedRegionStrategy: not ensure the EXACTLY_ONCE semantics
> -
>
> Key: FLINK-15568
> URL: https://issues.apache.org/jira/browse/FLINK-15568
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.8.0, 1.8.1, 1.8.3
>Reporter: Andrew.D.lin
>Priority: Minor
> Attachments: image-2020-01-13-16-40-47-888.png
>
>
> In 1.8* versions, FailoverRegion.java restart method  not restore from latest 
> checkpoint and marked TODO.
> Should we support this feature (region restart) in flink 1.8?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15568) RestartPipelinedRegionStrategy: not ensure the EXACTLY_ONCE semantics

2020-01-13 Thread Andrew.D.lin (Jira)
Andrew.D.lin created FLINK-15568:


 Summary: RestartPipelinedRegionStrategy: not ensure the 
EXACTLY_ONCE semantics
 Key: FLINK-15568
 URL: https://issues.apache.org/jira/browse/FLINK-15568
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Checkpointing
Affects Versions: 1.8.3, 1.8.1, 1.8.0
Reporter: Andrew.D.lin
 Attachments: image-2020-01-13-16-40-47-888.png

!image-2020-01-13-16-40-47-888.png!

 

In 1.8* versions, FailoverRegion.java restart method  not restore from latest 
checkpoint and marked TODO.

Should we support this feature (region restart) in flink 1.8?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15307) Subclasses of FailoverStrategy are easily confused with implementation classes of RestartStrategy

2019-12-19 Thread andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000673#comment-17000673
 ] 

andrew.D.lin commented on FLINK-15307:
--

Hi [~zhuzh], [~tison] 

OK, I am willing to do this. According to Zhuzhu‘ suggestion.

Can you give me this ticket?

> Subclasses of FailoverStrategy are easily confused with implementation 
> classes of RestartStrategy
> -
>
> Key: FLINK-15307
> URL: https://issues.apache.org/jira/browse/FLINK-15307
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Reporter: andrew.D.lin
>Priority: Minor
> Fix For: 1.11.0
>
> Attachments: image-2019-12-18-14-59-03-181.png
>
>
> Subclasses of RestartStrategy
>  * FailingRestartStrategy
>  * FailureRateRestartStrategy
>  * FixedDelayRestartStrategy
>  * InfiniteDelayRestartStrategy
> Implementation class of FailoverStrategy
>  * AdaptedRestartPipelinedRegionStrategyNG
>  * RestartAllStrategy
>  * RestartIndividualStrategy
>  * RestartPipelinedRegionStrategy
>  
> FailoverStrategy describes how the job computation recovers from task 
> failures.
> I think the following names may be easier to understand and easier to 
> distinguish:
> Implementation class of FailoverStrategy
>  * AdaptedPipelinedRegionFailoverStrategyNG
>  * FailoverAllStrategy
>  * FailoverIndividualStrategy
>  * FailoverPipelinedRegionStrategy
> FailoverStrategy is currently generated by configuration. If we change the 
> name of the implementation class, it will not affect compatibility.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (FLINK-15307) Subclasses of FailoverStrategy are easily confused with implementation classes of RestartStrategy

2019-12-19 Thread andrew.D.lin (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

andrew.D.lin updated FLINK-15307:
-
Comment: was deleted

(was: [~zhuzh]  [~tison]  )

> Subclasses of FailoverStrategy are easily confused with implementation 
> classes of RestartStrategy
> -
>
> Key: FLINK-15307
> URL: https://issues.apache.org/jira/browse/FLINK-15307
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Reporter: andrew.D.lin
>Priority: Minor
> Fix For: 1.11.0
>
> Attachments: image-2019-12-18-14-59-03-181.png
>
>
> Subclasses of RestartStrategy
>  * FailingRestartStrategy
>  * FailureRateRestartStrategy
>  * FixedDelayRestartStrategy
>  * InfiniteDelayRestartStrategy
> Implementation class of FailoverStrategy
>  * AdaptedRestartPipelinedRegionStrategyNG
>  * RestartAllStrategy
>  * RestartIndividualStrategy
>  * RestartPipelinedRegionStrategy
>  
> FailoverStrategy describes how the job computation recovers from task 
> failures.
> I think the following names may be easier to understand and easier to 
> distinguish:
> Implementation class of FailoverStrategy
>  * AdaptedPipelinedRegionFailoverStrategyNG
>  * FailoverAllStrategy
>  * FailoverIndividualStrategy
>  * FailoverPipelinedRegionStrategy
> FailoverStrategy is currently generated by configuration. If we change the 
> name of the implementation class, it will not affect compatibility.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15307) Subclasses of FailoverStrategy are easily confused with implementation classes of RestartStrategy

2019-12-19 Thread andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000666#comment-17000666
 ] 

andrew.D.lin commented on FLINK-15307:
--

[~zhuzh]  [~tison]  

> Subclasses of FailoverStrategy are easily confused with implementation 
> classes of RestartStrategy
> -
>
> Key: FLINK-15307
> URL: https://issues.apache.org/jira/browse/FLINK-15307
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Reporter: andrew.D.lin
>Priority: Minor
> Fix For: 1.11.0
>
> Attachments: image-2019-12-18-14-59-03-181.png
>
>
> Subclasses of RestartStrategy
>  * FailingRestartStrategy
>  * FailureRateRestartStrategy
>  * FixedDelayRestartStrategy
>  * InfiniteDelayRestartStrategy
> Implementation class of FailoverStrategy
>  * AdaptedRestartPipelinedRegionStrategyNG
>  * RestartAllStrategy
>  * RestartIndividualStrategy
>  * RestartPipelinedRegionStrategy
>  
> FailoverStrategy describes how the job computation recovers from task 
> failures.
> I think the following names may be easier to understand and easier to 
> distinguish:
> Implementation class of FailoverStrategy
>  * AdaptedPipelinedRegionFailoverStrategyNG
>  * FailoverAllStrategy
>  * FailoverIndividualStrategy
>  * FailoverPipelinedRegionStrategy
> FailoverStrategy is currently generated by configuration. If we change the 
> name of the implementation class, it will not affect compatibility.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15307) Subclasses of FailoverStrategy are easily confused with implementation classes of RestartStrategy

2019-12-17 Thread andrew.D.lin (Jira)
andrew.D.lin created FLINK-15307:


 Summary: Subclasses of FailoverStrategy are easily confused with 
implementation classes of RestartStrategy
 Key: FLINK-15307
 URL: https://issues.apache.org/jira/browse/FLINK-15307
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Configuration
Affects Versions: 1.9.1, 1.9.0, 1.10.0
Reporter: andrew.D.lin
 Attachments: image-2019-12-18-14-59-03-181.png

Subclasses of RestartStrategy
 * FailingRestartStrategy
 * FailureRateRestartStrategy
 * FixedDelayRestartStrategy
 * InfiniteDelayRestartStrategy

Implementation class of FailoverStrategy
 * AdaptedRestartPipelinedRegionStrategyNG
 * RestartAllStrategy
 * RestartIndividualStrategy
 * RestartPipelinedRegionStrategy

 

FailoverStrategy describes how the job computation recovers from task failures.

I think the following names may be easier to understand and easier to 
distinguish:

Implementation class of FailoverStrategy
 * AdaptedPipelinedRegionFailoverStrategyNG
 * FailoverAllStrategy
 * FailoverIndividualStrategy
 * FailoverPipelinedRegionStrategy

FailoverStrategy is currently generated by configuration. If we change the name 
of the implementation class, it will not affect compatibility.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13856) Reduce the delete file api when the checkpoint is completed

2019-12-06 Thread andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989520#comment-16989520
 ] 

andrew.D.lin commented on FLINK-13856:
--

[~klion26] thanks congxian,I thinks you have a better dessign!

Looking forward to using your new fratures! 

> Reduce the delete file api when the checkpoint is completed
> ---
>
> Key: FLINK-13856
> URL: https://issues.apache.org/jira/browse/FLINK-13856
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing, Runtime / State Backends
>Affects Versions: 1.8.1, 1.9.0
>Reporter: andrew.D.lin
>Assignee: andrew.D.lin
>Priority: Major
>  Labels: pull-request-available
> Attachments: after.png, before.png, 
> f6cc56b7-2c74-4f4b-bb6a-476d28a22096.png
>
>   Original Estimate: 48h
>  Time Spent: 10m
>  Remaining Estimate: 47h 50m
>
> When the new checkpoint is completed, an old checkpoint will be deleted by 
> calling CompletedCheckpoint.discardOnSubsume().
> When deleting old checkpoints, follow these steps:
> 1, drop the metadata
> 2, discard private state objects
> 3, discard location as a whole
> In some cases, is it possible to delete the checkpoint folder recursively by 
> one call?
> As far as I know the full amount of checkpoint, it should be possible to 
> delete the folder directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13856) Reduce the delete file api when the checkpoint is completed

2019-10-15 Thread andrew.D.lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951763#comment-16951763
 ] 

andrew.D.lin commented on FLINK-13856:
--

[~sewen] Hi Stephan!

Deleting the shared state first, and then dropping the exclusive location as a 
whole.  I think it's ok and won‘t cause any trouble(It was supposed to be 
deleted),can you task about it?

Furthermore,most file systems support recursive deletion including S3.

> Reduce the delete file api when the checkpoint is completed
> ---
>
> Key: FLINK-13856
> URL: https://issues.apache.org/jira/browse/FLINK-13856
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing, Runtime / State Backends
>Affects Versions: 1.8.1, 1.9.0
>Reporter: andrew.D.lin
>Assignee: andrew.D.lin
>Priority: Major
>  Labels: pull-request-available
> Attachments: after.png, before.png, 
> f6cc56b7-2c74-4f4b-bb6a-476d28a22096.png
>
>   Original Estimate: 48h
>  Time Spent: 10m
>  Remaining Estimate: 47h 50m
>
> When the new checkpoint is completed, an old checkpoint will be deleted by 
> calling CompletedCheckpoint.discardOnSubsume().
> When deleting old checkpoints, follow these steps:
> 1, drop the metadata
> 2, discard private state objects
> 3, discard location as a whole
> In some cases, is it possible to delete the checkpoint folder recursively by 
> one call?
> As far as I know the full amount of checkpoint, it should be possible to 
> delete the folder directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-13856) Reduce the delete file api when the checkpoint is completed

2019-08-26 Thread andrew.D.lin (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

andrew.D.lin updated FLINK-13856:
-
Component/s: Runtime / Checkpointing

> Reduce the delete file api when the checkpoint is completed
> ---
>
> Key: FLINK-13856
> URL: https://issues.apache.org/jira/browse/FLINK-13856
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing, Runtime / State Backends
>Affects Versions: 1.8.1, 1.9.0
>Reporter: andrew.D.lin
>Priority: Major
> Attachments: f6cc56b7-2c74-4f4b-bb6a-476d28a22096.png
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When the new checkpoint is completed, an old checkpoint will be deleted by 
> calling CompletedCheckpoint.discardOnSubsume().
> When deleting old checkpoints, follow these steps:
> 1, drop the metadata
> 2, discard private state objects
> 3, discard location as a whole
> In some cases, is it possible to delete the checkpoint folder recursively by 
> one call?
> As far as I know the full amount of checkpoint, it should be possible to 
> delete the folder directly.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-13856) Reduce the delete file api when the checkpoint is completed

2019-08-26 Thread andrew.D.lin (Jira)
andrew.D.lin created FLINK-13856:


 Summary: Reduce the delete file api when the checkpoint is 
completed
 Key: FLINK-13856
 URL: https://issues.apache.org/jira/browse/FLINK-13856
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / State Backends
Affects Versions: 1.9.0, 1.8.1
Reporter: andrew.D.lin
 Attachments: f6cc56b7-2c74-4f4b-bb6a-476d28a22096.png

When the new checkpoint is completed, an old checkpoint will be deleted by 
calling CompletedCheckpoint.discardOnSubsume().

When deleting old checkpoints, follow these steps:
1, drop the metadata
2, discard private state objects
3, discard location as a whole

In some cases, is it possible to delete the checkpoint folder recursively by 
one call?

As far as I know the full amount of checkpoint, it should be possible to delete 
the folder directly.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)