[jira] [Commented] (FLINK-33187) Don't record duplicate event if no change

2023-10-19 Thread Clara Xiong (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1327#comment-1327
 ] 

Clara Xiong commented on FLINK-33187:
-

Anyone have opinions on the default interval? Is 30 min a reasonable duration?

> Don't record duplicate event if no change
> -
>
> Key: FLINK-33187
> URL: https://issues.apache.org/jira/browse/FLINK-33187
> Project: Flink
>  Issue Type: Improvement
>  Components: Autoscaler
>Affects Versions: 1.17.1
>Reporter: Clara Xiong
>Assignee: Clara Xiong
>Priority: Major
>  Labels: pull-request-available
> Fix For: kubernetes-operator-1.7.0
>
>
> Problem:
> Some events are recorded repeatedly such as ScalingReport when autoscaling is 
> not enable,  which consists 99% of all events in our prod env. This wastes 
> resources and causes performance  downstream.
> Proposal:
> Suppress duplicate event within an interval defined by a new operator config 
> "scaling.report.interval" in second, defaulted to 1800.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33187) Don't record duplicate event if no change

2023-10-05 Thread Clara Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Clara Xiong updated FLINK-33187:

Description: 
Problem:
Some events are recorded repeatedly such as ScalingReport when autoscaling is 
not enable,  which consists 99% of all events in our prod env. This wastes 
resources and causes performance  downstream.

Proposal:
Suppress duplicate event within an interval defined by a new operator config 
"scaling.report.interval" in second, defaulted to 1800.

  was:
Problem:
Some events are sent to Kafka repeatedly such as ScalingReport when autoscaling 
is not enable,  which consists 99% of all Kafka events in our prod env. This 
wastes Kafka resources and cause performance  downstream.

Proposal:
Suppress duplicate event within an interval defined by a new operator config 
"scaling.report.interval" in second, defaulted to 1800.


> Don't record duplicate event if no change
> -
>
> Key: FLINK-33187
> URL: https://issues.apache.org/jira/browse/FLINK-33187
> Project: Flink
>  Issue Type: Improvement
>  Components: Autoscaler
>Affects Versions: 1.17.1
>Reporter: Clara Xiong
>Priority: Major
>
> Problem:
> Some events are recorded repeatedly such as ScalingReport when autoscaling is 
> not enable,  which consists 99% of all events in our prod env. This wastes 
> resources and causes performance  downstream.
> Proposal:
> Suppress duplicate event within an interval defined by a new operator config 
> "scaling.report.interval" in second, defaulted to 1800.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33187) Don't record duplicate event if no change

2023-10-05 Thread Clara Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Clara Xiong updated FLINK-33187:

Summary: Don't record duplicate event if no change  (was: Don't send 
duplicate event to Kafka if no change)

> Don't record duplicate event if no change
> -
>
> Key: FLINK-33187
> URL: https://issues.apache.org/jira/browse/FLINK-33187
> Project: Flink
>  Issue Type: Improvement
>  Components: Autoscaler
>Affects Versions: 1.17.1
>Reporter: Clara Xiong
>Priority: Major
>
> Problem:
> Some events are sent to Kafka repeatedly such as ScalingReport when 
> autoscaling is not enable,  which consists 99% of all Kafka events in our 
> prod env. This wastes Kafka resources and cause performance  downstream.
> Proposal:
> Suppress duplicate event within an interval defined by a new operator config 
> "scaling.report.interval" in second, defaulted to 1800.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33187) Don't send duplicate event to Kafka if no change

2023-10-04 Thread Clara Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Clara Xiong updated FLINK-33187:

Description: 
Problem:
Some events are sent to Kafka repeatedly such as ScalingReport when autoscaling 
is not enable,  which consists 99% of all Kafka events in our prod env. This 
wastes Kafka resources and cause performance  downstream.

Proposal:
Suppress duplicate event within an interval defined by a new operator dynamic 
config "scaling.report.interval" in second, defaulted to 1800.

  was:
Problem:
Some events are sent to Kafka repeatedly such as ScalingReport when autoscaling 
is not enable,  which consists 99% of all Kafka events in our prod env. This 
wastes Kafka resources and cause performance  downstream.

Proposal:
Suppress duplicate event within an interval defined by a new operator dynamic 
config "suppress-event.interval" in second, defaulted to 0.


> Don't send duplicate event to Kafka if no change
> 
>
> Key: FLINK-33187
> URL: https://issues.apache.org/jira/browse/FLINK-33187
> Project: Flink
>  Issue Type: Improvement
>  Components: Autoscaler
>Affects Versions: 1.17.1
>Reporter: Clara Xiong
>Priority: Major
>
> Problem:
> Some events are sent to Kafka repeatedly such as ScalingReport when 
> autoscaling is not enable,  which consists 99% of all Kafka events in our 
> prod env. This wastes Kafka resources and cause performance  downstream.
> Proposal:
> Suppress duplicate event within an interval defined by a new operator dynamic 
> config "scaling.report.interval" in second, defaulted to 1800.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33187) Don't send duplicate event to Kafka if no change

2023-10-04 Thread Clara Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Clara Xiong updated FLINK-33187:

Description: 
Problem:
Some events are sent to Kafka repeatedly such as ScalingReport when autoscaling 
is not enable,  which consists 99% of all Kafka events in our prod env. This 
wastes Kafka resources and cause performance  downstream.

Proposal:
Suppress duplicate event within an interval defined by a new operator config 
"scaling.report.interval" in second, defaulted to 1800.

  was:
Problem:
Some events are sent to Kafka repeatedly such as ScalingReport when autoscaling 
is not enable,  which consists 99% of all Kafka events in our prod env. This 
wastes Kafka resources and cause performance  downstream.

Proposal:
Suppress duplicate event within an interval defined by a new operator dynamic 
config "scaling.report.interval" in second, defaulted to 1800.


> Don't send duplicate event to Kafka if no change
> 
>
> Key: FLINK-33187
> URL: https://issues.apache.org/jira/browse/FLINK-33187
> Project: Flink
>  Issue Type: Improvement
>  Components: Autoscaler
>Affects Versions: 1.17.1
>Reporter: Clara Xiong
>Priority: Major
>
> Problem:
> Some events are sent to Kafka repeatedly such as ScalingReport when 
> autoscaling is not enable,  which consists 99% of all Kafka events in our 
> prod env. This wastes Kafka resources and cause performance  downstream.
> Proposal:
> Suppress duplicate event within an interval defined by a new operator config 
> "scaling.report.interval" in second, defaulted to 1800.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33187) Don't send duplicate event to Kafka if no change

2023-10-04 Thread Clara Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Clara Xiong updated FLINK-33187:

Description: 
Problem:
Some events are sent to Kafka repeatedly such as ScalingReport when autoscaling 
is not enable,  which consists 99% of all Kafka events in our prod env. This 
wastes Kafka resources and cause performance  downstream.

Proposal:
Suppress duplicate event within an interval defined by a new operator dynamic 
config "suppress-event.interval" in second, defaulted to 0.

  was:
Problem:
Some events are sent to Kafka repeatedly such as ScalingReport when autoscaling 
is not enable,  which consists 99% of all Kafka events in our prod env. This 
wastes Kafka resources and cause performance  downstream.

Proposal:
Suppress duplicate event within an interval defined by a new operator dynamic 
config "suppress-event.interval" in second, defaulted to 30 min.


> Don't send duplicate event to Kafka if no change
> 
>
> Key: FLINK-33187
> URL: https://issues.apache.org/jira/browse/FLINK-33187
> Project: Flink
>  Issue Type: Improvement
>  Components: Autoscaler
>Affects Versions: 1.17.1
>Reporter: Clara Xiong
>Priority: Major
>
> Problem:
> Some events are sent to Kafka repeatedly such as ScalingReport when 
> autoscaling is not enable,  which consists 99% of all Kafka events in our 
> prod env. This wastes Kafka resources and cause performance  downstream.
> Proposal:
> Suppress duplicate event within an interval defined by a new operator dynamic 
> config "suppress-event.interval" in second, defaulted to 0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33187) Don't send duplicate event to Kafka if no change

2023-10-04 Thread Clara Xiong (Jira)
Clara Xiong created FLINK-33187:
---

 Summary: Don't send duplicate event to Kafka if no change
 Key: FLINK-33187
 URL: https://issues.apache.org/jira/browse/FLINK-33187
 Project: Flink
  Issue Type: Improvement
  Components: Autoscaler
Affects Versions: 1.17.1
Reporter: Clara Xiong


Problem:
Some events are sent to Kafka repeatedly such as ScalingReport when autoscaling 
is not enable,  which consists 99% of all Kafka events in our prod env. This 
wastes Kafka resources and cause performance  downstream.

Proposal:
Suppress duplicate event within an interval defined by a new operator dynamic 
config "suppress-event.interval" in second, defaulted to 30 min.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (FLINK-30119) Breaking change: Flink Kubernetes Operator should store last savepoint in the SavepointInfo.lastSavepoint field whether it is completed or pending

2023-08-31 Thread Clara Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Clara Xiong closed FLINK-30119.
---
Release Note: Given the impact vs benefit of the change, we can revisit 
next time we get a change to make major design changes.
  Resolution: Won't Fix

> Breaking change: Flink Kubernetes Operator should store last savepoint in the 
> SavepointInfo.lastSavepoint field whether it is completed or pending
> --
>
> Key: FLINK-30119
> URL: https://issues.apache.org/jira/browse/FLINK-30119
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Clara Xiong
>Assignee: Clara Xiong
>Priority: Major
>  Labels: pull-request-available, stale-assigned
>
> End user experience proposal:
> Users can see the properties of last savepoint pending or completed and can 
> get status in one of three states for the status: PENDING, SUCCEEDED and 
> FAILED. If there is never savepoint taken or attempted, it is empty. 
> Completed savepoints (manual, periodic and upgrade) are included Savepoint 
> history, merged with savepoints form Flink job.
> Users can see this savepoint with PENDING status once one is trigger. Once 
> completed, users can see last savepoint status changed to SUCCEEDED and 
> included in savepoint history, or FAILED and not in savepoint history. If 
> there is other savepoint triggered after completion before user checks, user 
> cannot see the status of the one they triggered but they can check if the 
> savepoint is in the history.
> Currently lastSavepoint only stores the last completed one, duplicate with 
> savepoint history. To expose the properties of the currently pending 
> savepoint or last savepoint that failed, we need to expose those info in 
> separate fields in SavepointInfo. The internal logic of Operator uses those 
> fields for triggering and retries and creates compatibility issues with 
> client. It also use more space for etcd size limit.
> Code change proposal:
> Use lastSavepoint to store the last completed/attempted one and deprecate 
> SavepointInfo.triggerTimstamp, SavepointInfo.triggerType and 
> SavepointInfo.formatType. This will simplify the CRD and logic.
> Add SavepointInfo::retrieveLastSavepoint method to return the last succeeded 
> one.
> Update getLastSavepointStatus to simplify the logic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30119) Breaking change: Flink Kubernetes Operator should store last savepoint in the SavepointInfo.lastSavepoint field whether it is completed or pending

2023-08-31 Thread Clara Xiong (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760994#comment-17760994
 ] 

Clara Xiong commented on FLINK-30119:
-

Given the impact and benefit of the change, we can revisit next time we get a 
change to make major design changes.

> Breaking change: Flink Kubernetes Operator should store last savepoint in the 
> SavepointInfo.lastSavepoint field whether it is completed or pending
> --
>
> Key: FLINK-30119
> URL: https://issues.apache.org/jira/browse/FLINK-30119
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Clara Xiong
>Assignee: Clara Xiong
>Priority: Major
>  Labels: pull-request-available, stale-assigned
>
> End user experience proposal:
> Users can see the properties of last savepoint pending or completed and can 
> get status in one of three states for the status: PENDING, SUCCEEDED and 
> FAILED. If there is never savepoint taken or attempted, it is empty. 
> Completed savepoints (manual, periodic and upgrade) are included Savepoint 
> history, merged with savepoints form Flink job.
> Users can see this savepoint with PENDING status once one is trigger. Once 
> completed, users can see last savepoint status changed to SUCCEEDED and 
> included in savepoint history, or FAILED and not in savepoint history. If 
> there is other savepoint triggered after completion before user checks, user 
> cannot see the status of the one they triggered but they can check if the 
> savepoint is in the history.
> Currently lastSavepoint only stores the last completed one, duplicate with 
> savepoint history. To expose the properties of the currently pending 
> savepoint or last savepoint that failed, we need to expose those info in 
> separate fields in SavepointInfo. The internal logic of Operator uses those 
> fields for triggering and retries and creates compatibility issues with 
> client. It also use more space for etcd size limit.
> Code change proposal:
> Use lastSavepoint to store the last completed/attempted one and deprecate 
> SavepointInfo.triggerTimstamp, SavepointInfo.triggerType and 
> SavepointInfo.formatType. This will simplify the CRD and logic.
> Add SavepointInfo::retrieveLastSavepoint method to return the last succeeded 
> one.
> Update getLastSavepointStatus to simplify the logic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30119) Breaking change: Flink Kubernetes Operator should store last savepoint in the SavepointInfo.lastSavepoint field whether it is completed or pending

2022-12-01 Thread Clara Xiong (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642168#comment-17642168
 ] 

Clara Xiong commented on FLINK-30119:
-

This is a breaking change:
 * Existing application using lastSavepoint or trigger related fields in 
Savepoint info except triggerId may be broken.
 * For triggerId dependence, better use SavepointUtils.savepointInProgress 
which is guaranteed to work.
 * savepointHistory is not changed and still keeps the completed savepoints 
only. Unit tests for savepoint history also cover a few scenarios.

> Breaking change: Flink Kubernetes Operator should store last savepoint in the 
> SavepointInfo.lastSavepoint field whether it is completed or pending
> --
>
> Key: FLINK-30119
> URL: https://issues.apache.org/jira/browse/FLINK-30119
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Clara Xiong
>Assignee: Clara Xiong
>Priority: Major
>  Labels: pull-request-available
>
> End user experience proposal:
> Users can see the properties of last savepoint pending or completed and can 
> get status in one of three states for the status: PENDING, SUCCEEDED and 
> FAILED. If there is never savepoint taken or attempted, it is empty. 
> Completed savepoints (manual, periodic and upgrade) are included Savepoint 
> history, merged with savepoints form Flink job.
> Users can see this savepoint with PENDING status once one is trigger. Once 
> completed, users can see last savepoint status changed to SUCCEEDED and 
> included in savepoint history, or FAILED and not in savepoint history. If 
> there is other savepoint triggered after completion before user checks, user 
> cannot see the status of the one they triggered but they can check if the 
> savepoint is in the history.
> Currently lastSavepoint only stores the last completed one, duplicate with 
> savepoint history. To expose the properties of the currently pending 
> savepoint or last savepoint that failed, we need to expose those info in 
> separate fields in SavepointInfo. The internal logic of Operator uses those 
> fields for triggering and retries and creates compatibility issues with 
> client. It also use more space for etcd size limit.
> Code change proposal:
> Use lastSavepoint to store the last completed/attempted one and deprecate 
> SavepointInfo.triggerTimstamp, SavepointInfo.triggerType and 
> SavepointInfo.formatType. This will simplify the CRD and logic.
> Add SavepointInfo::retrieveLastSavepoint method to return the last succeeded 
> one.
> Update getLastSavepointStatus to simplify the logic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30119) Breaking change: Flink Kubernetes Operator should store last savepoint in the SavepointInfo.lastSavepoint field whether it is completed or pending

2022-12-01 Thread Clara Xiong (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642162#comment-17642162
 ] 

Clara Xiong commented on FLINK-30119:
-

Updating the title to reflect the fact this could break existing application 
relying on some fields of SavepointInfo although there is potential benefits to 
simplying logic and reduce the size of etcd object which is hitting upper limit.

> Breaking change: Flink Kubernetes Operator should store last savepoint in the 
> SavepointInfo.lastSavepoint field whether it is completed or pending
> --
>
> Key: FLINK-30119
> URL: https://issues.apache.org/jira/browse/FLINK-30119
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Clara Xiong
>Assignee: Clara Xiong
>Priority: Major
>
> End user experience proposal:
> Users can see the properties of last savepoint pending or completed and can 
> get status in one of three states for the status: PENDING, SUCCEEDED and 
> FAILED. If there is never savepoint taken or attempted, it is empty. 
> Completed savepoints (manual, periodic and upgrade) are included Savepoint 
> history, merged with savepoints form Flink job.
> Users can see this savepoint with PENDING status once one is trigger. Once 
> completed, users can see last savepoint status changed to SUCCEEDED and 
> included in savepoint history, or FAILED and not in savepoint history. If 
> there is other savepoint triggered after completion before user checks, user 
> cannot see the status of the one they triggered but they can check if the 
> savepoint is in the history.
> Currently lastSavepoint only stores the last completed one, duplicate with 
> savepoint history. To expose the properties of the currently pending 
> savepoint or last savepoint that failed, we need to expose those info in 
> separate fields in SavepointInfo. The internal logic of Operator uses those 
> fields for triggering and retries and creates compatibility issues with 
> client. It also use more space for etcd size limit.
> Code change proposal:
> Use lastSavepoint to store the last completed/attempted one and deprecate 
> SavepointInfo.triggerTimstamp, SavepointInfo.triggerType and 
> SavepointInfo.formatType. This will simplify the CRD and logic.
> Add SavepointInfo::retrieveLastSavepoint method to return the last succeeded 
> one.
> Update getLastSavepointStatus to simplify the logic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-30119) Breaking change: Flink Kubernetes Operator should store last savepoint in the SavepointInfo.lastSavepoint field whether it is completed or pending

2022-12-01 Thread Clara Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Clara Xiong updated FLINK-30119:

Summary: Breaking change: Flink Kubernetes Operator should store last 
savepoint in the SavepointInfo.lastSavepoint field whether it is completed or 
pending  (was: Flink Kubernetes Operator should store last savepoint in the 
SavepointInfo.lastSavepoint field whether it is completed or pending)

> Breaking change: Flink Kubernetes Operator should store last savepoint in the 
> SavepointInfo.lastSavepoint field whether it is completed or pending
> --
>
> Key: FLINK-30119
> URL: https://issues.apache.org/jira/browse/FLINK-30119
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Clara Xiong
>Assignee: Clara Xiong
>Priority: Major
>
> End user experience proposal:
> Users can see the properties of last savepoint pending or completed and can 
> get status in one of three states for the status: PENDING, SUCCEEDED and 
> FAILED. If there is never savepoint taken or attempted, it is empty. 
> Completed savepoints (manual, periodic and upgrade) are included Savepoint 
> history, merged with savepoints form Flink job.
> Users can see this savepoint with PENDING status once one is trigger. Once 
> completed, users can see last savepoint status changed to SUCCEEDED and 
> included in savepoint history, or FAILED and not in savepoint history. If 
> there is other savepoint triggered after completion before user checks, user 
> cannot see the status of the one they triggered but they can check if the 
> savepoint is in the history.
> Currently lastSavepoint only stores the last completed one, duplicate with 
> savepoint history. To expose the properties of the currently pending 
> savepoint or last savepoint that failed, we need to expose those info in 
> separate fields in SavepointInfo. The internal logic of Operator uses those 
> fields for triggering and retries and creates compatibility issues with 
> client. It also use more space for etcd size limit.
> Code change proposal:
> Use lastSavepoint to store the last completed/attempted one and deprecate 
> SavepointInfo.triggerTimstamp, SavepointInfo.triggerType and 
> SavepointInfo.formatType. This will simplify the CRD and logic.
> Add SavepointInfo::retrieveLastSavepoint method to return the last succeeded 
> one.
> Update getLastSavepointStatus to simplify the logic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-30119) Flink Kubernetes Operator should store last savepoint in the SavepointInfo.lastSavepoint field whether it is completed or pending

2022-11-28 Thread Clara Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Clara Xiong updated FLINK-30119:

Summary: Flink Kubernetes Operator should store last savepoint in the 
SavepointInfo.lastSavepoint field whether it is completed or pending  (was: 
(Flink Kubernetes Operator should store last savepoint in the 
SavepointInfo.lastSavepoint field whether it is completed or pending)

> Flink Kubernetes Operator should store last savepoint in the 
> SavepointInfo.lastSavepoint field whether it is completed or pending
> -
>
> Key: FLINK-30119
> URL: https://issues.apache.org/jira/browse/FLINK-30119
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Clara Xiong
>Assignee: Clara Xiong
>Priority: Major
>
> End user experience proposal:
> Users can see the properties of last savepoint pending or completed and can 
> get status in one of three states for the status: PENDING, SUCCEEDED and 
> FAILED. If there is never savepoint taken or attempted, it is empty. 
> Completed savepoints (manual, periodic and upgrade) are included Savepoint 
> history, merged with savepoints form Flink job.
> Users can see this savepoint with PENDING status once one is trigger. Once 
> completed, users can see last savepoint status changed to SUCCEEDED and 
> included in savepoint history, or FAILED and not in savepoint history. If 
> there is other savepoint triggered after completion before user checks, user 
> cannot see the status of the one they triggered but they can check if the 
> savepoint is in the history.
> Currently lastSavepoint only stores the last completed one, duplicate with 
> savepoint history. To expose the properties of the currently pending 
> savepoint or last savepoint that failed, we need to expose those info in 
> separate fields in SavepointInfo. The internal logic of Operator uses those 
> fields for triggering and retries and creates compatibility issues with 
> client. It also use more space for etcd size limit.
> Code change proposal:
> Use lastSavepoint to store the last completed/attempted one and deprecate 
> SavepointInfo.triggerTimstamp, SavepointInfo.triggerType and 
> SavepointInfo.formatType. This will simplify the CRD and logic.
> Add SavepointInfo::retrieveLastSavepoint method to return the last succeeded 
> one.
> Update getLastSavepointStatus to simplify the logic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30119) (Flink Kubernetes Operator should store last savepoint in the SavepointInfo.lastSavepoint field whether it is completed or pending

2022-11-21 Thread Clara Xiong (Jira)
Clara Xiong created FLINK-30119:
---

 Summary: (Flink Kubernetes Operator should store last savepoint in 
the SavepointInfo.lastSavepoint field whether it is completed or pending
 Key: FLINK-30119
 URL: https://issues.apache.org/jira/browse/FLINK-30119
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Clara Xiong


End user experience proposal:

Users can see the properties of last savepoint pending or completed and can get 
status in one of three states for the status: PENDING, SUCCEEDED and FAILED. If 
there is never savepoint taken or attempted, it is empty. Completed savepoints 
(manual, periodic and upgrade) are included Savepoint history, merged with 
savepoints form Flink job.

Users can see this savepoint with PENDING status once one is trigger. Once 
completed, users can see last savepoint status changed to SUCCEEDED and 
included in savepoint history, or FAILED and not in savepoint history. If there 
is other savepoint triggered after completion before user checks, user cannot 
see the status of the one they triggered but they can check if the savepoint is 
in the history.

Currently lastSavepoint only stores the last completed one, duplicate with 
savepoint history. To expose the properties of the currently pending savepoint 
or last savepoint that failed, we need to expose those info in separate fields 
in SavepointInfo. The internal logic of Operator uses those fields for 
triggering and retries and creates compatibility issues with client. It also 
use more space for etcd size limit.

Code change proposal:

Use lastSavepoint to store the last completed/attempted one and deprecate 
SavepointInfo.triggerTimstamp, SavepointInfo.triggerType and 
SavepointInfo.formatType. This will simplify the CRD and logic.

Add SavepointInfo::retrieveLastSavepoint method to return the last succeeded 
one.

Update getLastSavepointStatus to simplify the logic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30047) getLastSavepointStatus should return null when there is never savepoint completed or pending

2022-11-16 Thread Clara Xiong (Jira)
Clara Xiong created FLINK-30047:
---

 Summary: getLastSavepointStatus should return null when there is 
never savepoint completed or pending
 Key: FLINK-30047
 URL: https://issues.apache.org/jira/browse/FLINK-30047
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Clara Xiong


Current SUCCEEDED is returned in this case but null should be returned instead 
to distinguish from really success.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-29819) Record an error event when savepoint fails within grace period

2022-10-31 Thread Clara Xiong (Jira)
Clara Xiong created FLINK-29819:
---

 Summary: Record an error event when savepoint fails within grace 
period
 Key: FLINK-29819
 URL: https://issues.apache.org/jira/browse/FLINK-29819
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Clara Xiong


As of now, SavepointObserver retries if savepoint fails within grace period 
until success or failure happens after the grace period. The grace period is 
for each retry.  If underlying problem for quick failure is not transient, such 
as a mis-configured path or a perisistent storage failure, retries keep going 
on without recording any error event. 

We should first add logic to record an error event per failed attempt. We can 
consider capping the retries if it become a pain for users.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-29695) Create a utility to report the status of the last savepoint

2022-10-19 Thread Clara Xiong (Jira)
Clara Xiong created FLINK-29695:
---

 Summary: Create a utility to report the status of the last 
savepoint
 Key: FLINK-29695
 URL: https://issues.apache.org/jira/browse/FLINK-29695
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Clara Xiong


Users want to know the status of last savepoint, especially for manually 
triggered ones, to manage savepoints.

Currently, users can infer the status of the last savepoint (PENDING, SUCCEEDED 
and ABANDONED) from jobStatus.triggerId, lastSavepoint.triggerNonce, 
spec.job.savepointTriggerNonce and savepointTriggerNonce from last 
reconciliation. If the last savepoint is not manually triggered, there is no 
ABANDONED status, only PENDING or SUCCEEDED.

Creating a utility will encapsulate the internal logic of Flink operator guard 
against regression by any future version changes.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29588) Add Flink Version and Application Version to Savepoint properties

2022-10-11 Thread Clara Xiong (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17616041#comment-17616041
 ] 

Clara Xiong commented on FLINK-29588:
-

appVersion is the application version of the FlinkDeployment.

If a user doesn't find a problem until after a couple of rounds of upgrades and 
decides to go back to re-process or troubleshoot, they have already had many 
savepoints taken. Timestamp is a way to look for the right one but they 
probably need a way to keep the chronological history of versions. 

> Add Flink Version and Application Version to Savepoint properties
> -
>
> Key: FLINK-29588
> URL: https://issues.apache.org/jira/browse/FLINK-29588
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Clara Xiong
>Priority: Major
>
> It is common that users need to upgrade long running FlinkDeployment's. An 
> application upgrade or a major Flink upgrade might break the applications due 
> to schema incompatibilities, especially the 
> latter([Link)|https://nightlies.apache.org/flink/flink-docs-master/docs/ops/upgrading/#compatibility-table].
>  In this case, the users need to manually restore to a savepoint that is 
> compatible with the version the user wants to try or re-process. Currently 
> Flink Operator returns a list of completed Savepoint's for a FlinkDeployment. 
>  It will be helpful that the Savepoint's in SavepointHistory have the 
> properties for Flink Version and application version so users can easily 
> determine which savepoint to use.
>  
>  * {{String flinkVersion}}
>  * {{String appVersion}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-29588) Add Flink Version and Application Version to Savepoint properties

2022-10-11 Thread Clara Xiong (Jira)
Clara Xiong created FLINK-29588:
---

 Summary: Add Flink Version and Application Version to Savepoint 
properties
 Key: FLINK-29588
 URL: https://issues.apache.org/jira/browse/FLINK-29588
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.2.0
Reporter: Clara Xiong


It is common that users need to upgrade long running FlinkDeployment's. An 
application upgrade or a major Flink upgrade might break the applications due 
to schema incompatibilities, especially the 
latter([Link)|https://nightlies.apache.org/flink/flink-docs-master/docs/ops/upgrading/#compatibility-table].
 In this case, the users need to manually restore to a savepoint that is 
compatible with the version the user wants to try or re-process. Currently 
Flink Operator returns a list of completed Savepoint's for a FlinkDeployment.  
It will be helpful that the Savepoint's in SavepointHistory have the properties 
for Flink Version and application version so users can easily determine which 
savepoint to use.

 
 * {{String flinkVersion}}
 * {{String appVersion}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)