[GitHub] [flink-kubernetes-operator] gyfora commented on a diff in pull request #489: [FLINK-30406] Detect when jobmanager never started

2022-12-28 Thread GitBox


gyfora commented on code in PR #489:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/489#discussion_r1058387166


##
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java:
##
@@ -112,18 +116,23 @@ protected Optional getAvailableUpgradeMode(
 && !flinkVersionChanged(
 ReconciliationUtils.getDeployedSpec(deployment), 
deployment.getSpec())) {
 
-if (!flinkService.isHaMetadataAvailable(deployConfig)) {
-if 
(deployment.getStatus().getReconciliationStatus().getLastStableSpec() == null) {
-// initial deployment failure, reset to allow for spec 
change to proceed
-return resetOnMissingStableSpec(deployment, deployConfig);
-}
-} else {
+if (flinkService.isHaMetadataAvailable(deployConfig)) {
 LOG.info(
 "Job is not running but HA metadata is available for 
last state restore, ready for upgrade");
 return Optional.of(UpgradeMode.LAST_STATE);
 }
 }
 
+if (status.getReconciliationStatus()
+.deserializeLastReconciledSpec()
+.getJob()
+.getUpgradeMode()
+!= UpgradeMode.LAST_STATE
+&& FlinkUtils.jmPodNeverStarted(ctx)) {
+deleteJmThatNeverStarted(deployment, deployConfig);
+return getAvailableUpgradeMode(deployment, ctx, deployConfig, 
observeConfig);

Review Comment:
   I imporved the check to make this more clear, but essentially after we 
delete the JobManager deployment, the job will be put in a terminal state and 
the JmDeployment in the MISSING state. Therefore this branch would not be hit 
again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-kubernetes-operator] gyfora commented on a diff in pull request #489: [FLINK-30406] Detect when jobmanager never started

2022-12-22 Thread GitBox


gyfora commented on code in PR #489:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/489#discussion_r1055881354


##
flink-kubernetes-operator/src/test/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconcilerUpgradeModeTest.java:
##
@@ -260,11 +265,96 @@ private FlinkDeployment cloneDeploymentWithUpgradeMode(
 }
 
 @ParameterizedTest
-@EnumSource(UpgradeMode.class)
-public void testUpgradeBeforeReachingStableSpec(UpgradeMode upgradeMode) 
throws Exception {
+@MethodSource("testUpgradeJmDeployCannotStartParams")
+public void testUpgradeJmDeployCannotStart(UpgradeMode fromMode, 
UpgradeMode toMode)
+throws Exception {

Review Comment:
   I have added some more comments to the tests @morhidi . I could not find a 
good way to break it up to methods, I feel that would just add more complexity.
   
   These tests are pretty generic in the sense that they test different 
upgradeMode/savepoint setting combinations. While that helps us to cover all 
cases in a robust way it makes it a little difficult to simplify beyond a point.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-kubernetes-operator] gyfora commented on a diff in pull request #489: [FLINK-30406] Detect when jobmanager never started

2022-12-20 Thread GitBox


gyfora commented on code in PR #489:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/489#discussion_r1053989089


##
flink-kubernetes-operator/src/test/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconcilerUpgradeModeTest.java:
##
@@ -260,11 +265,96 @@ private FlinkDeployment cloneDeploymentWithUpgradeMode(
 }
 
 @ParameterizedTest
-@EnumSource(UpgradeMode.class)
-public void testUpgradeBeforeReachingStableSpec(UpgradeMode upgradeMode) 
throws Exception {
+@MethodSource("testUpgradeJmDeployCannotStartParams")
+public void testUpgradeJmDeployCannotStart(UpgradeMode fromMode, 
UpgradeMode toMode)
+throws Exception {

Review Comment:
   This test is indeed a bit too complex , I will try to make some extra 
methods comments to make it clearer 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-kubernetes-operator] gyfora commented on a diff in pull request #489: [FLINK-30406] Detect when jobmanager never started

2022-12-20 Thread GitBox


gyfora commented on code in PR #489:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/489#discussion_r1053364035


##
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java:
##
@@ -116,9 +119,23 @@ public final void reconcile(CR cr, Context ctx) throws 
Exception {
 if (reconciliationStatus.isBeforeFirstDeployment()) {
 LOG.info("Deploying for the first time");
 
+if (spec.getJob() != null) {
+var initialUpgradeMode = UpgradeMode.STATELESS;
+var initialSp = spec.getJob().getInitialSavepointPath();
+
+if (initialSp != null) {
+status.getJobStatus()
+.getSavepointInfo()
+.setLastSavepoint(
+Savepoint.of(initialSp, 
SavepointTriggerType.UNKNOWN));
+initialUpgradeMode = UpgradeMode.SAVEPOINT;
+}
+
+spec.getJob().setUpgradeMode(initialUpgradeMode);

Review Comment:
   Also I would note that this only happens if what the user requested in the 
spec cannot be applied directly but we apply an allowed but different upgrade 
mode instead . It would be strange if this was not reflected in the status and 
would make it difficult to build some feature that require knowing what 
actually was done during last upgrade 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-kubernetes-operator] gyfora commented on a diff in pull request #489: [FLINK-30406] Detect when jobmanager never started

2022-12-20 Thread GitBox


gyfora commented on code in PR #489:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/489#discussion_r1053360675


##
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java:
##
@@ -116,9 +119,23 @@ public final void reconcile(CR cr, Context ctx) throws 
Exception {
 if (reconciliationStatus.isBeforeFirstDeployment()) {
 LOG.info("Deploying for the first time");
 
+if (spec.getJob() != null) {
+var initialUpgradeMode = UpgradeMode.STATELESS;
+var initialSp = spec.getJob().getInitialSavepointPath();
+
+if (initialSp != null) {
+status.getJobStatus()
+.getSavepointInfo()
+.setLastSavepoint(
+Savepoint.of(initialSp, 
SavepointTriggerType.UNKNOWN));
+initialUpgradeMode = UpgradeMode.SAVEPOINT;
+}
+
+spec.getJob().setUpgradeMode(initialUpgradeMode);

Review Comment:
   We already store modified versions of the spec as the operator reconciles 
the resource. We had similar logic already during suspend/ upgrade/savepoints 
etc.
   
   this is not a problem because we do not use a naive equality check to test 
whether user requested changes have been applied. But instead we determine the 
diff and ignore changes that do not result in an upgrade but only affect 
reconciliation (such as upgrade mode and operator configs)
   
   Since this is already done and works without any issues and is actually a 
very clean solution I would suggest we not change it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org