gerkElznik commented on PR #1126:
URL:
https://github.com/apache/flink-kubernetes-operator/pull/1126#issuecomment-4771813362
> @gerkElznik the e2es seem to be failing
## Heads-up before going further: this regresses Flink 1.x application
deployments
While tracking down the failing `test_application_operations.sh` smoke test,
I found that switching the operator to mount `config.yaml` breaks **Flink 1.x
application-mode** deployments. I think it reshapes this PR, so I wanted to lay
it out.
### Symptom
A `v1_xx` application deployment's JobManager crashloops at startup:
```
ERROR ClusterEntrypoint - Could not create application program.
java.net.URISyntaxException: Illegal character in scheme name at index 0:
['local:///opt/flink/usrlib/myjob.jar']
at KubernetesApplicationClusterEntrypoint.fetchArtifacts(…)
[flink-dist-1.20.4.jar]
```
### Isolated with an A/B
Same operator image (built from this branch), same kind cluster, same `v1_20`
`StateMachineExample` — the only difference is the chart's mounted config
file:
| | mounts `config.yaml` (this PR) | mounts `flink-conf.yaml` (main) |
|---|---|---|
| operator config mode | standard-YAML | legacy |
| operator writes to the JM ConfigMap | `pipeline.jars:
['local:///…/myjob.jar']` | `pipeline.jars: local:///…/myjob.jar` |
| `v1_20` JobManager | URISyntaxException → CrashLoopBackOff | READY /
RUNNING |
### Root cause
`FlinkConfMountDecorator#getClusterSideConfData` (the operator's copy, from
FLINK-37236):
```java
// … otherwise it would simply inherit it from the base config (which would
always be false currently).
Configuration clusterSideConfig = new
Configuration(useStandardYamlConfig()); // false for v1_x → flink-conf.yaml ✓
clusterSideConfig.addAll(flinkConfig);
// copies the operator's global-config values
return ConfigurationUtils.convertConfigToWritableLines(clusterSideConfig,
false);
```
useStandardYamlConfig() correctly keys off the deployment's FlinkVersion,
but addAll(flinkConfig) copies values from the operator's global config — and
the comment assumes that global config is "always false" (legacy). Once the
operator reads config.yaml, its global config is standardYaml=true, so list
options (pipeline.jars, pipeline.classpaths, …) carry standard-YAML form and
get written into the v1_x flink-conf.yaml as ['local://…'], which the legacy
parser rejects. So it's a pre-existing operator assumption that this PR is
simply the first to expose — and it affects any list-valued option on any
Flink-1.x deployment, not only pipeline.jars.
How would you like to sequence this @gyfora? Tagging @Dennis-Mircea as well
since he's mentioned planning for the overall 2.x upgrade.
1. Fix FlinkConfMountDecorator to serialize per the target deployment's YAML
mode (downgrade list values when useStandardYamlConfig() is false) as a
prerequisite, then land this chart change; or
2. Hold the config.yaml mount (this change) until the operator no longer
manages Flink-1.x deployments.
Happy to help with either, including the operator-side code change.
_Two smaller notes, same root (single config.yaml key):_
_- Re your inline question on the removed lines: the
watchNamespaces/operatorHealth injections aren't dropped, they were emitted
once per config-file key (so twice), and the dedup keeps one copy (still just
below the resolver; helm template shows them rendering once). The hasKey → with
swap is the actual fix, and the flink-conf.yaml key is removed to emit a single
file._
_- e2e-tests/test_dynamic_flink_conf.sh patches the flink-conf.yaml
ConfigMap key, which is no longer mounted; its twin test_dynamic_config.sh
already covers the same path via config.yaml, so it'd need updating/removing —
but that's moot until the above is settled._
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]