[jira] [Created] (FLINK-34991) Flink Operator ClassPath Race Condition Bug

2024-04-02 Thread Ryan van Huuksloot (Jira)
Ryan van Huuksloot created FLINK-34991:
--

 Summary: Flink Operator ClassPath Race Condition Bug
 Key: FLINK-34991
 URL: https://issues.apache.org/jira/browse/FLINK-34991
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
Affects Versions: 1.7.2
 Environment: Standard Flink Operator with Flink Deployment.

To recreate, just remove a critical SQL connector library from the bundled jar
Reporter: Ryan van Huuksloot


Hello,

I believe we've found a bug with the Job Managers of the Kubernetes Operator. I 
think there is a race condition or an incorrect conditional where the operator 
is checking for High Availability instead of seeing if there is an issue with 
Class Loading in the Job Manager.


*Example:*
When deploying a SQL Flink Job, it starts the job managers in a failed state.
Describing the flink deployment returns the Error message 
{code:java}
RestoreFailed ... HA metadata not available to restore from last state. It is 
possible that the job has finished or terminally failed, or the configmaps have 
been deleted.{code}
But upon further investigation, the actual error was that the class loading of 
the Job Manager wasn't correct. This was a log in the Job Manager
{code:java}
"Could not find any factory for identifier 'kafka' that implements 
'org.apache.flink.table.factories.DynamicTableFactory' in the 
classpath.\n\nAvailable factory identifiers 
are:\n\nblackhole\ndatagen\nfilesystem\nprint","name":"org.apache.flink.table.api.ValidationException","extendedStackTrace":"org.apache.flink.table.api.ValidationException:
 Could not find any factory for identifier 'kafka' that implements 
'org.apache.flink.table.factories.DynamicTableFactory' in the 
classpath.\n\nAvailable factory identifiers 
are:\n\nblackhole\ndatagen\nfilesystem\nprint\n\"{code}

There is also logging in the operator
{code:java}
... Cannot discover a connector using option: 'connector'='kafka'\n\tat 
org.apache.flink.table.factories.FactoryUtil.enrichNoMatchingConnectorError(FactoryUtil.java:798)\n\tat
 
org.apache.flink.table.factories.FactoryUtil.discoverTableFactory(FactoryUtil.java:772)\n\tat
 
org.apache.flink.table.factories.FactoryUtil.createDynamicTableSource(FactoryUtil.java:215)\n\t...
 52 more\nCaused by: org.apache.flink.table.api.ValidationException: Could not 
find any factory for identifier 'kafka' that implements 
'org.apache.flink.table.factories.DynamicTableFactory' in the classpath 
{code}
I think that the operator should return this error in the CRD since the HA 
error is not the root cause. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33803) Add observedGeneration to Operator's status spec

2023-12-12 Thread Ryan van Huuksloot (Jira)
Ryan van Huuksloot created FLINK-33803:
--

 Summary: Add observedGeneration to Operator's status spec
 Key: FLINK-33803
 URL: https://issues.apache.org/jira/browse/FLINK-33803
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Ryan van Huuksloot
 Fix For: kubernetes-operator-1.8.0


Thoughts on adding the observedGeneration status field to the Kubernetes 
Operator?
 
I saw this issue https://issues.apache.org/jira/browse/FLINK-30858 which is 
useful but the traditional thing would be to just update the observedGeneration 
when the reconciliation is complete. There is 
[tooling]([https://github.com/Shopify/krane/blob/main/README.md#specifying-passfail-conditions])
 out there that requires this spec in order to determine if the reconciliation 
passed for CRDs. Not only that, but it is the expected flow: 
[https://alenkacz.medium.com/kubernetes-operator-best-practices-implementing-observedgeneration-250728868792]
 
I think we keep the current implementation of the reconciliation spec for now 
but at the same time, we can update the `status.observedGeneration`?
 
Mailing List Discussion: 
https://lists.apache.org/thread/1dv33dyqqh18qncot1c6lxgn02do6xnw



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32508) Flink-Metrics Prometheus - Native Histograms / Native Counters

2023-06-30 Thread Ryan van Huuksloot (Jira)
Ryan van Huuksloot created FLINK-32508:
--

 Summary: Flink-Metrics Prometheus - Native Histograms / Native 
Counters
 Key: FLINK-32508
 URL: https://issues.apache.org/jira/browse/FLINK-32508
 Project: Flink
  Issue Type: Technical Debt
  Components: Runtime / Metrics
Reporter: Ryan van Huuksloot
 Fix For: 1.18.0, 1.19.0


There are new metric types in Prometheus that would allow for the exporter to 
write Counters and Histograms as Native metrics in prometheus (vs writing as 
Gauges). This requires an update to the Prometheus Client which has changed 
it's spec.

To accommodate the new metric types while retaining the old option for 
prometheus metrics, the recommendation is to *Add a new package such as 
`flink-metrics-prometheus-native` and eventually deprecate the original.*

Discussed more on the mailing list: 
https://lists.apache.org/thread/kbo3973whb8nj5xvkpvhxrmgtmnbkhlv



--
This message was sent by Atlassian Jira
(v8.20.10#820010)