[
https://issues.apache.org/jira/browse/FLINK-34991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17910366#comment-17910366
]
david radley commented on FLINK-34991:
--------------------------------------
Hi,
I thought this might relate to your issue. We hit a classloading issue around
UDFs. It was fixed in master after a big refactor around job scheduling. We
have a PR [https://github.com/apache/flink/pull/25656] ; we are looking to use
to fix this at 1.20 if we can be convinced there are no side effects. I suggest
checking whether this fails for you in master / v2.
> Flink Operator ClassPath Race Condition Bug
> -------------------------------------------
>
> Key: FLINK-34991
> URL: https://issues.apache.org/jira/browse/FLINK-34991
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Affects Versions: 1.7.2
> Environment: Standard Flink Operator with Flink Deployment.
> To recreate, just remove a critical SQL connector library from the bundled jar
> Reporter: Ryan van Huuksloot
> Priority: Minor
>
> Hello,
> I believe we've found a bug with the Job Managers of the Kubernetes Operator.
> I think there is a race condition or an incorrect conditional where the
> operator is checking for High Availability instead of seeing if there is an
> issue with Class Loading in the Job Manager.
> *Example:*
> When deploying a SQL Flink Job, it starts the job managers in a failed state.
> Describing the flink deployment returns the Error message
> {code:java}
> RestoreFailed ... HA metadata not available to restore from last state. It is
> possible that the job has finished or terminally failed, or the configmaps
> have been deleted.{code}
> But upon further investigation, the actual error was that the class loading
> of the Job Manager wasn't correct. This was a log in the Job Manager
> {code:java}
> "Could not find any factory for identifier 'kafka' that implements
> 'org.apache.flink.table.factories.DynamicTableFactory' in the
> classpath.\n\nAvailable factory identifiers
> are:\n\nblackhole\ndatagen\nfilesystem\nprint","name":"org.apache.flink.table.api.ValidationException","extendedStackTrace":"org.apache.flink.table.api.ValidationException:
> Could not find any factory for identifier 'kafka' that implements
> 'org.apache.flink.table.factories.DynamicTableFactory' in the
> classpath.\n\nAvailable factory identifiers
> are:\n\nblackhole\ndatagen\nfilesystem\nprint\n\"{code}
> There is also logging in the operator
> {code:java}
> ... Cannot discover a connector using option: 'connector'='kafka'\n\tat
> org.apache.flink.table.factories.FactoryUtil.enrichNoMatchingConnectorError(FactoryUtil.java:798)\n\tat
>
> org.apache.flink.table.factories.FactoryUtil.discoverTableFactory(FactoryUtil.java:772)\n\tat
>
> org.apache.flink.table.factories.FactoryUtil.createDynamicTableSource(FactoryUtil.java:215)\n\t...
> 52 more\nCaused by: org.apache.flink.table.api.ValidationException: Could
> not find any factory for identifier 'kafka' that implements
> 'org.apache.flink.table.factories.DynamicTableFactory' in the classpath
> ....{code}
> I think that the operator should return this error in the CRD since the HA
> error is not the root cause.
>
>
> To recreate:
> All I did was remove the
> `"org.apache.flink:flink-connector-kafka:$flinkConnectorKafkaVersion"` from
> my bundled jar so the class path was missing. This was executing a Flink SQL
> job. Which means the job manager starts before the class path error is thrown
> which seems to be the issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)