[ 
https://issues.apache.org/jira/browse/FLINK-34991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17910370#comment-17910370
 ] 

Ryan van Huuksloot commented on FLINK-34991:
--------------------------------------------

Hi David,

I think that may be a different issue. Any startup error is not exposed by the 
Flink UI (since the JM dies) so it is not available to the Operator. The plugin 
example in the issue is just an example.

I haven't tested with 2.0 but this issue persists in 1.20. I chatted with Gyula 
about this and he agreed that work likely needs to be done in Flink Core to 
expose these errors.

I won't have time to test with 2.0 until later in January. I haven't seen any 
other commits that refer to this type of error propagation.

> Flink Operator ClassPath Race Condition Bug
> -------------------------------------------
>
>                 Key: FLINK-34991
>                 URL: https://issues.apache.org/jira/browse/FLINK-34991
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: 1.7.2
>         Environment: Standard Flink Operator with Flink Deployment.
> To recreate, just remove a critical SQL connector library from the bundled jar
>            Reporter: Ryan van Huuksloot
>            Priority: Minor
>
> Hello,
> I believe we've found a bug with the Job Managers of the Kubernetes Operator. 
> I think there is a race condition or an incorrect conditional where the 
> operator is checking for High Availability instead of seeing if there is an 
> issue with Class Loading in the Job Manager.
> *Example:*
> When deploying a SQL Flink Job, it starts the job managers in a failed state.
> Describing the flink deployment returns the Error message 
> {code:java}
> RestoreFailed ... HA metadata not available to restore from last state. It is 
> possible that the job has finished or terminally failed, or the configmaps 
> have been deleted.{code}
> But upon further investigation, the actual error was that the class loading 
> of the Job Manager wasn't correct. This was a log in the Job Manager
> {code:java}
> "Could not find any factory for identifier 'kafka' that implements 
> 'org.apache.flink.table.factories.DynamicTableFactory' in the 
> classpath.\n\nAvailable factory identifiers 
> are:\n\nblackhole\ndatagen\nfilesystem\nprint","name":"org.apache.flink.table.api.ValidationException","extendedStackTrace":"org.apache.flink.table.api.ValidationException:
>  Could not find any factory for identifier 'kafka' that implements 
> 'org.apache.flink.table.factories.DynamicTableFactory' in the 
> classpath.\n\nAvailable factory identifiers 
> are:\n\nblackhole\ndatagen\nfilesystem\nprint\n\"{code}
> There is also logging in the operator
> {code:java}
> ... Cannot discover a connector using option: 'connector'='kafka'\n\tat 
> org.apache.flink.table.factories.FactoryUtil.enrichNoMatchingConnectorError(FactoryUtil.java:798)\n\tat
>  
> org.apache.flink.table.factories.FactoryUtil.discoverTableFactory(FactoryUtil.java:772)\n\tat
>  
> org.apache.flink.table.factories.FactoryUtil.createDynamicTableSource(FactoryUtil.java:215)\n\t...
>  52 more\nCaused by: org.apache.flink.table.api.ValidationException: Could 
> not find any factory for identifier 'kafka' that implements 
> 'org.apache.flink.table.factories.DynamicTableFactory' in the classpath 
> ....{code}
> I think that the operator should return this error in the CRD since the HA 
> error is not the root cause. 
>  
>  
> To recreate:
> All I did was remove the 
> `"org.apache.flink:flink-connector-kafka:$flinkConnectorKafkaVersion"` from 
> my bundled jar so the class path was missing. This was executing a Flink SQL 
> job. Which means the job manager starts before the class path error is thrown 
> which seems to be the issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to