Re: flink-stop command fails with ` Operation not found under key`

2022-04-25 Thread Harsh Shah
Hello Huweihua (sorry for dup email, ended up clicking reply instead of
reply-all),

Thanks for reaching out and having a look at it. The job managers have the
same logs I shared before, another example below.

{"instant":{"epochSecond":1650920964,"nanoOfSecond":64800},"thread":"flink-akka.actor.default-dispatcher-15","level":"INFO","loggerName":"org.apache.flink.runtime.jobmaster.JobMaster","message":"Triggering
stop-with-savepoint for job
.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":124,"threadPriority":5}
{"instant":{"epochSecond":1650920964,"nanoOfSecond":77700},"thread":"Checkpoint
Timer","level":"INFO","loggerName":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator","message":"Triggering
checkpoint 724 (type=SAVEPOINT_SUSPEND) @ 1650920964650 for job
.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":120,"threadPriority":5}

*{"instant":{"epochSecond":1650920964,"nanoOfSecond":84600},"thread":"flink-rest-server-netty-worker-thread-1","level":"ERROR","loggerName":"org.apache.flink.runtime.rest.handler.job.savepoints.SavepointHandlers$SavepointStatusHandler","message":"Exception
occurred in REST handler: Operation not found under key:
org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@382cf973","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":203,"threadPriority":5}*{"instant":{"epochSecond":1650920966,"nanoOfSecond":41700},"thread":"jobmanager-io-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator","message":"Completed
checkpoint 724 for job  (3268755 bytes,
checkpointDuration=1443 ms, finalizationTime=324
ms).","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":105,"threadPriority":5}
{"instant":{"epochSecond":1650920966,"nanoOfSecond":58100},"thread":"flink-akka.actor.default-dispatcher-16","level":"INFO","loggerName":"org.apache.flink.runtime.executiongraph.ExecutionGraph","message":"Source:
line-items -> line-items-filter-valid-records -> Filter -> Map -> Filter ->
Filter -> Filter -> Timestamps/Watermarks (5/8)
(5ffc8a66e16b10c66a9eaef73bff538e) switched from RUNNING to
FINISHED.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":136,"threadPriority":5}
. // Bunch of 'RUNNING to FINISHED.' messages
{"instant":{"epochSecond":1650920967,"nanoOfSecond":95100},"thread":"flink-akka.actor.default-dispatcher-13","level":"INFO","loggerName":"org.apache.flink.runtime.executiongraph.ExecutionGraph","message":"Job
JOB_NAME () switched from state RUNNING to
FINISHED.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":97,"threadPriority":5}
{"instant":{"epochSecond":1650920967,"nanoOfSecond":95100},"thread":"flink-akka.actor.default-dispatcher-19","level":"INFO","loggerName":"org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager","message":"Clearing
resource requirements of job
","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":139,"threadPriority":5}
{"instant":{"epochSecond":1650920967,"nanoOfSecond":95100},"thread":"flink-akka.actor.default-dispatcher-13","level":"INFO","loggerName":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator","message":"Stopping
checkpoint coordinator for job
.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":97,"threadPriority":5}
{"instant":{"epochSecond":1650920967,"nanoOfSecond":97400},"thread":"flink-akka.actor.default-dispatcher-13","level":"INFO","loggerName":"org.apache.flink.runtime.dispatcher.StandaloneDispatcher","message":"Job
 reached terminal state
FINISHED.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":97,"threadPriority":5}
{"instant":{"epochSecond":1650920968,"nanoOfSecond":3200},"thread":"cluster-io-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.jobmanager.DefaultJobGraphStore","message":"Removed
job graph  from
KubernetesStateHandleStore{configMapName='CONFIGMAP_NAME'}.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":82,"threadPriority":5}
{"instant":{"epochSecond":1650920968,"nanoOfSecond":3500},"thread":"flink-akka.actor.default-dispatcher-13","level":"INFO","loggerName":"org.apache.flink.runtime.jobmaster.JobMaster","message":"Stopping
the JobMaster for job
'JOB_NAME' 

Re: flink-stop command fails with ` Operation not found under key`

2022-04-20 Thread huweihua
Stop-with-savepoint is an async operation, it will trigger savepoint at first, 
and trigger this operation in a cache. 

These error logs indicate that the operation is in the cache which can no 
longer be found. Could you provide more jobmanager.log to find when this 
operation is evicted from cache.




> 2022年4月20日 上午12:21,Harsh Shah  写道:
> 
> Hello Community,
> 
> We are trying to adopt flink-stop instead of orchestrating the stop manually. 
> While using the command, it failed with error "Operation not found under key: 
> org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@9fbfc15a".
>  the job eventually stopped but I wanted to see how I could execute the 
> command successfully ! Any help or direction to debug is appreciated. 
> 
> Information:
> * Flink version: 1.13.1 
> * Flink on kubernetes with HA enabled. Standalone cluster.
> 
> command: "/bin/flink stop $JOBID"
> Stack trace of the command:
> WARNING: Illegal reflective access by 
> org.apache.hadoop.security.authentication.util.KerberosUtil 
> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method 
> sun.security.krb5.Config.getInstance()
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.hadoop.security.authentication.util.KerberosUtil
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> Suspending job "" with a savepoint.
> 
> 
>  The program finished with the following exception:
> 
> org.apache.flink.util.FlinkException: Could not stop with a savepoint job 
> "".
> at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:581)
> at 
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002)
> at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:569)
> at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1069)
> at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
> at java.base/java.security.AccessController.doPrivileged(Native Method)
> at java.base/javax.security.auth.Subject.doAs(Unknown Source)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
> at 
> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
> Caused by: java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.rest.util.RestClientException: 
> [org.apache.flink.runtime.rest.NotFoundException: Operation not found under 
> key: 
> org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@9fbfc15a
> at 
> org.apache.flink.runtime.rest.handler.async.AbstractAsynchronousOperationHandlers$StatusHandler.handleRequest(AbstractAsynchronousOperationHandlers.java:182)
> at 
> org.apache.flink.runtime.rest.handler.job.savepoints.SavepointHandlers$SavepointStatusHandler.handleRequest(SavepointHandlers.java:219)
> at 
> org.apache.flink.runtime.rest.handler.AbstractRestHandler.respondToRequest(AbstractRestHandler.java:83)
> at 
> org.apache.flink.runtime.rest.handler.AbstractHandler.respondAsLeader(AbstractHandler.java:195)
> at 
> org.apache.flink.runtime.rest.handler.LeaderRetrievalHandler.lambda$channelRead0$0(LeaderRetrievalHandler.java:83)
> at java.base/java.util.Optional.ifPresent(Unknown Source)
> at org.apache.flink.util.OptionalConsumer.ifPresent(OptionalConsumer.java:45)
> at 
> org.apache.flink.runtime.rest.handler.LeaderRetrievalHandler.channelRead0(LeaderRetrievalHandler.java:80)
> at 
> org.apache.flink.runtime.rest.handler.LeaderRetrievalHandler.channelRead0(LeaderRetrievalHandler.java:49)
> at 
> org.apache.flink.shaded.netty4.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
> at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
> 
> 
> Jobs manager logs:
> {"instant":{"epochSecond":1650384228,"nanoOfSecond":866181000},"thread":"flink-akka.actor.default-dispatcher-2","level":"INFO","loggerName":"org.apache.flink.runtime.jobmaster.JobMaster","message":"Triggering
>  stop-with-savepoint for job 
> .","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":28,"threadPriority":5}
> {"instant":{"epochSecond":1650384229,"nanoOfSecond":30779000},"thread":"flink-rest-server-netty-worker-thread-1","level":"ERROR","loggerName":"org.apache.flink.runtime.rest.handler.job.savepoints.SavepointHandlers$SavepointStatusHandler","message":"Exception
>  occurred in REST handler: Operation not found under key: 
> 

flink-stop command fails with ` Operation not found under key`

2022-04-19 Thread Harsh Shah
Hello Community,

We are trying to adopt flink-stop instead of orchestrating the stop
manually. While using the command, it failed with error "Operation not
found under key:
org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@9fbfc15a".
the job eventually stopped but I wanted to see how I could execute the
command successfully ! Any help or direction to debug is appreciated.

Information:
* Flink version: 1.13.1
* Flink on kubernetes with HA enabled. Standalone cluster.

*command*: "/bin/flink stop $JOBID"

*Stack trace of the command:*






























*WARNING: Illegal reflective access by
org.apache.hadoop.security.authentication.util.KerberosUtil
(file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
sun.security.krb5.Config.getInstance()WARNING: Please consider reporting
this to the maintainers of
org.apache.hadoop.security.authentication.util.KerberosUtilWARNING: Use
--illegal-access=warn to enable warnings of further illegal reflective
access operationsWARNING: All illegal access operations will be denied in a
future releaseSuspending job "" with a
savepoint. The
program finished with the following
exception:org.apache.flink.util.FlinkException: Could not stop with a
savepoint job "". at
org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:581)
at
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002)
at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:569) at
org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1069)
at
org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
at java.base/java.security.AccessController.doPrivileged(Native Method) at
java.base/javax.security.auth.Subject.doAs(Unknown Source) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
at
org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at
org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)Caused
by: java.util.concurrent.ExecutionException:
org.apache.flink.runtime.rest.util.RestClientException:
[org.apache.flink.runtime.rest.NotFoundException: Operation not found under
key:
org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@9fbfc15a
at
org.apache.flink.runtime.rest.handler.async.AbstractAsynchronousOperationHandlers$StatusHandler.handleRequest(AbstractAsynchronousOperationHandlers.java:182)
at
org.apache.flink.runtime.rest.handler.job.savepoints.SavepointHandlers$SavepointStatusHandler.handleRequest(SavepointHandlers.java:219)
at
org.apache.flink.runtime.rest.handler.AbstractRestHandler.respondToRequest(AbstractRestHandler.java:83)
at
org.apache.flink.runtime.rest.handler.AbstractHandler.respondAsLeader(AbstractHandler.java:195)
at
org.apache.flink.runtime.rest.handler.LeaderRetrievalHandler.lambda$channelRead0$0(LeaderRetrievalHandler.java:83)
at java.base/java.util.Optional.ifPresent(Unknown Source) at
org.apache.flink.util.OptionalConsumer.ifPresent(OptionalConsumer.java:45)
at
org.apache.flink.runtime.rest.handler.LeaderRetrievalHandler.channelRead0(LeaderRetrievalHandler.java:80)
at
org.apache.flink.runtime.rest.handler.LeaderRetrievalHandler.channelRead0(LeaderRetrievalHandler.java:49)
at
org.apache.flink.shaded.netty4.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)*


*Jobs manager logs:*

*{"instant":{"epochSecond":1650384228,"nanoOfSecond":866181000},"thread":"flink-akka.actor.default-dispatcher-2","level":"INFO","loggerName":"org.apache.flink.runtime.jobmaster.JobMaster","message":"Triggering
stop-with-savepoint for job
.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":28,"threadPriority":5}{"instant":{"epochSecond":1650384229,"nanoOfSecond":30779000},"thread":"flink-rest-server-netty-worker-thread-1","level":"ERROR","loggerName":"org.apache.flink.runtime.rest.handler.job.savepoints.SavepointHandlers$SavepointStatusHandler","message":"Exception
occurred in REST handler: Operation not found under key:
org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@9fbfc15a","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":95,"threadPriority":5}*
*... # rest of things are success*