Hello Huweihua (sorry for dup email, ended up clicking reply instead of reply-all),
Thanks for reaching out and having a look at it. The job managers have the same logs I shared before, another example below. {"instant":{"epochSecond":1650920964,"nanoOfSecond":648000000},"thread":"flink-akka.actor.default-dispatcher-15","level":"INFO","loggerName":"org.apache.flink.runtime.jobmaster.JobMaster","message":"Triggering stop-with-savepoint for job 00000000000000000000000000000000.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":124,"threadPriority":5} {"instant":{"epochSecond":1650920964,"nanoOfSecond":777000000},"thread":"Checkpoint Timer","level":"INFO","loggerName":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator","message":"Triggering checkpoint 724 (type=SAVEPOINT_SUSPEND) @ 1650920964650 for job 00000000000000000000000000000000.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":120,"threadPriority":5} *{"instant":{"epochSecond":1650920964,"nanoOfSecond":846000000},"thread":"flink-rest-server-netty-worker-thread-1","level":"ERROR","loggerName":"org.apache.flink.runtime.rest.handler.job.savepoints.SavepointHandlers$SavepointStatusHandler","message":"Exception occurred in REST handler: Operation not found under key: org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@382cf973","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":203,"threadPriority":5}*{"instant":{"epochSecond":1650920966,"nanoOfSecond":417000000},"thread":"jobmanager-io-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator","message":"Completed checkpoint 724 for job 00000000000000000000000000000000 (3268755 bytes, checkpointDuration=1443 ms, finalizationTime=324 ms).","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":105,"threadPriority":5} {"instant":{"epochSecond":1650920966,"nanoOfSecond":581000000},"thread":"flink-akka.actor.default-dispatcher-16","level":"INFO","loggerName":"org.apache.flink.runtime.executiongraph.ExecutionGraph","message":"Source: line-items -> line-items-filter-valid-records -> Filter -> Map -> Filter -> Filter -> Filter -> Timestamps/Watermarks (5/8) (5ffc8a66e16b10c66a9eaef73bff538e) switched from RUNNING to FINISHED.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":136,"threadPriority":5} ..... // Bunch of 'RUNNING to FINISHED.' messages {"instant":{"epochSecond":1650920967,"nanoOfSecond":951000000},"thread":"flink-akka.actor.default-dispatcher-13","level":"INFO","loggerName":"org.apache.flink.runtime.executiongraph.ExecutionGraph","message":"Job JOB_NAME (00000000000000000000000000000000) switched from state RUNNING to FINISHED.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":97,"threadPriority":5} {"instant":{"epochSecond":1650920967,"nanoOfSecond":951000000},"thread":"flink-akka.actor.default-dispatcher-19","level":"INFO","loggerName":"org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager","message":"Clearing resource requirements of job 00000000000000000000000000000000","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":139,"threadPriority":5} {"instant":{"epochSecond":1650920967,"nanoOfSecond":951000000},"thread":"flink-akka.actor.default-dispatcher-13","level":"INFO","loggerName":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator","message":"Stopping checkpoint coordinator for job 00000000000000000000000000000000.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":97,"threadPriority":5} {"instant":{"epochSecond":1650920967,"nanoOfSecond":974000000},"thread":"flink-akka.actor.default-dispatcher-13","level":"INFO","loggerName":"org.apache.flink.runtime.dispatcher.StandaloneDispatcher","message":"Job 00000000000000000000000000000000 reached terminal state FINISHED.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":97,"threadPriority":5} {"instant":{"epochSecond":1650920968,"nanoOfSecond":32000000},"thread":"cluster-io-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.jobmanager.DefaultJobGraphStore","message":"Removed job graph 00000000000000000000000000000000 from KubernetesStateHandleStore{configMapName='CONFIGMAP_NAME'}.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":82,"threadPriority":5} {"instant":{"epochSecond":1650920968,"nanoOfSecond":35000000},"thread":"flink-akka.actor.default-dispatcher-13","level":"INFO","loggerName":"org.apache.flink.runtime.jobmaster.JobMaster","message":"Stopping the JobMaster for job 'JOB_NAME' (00000000000000000000000000000000).","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":97,"threadPriority":5} ... // tons of success shutdown messages Nothing shows up in the HA jobmanager as well. grepping "cache" only this line shows up (which looks unrelated) {"instant":{"epochSecond":1650920969,"nanoOfSecond":628000000},"thread":"Thread-33","level":"INFO","loggerName":"org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint","message":"Removing cache directory /tmp/flink-web-fd0f19f3-af71-4ee7-b697-853a9c80b8e1/flink-web-ui","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":9797,"threadPriority":5} This happens for all async commands, like `./bin/flink savepoint $JOB)ID` and consistently. Thanks, Harsh On Wed, Apr 20, 2022 at 5:40 AM huweihua <huweihua....@gmail.com> wrote: > Stop-with-savepoint is an async operation, it will trigger savepoint at > first, and trigger this operation in a cache. > > These error logs indicate that the operation is in the cache which can no > longer be found. Could you provide more jobmanager.log to find when this > operation is evicted from cache. > > > > > 2022年4月20日 上午12:21,Harsh Shah <harsh.a.s...@shopify.com> 写道: > > Hello Community, > > We are trying to adopt flink-stop instead of orchestrating the stop > manually. While using the command, it failed with error "Operation not > found under key: > org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@9fbfc15a". > the job eventually stopped but I wanted to see how I could execute the > command successfully ! Any help or direction to debug is appreciated. > > Information: > * Flink version: 1.13.1 > * Flink on kubernetes with HA enabled. Standalone cluster. > > *command*: "/bin/flink stop $JOBID" > > *Stack trace of the command:* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *WARNING: Illegal reflective access by > org.apache.hadoop.security.authentication.util.KerberosUtil > (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method > sun.security.krb5.Config.getInstance()WARNING: Please consider reporting > this to the maintainers of > org.apache.hadoop.security.authentication.util.KerberosUtilWARNING: Use > --illegal-access=warn to enable warnings of further illegal reflective > access operationsWARNING: All illegal access operations will be denied in a > future releaseSuspending job "00000000000000000000000000000000" with a > savepoint.------------------------------------------------------------ The > program finished with the following > exception:org.apache.flink.util.FlinkException: Could not stop with a > savepoint job "00000000000000000000000000000000". at > org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:581) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002) > at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:569) at > org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1069) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132) > at java.base/java.security.AccessController.doPrivileged(Native Method) at > java.base/javax.security.auth.Subject.doAs(Unknown Source) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) > at > org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at > org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)Caused > by: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.rest.util.RestClientException: > [org.apache.flink.runtime.rest.NotFoundException: Operation not found under > key: > org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@9fbfc15a > at > org.apache.flink.runtime.rest.handler.async.AbstractAsynchronousOperationHandlers$StatusHandler.handleRequest(AbstractAsynchronousOperationHandlers.java:182) > at > org.apache.flink.runtime.rest.handler.job.savepoints.SavepointHandlers$SavepointStatusHandler.handleRequest(SavepointHandlers.java:219) > at > org.apache.flink.runtime.rest.handler.AbstractRestHandler.respondToRequest(AbstractRestHandler.java:83) > at > org.apache.flink.runtime.rest.handler.AbstractHandler.respondAsLeader(AbstractHandler.java:195) > at > org.apache.flink.runtime.rest.handler.LeaderRetrievalHandler.lambda$channelRead0$0(LeaderRetrievalHandler.java:83) > at java.base/java.util.Optional.ifPresent(Unknown Source) at > org.apache.flink.util.OptionalConsumer.ifPresent(OptionalConsumer.java:45) > at > org.apache.flink.runtime.rest.handler.LeaderRetrievalHandler.channelRead0(LeaderRetrievalHandler.java:80) > at > org.apache.flink.runtime.rest.handler.LeaderRetrievalHandler.channelRead0(LeaderRetrievalHandler.java:49) > at > org.apache.flink.shaded.netty4.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)* > > > *Jobs manager logs:* > > *{"instant":{"epochSecond":1650384228,"nanoOfSecond":866181000},"thread":"flink-akka.actor.default-dispatcher-2","level":"INFO","loggerName":"org.apache.flink.runtime.jobmaster.JobMaster","message":"Triggering > stop-with-savepoint for job > 00000000000000000000000000000000.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":28,"threadPriority":5}{"instant":{"epochSecond":1650384229,"nanoOfSecond":30779000},"thread":"flink-rest-server-netty-worker-thread-1","level":"ERROR","loggerName":"org.apache.flink.runtime.rest.handler.job.savepoints.SavepointHandlers$SavepointStatusHandler","message":"Exception > occurred in REST handler: Operation not found under key: > org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@9fbfc15a","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":95,"threadPriority":5}* > *... # rest of things are success* > > >