Hello Huweihua (sorry for dup email, ended up clicking reply instead of
reply-all),

Thanks for reaching out and having a look at it. The job managers have the
same logs I shared before, another example below.

{"instant":{"epochSecond":1650920964,"nanoOfSecond":648000000},"thread":"flink-akka.actor.default-dispatcher-15","level":"INFO","loggerName":"org.apache.flink.runtime.jobmaster.JobMaster","message":"Triggering
stop-with-savepoint for job
00000000000000000000000000000000.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":124,"threadPriority":5}
{"instant":{"epochSecond":1650920964,"nanoOfSecond":777000000},"thread":"Checkpoint
Timer","level":"INFO","loggerName":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator","message":"Triggering
checkpoint 724 (type=SAVEPOINT_SUSPEND) @ 1650920964650 for job
00000000000000000000000000000000.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":120,"threadPriority":5}

*{"instant":{"epochSecond":1650920964,"nanoOfSecond":846000000},"thread":"flink-rest-server-netty-worker-thread-1","level":"ERROR","loggerName":"org.apache.flink.runtime.rest.handler.job.savepoints.SavepointHandlers$SavepointStatusHandler","message":"Exception
occurred in REST handler: Operation not found under key:
org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@382cf973","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":203,"threadPriority":5}*{"instant":{"epochSecond":1650920966,"nanoOfSecond":417000000},"thread":"jobmanager-io-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator","message":"Completed
checkpoint 724 for job 00000000000000000000000000000000 (3268755 bytes,
checkpointDuration=1443 ms, finalizationTime=324
ms).","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":105,"threadPriority":5}
{"instant":{"epochSecond":1650920966,"nanoOfSecond":581000000},"thread":"flink-akka.actor.default-dispatcher-16","level":"INFO","loggerName":"org.apache.flink.runtime.executiongraph.ExecutionGraph","message":"Source:
line-items -> line-items-filter-valid-records -> Filter -> Map -> Filter ->
Filter -> Filter -> Timestamps/Watermarks (5/8)
(5ffc8a66e16b10c66a9eaef73bff538e) switched from RUNNING to
FINISHED.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":136,"threadPriority":5}
..... // Bunch of 'RUNNING to FINISHED.' messages
{"instant":{"epochSecond":1650920967,"nanoOfSecond":951000000},"thread":"flink-akka.actor.default-dispatcher-13","level":"INFO","loggerName":"org.apache.flink.runtime.executiongraph.ExecutionGraph","message":"Job
JOB_NAME (00000000000000000000000000000000) switched from state RUNNING to
FINISHED.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":97,"threadPriority":5}
{"instant":{"epochSecond":1650920967,"nanoOfSecond":951000000},"thread":"flink-akka.actor.default-dispatcher-19","level":"INFO","loggerName":"org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager","message":"Clearing
resource requirements of job
00000000000000000000000000000000","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":139,"threadPriority":5}
{"instant":{"epochSecond":1650920967,"nanoOfSecond":951000000},"thread":"flink-akka.actor.default-dispatcher-13","level":"INFO","loggerName":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator","message":"Stopping
checkpoint coordinator for job
00000000000000000000000000000000.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":97,"threadPriority":5}
{"instant":{"epochSecond":1650920967,"nanoOfSecond":974000000},"thread":"flink-akka.actor.default-dispatcher-13","level":"INFO","loggerName":"org.apache.flink.runtime.dispatcher.StandaloneDispatcher","message":"Job
00000000000000000000000000000000 reached terminal state
FINISHED.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":97,"threadPriority":5}
{"instant":{"epochSecond":1650920968,"nanoOfSecond":32000000},"thread":"cluster-io-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.jobmanager.DefaultJobGraphStore","message":"Removed
job graph 00000000000000000000000000000000 from
KubernetesStateHandleStore{configMapName='CONFIGMAP_NAME'}.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":82,"threadPriority":5}
{"instant":{"epochSecond":1650920968,"nanoOfSecond":35000000},"thread":"flink-akka.actor.default-dispatcher-13","level":"INFO","loggerName":"org.apache.flink.runtime.jobmaster.JobMaster","message":"Stopping
the JobMaster for job
'JOB_NAME' 
(00000000000000000000000000000000).","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":97,"threadPriority":5}
...
// tons of success shutdown messages

Nothing shows up in the HA jobmanager as well. grepping "cache" only this
line shows up (which looks unrelated)
{"instant":{"epochSecond":1650920969,"nanoOfSecond":628000000},"thread":"Thread-33","level":"INFO","loggerName":"org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint","message":"Removing
cache directory
/tmp/flink-web-fd0f19f3-af71-4ee7-b697-853a9c80b8e1/flink-web-ui","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":9797,"threadPriority":5}


This happens for all async commands, like `./bin/flink savepoint $JOB)ID`
and consistently.

Thanks,
Harsh

On Wed, Apr 20, 2022 at 5:40 AM huweihua <huweihua....@gmail.com> wrote:

> Stop-with-savepoint is an async operation, it will trigger savepoint at
> first, and trigger this operation in a cache.
>
> These error logs indicate that the operation is in the cache which can no
> longer be found. Could you provide more jobmanager.log to find when this
> operation is evicted from cache.
>
>
>
>
> 2022年4月20日 上午12:21,Harsh Shah <harsh.a.s...@shopify.com> 写道:
>
> Hello Community,
>
> We are trying to adopt flink-stop instead of orchestrating the stop
> manually. While using the command, it failed with error "Operation not
> found under key:
> org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@9fbfc15a".
> the job eventually stopped but I wanted to see how I could execute the
> command successfully ! Any help or direction to debug is appreciated.
>
> Information:
> * Flink version: 1.13.1
> * Flink on kubernetes with HA enabled. Standalone cluster.
>
> *command*: "/bin/flink stop $JOBID"
>
> *Stack trace of the command:*
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *WARNING: Illegal reflective access by
> org.apache.hadoop.security.authentication.util.KerberosUtil
> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
> sun.security.krb5.Config.getInstance()WARNING: Please consider reporting
> this to the maintainers of
> org.apache.hadoop.security.authentication.util.KerberosUtilWARNING: Use
> --illegal-access=warn to enable warnings of further illegal reflective
> access operationsWARNING: All illegal access operations will be denied in a
> future releaseSuspending job "00000000000000000000000000000000" with a
> savepoint.------------------------------------------------------------ The
> program finished with the following
> exception:org.apache.flink.util.FlinkException: Could not stop with a
> savepoint job "00000000000000000000000000000000". at
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:581)
> at
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002)
> at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:569) at
> org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1069)
> at
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
> at java.base/java.security.AccessController.doPrivileged(Native Method) at
> java.base/javax.security.auth.Subject.doAs(Unknown Source) at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
> at
> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> at
> org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)Caused
> by: java.util.concurrent.ExecutionException:
> org.apache.flink.runtime.rest.util.RestClientException:
> [org.apache.flink.runtime.rest.NotFoundException: Operation not found under
> key:
> org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@9fbfc15a
> at
> org.apache.flink.runtime.rest.handler.async.AbstractAsynchronousOperationHandlers$StatusHandler.handleRequest(AbstractAsynchronousOperationHandlers.java:182)
> at
> org.apache.flink.runtime.rest.handler.job.savepoints.SavepointHandlers$SavepointStatusHandler.handleRequest(SavepointHandlers.java:219)
> at
> org.apache.flink.runtime.rest.handler.AbstractRestHandler.respondToRequest(AbstractRestHandler.java:83)
> at
> org.apache.flink.runtime.rest.handler.AbstractHandler.respondAsLeader(AbstractHandler.java:195)
> at
> org.apache.flink.runtime.rest.handler.LeaderRetrievalHandler.lambda$channelRead0$0(LeaderRetrievalHandler.java:83)
> at java.base/java.util.Optional.ifPresent(Unknown Source) at
> org.apache.flink.util.OptionalConsumer.ifPresent(OptionalConsumer.java:45)
> at
> org.apache.flink.runtime.rest.handler.LeaderRetrievalHandler.channelRead0(LeaderRetrievalHandler.java:80)
> at
> org.apache.flink.runtime.rest.handler.LeaderRetrievalHandler.channelRead0(LeaderRetrievalHandler.java:49)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)*
>
>
> *Jobs manager logs:*
>
> *{"instant":{"epochSecond":1650384228,"nanoOfSecond":866181000},"thread":"flink-akka.actor.default-dispatcher-2","level":"INFO","loggerName":"org.apache.flink.runtime.jobmaster.JobMaster","message":"Triggering
> stop-with-savepoint for job
> 00000000000000000000000000000000.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":28,"threadPriority":5}{"instant":{"epochSecond":1650384229,"nanoOfSecond":30779000},"thread":"flink-rest-server-netty-worker-thread-1","level":"ERROR","loggerName":"org.apache.flink.runtime.rest.handler.job.savepoints.SavepointHandlers$SavepointStatusHandler","message":"Exception
> occurred in REST handler: Operation not found under key:
> org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@9fbfc15a","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":95,"threadPriority":5}*
> *... # rest of things are success*
>
>
>

Reply via email to