Re: flink-stop command fails with ` Operation not found under key`

2022-04-25 Thread Harsh Shah
coordinator for job
.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":97,"threadPriority":5}
{"instant":{"epochSecond":1650920967,"nanoOfSecond":97400},"thread":"flink-akka.actor.default-dispatcher-13","level":"INFO","loggerName":"org.apache.flink.runtime.dispatcher.StandaloneDispatcher","message":"Job
 reached terminal state
FINISHED.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":97,"threadPriority":5}
{"instant":{"epochSecond":1650920968,"nanoOfSecond":3200},"thread":"cluster-io-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.jobmanager.DefaultJobGraphStore","message":"Removed
job graph 0000 from
KubernetesStateHandleStore{configMapName='CONFIGMAP_NAME'}.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":82,"threadPriority":5}
{"instant":{"epochSecond":1650920968,"nanoOfSecond":3500},"thread":"flink-akka.actor.default-dispatcher-13","level":"INFO","loggerName":"org.apache.flink.runtime.jobmaster.JobMaster","message":"Stopping
the JobMaster for job
'JOB_NAME' 
().","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":97,"threadPriority":5}
...
// tons of success shutdown messages

Nothing shows up in the HA jobmanager as well. grepping "cache" only this
line shows up (which looks unrelated)
{"instant":{"epochSecond":1650920969,"nanoOfSecond":62800},"thread":"Thread-33","level":"INFO","loggerName":"org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint","message":"Removing
cache directory
/tmp/flink-web-fd0f19f3-af71-4ee7-b697-853a9c80b8e1/flink-web-ui","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":9797,"threadPriority":5}


This happens for all async commands, like `./bin/flink savepoint $JOB)ID`
and consistently.

Thanks,
Harsh

On Wed, Apr 20, 2022 at 5:40 AM huweihua  wrote:

> Stop-with-savepoint is an async operation, it will trigger savepoint at
> first, and trigger this operation in a cache.
>
> These error logs indicate that the operation is in the cache which can no
> longer be found. Could you provide more jobmanager.log to find when this
> operation is evicted from cache.
>
>
>
>
> 2022年4月20日 上午12:21,Harsh Shah  写道:
>
> Hello Community,
>
> We are trying to adopt flink-stop instead of orchestrating the stop
> manually. While using the command, it failed with error "Operation not
> found under key:
> org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@9fbfc15a".
> the job eventually stopped but I wanted to see how I could execute the
> command successfully ! Any help or direction to debug is appreciated.
>
> Information:
> * Flink version: 1.13.1
> * Flink on kubernetes with HA enabled. Standalone cluster.
>
> *command*: "/bin/flink stop $JOBID"
>
> *Stack trace of the command:*
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *WARNING: Illegal reflective access by
> org.apache.hadoop.security.authentication.util.KerberosUtil
> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
> sun.security.krb5.Config.getInstance()WARNING: Please consider reporting
> this to the maintainers of
> org.apache.hadoop.security.authentication.util.KerberosUtilWARNING: Use
> --illegal-access=warn to enable warnings of further illegal reflective
> access operationsWARNING: All illegal access operations will be denied in a
> future releaseSuspending job "" with a
> savepoint. The
> program finished with the following
> exception:org.apache.flink.util.FlinkException: Could not stop with a
> savepoint job "". at
> org.apache.flink.client.cli.CliFrontend.lambda$sto

flink-stop command fails with ` Operation not found under key`

2022-04-19 Thread Harsh Shah
Hello Community,

We are trying to adopt flink-stop instead of orchestrating the stop
manually. While using the command, it failed with error "Operation not
found under key:
org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@9fbfc15a".
the job eventually stopped but I wanted to see how I could execute the
command successfully ! Any help or direction to debug is appreciated.

Information:
* Flink version: 1.13.1
* Flink on kubernetes with HA enabled. Standalone cluster.

*command*: "/bin/flink stop $JOBID"

*Stack trace of the command:*






























*WARNING: Illegal reflective access by
org.apache.hadoop.security.authentication.util.KerberosUtil
(file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
sun.security.krb5.Config.getInstance()WARNING: Please consider reporting
this to the maintainers of
org.apache.hadoop.security.authentication.util.KerberosUtilWARNING: Use
--illegal-access=warn to enable warnings of further illegal reflective
access operationsWARNING: All illegal access operations will be denied in a
future releaseSuspending job "" with a
savepoint. The
program finished with the following
exception:org.apache.flink.util.FlinkException: Could not stop with a
savepoint job "". at
org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:581)
at
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002)
at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:569) at
org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1069)
at
org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
at java.base/java.security.AccessController.doPrivileged(Native Method) at
java.base/javax.security.auth.Subject.doAs(Unknown Source) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
at
org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at
org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)Caused
by: java.util.concurrent.ExecutionException:
org.apache.flink.runtime.rest.util.RestClientException:
[org.apache.flink.runtime.rest.NotFoundException: Operation not found under
key:
org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@9fbfc15a
at
org.apache.flink.runtime.rest.handler.async.AbstractAsynchronousOperationHandlers$StatusHandler.handleRequest(AbstractAsynchronousOperationHandlers.java:182)
at
org.apache.flink.runtime.rest.handler.job.savepoints.SavepointHandlers$SavepointStatusHandler.handleRequest(SavepointHandlers.java:219)
at
org.apache.flink.runtime.rest.handler.AbstractRestHandler.respondToRequest(AbstractRestHandler.java:83)
at
org.apache.flink.runtime.rest.handler.AbstractHandler.respondAsLeader(AbstractHandler.java:195)
at
org.apache.flink.runtime.rest.handler.LeaderRetrievalHandler.lambda$channelRead0$0(LeaderRetrievalHandler.java:83)
at java.base/java.util.Optional.ifPresent(Unknown Source) at
org.apache.flink.util.OptionalConsumer.ifPresent(OptionalConsumer.java:45)
at
org.apache.flink.runtime.rest.handler.LeaderRetrievalHandler.channelRead0(LeaderRetrievalHandler.java:80)
at
org.apache.flink.runtime.rest.handler.LeaderRetrievalHandler.channelRead0(LeaderRetrievalHandler.java:49)
at
org.apache.flink.shaded.netty4.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)*


*Jobs manager logs:*

*{"instant":{"epochSecond":1650384228,"nanoOfSecond":866181000},"thread":"flink-akka.actor.default-dispatcher-2","level":"INFO","loggerName":"org.apache.flink.runtime.jobmaster.JobMaster","message":"Triggering
stop-with-savepoint for job
.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":28,"threadPriority":5}{"instant":{"epochSecond":1650384229,"nanoOfSecond":30779000},"thread":"flink-rest-server-netty-worker-thread-1","level":"ERROR","loggerName":"org.apache.flink.runtime.rest.handler.job.savepoints.SavepointHandlers$SavepointStatusHandler","message":"Exception
occurred in REST handler: Operation not found under key:
org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@9fbfc15a","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":95,"threadPriority":5}*
*... # rest of things are success*


Flink k8 HA mode + checkpoint management

2021-08-03 Thread Harsh Shah
Hello,

I am trying to use Flink HA mode inside kubernetes

 in standalone

mode.
The Job ID is always constant, "". In
situation where we restart the job (Not from a check-point or savepoint),
we see errors like
"""

Caused by: org.apache.hadoop.fs.FileAlreadyExistsException:
'/flink-checkpoints//chk-1/_metadata'
already exists

"""
where checkpoints have not been created since the restart of Job .

My question:
* Is the recommended way to set a new unique "checkpoint path" every time
we update Job and restart necessary k8 resources (say not restarted from
checkpoint-savepoint)? Or GC checkpoints during deletion and reload from
savepoint if required? Looking for a standard recommendation.
* Is there a way I can override the JobID to be unique and indicate it is a
complete restart in HA mode?


Thanks,
Harsh