[jira] [Commented] (FLINK-34696) GSRecoverableWriterCommitter is generating excessive data blobs
[ https://issues.apache.org/jira/browse/FLINK-34696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17827557#comment-17827557 ] Tobias Hofer commented on FLINK-34696: -- Will the temporary bucket prevent the use of object compose? Letting the GSRecoverableWriterCommitter not recursively compose objects? To ask more precisely: will the use of a temporary bucket just double the amount of data that is being written (one time in temporary bucket, the other time in the final target)? > GSRecoverableWriterCommitter is generating excessive data blobs > --- > > Key: FLINK-34696 > URL: https://issues.apache.org/jira/browse/FLINK-34696 > Project: Flink > Issue Type: Improvement > Components: Connectors / FileSystem >Reporter: Simon-Shlomo Poil >Priority: Major > > The `composeBlobs` method in > `org.apache.flink.fs.gs.writer.GSRecoverableWriterCommitter` is designed to > merge multiple small blobs into a single large blob using Google Cloud > Storage's compose method. This process is iterative, combining the result > from the previous iteration with 31 new blobs until all blobs are merged. > Upon completion of the composition, the method proceeds to remove the > temporary blobs. > *Issue:* > This methodology results in significant, unnecessary data storage consumption > during the blob composition process, incurring considerable costs due to > Google Cloud Storage pricing models. > *Example to Illustrate the Problem:* > - Initial state: 64 blobs, each 1 GB in size (totaling 64 GB). > - After 1st step: 32 blobs are merged into a single blob, increasing total > storage to 96 GB (64 original + 32 GB new). > - After 2nd step: The newly created 32 GB blob is merged with 31 more blobs, > raising the total to 159 GB. > - After 3rd step: The final blob is merged, culminating in a total of 223 GB > to combine the original 64 GB of data. This results in an overhead of 159 GB. > *Impact:* > This inefficiency has a profound impact, especially at scale, where terabytes > of data can incur overheads in the petabyte range, leading to unexpectedly > high costs. Additionally, we have observed an increase in storage exceptions > thrown by the Google Storage library, potentially linked to this issue. > *Suggested Solution:* > To mitigate this problem, we propose modifying the `composeBlobs` method to > immediately delete source blobs once they have been successfully combined. > This change could significantly reduce data duplication and associated costs. > However, the implications for data recovery and integrity need careful > consideration to ensure that this optimization does not compromise the > ability to recover data in case of a failure during the composition process. > *Steps to Reproduce:* > 1. Initiate the blob composition process in an environment with a significant > number of blobs (e.g., 64 blobs of 1 GB each). > 2. Observe the temporary increase in data storage as blobs are iteratively > combined. > 3. Note the final amount of data storage used compared to the initial total > size of the blobs. > *Expected Behavior:* > The blob composition process should minimize unnecessary data storage use, > efficiently managing resources to combine blobs without generating excessive > temporary data overhead. > *Actual Behavior:* > The current implementation results in significant temporary increases in data > storage, leading to high costs and potential system instability due to > frequent storage exceptions. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-26808) [flink v1.14.2] Submit jobs via REST API not working after set web.submit.enable: false
[ https://issues.apache.org/jira/browse/FLINK-26808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696167#comment-17696167 ] Tobias Hofer commented on FLINK-26808: -- The expectation is that Flink behaves as documented. In [https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/] I can read the following: {quote}{{{}web.submit.enable{}}}: Enables uploading and starting jobs through the Flink UI {_}(true by default){_}. Please note that even when this is disabled, session clusters still accept jobs through REST requests (HTTP calls). This flag only guards the feature to upload jobs in the UI. {quote} I would like to be able to disable submission via UI but still allow to job sumission to work. This behavior impacts the Flink Kubernetes Operator. It fails with {color:#174ea6}Warning | SESSIONJOBEXCEPTION | org.apache.flink.runtime.rest.util.RestClientException: [Not found: /v1/jars/upload]"{color} > [flink v1.14.2] Submit jobs via REST API not working after set > web.submit.enable: false > --- > > Key: FLINK-26808 > URL: https://issues.apache.org/jira/browse/FLINK-26808 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission >Affects Versions: 1.14.2 >Reporter: Luís Costa >Priority: Minor > > Greetings, > I am using flink version 1.14.2 and after changing web.submit.enable to > false, job submission via REST API is no longer working. > The app that uses flink receives a 404 with "Not found: /jars/upload" > Looking into > [documentation|[https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/]] > saw that web.upload.dir is only used if {{web.submit.enable}} is true, if > not it will be used JOB_MANAGER_WEB_TMPDIR_KEY > Doing a curl to /jars it returns: > {code:java} > curl -X GET http://localhost:8081/jars > HTTP/1.1 404 Not Found > {"errors":["Unable to load requested file /jars."]} {code} > Found this issue related to option web.submit.enable > https://issues.apache.org/jira/browse/FLINK-13799 > Could you please let me know if this is an issue that you are already aware? > Thanks in advance > Best regards, > Luís Costa > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31064) Error while retrieving the leader gateway
[ https://issues.apache.org/jira/browse/FLINK-31064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689279#comment-17689279 ] Tobias Hofer commented on FLINK-31064: -- Hi [~gyfora], in the meantime I'm not sure anymore that the exception in the description is really representing the problem, because I was aware of the very same exception in the logs of a properly running session. The only other log entry that my be a hint can be found in the operator log. {color:#174ea6}ERROR | ReconcilerExecutor-flinksessionjobcontroller-31 | org.apache.flink.kubernetes.operator.observer.sessionjob.FlinkSessionJobObserver - Missing Session Job{color} But next output says: {color:#174ea6}nothing to do...{color} even so no job is running. Find attached my session and job config. > Error while retrieving the leader gateway > - > > Key: FLINK-31064 > URL: https://issues.apache.org/jira/browse/FLINK-31064 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.3.1 > Environment: > >Reporter: Tobias Hofer >Priority: Major > Attachments: jobmanager_log.txt, my-job.yaml, my-session.yaml > > > JobManager enters corrupt state after restart (e.g. increasing restartNonce). > {code:java} > WARN | flink-akka.actor.default-dispatcher-5 | RpcGatewayRetriever > | Error while retrieving the leader gateway. Retrying to connect to > akka.tcp://flink@my-session.flink:6123/user/rpc/resourcemanager_*. > org.apache.flink.util.concurrent.FutureUtils$RetryException: Could not > complete the operation. Number of retries has been exhausted. > at > org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$6(FutureUtils.java:293) > at > java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown > Source) > at > java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown > Source) > at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown > Source) > at > java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown > Source) > at > org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1275) > at > org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$null$1(ClassLoadingUtils.java:93) > at > org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) > at > org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$guardCompletionWithContextClassLoader$2(ClassLoadingUtils.java:92) > ... > Caused by: java.util.concurrent.CompletionException: > org.apache.flink.runtime.rpc.exceptions.RpcConnectionException: Could not > connect to rpc endpoint under address > akka.tcp://flink@my-session.flink:6123/user/rpc/resourcemanager_*. > at > org.apache.flink.runtime.rpc.akka.AkkaRpcService.lambda$resolveActorAddress$11(AkkaRpcService.java:604) > at > scala.concurrent.java8.FuturesConvertersImpl$CF$$anon$1.accept(FutureConvertersImpl.scala:59) > ... 5 more > Caused by: org.apache.flink.runtime.rpc.exceptions.RpcConnectionException: > Could not connect to rpc endpoint under address > akka.tcp://flink@my-session.flink:6123/user/rpc/resourcemanager_*. > ... 7 more > Caused by: akka.actor.ActorNotFound: Actor not found for: > ActorSelection[Anchor(akka://flink/), Path(/user/rpc/resourcemanager_*)] > at > akka.actor.ActorSelection.$anonfun$resolveOne$1(ActorSelection.scala:74) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) > at > akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:63) > at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:81) > at > akka.dispatch.internal.SameThreadExecutionContext$$anon$1.unbatchedExecute(SameThreadExecutionContext.scala:21) > at akka.dispatch.BatchingExecutor.execute(BatchingExecutor.scala:130) > at akka.dispatch.BatchingExecutor.execute$(BatchingExecutor.scala:124) > at > akka.dispatch.internal.SameThreadExecutionContext$$anon$1.execute(SameThreadExecutionContext.scala:20) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) > at > scala.concurrent.impl.Promise$DefaultPromise.dispatchOrAddCallback(Promise.scala:312) > at > scala.concurrent.impl.Promise$DefaultPromise.onComplete(Promise.scala:303) > at akka.actor.ActorSelection.resolveOne(ActorSelection.scala:72) > at akka.actor.ActorSelection.resolveOne(ActorSelection.scala:89) > at akka.actor.ActorSelection.resolveOne(ActorSelection.scala:130) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcService.resolveActorAddress(AkkaRpcService.java:598) > at > org.apache.flin
[jira] [Updated] (FLINK-31064) Error while retrieving the leader gateway
[ https://issues.apache.org/jira/browse/FLINK-31064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tobias Hofer updated FLINK-31064: - Attachment: my-job.yaml > Error while retrieving the leader gateway > - > > Key: FLINK-31064 > URL: https://issues.apache.org/jira/browse/FLINK-31064 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.3.1 > Environment: > >Reporter: Tobias Hofer >Priority: Major > Attachments: jobmanager_log.txt, my-job.yaml, my-session.yaml > > > JobManager enters corrupt state after restart (e.g. increasing restartNonce). > {code:java} > WARN | flink-akka.actor.default-dispatcher-5 | RpcGatewayRetriever > | Error while retrieving the leader gateway. Retrying to connect to > akka.tcp://flink@my-session.flink:6123/user/rpc/resourcemanager_*. > org.apache.flink.util.concurrent.FutureUtils$RetryException: Could not > complete the operation. Number of retries has been exhausted. > at > org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$6(FutureUtils.java:293) > at > java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown > Source) > at > java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown > Source) > at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown > Source) > at > java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown > Source) > at > org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1275) > at > org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$null$1(ClassLoadingUtils.java:93) > at > org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) > at > org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$guardCompletionWithContextClassLoader$2(ClassLoadingUtils.java:92) > ... > Caused by: java.util.concurrent.CompletionException: > org.apache.flink.runtime.rpc.exceptions.RpcConnectionException: Could not > connect to rpc endpoint under address > akka.tcp://flink@my-session.flink:6123/user/rpc/resourcemanager_*. > at > org.apache.flink.runtime.rpc.akka.AkkaRpcService.lambda$resolveActorAddress$11(AkkaRpcService.java:604) > at > scala.concurrent.java8.FuturesConvertersImpl$CF$$anon$1.accept(FutureConvertersImpl.scala:59) > ... 5 more > Caused by: org.apache.flink.runtime.rpc.exceptions.RpcConnectionException: > Could not connect to rpc endpoint under address > akka.tcp://flink@my-session.flink:6123/user/rpc/resourcemanager_*. > ... 7 more > Caused by: akka.actor.ActorNotFound: Actor not found for: > ActorSelection[Anchor(akka://flink/), Path(/user/rpc/resourcemanager_*)] > at > akka.actor.ActorSelection.$anonfun$resolveOne$1(ActorSelection.scala:74) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) > at > akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:63) > at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:81) > at > akka.dispatch.internal.SameThreadExecutionContext$$anon$1.unbatchedExecute(SameThreadExecutionContext.scala:21) > at akka.dispatch.BatchingExecutor.execute(BatchingExecutor.scala:130) > at akka.dispatch.BatchingExecutor.execute$(BatchingExecutor.scala:124) > at > akka.dispatch.internal.SameThreadExecutionContext$$anon$1.execute(SameThreadExecutionContext.scala:20) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) > at > scala.concurrent.impl.Promise$DefaultPromise.dispatchOrAddCallback(Promise.scala:312) > at > scala.concurrent.impl.Promise$DefaultPromise.onComplete(Promise.scala:303) > at akka.actor.ActorSelection.resolveOne(ActorSelection.scala:72) > at akka.actor.ActorSelection.resolveOne(ActorSelection.scala:89) > at akka.actor.ActorSelection.resolveOne(ActorSelection.scala:130) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcService.resolveActorAddress(AkkaRpcService.java:598) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcService.connectInternal(AkkaRpcService.java:549) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcService.connect(AkkaRpcService.java:232) > at > org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever.lambda$null$0(RpcGatewayRetriever.java:66) > at > java.base/java.util.concurrent.CompletableFuture.uniComposeStage(Unknown > Source) > at java.base/java.util.concurrent.CompletableFuture.thenCompose(Unknown > Source) > at > org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever.lambda$createGateway$1(RpcGatewayRetriever.java:64) > at > org.apache.flink
[jira] [Updated] (FLINK-31064) Error while retrieving the leader gateway
[ https://issues.apache.org/jira/browse/FLINK-31064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tobias Hofer updated FLINK-31064: - Attachment: my-session.yaml > Error while retrieving the leader gateway > - > > Key: FLINK-31064 > URL: https://issues.apache.org/jira/browse/FLINK-31064 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.3.1 > Environment: > >Reporter: Tobias Hofer >Priority: Major > Attachments: jobmanager_log.txt, my-session.yaml > > > JobManager enters corrupt state after restart (e.g. increasing restartNonce). > {code:java} > WARN | flink-akka.actor.default-dispatcher-5 | RpcGatewayRetriever > | Error while retrieving the leader gateway. Retrying to connect to > akka.tcp://flink@my-session.flink:6123/user/rpc/resourcemanager_*. > org.apache.flink.util.concurrent.FutureUtils$RetryException: Could not > complete the operation. Number of retries has been exhausted. > at > org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$6(FutureUtils.java:293) > at > java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown > Source) > at > java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown > Source) > at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown > Source) > at > java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown > Source) > at > org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1275) > at > org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$null$1(ClassLoadingUtils.java:93) > at > org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) > at > org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$guardCompletionWithContextClassLoader$2(ClassLoadingUtils.java:92) > ... > Caused by: java.util.concurrent.CompletionException: > org.apache.flink.runtime.rpc.exceptions.RpcConnectionException: Could not > connect to rpc endpoint under address > akka.tcp://flink@my-session.flink:6123/user/rpc/resourcemanager_*. > at > org.apache.flink.runtime.rpc.akka.AkkaRpcService.lambda$resolveActorAddress$11(AkkaRpcService.java:604) > at > scala.concurrent.java8.FuturesConvertersImpl$CF$$anon$1.accept(FutureConvertersImpl.scala:59) > ... 5 more > Caused by: org.apache.flink.runtime.rpc.exceptions.RpcConnectionException: > Could not connect to rpc endpoint under address > akka.tcp://flink@my-session.flink:6123/user/rpc/resourcemanager_*. > ... 7 more > Caused by: akka.actor.ActorNotFound: Actor not found for: > ActorSelection[Anchor(akka://flink/), Path(/user/rpc/resourcemanager_*)] > at > akka.actor.ActorSelection.$anonfun$resolveOne$1(ActorSelection.scala:74) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) > at > akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:63) > at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:81) > at > akka.dispatch.internal.SameThreadExecutionContext$$anon$1.unbatchedExecute(SameThreadExecutionContext.scala:21) > at akka.dispatch.BatchingExecutor.execute(BatchingExecutor.scala:130) > at akka.dispatch.BatchingExecutor.execute$(BatchingExecutor.scala:124) > at > akka.dispatch.internal.SameThreadExecutionContext$$anon$1.execute(SameThreadExecutionContext.scala:20) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) > at > scala.concurrent.impl.Promise$DefaultPromise.dispatchOrAddCallback(Promise.scala:312) > at > scala.concurrent.impl.Promise$DefaultPromise.onComplete(Promise.scala:303) > at akka.actor.ActorSelection.resolveOne(ActorSelection.scala:72) > at akka.actor.ActorSelection.resolveOne(ActorSelection.scala:89) > at akka.actor.ActorSelection.resolveOne(ActorSelection.scala:130) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcService.resolveActorAddress(AkkaRpcService.java:598) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcService.connectInternal(AkkaRpcService.java:549) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcService.connect(AkkaRpcService.java:232) > at > org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever.lambda$null$0(RpcGatewayRetriever.java:66) > at > java.base/java.util.concurrent.CompletableFuture.uniComposeStage(Unknown > Source) > at java.base/java.util.concurrent.CompletableFuture.thenCompose(Unknown > Source) > at > org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever.lambda$createGateway$1(RpcGatewayRetriever.java:64) > at > org.apache.flink.util.con
[jira] [Updated] (FLINK-31064) Error while retrieving the leader gateway
[ https://issues.apache.org/jira/browse/FLINK-31064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tobias Hofer updated FLINK-31064: - Description: JobManager enters corrupt state after restart (e.g. increasing restartNonce). {code:java} WARN | flink-akka.actor.default-dispatcher-5 | RpcGatewayRetriever | Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@my-session.flink:6123/user/rpc/resourcemanager_*. org.apache.flink.util.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted. at org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$6(FutureUtils.java:293) at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) at org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1275) at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$null$1(ClassLoadingUtils.java:93) at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$guardCompletionWithContextClassLoader$2(ClassLoadingUtils.java:92) ... Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.rpc.exceptions.RpcConnectionException: Could not connect to rpc endpoint under address akka.tcp://flink@my-session.flink:6123/user/rpc/resourcemanager_*. at org.apache.flink.runtime.rpc.akka.AkkaRpcService.lambda$resolveActorAddress$11(AkkaRpcService.java:604) at scala.concurrent.java8.FuturesConvertersImpl$CF$$anon$1.accept(FutureConvertersImpl.scala:59) ... 5 more Caused by: org.apache.flink.runtime.rpc.exceptions.RpcConnectionException: Could not connect to rpc endpoint under address akka.tcp://flink@my-session.flink:6123/user/rpc/resourcemanager_*. ... 7 more Caused by: akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka://flink/), Path(/user/rpc/resourcemanager_*)] at akka.actor.ActorSelection.$anonfun$resolveOne$1(ActorSelection.scala:74) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:63) at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:81) at akka.dispatch.internal.SameThreadExecutionContext$$anon$1.unbatchedExecute(SameThreadExecutionContext.scala:21) at akka.dispatch.BatchingExecutor.execute(BatchingExecutor.scala:130) at akka.dispatch.BatchingExecutor.execute$(BatchingExecutor.scala:124) at akka.dispatch.internal.SameThreadExecutionContext$$anon$1.execute(SameThreadExecutionContext.scala:20) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) at scala.concurrent.impl.Promise$DefaultPromise.dispatchOrAddCallback(Promise.scala:312) at scala.concurrent.impl.Promise$DefaultPromise.onComplete(Promise.scala:303) at akka.actor.ActorSelection.resolveOne(ActorSelection.scala:72) at akka.actor.ActorSelection.resolveOne(ActorSelection.scala:89) at akka.actor.ActorSelection.resolveOne(ActorSelection.scala:130) at org.apache.flink.runtime.rpc.akka.AkkaRpcService.resolveActorAddress(AkkaRpcService.java:598) at org.apache.flink.runtime.rpc.akka.AkkaRpcService.connectInternal(AkkaRpcService.java:549) at org.apache.flink.runtime.rpc.akka.AkkaRpcService.connect(AkkaRpcService.java:232) at org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever.lambda$null$0(RpcGatewayRetriever.java:66) at java.base/java.util.concurrent.CompletableFuture.uniComposeStage(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.thenCompose(Unknown Source) at org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever.lambda$createGateway$1(RpcGatewayRetriever.java:64) at org.apache.flink.util.concurrent.FutureUtils.retryOperationWithDelay(FutureUtils.java:259) at org.apache.flink.util.concurrent.FutureUtils.lambda$null$4(FutureUtils.java:279) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at org.apache.flink.runtime.concurrent.akka.ActorSystemScheduledExecutorAdapter$ScheduledFutureTask.run(ActorSystemScheduledExecutorAdapter.java:171) at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$withContextClassLoader$0(Cl
[jira] [Commented] (FLINK-31064) Error while retrieving the leader gateway
[ https://issues.apache.org/jira/browse/FLINK-31064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17688451#comment-17688451 ] Tobias Hofer commented on FLINK-31064: -- I found the following entry in the meantime: [https://www.mail-archive.com/user@flink.apache.org/msg37424.html|https://www.mail-archive.com/user@flink.apache.org/msg37424.html.] Suggesting to do {code:java} strategy: type: RollingUpdate rollingUpdate: maxSurge: 0 maxUnavailable: 1 {code} But operator CRD do not allow me to change the update strategy (or at least I don't know how). > Error while retrieving the leader gateway > - > > Key: FLINK-31064 > URL: https://issues.apache.org/jira/browse/FLINK-31064 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.3.1 > Environment: > >Reporter: Tobias Hofer >Priority: Major > Attachments: jobmanager_log.txt > > > JobManager enters corrupt state after restart (e.g. increasing restartNonce). > {code:java} > WARN | flink-akka.actor.default-dispatcher-5 | RpcGatewayRetriever > | Errorwhile retrieving the leader gateway. Retrying to connect to > akka.tcp://flink@my-session.flink:6123/user/rpc/resourcemanager_*.org.apache.flink.util.concurrent.FutureUtils$RetryException: > Could not complete the operation. Number of retries has been exhausted. > at > org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$6(FutureUtils.java:293) > at > java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown > Source) at > java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(UnknownSource) > at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown > Source) at > java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown > Source) at > org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1275) > at > org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$null$1(ClassLoadingUtils.java:93) > at > org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) > at > org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$guardCompletionWithContextClassLoader$2(ClassLoadingUtils.java:92) > at > java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown > Source) at > java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(UnknownSource) > at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown > Source) at > java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown > Source) at > scala.concurrent.java8.FuturesConvertersImpl$CF$$anon$1.accept(FutureConvertersImpl.scala:61) > at > scala.concurrent.java8.FuturesConvertersImpl$CF$$anon$1.accept(FutureConvertersImpl.scala:53) > at > java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown > Source) at > java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(UnknownSource) > at > java.base/java.util.concurrent.CompletableFuture$Completion.run(Unknown > Source) at java.base/java.lang.Thread.run(Unknown Source)Caused by: > java.util.concurrent.CompletionException: > org.apache.flink.runtime.rpc.exceptions.RpcConnectionException: Could not > connect to rpc endpoint under address > akka.tcp://flink@my-session.flink:6123/user/rpc/resourcemanager_*. at > org.apache.flink.runtime.rpc.akka.AkkaRpcService.lambda$resolveActorAddress$11(AkkaRpcService.java:604) > at > scala.concurrent.java8.FuturesConvertersImpl$CF$$anon$1.accept(FutureConvertersImpl.scala:59) > ... 5 moreCaused by: > org.apache.flink.runtime.rpc.exceptions.RpcConnectionException: Could not > connect to rpc endpoint under address > akka.tcp://flink@my-session.flink:6123/user/rpc/resourcemanager_*. ... 7 > moreCaused by: akka.actor.ActorNotFound: Actor not found for: > ActorSelection[Anchor(akka://flink/), Path(/user/rpc/resourcemanager_*)] > at akka.actor.ActorSelection.$anonfun$resolveOne$1(ActorSelection.scala:74) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) at > akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:63) > at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:81) > at > akka.dispatch.internal.SameThreadExecutionContext$$anon$1.unbatchedExecute(SameThreadExecutionContext.scala:21) > at akka.dispatch.BatchingExecutor.execute(BatchingExecutor.scala:130) > at akka.dispatch.BatchingExecutor.execute$(BatchingExecutor.scala:124) at > akka.dispatch.internal.SameThreadExecutionContext$$anon$1.execute(SameThreadExecu
[jira] [Commented] (FLINK-31064) Error while retrieving the leader gateway
[ https://issues.apache.org/jira/browse/FLINK-31064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17688449#comment-17688449 ] Tobias Hofer commented on FLINK-31064: -- I first tried to get help on the user mailing list. Without success so far. https://lists.apache.org/thread/bb4vnw8c7l4725tlokz9xp2bjjdkgmo1. > Error while retrieving the leader gateway > - > > Key: FLINK-31064 > URL: https://issues.apache.org/jira/browse/FLINK-31064 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.3.1 > Environment: > >Reporter: Tobias Hofer >Priority: Major > Attachments: jobmanager_log.txt > > > JobManager enters corrupt state after restart (e.g. increasing restartNonce). > {code:java} > WARN | flink-akka.actor.default-dispatcher-5 | RpcGatewayRetriever > | Errorwhile retrieving the leader gateway. Retrying to connect to > akka.tcp://flink@my-session.flink:6123/user/rpc/resourcemanager_*.org.apache.flink.util.concurrent.FutureUtils$RetryException: > Could not complete the operation. Number of retries has been exhausted. > at > org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$6(FutureUtils.java:293) > at > java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown > Source) at > java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(UnknownSource) > at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown > Source) at > java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown > Source) at > org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1275) > at > org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$null$1(ClassLoadingUtils.java:93) > at > org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) > at > org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$guardCompletionWithContextClassLoader$2(ClassLoadingUtils.java:92) > at > java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown > Source) at > java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(UnknownSource) > at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown > Source) at > java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown > Source) at > scala.concurrent.java8.FuturesConvertersImpl$CF$$anon$1.accept(FutureConvertersImpl.scala:61) > at > scala.concurrent.java8.FuturesConvertersImpl$CF$$anon$1.accept(FutureConvertersImpl.scala:53) > at > java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown > Source) at > java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(UnknownSource) > at > java.base/java.util.concurrent.CompletableFuture$Completion.run(Unknown > Source) at java.base/java.lang.Thread.run(Unknown Source)Caused by: > java.util.concurrent.CompletionException: > org.apache.flink.runtime.rpc.exceptions.RpcConnectionException: Could not > connect to rpc endpoint under address > akka.tcp://flink@my-session.flink:6123/user/rpc/resourcemanager_*. at > org.apache.flink.runtime.rpc.akka.AkkaRpcService.lambda$resolveActorAddress$11(AkkaRpcService.java:604) > at > scala.concurrent.java8.FuturesConvertersImpl$CF$$anon$1.accept(FutureConvertersImpl.scala:59) > ... 5 moreCaused by: > org.apache.flink.runtime.rpc.exceptions.RpcConnectionException: Could not > connect to rpc endpoint under address > akka.tcp://flink@my-session.flink:6123/user/rpc/resourcemanager_*. ... 7 > moreCaused by: akka.actor.ActorNotFound: Actor not found for: > ActorSelection[Anchor(akka://flink/), Path(/user/rpc/resourcemanager_*)] > at akka.actor.ActorSelection.$anonfun$resolveOne$1(ActorSelection.scala:74) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) at > akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:63) > at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:81) > at > akka.dispatch.internal.SameThreadExecutionContext$$anon$1.unbatchedExecute(SameThreadExecutionContext.scala:21) > at akka.dispatch.BatchingExecutor.execute(BatchingExecutor.scala:130) > at akka.dispatch.BatchingExecutor.execute$(BatchingExecutor.scala:124) at > akka.dispatch.internal.SameThreadExecutionContext$$anon$1.execute(SameThreadExecutionContext.scala:20) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) > at > scala.concurrent.impl.Promise$DefaultPromise.dispatchOrAddCallback(Promise.scala:312) > at > scala.concurrent.impl.Promise$De
[jira] [Created] (FLINK-31064) Error while retrieving the leader gateway
Tobias Hofer created FLINK-31064: Summary: Error while retrieving the leader gateway Key: FLINK-31064 URL: https://issues.apache.org/jira/browse/FLINK-31064 Project: Flink Issue Type: Bug Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.3.1 Environment: Reporter: Tobias Hofer Attachments: jobmanager_log.txt JobManager enters corrupt state after restart (e.g. increasing restartNonce). {code:java} WARN | flink-akka.actor.default-dispatcher-5 | RpcGatewayRetriever | Errorwhile retrieving the leader gateway. Retrying to connect to akka.tcp://flink@my-session.flink:6123/user/rpc/resourcemanager_*.org.apache.flink.util.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted. at org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$6(FutureUtils.java:293) at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(UnknownSource) at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) at org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1275) at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$null$1(ClassLoadingUtils.java:93) at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$guardCompletionWithContextClassLoader$2(ClassLoadingUtils.java:92) at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(UnknownSource) at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) at scala.concurrent.java8.FuturesConvertersImpl$CF$$anon$1.accept(FutureConvertersImpl.scala:61) at scala.concurrent.java8.FuturesConvertersImpl$CF$$anon$1.accept(FutureConvertersImpl.scala:53) at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(UnknownSource) at java.base/java.util.concurrent.CompletableFuture$Completion.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.rpc.exceptions.RpcConnectionException: Could not connect to rpc endpoint under address akka.tcp://flink@my-session.flink:6123/user/rpc/resourcemanager_*. at org.apache.flink.runtime.rpc.akka.AkkaRpcService.lambda$resolveActorAddress$11(AkkaRpcService.java:604) at scala.concurrent.java8.FuturesConvertersImpl$CF$$anon$1.accept(FutureConvertersImpl.scala:59) ... 5 moreCaused by: org.apache.flink.runtime.rpc.exceptions.RpcConnectionException: Could not connect to rpc endpoint under address akka.tcp://flink@my-session.flink:6123/user/rpc/resourcemanager_*. ... 7 moreCaused by: akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka://flink/), Path(/user/rpc/resourcemanager_*)] at akka.actor.ActorSelection.$anonfun$resolveOne$1(ActorSelection.scala:74) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:63) at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:81) at akka.dispatch.internal.SameThreadExecutionContext$$anon$1.unbatchedExecute(SameThreadExecutionContext.scala:21) at akka.dispatch.BatchingExecutor.execute(BatchingExecutor.scala:130) at akka.dispatch.BatchingExecutor.execute$(BatchingExecutor.scala:124) at akka.dispatch.internal.SameThreadExecutionContext$$anon$1.execute(SameThreadExecutionContext.scala:20) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) at scala.concurrent.impl.Promise$DefaultPromise.dispatchOrAddCallback(Promise.scala:312) at scala.concurrent.impl.Promise$DefaultPromise.onComplete(Promise.scala:303) at akka.actor.ActorSelection.resolveOne(ActorSelection.scala:72) at akka.actor.ActorSelection.resolveOne(ActorSelection.scala:89) at akka.actor.ActorSelection.resolveOne(ActorSelection.scala:130) at org.apache.flink.runtime.rpc.akka.AkkaRpcService.resolveActorAddress(AkkaRpcService.java:598) at org.apache.flink.runtime.rpc.akka.AkkaRpcService.connectInternal(AkkaRpcService.java:549) at org.apache.flink.runtime.rpc.akka.AkkaRpcService.connect(AkkaRpcService.java:232)
[jira] [Commented] (FLINK-30046) Cannot use digest in Helm chart to reference image
[ https://issues.apache.org/jira/browse/FLINK-30046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641239#comment-17641239 ] Tobias Hofer commented on FLINK-30046: -- Hi [~mbalassi], I'm sorry for the situation. I try to contact the webmaster to get a copy. The guide mainly proposed to use a single value named {{version}} to handle tags and digests transparently. The guide also contained a proposal of the changes in the Helm chart templates. Nothing worldshaking if you're used to create Helm charts (but I'm not). > Cannot use digest in Helm chart to reference image > -- > > Key: FLINK-30046 > URL: https://issues.apache.org/jira/browse/FLINK-30046 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.2.0, kubernetes-operator-1.3.0, > kubernetes-operator-1.2.1 >Reporter: Tobias Hofer >Assignee: Márton Balassi >Priority: Major > Fix For: kubernetes-operator-1.3.0 > > > Images can be referenced by tag only. > Referencing images by digest has a number of advantages: > # Avoid unexpected or undesirable image changes. > # Increase security and awareness by knowing the specific image running in > your environment. > The following document describes a template to handle tags and digests: > [Adding Image Digest References to Your Helm > Charts|https://blog.andyserver.com/2021/09/adding-image-digest-references-to-your-helm-charts/] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-30046) Cannot use digest in Helm chart to reference image
Tobias Hofer created FLINK-30046: Summary: Cannot use digest in Helm chart to reference image Key: FLINK-30046 URL: https://issues.apache.org/jira/browse/FLINK-30046 Project: Flink Issue Type: Bug Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.2.0, kubernetes-operator-1.3.0, kubernetes-operator-1.2.1 Reporter: Tobias Hofer Images can be referenced by tag only. Referencing images by digest has a number of advantages: # Avoid unexpected or undesirable image changes. # Increase security and awareness by knowing the specific image running in your environment. The following document describes a template to handle tags and digests: [Adding Image Digest References to Your Helm Charts|https://blog.andyserver.com/2021/09/adding-image-digest-references-to-your-helm-charts/] -- This message was sent by Atlassian Jira (v8.20.10#820010)