[jira] [Assigned] (FLINK-18641) "Failure to finalize checkpoint" error in MasterTriggerRestoreHook
[ https://issues.apache.org/jira/browse/FLINK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiangjie Qin reassigned FLINK-18641: Assignee: Jiangjie Qin > "Failure to finalize checkpoint" error in MasterTriggerRestoreHook > -- > > Key: FLINK-18641 > URL: https://issues.apache.org/jira/browse/FLINK-18641 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 >Reporter: Brian Zhou >Assignee: Jiangjie Qin >Priority: Major > > https://github.com/pravega/flink-connectors is a Pravega connector for Flink. > The ReaderCheckpointHook[1] class uses the Flink `MasterTriggerRestoreHook` > interface to trigger the Pravega checkpoint during Flink checkpoints to make > sure the data recovery. The checkpoint recovery tests are running fine in > Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out. > Suspect it is related to the checkpoint coordinator thread model changes in > Flink 1.11 > Error stacktrace: > {code} > 2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN > o.a.f.runtime.jobmaster.JobMaster - Error while processing checkpoint > acknowledgement message > org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize > the pending checkpoint 3. Failure reason: Failure to finalize checkpoint. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948) > at > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.flink.util.SerializedThrowable: Pending checkpoint has > not been fully acknowledged yet > at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:195) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:298) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1021) > ... 9 common frames omitted > {code} > More detail in this mailing thread: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Pravega-connector-cannot-recover-from-the-checkpoint-due-to-quot-Failure-to-finalize-checkpoint-quot-td36652.html > Also in https://github.com/pravega/flink-connectors/issues/387 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18641) "Failure to finalize checkpoint" error in MasterTriggerRestoreHook
[ https://issues.apache.org/jira/browse/FLINK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166925#comment-17166925 ] Jiangjie Qin commented on FLINK-18641: -- [~SleePy] Sorry for the late reply. I have a fix ready and is working on the tests. Will submit a patch this week. > "Failure to finalize checkpoint" error in MasterTriggerRestoreHook > -- > > Key: FLINK-18641 > URL: https://issues.apache.org/jira/browse/FLINK-18641 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 >Reporter: Brian Zhou >Priority: Major > > https://github.com/pravega/flink-connectors is a Pravega connector for Flink. > The ReaderCheckpointHook[1] class uses the Flink `MasterTriggerRestoreHook` > interface to trigger the Pravega checkpoint during Flink checkpoints to make > sure the data recovery. The checkpoint recovery tests are running fine in > Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out. > Suspect it is related to the checkpoint coordinator thread model changes in > Flink 1.11 > Error stacktrace: > {code} > 2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN > o.a.f.runtime.jobmaster.JobMaster - Error while processing checkpoint > acknowledgement message > org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize > the pending checkpoint 3. Failure reason: Failure to finalize checkpoint. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948) > at > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.flink.util.SerializedThrowable: Pending checkpoint has > not been fully acknowledged yet > at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:195) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:298) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1021) > ... 9 common frames omitted > {code} > More detail in this mailing thread: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Pravega-connector-cannot-recover-from-the-checkpoint-due-to-quot-Failure-to-finalize-checkpoint-quot-td36652.html > Also in https://github.com/pravega/flink-connectors/issues/387 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18695) Allow NettyBufferPool to allocate heap buffers
[ https://issues.apache.org/jira/browse/FLINK-18695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166914#comment-17166914 ] Zhijiang commented on FLINK-18695: -- I agree with [~sewen]'s above analysis. Regarding the proposed two options by [~chesnay], I prefer the first one to allow heap buffer allocation in NettyBufferPool for two reasons: * The current unsupport heap buffer indeed brings some limitations to upgrade netty version. Not only for the current SSL concern, maybe there are also other required heap memory improvements in future. * The main direct memory overhead was already resolved after FLINK 1.11. I guess the heap memory overhead should be equivalent to the direct memory, if so the impact should be tiny even ignored. * If we are not quite sure of the heap memory overhead, we can also take some ways to monitor and statistic the heap usages inside NettyBufferPool as did for direct memory before. > Allow NettyBufferPool to allocate heap buffers > -- > > Key: FLINK-18695 > URL: https://issues.apache.org/jira/browse/FLINK-18695 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Reporter: Chesnay Schepler >Priority: Major > Fix For: 1.12.0 > > > in 4.1.43 netty made a change to their SslHandler to always use heap buffers > for JDK SSLEngine implementations, to avoid an additional memory copy. > However, our {{NettyBufferPool}} forbids heap buffer allocations. > We will either have to allow heap buffer allocations, or create a custom > SslHandler implementation that does not use heap buffers (although this seems > ill-adviced?). > /cc [~sewen] [~uce] [~NicoK] [~zjwang] [~pnowojski] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-15156) Warn user if System.exit() is called in user code
[ https://issues.apache.org/jira/browse/FLINK-15156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Metzger updated FLINK-15156: --- Labels: starter (was: ) > Warn user if System.exit() is called in user code > - > > Key: FLINK-15156 > URL: https://issues.apache.org/jira/browse/FLINK-15156 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Reporter: Robert Metzger >Priority: Minor > Labels: starter > > It would make debugging Flink errors easier if we would intercept and log > calls to System.exit() through the SecurityManager. > A user recently had an error where the JobManager was shutting down because > of a System.exit() in the user code: > https://lists.apache.org/thread.html/b28dabcf3068d489f38399c456c80d48569fcdf74b15f8bb95d532d0%40%3Cuser.flink.apache.org%3E > If I remember correctly, we had such issues before. > I put this ticket into the "Runtime / Coordination" component, as it is > mostly about improving the usability / debuggability in that area. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18695) Allow NettyBufferPool to allocate heap buffers
[ https://issues.apache.org/jira/browse/FLINK-18695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166902#comment-17166902 ] Ufuk Celebi commented on FLINK-18695: - I agree with Stephan that it should not have much of an impact. In particular for the SSL handler the comment in the [respective Netty PR|https://github.com/netty/netty/commit/39cc7a673939dec96258ff27f5b1874671838af0#diff-2fe7b22a8d650f1ea0bf56a809c061f9R303-R310] says that the direct buffer was copied back to a heap byte[] by the SSL engine anyways. Therefore, allowing heap buffers should be a net positive in the context of the SSL handler. > Allow NettyBufferPool to allocate heap buffers > -- > > Key: FLINK-18695 > URL: https://issues.apache.org/jira/browse/FLINK-18695 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Reporter: Chesnay Schepler >Priority: Major > Fix For: 1.12.0 > > > in 4.1.43 netty made a change to their SslHandler to always use heap buffers > for JDK SSLEngine implementations, to avoid an additional memory copy. > However, our {{NettyBufferPool}} forbids heap buffer allocations. > We will either have to allow heap buffer allocations, or create a custom > SslHandler implementation that does not use heap buffers (although this seems > ill-adviced?). > /cc [~sewen] [~uce] [~NicoK] [~zjwang] [~pnowojski] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18742) Some configuration args do not take effect at client
[ https://issues.apache.org/jira/browse/FLINK-18742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166901#comment-17166901 ] Yang Wang commented on FLINK-18742: --- I think this is a valid fix. We should apply the dynamic config option first and then build the packaged program. cc [~kkl0u] > Some configuration args do not take effect at client > > > Key: FLINK-18742 > URL: https://issues.apache.org/jira/browse/FLINK-18742 > Project: Flink > Issue Type: Improvement > Components: Command Line Client >Affects Versions: 1.11.1 >Reporter: Matt Wang >Priority: Major > > Some configuration args from command line will not work at client, for > example, the job sets the {color:#505f79}_classloader.resolve-order_{color} > to _{color:#505f79}parent-first,{color}_ it can work at TaskManager, but > Client doesn't. > The *FlinkUserCodeClassLoaders* will be created before calling the method of > _{color:#505f79}getEffectiveConfiguration(){color}_ at > {color:#505f79}org.apache.flink.client.cli.CliFrontend{color}, so the > _{color:#505f79}Configuration{color}_ used by > _{color:#505f79}PackagedProgram{color}_ does not include Configuration args. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18749) Correct dependencies in Kubernetes pom and notice file
Yang Wang created FLINK-18749: - Summary: Correct dependencies in Kubernetes pom and notice file Key: FLINK-18749 URL: https://issues.apache.org/jira/browse/FLINK-18749 Project: Flink Issue Type: Bug Components: Deployment / Kubernetes Affects Versions: 1.11.1 Reporter: Yang Wang Fix For: 1.11.2 Inspired when developing this PR[1], i find some incorrect dependency versions in NOTICE file. Also {{com.mifmif}} should be removed from the flink-kubernetes/pom.xml since we never use it. [1]. [https://github.com/apache/flink/pull/12995/commits/8519f65321ba24c5164196a67a05d98fb268f490] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-18748) Savepoint would be queued unexpected
[ https://issues.apache.org/jira/browse/FLINK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-18748: -- Description: Inspired by a [user-zh email|http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html] After FLINK-17342, when triggering a checkpoint/savepoint, we'll check whether the request can be triggered in {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow: {code:java} Preconditions.checkState(Thread.holdsLock(lock)); // 1. if (isTriggering || queuedRequests.isEmpty()) { return Optional.empty(); } // 2 too many ongoing checkpoitn/savepoint if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) { return Optional.of(queuedRequests.first()) .filter(CheckpointTriggerRequest::isForce) .map(unused -> queuedRequests.pollFirst()); } // 3 check the timestamp of last complete checkpoint long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs); if (nextTriggerDelayMillis > 0) { return onTooEarly(nextTriggerDelayMillis); } return Optional.of(queuedRequests.pollFirst()); {code} But if currently {{pendingCheckpointsSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the savepoint will still wait some time in step 3. I think we should trigger the savepoint immediately if {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}. was: Inspired by an [user-zh email|[http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html]] After FLINK-17342, when triggering a checkpoint/savepoint, we'll check whether the request can be triggered in {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow: {code:java} Preconditions.checkState(Thread.holdsLock(lock)); // 1. if (isTriggering || queuedRequests.isEmpty()) { return Optional.empty(); } // 2 too many ongoing checkpoitn/savepoint if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) { return Optional.of(queuedRequests.first()) .filter(CheckpointTriggerRequest::isForce) .map(unused -> queuedRequests.pollFirst()); } // 3 check the timestamp of last complete checkpoint long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs); if (nextTriggerDelayMillis > 0) { return onTooEarly(nextTriggerDelayMillis); } return Optional.of(queuedRequests.pollFirst()); {code} But if currently {{pendingCheckpointsSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the savepoint will still wait some time in step 3. I think we should trigger the savepoint immediately if {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}. > Savepoint would be queued unexpected > > > Key: FLINK-18748 > URL: https://issues.apache.org/jira/browse/FLINK-18748 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0, 1.11.1 >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by a [user-zh > email|http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html] > After FLINK-17342, when triggering a checkpoint/savepoint, we'll check > whether the request can be triggered in > {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow: > {code:java} > Preconditions.checkState(Thread.holdsLock(lock)); > // 1. > if (isTriggering || queuedRequests.isEmpty()) { >return Optional.empty(); > } > // 2 too many ongoing checkpoitn/savepoint > if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) { >return Optional.of(queuedRequests.first()) > .filter(CheckpointTriggerRequest::isForce) > .map(unused -> queuedRequests.pollFirst()); > } > // 3 check the timestamp of last complete checkpoint > long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs); > if (nextTriggerDelayMillis > 0) { >return onTooEarly(nextTriggerDelayMillis); > } > return Optional.of(queuedRequests.pollFirst()); > {code} > But if currently {{pendingCheckpointsSizeSupplier.get()}} < > {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the > savepoint will still wait some time in step 3. > I think we should trigger the savepoint immediately if > {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-18748) Savepoint would be queued unexpected
[ https://issues.apache.org/jira/browse/FLINK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-18748: -- Description: Inspired by an [user-zh email|[http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html]] After FLINK-17342, when triggering a checkpoint/savepoint, we'll check whether the request can be triggered in {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow: {code:java} Preconditions.checkState(Thread.holdsLock(lock)); // 1. if (isTriggering || queuedRequests.isEmpty()) { return Optional.empty(); } // 2 too many ongoing checkpoitn/savepoint if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) { return Optional.of(queuedRequests.first()) .filter(CheckpointTriggerRequest::isForce) .map(unused -> queuedRequests.pollFirst()); } // 3 check the timestamp of last complete checkpoint long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs); if (nextTriggerDelayMillis > 0) { return onTooEarly(nextTriggerDelayMillis); } return Optional.of(queuedRequests.pollFirst()); {code} But if currently {{pendingCheckpointsSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the savepoint will still wait some time in step 3. I think we should trigger the savepoint immediately if {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}. was: After FLINK-17342, when triggering a checkpoint/savepoint, we'll check whether the request can be triggered in {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow: {code:java} Preconditions.checkState(Thread.holdsLock(lock)); // 1. if (isTriggering || queuedRequests.isEmpty()) { return Optional.empty(); } // 2 too many ongoing checkpoitn/savepoint if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) { return Optional.of(queuedRequests.first()) .filter(CheckpointTriggerRequest::isForce) .map(unused -> queuedRequests.pollFirst()); } // 3 check the timestamp of last complete checkpoint long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs); if (nextTriggerDelayMillis > 0) { return onTooEarly(nextTriggerDelayMillis); } return Optional.of(queuedRequests.pollFirst()); {code} But if currently {{pendingCheckpointsSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the savepoint will still wait some time in step 3. I think we should trigger the savepoint immediately if {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}. > Savepoint would be queued unexpected > > > Key: FLINK-18748 > URL: https://issues.apache.org/jira/browse/FLINK-18748 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0, 1.11.1 >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by an [user-zh > email|[http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html]] > After FLINK-17342, when triggering a checkpoint/savepoint, we'll check > whether the request can be triggered in > {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow: > {code:java} > Preconditions.checkState(Thread.holdsLock(lock)); > // 1. > if (isTriggering || queuedRequests.isEmpty()) { >return Optional.empty(); > } > // 2 too many ongoing checkpoitn/savepoint > if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) { >return Optional.of(queuedRequests.first()) > .filter(CheckpointTriggerRequest::isForce) > .map(unused -> queuedRequests.pollFirst()); > } > // 3 check the timestamp of last complete checkpoint > long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs); > if (nextTriggerDelayMillis > 0) { >return onTooEarly(nextTriggerDelayMillis); > } > return Optional.of(queuedRequests.pollFirst()); > {code} > But if currently {{pendingCheckpointsSizeSupplier.get()}} < > {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the > savepoint will still wait some time in step 3. > I think we should trigger the savepoint immediately if > {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18681) The jar package version conflict causes the task to continue to increase and grab resources
[ https://issues.apache.org/jira/browse/FLINK-18681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166878#comment-17166878 ] Xintong Song commented on FLINK-18681: -- [~apach...@163.com], thanks for providing the screenshot and logs. I found the following warnings in the Yarn RM log. {code:java} 2020-07-22 17:54:57,155 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=wangty IP=x.x.x.61 OPERATION=AM Released Container TARGET=Scheduler RESULT=FAILURE DESCRIPTION=Trying to release container not owned by app or with invalid id.PERMISSIONS=Unauthorized access or invalid container APPID=application_1590424616102_556340 CONTAINERID=container_1590424616102_556340_01_02 2020-07-22 17:54:58,157 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=wangty IP=x.x.x.61 OPERATION=AM Released Container TARGET=Scheduler RESULT=FAILURE DESCRIPTION=Trying to release container not owned by app or with invalid id.PERMISSIONS=Unauthorized access or invalid container APPID=application_1590424616102_556340 CONTAINERID=container_1590424616102_556340_01_03 2020-07-22 17:54:59,160 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=wangty IP=x.x.x.61 OPERATION=AM Released Container TARGET=Scheduler RESULT=FAILURE DESCRIPTION=Trying to release container not owned by app or with invalid id.PERMISSIONS=Unauthorized access or invalid container APPID=application_1590424616102_556340 CONTAINERID=container_1590424616102_556340_01_04 {code} It shows that Flink did released the containers, but the operations were rejected by the Yarn RM. The API Flink uses for release containers is {{AMRMClientAsync#releaseAssignedContainer}}, via the same client that successfully allocated containers from Yarn. {code:java} /** * Release containers assigned by the Resource Manager. If the app cannot use * the container or wants to give up the container then it can release them. * The app needs to make new requests for the released resource capability if * it still needs it. eg. it released non-local resources * @param containerId */ public abstract void releaseAssignedContainer(ContainerId containerId); {code} It seems to me that the Hadoop API did not work as expected. I would suggest to try get some help from the Apache Hadoop community. Pulling in [~Tao Yang] who is an Apache Hadoop committer and expert in Yarn. > The jar package version conflict causes the task to continue to increase and > grab resources > --- > > Key: FLINK-18681 > URL: https://issues.apache.org/jira/browse/FLINK-18681 > Project: Flink > Issue Type: Bug >Affects Versions: 1.11.0 >Reporter: wangtaiyang >Priority: Major > Attachments: appId.log, dependency.log, > image-2020-07-28-15-32-51-851.png, > yarn-hadoop-resourcemanager-x.x.x.15.log.2020-07-22-17.log > > > When I submit a flink task to yarn, the default resource configuration is > 1G&1core, but in fact this task will always increase resources 2core, 3core, > and so on. . . 200core. . . Then I went to look at the JM log and found the > following error: > {code:java} > //代码占位符 > java.lang.NoSuchMethodError: > org.apache.commons.cli.Option.builder(Ljava/lang/String;)Lorg/apache/commons/cli/Option$Builder;java.lang.NoSuchMethodError: > > org.apache.commons.cli.Option.builder(Ljava/lang/String;)Lorg/apache/commons/cli/Option$Builder; > at > org.apache.flink.runtime.entrypoint.parser.CommandLineOptions.(CommandLineOptions.java:28) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] at > org.apache.flink.runtime.clusterframework.BootstrapTools.lambda$getDynamicPropertiesAsString$0(BootstrapTools.java:648) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] at > java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) > ~[?:1.8.0_191] > ... > java.lang.NoClassDefFoundError: Could not initialize class > org.apache.flink.runtime.entrypoint.parser.CommandLineOptionsjava.lang.NoClassDefFoundError: > Could not initialize class > org.apache.flink.runtime.entrypoint.parser.CommandLineOptions at > org.apache.flink.runtime.clusterframework.BootstrapTools.lambda$getDynamicPropertiesAsString$0(BootstrapTools.java:648) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] at > java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) > ~[?:1.8.0_191] at > java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1553) > ~[?:1.8.0_191] at > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) > ~[?:1.8.0_191]{code} > Finally, it is confirmed that it is caused by the commands-cli version > conflict, but the task reporting error has not stopped and will continue to
[jira] [Commented] (FLINK-18748) Savepoint would be queued unexpected
[ https://issues.apache.org/jira/browse/FLINK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166877#comment-17166877 ] Congxian Qiu(klion26) commented on FLINK-18748: --- [~pnowojski] [~roman_khachatryan] What do you think about this problem, If this is valid, I can help to fix it. > Savepoint would be queued unexpected > > > Key: FLINK-18748 > URL: https://issues.apache.org/jira/browse/FLINK-18748 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0, 1.11.1 >Reporter: Congxian Qiu(klion26) >Priority: Major > > After FLINK-17342, when triggering a checkpoint/savepoint, we'll check > whether the request can be triggered in > {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow: > {code:java} > Preconditions.checkState(Thread.holdsLock(lock)); > // 1. > if (isTriggering || queuedRequests.isEmpty()) { >return Optional.empty(); > } > // 2 too many ongoing checkpoitn/savepoint > if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) { >return Optional.of(queuedRequests.first()) > .filter(CheckpointTriggerRequest::isForce) > .map(unused -> queuedRequests.pollFirst()); > } > // 3 check the timestamp of last complete checkpoint > long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs); > if (nextTriggerDelayMillis > 0) { >return onTooEarly(nextTriggerDelayMillis); > } > return Optional.of(queuedRequests.pollFirst()); > {code} > But if currently {{pendingCheckpointsSizeSupplier.get()}} < > {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the > savepoint will still wait some time in step 3. > I think we should trigger the savepoint immediately if > {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18748) Savepoint would be queued unexpected
Congxian Qiu(klion26) created FLINK-18748: - Summary: Savepoint would be queued unexpected Key: FLINK-18748 URL: https://issues.apache.org/jira/browse/FLINK-18748 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.11.1, 1.11.0 Reporter: Congxian Qiu(klion26) After FLINK-17342, when triggering a checkpoint/savepoint, we'll check whether the request can be triggered in {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow: {code:java} Preconditions.checkState(Thread.holdsLock(lock)); // 1. if (isTriggering || queuedRequests.isEmpty()) { return Optional.empty(); } // 2 too many ongoing checkpoitn/savepoint if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) { return Optional.of(queuedRequests.first()) .filter(CheckpointTriggerRequest::isForce) .map(unused -> queuedRequests.pollFirst()); } // 3 check the timestamp of last complete checkpoint long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs); if (nextTriggerDelayMillis > 0) { return onTooEarly(nextTriggerDelayMillis); } return Optional.of(queuedRequests.pollFirst()); {code} But if currently {{pendingCheckpointsSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the savepoint will still wait some time in step 3. I think we should trigger the savepoint immediately if {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-18713) Allow default ms unit for table.exec.mini-batch.allow-latency etc.
[ https://issues.apache.org/jira/browse/FLINK-18713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jark Wu reassigned FLINK-18713: --- Assignee: hailong wang > Allow default ms unit for table.exec.mini-batch.allow-latency etc. > -- > > Key: FLINK-18713 > URL: https://issues.apache.org/jira/browse/FLINK-18713 > Project: Flink > Issue Type: Improvement > Components: Table SQL / Planner >Affects Versions: 1.11.0 >Reporter: hailong wang >Assignee: hailong wang >Priority: Major > Fix For: 1.12.0 > > > We use `scala.concurrent.duration.Duration.create` to parse timeStr in > `TableConfigUtils# > getMillisecondFromConfigDuration` for the following properties, > {code:java} > table.exec.async-lookup.timeout > table.exec.source.idle-timeout > table.exec.mini-batch.allow-latency > table.exec.emit.early-fire.delay > table.exec.emit.late-fire.delay{code} > And it must has the unit. > I think we can replace it with `TimeUtils.parseDuration(timeStr)` to parse > timeStr just like `DescriptorProperties#getOptionalDuration` to has default > ms unit and be consistent. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18713) Allow default ms unit for table.exec.mini-batch.allow-latency etc.
[ https://issues.apache.org/jira/browse/FLINK-18713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166855#comment-17166855 ] hailong wang commented on FLINK-18713: -- I would like to take it, Thanks for assigning to me. > Allow default ms unit for table.exec.mini-batch.allow-latency etc. > -- > > Key: FLINK-18713 > URL: https://issues.apache.org/jira/browse/FLINK-18713 > Project: Flink > Issue Type: Improvement > Components: Table SQL / Planner >Affects Versions: 1.11.0 >Reporter: hailong wang >Priority: Major > Fix For: 1.12.0 > > > We use `scala.concurrent.duration.Duration.create` to parse timeStr in > `TableConfigUtils# > getMillisecondFromConfigDuration` for the following properties, > {code:java} > table.exec.async-lookup.timeout > table.exec.source.idle-timeout > table.exec.mini-batch.allow-latency > table.exec.emit.early-fire.delay > table.exec.emit.late-fire.delay{code} > And it must has the unit. > I think we can replace it with `TimeUtils.parseDuration(timeStr)` to parse > timeStr just like `DescriptorProperties#getOptionalDuration` to has default > ms unit and be consistent. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-18747) Make Debezium Json format support timestamp with timezone
[ https://issues.apache.org/jira/browse/FLINK-18747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leon Hao closed FLINK-18747. Resolution: Fixed [~fsk119] Thank you for reply. > Make Debezium Json format support timestamp with timezone > - > > Key: FLINK-18747 > URL: https://issues.apache.org/jira/browse/FLINK-18747 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Reporter: Leon Hao >Priority: Major > > Debezium Connector for MySQL maps timestamp datatype to ISO 8601 with time > zone format like '2020-07-29T01:09:52.534173Z'. Current code doesn't support > the timestamp with time zone even if we set > 'debezium-json.timestamp-format.standard' = 'ISO-8601'. I think we should > support it because: > 1. ISO-8601 itself supports timestamp with timezone. > 2. Sometimes we need to CDC many tables with many timestamp fields from > Mysql. It would be a waste of time to convert these fields from string to > timestamp manually as proposed by this post FLINK-17752 > 3. It is very confusing to users when errors occur although the document says > Flink supports debezium json format and ISO-8601. > I think we can create a new debezium-json.timestamp-format option to support > timestamp with time zone or have a better documentation letting users know > Flink dose not support it. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18747) Make Debezium Json format support timestamp with timezone
[ https://issues.apache.org/jira/browse/FLINK-18747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166852#comment-17166852 ] Shengkai Fang commented on FLINK-18747: --- [~lhao] Yes. Debezium uses the same json format. > Make Debezium Json format support timestamp with timezone > - > > Key: FLINK-18747 > URL: https://issues.apache.org/jira/browse/FLINK-18747 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Reporter: Leon Hao >Priority: Major > > Debezium Connector for MySQL maps timestamp datatype to ISO 8601 with time > zone format like '2020-07-29T01:09:52.534173Z'. Current code doesn't support > the timestamp with time zone even if we set > 'debezium-json.timestamp-format.standard' = 'ISO-8601'. I think we should > support it because: > 1. ISO-8601 itself supports timestamp with timezone. > 2. Sometimes we need to CDC many tables with many timestamp fields from > Mysql. It would be a waste of time to convert these fields from string to > timestamp manually as proposed by this post FLINK-17752 > 3. It is very confusing to users when errors occur although the document says > Flink supports debezium json format and ISO-8601. > I think we can create a new debezium-json.timestamp-format option to support > timestamp with time zone or have a better documentation letting users know > Flink dose not support it. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18747) Make Debezium Json format support timestamp with timezone
[ https://issues.apache.org/jira/browse/FLINK-18747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166850#comment-17166850 ] Leon Hao commented on FLINK-18747: -- [~fsk119] will this work with Debezium Json format? > Make Debezium Json format support timestamp with timezone > - > > Key: FLINK-18747 > URL: https://issues.apache.org/jira/browse/FLINK-18747 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Reporter: Leon Hao >Priority: Major > > Debezium Connector for MySQL maps timestamp datatype to ISO 8601 with time > zone format like '2020-07-29T01:09:52.534173Z'. Current code doesn't support > the timestamp with time zone even if we set > 'debezium-json.timestamp-format.standard' = 'ISO-8601'. I think we should > support it because: > 1. ISO-8601 itself supports timestamp with timezone. > 2. Sometimes we need to CDC many tables with many timestamp fields from > Mysql. It would be a waste of time to convert these fields from string to > timestamp manually as proposed by this post FLINK-17752 > 3. It is very confusing to users when errors occur although the document says > Flink supports debezium json format and ISO-8601. > I think we can create a new debezium-json.timestamp-format option to support > timestamp with time zone or have a better documentation letting users know > Flink dose not support it. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-18742) Some configuration args do not take effect at client
[ https://issues.apache.org/jira/browse/FLINK-18742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Wang updated FLINK-18742: -- Summary: Some configuration args do not take effect at client (was: Configuration args do not take effect at client) > Some configuration args do not take effect at client > > > Key: FLINK-18742 > URL: https://issues.apache.org/jira/browse/FLINK-18742 > Project: Flink > Issue Type: Improvement > Components: Command Line Client >Affects Versions: 1.11.1 >Reporter: Matt Wang >Priority: Major > > The configuration args from command line will not work at client, for > example, the job sets the {color:#505f79}_classloader.resolve-order_{color} > to _{color:#505f79}parent-first,{color}_ it can work at TaskManager, but > Client doesn't. > The *FlinkUserCodeClassLoaders* will be created before calling the method of > _{color:#505f79}getEffectiveConfiguration(){color}_ at > {color:#505f79}org.apache.flink.client.cli.CliFrontend{color}, so the > _{color:#505f79}Configuration{color}_ used by > _{color:#505f79}PackagedProgram{color}_ does not include Configuration args. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-18742) Some configuration args do not take effect at client
[ https://issues.apache.org/jira/browse/FLINK-18742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Wang updated FLINK-18742: -- Description: Some configuration args from command line will not work at client, for example, the job sets the {color:#505f79}_classloader.resolve-order_{color} to _{color:#505f79}parent-first,{color}_ it can work at TaskManager, but Client doesn't. The *FlinkUserCodeClassLoaders* will be created before calling the method of _{color:#505f79}getEffectiveConfiguration(){color}_ at {color:#505f79}org.apache.flink.client.cli.CliFrontend{color}, so the _{color:#505f79}Configuration{color}_ used by _{color:#505f79}PackagedProgram{color}_ does not include Configuration args. was: The configuration args from command line will not work at client, for example, the job sets the {color:#505f79}_classloader.resolve-order_{color} to _{color:#505f79}parent-first,{color}_ it can work at TaskManager, but Client doesn't. The *FlinkUserCodeClassLoaders* will be created before calling the method of _{color:#505f79}getEffectiveConfiguration(){color}_ at {color:#505f79}org.apache.flink.client.cli.CliFrontend{color}, so the _{color:#505f79}Configuration{color}_ used by _{color:#505f79}PackagedProgram{color}_ does not include Configuration args. > Some configuration args do not take effect at client > > > Key: FLINK-18742 > URL: https://issues.apache.org/jira/browse/FLINK-18742 > Project: Flink > Issue Type: Improvement > Components: Command Line Client >Affects Versions: 1.11.1 >Reporter: Matt Wang >Priority: Major > > Some configuration args from command line will not work at client, for > example, the job sets the {color:#505f79}_classloader.resolve-order_{color} > to _{color:#505f79}parent-first,{color}_ it can work at TaskManager, but > Client doesn't. > The *FlinkUserCodeClassLoaders* will be created before calling the method of > _{color:#505f79}getEffectiveConfiguration(){color}_ at > {color:#505f79}org.apache.flink.client.cli.CliFrontend{color}, so the > _{color:#505f79}Configuration{color}_ used by > _{color:#505f79}PackagedProgram{color}_ does not include Configuration args. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18747) Make Debezium Json format support timestamp with timezone
[ https://issues.apache.org/jira/browse/FLINK-18747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166839#comment-17166839 ] Shengkai Fang commented on FLINK-18747: --- Currently, we have implemented TIMESTAMP_WITH_LOCAL_ZONE datatype to support timestamp with zone in FLINK-18296. > Make Debezium Json format support timestamp with timezone > - > > Key: FLINK-18747 > URL: https://issues.apache.org/jira/browse/FLINK-18747 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Reporter: Leon Hao >Priority: Major > > Debezium Connector for MySQL maps timestamp datatype to ISO 8601 with time > zone format like '2020-07-29T01:09:52.534173Z'. Current code doesn't support > the timestamp with time zone even if we set > 'debezium-json.timestamp-format.standard' = 'ISO-8601'. I think we should > support it because: > 1. ISO-8601 itself supports timestamp with timezone. > 2. Sometimes we need to CDC many tables with many timestamp fields from > Mysql. It would be a waste of time to convert these fields from string to > timestamp manually as proposed by this post FLINK-17752 > 3. It is very confusing to users when errors occur although the document says > Flink supports debezium json format and ISO-8601. > I think we can create a new debezium-json.timestamp-format option to support > timestamp with time zone or have a better documentation letting users know > Flink dose not support it. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18747) Make Debezium Json format support timestamp with timezone
Leon Hao created FLINK-18747: Summary: Make Debezium Json format support timestamp with timezone Key: FLINK-18747 URL: https://issues.apache.org/jira/browse/FLINK-18747 Project: Flink Issue Type: Improvement Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) Reporter: Leon Hao Debezium Connector for MySQL maps timestamp datatype to ISO 8601 with time zone format like '2020-07-29T01:09:52.534173Z'. Current code doesn't support the timestamp with time zone even if we set 'debezium-json.timestamp-format.standard' = 'ISO-8601'. I think we should support it because: 1. ISO-8601 itself supports timestamp with timezone. 2. Sometimes we need to CDC many tables with many timestamp fields from Mysql. It would be a waste of time to convert these fields from string to timestamp manually as proposed by this post FLINK-17752 3. It is very confusing to users when errors occur although the document says Flink supports debezium json format and ISO-8601. I think we can create a new debezium-json.timestamp-format option to support timestamp with time zone or have a better documentation letting users know Flink dose not support it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18625) Maintain redundant taskmanagers to speed up failover
[ https://issues.apache.org/jira/browse/FLINK-18625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166829#comment-17166829 ] Xintong Song commented on FLINK-18625: -- [~trohrmann], regarding your questions. bq. How would this feature work if the job requests heterogeneous slots which might result into differently sized TMs? I guess we will allocate default sized TMs. But what if this will prevent us from allocating fewer larger sized TMs which are required for fulfilling the heterogeneous slot requests? I see your point. One optimization could be to release the redundant task managers if there are heterogeneous pending worker requests. The problem is that the redundant task manager may not be releasable if any of the slots are allocated (e.g., slots are evenly spread out), and even releasable it would cost more time to obtain the new task manager. I guess that's the price we need to pay if this feature is enabled. WDYT? bq. How does this feature relate to FLINK-16605 and FLINK-15959? I believe that the lower and upper bounds should also limit the number of redundant slots, right? According to [~Jiangang]'s PR, the upper bound also limits the number of redundant slots. I believe it should be the same for the lower bound. We should make sure of that when working on FLINK-15959. cc [~karmagyz] > Maintain redundant taskmanagers to speed up failover > > > Key: FLINK-18625 > URL: https://issues.apache.org/jira/browse/FLINK-18625 > Project: Flink > Issue Type: New Feature > Components: Runtime / Coordination >Reporter: Liu >Assignee: Liu >Priority: Major > Labels: pull-request-available > > When flink job fails because of killed taskmanagers, it will request new > containers when restarting. Requesting new containers can be very slow, > sometimes it takes dozens of seconds even more. The reasons can be different, > for example, yarn and hdfs are slow, machine performance is poor. In some > product scenario, SLA is high and failover should be in seconds. > > To speed up the recovery process, we can maintain redundant slots in advance. > When job restarts, it can use the redundant slots at once instead of > requesting new taskmanagers. > > The implemention can be done in SlotManagerImpl. Below is a brief description: > # In construct method, init redundantTaskmanagerNum from config. > # In method start(), allocate redundant taskmanagers. > # In method start(), Change taskManagerTimeoutCheck() to > checkValidTaskManagers(). > # In method checkValidTaskManagers(), manage redundant taskmanagers and > timeout taskmanagers. The idle taskmanager number must be not less than > redundantTaskmanagerNum. > * If less, allocate from resourceManager until equal. > * If more, release timeout taskmanagers but keep at least > redundantTaskmanagerNum idle taskmanagers. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-18681) The jar package version conflict causes the task to continue to increase and grab resources
[ https://issues.apache.org/jira/browse/FLINK-18681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangtaiyang updated FLINK-18681: Attachment: yarn-hadoop-resourcemanager-x.x.x.15.log.2020-07-22-17.log > The jar package version conflict causes the task to continue to increase and > grab resources > --- > > Key: FLINK-18681 > URL: https://issues.apache.org/jira/browse/FLINK-18681 > Project: Flink > Issue Type: Bug >Affects Versions: 1.11.0 >Reporter: wangtaiyang >Priority: Major > Attachments: appId.log, dependency.log, > image-2020-07-28-15-32-51-851.png, > yarn-hadoop-resourcemanager-x.x.x.15.log.2020-07-22-17.log > > > When I submit a flink task to yarn, the default resource configuration is > 1G&1core, but in fact this task will always increase resources 2core, 3core, > and so on. . . 200core. . . Then I went to look at the JM log and found the > following error: > {code:java} > //代码占位符 > java.lang.NoSuchMethodError: > org.apache.commons.cli.Option.builder(Ljava/lang/String;)Lorg/apache/commons/cli/Option$Builder;java.lang.NoSuchMethodError: > > org.apache.commons.cli.Option.builder(Ljava/lang/String;)Lorg/apache/commons/cli/Option$Builder; > at > org.apache.flink.runtime.entrypoint.parser.CommandLineOptions.(CommandLineOptions.java:28) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] at > org.apache.flink.runtime.clusterframework.BootstrapTools.lambda$getDynamicPropertiesAsString$0(BootstrapTools.java:648) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] at > java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) > ~[?:1.8.0_191] > ... > java.lang.NoClassDefFoundError: Could not initialize class > org.apache.flink.runtime.entrypoint.parser.CommandLineOptionsjava.lang.NoClassDefFoundError: > Could not initialize class > org.apache.flink.runtime.entrypoint.parser.CommandLineOptions at > org.apache.flink.runtime.clusterframework.BootstrapTools.lambda$getDynamicPropertiesAsString$0(BootstrapTools.java:648) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] at > java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) > ~[?:1.8.0_191] at > java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1553) > ~[?:1.8.0_191] at > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) > ~[?:1.8.0_191]{code} > Finally, it is confirmed that it is caused by the commands-cli version > conflict, but the task reporting error has not stopped and will continue to > grab resources and increase. Is this a bug? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18681) The jar package version conflict causes the task to continue to increase and grab resources
[ https://issues.apache.org/jira/browse/FLINK-18681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166824#comment-17166824 ] wangtaiyang commented on FLINK-18681: - It's here !image-2020-07-28-15-32-51-851.png! RM LOG: [^yarn-hadoop-resourcemanager-x.x.x.15.log.2020-07-22-17.log] > The jar package version conflict causes the task to continue to increase and > grab resources > --- > > Key: FLINK-18681 > URL: https://issues.apache.org/jira/browse/FLINK-18681 > Project: Flink > Issue Type: Bug >Affects Versions: 1.11.0 >Reporter: wangtaiyang >Priority: Major > Attachments: appId.log, dependency.log, > image-2020-07-28-15-32-51-851.png, > yarn-hadoop-resourcemanager-x.x.x.15.log.2020-07-22-17.log > > > When I submit a flink task to yarn, the default resource configuration is > 1G&1core, but in fact this task will always increase resources 2core, 3core, > and so on. . . 200core. . . Then I went to look at the JM log and found the > following error: > {code:java} > //代码占位符 > java.lang.NoSuchMethodError: > org.apache.commons.cli.Option.builder(Ljava/lang/String;)Lorg/apache/commons/cli/Option$Builder;java.lang.NoSuchMethodError: > > org.apache.commons.cli.Option.builder(Ljava/lang/String;)Lorg/apache/commons/cli/Option$Builder; > at > org.apache.flink.runtime.entrypoint.parser.CommandLineOptions.(CommandLineOptions.java:28) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] at > org.apache.flink.runtime.clusterframework.BootstrapTools.lambda$getDynamicPropertiesAsString$0(BootstrapTools.java:648) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] at > java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) > ~[?:1.8.0_191] > ... > java.lang.NoClassDefFoundError: Could not initialize class > org.apache.flink.runtime.entrypoint.parser.CommandLineOptionsjava.lang.NoClassDefFoundError: > Could not initialize class > org.apache.flink.runtime.entrypoint.parser.CommandLineOptions at > org.apache.flink.runtime.clusterframework.BootstrapTools.lambda$getDynamicPropertiesAsString$0(BootstrapTools.java:648) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] at > java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) > ~[?:1.8.0_191] at > java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1553) > ~[?:1.8.0_191] at > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) > ~[?:1.8.0_191]{code} > Finally, it is confirmed that it is caused by the commands-cli version > conflict, but the task reporting error has not stopped and will continue to > grab resources and increase. Is this a bug? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18496) Anchors are not generated based on ZH characters
[ https://issues.apache.org/jira/browse/FLINK-18496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166822#comment-17166822 ] Jark Wu commented on FLINK-18496: - I noticed you are using Jekyll 4.1.1 instead of 4.0.1 > Anchors are not generated based on ZH characters > > > Key: FLINK-18496 > URL: https://issues.apache.org/jira/browse/FLINK-18496 > Project: Flink > Issue Type: Bug > Components: Project Website >Reporter: Zhu Zhu >Assignee: Zhilong Hong >Priority: Major > Labels: starter > > In ZH version pages of flink-web, the anchors are not generated based on ZH > characters. The anchor name would be like 'section-1', 'section-2' if there > is no EN characters. An example can be the links in the navigator of > https://flink.apache.org/zh/contributing/contribute-code.html > This makes it impossible to ref an anchor from the content because the anchor > name might change unexpectedly if a new section is added. > Note that it is a problem for flink-web only. The docs generated from the > flink repo can properly generate ZH anchors. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18496) Anchors are not generated based on ZH characters
[ https://issues.apache.org/jira/browse/FLINK-18496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166819#comment-17166819 ] Jark Wu commented on FLINK-18496: - My point is if it conflict with excerpt plugin, why it works in flink main repo? > Anchors are not generated based on ZH characters > > > Key: FLINK-18496 > URL: https://issues.apache.org/jira/browse/FLINK-18496 > Project: Flink > Issue Type: Bug > Components: Project Website >Reporter: Zhu Zhu >Assignee: Zhilong Hong >Priority: Major > Labels: starter > > In ZH version pages of flink-web, the anchors are not generated based on ZH > characters. The anchor name would be like 'section-1', 'section-2' if there > is no EN characters. An example can be the links in the navigator of > https://flink.apache.org/zh/contributing/contribute-code.html > This makes it impossible to ref an anchor from the content because the anchor > name might change unexpectedly if a new section is added. > Note that it is a problem for flink-web only. The docs generated from the > flink repo can properly generate ZH anchors. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-18496) Anchors are not generated based on ZH characters
[ https://issues.apache.org/jira/browse/FLINK-18496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166815#comment-17166815 ] Zhilong Hong edited comment on FLINK-18496 at 7/29/20, 2:36 AM: I try to use the Gemfile in flink/docs, but it doesn't work. In my opinion, the main blocker is that "jekyll-multiple-language" is conflict with the *excerpt* in Jekyll 4. I'm still working on this, but it seems difficult to solve this issue. was (Author: thesharing): I try to use the Gemfile in flink/docs, but it doesn't work. In my opinion, the main blocker is that "jekyll-multiple-language" is conflict with the *excerpt* in Jekyll 4. I'm still working on this, but it seems hard to solve this issue. > Anchors are not generated based on ZH characters > > > Key: FLINK-18496 > URL: https://issues.apache.org/jira/browse/FLINK-18496 > Project: Flink > Issue Type: Bug > Components: Project Website >Reporter: Zhu Zhu >Assignee: Zhilong Hong >Priority: Major > Labels: starter > > In ZH version pages of flink-web, the anchors are not generated based on ZH > characters. The anchor name would be like 'section-1', 'section-2' if there > is no EN characters. An example can be the links in the navigator of > https://flink.apache.org/zh/contributing/contribute-code.html > This makes it impossible to ref an anchor from the content because the anchor > name might change unexpectedly if a new section is added. > Note that it is a problem for flink-web only. The docs generated from the > flink repo can properly generate ZH anchors. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18496) Anchors are not generated based on ZH characters
[ https://issues.apache.org/jira/browse/FLINK-18496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166815#comment-17166815 ] Zhilong Hong commented on FLINK-18496: -- I try to use the Gemfile in flink/docs, but it doesn't work. In my opinion, the main blocker is that "jekyll-multiple-language" is conflict with the *excerpt* in Jekyll 4. I'm still working on this, but it seems hard to solve this issue. > Anchors are not generated based on ZH characters > > > Key: FLINK-18496 > URL: https://issues.apache.org/jira/browse/FLINK-18496 > Project: Flink > Issue Type: Bug > Components: Project Website >Reporter: Zhu Zhu >Assignee: Zhilong Hong >Priority: Major > Labels: starter > > In ZH version pages of flink-web, the anchors are not generated based on ZH > characters. The anchor name would be like 'section-1', 'section-2' if there > is no EN characters. An example can be the links in the navigator of > https://flink.apache.org/zh/contributing/contribute-code.html > This makes it impossible to ref an anchor from the content because the anchor > name might change unexpectedly if a new section is added. > Note that it is a problem for flink-web only. The docs generated from the > flink repo can properly generate ZH anchors. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18663) Fix Flink On YARN AM not exit
[ https://issues.apache.org/jira/browse/FLINK-18663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166813#comment-17166813 ] tartarus commented on FLINK-18663: -- [~chesnay] [~trohrmann] Then we discuss how to fix this problem? what do you think about pass {{maxContentLength}} as construction parameters to AbstractHandler? Then modify all Handlers. Do you have any good suggestions? thanks > Fix Flink On YARN AM not exit > - > > Key: FLINK-18663 > URL: https://issues.apache.org/jira/browse/FLINK-18663 > Project: Flink > Issue Type: Bug > Components: Runtime / REST >Affects Versions: 1.10.0, 1.10.1, 1.11.0 >Reporter: tartarus >Assignee: tartarus >Priority: Critical > Labels: pull-request-available > Attachments: 110.png, 111.png, > C49A7310-F932-451B-A203-6D17F3140C0D.png, > e18e00dd6664485c2ff55284fe969474.png, jobmanager.log.noyarn.tar.gz > > > AbstractHandler throw NPE cause by FlinkHttpObjectAggregator is null > when rest throw exception, it will do this code > {code:java} > private CompletableFuture handleException(Throwable throwable, > ChannelHandlerContext ctx, HttpRequest httpRequest) { > FlinkHttpObjectAggregator flinkHttpObjectAggregator = > ctx.pipeline().get(FlinkHttpObjectAggregator.class); > int maxLength = flinkHttpObjectAggregator.maxContentLength() - > OTHER_RESP_PAYLOAD_OVERHEAD; > if (throwable instanceof RestHandlerException) { > RestHandlerException rhe = (RestHandlerException) throwable; > String stackTrace = ExceptionUtils.stringifyException(rhe); > String truncatedStackTrace = Ascii.truncate(stackTrace, > maxLength, "..."); > if (log.isDebugEnabled()) { > log.error("Exception occurred in REST handler.", rhe); > } else { > log.error("Exception occurred in REST handler: {}", > rhe.getMessage()); > } > return HandlerUtils.sendErrorResponse( > ctx, > httpRequest, > new ErrorResponseBody(truncatedStackTrace), > rhe.getHttpResponseStatus(), > responseHeaders); > } else { > log.error("Unhandled exception.", throwable); > String stackTrace = String.format(" side:%n%s%nEnd of exception on server side>", > ExceptionUtils.stringifyException(throwable)); > String truncatedStackTrace = Ascii.truncate(stackTrace, > maxLength, "..."); > return HandlerUtils.sendErrorResponse( > ctx, > httpRequest, > new ErrorResponseBody(Arrays.asList("Internal server > error.", truncatedStackTrace)), > HttpResponseStatus.INTERNAL_SERVER_ERROR, > responseHeaders); > } > } > {code} > but flinkHttpObjectAggregator some case is null,so this will throw NPE,but > this method called by AbstractHandler#respondAsLeader > {code:java} > requestProcessingFuture > .whenComplete((Void ignored, Throwable throwable) -> { > if (throwable != null) { > > handleException(ExceptionUtils.stripCompletionException(throwable), ctx, > httpRequest) > .whenComplete((Void ignored2, Throwable > throwable2) -> finalizeRequestProcessing(finalUploadedFiles)); > } else { > finalizeRequestProcessing(finalUploadedFiles); > } > }); > {code} > the result is InFlightRequestTracker Cannot be cleared. > so the CompletableFuture does‘t complete that handler's closeAsync returned > !C49A7310-F932-451B-A203-6D17F3140C0D.png! > !e18e00dd6664485c2ff55284fe969474.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18496) Anchors are not generated based on ZH characters
[ https://issues.apache.org/jira/browse/FLINK-18496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166812#comment-17166812 ] Jark Wu commented on FLINK-18496: - Could you use the versions used in flink main repo [1]? [1]: https://github.com/apache/flink/blob/master/docs/Gemfile > Anchors are not generated based on ZH characters > > > Key: FLINK-18496 > URL: https://issues.apache.org/jira/browse/FLINK-18496 > Project: Flink > Issue Type: Bug > Components: Project Website >Reporter: Zhu Zhu >Assignee: Zhilong Hong >Priority: Major > Labels: starter > > In ZH version pages of flink-web, the anchors are not generated based on ZH > characters. The anchor name would be like 'section-1', 'section-2' if there > is no EN characters. An example can be the links in the navigator of > https://flink.apache.org/zh/contributing/contribute-code.html > This makes it impossible to ref an anchor from the content because the anchor > name might change unexpectedly if a new section is added. > Note that it is a problem for flink-web only. The docs generated from the > flink repo can properly generate ZH anchors. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18744) resume from modified savepoint dirctionary: No such file or directory
[ https://issues.apache.org/jira/browse/FLINK-18744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166808#comment-17166808 ] Congxian Qiu(klion26) commented on FLINK-18744: --- [~wayland] Thanks for reporting this issue, I'll take a look at it. > resume from modified savepoint dirctionary: No such file or directory > - > > Key: FLINK-18744 > URL: https://issues.apache.org/jira/browse/FLINK-18744 > Project: Flink > Issue Type: Bug > Components: API / State Processor >Affects Versions: 1.11.0 >Reporter: tao wang >Priority: Major > > If I resume a job from a savepoint which is modified by state processor API, > such as loading from /savepoint-path-old and writing to /savepoint-path-new, > the job resumed with savepointpath = /savepoint-path-new while throwing an > Exception : > _*/savepoint-path-new/\{some-ui-id} (No such file or directory)*_. > I think it's an issue because of flink 1.11 use absolute path in savepoint > and checkpoint, but state processor API missed this. > The job will work well with new savepoint(which path is /savepoint-path-new) > if I copy all dictionary except `_metadata from` /savepoint-path-old to > /savepoint-path-new. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16947) ArtifactResolutionException: Could not transfer artifact. Entry [...] has not been leased from this pool
[ https://issues.apache.org/jira/browse/FLINK-16947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166805#comment-17166805 ] Dian Fu commented on FLINK-16947: - https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=4975&view=logs&j=8fd975ef-f478-511d-4997-6f15fe8a1fd3&t=ac0fa443-5d45-5a6b-3597-0310ecc1d2ab > ArtifactResolutionException: Could not transfer artifact. Entry [...] has > not been leased from this pool > - > > Key: FLINK-16947 > URL: https://issues.apache.org/jira/browse/FLINK-16947 > Project: Flink > Issue Type: Bug > Components: Build System / Azure Pipelines >Reporter: Piotr Nowojski >Assignee: Robert Metzger >Priority: Blocker > Labels: test-stability > Fix For: 1.12.0 > > > https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6982&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5 > Build of flink-metrics-availability-test failed with: > {noformat} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-surefire-plugin:2.22.1:test (end-to-end-tests) > on project flink-metrics-availability-test: Unable to generate classpath: > org.apache.maven.artifact.resolver.ArtifactResolutionException: Could not > transfer artifact org.apache.maven.surefire:surefire-grouper:jar:2.22.1 > from/to google-maven-central > (https://maven-central-eu.storage-download.googleapis.com/maven2/): Entry > [id:13][route:{s}->https://maven-central-eu.storage-download.googleapis.com:443][state:null] > has not been leased from this pool > [ERROR] org.apache.maven.surefire:surefire-grouper:jar:2.22.1 > [ERROR] > [ERROR] from the specified remote repositories: > [ERROR] google-maven-central > (https://maven-central-eu.storage-download.googleapis.com/maven2/, > releases=true, snapshots=false), > [ERROR] apache.snapshots (https://repository.apache.org/snapshots, > releases=false, snapshots=true) > [ERROR] Path to dependency: > [ERROR] 1) dummy:dummy:jar:1.0 > [ERROR] 2) org.apache.maven.surefire:surefire-junit47:jar:2.22.1 > [ERROR] 3) org.apache.maven.surefire:common-junit48:jar:2.22.1 > [ERROR] 4) org.apache.maven.surefire:surefire-grouper:jar:2.22.1 > [ERROR] -> [Help 1] > [ERROR] > [ERROR] To see the full stack trace of the errors, re-run Maven with the -e > switch. > [ERROR] Re-run Maven using the -X switch to enable full debug logging. > [ERROR] > [ERROR] For more information about the errors and possible solutions, please > read the following articles: > [ERROR] [Help 1] > http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException > [ERROR] > [ERROR] After correcting the problems, you can resume the build with the > command > [ERROR] mvn -rf :flink-metrics-availability-test > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17274) Maven: Premature end of Content-Length delimited message body
[ https://issues.apache.org/jira/browse/FLINK-17274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166804#comment-17166804 ] Dian Fu commented on FLINK-17274: - https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=4975&view=logs&j=3d12d40f-c62d-5ec4-6acc-0efe94cc3e89&t=5d6e4255-0ea8-5e2a-f52c-c881b7872361 > Maven: Premature end of Content-Length delimited message body > - > > Key: FLINK-17274 > URL: https://issues.apache.org/jira/browse/FLINK-17274 > Project: Flink > Issue Type: Bug > Components: Build System / Azure Pipelines >Reporter: Robert Metzger >Assignee: Robert Metzger >Priority: Critical > Fix For: 1.12.0 > > > CI: > https://dev.azure.com/rmetzger/Flink/_build/results?buildId=7786&view=logs&j=52b61abe-a3cc-5bde-cc35-1bbe89bb7df5&t=54421a62-0c80-5aad-3319-094ff69180bb > {code} > [ERROR] Failed to execute goal on project > flink-connector-elasticsearch7_2.11: Could not resolve dependencies for > project > org.apache.flink:flink-connector-elasticsearch7_2.11:jar:1.11-SNAPSHOT: Could > not transfer artifact org.apache.lucene:lucene-sandbox:jar:8.3.0 from/to > alicloud-mvn-mirror > (http://mavenmirror.alicloud.dak8s.net:/repository/maven-central/): GET > request of: org/apache/lucene/lucene-sandbox/8.3.0/lucene-sandbox-8.3.0.jar > from alicloud-mvn-mirror failed: Premature end of Content-Length delimited > message body (expected: 289920; received: 239832 -> [Help 1] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18746) WindowStaggerTest.testWindowStagger failed
Dian Fu created FLINK-18746: --- Summary: WindowStaggerTest.testWindowStagger failed Key: FLINK-18746 URL: https://issues.apache.org/jira/browse/FLINK-18746 Project: Flink Issue Type: Bug Components: API / DataStream Affects Versions: 1.12.0 Reporter: Dian Fu https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=4975&view=logs&j=77a9d8e1-d610-59b3-fc2a-4766541e0e33&t=7c61167f-30b3-5893-cc38-a9e3d057e392 {code} 2020-07-28T21:16:30.1350624Z [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.145 s <<< FAILURE! - in org.apache.flink.streaming.runtime.operators.windowing.WindowStaggerTest 2020-07-28T21:16:30.1352065Z [ERROR] testWindowStagger(org.apache.flink.streaming.runtime.operators.windowing.WindowStaggerTest) Time elapsed: 0.012 s <<< FAILURE! 2020-07-28T21:16:30.1352701Z java.lang.AssertionError 2020-07-28T21:16:30.1353104Zat org.junit.Assert.fail(Assert.java:86) 2020-07-28T21:16:30.1353810Zat org.junit.Assert.assertTrue(Assert.java:41) 2020-07-28T21:16:30.1354289Zat org.junit.Assert.assertTrue(Assert.java:52) 2020-07-28T21:16:30.1354914Zat org.apache.flink.streaming.runtime.operators.windowing.WindowStaggerTest.testWindowStagger(WindowStaggerTest.java:38) 2020-07-28T21:16:30.1355520Zat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 2020-07-28T21:16:30.1356060Zat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 2020-07-28T21:16:30.1356663Zat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2020-07-28T21:16:30.1357220Zat java.lang.reflect.Method.invoke(Method.java:498) 2020-07-28T21:16:30.1357775Zat org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) 2020-07-28T21:16:30.1358383Zat org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) 2020-07-28T21:16:30.1358986Zat org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) 2020-07-28T21:16:30.1359623Zat org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) 2020-07-28T21:16:30.1360187Zat org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) 2020-07-28T21:16:30.1360740Zat org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) 2020-07-28T21:16:30.1361364Zat org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) 2020-07-28T21:16:30.1361916Zat org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) 2020-07-28T21:16:30.1362432Zat org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) 2020-07-28T21:16:30.1362976Zat org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) 2020-07-28T21:16:30.1363516Zat org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) 2020-07-28T21:16:30.1364041Zat org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) 2020-07-28T21:16:30.1364568Zat org.junit.runners.ParentRunner.run(ParentRunner.java:363) 2020-07-28T21:16:30.1365139Zat org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) 2020-07-28T21:16:30.1365764Zat org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) 2020-07-28T21:16:30.1366413Zat org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) 2020-07-28T21:16:30.1367036Zat org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) 2020-07-28T21:16:30.1367671Zat org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) 2020-07-28T21:16:30.1368337Zat org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) 2020-07-28T21:16:30.1368956Zat org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) 2020-07-28T21:16:30.1369530Zat org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-18745) 'SQL Client end-to-end test (Old planner) Elasticsearch (v7.5.1)' failed to download Elasticsearch
[ https://issues.apache.org/jira/browse/FLINK-18745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dian Fu updated FLINK-18745: Labels: test-stability (was: ) > 'SQL Client end-to-end test (Old planner) Elasticsearch (v7.5.1)' failed to > download Elasticsearch > -- > > Key: FLINK-18745 > URL: https://issues.apache.org/jira/browse/FLINK-18745 > Project: Flink > Issue Type: Test > Components: Connectors / ElasticSearch, Table SQL / Client, Tests >Affects Versions: 1.12.0, 1.11.1 >Reporter: Dian Fu >Priority: Major > Labels: test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=4976&view=logs&j=91bf6583-3fb2-592f-e4d4-d79d79c3230a&t=03dbd840-5430-533d-d1a7-05d0ebe03873 > {code} > Downloading Elasticsearch from > https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.5.1-linux-x86_64.tar.gz > ... > 2020-07-28T22:23:48.2016019Z % Total% Received % Xferd Average Speed > TimeTime Time Current > 2020-07-28T22:23:48.2017880Z Dload Upload > Total SpentLeft Speed > 2020-07-28T22:23:48.2018245Z > 2020-07-28T22:23:48.4204474Z 0 00 00 0 0 0 > --:--:-- --:--:-- --:--:-- 0 > 2020-07-28T22:23:49.4207369Z 0 276M0 00 0 0 0 > --:--:-- --:--:-- --:--:-- 0 > 2020-07-28T22:23:50.4205512Z 2 276M2 6459k0 0 5291k 0 > 0:00:53 0:00:01 0:00:52 5290k > 2020-07-28T22:23:51.4205838Z 7 276M7 20.2M0 0 9343k 0 > 0:00:30 0:00:02 0:00:28 9341k > 2020-07-28T22:23:51.5660725Z 14 276M 14 39.9M0 0 12.3M 0 > 0:00:22 0:00:03 0:00:19 12.3M > 2020-07-28T22:23:51.5661374Z 15 276M 15 43.2M0 0 12.8M 0 > 0:00:21 0:00:03 0:00:18 12.8M > 2020-07-28T22:23:51.5735405Z curl: (18) transfer closed with 244702844 bytes > remaining to read > 2020-07-28T22:23:51.9894747Z % Total% Received % Xferd Average Speed > TimeTime Time Current > 2020-07-28T22:23:51.9895725Z Dload Upload > Total SpentLeft Speed > 2020-07-28T22:23:51.9896068Z > 2020-07-28T22:23:51.9940775Z 0 00 00 0 0 0 > --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to localhost port > 9200: Connection refused > 2020-07-28T22:23:51.9951219Z [FAIL] Test script contains errors. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-18745) 'SQL Client end-to-end test (Old planner) Elasticsearch (v7.5.1)' failed to download Elasticsearch
[ https://issues.apache.org/jira/browse/FLINK-18745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dian Fu updated FLINK-18745: Affects Version/s: 1.12.0 > 'SQL Client end-to-end test (Old planner) Elasticsearch (v7.5.1)' failed to > download Elasticsearch > -- > > Key: FLINK-18745 > URL: https://issues.apache.org/jira/browse/FLINK-18745 > Project: Flink > Issue Type: Test > Components: Connectors / ElasticSearch, Table SQL / Client, Tests >Affects Versions: 1.12.0, 1.11.1 >Reporter: Dian Fu >Priority: Major > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=4976&view=logs&j=91bf6583-3fb2-592f-e4d4-d79d79c3230a&t=03dbd840-5430-533d-d1a7-05d0ebe03873 > {code} > Downloading Elasticsearch from > https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.5.1-linux-x86_64.tar.gz > ... > 2020-07-28T22:23:48.2016019Z % Total% Received % Xferd Average Speed > TimeTime Time Current > 2020-07-28T22:23:48.2017880Z Dload Upload > Total SpentLeft Speed > 2020-07-28T22:23:48.2018245Z > 2020-07-28T22:23:48.4204474Z 0 00 00 0 0 0 > --:--:-- --:--:-- --:--:-- 0 > 2020-07-28T22:23:49.4207369Z 0 276M0 00 0 0 0 > --:--:-- --:--:-- --:--:-- 0 > 2020-07-28T22:23:50.4205512Z 2 276M2 6459k0 0 5291k 0 > 0:00:53 0:00:01 0:00:52 5290k > 2020-07-28T22:23:51.4205838Z 7 276M7 20.2M0 0 9343k 0 > 0:00:30 0:00:02 0:00:28 9341k > 2020-07-28T22:23:51.5660725Z 14 276M 14 39.9M0 0 12.3M 0 > 0:00:22 0:00:03 0:00:19 12.3M > 2020-07-28T22:23:51.5661374Z 15 276M 15 43.2M0 0 12.8M 0 > 0:00:21 0:00:03 0:00:18 12.8M > 2020-07-28T22:23:51.5735405Z curl: (18) transfer closed with 244702844 bytes > remaining to read > 2020-07-28T22:23:51.9894747Z % Total% Received % Xferd Average Speed > TimeTime Time Current > 2020-07-28T22:23:51.9895725Z Dload Upload > Total SpentLeft Speed > 2020-07-28T22:23:51.9896068Z > 2020-07-28T22:23:51.9940775Z 0 00 00 0 0 0 > --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to localhost port > 9200: Connection refused > 2020-07-28T22:23:51.9951219Z [FAIL] Test script contains errors. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18745) 'SQL Client end-to-end test (Old planner) Elasticsearch (v7.5.1)' failed to download Elasticsearch
[ https://issues.apache.org/jira/browse/FLINK-18745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166803#comment-17166803 ] Dian Fu commented on FLINK-18745: - https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=4942&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529 > 'SQL Client end-to-end test (Old planner) Elasticsearch (v7.5.1)' failed to > download Elasticsearch > -- > > Key: FLINK-18745 > URL: https://issues.apache.org/jira/browse/FLINK-18745 > Project: Flink > Issue Type: Test > Components: Connectors / ElasticSearch, Table SQL / Client, Tests >Affects Versions: 1.11.1 >Reporter: Dian Fu >Priority: Major > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=4976&view=logs&j=91bf6583-3fb2-592f-e4d4-d79d79c3230a&t=03dbd840-5430-533d-d1a7-05d0ebe03873 > {code} > Downloading Elasticsearch from > https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.5.1-linux-x86_64.tar.gz > ... > 2020-07-28T22:23:48.2016019Z % Total% Received % Xferd Average Speed > TimeTime Time Current > 2020-07-28T22:23:48.2017880Z Dload Upload > Total SpentLeft Speed > 2020-07-28T22:23:48.2018245Z > 2020-07-28T22:23:48.4204474Z 0 00 00 0 0 0 > --:--:-- --:--:-- --:--:-- 0 > 2020-07-28T22:23:49.4207369Z 0 276M0 00 0 0 0 > --:--:-- --:--:-- --:--:-- 0 > 2020-07-28T22:23:50.4205512Z 2 276M2 6459k0 0 5291k 0 > 0:00:53 0:00:01 0:00:52 5290k > 2020-07-28T22:23:51.4205838Z 7 276M7 20.2M0 0 9343k 0 > 0:00:30 0:00:02 0:00:28 9341k > 2020-07-28T22:23:51.5660725Z 14 276M 14 39.9M0 0 12.3M 0 > 0:00:22 0:00:03 0:00:19 12.3M > 2020-07-28T22:23:51.5661374Z 15 276M 15 43.2M0 0 12.8M 0 > 0:00:21 0:00:03 0:00:18 12.8M > 2020-07-28T22:23:51.5735405Z curl: (18) transfer closed with 244702844 bytes > remaining to read > 2020-07-28T22:23:51.9894747Z % Total% Received % Xferd Average Speed > TimeTime Time Current > 2020-07-28T22:23:51.9895725Z Dload Upload > Total SpentLeft Speed > 2020-07-28T22:23:51.9896068Z > 2020-07-28T22:23:51.9940775Z 0 00 00 0 0 0 > --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to localhost port > 9200: Connection refused > 2020-07-28T22:23:51.9951219Z [FAIL] Test script contains errors. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18745) 'SQL Client end-to-end test (Old planner) Elasticsearch (v7.5.1)' failed to download Elasticsearch
Dian Fu created FLINK-18745: --- Summary: 'SQL Client end-to-end test (Old planner) Elasticsearch (v7.5.1)' failed to download Elasticsearch Key: FLINK-18745 URL: https://issues.apache.org/jira/browse/FLINK-18745 Project: Flink Issue Type: Test Components: Connectors / ElasticSearch, Table SQL / Client, Tests Affects Versions: 1.11.1 Reporter: Dian Fu https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=4976&view=logs&j=91bf6583-3fb2-592f-e4d4-d79d79c3230a&t=03dbd840-5430-533d-d1a7-05d0ebe03873 {code} Downloading Elasticsearch from https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.5.1-linux-x86_64.tar.gz ... 2020-07-28T22:23:48.2016019Z % Total% Received % Xferd Average Speed TimeTime Time Current 2020-07-28T22:23:48.2017880Z Dload Upload Total SpentLeft Speed 2020-07-28T22:23:48.2018245Z 2020-07-28T22:23:48.4204474Z 0 00 00 0 0 0 --:--:-- --:--:-- --:--:-- 0 2020-07-28T22:23:49.4207369Z 0 276M0 00 0 0 0 --:--:-- --:--:-- --:--:-- 0 2020-07-28T22:23:50.4205512Z 2 276M2 6459k0 0 5291k 0 0:00:53 0:00:01 0:00:52 5290k 2020-07-28T22:23:51.4205838Z 7 276M7 20.2M0 0 9343k 0 0:00:30 0:00:02 0:00:28 9341k 2020-07-28T22:23:51.5660725Z 14 276M 14 39.9M0 0 12.3M 0 0:00:22 0:00:03 0:00:19 12.3M 2020-07-28T22:23:51.5661374Z 15 276M 15 43.2M0 0 12.8M 0 0:00:21 0:00:03 0:00:18 12.8M 2020-07-28T22:23:51.5735405Z curl: (18) transfer closed with 244702844 bytes remaining to read 2020-07-28T22:23:51.9894747Z % Total% Received % Xferd Average Speed TimeTime Time Current 2020-07-28T22:23:51.9895725Z Dload Upload Total SpentLeft Speed 2020-07-28T22:23:51.9896068Z 2020-07-28T22:23:51.9940775Z 0 00 00 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to localhost port 9200: Connection refused 2020-07-28T22:23:51.9951219Z [FAIL] Test script contains errors. {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-18690) Implement LocalInputPreferredSlotSharingStrategy
[ https://issues.apache.org/jira/browse/FLINK-18690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhu Zhu updated FLINK-18690: Component/s: Runtime / Coordination > Implement LocalInputPreferredSlotSharingStrategy > > > Key: FLINK-18690 > URL: https://issues.apache.org/jira/browse/FLINK-18690 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination >Reporter: Andrey Zagrebin >Assignee: Zhu Zhu >Priority: Major > > Implement ExecutionSlotSharingGroup, SlotSharingStrategy and default > LocalInputPreferredSlotSharingStrategy. > The default strategy would be LocalInputPreferredSlotSharingStrategy. It will > try to reduce remote data exchanges. Subtasks, which are connected and belong > to the same SlotSharingGroup, tend to be put in the same > ExecutionSlotSharingGroup. > See [design > doc|https://docs.google.com/document/d/10CbCVDBJWafaFOovIXrR8nAr2BnZ_RAGFtUeTHiJplw/edit#heading=h.t4vfmm4atqoy] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18663) Fix Flink On YARN AM not exit
[ https://issues.apache.org/jira/browse/FLINK-18663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166755#comment-17166755 ] 许 容浩 commented on FLINK-18663: -- 发自我的华为手机 原始邮件 发件人: "Chesnay Schepler (Jira)" 日期: 2020年7月29日周三 清晨6:06 收件人: issues@flink.apache.org 主 题: [jira] [Commented] (FLINK-18663) Fix Flink On YARN AM not exit [ https://issues.apache.org/jira/browse/FLINK-18663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166717#comment-17166717 ] Chesnay Schepler commented on FLINK-18663: -- Well there we go then, I was able to reproduce the issue. It is indeed due to the client closing the connection. -- This message was sent by Atlassian Jira (v8.3.4#803005) > Fix Flink On YARN AM not exit > - > > Key: FLINK-18663 > URL: https://issues.apache.org/jira/browse/FLINK-18663 > Project: Flink > Issue Type: Bug > Components: Runtime / REST >Affects Versions: 1.10.0, 1.10.1, 1.11.0 >Reporter: tartarus >Assignee: tartarus >Priority: Critical > Labels: pull-request-available > Attachments: 110.png, 111.png, > C49A7310-F932-451B-A203-6D17F3140C0D.png, > e18e00dd6664485c2ff55284fe969474.png, jobmanager.log.noyarn.tar.gz > > > AbstractHandler throw NPE cause by FlinkHttpObjectAggregator is null > when rest throw exception, it will do this code > {code:java} > private CompletableFuture handleException(Throwable throwable, > ChannelHandlerContext ctx, HttpRequest httpRequest) { > FlinkHttpObjectAggregator flinkHttpObjectAggregator = > ctx.pipeline().get(FlinkHttpObjectAggregator.class); > int maxLength = flinkHttpObjectAggregator.maxContentLength() - > OTHER_RESP_PAYLOAD_OVERHEAD; > if (throwable instanceof RestHandlerException) { > RestHandlerException rhe = (RestHandlerException) throwable; > String stackTrace = ExceptionUtils.stringifyException(rhe); > String truncatedStackTrace = Ascii.truncate(stackTrace, > maxLength, "..."); > if (log.isDebugEnabled()) { > log.error("Exception occurred in REST handler.", rhe); > } else { > log.error("Exception occurred in REST handler: {}", > rhe.getMessage()); > } > return HandlerUtils.sendErrorResponse( > ctx, > httpRequest, > new ErrorResponseBody(truncatedStackTrace), > rhe.getHttpResponseStatus(), > responseHeaders); > } else { > log.error("Unhandled exception.", throwable); > String stackTrace = String.format(" side:%n%s%nEnd of exception on server side>", > ExceptionUtils.stringifyException(throwable)); > String truncatedStackTrace = Ascii.truncate(stackTrace, > maxLength, "..."); > return HandlerUtils.sendErrorResponse( > ctx, > httpRequest, > new ErrorResponseBody(Arrays.asList("Internal server > error.", truncatedStackTrace)), > HttpResponseStatus.INTERNAL_SERVER_ERROR, > responseHeaders); > } > } > {code} > but flinkHttpObjectAggregator some case is null,so this will throw NPE,but > this method called by AbstractHandler#respondAsLeader > {code:java} > requestProcessingFuture > .whenComplete((Void ignored, Throwable throwable) -> { > if (throwable != null) { > > handleException(ExceptionUtils.stripCompletionException(throwable), ctx, > httpRequest) > .whenComplete((Void ignored2, Throwable > throwable2) -> finalizeRequestProcessing(finalUploadedFiles)); > } else { > finalizeRequestProcessing(finalUploadedFiles); > } > }); > {code} > the result is InFlightRequestTracker Cannot be cleared. > so the CompletableFuture does‘t complete that handler's closeAsync returned > !C49A7310-F932-451B-A203-6D17F3140C0D.png! > !e18e00dd6664485c2ff55284fe969474.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18663) Fix Flink On YARN AM not exit
[ https://issues.apache.org/jira/browse/FLINK-18663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166717#comment-17166717 ] Chesnay Schepler commented on FLINK-18663: -- Well there we go then, I was able to reproduce the issue. It is indeed due to the client closing the connection. > Fix Flink On YARN AM not exit > - > > Key: FLINK-18663 > URL: https://issues.apache.org/jira/browse/FLINK-18663 > Project: Flink > Issue Type: Bug > Components: Runtime / REST >Affects Versions: 1.10.0, 1.10.1, 1.11.0 >Reporter: tartarus >Assignee: tartarus >Priority: Critical > Labels: pull-request-available > Attachments: 110.png, 111.png, > C49A7310-F932-451B-A203-6D17F3140C0D.png, > e18e00dd6664485c2ff55284fe969474.png, jobmanager.log.noyarn.tar.gz > > > AbstractHandler throw NPE cause by FlinkHttpObjectAggregator is null > when rest throw exception, it will do this code > {code:java} > private CompletableFuture handleException(Throwable throwable, > ChannelHandlerContext ctx, HttpRequest httpRequest) { > FlinkHttpObjectAggregator flinkHttpObjectAggregator = > ctx.pipeline().get(FlinkHttpObjectAggregator.class); > int maxLength = flinkHttpObjectAggregator.maxContentLength() - > OTHER_RESP_PAYLOAD_OVERHEAD; > if (throwable instanceof RestHandlerException) { > RestHandlerException rhe = (RestHandlerException) throwable; > String stackTrace = ExceptionUtils.stringifyException(rhe); > String truncatedStackTrace = Ascii.truncate(stackTrace, > maxLength, "..."); > if (log.isDebugEnabled()) { > log.error("Exception occurred in REST handler.", rhe); > } else { > log.error("Exception occurred in REST handler: {}", > rhe.getMessage()); > } > return HandlerUtils.sendErrorResponse( > ctx, > httpRequest, > new ErrorResponseBody(truncatedStackTrace), > rhe.getHttpResponseStatus(), > responseHeaders); > } else { > log.error("Unhandled exception.", throwable); > String stackTrace = String.format(" side:%n%s%nEnd of exception on server side>", > ExceptionUtils.stringifyException(throwable)); > String truncatedStackTrace = Ascii.truncate(stackTrace, > maxLength, "..."); > return HandlerUtils.sendErrorResponse( > ctx, > httpRequest, > new ErrorResponseBody(Arrays.asList("Internal server > error.", truncatedStackTrace)), > HttpResponseStatus.INTERNAL_SERVER_ERROR, > responseHeaders); > } > } > {code} > but flinkHttpObjectAggregator some case is null,so this will throw NPE,but > this method called by AbstractHandler#respondAsLeader > {code:java} > requestProcessingFuture > .whenComplete((Void ignored, Throwable throwable) -> { > if (throwable != null) { > > handleException(ExceptionUtils.stripCompletionException(throwable), ctx, > httpRequest) > .whenComplete((Void ignored2, Throwable > throwable2) -> finalizeRequestProcessing(finalUploadedFiles)); > } else { > finalizeRequestProcessing(finalUploadedFiles); > } > }); > {code} > the result is InFlightRequestTracker Cannot be cleared. > so the CompletableFuture does‘t complete that handler's closeAsync returned > !C49A7310-F932-451B-A203-6D17F3140C0D.png! > !e18e00dd6664485c2ff55284fe969474.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18695) Allow NettyBufferPool to allocate heap buffers
[ https://issues.apache.org/jira/browse/FLINK-18695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166697#comment-17166697 ] Stephan Ewen commented on FLINK-18695: -- My feeling is that this should be okay in Flink 1.12 (and 1.11) for the following reason: - Netty memory USED TO BE a significant chunk (especially on the receiver side), which made it non-trivial to reason about in the memory configuration - In the latest versions we directly write from (and read into) Flink memory buffers, so the memory that Netty itself allocates is minimal (headers, frame length decoders, encryption). These should be not very big (possibly except the encryption case) so not too much impact whether they are on heap or off heap. [~zjwang] What is your opinion? [~NicoK] Related, is there any way we can set up OpenSSL by default? It looks like the binaries are anyways in the Flink binary release. > Allow NettyBufferPool to allocate heap buffers > -- > > Key: FLINK-18695 > URL: https://issues.apache.org/jira/browse/FLINK-18695 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Reporter: Chesnay Schepler >Priority: Major > Fix For: 1.12.0 > > > in 4.1.43 netty made a change to their SslHandler to always use heap buffers > for JDK SSLEngine implementations, to avoid an additional memory copy. > However, our {{NettyBufferPool}} forbids heap buffer allocations. > We will either have to allow heap buffer allocations, or create a custom > SslHandler implementation that does not use heap buffers (although this seems > ill-adviced?). > /cc [~sewen] [~uce] [~NicoK] [~zjwang] [~pnowojski] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16510) Task manager safeguard shutdown may not be reliable
[ https://issues.apache.org/jira/browse/FLINK-16510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1719#comment-1719 ] Maximilian Michels commented on FLINK-16510: Here is the requested information (without changing the JVM arguments): {{jstack -F -m 1}}: [^stack3-mixed.txt] (same error without -F) {{jstack -F 1}}: [^stack3.txt] Command: [^command.txt] Interestingly, the originally reported behavior of hanging in the shutdown hooks is not visible in the stack trace. Still, the problem is not reproducible if {{halt}} will be immediately called on fatal errors without running shutdown hooks. > Task manager safeguard shutdown may not be reliable > --- > > Key: FLINK-16510 > URL: https://issues.apache.org/jira/browse/FLINK-16510 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Reporter: Maximilian Michels >Assignee: Maximilian Michels >Priority: Major > Attachments: command.txt, stack2-1.txt, stack3-mixed.txt, stack3.txt > > > The {{JvmShutdownSafeguard}} does not always succeed but can hang when > multiple threads attempt to shutdown the JVM. Apparently mixing > {{System.exit()}} with ShutdownHooks and forcefully terminating the JVM via > {{Runtime.halt()}} does not play together well: > {noformat} > "Jvm Terminator" #22 daemon prio=5 os_prio=0 tid=0x7fb8e82f2800 > nid=0x5a96 runnable [0x7fb35cffb000] >java.lang.Thread.State: RUNNABLE > at java.lang.Shutdown.$$YJP$$halt0(Native Method) > at java.lang.Shutdown.halt0(Shutdown.java) > at java.lang.Shutdown.halt(Shutdown.java:139) > - locked <0x00047ed67638> (a java.lang.Shutdown$Lock) > at java.lang.Runtime.halt(Runtime.java:276) > at > org.apache.flink.runtime.util.JvmShutdownSafeguard$DelayedTerminator.run(JvmShutdownSafeguard.java:86) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - None > "FlinkCompletableFutureDelayScheduler-thread-1" #18154 daemon prio=5 > os_prio=0 tid=0x7fb708a7d000 nid=0x5a8a waiting for monitor entry > [0x7fb289d49000] >java.lang.Thread.State: BLOCKED (on object monitor) > at java.lang.Shutdown.halt(Shutdown.java:139) > - waiting to lock <0x00047ed67638> (a java.lang.Shutdown$Lock) > at java.lang.Shutdown.exit(Shutdown.java:213) > - locked <0x00047edb7348> (a java.lang.Class for java.lang.Shutdown) > at java.lang.Runtime.exit(Runtime.java:110) > at java.lang.System.exit(System.java:973) > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.terminateJVM(TaskManagerRunner.java:266) > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$onFatalError$1(TaskManagerRunner.java:260) > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner$$Lambda$27464/1464672548.accept(Unknown > Source) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > at > org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:943) > at > org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211) > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$11(FutureUtils.java:361) > at > org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$27435/159015392.run(Unknown > Source) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0006d5e56bd0> (a > java.util.concurrent.ThreadPoolExecutor$Worker) > {noformat} > Note that under this condition the JVM should terminate but it still hangs. > Sometimes it quits after several minutes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-16510) Task manager safeguard shutdown may not be reliable
[ https://issues.apache.org/jira/browse/FLINK-16510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maximilian Michels updated FLINK-16510: --- Attachment: stack3.txt > Task manager safeguard shutdown may not be reliable > --- > > Key: FLINK-16510 > URL: https://issues.apache.org/jira/browse/FLINK-16510 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Reporter: Maximilian Michels >Assignee: Maximilian Michels >Priority: Major > Attachments: command.txt, stack2-1.txt, stack3-mixed.txt, stack3.txt > > > The {{JvmShutdownSafeguard}} does not always succeed but can hang when > multiple threads attempt to shutdown the JVM. Apparently mixing > {{System.exit()}} with ShutdownHooks and forcefully terminating the JVM via > {{Runtime.halt()}} does not play together well: > {noformat} > "Jvm Terminator" #22 daemon prio=5 os_prio=0 tid=0x7fb8e82f2800 > nid=0x5a96 runnable [0x7fb35cffb000] >java.lang.Thread.State: RUNNABLE > at java.lang.Shutdown.$$YJP$$halt0(Native Method) > at java.lang.Shutdown.halt0(Shutdown.java) > at java.lang.Shutdown.halt(Shutdown.java:139) > - locked <0x00047ed67638> (a java.lang.Shutdown$Lock) > at java.lang.Runtime.halt(Runtime.java:276) > at > org.apache.flink.runtime.util.JvmShutdownSafeguard$DelayedTerminator.run(JvmShutdownSafeguard.java:86) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - None > "FlinkCompletableFutureDelayScheduler-thread-1" #18154 daemon prio=5 > os_prio=0 tid=0x7fb708a7d000 nid=0x5a8a waiting for monitor entry > [0x7fb289d49000] >java.lang.Thread.State: BLOCKED (on object monitor) > at java.lang.Shutdown.halt(Shutdown.java:139) > - waiting to lock <0x00047ed67638> (a java.lang.Shutdown$Lock) > at java.lang.Shutdown.exit(Shutdown.java:213) > - locked <0x00047edb7348> (a java.lang.Class for java.lang.Shutdown) > at java.lang.Runtime.exit(Runtime.java:110) > at java.lang.System.exit(System.java:973) > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.terminateJVM(TaskManagerRunner.java:266) > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$onFatalError$1(TaskManagerRunner.java:260) > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner$$Lambda$27464/1464672548.accept(Unknown > Source) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > at > org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:943) > at > org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211) > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$11(FutureUtils.java:361) > at > org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$27435/159015392.run(Unknown > Source) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0006d5e56bd0> (a > java.util.concurrent.ThreadPoolExecutor$Worker) > {noformat} > Note that under this condition the JVM should terminate but it still hangs. > Sometimes it quits after several minutes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-16510) Task manager safeguard shutdown may not be reliable
[ https://issues.apache.org/jira/browse/FLINK-16510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maximilian Michels updated FLINK-16510: --- Attachment: command.txt > Task manager safeguard shutdown may not be reliable > --- > > Key: FLINK-16510 > URL: https://issues.apache.org/jira/browse/FLINK-16510 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Reporter: Maximilian Michels >Assignee: Maximilian Michels >Priority: Major > Attachments: command.txt, stack2-1.txt, stack3-mixed.txt, stack3.txt > > > The {{JvmShutdownSafeguard}} does not always succeed but can hang when > multiple threads attempt to shutdown the JVM. Apparently mixing > {{System.exit()}} with ShutdownHooks and forcefully terminating the JVM via > {{Runtime.halt()}} does not play together well: > {noformat} > "Jvm Terminator" #22 daemon prio=5 os_prio=0 tid=0x7fb8e82f2800 > nid=0x5a96 runnable [0x7fb35cffb000] >java.lang.Thread.State: RUNNABLE > at java.lang.Shutdown.$$YJP$$halt0(Native Method) > at java.lang.Shutdown.halt0(Shutdown.java) > at java.lang.Shutdown.halt(Shutdown.java:139) > - locked <0x00047ed67638> (a java.lang.Shutdown$Lock) > at java.lang.Runtime.halt(Runtime.java:276) > at > org.apache.flink.runtime.util.JvmShutdownSafeguard$DelayedTerminator.run(JvmShutdownSafeguard.java:86) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - None > "FlinkCompletableFutureDelayScheduler-thread-1" #18154 daemon prio=5 > os_prio=0 tid=0x7fb708a7d000 nid=0x5a8a waiting for monitor entry > [0x7fb289d49000] >java.lang.Thread.State: BLOCKED (on object monitor) > at java.lang.Shutdown.halt(Shutdown.java:139) > - waiting to lock <0x00047ed67638> (a java.lang.Shutdown$Lock) > at java.lang.Shutdown.exit(Shutdown.java:213) > - locked <0x00047edb7348> (a java.lang.Class for java.lang.Shutdown) > at java.lang.Runtime.exit(Runtime.java:110) > at java.lang.System.exit(System.java:973) > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.terminateJVM(TaskManagerRunner.java:266) > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$onFatalError$1(TaskManagerRunner.java:260) > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner$$Lambda$27464/1464672548.accept(Unknown > Source) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > at > org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:943) > at > org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211) > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$11(FutureUtils.java:361) > at > org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$27435/159015392.run(Unknown > Source) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0006d5e56bd0> (a > java.util.concurrent.ThreadPoolExecutor$Worker) > {noformat} > Note that under this condition the JVM should terminate but it still hangs. > Sometimes it quits after several minutes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-16510) Task manager safeguard shutdown may not be reliable
[ https://issues.apache.org/jira/browse/FLINK-16510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maximilian Michels updated FLINK-16510: --- Attachment: stack3-mixed.txt > Task manager safeguard shutdown may not be reliable > --- > > Key: FLINK-16510 > URL: https://issues.apache.org/jira/browse/FLINK-16510 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Reporter: Maximilian Michels >Assignee: Maximilian Michels >Priority: Major > Attachments: stack2-1.txt, stack3-mixed.txt > > > The {{JvmShutdownSafeguard}} does not always succeed but can hang when > multiple threads attempt to shutdown the JVM. Apparently mixing > {{System.exit()}} with ShutdownHooks and forcefully terminating the JVM via > {{Runtime.halt()}} does not play together well: > {noformat} > "Jvm Terminator" #22 daemon prio=5 os_prio=0 tid=0x7fb8e82f2800 > nid=0x5a96 runnable [0x7fb35cffb000] >java.lang.Thread.State: RUNNABLE > at java.lang.Shutdown.$$YJP$$halt0(Native Method) > at java.lang.Shutdown.halt0(Shutdown.java) > at java.lang.Shutdown.halt(Shutdown.java:139) > - locked <0x00047ed67638> (a java.lang.Shutdown$Lock) > at java.lang.Runtime.halt(Runtime.java:276) > at > org.apache.flink.runtime.util.JvmShutdownSafeguard$DelayedTerminator.run(JvmShutdownSafeguard.java:86) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - None > "FlinkCompletableFutureDelayScheduler-thread-1" #18154 daemon prio=5 > os_prio=0 tid=0x7fb708a7d000 nid=0x5a8a waiting for monitor entry > [0x7fb289d49000] >java.lang.Thread.State: BLOCKED (on object monitor) > at java.lang.Shutdown.halt(Shutdown.java:139) > - waiting to lock <0x00047ed67638> (a java.lang.Shutdown$Lock) > at java.lang.Shutdown.exit(Shutdown.java:213) > - locked <0x00047edb7348> (a java.lang.Class for java.lang.Shutdown) > at java.lang.Runtime.exit(Runtime.java:110) > at java.lang.System.exit(System.java:973) > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.terminateJVM(TaskManagerRunner.java:266) > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$onFatalError$1(TaskManagerRunner.java:260) > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner$$Lambda$27464/1464672548.accept(Unknown > Source) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > at > org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:943) > at > org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211) > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$11(FutureUtils.java:361) > at > org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$27435/159015392.run(Unknown > Source) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0006d5e56bd0> (a > java.util.concurrent.ThreadPoolExecutor$Worker) > {noformat} > Note that under this condition the JVM should terminate but it still hangs. > Sometimes it quits after several minutes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18733) Jobmanager cannot start in HA mode with Zookeeper
[ https://issues.apache.org/jira/browse/FLINK-18733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166543#comment-17166543 ] Leonid Ilyevsky commented on FLINK-18733: - Problem solved. I noticed that your distribution includes opt/flink-shaded-zookeeper-3.5.6.jar, so I replaced lib/flink-shaded-zookeeper-3.4.14.jar by 3.5.6 version, and now everything works. I guess, you may want to make 3.5.6 (or newer) a default in your future releases. > Jobmanager cannot start in HA mode with Zookeeper > - > > Key: FLINK-18733 > URL: https://issues.apache.org/jira/browse/FLINK-18733 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.1 >Reporter: Leonid Ilyevsky >Priority: Major > Attachments: flink-conf.yaml, > flink-liquidnt-standalonesession-0-nj1dvloglab01.liquidnet.biz.log, > flink-liquidnt-taskexecutor-0-nj1dvloglab01.liquidnet.biz.log > > > When configured in HA mode, the Jobmanager cannot start at all. First, it > issues warnings like this: > {quote}{{2020-07-27 08:58:23,197 WARN > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - > Session 0x0 for server *nj1dvloglab01.liquidnet.biz/:2181*, > unexpected error, closing socket connection and attempting reconnect}} > {{java.lang.IllegalArgumentException: *Unable to canonicalize address* > nj1dvloglab01.liquidnet.biz/:2181 because it's not resolvable}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060) > [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {quote} > After few attempts connecting to Zookeeper, it finally fails: > {quote}2020-07-27 08:59:35,055 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error > occurred in the cluster entrypoint. > org.apache.flink.util.FlinkException: Unhandled error in > ZooKeeperLeaderElectionService: Ensure path threw exception > at > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.unhandledError(ZooKeeperLeaderElectionService.java:430) > ~[flink-dist_2.12-1.11.1.jar:1.11.1] > {quote} > > The same HA configuration works fine for me in Flink 1.10.0. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166480#comment-17166480 ] Etienne Chauchot edited comment on FLINK-17073 at 7/28/20, 4:09 PM: [~roman_khachatryan] When [~SleePy] and I discussed in [the deisgn doc|https://docs.google.com/document/d/1q0y0aWlJMoUWNW7jjsM8uWfHsy2dM6YmmcmhpQzgLMA/edit?usp=sharing], the idea was to wait until last checkpoint was cleaned before accepting another (that is what we called make cleaning part of checkpoint processing). Thus, checking only existing number of pending checkpoints was enough (no need for a new queue) to foresee an flood of checkpoints to clean. But the solution you propose (managing the queue of the checkpoints to clean and monitor its size) seems even simpler to me: it avoids having to sync normal checkpointing and checkpoint cleaning: As you said, when we chose a checkpoint trigger request to execute (*CheckpointRequestDecider.chooseRequestToExecute*), we can drop new checkpoint requests when there are too many checkpoints to clean and thus regulate the whole checkpointing system. Syncing cleaning and checkpointing might not be necessary for this regulation, you're right. If you don't mind, I'll go for this implementation proposal in the design doc. [~roman_khachatryan] thanks anyway for the suggestions and please take a look at the design doc where we will have the impl discussions was (Author: echauchot): [~roman_khachatryan] When [~SleePy] and I discussed in [the deisgn doc|https://docs.google.com/document/d/1q0y0aWlJMoUWNW7jjsM8uWfHsy2dM6YmmcmhpQzgLMA/edit?usp=sharing], the idea was to wait until last checkpoint was cleaned before accepting another (that is what we called make cleaning part of checkpoint processing). Thus, checking only existing number of pending checkpoints was enough (no need for a new queue) to foresee an flood of checkpoints to clean. But the solution you propose (managing the queue of the checkpoints to clean and monitor its size) seems even simpler to me: it avoids having to sync normal checkpointing and checkpoint cleaning: As you said, when we chose a checkpoint trigger request to execute (*CheckpointRequestDecider.chooseRequestToExecute*), we can drop new checkpoint requests when there are too many checkpoints to clean and thus regulate the whole checkpointing system. Syncing cleaning and checkpointing might not be necessary for this regulation, you're right. If you don't mind, I'll go for this implementation proposal in the design doc. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Etienne Chauchot >Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-18733) Jobmanager cannot start in HA mode with Zookeeper
[ https://issues.apache.org/jira/browse/FLINK-18733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166507#comment-17166507 ] Leonid Ilyevsky edited comment on FLINK-18733 at 7/28/20, 3:44 PM: --- Till, Just now I found this: https://issues.apache.org/jira/browse/ZOOKEEPER-3590 . So So it does look like some problem was introduced in 3.4.14. I guess, if you upgrade your Zookeeper client, it will be OK. Or at least it will be possible to turn off that "canonicalize" thing. FYI: I use 3.6.1 for my Zookeeper cluster, it is stable release now. My puzzle now is - how does it work for you? Maybe your network, DNS, etc. is setup differently compare to what I have at my work. Question: would you know how I can independently test whether the host is resolvable in that sense? I mean, to test whether it can be "canonicalized"? Or configure the hosts differently? There are some other alias names that resolve to the same hosts, maybe I should use them? was (Author: lilyevsky): Till, Just now I found this: https://issues.apache.org/jira/browse/ZOOKEEPER-3590 . So So it does look like some problem was introduced in 3.14. I guess, if you upgrade your Zookeeper client, it will be OK. Or at least it will be possible to turn off that "canonicalize" thing. My puzzle now is - how does it work for you? Maybe your network, DNS, etc. is setup differently compare to what I have at my work. Question: would you know how I can independently test whether the host is resolvable in that sense? I mean, to test whether it can be "canonicalized"? Or configure the hosts differently? There are some other alias names that resolve to the same hosts, maybe I should use them? > Jobmanager cannot start in HA mode with Zookeeper > - > > Key: FLINK-18733 > URL: https://issues.apache.org/jira/browse/FLINK-18733 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.1 >Reporter: Leonid Ilyevsky >Priority: Major > Attachments: flink-conf.yaml, > flink-liquidnt-standalonesession-0-nj1dvloglab01.liquidnet.biz.log, > flink-liquidnt-taskexecutor-0-nj1dvloglab01.liquidnet.biz.log > > > When configured in HA mode, the Jobmanager cannot start at all. First, it > issues warnings like this: > {quote}{{2020-07-27 08:58:23,197 WARN > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - > Session 0x0 for server *nj1dvloglab01.liquidnet.biz/:2181*, > unexpected error, closing socket connection and attempting reconnect}} > {{java.lang.IllegalArgumentException: *Unable to canonicalize address* > nj1dvloglab01.liquidnet.biz/:2181 because it's not resolvable}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060) > [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {quote} > After few attempts connecting to Zookeeper, it finally fails: > {quote}2020-07-27 08:59:35,055 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error > occurred in the cluster entrypoint. > org.apache.flink.util.FlinkException: Unhandled error in > ZooKeeperLeaderElectionService: Ensure path threw exception > at > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.unhandledError(ZooKeeperLeaderElectionService.java:430) > ~[flink-dist_2.12-1.11.1.jar:1.11.1] > {quote} > > The same HA configuration works fine for me in Flink 1.10.0. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18733) Jobmanager cannot start in HA mode with Zookeeper
[ https://issues.apache.org/jira/browse/FLINK-18733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166507#comment-17166507 ] Leonid Ilyevsky commented on FLINK-18733: - Till, Just now I found this: https://issues.apache.org/jira/browse/ZOOKEEPER-3590 . So So it does look like some problem was introduced in 3.14. I guess, if you upgrade your Zookeeper client, it will be OK. Or at least it will be possible to turn off that "canonicalize" thing. My puzzle now is - how does it work for you? Maybe your network, DNS, etc. is setup differently compare to what I have at my work. Question: would you know how I can independently test whether the host is resolvable in that sense? I mean, to test whether it can be "canonicalized"? Or configure the hosts differently? There are some other alias names that resolve to the same hosts, maybe I should use them? > Jobmanager cannot start in HA mode with Zookeeper > - > > Key: FLINK-18733 > URL: https://issues.apache.org/jira/browse/FLINK-18733 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.1 >Reporter: Leonid Ilyevsky >Priority: Major > Attachments: flink-conf.yaml, > flink-liquidnt-standalonesession-0-nj1dvloglab01.liquidnet.biz.log, > flink-liquidnt-taskexecutor-0-nj1dvloglab01.liquidnet.biz.log > > > When configured in HA mode, the Jobmanager cannot start at all. First, it > issues warnings like this: > {quote}{{2020-07-27 08:58:23,197 WARN > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - > Session 0x0 for server *nj1dvloglab01.liquidnet.biz/:2181*, > unexpected error, closing socket connection and attempting reconnect}} > {{java.lang.IllegalArgumentException: *Unable to canonicalize address* > nj1dvloglab01.liquidnet.biz/:2181 because it's not resolvable}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060) > [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {quote} > After few attempts connecting to Zookeeper, it finally fails: > {quote}2020-07-27 08:59:35,055 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error > occurred in the cluster entrypoint. > org.apache.flink.util.FlinkException: Unhandled error in > ZooKeeperLeaderElectionService: Ensure path threw exception > at > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.unhandledError(ZooKeeperLeaderElectionService.java:430) > ~[flink-dist_2.12-1.11.1.jar:1.11.1] > {quote} > > The same HA configuration works fine for me in Flink 1.10.0. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166501#comment-17166501 ] Roman Khachatryan commented on FLINK-17073: --- Thanks for your analysis [~echauchot]. Sure, go ahead! > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Etienne Chauchot >Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16866) Make job submission non-blocking
[ https://issues.apache.org/jira/browse/FLINK-16866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166490#comment-17166490 ] Robert Metzger commented on FLINK-16866: After an offline discussion with [~trohrmann], we are proposing to address this issue as follows: Changes to the Dispatcher - As part of the Dispatcher, we'll introduce a Job abstraction that tracks the final step of the job submission: the creation of the job manager. - the job submission REST handler will return immediately after triggering the creation of the job manager. - in this phase, the job will be in an INITIALIZING state. Once the job manager is started, the job is INITIALIZED. - errors during the creation of the jobmanager are stored in the new job abstraction, probably in an ArchivedExecutionGraph. Failed submissions need to get evicted eventually. - calls to get the job status or cancel the job will have to be adopted to this change (so that they return the INITIALIZING state, properly cancel the job manager creation or fail the request (for example when triggering a savepoint on an initializing job). As a follow up idea, we could improve the cancellation of the initialization by executing it in a thread controlled by the "Job" abstraction, so that we can interrupt the thread (cooperation is not guaranteed)) Changes to other components: - The web UI will need to handle jobs in the INITIALIZING state differently: Initially the job shall be listed in the "Running" section, but it won't be clickable OR it will show an almost-empty page, explaining that it is still pending submission. Submission errors should be accessible in the UI. - The CliFrontend will keep its current semantics: After the submission has succeeded, it will periodically query the REST endpoint until the initialization is finished (or failed). - The ExecutionEnvironment.executeAsync() call will only return a JobClient, once the job manager has been initialized. > Make job submission non-blocking > > > Key: FLINK-16866 > URL: https://issues.apache.org/jira/browse/FLINK-16866 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Critical > Fix For: 1.12.0 > > > Currently, Flink waits to acknowledge a job submission until the > corresponding {{JobManager}} has been created. Since its creation also > involves the creation of the {{ExecutionGraph}} and potential FS operations, > it can take a bit of time. If the user has configured a too low > {{web.timeout}}, the submission can time out only reporting a > {{TimeoutException}} to the user. > I propose to change the notion of job submission slightly. Instead of waiting > until the {{JobManager}} has been created, a job submission is complete once > all job relevant files have been uploaded to the {{Dispatcher}} and the > {{Dispatcher}} has been told about it. Creating the {{JobManager}} will then > belong to the actual job execution. Consequently, if problems occur while > creating the {{JobManager}} it will result into a job failure. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18663) Fix Flink On YARN AM not exit
[ https://issues.apache.org/jira/browse/FLINK-18663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166485#comment-17166485 ] tartarus commented on FLINK-18663: -- [~chesnay] There is a separate monitoring service that requests /jobs/overview every 5 seconds and timeout is 5 seconds too. Then will close the client. > Fix Flink On YARN AM not exit > - > > Key: FLINK-18663 > URL: https://issues.apache.org/jira/browse/FLINK-18663 > Project: Flink > Issue Type: Bug > Components: Runtime / REST >Affects Versions: 1.10.0, 1.10.1, 1.11.0 >Reporter: tartarus >Assignee: tartarus >Priority: Critical > Labels: pull-request-available > Attachments: 110.png, 111.png, > C49A7310-F932-451B-A203-6D17F3140C0D.png, > e18e00dd6664485c2ff55284fe969474.png, jobmanager.log.noyarn.tar.gz > > > AbstractHandler throw NPE cause by FlinkHttpObjectAggregator is null > when rest throw exception, it will do this code > {code:java} > private CompletableFuture handleException(Throwable throwable, > ChannelHandlerContext ctx, HttpRequest httpRequest) { > FlinkHttpObjectAggregator flinkHttpObjectAggregator = > ctx.pipeline().get(FlinkHttpObjectAggregator.class); > int maxLength = flinkHttpObjectAggregator.maxContentLength() - > OTHER_RESP_PAYLOAD_OVERHEAD; > if (throwable instanceof RestHandlerException) { > RestHandlerException rhe = (RestHandlerException) throwable; > String stackTrace = ExceptionUtils.stringifyException(rhe); > String truncatedStackTrace = Ascii.truncate(stackTrace, > maxLength, "..."); > if (log.isDebugEnabled()) { > log.error("Exception occurred in REST handler.", rhe); > } else { > log.error("Exception occurred in REST handler: {}", > rhe.getMessage()); > } > return HandlerUtils.sendErrorResponse( > ctx, > httpRequest, > new ErrorResponseBody(truncatedStackTrace), > rhe.getHttpResponseStatus(), > responseHeaders); > } else { > log.error("Unhandled exception.", throwable); > String stackTrace = String.format(" side:%n%s%nEnd of exception on server side>", > ExceptionUtils.stringifyException(throwable)); > String truncatedStackTrace = Ascii.truncate(stackTrace, > maxLength, "..."); > return HandlerUtils.sendErrorResponse( > ctx, > httpRequest, > new ErrorResponseBody(Arrays.asList("Internal server > error.", truncatedStackTrace)), > HttpResponseStatus.INTERNAL_SERVER_ERROR, > responseHeaders); > } > } > {code} > but flinkHttpObjectAggregator some case is null,so this will throw NPE,but > this method called by AbstractHandler#respondAsLeader > {code:java} > requestProcessingFuture > .whenComplete((Void ignored, Throwable throwable) -> { > if (throwable != null) { > > handleException(ExceptionUtils.stripCompletionException(throwable), ctx, > httpRequest) > .whenComplete((Void ignored2, Throwable > throwable2) -> finalizeRequestProcessing(finalUploadedFiles)); > } else { > finalizeRequestProcessing(finalUploadedFiles); > } > }); > {code} > the result is InFlightRequestTracker Cannot be cleared. > so the CompletableFuture does‘t complete that handler's closeAsync returned > !C49A7310-F932-451B-A203-6D17F3140C0D.png! > !e18e00dd6664485c2ff55284fe969474.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18663) Fix Flink On YARN AM not exit
[ https://issues.apache.org/jira/browse/FLINK-18663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166483#comment-17166483 ] Till Rohrmann commented on FLINK-18663: --- [~tartarus] what I meant is to check whether {{AbstractHandler.terminationFuture}} is not {{null}}. If this is the case, then the handler is being shut down. I agree with Chesnay that we might have to look into other explanations for the described problem. > Fix Flink On YARN AM not exit > - > > Key: FLINK-18663 > URL: https://issues.apache.org/jira/browse/FLINK-18663 > Project: Flink > Issue Type: Bug > Components: Runtime / REST >Affects Versions: 1.10.0, 1.10.1, 1.11.0 >Reporter: tartarus >Assignee: tartarus >Priority: Critical > Labels: pull-request-available > Attachments: 110.png, 111.png, > C49A7310-F932-451B-A203-6D17F3140C0D.png, > e18e00dd6664485c2ff55284fe969474.png, jobmanager.log.noyarn.tar.gz > > > AbstractHandler throw NPE cause by FlinkHttpObjectAggregator is null > when rest throw exception, it will do this code > {code:java} > private CompletableFuture handleException(Throwable throwable, > ChannelHandlerContext ctx, HttpRequest httpRequest) { > FlinkHttpObjectAggregator flinkHttpObjectAggregator = > ctx.pipeline().get(FlinkHttpObjectAggregator.class); > int maxLength = flinkHttpObjectAggregator.maxContentLength() - > OTHER_RESP_PAYLOAD_OVERHEAD; > if (throwable instanceof RestHandlerException) { > RestHandlerException rhe = (RestHandlerException) throwable; > String stackTrace = ExceptionUtils.stringifyException(rhe); > String truncatedStackTrace = Ascii.truncate(stackTrace, > maxLength, "..."); > if (log.isDebugEnabled()) { > log.error("Exception occurred in REST handler.", rhe); > } else { > log.error("Exception occurred in REST handler: {}", > rhe.getMessage()); > } > return HandlerUtils.sendErrorResponse( > ctx, > httpRequest, > new ErrorResponseBody(truncatedStackTrace), > rhe.getHttpResponseStatus(), > responseHeaders); > } else { > log.error("Unhandled exception.", throwable); > String stackTrace = String.format(" side:%n%s%nEnd of exception on server side>", > ExceptionUtils.stringifyException(throwable)); > String truncatedStackTrace = Ascii.truncate(stackTrace, > maxLength, "..."); > return HandlerUtils.sendErrorResponse( > ctx, > httpRequest, > new ErrorResponseBody(Arrays.asList("Internal server > error.", truncatedStackTrace)), > HttpResponseStatus.INTERNAL_SERVER_ERROR, > responseHeaders); > } > } > {code} > but flinkHttpObjectAggregator some case is null,so this will throw NPE,but > this method called by AbstractHandler#respondAsLeader > {code:java} > requestProcessingFuture > .whenComplete((Void ignored, Throwable throwable) -> { > if (throwable != null) { > > handleException(ExceptionUtils.stripCompletionException(throwable), ctx, > httpRequest) > .whenComplete((Void ignored2, Throwable > throwable2) -> finalizeRequestProcessing(finalUploadedFiles)); > } else { > finalizeRequestProcessing(finalUploadedFiles); > } > }); > {code} > the result is InFlightRequestTracker Cannot be cleared. > so the CompletableFuture does‘t complete that handler's closeAsync returned > !C49A7310-F932-451B-A203-6D17F3140C0D.png! > !e18e00dd6664485c2ff55284fe969474.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166480#comment-17166480 ] Etienne Chauchot commented on FLINK-17073: -- [~roman_khachatryan] When [~SleePy] and I discussed in [the deisgn doc|https://docs.google.com/document/d/1q0y0aWlJMoUWNW7jjsM8uWfHsy2dM6YmmcmhpQzgLMA/edit?usp=sharing], the idea was to wait until last checkpoint was cleaned before accepting another (that is what we called make cleaning part of checkpoint processing). Thus, checking only existing number of pending checkpoints was enough (no need for a new queue) to foresee an flood of checkpoints to clean. But the solution you propose (managing the queue of the checkpoints to clean and monitor its size) seems even simpler to me: it avoids having to sync normal checkpointing and checkpoint cleaning: As you said, when we chose a checkpoint trigger request to execute (*CheckpointRequestDecider.chooseRequestToExecute*), we can drop new checkpoint requests when there are too many checkpoints to clean and thus regulate the whole checkpointing system. Syncing cleaning and checkpointing might not be necessary for this regulation, you're right. If you don't mind, I'll go for this implementation proposal in the design doc. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Etienne Chauchot >Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-18733) Jobmanager cannot start in HA mode with Zookeeper
[ https://issues.apache.org/jira/browse/FLINK-18733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166478#comment-17166478 ] Till Rohrmann edited comment on FLINK-18733 at 7/28/20, 2:56 PM: - I tested the scenario with running a ZooKeeper cluster on the same machine where I also started Flink (but not the same process). I did not configure SASL and it worked when using a resolvable name. When configuring an unresolvable name, I saw the exact same stack traces as you did (including the SASL part). Hence, I would assume that ZooKeeper/Curator also used this code path when running the test with a resolvable name. Have you checked whether there this is a ZooKeeper issue for a similar problem? was (Author: till.rohrmann): I tested the scenario with running a ZooKeeper cluster on the same machine where I also started Flink (but not the same process). I did not configure SASL and it worked when using a resolvable name. When configuring an unresolvable name, I saw the exact same stack traces as you did (including the SASL part). Hence, I would assume that ZooKeeper/Curator also used this code path when running the test with a resolvable name. > Jobmanager cannot start in HA mode with Zookeeper > - > > Key: FLINK-18733 > URL: https://issues.apache.org/jira/browse/FLINK-18733 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.1 >Reporter: Leonid Ilyevsky >Priority: Major > Attachments: flink-conf.yaml, > flink-liquidnt-standalonesession-0-nj1dvloglab01.liquidnet.biz.log, > flink-liquidnt-taskexecutor-0-nj1dvloglab01.liquidnet.biz.log > > > When configured in HA mode, the Jobmanager cannot start at all. First, it > issues warnings like this: > {quote}{{2020-07-27 08:58:23,197 WARN > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - > Session 0x0 for server *nj1dvloglab01.liquidnet.biz/:2181*, > unexpected error, closing socket connection and attempting reconnect}} > {{java.lang.IllegalArgumentException: *Unable to canonicalize address* > nj1dvloglab01.liquidnet.biz/:2181 because it's not resolvable}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060) > [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {quote} > After few attempts connecting to Zookeeper, it finally fails: > {quote}2020-07-27 08:59:35,055 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error > occurred in the cluster entrypoint. > org.apache.flink.util.FlinkException: Unhandled error in > ZooKeeperLeaderElectionService: Ensure path threw exception > at > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.unhandledError(ZooKeeperLeaderElectionService.java:430) > ~[flink-dist_2.12-1.11.1.jar:1.11.1] > {quote} > > The same HA configuration works fine for me in Flink 1.10.0. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18733) Jobmanager cannot start in HA mode with Zookeeper
[ https://issues.apache.org/jira/browse/FLINK-18733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166478#comment-17166478 ] Till Rohrmann commented on FLINK-18733: --- I tested the scenario with running a ZooKeeper cluster on the same machine where I also started Flink (but not the same process). I did not configure SASL and it worked when using a resolvable name. When configuring an unresolvable name, I saw the exact same stack traces as you did (including the SASL part). Hence, I would assume that ZooKeeper/Curator also used this code path when running the test with a resolvable name. > Jobmanager cannot start in HA mode with Zookeeper > - > > Key: FLINK-18733 > URL: https://issues.apache.org/jira/browse/FLINK-18733 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.1 >Reporter: Leonid Ilyevsky >Priority: Major > Attachments: flink-conf.yaml, > flink-liquidnt-standalonesession-0-nj1dvloglab01.liquidnet.biz.log, > flink-liquidnt-taskexecutor-0-nj1dvloglab01.liquidnet.biz.log > > > When configured in HA mode, the Jobmanager cannot start at all. First, it > issues warnings like this: > {quote}{{2020-07-27 08:58:23,197 WARN > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - > Session 0x0 for server *nj1dvloglab01.liquidnet.biz/:2181*, > unexpected error, closing socket connection and attempting reconnect}} > {{java.lang.IllegalArgumentException: *Unable to canonicalize address* > nj1dvloglab01.liquidnet.biz/:2181 because it's not resolvable}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060) > [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {quote} > After few attempts connecting to Zookeeper, it finally fails: > {quote}2020-07-27 08:59:35,055 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error > occurred in the cluster entrypoint. > org.apache.flink.util.FlinkException: Unhandled error in > ZooKeeperLeaderElectionService: Ensure path threw exception > at > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.unhandledError(ZooKeeperLeaderElectionService.java:430) > ~[flink-dist_2.12-1.11.1.jar:1.11.1] > {quote} > > The same HA configuration works fine for me in Flink 1.10.0. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18663) Fix Flink On YARN AM not exit
[ https://issues.apache.org/jira/browse/FLINK-18663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166467#comment-17166467 ] Chesnay Schepler commented on FLINK-18663: -- So far the only explanation that I could find for a pipeline returning null is that the channel was already closed. Our current assumption was that this happened because the RestServerEndpoint is shutting down. From the logs you gave us it appears that the shutdown is initiated 2 minutes after the NPE; this doesn't seem to match our assumption. I'm wondering what would happen if the netty connection were to be closed by the client. We know that the request processing is delayed by 10 seconds; if the client aborts the connection in between maybe netty starts cleaning up the pipeline. > Fix Flink On YARN AM not exit > - > > Key: FLINK-18663 > URL: https://issues.apache.org/jira/browse/FLINK-18663 > Project: Flink > Issue Type: Bug > Components: Runtime / REST >Affects Versions: 1.10.0, 1.10.1, 1.11.0 >Reporter: tartarus >Assignee: tartarus >Priority: Critical > Labels: pull-request-available > Attachments: 110.png, 111.png, > C49A7310-F932-451B-A203-6D17F3140C0D.png, > e18e00dd6664485c2ff55284fe969474.png, jobmanager.log.noyarn.tar.gz > > > AbstractHandler throw NPE cause by FlinkHttpObjectAggregator is null > when rest throw exception, it will do this code > {code:java} > private CompletableFuture handleException(Throwable throwable, > ChannelHandlerContext ctx, HttpRequest httpRequest) { > FlinkHttpObjectAggregator flinkHttpObjectAggregator = > ctx.pipeline().get(FlinkHttpObjectAggregator.class); > int maxLength = flinkHttpObjectAggregator.maxContentLength() - > OTHER_RESP_PAYLOAD_OVERHEAD; > if (throwable instanceof RestHandlerException) { > RestHandlerException rhe = (RestHandlerException) throwable; > String stackTrace = ExceptionUtils.stringifyException(rhe); > String truncatedStackTrace = Ascii.truncate(stackTrace, > maxLength, "..."); > if (log.isDebugEnabled()) { > log.error("Exception occurred in REST handler.", rhe); > } else { > log.error("Exception occurred in REST handler: {}", > rhe.getMessage()); > } > return HandlerUtils.sendErrorResponse( > ctx, > httpRequest, > new ErrorResponseBody(truncatedStackTrace), > rhe.getHttpResponseStatus(), > responseHeaders); > } else { > log.error("Unhandled exception.", throwable); > String stackTrace = String.format(" side:%n%s%nEnd of exception on server side>", > ExceptionUtils.stringifyException(throwable)); > String truncatedStackTrace = Ascii.truncate(stackTrace, > maxLength, "..."); > return HandlerUtils.sendErrorResponse( > ctx, > httpRequest, > new ErrorResponseBody(Arrays.asList("Internal server > error.", truncatedStackTrace)), > HttpResponseStatus.INTERNAL_SERVER_ERROR, > responseHeaders); > } > } > {code} > but flinkHttpObjectAggregator some case is null,so this will throw NPE,but > this method called by AbstractHandler#respondAsLeader > {code:java} > requestProcessingFuture > .whenComplete((Void ignored, Throwable throwable) -> { > if (throwable != null) { > > handleException(ExceptionUtils.stripCompletionException(throwable), ctx, > httpRequest) > .whenComplete((Void ignored2, Throwable > throwable2) -> finalizeRequestProcessing(finalUploadedFiles)); > } else { > finalizeRequestProcessing(finalUploadedFiles); > } > }); > {code} > the result is InFlightRequestTracker Cannot be cleared. > so the CompletableFuture does‘t complete that handler's closeAsync returned > !C49A7310-F932-451B-A203-6D17F3140C0D.png! > !e18e00dd6664485c2ff55284fe969474.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-16866) Make job submission non-blocking
[ https://issues.apache.org/jira/browse/FLINK-16866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann reassigned FLINK-16866: - Assignee: Till Rohrmann (was: Robert Metzger) > Make job submission non-blocking > > > Key: FLINK-16866 > URL: https://issues.apache.org/jira/browse/FLINK-16866 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Critical > Fix For: 1.12.0 > > > Currently, Flink waits to acknowledge a job submission until the > corresponding {{JobManager}} has been created. Since its creation also > involves the creation of the {{ExecutionGraph}} and potential FS operations, > it can take a bit of time. If the user has configured a too low > {{web.timeout}}, the submission can time out only reporting a > {{TimeoutException}} to the user. > I propose to change the notion of job submission slightly. Instead of waiting > until the {{JobManager}} has been created, a job submission is complete once > all job relevant files have been uploaded to the {{Dispatcher}} and the > {{Dispatcher}} has been told about it. Creating the {{JobManager}} will then > belong to the actual job execution. Consequently, if problems occur while > creating the {{JobManager}} it will result into a job failure. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18733) Jobmanager cannot start in HA mode with Zookeeper
[ https://issues.apache.org/jira/browse/FLINK-18733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166458#comment-17166458 ] Leonid Ilyevsky commented on FLINK-18733: - Hi [~trohrmann] The addresses I am using are perfectly resolvable, if you are talking about IP address resolution on Linux level. In fact, it is the same set of machines. I am running Flink cluster on the same three machines where I am running my Zookeeper cluster. The critical point is that when when I reverted back to version 1.10.0, the problem disappeared. I don't think this problem has anything to do with the Linux hosts being resolved or not. As you can see in the error message, it happens inside a routine related to SASL, which I am not using and don't need. You said it works when you use Flink's ZooKeeper support locally. What exactly is it? A Zookeeper running inside Flink? Then it fails when you configured it with "unresolvable {{high-availability.zookeeper.quorum}} address". Did you actually use unresolvable hosts, so you could not even ping them? Obviously such test would fail, no doubts. Could you please perform the test closer to what I am doing? Run a simple Zookeeper cluster on the same machines where you run Flink. I actually found the code where the exception is thrown: [https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/SaslServerPrincipal.java] . I guess, this is not the exact version that you are using, so the line numbers might differ. First thing I noticed, in the comment it says "Get the name of the server principal for a SASL client. This is visible for *testing purposes*". So this is supposed to be called only during tests? Not sure what that means. Then, inside the getServerPrincipal method, it retrieves the "canonicalize" flag, and apparently it got the value "true". Maybe this is the source of the issue? Maybe in Flink 1.10.0 it was "false" and there was no problem? I hope there should be some workaround, like set some system property and make that flag to be false. Thanks, Leonid > Jobmanager cannot start in HA mode with Zookeeper > - > > Key: FLINK-18733 > URL: https://issues.apache.org/jira/browse/FLINK-18733 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.1 >Reporter: Leonid Ilyevsky >Priority: Major > Attachments: flink-conf.yaml, > flink-liquidnt-standalonesession-0-nj1dvloglab01.liquidnet.biz.log, > flink-liquidnt-taskexecutor-0-nj1dvloglab01.liquidnet.biz.log > > > When configured in HA mode, the Jobmanager cannot start at all. First, it > issues warnings like this: > {quote}{{2020-07-27 08:58:23,197 WARN > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - > Session 0x0 for server *nj1dvloglab01.liquidnet.biz/:2181*, > unexpected error, closing socket connection and attempting reconnect}} > {{java.lang.IllegalArgumentException: *Unable to canonicalize address* > nj1dvloglab01.liquidnet.biz/:2181 because it's not resolvable}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060) > [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {quote} > After few attempts connecting to Zookeeper, it finally fails: > {quote}2020-07-27 08:59:35,055 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error > occurred in the cluster entrypoint. > org.apache.flink.util.FlinkException: Unhandled error in > ZooKeeperLeaderElectionService: Ensure path threw exception > at > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.unhandledError(ZooKeeperLeaderElectionService.java:430) > ~[flink-dist_2.12-1.11.1.jar:1.11.1] > {quote} > > The same HA configuration works fine for me in Flink 1.10.0. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18627) Get unmatch filter method records to side output
[ https://issues.apache.org/jira/browse/FLINK-18627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166445#comment-17166445 ] Aljoscha Krettek commented on FLINK-18627: -- That will not produce the desired output, {{stream.getSideOutput()}} will only get the side output from the last operation. That's what I was hinting at above: I don't know how we can provide an ergonomic API for the pattern of chaining multiple filters. > Get unmatch filter method records to side output > > > Key: FLINK-18627 > URL: https://issues.apache.org/jira/browse/FLINK-18627 > Project: Flink > Issue Type: Improvement > Components: API / DataStream >Reporter: Roey Shem Tov >Priority: Major > Fix For: 1.12.0 > > > Unmatch records to filter functions should send somehow to side output. > Example: > > {code:java} > datastream > .filter(i->i%2==0) > .sideOutput(oddNumbersSideOutput); > {code} > > > That's way we can filter multiple times and send the filtered records to our > side output instead of dropping it immediatly, it can be useful in many ways. > > What do you think? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-18744) resume from modified savepoint dirctionary: No such file or directory
[ https://issues.apache.org/jira/browse/FLINK-18744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao wang updated FLINK-18744: - Description: If I resume a job from a savepoint which is modified by state processor API, such as loading from /savepoint-path-old and writing to /savepoint-path-new, the job resumed with savepointpath = /savepoint-path-new while throwing an Exception : _*/savepoint-path-new/\{some-ui-id} (No such file or directory)*_. I think it's an issue because of flink 1.11 use absolute path in savepoint and checkpoint, but state processor API missed this. The job will work well with new savepoint(which path is /savepoint-path-new) if I copy all dictionary except `_metadata from` /savepoint-path-old to /savepoint-path-new. was: If I resume a job from a savepoint which is modified by state processor API, such as loading from /savepoint-path-old and writing to /savepoint-path-new, the job resumed with savepointpath = /savepoint-path-new while throwing an Exception : _*/savepoint-path-new/\{some-ui-id} (No such file or directory)*_. I think it's an issue because of flink 1.11 use absolute path in savepoint and checkpoint, but state processor API missed this. > resume from modified savepoint dirctionary: No such file or directory > - > > Key: FLINK-18744 > URL: https://issues.apache.org/jira/browse/FLINK-18744 > Project: Flink > Issue Type: Bug > Components: API / State Processor >Affects Versions: 1.11.0 >Reporter: tao wang >Priority: Major > > If I resume a job from a savepoint which is modified by state processor API, > such as loading from /savepoint-path-old and writing to /savepoint-path-new, > the job resumed with savepointpath = /savepoint-path-new while throwing an > Exception : > _*/savepoint-path-new/\{some-ui-id} (No such file or directory)*_. > I think it's an issue because of flink 1.11 use absolute path in savepoint > and checkpoint, but state processor API missed this. > The job will work well with new savepoint(which path is /savepoint-path-new) > if I copy all dictionary except `_metadata from` /savepoint-path-old to > /savepoint-path-new. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-18744) resume from modified savepoint dirctionary: No such file or directory
[ https://issues.apache.org/jira/browse/FLINK-18744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao wang updated FLINK-18744: - Affects Version/s: (was: 1.11.1) 1.11.0 > resume from modified savepoint dirctionary: No such file or directory > - > > Key: FLINK-18744 > URL: https://issues.apache.org/jira/browse/FLINK-18744 > Project: Flink > Issue Type: Bug > Components: API / State Processor >Affects Versions: 1.11.0 >Reporter: tao wang >Priority: Major > > If I resume a job from a savepoint which is modified by state processor API, > such as loading from /savepoint-path-old and writing to /savepoint-path-new, > the job resumed with savepointpath = /savepoint-path-new while throwing an > Exception : > _*/savepoint-path-new/\{some-ui-id} (No such file or directory)*_. > I think it's an issue because of flink 1.11 use absolute path in savepoint > and checkpoint, but state processor API missed this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18744) resume from modified savepoint dirctionary: No such file or directory
tao wang created FLINK-18744: Summary: resume from modified savepoint dirctionary: No such file or directory Key: FLINK-18744 URL: https://issues.apache.org/jira/browse/FLINK-18744 Project: Flink Issue Type: Bug Components: API / State Processor Affects Versions: 1.11.1 Reporter: tao wang If I resume a job from a savepoint which is modified by state processor API, such as loading from /savepoint-path-old and writing to /savepoint-path-new, the job resumed with savepointpath = /savepoint-path-new while throwing an Exception : _*/savepoint-path-new/\{some-ui-id} (No such file or directory)*_. I think it's an issue because of flink 1.11 use absolute path in savepoint and checkpoint, but state processor API missed this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-18281) Add WindowStagger into all Tumbling and Sliding Windows
[ https://issues.apache.org/jira/browse/FLINK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aljoscha Krettek closed FLINK-18281. Fix Version/s: 1.12.0 Assignee: Teng Hu Resolution: Fixed master: 335c47e11478358e8514e63ca807ea765ed9dd9a > Add WindowStagger into all Tumbling and Sliding Windows > --- > > Key: FLINK-18281 > URL: https://issues.apache.org/jira/browse/FLINK-18281 > Project: Flink > Issue Type: New Feature > Components: API / DataStream >Reporter: Teng Hu >Assignee: Teng Hu >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > Adding the window staggering functionality into *TumblingEventTimeWindows*, > *SlidingProcessingTimeWindows* and *SlidingEventTimeWindows*. > This is a follow-up issue of > [FLINK-12855|https://issues.apache.org/jira/browse/FLINK-12855] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18663) Fix Flink On YARN AM not exit
[ https://issues.apache.org/jira/browse/FLINK-18663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166433#comment-17166433 ] tartarus commented on FLINK-18663: -- [~trohrmann] I may not understand you too much before, but I'm sure {{AbstractHandler.terminationFuture}} has not complete, because I has dump a jvm. I think [~chesnay] is right , this job has 15000 tasks, and GC frequently, so the TimeoutException is possible. I am not very clear why {{FlinkHttpObjectAggregator}} was null. [~trohrmann] [~chesnay] Do you have any suggestions on this issue? > Fix Flink On YARN AM not exit > - > > Key: FLINK-18663 > URL: https://issues.apache.org/jira/browse/FLINK-18663 > Project: Flink > Issue Type: Bug > Components: Runtime / REST >Affects Versions: 1.10.0, 1.10.1, 1.11.0 >Reporter: tartarus >Assignee: tartarus >Priority: Critical > Labels: pull-request-available > Attachments: 110.png, 111.png, > C49A7310-F932-451B-A203-6D17F3140C0D.png, > e18e00dd6664485c2ff55284fe969474.png, jobmanager.log.noyarn.tar.gz > > > AbstractHandler throw NPE cause by FlinkHttpObjectAggregator is null > when rest throw exception, it will do this code > {code:java} > private CompletableFuture handleException(Throwable throwable, > ChannelHandlerContext ctx, HttpRequest httpRequest) { > FlinkHttpObjectAggregator flinkHttpObjectAggregator = > ctx.pipeline().get(FlinkHttpObjectAggregator.class); > int maxLength = flinkHttpObjectAggregator.maxContentLength() - > OTHER_RESP_PAYLOAD_OVERHEAD; > if (throwable instanceof RestHandlerException) { > RestHandlerException rhe = (RestHandlerException) throwable; > String stackTrace = ExceptionUtils.stringifyException(rhe); > String truncatedStackTrace = Ascii.truncate(stackTrace, > maxLength, "..."); > if (log.isDebugEnabled()) { > log.error("Exception occurred in REST handler.", rhe); > } else { > log.error("Exception occurred in REST handler: {}", > rhe.getMessage()); > } > return HandlerUtils.sendErrorResponse( > ctx, > httpRequest, > new ErrorResponseBody(truncatedStackTrace), > rhe.getHttpResponseStatus(), > responseHeaders); > } else { > log.error("Unhandled exception.", throwable); > String stackTrace = String.format(" side:%n%s%nEnd of exception on server side>", > ExceptionUtils.stringifyException(throwable)); > String truncatedStackTrace = Ascii.truncate(stackTrace, > maxLength, "..."); > return HandlerUtils.sendErrorResponse( > ctx, > httpRequest, > new ErrorResponseBody(Arrays.asList("Internal server > error.", truncatedStackTrace)), > HttpResponseStatus.INTERNAL_SERVER_ERROR, > responseHeaders); > } > } > {code} > but flinkHttpObjectAggregator some case is null,so this will throw NPE,but > this method called by AbstractHandler#respondAsLeader > {code:java} > requestProcessingFuture > .whenComplete((Void ignored, Throwable throwable) -> { > if (throwable != null) { > > handleException(ExceptionUtils.stripCompletionException(throwable), ctx, > httpRequest) > .whenComplete((Void ignored2, Throwable > throwable2) -> finalizeRequestProcessing(finalUploadedFiles)); > } else { > finalizeRequestProcessing(finalUploadedFiles); > } > }); > {code} > the result is InFlightRequestTracker Cannot be cleared. > so the CompletableFuture does‘t complete that handler's closeAsync returned > !C49A7310-F932-451B-A203-6D17F3140C0D.png! > !e18e00dd6664485c2ff55284fe969474.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18733) Jobmanager cannot start in HA mode with Zookeeper
[ https://issues.apache.org/jira/browse/FLINK-18733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166429#comment-17166429 ] Till Rohrmann commented on FLINK-18733: --- Hi [~lilyevsky], I tried Flink's ZooKeeper support locally. At first it worked w/o problems. However, once I've configured the system with an unresolvable {{high-availability.zookeeper.quorum}} address, I could reproduce what you are describing. Hence, could you verify that the node on which you start the Flink processes can actually resolve any of {{nj1dvloglab01.liquidnet.biz:2181,nj1dvloglab02.liquidnet.biz:2181,nj1dvloglab03.liquidnet.biz:2181}}? If the system cannot resolve the address name, then the ZooKeeper client is not able to connect to the ZooKeeper quorum. > Jobmanager cannot start in HA mode with Zookeeper > - > > Key: FLINK-18733 > URL: https://issues.apache.org/jira/browse/FLINK-18733 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.1 >Reporter: Leonid Ilyevsky >Priority: Major > Attachments: flink-conf.yaml, > flink-liquidnt-standalonesession-0-nj1dvloglab01.liquidnet.biz.log, > flink-liquidnt-taskexecutor-0-nj1dvloglab01.liquidnet.biz.log > > > When configured in HA mode, the Jobmanager cannot start at all. First, it > issues warnings like this: > {quote}{{2020-07-27 08:58:23,197 WARN > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - > Session 0x0 for server *nj1dvloglab01.liquidnet.biz/:2181*, > unexpected error, closing socket connection and attempting reconnect}} > {{java.lang.IllegalArgumentException: *Unable to canonicalize address* > nj1dvloglab01.liquidnet.biz/:2181 because it's not resolvable}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060) > [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {quote} > After few attempts connecting to Zookeeper, it finally fails: > {quote}2020-07-27 08:59:35,055 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error > occurred in the cluster entrypoint. > org.apache.flink.util.FlinkException: Unhandled error in > ZooKeeperLeaderElectionService: Ensure path threw exception > at > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.unhandledError(ZooKeeperLeaderElectionService.java:430) > ~[flink-dist_2.12-1.11.1.jar:1.11.1] > {quote} > > The same HA configuration works fine for me in Flink 1.10.0. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (FLINK-18743) Modify English link to Chinese link
[ https://issues.apache.org/jira/browse/FLINK-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] weizheng reopened FLINK-18743: -- > Modify English link to Chinese link > --- > > Key: FLINK-18743 > URL: https://issues.apache.org/jira/browse/FLINK-18743 > Project: Flink > Issue Type: Improvement > Components: Documentation >Reporter: weizheng >Priority: Major > > In > [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/concepts/stateful-stream-processing.html,] > > [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/cluster_execution.html,|https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/cluster_execution.html] > > [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/datastream_api.html,|https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/datastream_api.html] > > [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/event_timestamp_extractors.html] > Modify English link to Chinese link > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (FLINK-18743) Modify English link to Chinese link
[ https://issues.apache.org/jira/browse/FLINK-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] weizheng resolved FLINK-18743. -- Resolution: Fixed https://github.com/apache/flink/pull/13003 > Modify English link to Chinese link > --- > > Key: FLINK-18743 > URL: https://issues.apache.org/jira/browse/FLINK-18743 > Project: Flink > Issue Type: Improvement > Components: Documentation >Reporter: weizheng >Priority: Major > > In > [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/concepts/stateful-stream-processing.html,] > > [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/cluster_execution.html,|https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/cluster_execution.html] > > [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/datastream_api.html,|https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/datastream_api.html] > > [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/event_timestamp_extractors.html] > Modify English link to Chinese link > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Issue Comment Deleted] (FLINK-18743) Modify English link to Chinese link
[ https://issues.apache.org/jira/browse/FLINK-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] weizheng updated FLINK-18743: - Comment: was deleted (was: https://github.com/apache/flink/pull/13003) > Modify English link to Chinese link > --- > > Key: FLINK-18743 > URL: https://issues.apache.org/jira/browse/FLINK-18743 > Project: Flink > Issue Type: Improvement > Components: Documentation >Reporter: weizheng >Priority: Major > > In > [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/concepts/stateful-stream-processing.html,] > > [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/cluster_execution.html,|https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/cluster_execution.html] > > [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/datastream_api.html,|https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/datastream_api.html] > > [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/event_timestamp_extractors.html] > Modify English link to Chinese link > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager
[ https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166396#comment-17166396 ] wgcn commented on FLINK-18715: -- [~trohrmann] it's not suitable for the deployment scenarios we don't have CPU isolation > add cpu usage metric of jobmanager/taskmanager > - > > Key: FLINK-18715 > URL: https://issues.apache.org/jira/browse/FLINK-18715 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.11.1 >Reporter: wgcn >Priority: Major > Fix For: 1.12.0, 1.11.2 > > > flink process add cpu usage metric, user can determine that their job is > io bound /cpu bound ,so that they can increase/decrese cpu core in the > container (k8s,yarn). If it's nessary > . you can assign it to me ,I come up with a idea calculating cpu usage > ratio using ManagementFactory.getRuntimeMXBean().getUptime() and > ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime over a period > of time . it can get a value in single cpu core environment. and user can > use the value to calculate cpu usage ratio by dividing num of container's > cpu core. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18742) Configuration args do not take effect at client
[ https://issues.apache.org/jira/browse/FLINK-18742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166391#comment-17166391 ] Matt Wang commented on FLINK-18742: --- I have fixed this in our company internal version, can assign this issue to me? > Configuration args do not take effect at client > --- > > Key: FLINK-18742 > URL: https://issues.apache.org/jira/browse/FLINK-18742 > Project: Flink > Issue Type: Improvement > Components: Command Line Client >Affects Versions: 1.11.1 >Reporter: Matt Wang >Priority: Major > > The configuration args from command line will not work at client, for > example, the job sets the {color:#505f79}_classloader.resolve-order_{color} > to _{color:#505f79}parent-first,{color}_ it can work at TaskManager, but > Client doesn't. > The *FlinkUserCodeClassLoaders* will be created before calling the method of > _{color:#505f79}getEffectiveConfiguration(){color}_ at > {color:#505f79}org.apache.flink.client.cli.CliFrontend{color}, so the > _{color:#505f79}Configuration{color}_ used by > _{color:#505f79}PackagedProgram{color}_ does not include Configuration args. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18743) Modify English link to Chinese link
weizheng created FLINK-18743: Summary: Modify English link to Chinese link Key: FLINK-18743 URL: https://issues.apache.org/jira/browse/FLINK-18743 Project: Flink Issue Type: Improvement Components: Documentation Reporter: weizheng In [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/concepts/stateful-stream-processing.html,] [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/cluster_execution.html,|https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/cluster_execution.html] [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/datastream_api.html,|https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/datastream_api.html] [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/event_timestamp_extractors.html] Modify English link to Chinese link -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-18742) Configuration args do not take effect at client
[ https://issues.apache.org/jira/browse/FLINK-18742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Wang updated FLINK-18742: -- Description: The configuration args from command line will not work at client, for example, the job sets the {color:#505f79}_classloader.resolve-order_{color} to _{color:#505f79}parent-first,{color}_ it can work at TaskManager, but Client doesn't. The *FlinkUserCodeClassLoaders* will be created before calling the method of _{color:#505f79}getEffectiveConfiguration(){color}_ at {color:#505f79}org.apache.flink.client.cli.CliFrontend{color}, so the _{color:#505f79}Configuration{color}_ used by _{color:#505f79}PackagedProgram{color}_ does not include Configuration args. was: The configuration args from command line will not work at client, for example, the job sets the `classloader.resolve-order` to `parent-first`, it can work at TaskManager, but Client doesn't. The `FlinkUserCodeClassLoaders` will be created before calling the method of `getEffectiveConfiguration()` at `org.apache.flink.client.cli.CliFrontend`, so the `Configuration` used by `PackagedProgram` does not include Configuration args. > Configuration args do not take effect at client > --- > > Key: FLINK-18742 > URL: https://issues.apache.org/jira/browse/FLINK-18742 > Project: Flink > Issue Type: Improvement > Components: Command Line Client >Affects Versions: 1.11.1 >Reporter: Matt Wang >Priority: Major > > The configuration args from command line will not work at client, for > example, the job sets the {color:#505f79}_classloader.resolve-order_{color} > to _{color:#505f79}parent-first,{color}_ it can work at TaskManager, but > Client doesn't. > The *FlinkUserCodeClassLoaders* will be created before calling the method of > _{color:#505f79}getEffectiveConfiguration(){color}_ at > {color:#505f79}org.apache.flink.client.cli.CliFrontend{color}, so the > _{color:#505f79}Configuration{color}_ used by > _{color:#505f79}PackagedProgram{color}_ does not include Configuration args. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18742) Configuration args do not take effect at client
Matt Wang created FLINK-18742: - Summary: Configuration args do not take effect at client Key: FLINK-18742 URL: https://issues.apache.org/jira/browse/FLINK-18742 Project: Flink Issue Type: Improvement Components: Command Line Client Affects Versions: 1.11.1 Reporter: Matt Wang The configuration args from command line will not work at client, for example, the job sets the `classloader.resolve-order` to `parent-first`, it can work at TaskManager, but Client doesn't. The `FlinkUserCodeClassLoaders` will be created before calling the method of `getEffectiveConfiguration()` at `org.apache.flink.client.cli.CliFrontend`, so the `Configuration` used by `PackagedProgram` does not include Configuration args. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18496) Anchors are not generated based on ZH characters
[ https://issues.apache.org/jira/browse/FLINK-18496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166374#comment-17166374 ] Zhilong Hong commented on FLINK-18496: -- I think in fact the plugin is conflict with the excerpt rather than the {{_post}} folder. 1. After upgrading to Jekyll 4.0.1, I get errors: [https://gist.github.com/Thesharing/a4681dcef6229903088fb9eb49eaea6f] . The build stuck in endless loop when it comes to: {code:bash} /srv/flink-web/.rubydeps/ruby/2.5.0/gems/jekyll-multiple-languages-2.0.3/lib/jekyll-multiple-languages/multilang.rb:55:in `append_data_for_liquid' /srv/flink-web/.rubydeps/ruby/2.5.0/gems/jekyll-multiple-languages-2.0.3/lib/jekyll-multiple-languages/document.rb:71:in `to_liquid' {code} The function {{append_data_for_liquid}} calls {{deep_merge_hashes!}} in jekyll/utils.rb, and the implementation of {{deep_merge_hashes!}} is different between Jekyll 3 and 4. In Jekyll 3: {code:ruby} def deep_merge_hashes!(target, overwrite) overwrite.each_key do |key| if overwrite[key].is_a? Hash and target[key].is_a? Hash target[key] = Utils.deep_merge_hashes(target[key], overwrite[key]) next end target[key] = overwrite[key] end if target.default_proc.nil? target.default_proc = overwrite.default_proc end target end {code} In Jekyll 4: {code:ruby} def deep_merge_hashes!(target, overwrite) merge_values(target, overwrite) merge_default_proc(target, overwrite) duplicate_frozen_values(target) target end {code} The loop gets stuck at {{duplicate_frozen_values}}. I try to remove {{duplicate_frozen_values}} and build the doc. The build succeeds. However, the blogs without excerpts in markdown have no automatically generated excerpts in the web page. 2. I renamed the folder {{_posts}} to other names like {{articles}}. The build succeeds. But the reference of {{_posts}} in {{index.md}} is broken. Then I try to upgrade Jekyll to 4.1.1 and setting {{page_excerpts}} to {{true}}, which enables excerpts for every page. The build fails like above. So in my opinion, the issue is related to the process of generating excerpts with "jekyll-multiple-language". I'm not familiar with Ruby, and I have no idea how to debug Jekyll with IDE. I think I'm stuck here right now, and I'd really appreciate it if anyone could help me with this issue. > Anchors are not generated based on ZH characters > > > Key: FLINK-18496 > URL: https://issues.apache.org/jira/browse/FLINK-18496 > Project: Flink > Issue Type: Bug > Components: Project Website >Reporter: Zhu Zhu >Assignee: Zhilong Hong >Priority: Major > Labels: starter > > In ZH version pages of flink-web, the anchors are not generated based on ZH > characters. The anchor name would be like 'section-1', 'section-2' if there > is no EN characters. An example can be the links in the navigator of > https://flink.apache.org/zh/contributing/contribute-code.html > This makes it impossible to ref an anchor from the content because the anchor > name might change unexpectedly if a new section is added. > Note that it is a problem for flink-web only. The docs generated from the > flink repo can properly generate ZH anchors. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager
[ https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166353#comment-17166353 ] wgcn edited comment on FLINK-18715 at 7/28/20, 11:30 AM: - [~chesnay] it indeed has a lot of system resources metric , we talked about cpu occupation in single flink process was (Author: 1026688210): [~chesnay] it indeed has a lot of system resources , we talked about cpu occupation in single flink process > add cpu usage metric of jobmanager/taskmanager > - > > Key: FLINK-18715 > URL: https://issues.apache.org/jira/browse/FLINK-18715 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.11.1 >Reporter: wgcn >Priority: Major > Fix For: 1.12.0, 1.11.2 > > > flink process add cpu usage metric, user can determine that their job is > io bound /cpu bound ,so that they can increase/decrese cpu core in the > container (k8s,yarn). If it's nessary > . you can assign it to me ,I come up with a idea calculating cpu usage > ratio using ManagementFactory.getRuntimeMXBean().getUptime() and > ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime over a period > of time . it can get a value in single cpu core environment. and user can > use the value to calculate cpu usage ratio by dividing num of container's > cpu core. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-18646) Managed memory released check can block RPC thread
[ https://issues.apache.org/jira/browse/FLINK-18646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166346#comment-17166346 ] Andrey Zagrebin edited comment on FLINK-18646 at 7/28/20, 11:27 AM: merged into master by 3d056c8fea72ca40b663d12570913679be87c0a9 merged into 1.11 by bcc97082639280ab14f465463fb07b27167c37e3 [~TsReaper] I am closing the issue as the verification should not block the RPC thread any more. Reopen it if you notice any problems with it. If there are still problems with the normal memory allocation timeout (given there is no real leak), we can discuss it in another issue. was (Author: azagrebin): merged into master by 3d056c8fea72ca40b663d12570913679be87c0a9 merged into 1.10 by bcc97082639280ab14f465463fb07b27167c37e3 [~TsReaper] I am closing the issue as the verification should not block the RPC thread any more. Reopen it if you notice any problems with it. If there are still problems with the normal memory allocation timeout (given there is no real leak), we can discuss it in another issue. > Managed memory released check can block RPC thread > -- > > Key: FLINK-18646 > URL: https://issues.apache.org/jira/browse/FLINK-18646 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.11.0 >Reporter: Andrey Zagrebin >Assignee: Andrey Zagrebin >Priority: Critical > Labels: pull-request-available > Fix For: 1.12.0, 1.11.2 > > Attachments: log1.png, log2.png > > > UnsafeMemoryBudget#verifyEmpty, called on slot freeing, needs time to wait on > GC of all allocated/released managed memory. If there are a lot of segments > to GC then it can take time to finish the check. If slot freeing happens in > RPC thread, the GC waiting can block it and TM risks to miss its heartbeat. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-18581) Cannot find GC cleaner with java version previous jdk8u72(-b01)
[ https://issues.apache.org/jira/browse/FLINK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Zagrebin closed FLINK-18581. --- Resolution: Fixed merged into master by 2f03841d5414f9d4a4b810810317c0250065264e merged into 1.11 by fe95187edfe742b64a1f7147e57856c931ef05c3 > Cannot find GC cleaner with java version previous jdk8u72(-b01) > --- > > Key: FLINK-18581 > URL: https://issues.apache.org/jira/browse/FLINK-18581 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.0 >Reporter: Xintong Song >Assignee: Andrey Zagrebin >Priority: Critical > Labels: pull-request-available > Fix For: 1.12.0, 1.11.2 > > > {{JavaGcCleanerWrapper}} is looking for the package-private method > {{Reference.tryHandlePending}} using reflection. However, the method is first > introduced in the version jdk8u72(-b01). Therefore, if an older version JDK > is used, the method cannot be found and Flink will fail. > See also this [ML > thread|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-GC-Cleaner-Provider-Flink-1-11-0-td36565.html]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager
[ https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166353#comment-17166353 ] wgcn edited comment on FLINK-18715 at 7/28/20, 11:27 AM: - [~chesnay] it indeed has a lot of system resources , we talked about cpu occupation in single flink process was (Author: 1026688210): [~chesnay] it indeed has a lot of system resources , we talked able cpu occupation in single flink process > add cpu usage metric of jobmanager/taskmanager > - > > Key: FLINK-18715 > URL: https://issues.apache.org/jira/browse/FLINK-18715 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.11.1 >Reporter: wgcn >Priority: Major > Fix For: 1.12.0, 1.11.2 > > > flink process add cpu usage metric, user can determine that their job is > io bound /cpu bound ,so that they can increase/decrese cpu core in the > container (k8s,yarn). If it's nessary > . you can assign it to me ,I come up with a idea calculating cpu usage > ratio using ManagementFactory.getRuntimeMXBean().getUptime() and > ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime over a period > of time . it can get a value in single cpu core environment. and user can > use the value to calculate cpu usage ratio by dividing num of container's > cpu core. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager
[ https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166353#comment-17166353 ] wgcn commented on FLINK-18715: -- [~chesnay] it indeed has a lot of system resources , we talked able cpu occupation in single flink process > add cpu usage metric of jobmanager/taskmanager > - > > Key: FLINK-18715 > URL: https://issues.apache.org/jira/browse/FLINK-18715 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.11.1 >Reporter: wgcn >Priority: Major > Fix For: 1.12.0, 1.11.2 > > > flink process add cpu usage metric, user can determine that their job is > io bound /cpu bound ,so that they can increase/decrese cpu core in the > container (k8s,yarn). If it's nessary > . you can assign it to me ,I come up with a idea calculating cpu usage > ratio using ManagementFactory.getRuntimeMXBean().getUptime() and > ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime over a period > of time . it can get a value in single cpu core environment. and user can > use the value to calculate cpu usage ratio by dividing num of container's > cpu core. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-18646) Managed memory released check can block RPC thread
[ https://issues.apache.org/jira/browse/FLINK-18646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Zagrebin updated FLINK-18646: Fix Version/s: 1.12.0 > Managed memory released check can block RPC thread > -- > > Key: FLINK-18646 > URL: https://issues.apache.org/jira/browse/FLINK-18646 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.11.0 >Reporter: Andrey Zagrebin >Assignee: Andrey Zagrebin >Priority: Critical > Labels: pull-request-available > Fix For: 1.12.0, 1.11.2 > > Attachments: log1.png, log2.png > > > UnsafeMemoryBudget#verifyEmpty, called on slot freeing, needs time to wait on > GC of all allocated/released managed memory. If there are a lot of segments > to GC then it can take time to finish the check. If slot freeing happens in > RPC thread, the GC waiting can block it and TM risks to miss its heartbeat. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-18646) Managed memory released check can block RPC thread
[ https://issues.apache.org/jira/browse/FLINK-18646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166346#comment-17166346 ] Andrey Zagrebin edited comment on FLINK-18646 at 7/28/20, 11:20 AM: merged into master by 3d056c8fea72ca40b663d12570913679be87c0a9 merged into 1.10 by bcc97082639280ab14f465463fb07b27167c37e3 [~TsReaper] I am closing the issue as the verification should not block the RPC thread any more. Reopen it if you notice any problems with it. If there are still problems with the normal memory allocation timeout (given there is no real leak), we can discuss it in another issue. was (Author: azagrebin): merged into master by 3d056c8fea72ca40b663d12570913679be87c0a9 merged into 1.10 by bcc97082639280ab14f465463fb07b27167c37e3 [~TsReaper] I am closing the issue as the verification should not block the RPC thread any more. Reopen it if you notice any problems with it. If there are still problems with the normal memory allocation timeout, we can discuss it in another issue. > Managed memory released check can block RPC thread > -- > > Key: FLINK-18646 > URL: https://issues.apache.org/jira/browse/FLINK-18646 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.11.0 >Reporter: Andrey Zagrebin >Assignee: Andrey Zagrebin >Priority: Critical > Labels: pull-request-available > Fix For: 1.11.2 > > Attachments: log1.png, log2.png > > > UnsafeMemoryBudget#verifyEmpty, called on slot freeing, needs time to wait on > GC of all allocated/released managed memory. If there are a lot of segments > to GC then it can take time to finish the check. If slot freeing happens in > RPC thread, the GC waiting can block it and TM risks to miss its heartbeat. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18646) Managed memory released check can block RPC thread
[ https://issues.apache.org/jira/browse/FLINK-18646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166346#comment-17166346 ] Andrey Zagrebin commented on FLINK-18646: - merged into master by 3d056c8fea72ca40b663d12570913679be87c0a9 merged into 1.10 by bcc97082639280ab14f465463fb07b27167c37e3 [~TsReaper] I am closing the issue as the verification should not block the RPC thread any more. Reopen it if you notice any problems with it. If there are still problems with the normal memory allocation timeout, we can discuss it in another issue. > Managed memory released check can block RPC thread > -- > > Key: FLINK-18646 > URL: https://issues.apache.org/jira/browse/FLINK-18646 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.11.0 >Reporter: Andrey Zagrebin >Assignee: Andrey Zagrebin >Priority: Critical > Labels: pull-request-available > Fix For: 1.11.2 > > Attachments: log1.png, log2.png > > > UnsafeMemoryBudget#verifyEmpty, called on slot freeing, needs time to wait on > GC of all allocated/released managed memory. If there are a lot of segments > to GC then it can take time to finish the check. If slot freeing happens in > RPC thread, the GC waiting can block it and TM risks to miss its heartbeat. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-18646) Managed memory released check can block RPC thread
[ https://issues.apache.org/jira/browse/FLINK-18646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Zagrebin updated FLINK-18646: Release Note: (was: merged into master by 3d056c8fea72ca40b663d12570913679be87c0a9 merged into 1.10 by bcc97082639280ab14f465463fb07b27167c37e3) > Managed memory released check can block RPC thread > -- > > Key: FLINK-18646 > URL: https://issues.apache.org/jira/browse/FLINK-18646 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.11.0 >Reporter: Andrey Zagrebin >Assignee: Andrey Zagrebin >Priority: Critical > Labels: pull-request-available > Fix For: 1.11.2 > > Attachments: log1.png, log2.png > > > UnsafeMemoryBudget#verifyEmpty, called on slot freeing, needs time to wait on > GC of all allocated/released managed memory. If there are a lot of segments > to GC then it can take time to finish the check. If slot freeing happens in > RPC thread, the GC waiting can block it and TM risks to miss its heartbeat. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18734) Add documentation for DynamoStreams Consumer CDC
[ https://issues.apache.org/jira/browse/FLINK-18734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166345#comment-17166345 ] Jark Wu commented on FLINK-18734: - Sounds good to me. > Add documentation for DynamoStreams Consumer CDC > > > Key: FLINK-18734 > URL: https://issues.apache.org/jira/browse/FLINK-18734 > Project: Flink > Issue Type: Improvement > Components: Connectors / Kinesis, Documentation >Affects Versions: 1.11.1 >Reporter: Vinay >Priority: Minor > Labels: CDC, documentation > Fix For: 1.12.0, 1.11.2 > > > Flink already supports CDC for DynamoDb - > https://issues.apache.org/jira/browse/FLINK-4582 by reading the data from > DynamoStreams but there is no documentation for the same. Given that Flink > now supports CDC for Debezium as well , we should add the documentation for > Dynamo CDC so that more users can use this feature. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-18646) Managed memory released check can block RPC thread
[ https://issues.apache.org/jira/browse/FLINK-18646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Zagrebin closed FLINK-18646. --- Release Note: merged into master by 3d056c8fea72ca40b663d12570913679be87c0a9 merged into 1.10 by bcc97082639280ab14f465463fb07b27167c37e3 Resolution: Fixed > Managed memory released check can block RPC thread > -- > > Key: FLINK-18646 > URL: https://issues.apache.org/jira/browse/FLINK-18646 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.11.0 >Reporter: Andrey Zagrebin >Assignee: Andrey Zagrebin >Priority: Critical > Labels: pull-request-available > Fix For: 1.11.2 > > Attachments: log1.png, log2.png > > > UnsafeMemoryBudget#verifyEmpty, called on slot freeing, needs time to wait on > GC of all allocated/released managed memory. If there are a lot of segments > to GC then it can take time to finish the check. If slot freeing happens in > RPC thread, the GC waiting can block it and TM risks to miss its heartbeat. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-18627) Get unmatch filter method records to side output
[ https://issues.apache.org/jira/browse/FLINK-18627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166290#comment-17166290 ] Roey Shem Tov edited comment on FLINK-18627 at 7/28/20, 11:16 AM: -- [~aljoscha], The semantic improvment is making it easier to get FilteredRecord into side output. in this example (I changed it a little bit for the semantic of the PR): {code:java} final OutputTag curruptedData = new OutputTag<>("side-output"){}; SingleOutputStreamOperator stream = datastream .filter(i->i%2==0,curruptedData) .filter(i->i%3==0,curruptedData) .filter(i->i%4==0,curruptedData) .filter(i->i%5==0,curruptedData); DataStream curruptedDataStream = stream.getSideOutput(curruptedData); // All data that doesn't divide at (2,3,4,5) together.{code} And in the above case i have a new stream with all the curruptedData. Offcourse the currupted data is only one example, there is more examples i can share. I agree that filter should be filtering data, but it is NiceToHave feature that all the Filtered data will go to given outputTag instead just drop it. Offcourse you can implement it by your self (extending RichFilterFunction and send all the Filtered Data into given output), but I think it is a nice wrapper that will be useful. was (Author: roeyshemtov): [~aljoscha], The semantic improvment is making it easier to get FilteredRecord into side output. in this example (I changed it a little bit for the semantic of the PR): {code:java} final OutputTag curruptedData = new OutputTag("side-output"){}; SingleOutputStreamOperator stream = datastream .filter(i->i%2==0,curruptedData) .filter(i->i%3==0,curruptedData) .filter(i->i%4==0,curruptedData) .filter(i->i%5==0,curruptedData); DataStream curruptedDataStream = stream.getSideOutput(curruptedData); // All data that doesn't divide at (2,3,4,5) together.{code} And in the above case i have a new stream with all the curruptedData. Offcourse the currupted data is only one example, there is more examples i can share. I agree that filter should be filtering data, but it is NiceToHave feature that all the Filtered data will go to given outputTag instead just drop it. Offcourse you can implement it by your self (extending RichFilterFunction and send all the Filtered Data into given output), but I think it is a nice wrapper that will be useful. > Get unmatch filter method records to side output > > > Key: FLINK-18627 > URL: https://issues.apache.org/jira/browse/FLINK-18627 > Project: Flink > Issue Type: Improvement > Components: API / DataStream >Reporter: Roey Shem Tov >Priority: Major > Fix For: 1.12.0 > > > Unmatch records to filter functions should send somehow to side output. > Example: > > {code:java} > datastream > .filter(i->i%2==0) > .sideOutput(oddNumbersSideOutput); > {code} > > > That's way we can filter multiple times and send the filtered records to our > side output instead of dropping it immediatly, it can be useful in many ways. > > What do you think? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18734) Add documentation for DynamoStreams Consumer CDC
[ https://issues.apache.org/jira/browse/FLINK-18734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166340#comment-17166340 ] Vinay commented on FLINK-18734: --- [~jark] yes, you are right, my suggestion is to only add it to the documentation, it could be under Kinesis Connector - [https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/connectors/kinesis.html] or even better we can have a CDC section in the document in which we can add supported tools by Flink > Add documentation for DynamoStreams Consumer CDC > > > Key: FLINK-18734 > URL: https://issues.apache.org/jira/browse/FLINK-18734 > Project: Flink > Issue Type: Improvement > Components: Connectors / Kinesis, Documentation >Affects Versions: 1.11.1 >Reporter: Vinay >Priority: Minor > Labels: CDC, documentation > Fix For: 1.12.0, 1.11.2 > > > Flink already supports CDC for DynamoDb - > https://issues.apache.org/jira/browse/FLINK-4582 by reading the data from > DynamoStreams but there is no documentation for the same. Given that Flink > now supports CDC for Debezium as well , we should add the documentation for > Dynamo CDC so that more users can use this feature. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-16866) Make job submission non-blocking
[ https://issues.apache.org/jira/browse/FLINK-16866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Metzger reassigned FLINK-16866: -- Assignee: Robert Metzger > Make job submission non-blocking > > > Key: FLINK-16866 > URL: https://issues.apache.org/jira/browse/FLINK-16866 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Robert Metzger >Priority: Critical > Fix For: 1.12.0 > > > Currently, Flink waits to acknowledge a job submission until the > corresponding {{JobManager}} has been created. Since its creation also > involves the creation of the {{ExecutionGraph}} and potential FS operations, > it can take a bit of time. If the user has configured a too low > {{web.timeout}}, the submission can time out only reporting a > {{TimeoutException}} to the user. > I propose to change the notion of job submission slightly. Instead of waiting > until the {{JobManager}} has been created, a job submission is complete once > all job relevant files have been uploaded to the {{Dispatcher}} and the > {{Dispatcher}} has been told about it. Creating the {{JobManager}} will then > belong to the actual job execution. Consequently, if problems occur while > creating the {{JobManager}} it will result into a job failure. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17260) StreamingKafkaITCase failure on Azure
[ https://issues.apache.org/jira/browse/FLINK-17260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166330#comment-17166330 ] Jiangjie Qin commented on FLINK-17260: -- I don't have a clue either. From the phenomenon itself, a wild guess is the aggregation value 14 is somehow doubled. * The expected final result 27 comes from the aggregation of three events: 5, 9, 13. * The error always have a 41, which is 14 greater than the expected result. * There are two possibilities for this to happen ** There are duplicate messages in Kafka. ** The state got wrong. * If there are duplicate messages, then 5 and 9 must both be duplicate. In that case, the output sequence should be 14, 19, 28, 41. So there should also be 19 and 28 before 41 is emitted. * So mathematically speaking, it seems that the aggregated value of 14 is somehow doubled. I'll try to reproduce this locally with some logging added. > StreamingKafkaITCase failure on Azure > - > > Key: FLINK-17260 > URL: https://issues.apache.org/jira/browse/FLINK-17260 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka, Tests >Affects Versions: 1.11.0, 1.12.0 >Reporter: Roman Khachatryan >Assignee: Chesnay Schepler >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.12.0 > > > [https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_apis/build/builds/7544/logs/165] > > {code:java} > 2020-04-16T00:12:32.2848429Z [INFO] Running > org.apache.flink.tests.util.kafka.StreamingKafkaITCase > 2020-04-16T00:14:47.9100927Z [ERROR] Tests run: 3, Failures: 1, Errors: 0, > Skipped: 0, Time elapsed: 135.621 s <<< FAILURE! - in > org.apache.flink.tests.util.k afka.StreamingKafkaITCase > 2020-04-16T00:14:47.9103036Z [ERROR] testKafka[0: > kafka-version:0.10.2.0](org.apache.flink.tests.util.kafka.StreamingKafkaITCase) > Time elapsed: 46.222 s <<< FAILURE! > 2020-04-16T00:14:47.9104033Z java.lang.AssertionError: > expected:<[elephant,27,64213]> but was:<[]> > 2020-04-16T00:14:47.9104638Zat org.junit.Assert.fail(Assert.java:88) > 2020-04-16T00:14:47.9105148Zat > org.junit.Assert.failNotEquals(Assert.java:834) > 2020-04-16T00:14:47.9105701Zat > org.junit.Assert.assertEquals(Assert.java:118) > 2020-04-16T00:14:47.9106239Zat > org.junit.Assert.assertEquals(Assert.java:144) > 2020-04-16T00:14:47.9107177Zat > org.apache.flink.tests.util.kafka.StreamingKafkaITCase.testKafka(StreamingKafkaITCase.java:162) > 2020-04-16T00:14:47.9107845Zat > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2020-04-16T00:14:47.9108434Zat > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2020-04-16T00:14:47.9109318Zat > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2020-04-16T00:14:47.9109914Zat > java.lang.reflect.Method.invoke(Method.java:498) > 2020-04-16T00:14:47.9110434Zat > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2020-04-16T00:14:47.9110985Zat > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2020-04-16T00:14:47.9111548Zat > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2020-04-16T00:14:47.9112083Zat > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2020-04-16T00:14:47.9112629Zat > org.apache.flink.util.ExternalResource$1.evaluate(ExternalResource.java:48) > 2020-04-16T00:14:47.9113145Zat > org.apache.flink.util.ExternalResource$1.evaluate(ExternalResource.java:48) > 2020-04-16T00:14:47.9113637Zat > org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > 2020-04-16T00:14:47.9114072Zat > org.junit.rules.RunRules.evaluate(RunRules.java:20) > 2020-04-16T00:14:47.9114490Zat > org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > 2020-04-16T00:14:47.9115256Zat > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > 2020-04-16T00:14:47.9115791Zat > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > 2020-04-16T00:14:47.9116292Zat > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2020-04-16T00:14:47.9116736Zat > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2020-04-16T00:14:47.9117779Zat > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2020-04-16T00:14:47.9118274Zat > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2020-04-16T00:14:47.9118766Zat > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2020-04-16T00:14:47.9119204Zat > org.junit.runners.ParentR
[jira] [Commented] (FLINK-18712) Flink RocksDB statebackend memory leak issue
[ https://issues.apache.org/jira/browse/FLINK-18712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166322#comment-17166322 ] Farnight commented on FLINK-18712: -- [~yunta], below is some testing information based on simple job. Please help check. thanks a lot! Flink configs: for flink cluster, we use session-cluster mode. version: 1.10 TM configs: state.backend.rocksdb.memory.managed set to `true` our k8s pod has 31G memory. managed memory set to 10G. heap size set to 15G other settings keep the default. Job: # write a dummy source function to emit events in a for/while loop # use the default SessionWindow with gap 30 minutes. # run the job few times # monitor the k8s pod memory working set usage by cadvisor case 1: when running job on k8s (jm/tm inside a pod container). the memory working set keep raising, although the job is stopped, but working set doesn't decrease. eventually the tm process will be killed by oom-killer. and tm process will be restart(pid changed). then the memory working set got reset. case 2: when running job in my local machine(macbook pro) without k8s env. it doesn't have this issue. > Flink RocksDB statebackend memory leak issue > - > > Key: FLINK-18712 > URL: https://issues.apache.org/jira/browse/FLINK-18712 > Project: Flink > Issue Type: Bug > Components: Runtime / State Backends >Affects Versions: 1.10.0 >Reporter: Farnight >Priority: Critical > > When using RocksDB as our statebackend, we found it will lead to memory leak > when restarting job (manually or in recovery case). > > How to reproduce: > # increase RocksDB blockcache size(e.g. 1G), it is easier to monitor and > reproduce. > # start a job using RocksDB statebackend. > # when the RocksDB blockcache reachs maximum size, restart the job. and > monitor the memory usage (k8s pod working set) of the TM. > # go through step 2-3 few more times. and memory will keep raising. > > Any solution or suggestion for this? Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18740) Support rate limiter in the universal kafka connector
Truong Duc Kien created FLINK-18740: --- Summary: Support rate limiter in the universal kafka connector Key: FLINK-18740 URL: https://issues.apache.org/jira/browse/FLINK-18740 Project: Flink Issue Type: Improvement Components: Connectors / Kafka Affects Versions: 1.11.1 Reporter: Truong Duc Kien Currently rate limiter is only available for kafka connector 010, but not the universal connector.{{}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager
[ https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166303#comment-17166303 ] Chesnay Schepler commented on FLINK-18715: -- [~1026688210] Flink has various opt-in metrics for system resources; have you looked at the [documentation|https://ci.apache.org/projects/flink/flink-docs-release-1.11/monitoring/metrics.html#system-resources]? > add cpu usage metric of jobmanager/taskmanager > - > > Key: FLINK-18715 > URL: https://issues.apache.org/jira/browse/FLINK-18715 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.11.1 >Reporter: wgcn >Priority: Major > Fix For: 1.12.0, 1.11.2 > > > flink process add cpu usage metric, user can determine that their job is > io bound /cpu bound ,so that they can increase/decrese cpu core in the > container (k8s,yarn). If it's nessary > . you can assign it to me ,I come up with a idea calculating cpu usage > ratio using ManagementFactory.getRuntimeMXBean().getUptime() and > ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime over a period > of time . it can get a value in single cpu core environment. and user can > use the value to calculate cpu usage ratio by dividing num of container's > cpu core. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs
[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166310#comment-17166310 ] Etienne Chauchot commented on FLINK-17073: -- [~SleePy] sure, I'll update the google doc to add impl plan. > Slow checkpoint cleanup causing OOMs > > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 >Reporter: Till Rohrmann >Assignee: Etienne Chauchot >Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chesnay Schepler closed FLINK-11127. Resolution: Won't Fix > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes, Runtime / Coordination, Runtime > / Metrics >Affects Versions: 1.7.0, 1.9.2, 1.10.0 >Reporter: Ufuk Celebi >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18741) ProcessWindowFunction's process function exception
mzz created FLINK-18741: --- Summary: ProcessWindowFunction's process function exception Key: FLINK-18741 URL: https://issues.apache.org/jira/browse/FLINK-18741 Project: Flink Issue Type: Bug Components: API / DataStream Affects Versions: 1.10.0 Reporter: mzz I use ProcessWindowFunction to achieve PV calculation, but when rewriting process, the user-defined state value cannot be returned。 code: {code:java} tem.keyBy(x => (x._1, x._2, x._4, x._5, x._6, x._7, x._8)) .timeWindow(Time.seconds(15 * 60)) //15 min window .process(new ProcessWindowFunction[(String, String, String, String, String, String, String, String, String), CkResult, (String, String, String, String, String, String, String), TimeWindow] { var clickCount: ValueState[Long] = _ * var requestCount: ValueState[Long] = _ * var returnCount: ValueState[Long] = _ var videoCount: ValueState[Long] = _ var noVideoCount: ValueState[Long] = _ override def open(parameters: Configuration): Unit = { clickCount = getRuntimeContext.getState(new ValueStateDescriptor("clickCount", classOf[Long])) * requestCount = getRuntimeContext.getState(new ValueStateDescriptor("requestCount", classOf[Long]))* returnCount = getRuntimeContext.getState(new ValueStateDescriptor("returnCount", classOf[Long])) videoCount = getRuntimeContext.getState(new ValueStateDescriptor("videoCount", classOf[Long])) noVideoCount = getRuntimeContext.getState(new ValueStateDescriptor("noVideoCount", classOf[Long])) } override def process(key: (String, String, String, String, String, String, String), context: Context, elements: Iterable[(String, String, String, String, String, String, String, String, String)], out: Collector[CkResult]) = { try { var clickNum: Long = clickCount.value val dateNow = LocalDateTime.now().format(DateTimeFormatter.ofPattern("MMdd")).toLong var requestNum: Long = requestCount.value var returnNum: Long = returnCount.value var videoNum: Long = videoCount.value var noVideoNum: Long = noVideoCount.value if (requestNum == null) { requestNum = 0 } val ecpm = key._7.toDouble.formatted("%.2f").toFloat val created_at = getSecondTimestampTwo(new Date) * elements.foreach(e => { if ("adreq".equals(e._3)) { requestNum += 1 println(key._1, requestNum) } }) requestCount.update(requestNum) println(requestNum, key._1)* out.collect(CkResult(dateNow, (created_at - getZero_time) / (60 * 15), key._2, key._3, key._4, key._5, key._3 + "_" + key._4 + "_" + key._5, key._6, key._1, requestCount.value, returnCount.value, fill_rate, noVideoCount.value + videoCount.value, expose_rate, clickCount.value, click_rate, ecpm, (noVideoCount.value * ecpm + videoCount.value * ecpm / 1000.toFloat).formatted("%.2f").toFloat, created_at)) } catch { case e: Exception => println(key, e) } } }) {code} {code:java} elements.foreach(e => { if ("adreq".equals(e._3)) { requestNum += 1 println(key._1, requestNum) // The values printed here like : //(key,1) //(key,2) //(key,3) } }) //But print outside the for loop always like : //(key,0) println(requestNum, key._1) {code} who can help me ,plz thx。 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16510) Task manager safeguard shutdown may not be reliable
[ https://issues.apache.org/jira/browse/FLINK-16510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166302#comment-17166302 ] Till Rohrmann commented on FLINK-16510: --- The JavaDoc says "If the shutdown sequence has already been initiated then this method does not wait for any running shutdown hooks or finalizers to finish their work.". > Task manager safeguard shutdown may not be reliable > --- > > Key: FLINK-16510 > URL: https://issues.apache.org/jira/browse/FLINK-16510 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Reporter: Maximilian Michels >Assignee: Maximilian Michels >Priority: Major > Attachments: stack2-1.txt > > > The {{JvmShutdownSafeguard}} does not always succeed but can hang when > multiple threads attempt to shutdown the JVM. Apparently mixing > {{System.exit()}} with ShutdownHooks and forcefully terminating the JVM via > {{Runtime.halt()}} does not play together well: > {noformat} > "Jvm Terminator" #22 daemon prio=5 os_prio=0 tid=0x7fb8e82f2800 > nid=0x5a96 runnable [0x7fb35cffb000] >java.lang.Thread.State: RUNNABLE > at java.lang.Shutdown.$$YJP$$halt0(Native Method) > at java.lang.Shutdown.halt0(Shutdown.java) > at java.lang.Shutdown.halt(Shutdown.java:139) > - locked <0x00047ed67638> (a java.lang.Shutdown$Lock) > at java.lang.Runtime.halt(Runtime.java:276) > at > org.apache.flink.runtime.util.JvmShutdownSafeguard$DelayedTerminator.run(JvmShutdownSafeguard.java:86) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - None > "FlinkCompletableFutureDelayScheduler-thread-1" #18154 daemon prio=5 > os_prio=0 tid=0x7fb708a7d000 nid=0x5a8a waiting for monitor entry > [0x7fb289d49000] >java.lang.Thread.State: BLOCKED (on object monitor) > at java.lang.Shutdown.halt(Shutdown.java:139) > - waiting to lock <0x00047ed67638> (a java.lang.Shutdown$Lock) > at java.lang.Shutdown.exit(Shutdown.java:213) > - locked <0x00047edb7348> (a java.lang.Class for java.lang.Shutdown) > at java.lang.Runtime.exit(Runtime.java:110) > at java.lang.System.exit(System.java:973) > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.terminateJVM(TaskManagerRunner.java:266) > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$onFatalError$1(TaskManagerRunner.java:260) > at > org.apache.flink.runtime.taskexecutor.TaskManagerRunner$$Lambda$27464/1464672548.accept(Unknown > Source) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > at > org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:943) > at > org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211) > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$11(FutureUtils.java:361) > at > org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$27435/159015392.run(Unknown > Source) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - <0x0006d5e56bd0> (a > java.util.concurrent.ThreadPoolExecutor$Worker) > {noformat} > Note that under this condition the JVM should terminate but it still hangs. > Sometimes it quits after several minutes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager
[ https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166299#comment-17166299 ] Till Rohrmann commented on FLINK-18715: --- How would this feature work in different deployment scenarios (e.g. in standalone mode where we don't have CPU isolation)? > add cpu usage metric of jobmanager/taskmanager > - > > Key: FLINK-18715 > URL: https://issues.apache.org/jira/browse/FLINK-18715 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.11.1 >Reporter: wgcn >Priority: Major > Fix For: 1.12.0, 1.11.2 > > > flink process add cpu usage metric, user can determine that their job is > io bound /cpu bound ,so that they can increase/decrese cpu core in the > container (k8s,yarn). If it's nessary > . you can assign it to me ,I come up with a idea calculating cpu usage > ratio using ManagementFactory.getRuntimeMXBean().getUptime() and > ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime over a period > of time . it can get a value in single cpu core environment. and user can > use the value to calculate cpu usage ratio by dividing num of container's > cpu core. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-18689) Deterministic Slot Sharing
[ https://issues.apache.org/jira/browse/FLINK-18689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann reassigned FLINK-18689: - Assignee: Andrey Zagrebin > Deterministic Slot Sharing > -- > > Key: FLINK-18689 > URL: https://issues.apache.org/jira/browse/FLINK-18689 > Project: Flink > Issue Type: New Feature > Components: Runtime / Coordination >Reporter: Andrey Zagrebin >Assignee: Andrey Zagrebin >Priority: Major > > [Design > doc|https://docs.google.com/document/d/10CbCVDBJWafaFOovIXrR8nAr2BnZ_RAGFtUeTHiJplw] -- This message was sent by Atlassian Jira (v8.3.4#803005)