[jira] [Closed] (FLINK-32461) manage union operator state increase very large in Jobmanager
[ https://issues.apache.org/jira/browse/FLINK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wgcn closed FLINK-32461. Release Note: it's a usage problem Resolution: Fixed > manage union operator state increase very large in Jobmanager > --- > > Key: FLINK-32461 > URL: https://issues.apache.org/jira/browse/FLINK-32461 > Project: Flink > Issue Type: Bug >Affects Versions: 1.17.1 >Reporter: wgcn >Priority: Major > Attachments: image-2023-06-28-16-24-11-538.png, screenshot-1.png > > > This issue doesn't usually occur, but it happens during busy nights when the > machines are more active. The "manage operator state" will increase > significantly, and I found the number of operator union state object is 128 > ,same with the parallelism .Whether the union state only needs to be loaded > once? > !screenshot-1.png! > !image-2023-06-28-16-24-11-538.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] (FLINK-32461) manage union operator state increase very large in Jobmanager
[ https://issues.apache.org/jira/browse/FLINK-32461 ] wgcn deleted comment on FLINK-32461: -- was (Author: 1026688210): the issue This issue is related to https://issues.apache.org/jira/browse/FLINK-21436 . Can we reopen it and continue the discussion? > manage union operator state increase very large in Jobmanager > --- > > Key: FLINK-32461 > URL: https://issues.apache.org/jira/browse/FLINK-32461 > Project: Flink > Issue Type: Bug >Affects Versions: 1.17.1 >Reporter: wgcn >Priority: Major > Attachments: image-2023-06-28-16-24-11-538.png, screenshot-1.png > > > This issue doesn't usually occur, but it happens during busy nights when the > machines are more active. The "manage operator state" will increase > significantly, and I found the number of operator union state object is 128 > ,same with the parallelism .Whether the union state only needs to be loaded > once? > !screenshot-1.png! > !image-2023-06-28-16-24-11-538.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32461) manage union operator state increase very large in Jobmanager
[ https://issues.apache.org/jira/browse/FLINK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738928#comment-17738928 ] wgcn commented on FLINK-32461: -- the issue This issue is related to https://issues.apache.org/jira/browse/FLINK-21436 . Can we reopen it and continue the discussion? > manage union operator state increase very large in Jobmanager > --- > > Key: FLINK-32461 > URL: https://issues.apache.org/jira/browse/FLINK-32461 > Project: Flink > Issue Type: Bug >Affects Versions: 1.17.1 >Reporter: wgcn >Priority: Major > Attachments: image-2023-06-28-16-24-11-538.png, screenshot-1.png > > > This issue doesn't usually occur, but it happens during busy nights when the > machines are more active. The "manage operator state" will increase > significantly, and I found the number of operator union state object is 128 > ,same with the parallelism .Whether the union state only needs to be loaded > once? > !screenshot-1.png! > !image-2023-06-28-16-24-11-538.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32461) manage union operator state increase very large in Jobmanager
[ https://issues.apache.org/jira/browse/FLINK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wgcn updated FLINK-32461: - Summary: manage union operator state increase very large in Jobmanager (was: manage operator state increase very large ) > manage union operator state increase very large in Jobmanager > --- > > Key: FLINK-32461 > URL: https://issues.apache.org/jira/browse/FLINK-32461 > Project: Flink > Issue Type: Bug >Affects Versions: 1.17.1 >Reporter: wgcn >Priority: Major > Attachments: image-2023-06-28-16-24-11-538.png, screenshot-1.png > > > This issue doesn't usually occur, but it happens during busy nights when the > machines are more active. The "manage operator state" will increase > significantly, and I found the number of operator union state object is 128 > ,same with the parallelism .Whether the union state only needs to be loaded > once? > !screenshot-1.png! > !image-2023-06-28-16-24-11-538.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] (FLINK-32461) manage operator state increase very large
[ https://issues.apache.org/jira/browse/FLINK-32461 ] wgcn deleted comment on FLINK-32461: -- was (Author: 1026688210): >> This issue doesn't usually occur, but it happens during busy nights when the machines are more active. The "manage operator state" will increase significantly, and I can see that it stores Kafka offsets inside mostly. The task only has 2 topic,and I haven't figured out why the state is so large. > manage operator state increase very large > -- > > Key: FLINK-32461 > URL: https://issues.apache.org/jira/browse/FLINK-32461 > Project: Flink > Issue Type: Bug >Affects Versions: 1.17.1 >Reporter: wgcn >Priority: Major > Attachments: image-2023-06-28-16-24-11-538.png, screenshot-1.png > > > This issue doesn't usually occur, but it happens during busy nights when the > machines are more active. The "manage operator state" will increase > significantly, and I found the number of operator union state object is 128 > ,same with the parallelism .Whether the union state only needs to be loaded > once? > !screenshot-1.png! > !image-2023-06-28-16-24-11-538.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32461) manage operator state increase very large
[ https://issues.apache.org/jira/browse/FLINK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wgcn updated FLINK-32461: - Description: This issue doesn't usually occur, but it happens during busy nights when the machines are more active. The "manage operator state" will increase significantly, and I found the number of operator union state object is 128 ,same with the parallelism .Whether the union state only needs to be loaded once? !screenshot-1.png! !image-2023-06-28-16-24-11-538.png! was: This issue doesn't usually occur, but it happens during busy nights when the machines are more active. The "manage operator state" will increase significantly, and I found the number of operator union state object is 128 ,same with the parallelism .Whether the union state only needs to be loaded once? !screenshot-1.png! !image-2023-06-28-16-24-11-538.png! > manage operator state increase very large > -- > > Key: FLINK-32461 > URL: https://issues.apache.org/jira/browse/FLINK-32461 > Project: Flink > Issue Type: Bug >Affects Versions: 1.17.1 >Reporter: wgcn >Priority: Major > Attachments: image-2023-06-28-16-24-11-538.png, screenshot-1.png > > > This issue doesn't usually occur, but it happens during busy nights when the > machines are more active. The "manage operator state" will increase > significantly, and I found the number of operator union state object is 128 > ,same with the parallelism .Whether the union state only needs to be loaded > once? > !screenshot-1.png! > !image-2023-06-28-16-24-11-538.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32461) manage operator state increase very large
[ https://issues.apache.org/jira/browse/FLINK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wgcn updated FLINK-32461: - Description: This issue doesn't usually occur, but it happens during busy nights when the machines are more active. The "manage operator state" will increase significantly, and I found the number of operator union state object is 128 ,same with the parallelism .Whether the union state only needs to be loaded once? !screenshot-1.png! !image-2023-06-28-16-24-11-538.png! was: This issue doesn't usually occur, but it happens during busy nights when the machines are more active. The "manage operator state" will increase significantly, and I can see that it stores Kafka offsets inside mostly. !screenshot-1.png! !image-2023-06-28-16-24-11-538.png! > manage operator state increase very large > -- > > Key: FLINK-32461 > URL: https://issues.apache.org/jira/browse/FLINK-32461 > Project: Flink > Issue Type: Bug >Affects Versions: 1.17.1 >Reporter: wgcn >Priority: Major > Attachments: image-2023-06-28-16-24-11-538.png, screenshot-1.png > > > This issue doesn't usually occur, but it happens during busy nights when the > machines are more active. The "manage operator state" will increase > significantly, and I found the number of operator union state object is 128 > ,same with the parallelism .Whether the union state only needs to be loaded > once? > !screenshot-1.png! > !image-2023-06-28-16-24-11-538.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32461) manage operator state increase very large
[ https://issues.apache.org/jira/browse/FLINK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wgcn updated FLINK-32461: - Description: This issue doesn't usually occur, but it happens during busy nights when the machines are more active. The "manage operator state" will increase significantly, and I can see that it stores Kafka offsets inside mostly. !screenshot-1.png! !image-2023-06-28-16-24-11-538.png! was: This issue doesn't usually occur, but it happens during busy nights when the machines are more active. The "manage operator state" will increase significantly, and I can see that it stores Kafka offsets inside mostly. !image-2023-06-28-16-24-11-538.png! !screenshot-1.png! > manage operator state increase very large > -- > > Key: FLINK-32461 > URL: https://issues.apache.org/jira/browse/FLINK-32461 > Project: Flink > Issue Type: Bug >Affects Versions: 1.17.1 >Reporter: wgcn >Priority: Major > Attachments: image-2023-06-28-16-24-11-538.png, screenshot-1.png > > > This issue doesn't usually occur, but it happens during busy nights when the > machines are more active. The "manage operator state" will increase > significantly, and I can see that it stores Kafka offsets inside mostly. > !screenshot-1.png! > !image-2023-06-28-16-24-11-538.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32461) manage operator state increase very large
[ https://issues.apache.org/jira/browse/FLINK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wgcn updated FLINK-32461: - Attachment: image-2023-06-28-16-24-11-538.png Description: This issue doesn't usually occur, but it happens during busy nights when the machines are more active. The "manage operator state" will increase significantly, and I can see that it stores Kafka offsets inside mostly. !image-2023-06-28-16-24-11-538.png! !screenshot-1.png! was: !screenshot-1.png! This issue doesn't usually occur, but it happens during busy nights when the machines are more active. The "manage operator state" will increase significantly, and I can see that it stores Kafka offsets inside mostly. > manage operator state increase very large > -- > > Key: FLINK-32461 > URL: https://issues.apache.org/jira/browse/FLINK-32461 > Project: Flink > Issue Type: Bug >Affects Versions: 1.17.1 >Reporter: wgcn >Priority: Major > Attachments: image-2023-06-28-16-24-11-538.png, screenshot-1.png > > > This issue doesn't usually occur, but it happens during busy nights when the > machines are more active. The "manage operator state" will increase > significantly, and I can see that it stores Kafka offsets inside mostly. > !image-2023-06-28-16-24-11-538.png! > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32461) manage operator state increase very large
[ https://issues.apache.org/jira/browse/FLINK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wgcn updated FLINK-32461: - Attachment: (was: image-2023-06-28-15-57-52-615.png) > manage operator state increase very large > -- > > Key: FLINK-32461 > URL: https://issues.apache.org/jira/browse/FLINK-32461 > Project: Flink > Issue Type: Bug >Affects Versions: 1.17.1 >Reporter: wgcn >Priority: Major > Attachments: screenshot-1.png > > > !screenshot-1.png! > This issue doesn't usually occur, but it happens during busy nights when the > machines are more active. The "manage operator state" will increase > significantly, and I can see that it stores Kafka offsets inside mostly. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32461) manage operator state increase very large
[ https://issues.apache.org/jira/browse/FLINK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738010#comment-17738010 ] wgcn commented on FLINK-32461: -- >> This issue doesn't usually occur, but it happens during busy nights when the machines are more active. The "manage operator state" will increase significantly, and I can see that it stores Kafka offsets inside mostly. The task only has 2 topic,and I haven't figured out why the state is so large. > manage operator state increase very large > -- > > Key: FLINK-32461 > URL: https://issues.apache.org/jira/browse/FLINK-32461 > Project: Flink > Issue Type: Bug >Affects Versions: 1.17.1 >Reporter: wgcn >Priority: Major > Attachments: image-2023-06-28-15-57-52-615.png, screenshot-1.png > > > !screenshot-1.png! > This issue doesn't usually occur, but it happens during busy nights when the > machines are more active. The "manage operator state" will increase > significantly, and I can see that it stores Kafka offsets inside mostly. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32461) manage operator state increase very large
[ https://issues.apache.org/jira/browse/FLINK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wgcn updated FLINK-32461: - Attachment: screenshot-1.png > manage operator state increase very large > -- > > Key: FLINK-32461 > URL: https://issues.apache.org/jira/browse/FLINK-32461 > Project: Flink > Issue Type: Bug >Affects Versions: 1.17.1 >Reporter: wgcn >Priority: Major > Attachments: image-2023-06-28-15-57-52-615.png, screenshot-1.png > > > !image-2023-06-28-15-57-39-557.png! > This issue doesn't usually occur, but it happens during busy nights when the > machines are more active. The "manage operator state" will increase > significantly, and I can see that it stores Kafka offsets inside mostly. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32461) manage operator state increase very large
[ https://issues.apache.org/jira/browse/FLINK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wgcn updated FLINK-32461: - Description: !screenshot-1.png! This issue doesn't usually occur, but it happens during busy nights when the machines are more active. The "manage operator state" will increase significantly, and I can see that it stores Kafka offsets inside mostly. was: !image-2023-06-28-15-57-39-557.png! This issue doesn't usually occur, but it happens during busy nights when the machines are more active. The "manage operator state" will increase significantly, and I can see that it stores Kafka offsets inside mostly. > manage operator state increase very large > -- > > Key: FLINK-32461 > URL: https://issues.apache.org/jira/browse/FLINK-32461 > Project: Flink > Issue Type: Bug >Affects Versions: 1.17.1 >Reporter: wgcn >Priority: Major > Attachments: image-2023-06-28-15-57-52-615.png, screenshot-1.png > > > !screenshot-1.png! > This issue doesn't usually occur, but it happens during busy nights when the > machines are more active. The "manage operator state" will increase > significantly, and I can see that it stores Kafka offsets inside mostly. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32461) manage operator state increase very large
wgcn created FLINK-32461: Summary: manage operator state increase very large Key: FLINK-32461 URL: https://issues.apache.org/jira/browse/FLINK-32461 Project: Flink Issue Type: Bug Affects Versions: 1.17.1 Reporter: wgcn Attachments: image-2023-06-28-15-57-52-615.png !image-2023-06-28-15-57-39-557.png! This issue doesn't usually occur, but it happens during busy nights when the machines are more active. The "manage operator state" will increase significantly, and I can see that it stores Kafka offsets inside mostly. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32319) flink can't the partition of network after restart
[ https://issues.apache.org/jira/browse/FLINK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17733968#comment-17733968 ] wgcn commented on FLINK-32319: -- It work after I increase "taskmanager.memory.network",but I'm not sure why this is happening, as the Flink Task was functioning normally when it was initially started. After some time, there is a chance that this issue occurs upon restart, which has not been encountered in Flink 1.12 version,and I have calculatd the number of float buffers, buffer size, and the number of buffers for each channel. 600MB should be enough.Is this issue due to a new mechanism causing usage problems? or is it an unexpected issue? > flink can't the partition of network after restart > -- > > Key: FLINK-32319 > URL: https://issues.apache.org/jira/browse/FLINK-32319 > Project: Flink > Issue Type: Bug > Components: Runtime / Network >Affects Versions: 1.17.1 > Environment: centos 7. > jdk 8. > flink1.17.1 application mode on yarn > flink configuration : > ``` > $internal.application.program-argssql2 > $internal.deployment.config-dir /data/home/flink/wgcn/flink-1.17.1/conf > $internal.yarn.log-config-file > /data/home/flink/wgcn/flink-1.17.1/conf/log4j.properties > akka.ask.timeout 100s > blob.server.port 15402 > classloader.check-leaked-classloader false > classloader.resolve-order parent-first > env.java.opts.taskmanager -XX:+UseG1GC -XX:MaxGCPauseMillis=1000 > execution.attachedtrue > execution.checkpointing.aligned-checkpoint-timeout10 min > execution.checkpointing.externalized-checkpoint-retention > RETAIN_ON_CANCELLATION > execution.checkpointing.interval 10 min > execution.checkpointing.min-pause 10 min > execution.savepoint-restore-mode NO_CLAIM > execution.savepoint.ignore-unclaimed-statefalse > execution.shutdown-on-attached-exit false > execution.target embedded > high-availability zookeeper > high-availability.cluster-id application_1684133071014_7202676 > high-availability.storageDir hdfs:///user/flink/recovery > high-availability.zookeeper.path.root /flink > high-availability.zookeeper.quorumx > internal.cluster.execution-mode NORMAL > internal.io.tmpdirs.use-local-default true > io.tmp.dirs > /data1/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data3/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data4/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data5/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data6/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data7/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data8/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data9/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data10/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data11/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data12/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676 > jobmanager.execution.failover-strategyregion > jobmanager.memory.heap.size 9261023232b > jobmanager.memory.jvm-metaspace.size 268435456b > jobmanager.memory.jvm-overhead.max1073741824b > jobmanager.memory.jvm-overhead.min1073741824b > jobmanager.memory.off-heap.size 134217728b > jobmanager.memory.process.size10240m > jobmanager.rpc.address > jobmanager.rpc.port 31332 > metrics.reporter.promgateway.deleteOnShutdown true > metrics.reporter.promgateway.factory.class > org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporterFactory > metrics.reporter.promgateway.hostUrl :9091 > metrics.reporter.promgateway.interval 60 SECONDS > metrics.reporter.promgateway.jobName join_phase3_v7 > metrics.reporter.promgateway.randomJobNameSuffix false > parallelism.default 128 > pipeline.classpaths > pipeline.jars > file:/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676/container_e16_1684133071014_7202676_01_02/frauddetection-0.1.jar > rest.address > rest.bind-address x > rest.bind-port5-50500 > rest.flamegraph.enabled true > restart-strategy.failure-rate.delay 10 s > restart-strategy.failure-rate.failure-rate-interval 1 min > restart-strategy.failure-rate.max-failures-per-interval 6 > restart-strategy.type exponential-delay > state.backend.typefilesystem > state.checkpoints.dir hdfs://xx/user/flink/checkpoints-data/wgcn > state.checkpoints.num-retained3 > taskmanager.memory.managed.frac
[jira] [Comment Edited] (FLINK-32319) flink can't the partition of network after restart
[ https://issues.apache.org/jira/browse/FLINK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732799#comment-17732799 ] wgcn edited comment on FLINK-32319 at 6/14/23 11:26 PM: Hi~[~Wencong Liu],I Increased taskmanager.network.request-backoff.max from 1 to 2 and 3, this problem keeps occurring,Is this related to the config "-Dtaskmanager.memory.network.max=600 MB"? I calculated that the network buffer required by the task manager is not very large, so I decreased it. was (Author: 1026688210): Hi~[~Wencong Liu],I Increased taskmanager.network.request-backoff.max to 2 and 3, this problem keeps occurring,Is this related to the config "-Dtaskmanager.memory.network.max=600 MB"? I calculated that the network buffer required by the task manager is not very large, so I decreased it. > flink can't the partition of network after restart > -- > > Key: FLINK-32319 > URL: https://issues.apache.org/jira/browse/FLINK-32319 > Project: Flink > Issue Type: Bug > Components: Runtime / Network >Affects Versions: 1.17.1 > Environment: centos 7. > jdk 8. > flink1.17.1 application mode on yarn > flink configuration : > ``` > $internal.application.program-argssql2 > $internal.deployment.config-dir /data/home/flink/wgcn/flink-1.17.1/conf > $internal.yarn.log-config-file > /data/home/flink/wgcn/flink-1.17.1/conf/log4j.properties > akka.ask.timeout 100s > blob.server.port 15402 > classloader.check-leaked-classloader false > classloader.resolve-order parent-first > env.java.opts.taskmanager -XX:+UseG1GC -XX:MaxGCPauseMillis=1000 > execution.attachedtrue > execution.checkpointing.aligned-checkpoint-timeout10 min > execution.checkpointing.externalized-checkpoint-retention > RETAIN_ON_CANCELLATION > execution.checkpointing.interval 10 min > execution.checkpointing.min-pause 10 min > execution.savepoint-restore-mode NO_CLAIM > execution.savepoint.ignore-unclaimed-statefalse > execution.shutdown-on-attached-exit false > execution.target embedded > high-availability zookeeper > high-availability.cluster-id application_1684133071014_7202676 > high-availability.storageDir hdfs:///user/flink/recovery > high-availability.zookeeper.path.root /flink > high-availability.zookeeper.quorumx > internal.cluster.execution-mode NORMAL > internal.io.tmpdirs.use-local-default true > io.tmp.dirs > /data1/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data3/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data4/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data5/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data6/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data7/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data8/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data9/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data10/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data11/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data12/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676 > jobmanager.execution.failover-strategyregion > jobmanager.memory.heap.size 9261023232b > jobmanager.memory.jvm-metaspace.size 268435456b > jobmanager.memory.jvm-overhead.max1073741824b > jobmanager.memory.jvm-overhead.min1073741824b > jobmanager.memory.off-heap.size 134217728b > jobmanager.memory.process.size10240m > jobmanager.rpc.address > jobmanager.rpc.port 31332 > metrics.reporter.promgateway.deleteOnShutdown true > metrics.reporter.promgateway.factory.class > org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporterFactory > metrics.reporter.promgateway.hostUrl :9091 > metrics.reporter.promgateway.interval 60 SECONDS > metrics.reporter.promgateway.jobName join_phase3_v7 > metrics.reporter.promgateway.randomJobNameSuffix false > parallelism.default 128 > pipeline.classpaths > pipeline.jars > file:/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676/container_e16_1684133071014_7202676_01_02/frauddetection-0.1.jar > rest.address > rest.bind-address x > rest.bind-port5-50500 > rest.flamegraph.enabled true > restart-strategy.failure-rate.delay 10 s > restart-strategy.failure-rate.failure-rate-interval 1 min > restart-strategy.failure-rate.max-failures-per-interval 6 > restart-strategy.type exponential-delay > state.bac
[jira] [Commented] (FLINK-32319) flink can't the partition of network after restart
[ https://issues.apache.org/jira/browse/FLINK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732799#comment-17732799 ] wgcn commented on FLINK-32319: -- Hi~[~Wencong Liu],I Increased taskmanager.network.request-backoff.max to 2 and 3, this problem keeps occurring,Is this related to the config "-Dtaskmanager.memory.network.max=600 MB"? I calculated that the network buffer required by the task manager is not very large, so I decreased it. > flink can't the partition of network after restart > -- > > Key: FLINK-32319 > URL: https://issues.apache.org/jira/browse/FLINK-32319 > Project: Flink > Issue Type: Bug > Components: Runtime / Network >Affects Versions: 1.17.1 > Environment: centos 7. > jdk 8. > flink1.17.1 application mode on yarn > flink configuration : > ``` > $internal.application.program-argssql2 > $internal.deployment.config-dir /data/home/flink/wgcn/flink-1.17.1/conf > $internal.yarn.log-config-file > /data/home/flink/wgcn/flink-1.17.1/conf/log4j.properties > akka.ask.timeout 100s > blob.server.port 15402 > classloader.check-leaked-classloader false > classloader.resolve-order parent-first > env.java.opts.taskmanager -XX:+UseG1GC -XX:MaxGCPauseMillis=1000 > execution.attachedtrue > execution.checkpointing.aligned-checkpoint-timeout10 min > execution.checkpointing.externalized-checkpoint-retention > RETAIN_ON_CANCELLATION > execution.checkpointing.interval 10 min > execution.checkpointing.min-pause 10 min > execution.savepoint-restore-mode NO_CLAIM > execution.savepoint.ignore-unclaimed-statefalse > execution.shutdown-on-attached-exit false > execution.target embedded > high-availability zookeeper > high-availability.cluster-id application_1684133071014_7202676 > high-availability.storageDir hdfs:///user/flink/recovery > high-availability.zookeeper.path.root /flink > high-availability.zookeeper.quorumx > internal.cluster.execution-mode NORMAL > internal.io.tmpdirs.use-local-default true > io.tmp.dirs > /data1/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data3/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data4/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data5/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data6/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data7/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data8/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data9/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data10/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data11/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data12/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676 > jobmanager.execution.failover-strategyregion > jobmanager.memory.heap.size 9261023232b > jobmanager.memory.jvm-metaspace.size 268435456b > jobmanager.memory.jvm-overhead.max1073741824b > jobmanager.memory.jvm-overhead.min1073741824b > jobmanager.memory.off-heap.size 134217728b > jobmanager.memory.process.size10240m > jobmanager.rpc.address > jobmanager.rpc.port 31332 > metrics.reporter.promgateway.deleteOnShutdown true > metrics.reporter.promgateway.factory.class > org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporterFactory > metrics.reporter.promgateway.hostUrl :9091 > metrics.reporter.promgateway.interval 60 SECONDS > metrics.reporter.promgateway.jobName join_phase3_v7 > metrics.reporter.promgateway.randomJobNameSuffix false > parallelism.default 128 > pipeline.classpaths > pipeline.jars > file:/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676/container_e16_1684133071014_7202676_01_02/frauddetection-0.1.jar > rest.address > rest.bind-address x > rest.bind-port5-50500 > rest.flamegraph.enabled true > restart-strategy.failure-rate.delay 10 s > restart-strategy.failure-rate.failure-rate-interval 1 min > restart-strategy.failure-rate.max-failures-per-interval 6 > restart-strategy.type exponential-delay > state.backend.typefilesystem > state.checkpoints.dir hdfs://xx/user/flink/checkpoints-data/wgcn > state.checkpoints.num-retained3 > taskmanager.memory.managed.fraction 0 > taskmanager.memory.network.max600mb > taskmanager.memory.process.size 10240m > taskmanager.memory.segment-size 128kb > taskmanager.network.memory.buffers-per-channel8 > taskmanager.
[jira] [Commented] (FLINK-32319) flink can't the partition of network after restart
[ https://issues.apache.org/jira/browse/FLINK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731827#comment-17731827 ] wgcn commented on FLINK-32319: -- hi~ [~Wencong Liu] thanks for your response, I will try the config, I have a question. I just roughly looked at the meaning of this config. Why does this config need to be set so large? Is it related to parallel reading? We are using version 1.12 of Flink in our production environment, and I have never paid attention to this config before. Is this because some mechanisms were added after version 1.12 > flink can't the partition of network after restart > -- > > Key: FLINK-32319 > URL: https://issues.apache.org/jira/browse/FLINK-32319 > Project: Flink > Issue Type: Bug > Components: Runtime / Network >Affects Versions: 1.17.1 > Environment: centos 7. > jdk 8. > flink1.17.1 application mode on yarn > flink configuration : > ``` > $internal.application.program-argssql2 > $internal.deployment.config-dir /data/home/flink/wgcn/flink-1.17.1/conf > $internal.yarn.log-config-file > /data/home/flink/wgcn/flink-1.17.1/conf/log4j.properties > akka.ask.timeout 100s > blob.server.port 15402 > classloader.check-leaked-classloader false > classloader.resolve-order parent-first > env.java.opts.taskmanager -XX:+UseG1GC -XX:MaxGCPauseMillis=1000 > execution.attachedtrue > execution.checkpointing.aligned-checkpoint-timeout10 min > execution.checkpointing.externalized-checkpoint-retention > RETAIN_ON_CANCELLATION > execution.checkpointing.interval 10 min > execution.checkpointing.min-pause 10 min > execution.savepoint-restore-mode NO_CLAIM > execution.savepoint.ignore-unclaimed-statefalse > execution.shutdown-on-attached-exit false > execution.target embedded > high-availability zookeeper > high-availability.cluster-id application_1684133071014_7202676 > high-availability.storageDir hdfs:///user/flink/recovery > high-availability.zookeeper.path.root /flink > high-availability.zookeeper.quorumx > internal.cluster.execution-mode NORMAL > internal.io.tmpdirs.use-local-default true > io.tmp.dirs > /data1/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data3/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data4/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data5/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data6/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data7/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data8/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data9/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data10/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data11/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data12/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676 > jobmanager.execution.failover-strategyregion > jobmanager.memory.heap.size 9261023232b > jobmanager.memory.jvm-metaspace.size 268435456b > jobmanager.memory.jvm-overhead.max1073741824b > jobmanager.memory.jvm-overhead.min1073741824b > jobmanager.memory.off-heap.size 134217728b > jobmanager.memory.process.size10240m > jobmanager.rpc.address > jobmanager.rpc.port 31332 > metrics.reporter.promgateway.deleteOnShutdown true > metrics.reporter.promgateway.factory.class > org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporterFactory > metrics.reporter.promgateway.hostUrl :9091 > metrics.reporter.promgateway.interval 60 SECONDS > metrics.reporter.promgateway.jobName join_phase3_v7 > metrics.reporter.promgateway.randomJobNameSuffix false > parallelism.default 128 > pipeline.classpaths > pipeline.jars > file:/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676/container_e16_1684133071014_7202676_01_02/frauddetection-0.1.jar > rest.address > rest.bind-address x > rest.bind-port5-50500 > rest.flamegraph.enabled true > restart-strategy.failure-rate.delay 10 s > restart-strategy.failure-rate.failure-rate-interval 1 min > restart-strategy.failure-rate.max-failures-per-interval 6 > restart-strategy.type exponential-delay > state.backend.typefilesystem > state.checkpoints.dir hdfs://xx/user/flink/checkpoints-data/wgcn > state.checkpoints.num-retained3 > taskmanager.memory.managed.fraction 0 > taskmanager.memory.network.max600mb > taskmanager.memory.process.size 10240m > tas
[jira] [Updated] (FLINK-32319) flink can't the partition of network after restart
[ https://issues.apache.org/jira/browse/FLINK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wgcn updated FLINK-32319: - Environment: centos 7. jdk 8. flink1.17.1 application mode on yarn flink configuration : ``` $internal.application.program-args sql2 $internal.deployment.config-dir /data/home/flink/wgcn/flink-1.17.1/conf $internal.yarn.log-config-file /data/home/flink/wgcn/flink-1.17.1/conf/log4j.properties akka.ask.timeout100s blob.server.port15402 classloader.check-leaked-classloaderfalse classloader.resolve-order parent-first env.java.opts.taskmanager -XX:+UseG1GC -XX:MaxGCPauseMillis=1000 execution.attached true execution.checkpointing.aligned-checkpoint-timeout 10 min execution.checkpointing.externalized-checkpoint-retention RETAIN_ON_CANCELLATION execution.checkpointing.interval10 min execution.checkpointing.min-pause 10 min execution.savepoint-restore-modeNO_CLAIM execution.savepoint.ignore-unclaimed-state false execution.shutdown-on-attached-exit false execution.targetembedded high-availability zookeeper high-availability.cluster-idapplication_1684133071014_7202676 high-availability.storageDirhdfs:///user/flink/recovery high-availability.zookeeper.path.root /flink high-availability.zookeeper.quorum x internal.cluster.execution-mode NORMAL internal.io.tmpdirs.use-local-default true io.tmp.dirs /data1/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data3/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data4/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data5/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data6/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data7/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data8/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data9/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data10/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data11/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data12/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676 jobmanager.execution.failover-strategy region jobmanager.memory.heap.size 9261023232b jobmanager.memory.jvm-metaspace.size268435456b jobmanager.memory.jvm-overhead.max 1073741824b jobmanager.memory.jvm-overhead.min 1073741824b jobmanager.memory.off-heap.size 134217728b jobmanager.memory.process.size 10240m jobmanager.rpc.address jobmanager.rpc.port 31332 metrics.reporter.promgateway.deleteOnShutdown true metrics.reporter.promgateway.factory.class org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporterFactory metrics.reporter.promgateway.hostUrl:9091 metrics.reporter.promgateway.interval 60 SECONDS metrics.reporter.promgateway.jobNamejoin_phase3_v7 metrics.reporter.promgateway.randomJobNameSuffixfalse parallelism.default 128 pipeline.classpaths pipeline.jars file:/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676/container_e16_1684133071014_7202676_01_02/frauddetection-0.1.jar rest.address rest.bind-address x rest.bind-port 5-50500 rest.flamegraph.enabled true restart-strategy.failure-rate.delay 10 s restart-strategy.failure-rate.failure-rate-interval 1 min restart-strategy.failure-rate.max-failures-per-interval 6 restart-strategy.type exponential-delay state.backend.type filesystem state.checkpoints.dir hdfs://xx/user/flink/checkpoints-data/wgcn state.checkpoints.num-retained 3 taskmanager.memory.managed.fraction 0 taskmanager.memory.network.max 600mb taskmanager.memory.process.size 10240m taskmanager.memory.segment-size 128kb taskmanager.network.memory.buffers-per-channel 8 taskmanager.network.memory.floating-buffers-per-gate800 taskmanager.numberOfTaskSlots 2 web.port0 web.tmpdir /tmp/flink-web-1b87445e-2761-4f16-97a1-8d4fc6fa8534 yarn.application-attempt-failures-validity-interval 6 yarn.application-attempts 3 yarn.application.name join_phase3_v7 yarn.heartbeat.container-request-interval 700 ``` was: centos 7. jdk 8. flink1.17.1 application mode on yarn > flink can't the partition of network after restart > -- > > Key: FLINK-32319 > URL: https://issues.apache.org/jira/browse/FLINK-32319 > Project: Flink > Issue Type: Bug > Components: Runtime / Network >Affects Versions: 1.17.1 > Environment: centos 7. > jdk 8. > flink1.17.1 application mode on y
[jira] [Updated] (FLINK-32319) flink can't the partition of network after restart
[ https://issues.apache.org/jira/browse/FLINK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wgcn updated FLINK-32319: - Attachment: image-2023-06-13-07-14-48-958.png Component/s: Runtime / Network Affects Version/s: 1.17.1 Description: flink can't the partition of network after restart, lead that job can not restoring !image-2023-06-13-07-14-48-958.png! Environment: centos 7. jdk 8. flink1.17.1 application mode on yarn > flink can't the partition of network after restart > -- > > Key: FLINK-32319 > URL: https://issues.apache.org/jira/browse/FLINK-32319 > Project: Flink > Issue Type: Bug > Components: Runtime / Network >Affects Versions: 1.17.1 > Environment: centos 7. > jdk 8. > flink1.17.1 application mode on yarn >Reporter: wgcn >Priority: Major > Attachments: image-2023-06-13-07-14-48-958.png > > > flink can't the partition of network after restart, lead that job can not > restoring > !image-2023-06-13-07-14-48-958.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32319) flink can't the partition of network after restart
wgcn created FLINK-32319: Summary: flink can't the partition of network after restart Key: FLINK-32319 URL: https://issues.apache.org/jira/browse/FLINK-32319 Project: Flink Issue Type: Bug Reporter: wgcn -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-30720) KafkaChangelogTableITCase.testKafkaDebeziumChangelogSource failed due to a topic already exist when creating it
[ https://issues.apache.org/jira/browse/FLINK-30720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693657#comment-17693657 ] wgcn commented on FLINK-30720: -- Hi [~mapohl] ,does this Exception occur in every test? Could you tell me the steps causing the Exception occur ,I wanna test it at my local environment. > KafkaChangelogTableITCase.testKafkaDebeziumChangelogSource failed due to a > topic already exist when creating it > --- > > Key: FLINK-30720 > URL: https://issues.apache.org/jira/browse/FLINK-30720 > Project: Flink > Issue Type: Sub-task > Components: Connectors / Kafka >Affects Versions: 1.17.0 >Reporter: Matthias Pohl >Priority: Critical > Labels: test-stability > > We experienced a build failure in > {{KafkaChangelogTableITCase.testKafkaDebeziumChangelogSource}} due to an > already existing topic: > {code} > Jan 17 14:15:33 [ERROR] > org.apache.flink.streaming.connectors.kafka.table.KafkaChangelogTableITCase.testKafkaDebeziumChangelogSource > Time elapsed: 14.771 s <<< ERROR! > Jan 17 14:15:33 java.lang.IllegalStateException: Fail to create topic > [changelog_topic partitions: 1 replication factor: 1]. > Jan 17 14:15:33 at > org.apache.flink.streaming.connectors.kafka.table.KafkaTableTestBase.createTestTopic(KafkaTableTestBase.java:143) > Jan 17 14:15:33 at > org.apache.flink.streaming.connectors.kafka.table.KafkaChangelogTableITCase.testKafkaDebeziumChangelogSource(KafkaChangelogTableITCase.java:60) > Jan 17 14:15:33 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > Jan 17 14:15:33 at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > Jan 17 14:15:33 at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > Jan 17 14:15:33 at java.lang.reflect.Method.invoke(Method.java:498) > [...] > {code} > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=44972&view=logs&j=aa18c3f6-13b8-5f58-86bb-c1cffb239496&t=502fb6c0-30a2-5e49-c5c2-a00fa3acb203&l=38188 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted
[ https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232000#comment-17232000 ] wgcn edited comment on FLINK-20138 at 11/14/20, 12:17 PM: -- [~trohrmann] , the log 'Connecting to ResourceManager x' does not appear in the jobmanager. [Jobmaster.connectToResourceManager|https://github.com/apache/flink/blob/c9d2c9098d725a2d39e860bde414ecb0c5d6a233/flink-runtime/src/main/java/org/apache/flink/runt ime/jobmaster/JobMaster.java#L931-L936], we guess the change event of resourcemanager latch node in zookeeper did not inform jobmaster in a bad network environment. I found the [FLINK-10052|https://issues.apache.org/jira/browse/FLINK-10052] make a improvement on curator , will it be finished in next version. was (Author: 1026688210): [~trohrmann] , the log 'Connecting to ResourceManager x' does not appear in the jobmanager. [Jobmaster.connectToResourceManager|https://github.com/apache/flink/blob/c9d2c9098d725a2d39e860bde414ecb0c5d6a233/flink-runtime/src/main/java/org/apache/flink/runt ime/jobmaster/JobMaster.java#L931-L936], we guess the change event of resourcemanager latch node in zookeeper did not inform jobmaster in a bad network environment. I found the [FLINK-10052|https://issues.apache.org/jira/browse/FLINK-10052] make a improvement on zookeeper , will it be finished in next version. > Flink Job can not recover due to timeout of requiring slots when flink > jobmanager restarted > > > Key: FLINK-20138 > URL: https://issues.apache.org/jira/browse/FLINK-20138 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN, Table SQL / Runtime > Environment: flink : 1.9.2 > hadoop :2.7.2 > jdk:1.8 >Reporter: wgcn >Priority: Major > Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, > jobmanager.log, zk_resource_address_info.png > > > our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger machines > ,and AMs of the machines restarted at other nodemanager. We found some > jobs can not recover due to timeout of requiring slots. > *SlotPoolImp always did not connect ResourceManager * > ``` > 2020-11-09 16:31:31,794 INFO > flink-akka.actor.default-dispatcher-16 > (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369) > - Cannot serve slot request, no ResourceManager connected. Adding as pending > request [SlotRequestId{456c9daa6670a4490810f8e51f495174}] > ``` > *1.We did not find the log of YarnResourceManager requesting container at > the jobmanager log of attachment. > 2.The node of Zookeeper is also showed at attachment .* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted
[ https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232000#comment-17232000 ] wgcn edited comment on FLINK-20138 at 11/14/20, 12:13 PM: -- [~trohrmann] , the log 'Connecting to ResourceManager x' does not appear in the jobmanager. [Jobmaster.connectToResourceManager|https://github.com/apache/flink/blob/c9d2c9098d725a2d39e860bde414ecb0c5d6a233/flink-runtime/src/main/java/org/apache/flink/runt ime/jobmaster/JobMaster.java#L931-L936], we guess the change event of resourcemanager latch node in zookeeper did not inform jobmaster in a bad network environment. I found the [FLINK-10052|https://issues.apache.org/jira/browse/FLINK-10052] make a improvement on zookeeper , will it be finished in next version. was (Author: 1026688210): [~trohrmann] , the log 'Connecting to ResourceManager x' does not appear in the jobmanager. [Jobmaster.connectToResourceManager|https://github.com/apache/flink/blob/c9d2c9098d725a2d39e860bde414ecb0c5d6a233/flink-runtime/src/main/java/org/apache/flink/runt ime/jobmaster/JobMaster.java#L931-L936], we guess the change event of resourcemanager latch node in zookeeper did not inform jobmaster in a bad network environment. I found the [FLINK-10052|https://issues.apache.org/jira/browse/FLINK-10052] make a improvement on zookeeper , will it be finished in latest version. > Flink Job can not recover due to timeout of requiring slots when flink > jobmanager restarted > > > Key: FLINK-20138 > URL: https://issues.apache.org/jira/browse/FLINK-20138 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN, Table SQL / Runtime > Environment: flink : 1.9.2 > hadoop :2.7.2 > jdk:1.8 >Reporter: wgcn >Priority: Major > Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, > jobmanager.log, zk_resource_address_info.png > > > our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger machines > ,and AMs of the machines restarted at other nodemanager. We found some > jobs can not recover due to timeout of requiring slots. > *SlotPoolImp always did not connect ResourceManager * > ``` > 2020-11-09 16:31:31,794 INFO > flink-akka.actor.default-dispatcher-16 > (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369) > - Cannot serve slot request, no ResourceManager connected. Adding as pending > request [SlotRequestId{456c9daa6670a4490810f8e51f495174}] > ``` > *1.We did not find the log of YarnResourceManager requesting container at > the jobmanager log of attachment. > 2.The node of Zookeeper is also showed at attachment .* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted
[ https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232000#comment-17232000 ] wgcn edited comment on FLINK-20138 at 11/14/20, 12:11 PM: -- [~trohrmann] , the log 'Connecting to ResourceManager x' does not appear in the jobmanager. [Jobmaster.connectToResourceManager|https://github.com/apache/flink/blob/c9d2c9098d725a2d39e860bde414ecb0c5d6a233/flink-runtime/src/main/java/org/apache/flink/runt ime/jobmaster/JobMaster.java#L931-L936], we guess the change event of resourcemanager latch node in zookeeper did not inform jobmaster in a bad network environment. I found the [FLINK-10052|https://issues.apache.org/jira/browse/FLINK-10052] make a improvement on zookeeper , will it be finished in latest version. was (Author: 1026688210): [~trohrmann] , the log 'Connecting to ResourceManager x' does not appear in the jobmanager. [链接标题|https://github.com/apache/flink/blob/c9d2c9098d725a2d39e860bde414ecb0c5d6a233/flink-runtime/src/main/java/org/apache/flink/runt ime/jobmaster/JobMaster.java#L931-L936], we guess the change event of resourcemanager latch node in zookeeper did not inform jobmaster in a bad network environment. I found the [FLINK-10052|https://issues.apache.org/jira/browse/FLINK-10052] make a improvement on zookeeper , will it be finished in latest version. > Flink Job can not recover due to timeout of requiring slots when flink > jobmanager restarted > > > Key: FLINK-20138 > URL: https://issues.apache.org/jira/browse/FLINK-20138 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN, Table SQL / Runtime > Environment: flink : 1.9.2 > hadoop :2.7.2 > jdk:1.8 >Reporter: wgcn >Priority: Major > Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, > jobmanager.log, zk_resource_address_info.png > > > our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger machines > ,and AMs of the machines restarted at other nodemanager. We found some > jobs can not recover due to timeout of requiring slots. > *SlotPoolImp always did not connect ResourceManager * > ``` > 2020-11-09 16:31:31,794 INFO > flink-akka.actor.default-dispatcher-16 > (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369) > - Cannot serve slot request, no ResourceManager connected. Adding as pending > request [SlotRequestId{456c9daa6670a4490810f8e51f495174}] > ``` > *1.We did not find the log of YarnResourceManager requesting container at > the jobmanager log of attachment. > 2.The node of Zookeeper is also showed at attachment .* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted
[ https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232000#comment-17232000 ] wgcn commented on FLINK-20138: -- [~trohrmann] , the log 'Connecting to ResourceManager x' does not appear in the jobmanager. [链接标题|https://github.com/apache/flink/blob/c9d2c9098d725a2d39e860bde414ecb0c5d6a233/flink-runtime/src/main/java/org/apache/flink/runt ime/jobmaster/JobMaster.java#L931-L936], we guess the change event of resourcemanager latch node in zookeeper did not inform jobmaster in a bad network environment. I found the [FLINK-10052|https://issues.apache.org/jira/browse/FLINK-10052] make a improvement on zookeeper , will it be finished in latest version. > Flink Job can not recover due to timeout of requiring slots when flink > jobmanager restarted > > > Key: FLINK-20138 > URL: https://issues.apache.org/jira/browse/FLINK-20138 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN, Table SQL / Runtime > Environment: flink : 1.9.2 > hadoop :2.7.2 > jdk:1.8 >Reporter: wgcn >Priority: Major > Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, > jobmanager.log, zk_resource_address_info.png > > > our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger machines > ,and AMs of the machines restarted at other nodemanager. We found some > jobs can not recover due to timeout of requiring slots. > *SlotPoolImp always did not connect ResourceManager * > ``` > 2020-11-09 16:31:31,794 INFO > flink-akka.actor.default-dispatcher-16 > (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369) > - Cannot serve slot request, no ResourceManager connected. Adding as pending > request [SlotRequestId{456c9daa6670a4490810f8e51f495174}] > ``` > *1.We did not find the log of YarnResourceManager requesting container at > the jobmanager log of attachment. > 2.The node of Zookeeper is also showed at attachment .* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted
[ https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17231966#comment-17231966 ] wgcn commented on FLINK-20138: -- [~trohrmann] I upload the resource_manager_lock at the attachment, the address on resource_manager_lock node is correct, which is occupied by current jobmanager. The problem is reproducible diffcultly. I did not find the problem again by trying killing the other AM on yarn。 > Flink Job can not recover due to timeout of requiring slots when flink > jobmanager restarted > > > Key: FLINK-20138 > URL: https://issues.apache.org/jira/browse/FLINK-20138 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN, Table SQL / Runtime > Environment: flink : 1.9.2 > hadoop :2.7.2 > jdk:1.8 >Reporter: wgcn >Priority: Major > Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, > jobmanager.log, zk_resource_address_info.png > > > our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger machines > ,and AMs of the machines restarted at other nodemanager. We found some > jobs can not recover due to timeout of requiring slots. > *SlotPoolImp always did not connect ResourceManager * > ``` > 2020-11-09 16:31:31,794 INFO > flink-akka.actor.default-dispatcher-16 > (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369) > - Cannot serve slot request, no ResourceManager connected. Adding as pending > request [SlotRequestId{456c9daa6670a4490810f8e51f495174}] > ``` > *1.We did not find the log of YarnResourceManager requesting container at > the jobmanager log of attachment. > 2.The node of Zookeeper is also showed at attachment .* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted
[ https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wgcn updated FLINK-20138: - Attachment: zk_resource_address_info.png > Flink Job can not recover due to timeout of requiring slots when flink > jobmanager restarted > > > Key: FLINK-20138 > URL: https://issues.apache.org/jira/browse/FLINK-20138 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN, Table SQL / Runtime > Environment: flink : 1.9.2 > hadoop :2.7.2 > jdk:1.8 >Reporter: wgcn >Priority: Major > Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, > jobmanager.log, zk_resource_address_info.png > > > our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger machines > ,and AMs of the machines restarted at other nodemanager. We found some > jobs can not recover due to timeout of requiring slots. > *SlotPoolImp always did not connect ResourceManager * > ``` > 2020-11-09 16:31:31,794 INFO > flink-akka.actor.default-dispatcher-16 > (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369) > - Cannot serve slot request, no ResourceManager connected. Adding as pending > request [SlotRequestId{456c9daa6670a4490810f8e51f495174}] > ``` > *1.We did not find the log of YarnResourceManager requesting container at > the jobmanager log of attachment. > 2.The node of Zookeeper is also showed at attachment .* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager
[ https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wgcn closed FLINK-18715. Resolution: Won't Do > add cpu usage metric of jobmanager/taskmanager > - > > Key: FLINK-18715 > URL: https://issues.apache.org/jira/browse/FLINK-18715 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.11.1 >Reporter: wgcn >Priority: Major > Fix For: 1.12.0, 1.11.3 > > > flink process add cpu usage metric, user can determine that their job is > io bound /cpu bound ,so that they can increase/decrese cpu core in the > container (k8s,yarn). If it's nessary > . you can assign it to me ,I come up with a idea calculating cpu usage > ratio using ManagementFactory.getRuntimeMXBean().getUptime() and > ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime over a period > of time . it can get a value in single cpu core environment. and user can > use the value to calculate cpu usage ratio by dividing num of container's > cpu core. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted
[ https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17231130#comment-17231130 ] wgcn commented on FLINK-20138: -- hi~ [~trohrmann] please have a look at this issue > Flink Job can not recover due to timeout of requiring slots when flink > jobmanager restarted > > > Key: FLINK-20138 > URL: https://issues.apache.org/jira/browse/FLINK-20138 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN, Table SQL / Runtime > Environment: flink : 1.9.2 > hadoop :2.7.2 > jdk:1.8 >Reporter: wgcn >Priority: Major > Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, jobmanager.log > > > our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger machines > ,and AMs of the machines restarted at other nodemanager. We found some > jobs can not recover due to timeout of requiring slots. > *SlotPoolImp always did not connect ResourceManager * > ``` > 2020-11-09 16:31:31,794 INFO > flink-akka.actor.default-dispatcher-16 > (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369) > - Cannot serve slot request, no ResourceManager connected. Adding as pending > request [SlotRequestId{456c9daa6670a4490810f8e51f495174}] > ``` > *1.We did not find the log of YarnResourceManager requesting container at > the jobmanager log of attachment. > 2.The node of Zookeeper is also showed at attachment .* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted
[ https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wgcn updated FLINK-20138: - Description: our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger machines ,and AMs of the machines restarted at other nodemanager. We found some jobs can not recover due to timeout of requiring slots. *SlotPoolImp always did not connect ResourceManager * ``` +_ 2020-11-09 16:31:31,794 INFO flink-akka.actor.default-dispatcher-16 (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369) - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{456c9daa6670a4490810f8e51f495174}] _+ ``` *1.We did not find the log of YarnResourceManager requesting container at the jobmanager log of attachment. 2.The node of Zookeeper is also showed at attachment .* was: our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger machines ,and AMs of the machines restarted at other nodemanager. We found some jobs can not recover due to timeout of requiring slots. SlotPoolImp always did not connect ResourceManager ``` 2020-11-09 16:31:31,794 INFO flink-akka.actor.default-dispatcher-16 (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369) - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{456c9daa6670a4490810f8e51f495174}] ``` 1.We did not find the log of YarnResourceManager requesting container at the jobmanager log of attachment. 2.The node of Zookeeper is also showed at attachment . > Flink Job can not recover due to timeout of requiring slots when flink > jobmanager restarted > > > Key: FLINK-20138 > URL: https://issues.apache.org/jira/browse/FLINK-20138 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN, Table SQL / Runtime > Environment: flink : 1.9.2 > hadoop :2.7.2 > jdk:1.8 >Reporter: wgcn >Priority: Major > Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, jobmanager.log > > > our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger machines > ,and AMs of the machines restarted at other nodemanager. We found some > jobs can not recover due to timeout of requiring slots. > *SlotPoolImp always did not connect ResourceManager * > ``` > +_ > 2020-11-09 16:31:31,794 INFO > flink-akka.actor.default-dispatcher-16 > (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369) > - Cannot serve slot request, no ResourceManager connected. Adding as pending > request [SlotRequestId{456c9daa6670a4490810f8e51f495174}] > _+ > ``` > *1.We did not find the log of YarnResourceManager requesting container at > the jobmanager log of attachment. > 2.The node of Zookeeper is also showed at attachment .* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted
[ https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wgcn updated FLINK-20138: - Description: our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger machines ,and AMs of the machines restarted at other nodemanager. We found some jobs can not recover due to timeout of requiring slots. *SlotPoolImp always did not connect ResourceManager * ``` 2020-11-09 16:31:31,794 INFO flink-akka.actor.default-dispatcher-16 (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369) - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{456c9daa6670a4490810f8e51f495174}] ``` *1.We did not find the log of YarnResourceManager requesting container at the jobmanager log of attachment. 2.The node of Zookeeper is also showed at attachment .* was: our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger machines ,and AMs of the machines restarted at other nodemanager. We found some jobs can not recover due to timeout of requiring slots. *SlotPoolImp always did not connect ResourceManager * ``` +_ 2020-11-09 16:31:31,794 INFO flink-akka.actor.default-dispatcher-16 (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369) - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{456c9daa6670a4490810f8e51f495174}] _+ ``` *1.We did not find the log of YarnResourceManager requesting container at the jobmanager log of attachment. 2.The node of Zookeeper is also showed at attachment .* > Flink Job can not recover due to timeout of requiring slots when flink > jobmanager restarted > > > Key: FLINK-20138 > URL: https://issues.apache.org/jira/browse/FLINK-20138 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN, Table SQL / Runtime > Environment: flink : 1.9.2 > hadoop :2.7.2 > jdk:1.8 >Reporter: wgcn >Priority: Major > Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, jobmanager.log > > > our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger machines > ,and AMs of the machines restarted at other nodemanager. We found some > jobs can not recover due to timeout of requiring slots. > *SlotPoolImp always did not connect ResourceManager * > ``` > 2020-11-09 16:31:31,794 INFO > flink-akka.actor.default-dispatcher-16 > (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369) > - Cannot serve slot request, no ResourceManager connected. Adding as pending > request [SlotRequestId{456c9daa6670a4490810f8e51f495174}] > ``` > *1.We did not find the log of YarnResourceManager requesting container at > the jobmanager log of attachment. > 2.The node of Zookeeper is also showed at attachment .* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted
[ https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wgcn updated FLINK-20138: - Description: our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger machines ,and AMs of the machines restarted at other nodemanager. We found some jobs can not recover due to timeout of requiring slots. SlotPoolImp always did not connect ResourceManager ``` 2020-11-09 16:31:31,794 INFO flink-akka.actor.default-dispatcher-16 (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369) - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{456c9daa6670a4490810f8e51f495174}] ``` 1.We did not find the log of YarnResourceManager requesting container at the jobmanager log of attachment. 2.The node of Zookeeper is also showed at attachment . was: our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger machines ,and AMs of the machines restarted at other nodemanager. We found some jobs can not recover due to timeout of requiring slots. SlotPoolImp always did not connect ResourceManager ``` 2020-11-09 16:31:31,794 INFO flink-akka.actor.default-dispatcher-16 (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369) - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{456c9daa6670a4490810f8e51f495174}] ``` 1.We did not find the log of YarnResourceManager requesting container at the jobmanager log of attachment. 2.The node of Zookeeper is also showed at attachment . > Flink Job can not recover due to timeout of requiring slots when flink > jobmanager restarted > > > Key: FLINK-20138 > URL: https://issues.apache.org/jira/browse/FLINK-20138 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN, Table SQL / Runtime > Environment: flink : 1.9.2 > hadoop :2.7.2 > jdk:1.8 >Reporter: wgcn >Priority: Major > Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, jobmanager.log > > > our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger machines > ,and AMs of the machines restarted at other nodemanager. We found some > jobs can not recover due to timeout of requiring slots. > SlotPoolImp always did not connect ResourceManager > ``` > 2020-11-09 16:31:31,794 INFO > flink-akka.actor.default-dispatcher-16 > (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369) > - Cannot serve slot request, no ResourceManager connected. Adding as pending > request [SlotRequestId{456c9daa6670a4490810f8e51f495174}] > ``` > 1.We did not find the log of YarnResourceManager requesting container at > the jobmanager log of attachment. > 2.The node of Zookeeper is also showed at attachment . -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted
[ https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wgcn updated FLINK-20138: - Attachment: jobmanager.log > Flink Job can not recover due to timeout of requiring slots when flink > jobmanager restarted > > > Key: FLINK-20138 > URL: https://issues.apache.org/jira/browse/FLINK-20138 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN, Table SQL / Runtime > Environment: flink : 1.9.2 > hadoop :2.7.2 > jdk:1.8 >Reporter: wgcn >Priority: Major > Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, jobmanager.log > > > our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger machines > ,and AMs of the machines restarted at other nodemanager. We found some > jobs can not recover due to timeout of requiring slots. > SlotPoolImp always did not connect ResourceManager > ``` > 2020-11-09 16:31:31,794 INFO > flink-akka.actor.default-dispatcher-16 > (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369) > - Cannot serve slot request, no ResourceManager connected. Adding as pending > request [SlotRequestId{456c9daa6670a4490810f8e51f495174}] > ``` > 1.We did not find the log of YarnResourceManager requesting container at > the jobmanager log of attachment. > 2.The node of Zookeeper is also showed at attachment . -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted
[ https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wgcn updated FLINK-20138: - Attachment: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png > Flink Job can not recover due to timeout of requiring slots when flink > jobmanager restarted > > > Key: FLINK-20138 > URL: https://issues.apache.org/jira/browse/FLINK-20138 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN, Table SQL / Runtime > Environment: flink : 1.9.2 > hadoop :2.7.2 > jdk:1.8 >Reporter: wgcn >Priority: Major > Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png > > > our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger machines > ,and AMs of the machines restarted at other nodemanager. We found some > jobs can not recover due to timeout of requiring slots. > SlotPoolImp always did not connect ResourceManager > ``` > 2020-11-09 16:31:31,794 INFO > flink-akka.actor.default-dispatcher-16 > (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369) > - Cannot serve slot request, no ResourceManager connected. Adding as pending > request [SlotRequestId{456c9daa6670a4490810f8e51f495174}] > ``` > 1.We did not find the log of YarnResourceManager requesting container at > the jobmanager log of attachment. > 2.The node of Zookeeper is also showed at attachment . -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted
wgcn created FLINK-20138: Summary: Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted Key: FLINK-20138 URL: https://issues.apache.org/jira/browse/FLINK-20138 Project: Flink Issue Type: Bug Components: Deployment / YARN, Table SQL / Runtime Environment: flink : 1.9.2 hadoop :2.7.2 jdk:1.8 Reporter: wgcn Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger machines ,and AMs of the machines restarted at other nodemanager. We found some jobs can not recover due to timeout of requiring slots. SlotPoolImp always did not connect ResourceManager ``` 2020-11-09 16:31:31,794 INFO flink-akka.actor.default-dispatcher-16 (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369) - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{456c9daa6670a4490810f8e51f495174}] ``` 1.We did not find the log of YarnResourceManager requesting container at the jobmanager log of attachment. 2.The node of Zookeeper is also showed at attachment . -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19943) Support sink parallelism configuration to ElasticSearch connector
[ https://issues.apache.org/jira/browse/FLINK-19943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17225333#comment-17225333 ] wgcn commented on FLINK-19943: -- I want to have a try .Please assign it to me [~lzljs3620320] > Support sink parallelism configuration to ElasticSearch connector > - > > Key: FLINK-19943 > URL: https://issues.apache.org/jira/browse/FLINK-19943 > Project: Flink > Issue Type: Sub-task > Components: Connectors / ElasticSearch >Reporter: CloseRiver >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager
[ https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166396#comment-17166396 ] wgcn commented on FLINK-18715: -- [~trohrmann] it's not suitable for the deployment scenarios we don't have CPU isolation > add cpu usage metric of jobmanager/taskmanager > - > > Key: FLINK-18715 > URL: https://issues.apache.org/jira/browse/FLINK-18715 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.11.1 >Reporter: wgcn >Priority: Major > Fix For: 1.12.0, 1.11.2 > > > flink process add cpu usage metric, user can determine that their job is > io bound /cpu bound ,so that they can increase/decrese cpu core in the > container (k8s,yarn). If it's nessary > . you can assign it to me ,I come up with a idea calculating cpu usage > ratio using ManagementFactory.getRuntimeMXBean().getUptime() and > ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime over a period > of time . it can get a value in single cpu core environment. and user can > use the value to calculate cpu usage ratio by dividing num of container's > cpu core. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager
[ https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166353#comment-17166353 ] wgcn edited comment on FLINK-18715 at 7/28/20, 11:30 AM: - [~chesnay] it indeed has a lot of system resources metric , we talked about cpu occupation in single flink process was (Author: 1026688210): [~chesnay] it indeed has a lot of system resources , we talked about cpu occupation in single flink process > add cpu usage metric of jobmanager/taskmanager > - > > Key: FLINK-18715 > URL: https://issues.apache.org/jira/browse/FLINK-18715 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.11.1 >Reporter: wgcn >Priority: Major > Fix For: 1.12.0, 1.11.2 > > > flink process add cpu usage metric, user can determine that their job is > io bound /cpu bound ,so that they can increase/decrese cpu core in the > container (k8s,yarn). If it's nessary > . you can assign it to me ,I come up with a idea calculating cpu usage > ratio using ManagementFactory.getRuntimeMXBean().getUptime() and > ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime over a period > of time . it can get a value in single cpu core environment. and user can > use the value to calculate cpu usage ratio by dividing num of container's > cpu core. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager
[ https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166353#comment-17166353 ] wgcn edited comment on FLINK-18715 at 7/28/20, 11:27 AM: - [~chesnay] it indeed has a lot of system resources , we talked about cpu occupation in single flink process was (Author: 1026688210): [~chesnay] it indeed has a lot of system resources , we talked able cpu occupation in single flink process > add cpu usage metric of jobmanager/taskmanager > - > > Key: FLINK-18715 > URL: https://issues.apache.org/jira/browse/FLINK-18715 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.11.1 >Reporter: wgcn >Priority: Major > Fix For: 1.12.0, 1.11.2 > > > flink process add cpu usage metric, user can determine that their job is > io bound /cpu bound ,so that they can increase/decrese cpu core in the > container (k8s,yarn). If it's nessary > . you can assign it to me ,I come up with a idea calculating cpu usage > ratio using ManagementFactory.getRuntimeMXBean().getUptime() and > ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime over a period > of time . it can get a value in single cpu core environment. and user can > use the value to calculate cpu usage ratio by dividing num of container's > cpu core. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager
[ https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166353#comment-17166353 ] wgcn commented on FLINK-18715: -- [~chesnay] it indeed has a lot of system resources , we talked able cpu occupation in single flink process > add cpu usage metric of jobmanager/taskmanager > - > > Key: FLINK-18715 > URL: https://issues.apache.org/jira/browse/FLINK-18715 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.11.1 >Reporter: wgcn >Priority: Major > Fix For: 1.12.0, 1.11.2 > > > flink process add cpu usage metric, user can determine that their job is > io bound /cpu bound ,so that they can increase/decrese cpu core in the > container (k8s,yarn). If it's nessary > . you can assign it to me ,I come up with a idea calculating cpu usage > ratio using ManagementFactory.getRuntimeMXBean().getUptime() and > ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime over a period > of time . it can get a value in single cpu core environment. and user can > use the value to calculate cpu usage ratio by dividing num of container's > cpu core. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager
[ https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166213#comment-17166213 ] wgcn commented on FLINK-18715: -- Thanks for your concern [~trohrmann]. We use yarn as resource management system which provide cpu isolation function by cgroup ,but we can not get the cpu ratio metric of containers . So we add a metric in flink process. we can adjust the vcore according to the metric . Maybe the improvement is suitable for yarn :) > add cpu usage metric of jobmanager/taskmanager > - > > Key: FLINK-18715 > URL: https://issues.apache.org/jira/browse/FLINK-18715 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.11.1 >Reporter: wgcn >Priority: Major > Fix For: 1.12.0, 1.11.2 > > > flink process add cpu usage metric, user can determine that their job is > io bound /cpu bound ,so that they can increase/decrese cpu core in the > container (k8s,yarn). If it's nessary > . you can assign it to me ,I come up with a idea calculating cpu usage > ratio using ManagementFactory.getRuntimeMXBean().getUptime() and > ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime over a period > of time . it can get a value in single cpu core environment. and user can > use the value to calculate cpu usage ratio by dividing num of container's > cpu core. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager
[ https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wgcn updated FLINK-18715: - Description: flink process add cpu usage metric, user can determine that their job is io bound /cpu bound ,so that they can increase/decrese cpu core in the container (k8s,yarn). If it's nessary . you can assign it to me ,I come up with a idea calculating cpu usage ratio using ManagementFactory.getRuntimeMXBean().getUptime() and ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime over a period of time . it can get a value in single cpu core environment. and user can use the value to calculate cpu usage ratio by dividing num of container's cpu core. was: flink process add cpu usage metric, user can determine that their job is io bound /cpu bound ,so that they can increase/decrese cpu core in the container (k8s,yarn). If it's necessary . you can assign it to me > add cpu usage metric of jobmanager/taskmanager > - > > Key: FLINK-18715 > URL: https://issues.apache.org/jira/browse/FLINK-18715 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.11.1 >Reporter: wgcn >Priority: Major > Fix For: 1.12.0, 1.11.2 > > > flink process add cpu usage metric, user can determine that their job is > io bound /cpu bound ,so that they can increase/decrese cpu core in the > container (k8s,yarn). If it's nessary > . you can assign it to me ,I come up with a idea calculating cpu usage > ratio using ManagementFactory.getRuntimeMXBean().getUptime() and > ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime over a period > of time . it can get a value in single cpu core environment. and user can > use the value to calculate cpu usage ratio by dividing num of container's > cpu core. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager
[ https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wgcn updated FLINK-18715: - Description: flink process add cpu usage metric, user can determine that their job is io bound /cpu bound ,so that they can increase/decrese cpu core in the container (k8s,yarn). If it's necessary . you can assign it to me was: flink process add cpu usage metric, user can determine that their job is io bound /cpu bound ,so that they can increase/decrese cpu core in the container (k8s,yarn). If it's nessary . you can assign it to me > add cpu usage metric of jobmanager/taskmanager > - > > Key: FLINK-18715 > URL: https://issues.apache.org/jira/browse/FLINK-18715 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.11.1 >Reporter: wgcn >Priority: Major > Fix For: 1.12.0, 1.11.2 > > > flink process add cpu usage metric, user can determine that their job is > io bound /cpu bound ,so that they can increase/decrese cpu core in the > container (k8s,yarn). If it's necessary > . you can assign it to me -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager
wgcn created FLINK-18715: Summary: add cpu usage metric of jobmanager/taskmanager Key: FLINK-18715 URL: https://issues.apache.org/jira/browse/FLINK-18715 Project: Flink Issue Type: Improvement Components: Runtime / Metrics Affects Versions: 1.11.1 Reporter: wgcn Fix For: 1.12.0, 1.11.2 flink process add cpu usage metric, user can determine that their job is io bound /cpu bound ,so that they can increase/decrese cpu core in the container (k8s,yarn). If it's nessary . you can assign it to me -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-16360) connector on hive 2.0.1 don't support type conversion from STRING to VARCHAR
wgcn created FLINK-16360: Summary: connector on hive 2.0.1 don't support type conversion from STRING to VARCHAR Key: FLINK-16360 URL: https://issues.apache.org/jira/browse/FLINK-16360 Project: Flink Issue Type: Bug Components: Connectors / Hive Affects Versions: 1.10.0 Environment: os:centos java: 1.8.0_92 flink :1.10.0 hadoop: 2.7.2 hive:2.0.1 Reporter: wgcn Attachments: exceptionstack it threw exception when we query hive 2.0.1 by flink 1.10.0 Exception stack: org.apache.flink.runtime.JobException: Recovery is suppressed by FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=50, backoffTimeMS=1) at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:110) at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:76) at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192) at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:186) at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:180) at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:484) at org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:380) at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:279) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:194) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at akka.actor.Actor$class.aroundReceive(Actor.scala:517) at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) at akka.actor.ActorCell.invoke(ActorCell.scala:561) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) at akka.dispatch.Mailbox.run(Mailbox.scala:225) at akka.dispatch.Mailbox.exec(Mailbox.scala:235) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.io.IOException: java.lang.reflect.InvocationTargetException at org.apache.flink.orc.shim.OrcShimV200.createRecordReader(OrcShimV200.java:76) at org.apache.flink.orc.shim.OrcShimV200.createRecordReader(OrcShimV200.java:123) at org.apache.flink.orc.OrcSplitReader.(OrcSplitReader.java:73) at org.apache.flink.orc.OrcColumnarRowSplitReader.(OrcColumnarRowSplitReader.java:55) at org.apache.flink.orc.OrcSplitReaderUtil.genPartColumnarRowReader(OrcSplitReaderUtil.java:96) at org.apache.flink.connectors.hive.read.HiveVectorizedOrcSplitReader.(HiveVectorizedOrcSplitReader.java:65) at org.apache.flink.connectors.hive.read.HiveTableInputFormat.open(HiveTableInputFormat.java:117) at org.apache.flink.connectors.hive.read.HiveTableInputFormat.open(HiveTableInputFormat.java:56) at org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:85) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63) at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:196) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.commons.lang3.reflect.MethodUtils.invokeExactMethod(MethodUtils.java:204) at org.apache.commons.lang3.reflect.MethodUtils.invokeExactMe
[jira] [Commented] (FLINK-15350) develop JDBC catalogs to connect to relational databases
[ https://issues.apache.org/jira/browse/FLINK-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17018800#comment-17018800 ] wgcn commented on FLINK-15350: -- Hi~ [~phoenixjiangnan] , I wanna know which scene JDBC catalog will be apply in stream/batch mode > develop JDBC catalogs to connect to relational databases > > > Key: FLINK-15350 > URL: https://issues.apache.org/jira/browse/FLINK-15350 > Project: Flink > Issue Type: New Feature > Components: Connectors / JDBC >Reporter: Bowen Li >Assignee: Bowen Li >Priority: Major > > introduce AbastractJDBCCatalog and a set of JDBC catalog implementations to > connect Flink to all relational databases. > Class hierarchy: > {code:java} > Catalog API > | > AbstractJDBCCatalog > | > PostgresJDBCCatalog, MySqlJDBCCatalog, OracleJDBCCatalog, ... > > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15625) flink sql multiple statements syntatic validation supports
[ https://issues.apache.org/jira/browse/FLINK-15625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17018773#comment-17018773 ] wgcn commented on FLINK-15625: -- [~jark]thks for your reply . i will be concern about [Flink-12828|https://issues.apache.org/jira/browse/FLINK-12828] and [FLINK-12845|https://issues.apache.org/jira/browse/FLINK-12845] > flink sql multiple statements syntatic validation supports > -- > > Key: FLINK-15625 > URL: https://issues.apache.org/jira/browse/FLINK-15625 > Project: Flink > Issue Type: Improvement > Components: Table SQL / API, Table SQL / Planner >Reporter: jackylau >Priority: Major > Fix For: 1.11.0 > > > we konw that blink(blink first commits ) parser and calcite parser all > support multiple statements now and supports multiple statement syntatic > validation by calcite, which validates sql statements one by one, and it will > not validate the previous tablenames and others. and we only know the sql > syntatic error when we submit the flink applications. > I think it is eagerly need for users. we hope the flink community to support > it -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15625) flink sql multiple statements syntatic validation supports
[ https://issues.apache.org/jira/browse/FLINK-15625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017909#comment-17017909 ] wgcn commented on FLINK-15625: -- hi , [~lzljs3620320]we can support a sql text which contain multiple statements (such as 'insert into','select','ddl', 'sql comment' etc) at streaming mode ,and then excute the multiple statements by the order of multiple statements > flink sql multiple statements syntatic validation supports > -- > > Key: FLINK-15625 > URL: https://issues.apache.org/jira/browse/FLINK-15625 > Project: Flink > Issue Type: Improvement > Components: Table SQL / Legacy Planner >Reporter: jackylau >Priority: Major > Fix For: 1.10.0 > > > we konw that blink(blink first commits ) parser and calcite parser all > support multiple statements now and supports multiple statement syntatic > validation by calcite, which validates sql statements one by one, and it will > not validate the previous tablenames and others. and we only know the sql > syntatic error when we submit the flink applications. > I think it is eagerly need for users. we hope the flink community to support > it -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15625) flink sql multiple statements syntatic validation supports
[ https://issues.apache.org/jira/browse/FLINK-15625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017738#comment-17017738 ] wgcn commented on FLINK-15625: -- it's indeed a good improvement , and it's seem to not complex to realize the improvement . i wanna to try it > flink sql multiple statements syntatic validation supports > -- > > Key: FLINK-15625 > URL: https://issues.apache.org/jira/browse/FLINK-15625 > Project: Flink > Issue Type: Improvement > Components: Table SQL / Legacy Planner >Reporter: jackylau >Priority: Major > Fix For: 1.10.0 > > > we konw that blink(blink first commits ) parser and calcite parser all > support multiple statements now and supports multiple statement syntatic > validation by calcite, which validates sql statements one by one, and it will > not validate the previous tablenames and others. and we only know the sql > syntatic error when we submit the flink applications. > I think it is eagerly need for users. we hope the flink community to support > it -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-12728) taskmanager container can't launch on nodemanager machine because of kerberos
[ https://issues.apache.org/jira/browse/FLINK-12728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wgcn closed FLINK-12728. Resolution: Resolved *in core-site.xml* hadoop.proxyuser.yarn.hosts * hadoop.proxyuser.yarn.groups * in *yarn**-site.xml* yarn.resourcemanager.proxy-user-privileges.enabled true > taskmanager container can't launch on nodemanager machine because of > kerberos > --- > > Key: FLINK-12728 > URL: https://issues.apache.org/jira/browse/FLINK-12728 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.7.2 > Environment: linux > jdk8 > hadoop 2.7.2 > flink 1.7.2 >Reporter: wgcn >Priority: Major > Attachments: AM.log, NM.log > > > job can't restart when flink job has been running for a long time and > then taskmanager restarting ,i find log in AM that AM request > containers taskmanager all the time . the log in NodeManager show > that the new requested containers can't downloading file from hdfs because > of kerberos . I configed the keytab config that > security.kerberos.login.use-ticket-cache: false > security.kerberos.login.keytab: /data/sysdir/knit/user/.flink.keytab > security.kerberos.login.principal: > at flink-client machine and keytab is exist. > I showed the logs at AM and NodeManager below. > > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-12728) taskmanager container can't launch on nodemanager machine because of kerberos
[ https://issues.apache.org/jira/browse/FLINK-12728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16888705#comment-16888705 ] wgcn commented on FLINK-12728: -- hi Tao Yang,Andrey Zagrebin it's does work by setting yarn.resourcemanager.proxy-user-privileges.enabled =true ,hadoop.proxyuser.yarn.hosts=*,hadoop.proxyuser.yarn.groups=* ,and it may be useful for other user if it's recorded in flink document. > taskmanager container can't launch on nodemanager machine because of > kerberos > --- > > Key: FLINK-12728 > URL: https://issues.apache.org/jira/browse/FLINK-12728 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.7.2 > Environment: linux > jdk8 > hadoop 2.7.2 > flink 1.7.2 >Reporter: wgcn >Priority: Major > Attachments: AM.log, NM.log > > > job can't restart when flink job has been running for a long time and > then taskmanager restarting ,i find log in AM that AM request > containers taskmanager all the time . the log in NodeManager show > that the new requested containers can't downloading file from hdfs because > of kerberos . I configed the keytab config that > security.kerberos.login.use-ticket-cache: false > security.kerberos.login.keytab: /data/sysdir/knit/user/.flink.keytab > security.kerberos.login.principal: > at flink-client machine and keytab is exist. > I showed the logs at AM and NodeManager below. > > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-12728) taskmanager container can't launch on nodemanager machine because of kerberos
[ https://issues.apache.org/jira/browse/FLINK-12728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881895#comment-16881895 ] wgcn commented on FLINK-12728: -- hi Andrey Zagrebin The issue did happen again after yarn restarted , we are looking for reason. We will report it in time after resolved it. thanks for your _concern_ > taskmanager container can't launch on nodemanager machine because of > kerberos > --- > > Key: FLINK-12728 > URL: https://issues.apache.org/jira/browse/FLINK-12728 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.7.2 > Environment: linux > jdk8 > hadoop 2.7.2 > flink 1.7.2 >Reporter: wgcn >Priority: Major > Attachments: AM.log, NM.log > > > job can't restart when flink job has been running for a long time and > then taskmanager restarting ,i find log in AM that AM request > containers taskmanager all the time . the log in NodeManager show > that the new requested containers can't downloading file from hdfs because > of kerberos . I configed the keytab config that > security.kerberos.login.use-ticket-cache: false > security.kerberos.login.keytab: /data/sysdir/knit/user/.flink.keytab > security.kerberos.login.principal: > at flink-client machine and keytab is exist. > I showed the logs at AM and NodeManager below. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-12728) taskmanager container can't launch on nodemanager machine because of kerberos
[ https://issues.apache.org/jira/browse/FLINK-12728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wgcn updated FLINK-12728: - Description: job can't restart when flink job has been running for a long time and then taskmanager restarting ,i find log in AM that AM request containers taskmanager all the time . the log in NodeManager show that the new requested containers can't downloading file from hdfs because of kerberos . I configed the keytab config that security.kerberos.login.use-ticket-cache: false security.kerberos.login.keytab: /data/sysdir/knit/user/.flink.keytab security.kerberos.login.principal: at flink-client machine and keytab is exist. I showed the logs at AM and NodeManager below. was: job can't restart when flink job has been running for a long time and then taskmanager restarting ,i find log in AM that AM request containers taskmanager all the time . the log in NodeManager show that the new requested containers can't downloading file from hdfs because of kerberos . I configed the keytab config that security.kerberos.login.use-ticket-cache: false security.kerberos.login.keytab: /data/sysdir/knit/user/.flink.keytab security.kerberos.login.principal: [flink/client-docker-201-53.hadoop.lq@HADOOP.LQ2. |mailto:flink/client-docker-201-53.hadoop.lq@HADOOP.LQ2.] at flink-client machine and keytab is exist. I showed the logs at AM and NodeManager below. > taskmanager container can't launch on nodemanager machine because of > kerberos > --- > > Key: FLINK-12728 > URL: https://issues.apache.org/jira/browse/FLINK-12728 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.7.2 > Environment: linux > jdk8 > hadoop 2.7.2 > flink 1.7.2 >Reporter: wgcn >Priority: Major > Attachments: AM.log, NM.log > > > job can't restart when flink job has been running for a long time and > then taskmanager restarting ,i find log in AM that AM request > containers taskmanager all the time . the log in NodeManager show > that the new requested containers can't downloading file from hdfs because > of kerberos . I configed the keytab config that > security.kerberos.login.use-ticket-cache: false > security.kerberos.login.keytab: /data/sysdir/knit/user/.flink.keytab > security.kerberos.login.principal: > at flink-client machine and keytab is exist. > I showed the logs at AM and NodeManager below. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-12728) taskmanager container can't launch on nodemanager machine because of kerberos
[ https://issues.apache.org/jira/browse/FLINK-12728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856676#comment-16856676 ] wgcn commented on FLINK-12728: -- thanks ,Tao Yang .it does set false. We will try it. > taskmanager container can't launch on nodemanager machine because of > kerberos > --- > > Key: FLINK-12728 > URL: https://issues.apache.org/jira/browse/FLINK-12728 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.7.2 > Environment: linux > jdk8 > hadoop 2.7.2 > flink 1.7.2 >Reporter: wgcn >Priority: Major > Attachments: AM.log, NM.log > > > job can't restart when flink job has been running for a long time and > then taskmanager restarting ,i find log in AM that AM request > containers taskmanager all the time . the log in NodeManager show > that the new requested containers can't downloading file from hdfs because > of kerberos . I configed the keytab config that > security.kerberos.login.use-ticket-cache: false > security.kerberos.login.keytab: /data/sysdir/knit/user/.flink.keytab > security.kerberos.login.principal: > [flink/client-docker-201-53.hadoop.lq@HADOOP.LQ2. > |mailto:flink/client-docker-201-53.hadoop.lq@HADOOP.LQ2.] > at flink-client machine and keytab is exist. > I showed the logs at AM and NodeManager below. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-12728) taskmanager container can't launch on nodemanager machine because of kerberos
[ https://issues.apache.org/jira/browse/FLINK-12728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wgcn updated FLINK-12728: - Description: job can't restart when flink job has been running for a long time and then taskmanager restarting ,i find log in AM that AM request containers taskmanager all the time . the log in NodeManager show that the new requested containers can't downloading file from hdfs because of kerberos . I configed the keytab config that security.kerberos.login.use-ticket-cache: false security.kerberos.login.keytab: /data/sysdir/knit/user/.flink.keytab security.kerberos.login.principal: [flink/client-docker-201-53.hadoop.lq@HADOOP.LQ2. |mailto:flink/client-docker-201-53.hadoop.lq@HADOOP.LQ2.] at flink-client machine and keytab is exist. I showed the logs at AM and NodeManager below. was: job can't restart when flink job has been running for a long time and then taskmanager restarting ,i find log in AM that AM request containers taskmanager all the time . log in NodeManager show that the new requested containers can't downloading file from hdfs because of kerberos . I configed the keytab config that security.kerberos.login.use-ticket-cache: false security.kerberos.login.keytab: /data/sysdir/knit/user/.flink.keytab security.kerberos.login.principal: [flink/client-docker-201-53.hadoop.lq@HADOOP.LQ2. |mailto:flink/client-docker-201-53.hadoop.lq@HADOOP.LQ2.] at flink-client machine and keytab is exist. I showed the logs at AM and NodeManager below. > taskmanager container can't launch on nodemanager machine because of > kerberos > --- > > Key: FLINK-12728 > URL: https://issues.apache.org/jira/browse/FLINK-12728 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.7.2 > Environment: linux > jdk8 > hadoop 2.7.2 > flink 1.7.2 >Reporter: wgcn >Priority: Major > Attachments: AM.log, NM.log > > > job can't restart when flink job has been running for a long time and > then taskmanager restarting ,i find log in AM that AM request > containers taskmanager all the time . the log in NodeManager show > that the new requested containers can't downloading file from hdfs because > of kerberos . I configed the keytab config that > security.kerberos.login.use-ticket-cache: false > security.kerberos.login.keytab: /data/sysdir/knit/user/.flink.keytab > security.kerberos.login.principal: > [flink/client-docker-201-53.hadoop.lq@HADOOP.LQ2. > |mailto:flink/client-docker-201-53.hadoop.lq@HADOOP.LQ2.] > at flink-client machine and keytab is exist. > I showed the logs at AM and NodeManager below. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12728) taskmanager container can't launch on nodemanager machine because of kerberos
wgcn created FLINK-12728: Summary: taskmanager container can't launch on nodemanager machine because of kerberos Key: FLINK-12728 URL: https://issues.apache.org/jira/browse/FLINK-12728 Project: Flink Issue Type: Bug Components: Deployment / YARN Affects Versions: 1.7.2 Environment: linux jdk8 hadoop 2.7.2 flink 1.7.2 Reporter: wgcn Attachments: AM.log, NM.log job can't restart when flink job has been running for a long time and then taskmanager restarting ,i find log in AM that AM request containers taskmanager all the time . log in NodeManager show that the new requested containers can't downloading file from hdfs because of kerberos . I configed the keytab config that security.kerberos.login.use-ticket-cache: false security.kerberos.login.keytab: /data/sysdir/knit/user/.flink.keytab security.kerberos.login.principal: [flink/client-docker-201-53.hadoop.lq@HADOOP.LQ2. |mailto:flink/client-docker-201-53.hadoop.lq@HADOOP.LQ2.] at flink-client machine and keytab is exist. I showed the logs at AM and NodeManager below. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10884) Flink on yarn TM container will be killed by nodemanager because of the exceeded physical memory.
[ https://issues.apache.org/jira/browse/FLINK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701574#comment-16701574 ] wgcn commented on FLINK-10884: -- https://github.com/apache/flink/pull/7185 > Flink on yarn TM container will be killed by nodemanager because of the > exceeded physical memory. > > > Key: FLINK-10884 > URL: https://issues.apache.org/jira/browse/FLINK-10884 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Core >Affects Versions: 1.5.5, 1.6.2, 1.7.0 > Environment: version : 1.6.2 > module : flink on yarn > centos jdk1.8 > hadoop 2.7 >Reporter: wgcn >Assignee: wgcn >Priority: Major > Labels: pull-request-available, yarn > > TM container will be killed by nodemanager because of the exceeded > [physical|http://www.baidu.com/link?url=Y4LyfMDH59n9-Ey16Fo6EFAYltN1e9anB3y2ynhVmdvuIBCkJGdH0hTExKDZRvXNr6hqhwIXs8JjYqesYbx0BOpQDD0o1VjbVQlOC-9MgXi] > memory. I found the lanuch context lanuching TM container that > "container memory = heap memory+ offHeapSizeMB" at the class > org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters > from line 160 to 166 I set a safety margin for the whole memory container > using. For example if the container limit 3g memory, the sum memory that > "heap memory+ offHeapSizeMB" is equal to 2.4g to prevent the container > being killed.Do we have the > [ready-made|http://www.baidu.com/link?url=ylC8cEafGU6DWAdU9ADcJPNugkjbx6IjtqIIxJ9foX4_Yfgc7ctWmpEpQRettVmBiOy7Wfph7S1UvN5LiJj-G1Rsb--oDw4Z2OEbA5Fj0bC] > solution or I can commit my solution -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10884) Flink on yarn TM container will be killed by nodemanager because of the exceeded physical memory.
[ https://issues.apache.org/jira/browse/FLINK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698537#comment-16698537 ] wgcn commented on FLINK-10884: -- thanks for your suggestion:D > Flink on yarn TM container will be killed by nodemanager because of the > exceeded physical memory. > > > Key: FLINK-10884 > URL: https://issues.apache.org/jira/browse/FLINK-10884 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Core >Affects Versions: 1.5.5, 1.6.2, 1.7.0 > Environment: version : 1.6.2 > module : flink on yarn > centos jdk1.8 > hadoop 2.7 >Reporter: wgcn >Assignee: wgcn >Priority: Major > Labels: yarn > > TM container will be killed by nodemanager because of the exceeded > [physical|http://www.baidu.com/link?url=Y4LyfMDH59n9-Ey16Fo6EFAYltN1e9anB3y2ynhVmdvuIBCkJGdH0hTExKDZRvXNr6hqhwIXs8JjYqesYbx0BOpQDD0o1VjbVQlOC-9MgXi] > memory. I found the lanuch context lanuching TM container that > "container memory = heap memory+ offHeapSizeMB" at the class > org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters > from line 160 to 166 I set a safety margin for the whole memory container > using. For example if the container limit 3g memory, the sum memory that > "heap memory+ offHeapSizeMB" is equal to 2.4g to prevent the container > being killed.Do we have the > [ready-made|http://www.baidu.com/link?url=ylC8cEafGU6DWAdU9ADcJPNugkjbx6IjtqIIxJ9foX4_Yfgc7ctWmpEpQRettVmBiOy7Wfph7S1UvN5LiJj-G1Rsb--oDw4Z2OEbA5Fj0bC] > solution or I can commit my solution -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10884) Flink on yarn TM container will be killed by nodemanager because of the exceeded physical memory.
[ https://issues.apache.org/jira/browse/FLINK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698043#comment-16698043 ] wgcn commented on FLINK-10884: -- I think the memory assigned to jvm not only contains " heap memory" and "off heap memory" in container. Do you mean that the offHeapSizeMB should be assigned to " (containerMemoryMB - heapSizeMB) *(1-cutoffFactor)" > Flink on yarn TM container will be killed by nodemanager because of the > exceeded physical memory. > > > Key: FLINK-10884 > URL: https://issues.apache.org/jira/browse/FLINK-10884 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Core >Affects Versions: 1.5.5, 1.6.2, 1.7.0 > Environment: version : 1.6.2 > module : flink on yarn > centos jdk1.8 > hadoop 2.7 >Reporter: wgcn >Assignee: wgcn >Priority: Major > Labels: yarn > > TM container will be killed by nodemanager because of the exceeded > [physical|http://www.baidu.com/link?url=Y4LyfMDH59n9-Ey16Fo6EFAYltN1e9anB3y2ynhVmdvuIBCkJGdH0hTExKDZRvXNr6hqhwIXs8JjYqesYbx0BOpQDD0o1VjbVQlOC-9MgXi] > memory. I found the lanuch context lanuching TM container that > "container memory = heap memory+ offHeapSizeMB" at the class > org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters > from line 160 to 166 I set a safety margin for the whole memory container > using. For example if the container limit 3g memory, the sum memory that > "heap memory+ offHeapSizeMB" is equal to 2.4g to prevent the container > being killed.Do we have the > [ready-made|http://www.baidu.com/link?url=ylC8cEafGU6DWAdU9ADcJPNugkjbx6IjtqIIxJ9foX4_Yfgc7ctWmpEpQRettVmBiOy7Wfph7S1UvN5LiJj-G1Rsb--oDw4Z2OEbA5Fj0bC] > solution or I can commit my solution -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10884) Flink on yarn TM container will be killed by nodemanager because of the exceeded physical memory.
[ https://issues.apache.org/jira/browse/FLINK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16693129#comment-16693129 ] wgcn commented on FLINK-10884: -- I found a unit test that per default the off heap memory is set to what the network buffers require. [https://github.com/apache/flink/blob/f629b05a2cdc2c07f4a19456cf5b3e5fdd6ff607/flink-runtime/src/test/java/org/apache/flink/runtime/clusterframework/ContaineredTaskManagerParametersTest.java#L37] I guess that the author of the code do that willfully .I don't know why the author do that > Flink on yarn TM container will be killed by nodemanager because of the > exceeded physical memory. > > > Key: FLINK-10884 > URL: https://issues.apache.org/jira/browse/FLINK-10884 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Core >Affects Versions: 1.5.5, 1.6.2, 1.7.0 > Environment: version : 1.6.2 > module : flink on yarn > centos jdk1.8 > hadoop 2.7 >Reporter: wgcn >Assignee: wgcn >Priority: Major > Labels: yarn > > TM container will be killed by nodemanager because of the exceeded > [physical|http://www.baidu.com/link?url=Y4LyfMDH59n9-Ey16Fo6EFAYltN1e9anB3y2ynhVmdvuIBCkJGdH0hTExKDZRvXNr6hqhwIXs8JjYqesYbx0BOpQDD0o1VjbVQlOC-9MgXi] > memory. I found the lanuch context lanuching TM container that > "container memory = heap memory+ offHeapSizeMB" at the class > org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters > from line 160 to 166 I set a safety margin for the whole memory container > using. For example if the container limit 3g memory, the sum memory that > "heap memory+ offHeapSizeMB" is equal to 2.4g to prevent the container > being killed.Do we have the > [ready-made|http://www.baidu.com/link?url=ylC8cEafGU6DWAdU9ADcJPNugkjbx6IjtqIIxJ9foX4_Yfgc7ctWmpEpQRettVmBiOy7Wfph7S1UvN5LiJj-G1Rsb--oDw4Z2OEbA5Fj0bC] > solution or I can commit my solution -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10884) Flink on yarn TM container will be killed by nodemanager because of the exceeded physical memory.
[ https://issues.apache.org/jira/browse/FLINK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16690502#comment-16690502 ] wgcn commented on FLINK-10884: -- Cool. it's my first experience in the community. I will fix it. Thank you > Flink on yarn TM container will be killed by nodemanager because of the > exceeded physical memory. > > > Key: FLINK-10884 > URL: https://issues.apache.org/jira/browse/FLINK-10884 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Core >Affects Versions: 1.5.5, 1.6.2, 1.7.0 > Environment: version : 1.6.2 > module : flink on yarn > centos jdk1.8 > hadoop 2.7 >Reporter: wgcn >Priority: Major > Labels: yarn > > TM container will be killed by nodemanager because of the exceeded > [physical|http://www.baidu.com/link?url=Y4LyfMDH59n9-Ey16Fo6EFAYltN1e9anB3y2ynhVmdvuIBCkJGdH0hTExKDZRvXNr6hqhwIXs8JjYqesYbx0BOpQDD0o1VjbVQlOC-9MgXi] > memory. I found the lanuch context lanuching TM container that > "container memory = heap memory+ offHeapSizeMB" at the class > org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters > from line 160 to 166 I set a safety margin for the whole memory container > using. For example if the container limit 3g memory, the sum memory that > "heap memory+ offHeapSizeMB" is equal to 2.4g to prevent the container > being killed.Do we have the > [ready-made|http://www.baidu.com/link?url=ylC8cEafGU6DWAdU9ADcJPNugkjbx6IjtqIIxJ9foX4_Yfgc7ctWmpEpQRettVmBiOy7Wfph7S1UvN5LiJj-G1Rsb--oDw4Z2OEbA5Fj0bC] > solution or I can commit my solution -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10884) Flink on yarn TM container will be killed by nodemanager because of the exceeded physical memory.
[ https://issues.apache.org/jira/browse/FLINK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16689067#comment-16689067 ] wgcn commented on FLINK-10884: -- Thanks for your watch. I found the configuration option, but code design "offHeapSizeMB = containerMemoryMB - heapSizeMB", the option don't work for the physical memory exceed."Heap memory"+"MaxDirectMemorySize" is still equal to Container Memory ,When ApplicationMaster create TM's lanuching script ,I don't know why the code design like this > Flink on yarn TM container will be killed by nodemanager because of the > exceeded physical memory. > > > Key: FLINK-10884 > URL: https://issues.apache.org/jira/browse/FLINK-10884 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Core >Affects Versions: 1.6.2 > Environment: version : 1.6.2 > module : flink on yarn > centos jdk1.8 > hadoop 2.7 >Reporter: wgcn >Priority: Major > Labels: yarn > > TM container will be killed by nodemanager because of the exceeded > [physical|http://www.baidu.com/link?url=Y4LyfMDH59n9-Ey16Fo6EFAYltN1e9anB3y2ynhVmdvuIBCkJGdH0hTExKDZRvXNr6hqhwIXs8JjYqesYbx0BOpQDD0o1VjbVQlOC-9MgXi] > memory. I found the lanuch context lanuching TM container that > "container memory = heap memory+ offHeapSizeMB" at the class > org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters > from line 160 to 166 I set a safety margin for the whole memory container > using. For example if the container limit 3g memory, the sum memory that > "heap memory+ offHeapSizeMB" is equal to 2.4g to prevent the container > being killed.Do we have the > [ready-made|http://www.baidu.com/link?url=ylC8cEafGU6DWAdU9ADcJPNugkjbx6IjtqIIxJ9foX4_Yfgc7ctWmpEpQRettVmBiOy7Wfph7S1UvN5LiJj-G1Rsb--oDw4Z2OEbA5Fj0bC] > solution or I can commit my solution -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-10884) Flink on yarn TM container will be killed by nodemanager because of the exceeded physical memory.
[ https://issues.apache.org/jira/browse/FLINK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wgcn updated FLINK-10884: - Labels: yarn (was: ) > Flink on yarn TM container will be killed by nodemanager because of the > exceeded physical memory. > > > Key: FLINK-10884 > URL: https://issues.apache.org/jira/browse/FLINK-10884 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Core >Affects Versions: 1.6.2 > Environment: version : 1.6.2 > module : flink on yarn > centos jdk1.8 > hadoop 2.7 >Reporter: wgcn >Priority: Major > Labels: yarn > > TM container will be killed by nodemanager because of the exceeded > [physical|http://www.baidu.com/link?url=Y4LyfMDH59n9-Ey16Fo6EFAYltN1e9anB3y2ynhVmdvuIBCkJGdH0hTExKDZRvXNr6hqhwIXs8JjYqesYbx0BOpQDD0o1VjbVQlOC-9MgXi] > memory. I found the lanuch context lanuching TM container that > "container memory = heap memory+ offHeapSizeMB" at the class > org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters > from line 160 to 166 I set a safety margin for the whole memory container > using. For example if the container limit 3g memory, the sum memory that > "heap memory+ offHeapSizeMB" is equal to 2.4g to prevent the container > being killed.Do we have the > [ready-made|http://www.baidu.com/link?url=ylC8cEafGU6DWAdU9ADcJPNugkjbx6IjtqIIxJ9foX4_Yfgc7ctWmpEpQRettVmBiOy7Wfph7S1UvN5LiJj-G1Rsb--oDw4Z2OEbA5Fj0bC] > solution or I can commit my solution -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10884) Flink on yarn TM container will be killed by nodemanager because of the exceeded physical memory.
wgcn created FLINK-10884: Summary: Flink on yarn TM container will be killed by nodemanager because of the exceeded physical memory. Key: FLINK-10884 URL: https://issues.apache.org/jira/browse/FLINK-10884 Project: Flink Issue Type: Bug Components: Cluster Management, Core Affects Versions: 1.6.2 Environment: version : 1.6.2 module : flink on yarn centos jdk1.8 hadoop 2.7 Reporter: wgcn TM container will be killed by nodemanager because of the exceeded [physical|http://www.baidu.com/link?url=Y4LyfMDH59n9-Ey16Fo6EFAYltN1e9anB3y2ynhVmdvuIBCkJGdH0hTExKDZRvXNr6hqhwIXs8JjYqesYbx0BOpQDD0o1VjbVQlOC-9MgXi] memory. I found the lanuch context lanuching TM container that "container memory = heap memory+ offHeapSizeMB" at the class org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters from line 160 to 166 I set a safety margin for the whole memory container using. For example if the container limit 3g memory, the sum memory that "heap memory+ offHeapSizeMB" is equal to 2.4g to prevent the container being killed.Do we have the [ready-made|http://www.baidu.com/link?url=ylC8cEafGU6DWAdU9ADcJPNugkjbx6IjtqIIxJ9foX4_Yfgc7ctWmpEpQRettVmBiOy7Wfph7S1UvN5LiJj-G1Rsb--oDw4Z2OEbA5Fj0bC] solution or I can commit my solution -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-5770) Flink yarn session stop in non-detached model
[ https://issues.apache.org/jira/browse/FLINK-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16668448#comment-16668448 ] wgcn commented on FLINK-5770: - your client maybe shutdown you can use the arg -d > Flink yarn session stop in non-detached model > - > > Key: FLINK-5770 > URL: https://issues.apache.org/jira/browse/FLINK-5770 > Project: Flink > Issue Type: Bug > Components: Client >Affects Versions: 1.2.0 > Environment: 1、the cluster contains 4 nodes; > 2、every node has 380GB memory, and the CPU has 40 cores; > 3、the OS is centOS7.2; >Reporter: zhangrucong1982 >Priority: Major > > 1、I user the recent version of flink, and use fink in security mode without > HA.the configurations in flink-conf.yaml are: > security.kerberos.login.keytab: > /home/demo/flink/release/flink-1.2.2/keytab/huawei1.keytab > security.kerberos.login.principal: huawei1 > security.kerberos.login.contexts: Client,KafkaClient > 2、then I use the command ./yarn-session.sh -n 2 to start the cluster with > two taskmanagers. > 3、 But About the 4 hours later, the session is shutting down by itself. the > error stack is following: > 2017-02-07 19:27:30,841 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@9-96-101-251:38650] has failed, address is now gated for > [5000] ms. Reason: [Disassociated] > 2017-02-07 19:27:42,804 WARN org.apache.flink.yarn.cli.FlinkYarnSessionCli > - Exception while running the interactive command line interface > java.lang.RuntimeException: Unable to get ClusterClient status from > Application Client > at > org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.java:248) > at > org.apache.flink.yarn.cli.FlinkYarnSessionCli.runInteractiveCli(FlinkYarnSessionCli.java:410) > at > org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:663) > at > org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:476) > at > org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:473) > at > org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40) > at > org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:473) > Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: > Could not retrieve the leader gateway > at > org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:142) > at > org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:691) > at > org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.java:243) > ... 10 more > Caused by: java.util.concurrent.TimeoutException: Futures timed out after > [1 milliseconds] > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:190) > at scala.concurrent.Await.result(package.scala) > at > org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:140) > ... 12 more > 4、the detail log you can see in the following : > https://docs.google.com/document/d/1mbxrCy6mHHFxcxPv8f7CCA3BI1QVGPeNiHxUQhuZP0o/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005)