Sorry, I overlooked the logs for detection-engine-dev-taskmanager-1-1. Could you start a busybox to check the connectivity for the K8s service "detection-engine-dev"? It seems that the TaskManager try to connect and get a response "Connection reset by peer".
Best, Yang Yang Wang <danrtsey...@gmail.com> 于2020年11月2日周一 下午5:41写道: > Hi Liangde Chen, > > Thanks for providing the logs. After checking the logs, I am afraid that > there is something wrong with > your K8s cluster. Since detection-engine-dev-taskmanager-1-2 has been > started and registered to JobManager > successfully. > > I suggest finding which K8s node detection-engine-dev-taskmanager-1-1 is > running on and disable > the scheduling on it. Then restart the Flink K8s session and have a try > again. > > Best, > Yang > > Chen Liangde <lian...@gmail.com> 于2020年11月2日周一 下午3:55写道: > >> Please find attached logs. >> >> The kubernetes cluster is an aws EKS cluster but managed by our infra's >> team. >> I created a service account "flink" for it and it has permission to >> create, list, delete pods along with some other types of resources in the >> "team-anti-cheat" namespace. >> >> Below command was used to create the flink cluster: >> ./bin/kubernetes-session.sh \ >> -Dexecution.attached=true \ >> -Dkubernetes.cluster-id=detection-engine-dev \ >> -Dkubernetes.namespace=team-anti-cheat \ >> -Dkubernetes.container-start-command-template="%java% %classpath% >> %jvmmem% %jvmopts% %logging% %class% %args%" \ >> -Dkubernetes.jobmanager.service-account=flink >> >> Thanks >> Liangde Chen >> >> >> On Mon, 2 Nov 2020 at 08:20, Yang Wang <danrtsey...@gmail.com> wrote: >> >>> Could you share the JobManager logs so that we could check whether it >>> received the >>> registration from TasManager? >>> >>> In a non-HA Flink cluster, the TaskManager is using the service to talk >>> to JobManager. >>> Currently, Flink creates a headless service for JobManager. You could >>> use `kubectl get svc` >>> to find it. And then start a busybox to check the network connectivity. >>> >>> And maybe you could share more information about the environment. I >>> could not reproduce >>> your issue in a typical K8s cluster. >>> >>> Best, >>> Yang >>> >>> Yun Gao <yungao...@aliyun.com> 于2020年10月30日周五 上午11:53写道: >>> >>>> Hi Liangde, >>>> >>>> I pull in Yang Wang who is the expert for Flink on K8s. >>>> >>>> Best, >>>> Yun >>>> >>>> ------------------Original Mail ------------------ >>>> *Sender:*Chen Liangde <lian...@gmail.com> >>>> *Send Date:*Fri Oct 30 05:30:40 2020 >>>> *Recipients:*Flink ML <user@flink.apache.org> >>>> *Subject:*Native kubernetes setup failed to start job >>>> >>>>> I created a flink cluster in kubernetes following this guide: >>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html >>>>> >>>>> The job manager was running. When a job was submitted to the job >>>>> manager, it spawned a task manager pod, but the task manager failed to >>>>> connect to the job manager. And in the job manager web ui I can't find the >>>>> task manager. >>>>> >>>>> This error is >>>>> suspicious: >>>>> org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: >>>>> Adjusted frame length exceeds 10485760: 352518404 - discarded >>>>> >>>>> 2020-10-29 13:22:51,069 INFO >>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - >>>>> Connecting to ResourceManager >>>>> akka.tcp://fl...@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).2020-10-29 >>>>> 13:22:51,176 WARN akka.remote.transport.netty.NettyTransport >>>>> [] - Remote connection to >>>>> [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with >>>>> java.io.IOException: Connection reset by peer2020-10-29 13:22:51,176 WARN >>>>> akka.remote.transport.netty.NettyTransport [] - Remote >>>>> connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] >>>>> failed with >>>>> org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: >>>>> Adjusted frame length exceeds 10485760: 352518404 - discarded2020-10-29 >>>>> 13:22:51,180 WARN akka.remote.ReliableDeliverySupervisor >>>>> [] - Association with remote system >>>>> [akka.tcp://fl...@detection-engine-dev.team-anti-cheat:6123] has failed, >>>>> address is now gated for [50] ms. Reason: [Association failed with >>>>> [akka.tcp://fl...@detection-engine-dev.team-anti-cheat:6123]] Caused by: >>>>> [The remote system explicitly disassociated (reason unknown).]2020-10-29 >>>>> 13:22:51,183 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor >>>>> [] - Could not resolve ResourceManager address >>>>> akka.tcp://fl...@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*, >>>>> retrying in 10000 ms: Could not connect to rpc endpoint under address >>>>> akka.tcp://fl...@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*.2020-10-29 >>>>> 13:23:01,203 WARN akka.remote.transport.netty.NettyTransport >>>>> [] - Remote connection to >>>>> [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with >>>>> java.io.IOException: Connection reset by peer >>>>> >>>>>