[jira] [Commented] (FLINK-29572) Flink Task Manager skip loopback interface for resource manager registration
[ https://issues.apache.org/jira/browse/FLINK-29572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17625932#comment-17625932 ] Kevin Li commented on FLINK-29572: -- Hi, Xintong, thanks for your help first. However, this is not some vague proxy software, it is part of Service Mesh implementation and now become very popular now, especially in Kubernetes world. https://medium.com/microservices-in-practice/service-mesh-for-microservices-2953109a3c9a Keep in mind that this FLINK-24474 is not available before 1.15. Original purpose is to make Flink cluster more secure if both JM/TMs run on the same node/computer, which is not really a case for production deployment. Also the way it probes the location of Job Manager is wrong if such proxy exists. That's why I recommended to add an option to disable/skip the loopback check since we know JM is not running on the same node as TM. So in my opinion, it is a bug. > Flink Task Manager skip loopback interface for resource manager registration > > > Key: FLINK-29572 > URL: https://issues.apache.org/jira/browse/FLINK-29572 > Project: Flink > Issue Type: Bug > Components: API / Core >Affects Versions: 1.15.2 > Environment: Flink 1.15.2 > Kubernetes with Istio Proxy >Reporter: Kevin Li >Priority: Major > > Currently Flink Task Manager use different local interface to bind to connect > to Resource Manager. First one is Loopback interface. Normally if Job Manager > is running on remote host/container, using loopback interface to connect will > fail and it will pick up correct IP address. > However, if Task Manager is running with some proxy, loopback interface can > connect to remote host as well. This will result 127.0.0.1 reported to > Resource Manager during registration, even Job Manager/Resource Manager runs > on remote host, and problem will happen. For us, only one Task Manager can > register in this case. > I suggest adding configuration to skip Loopback interface check if we know > Job/Resource Manager is running on remote host/container. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-29572) Flink Task Manager skip loopback interface for resource manager registration
[ https://issues.apache.org/jira/browse/FLINK-29572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17625277#comment-17625277 ] Kevin Li commented on FLINK-29572: -- It will work if we configure different ports for each different task manager. But that will be cumbersome. If you have 10 task manager, you need to create 10 different deployments for each of them. Also autoscale could be issue too. Rather than you have one deployment with 10 replicas and they can scale up and down. I downgrade my Flink to 1.14.6 and it works fine. Looks like it was introduced by FLINK-24474. > Flink Task Manager skip loopback interface for resource manager registration > > > Key: FLINK-29572 > URL: https://issues.apache.org/jira/browse/FLINK-29572 > Project: Flink > Issue Type: Bug > Components: API / Core >Affects Versions: 1.15.2 > Environment: Flink 1.15.2 > Kubernetes with Istio Proxy >Reporter: Kevin Li >Priority: Major > > Currently Flink Task Manager use different local interface to bind to connect > to Resource Manager. First one is Loopback interface. Normally if Job Manager > is running on remote host/container, using loopback interface to connect will > fail and it will pick up correct IP address. > However, if Task Manager is running with some proxy, loopback interface can > connect to remote host as well. This will result 127.0.0.1 reported to > Resource Manager during registration, even Job Manager/Resource Manager runs > on remote host, and problem will happen. For us, only one Task Manager can > register in this case. > I suggest adding configuration to skip Loopback interface check if we know > Job/Resource Manager is running on remote host/container. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-29572) Flink Task Manager skip loopback interface for resource manager registration
[ https://issues.apache.org/jira/browse/FLINK-29572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17624793#comment-17624793 ] Kevin Li commented on FLINK-29572: -- The sidecar proxy allows application binding to 127.0.0.1 to connect remote IP address (where Job Manager runs), which it shouldn't under normal situation. This will make Task Manager report its IP as 127.0.0.1 to Job Manager, instead of its real IP, such as 1.2.3.4. It has nothing with port. Under this situation, all TMs will report their IP as 127.0.0.1, this confuse the Job Manager and eventually no TM can communicate with JM. > Flink Task Manager skip loopback interface for resource manager registration > > > Key: FLINK-29572 > URL: https://issues.apache.org/jira/browse/FLINK-29572 > Project: Flink > Issue Type: Bug > Components: API / Core >Affects Versions: 1.15.2 > Environment: Flink 1.15.2 > Kubernetes with Istio Proxy >Reporter: Kevin Li >Priority: Major > > Currently Flink Task Manager use different local interface to bind to connect > to Resource Manager. First one is Loopback interface. Normally if Job Manager > is running on remote host/container, using loopback interface to connect will > fail and it will pick up correct IP address. > However, if Task Manager is running with some proxy, loopback interface can > connect to remote host as well. This will result 127.0.0.1 reported to > Resource Manager during registration, even Job Manager/Resource Manager runs > on remote host, and problem will happen. For us, only one Task Manager can > register in this case. > I suggest adding configuration to skip Loopback interface check if we know > Job/Resource Manager is running on remote host/container. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-29572) Flink Task Manager skip loopback interface for resource manager registration
[ https://issues.apache.org/jira/browse/FLINK-29572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17618407#comment-17618407 ] Kevin Li commented on FLINK-29572: -- 1. It is called service mesh, basically all ingress/egress traffic are captured by proxy and proxies are connected as service mesh so that apps are transparent for service discovery and many more. https://istio.io/latest/docs/ops/deployment/architecture/ 2. With service mesh proxy deployed, TM can connect JM using loopback address. If this works, TM will report its address as 127.0.0.1:6223. JM can RPC this address as well. But as soon as you have multiple TMs, all of them will report their address as 127.0.0.1:6223. Obviously only one will succeed. This result JM can only connect with one TM, which is the one got success. 3. Capturing loopback traffic and forward to remote is how proxy working. Disable this will make proxy useless. Pls check the link in No.1. > Flink Task Manager skip loopback interface for resource manager registration > > > Key: FLINK-29572 > URL: https://issues.apache.org/jira/browse/FLINK-29572 > Project: Flink > Issue Type: Bug > Components: API / Core >Affects Versions: 1.15.2 > Environment: Flink 1.15.2 > Kubernetes with Istio Proxy >Reporter: Kevin Li >Priority: Major > > Currently Flink Task Manager use different local interface to bind to connect > to Resource Manager. First one is Loopback interface. Normally if Job Manager > is running on remote host/container, using loopback interface to connect will > fail and it will pick up correct IP address. > However, if Task Manager is running with some proxy, loopback interface can > connect to remote host as well. This will result 127.0.0.1 reported to > Resource Manager during registration, even Job Manager/Resource Manager runs > on remote host, and problem will happen. For us, only one Task Manager can > register in this case. > I suggest adding configuration to skip Loopback interface check if we know > Job/Resource Manager is running on remote host/container. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-29572) Flink Task Manager skip loopback interface for resource manager registration
[ https://issues.apache.org/jira/browse/FLINK-29572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17617811#comment-17617811 ] Kevin Li commented on FLINK-29572: -- No, it wouldn't. This problem happens for K8s deployment. For K8s, all task managers share the same configuration, which was converted from config-map. I think we just need a configuration flag to skip loopback check since we know Job Manager is not running on localhost. As indicated from doc: {code:java} The external address of the network interface where the TaskManager is exposed. Because different TaskManagers need different values for this option, usually it is specified in an additional non-shared TaskManager-specific config file. {code} > Flink Task Manager skip loopback interface for resource manager registration > > > Key: FLINK-29572 > URL: https://issues.apache.org/jira/browse/FLINK-29572 > Project: Flink > Issue Type: Bug > Components: API / Core >Affects Versions: 1.15.2 > Environment: Flink 1.15.2 > Kubernetes with Istio Proxy >Reporter: Kevin Li >Priority: Major > > Currently Flink Task Manager use different local interface to bind to connect > to Resource Manager. First one is Loopback interface. Normally if Job Manager > is running on remote host/container, using loopback interface to connect will > fail and it will pick up correct IP address. > However, if Task Manager is running with some proxy, loopback interface can > connect to remote host as well. This will result 127.0.0.1 reported to > Resource Manager during registration, even Job Manager/Resource Manager runs > on remote host, and problem will happen. For us, only one Task Manager can > register in this case. > I suggest adding configuration to skip Loopback interface check if we know > Job/Resource Manager is running on remote host/container. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-29572) Flink Task Manager skip loopback interface for resource manager registration
[ https://issues.apache.org/jira/browse/FLINK-29572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Li updated FLINK-29572: - Description: Currently Flink Task Manager use different local interface to bind to connect to Resource Manager. First one is Loopback interface. Normally if Job Manager is running on remote host/container, using loopback interface to connect will fail and it will pick up correct IP address. However, if Task Manager is running with some proxy, loopback interface can connect to remote host as well. This will result 127.0.0.1 reported to Resource Manager during registration, even Job Manager/Resource Manager runs on remote host, and problem will happen. For us, only one Task Manager can register in this case. I suggest adding configuration to skip Loopback interface check if we know Job/Resource Manager is running on remote host/container. was: Currently Flink Task Manager use different local interface to bind to connect to Resource Manager. First one is Loopback interface. Normally if Job Manager is running on remote host/container, using loopback interface to connect will fail and it will pick up correct IP address. However, if Task Manager is running with some proxy, loopback interface can connect to remote host as well. This will result 127.0.0.1 reported to Resource Manager during registration, even Job Manager/Resource Manager runs on remote host, and problem will happen. For us, only one Task Manager can register in this case. I suggest adding configuration to skip Loopback interface check if we know Job/Resource Manager is running on remote host/container. > Flink Task Manager skip loopback interface for resource manager registration > > > Key: FLINK-29572 > URL: https://issues.apache.org/jira/browse/FLINK-29572 > Project: Flink > Issue Type: Improvement > Components: API / Core >Affects Versions: 1.15.2 > Environment: Flink 1.15.2 > Kubernetes with Istio Proxy >Reporter: Kevin Li >Priority: Major > > Currently Flink Task Manager use different local interface to bind to connect > to Resource Manager. First one is Loopback interface. Normally if Job Manager > is running on remote host/container, using loopback interface to connect will > fail and it will pick up correct IP address. > However, if Task Manager is running with some proxy, loopback interface can > connect to remote host as well. This will result 127.0.0.1 reported to > Resource Manager during registration, even Job Manager/Resource Manager runs > on remote host, and problem will happen. For us, only one Task Manager can > register in this case. > I suggest adding configuration to skip Loopback interface check if we know > Job/Resource Manager is running on remote host/container. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-29572) Flink Task Manager skip loopback interface for resource manager registration
[ https://issues.apache.org/jira/browse/FLINK-29572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17615214#comment-17615214 ] Kevin Li commented on FLINK-29572: -- {quote}Task Manager Log: 2022-10-08 17:22:32,983 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils [] - Trying to select the network interface and address to use by connecting to the leading JobManager. 2022-10-08 17:22:32,984 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils [] - TaskManager will try to connect for PT10S before falling back to heuristics 2022-10-08 17:22:33,356 DEBUG org.apache.flink.runtime.net.ConnectionUtils [] - Retrieved new target address flink-jobmanager/172.20.133.241:6123 for akka URL [akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_*] . 2022-10-08 17:22:33,357 DEBUG org.apache.flink.runtime.net.ConnectionUtils [] - Trying to connect to [flink-jobmanager/172.20.133.241:6123] from local address [localhost/127.0.0.1] with timeout [100] 2022-10-08 17:22:33,361 DEBUG org.apache.flink.runtime.net.ConnectionUtils [] - Using InetAddress.getLoopbackAddress() immediately for connecting address 2022-10-08 17:22:33,361 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - TaskManager will use hostname/address 'localhost' (127.0.0.1) for communication. 2022-10-08 17:22:33,416 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils[] - Trying to start actor system, external address 127.0.0.1:6122, bind address 0.0.0.0:6122.{quote} > Flink Task Manager skip loopback interface for resource manager registration > > > Key: FLINK-29572 > URL: https://issues.apache.org/jira/browse/FLINK-29572 > Project: Flink > Issue Type: Improvement > Components: API / Core >Affects Versions: 1.15.2 > Environment: Flink 1.15.2 > Kubernetes with Istio Proxy >Reporter: Kevin Li >Priority: Major > > Currently Flink Task Manager use different local interface to bind to connect > to Resource Manager. First one is Loopback interface. Normally if Job Manager > is running on remote host/container, using loopback interface to connect will > fail and it will pick up correct IP address. > > However, if Task Manager is running with some proxy, loopback interface can > connect to remote host as well. This will result 127.0.0.1 reported to > Resource Manager during registration, even Job Manager/Resource Manager runs > on remote host, and problem will happen. For us, only one Task Manager can > register in this case. > > > > I suggest adding configuration to skip Loopback interface check if we know > Job/Resource Manager is running on remote host/container. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29572) Flink Task Manager skip loopback interface for resource manager registration
Kevin Li created FLINK-29572: Summary: Flink Task Manager skip loopback interface for resource manager registration Key: FLINK-29572 URL: https://issues.apache.org/jira/browse/FLINK-29572 Project: Flink Issue Type: Improvement Components: API / Core Affects Versions: 1.15.2 Environment: Flink 1.15.2 Kubernetes with Istio Proxy Reporter: Kevin Li Currently Flink Task Manager use different local interface to bind to connect to Resource Manager. First one is Loopback interface. Normally if Job Manager is running on remote host/container, using loopback interface to connect will fail and it will pick up correct IP address. However, if Task Manager is running with some proxy, loopback interface can connect to remote host as well. This will result 127.0.0.1 reported to Resource Manager during registration, even Job Manager/Resource Manager runs on remote host, and problem will happen. For us, only one Task Manager can register in this case. I suggest adding configuration to skip Loopback interface check if we know Job/Resource Manager is running on remote host/container. -- This message was sent by Atlassian Jira (v8.20.10#820010)