jotasixto commented on issue #2708: URL: https://github.com/apache/apisix-ingress-controller/issues/2708#issuecomment-4218338000
We experienced a very similar issue in our environment and wanted to share our findings, as the root cause in our case turned out to be related to **EKS security group rules blocking low-port traffic between worker nodes**, rather than a bug in APISIX or the ingress controller itself. ## Our Environment - **EKS cluster** deployed via the official [terraform-aws-eks module](https://github.com/terraform-aws-modules/terraform-aws-eks) - **APISIX standalone** deployed with Helm chart **v2.13.0** - Backend services exposing **port 80** on their pods (front-end Nginx containers with `containerPort: 80`) ## Observed Behavior We saw the same symptoms described in this issue: after pod rescheduling or scaling events, APISIX gateways would intermittently fail to reach backend pods with `(111: Connection refused)` errors, particularly when pods were placed on different nodes than the APISIX gateway pods. When we attempted a workaround of ensuring at least one replica of each backend service ran on every node, we discovered the actual underlying problem: **traffic on port 80 was being blocked between worker nodes by the EKS node security group rules**. ## Root Cause: EKS Security Group Default Rules and Low Ports The [official terraform-aws-eks documentation on network connectivity](https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/network_connectivity.md) explains that the default node security group rules only allow traffic on **ephemeral ports (1025-65535)** between the cluster control plane and worker nodes, and between nodes themselves. This is by design — AWS considers it a best practice because **non-privileged pods should not bind to ports below 1024**. Looking at the [security group diagram](https://raw.githubusercontent.com/terraform-aws-modules/terraform-aws-eks/master/.github/images/security_groups.svg) from the module documentation, port 80 is simply not in the allowed range for node-to-node or cluster-to-node ingress traffic. This means: - When an APISIX gateway pod on **Node A** tries to reach a backend pod on **Node B** using port 80, the traffic is **silently dropped** by the node security group. - When both APISIX and the backend pod happen to be on the **same node**, traffic works fine (it stays within the node and doesn't cross the security group boundary). - This creates the **intermittent** behavior: it works or fails depending on pod placement, which changes with scaling events, node rotation, etc. ## Our Fix We added custom security group rules to explicitly allow port 80 traffic between worker nodes ([Automya/claims#185](https://github.com/Automya/claims/pull/185)): ```yaml node_security_group_additional_rules: ingress_node_ports_fronts: description : "Allow port 80 from cluster to worker nodes" protocol : "tcp" from_port : 80 to_port : 80 type : "ingress" source_cluster_security_group : true ingress_node_ports_fronts_self: description : "Allow port 80 between worker nodes" protocol : "tcp" from_port : 80 to_port : 80 type : "ingress" self : true ``` After applying these rules, the issue was **fully resolved** — APISIX gateways could reach backend pods on any node in the cluster without `Connection refused` errors, regardless of pod placement. ## Recommendation If you're running on EKS (especially with the terraform-aws-eks module) and your backend services expose pods on **port 80 or any port below 1024**, check your node security group rules. The default rules only allow ephemeral ports (1025-65535), and traffic to low ports between nodes will be silently dropped. The proper long-term fix is to **migrate backend services to listen on high ports (≥1024)**, which aligns with AWS best practices for non-privileged pods. The custom security group rules above are a valid workaround if migrating ports immediately is not feasible. **TL;DR**: In our case, APISIX was working correctly — it was the EKS node security groups blocking cross-node traffic on port 80 that caused the intermittent `Connection refused` errors after pod rescheduling. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
