vdombrovski opened a new issue, #8967: URL: https://github.com/apache/cloudstack/issues/8967
<!-- Verify first that your issue/request is not already reported on GitHub. Also test if the latest release and main branch are affected too. Always add information AFTER of these HTML comments, but no need to delete the comments. --> ##### ISSUE TYPE <!-- Pick one below and delete the rest --> * Bug Report ##### COMPONENT NAME <!-- Categorize the issue, e.g. API, VR, VPN, UI, etc. --> ~~~ VR ~~~ ##### CLOUDSTACK VERSION <!-- New line separated list of affected versions, commit ID for issues on main branch. --> ~~~ 4.17.2 ~~~ Note: I believe this impacts releases 4.18 and 4.19 also as no relevant code changes have been made to VR code ##### CONFIGURATION <!-- Information about the configuration if relevant, e.g. basic network, advanced networking, etc. N/A otherwise --> - Advanced zone - Standard networking using virtual routers only - /24 pool of "public" IPs available for the virtual routers ##### OS / ENVIRONMENT <!-- Information about the environment if relevant, N/A otherwise --> Ubuntu Focal 20.04 ##### SUMMARY <!-- Explain the problem/feature briefly --> Our workflow makes it so public IPs can be assigned/unassigned/reassigned to different networks, possibly multiple times per day (mostly, but not only via https://github.com/apache/cloudstack-kubernetes-provider) After a while, we started seeing that many unused IP addresses (according to ACS database) were actually answering to L2 ARPing. In parallel, some of our IPs are getting 2 different MACs from the ping: ``` ARPING XX.YY.25.156 60 bytes from 1e:00:fe:00:04:ac (XX.YY.25.156): index=0 time=306.714 usec 60 bytes from 1e:00:63:00:04:b5 (XX.YY.25.156): index=1 time=342.809 usec ``` Our investigation has shown these conflicts on around 10 ips in two of our /24 subnets, which looks too much to be a conicidence. We have confirmed that both MAC addresses belong to alive and healthy virtual routers: ``` [cloud]> select instance_id from nics where mac_address="1e:00:63:00:04:b5"; +-------------+ | instance_id | +-------------+ | 5654 | +-------------+ 1 row in set (0.010 sec) [cloud]> select instance_id from nics where mac_address="1e:00:fe:00:04:ac"; +-------------+ | instance_id | +-------------+ | 5624 | +-------------+ 1 row in set (0.007 sec) ``` These routers both correctly mount the IP addresses: ``` ip a | grep XX.YY.25.156 inet XX.YY.25.156/24 brd XX.YY.25.255 scope global secondary eth2 ``` Inside the "databag" of the so called "illegitimate" router causing the conflict, we can see the IP declared (sometimes with flag add: true, sometimes with add: false): ``` # /etc/cloudstack/ips.json { "add": true, "broadcast": "XX.YY.25.255", "cidr": "XX.YY.25.156/24", "device": "eth2", "first_i_p": false, "gateway": "XX.YY.25.1", "is_private_gateway": false, "netmask": "255.255.255.0", "network": "XX.YY.25.0/24", "new_nic": false, "nic_dev_id": 2, "nw_type": "public", "one_to_one_nat": false, "public_ip": "XX.YY.25.156", "size": "24", "source_nat": false, "vif_mac_address": "1e:00:63:00:04:b5" } ``` To my understanding, this file is read/merged whenever a new IP state change is requested; however it's not merged properly on unassignment. Obviously, deleting/recreating all the virtual routers would solve the issue as this cache would be cleared; however because this causes massive disruptions on the service we would like to make sure the root cause is fixed first. ##### STEPS TO REPRODUCE <!-- For bugs, show exactly how to reproduce the problem, using a minimal test-case. Use Screenshots if accurate. For new features, show how the feature would be used. --> No specific command line here, but I think trying to bounce a public IP between 2 guest routers (ofc inside separate VPCs if these are VPC networks) several time would results in this issue. 1. Create 2 guest networks A and B 2. Create 2 VMS, one inside network A, the other inside network B 3. Associate an IP address to network A, and create a loadbalancer rule for some port of VM A (should also work with port-forwarding and static NAT) 4. Release the IP, associate it to network B, and create a loadbalancer rule for some port of VM A (should also work with port-forwarding and static NAT) 5. Repeat steps 3-4 until you get an IP address conflict (or until you see invalid entries in the databag) ##### EXPECTED RESULTS <!-- What did you expect to happen when running the steps above? --> The IP is properly garbage collected on the router that doesn't hold it anymore ##### ACTUAL RESULTS <!-- What actually happened? --> <!-- Paste verbatim command output between quotes below --> IP address conflcit -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@cloudstack.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org