vdombrovski opened a new issue, #8967:
URL: https://github.com/apache/cloudstack/issues/8967

   <!--
   Verify first that your issue/request is not already reported on GitHub.
   Also test if the latest release and main branch are affected too.
   Always add information AFTER of these HTML comments, but no need to delete 
the comments.
   -->
   
   ##### ISSUE TYPE
   <!-- Pick one below and delete the rest -->
    * Bug Report
   
   ##### COMPONENT NAME
   <!--
   Categorize the issue, e.g. API, VR, VPN, UI, etc.
   -->
   ~~~
   VR
   ~~~
   
   ##### CLOUDSTACK VERSION
   <!--
   New line separated list of affected versions, commit ID for issues on main 
branch.
   -->
   
   ~~~
   4.17.2
   ~~~
   
   Note: I believe this impacts releases 4.18 and 4.19 also as no relevant code 
changes have been made to VR code
   
   ##### CONFIGURATION
   <!--
   Information about the configuration if relevant, e.g. basic network, 
advanced networking, etc.  N/A otherwise
   -->
   - Advanced zone
   - Standard networking using virtual routers only
   - /24 pool of "public" IPs available for the virtual routers
   
   
   ##### OS / ENVIRONMENT
   <!--
   Information about the environment if relevant, N/A otherwise
   -->
   Ubuntu Focal 20.04
   
   
   ##### SUMMARY
   <!-- Explain the problem/feature briefly -->
   
   Our workflow makes it so public IPs can be assigned/unassigned/reassigned to 
different networks, possibly multiple times per day (mostly, but not only via 
https://github.com/apache/cloudstack-kubernetes-provider)
   
   After a while, we started seeing that many unused IP addresses (according to 
ACS database) were actually answering to L2 ARPing. In parallel, some of our 
IPs are getting 2 different MACs from the ping:
   
   ```
   ARPING XX.YY.25.156
   60 bytes from 1e:00:fe:00:04:ac (XX.YY.25.156): index=0 time=306.714 usec
   60 bytes from 1e:00:63:00:04:b5 (XX.YY.25.156): index=1 time=342.809 usec
   ```
   Our investigation has shown these conflicts on around 10 ips in two of our 
/24 subnets, which looks too much to be a conicidence.
   
   We have confirmed that both MAC addresses belong to alive and healthy 
virtual routers:
   
   ```
    [cloud]> select instance_id from nics where mac_address="1e:00:63:00:04:b5";
   +-------------+
   | instance_id |
   +-------------+
   |        5654 |
   +-------------+
   1 row in set (0.010 sec)
   
   [cloud]> select instance_id from nics where mac_address="1e:00:fe:00:04:ac";
   +-------------+
   | instance_id |
   +-------------+
   |        5624 |
   +-------------+
   1 row in set (0.007 sec)
   ```
   
   These routers both correctly mount the IP addresses:
   
   ```
   ip a | grep XX.YY.25.156
       inet XX.YY.25.156/24 brd XX.YY.25.255 scope global secondary eth2
   ```
   
   Inside the "databag" of the so called "illegitimate" router causing the 
conflict, we can see the IP declared (sometimes with flag add: true, sometimes 
with add: false):
   ```
   # /etc/cloudstack/ips.json
       {
         "add": true,
         "broadcast": "XX.YY.25.255",
         "cidr": "XX.YY.25.156/24",
         "device": "eth2",
         "first_i_p": false,
         "gateway": "XX.YY.25.1",
         "is_private_gateway": false,
         "netmask": "255.255.255.0",
         "network": "XX.YY.25.0/24",
         "new_nic": false,
         "nic_dev_id": 2,
         "nw_type": "public",
         "one_to_one_nat": false,
         "public_ip": "XX.YY.25.156",
         "size": "24",
         "source_nat": false,
         "vif_mac_address": "1e:00:63:00:04:b5"
       }
   ```
   
   To my understanding, this file is read/merged whenever a new IP state change 
is requested; however it's not merged properly on unassignment.
   
   Obviously, deleting/recreating all the virtual routers would solve the issue 
as this cache would be cleared; however because this causes massive disruptions 
on the service we would like to make sure the root cause is fixed first.
   
   ##### STEPS TO REPRODUCE
   <!--
   For bugs, show exactly how to reproduce the problem, using a minimal 
test-case. Use Screenshots if accurate.
   
   For new features, show how the feature would be used.
   -->
   
   No specific command line here, but I think trying to bounce a public IP 
between 2 guest routers (ofc inside separate VPCs if these are VPC networks) 
several time would results in this issue.
   
   1. Create 2 guest networks A and B
   2. Create 2 VMS, one inside network A, the other inside network B
   3. Associate an IP address to network A, and create a loadbalancer rule for 
some port of VM A (should also work with port-forwarding and static NAT)
   4. Release the IP, associate it to network B, and create a loadbalancer rule 
for some port of VM A (should also work with port-forwarding and static NAT)
   5. Repeat steps 3-4 until you get an IP address conflict (or until you see 
invalid entries in the databag)
   
   ##### EXPECTED RESULTS
   <!-- What did you expect to happen when running the steps above? -->
   
   The IP is properly garbage collected on the router that doesn't hold it 
anymore
   
   ##### ACTUAL RESULTS
   <!-- What actually happened? -->
   
   <!-- Paste verbatim command output between quotes below -->
   IP address conflcit
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cloudstack.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to