So Suresh's advise has pushed me in the right direction. The VM was up but the 
agent state was down. I was able to connect to the VM in order to continue 
investigating and the VM is having network issues connecting to both my load 
balancer and my secondary storage server. I don't think I'm understanding how 
the public network portion is supposed to work in my zone and could use some 
clarification. First let me explain my network setup. On my compute nodes, 
ideally, I want to use 3 NIC's:

1. A management NIC for management traffic. I was using cloudbr0 for this. 
cloudbr0 is a bridge I created that is connected to an access port on my 
switch. No vlan tagging is required to use this network (it uses VLAN 20)
2. A cloud NIC for both public and guest traffic. I was using cloudbr1 for 
this. cloudbr1 is a bridge I created that is connected to a trunk port on my 
switch. Public traffic uses VLAN 48 and guest traffic should use VLANs 400 - 
656. As the port is trunked I have to use vlan tagging for any traffic over 
this NIC.
3. A storage NIC for storage traffic. I use a bond called "bond-storage" for 
this. bond-storage is connected to an access port on my switch. No vlan tagging 
is required to use this network (it uses VLAN 96)

For now I've removed the storage NIC from the configuration to simplify my 
troubleshooting, so I should only be working with cloudbr0 and cloudbr1. To me 
the public network is a *non-RFC 1918* address that should be assigned to 
tenant VM's for external internet access. Why do system VM's need/get a public 
IP address? Can't they access all the internal CloudStack servers using the 
pod's management network?

So the first problem I'm seeing is whenever I tell CloudStack to tag VLAN 48 
for public traffic it uses the underlying bond under cloudbr1 and not the 
bridge. I don't know where it is even getting this name as I never provided it 
to CloudStack

Here is how I have it configured: 
https://drive.google.com/file/d/10PxLdp6e46_GW7oPFJwB3sQQxnvwUhvH/view?usp=sharing

Here is the message in the management logs:

2021-06-16 16:00:40,454 INFO  [c.c.v.VirtualMachineManagerImpl] 
(Work-Job-Executor-13:ctx-0f39d8e2 job-4/job-68 ctx-a4f832c5) (logid:eb82035c) 
Unable to start VM on Host[-2-Routing] due to Failed to create vnet 48: Error: 
argument "bond-services.48" is wrong: "name" not a valid ifnameCannot find 
device "bond-services.48"Failed to create vlan 48 on pif: bond-services.

This ultimately results in an error and the system VM never even starts.

If I remove the vlan tag from the configuration 
(https://drive.google.com/file/d/11tF6YIHm9xDZvQkvphi1xvHCX_X9rDz1/view?usp=sharing)
 then the VM starts and gets a public IP but without a tagged NIC it can't 
actually connect to the network. This is from inside the system VM:

root@s-9-VM:~# ip --brief addr
lo               UNKNOWN        127.0.0.1/8
eth0             UP             169.254.91.216/16
eth1             UP             10.2.21.72/22
eth2             UP             192.41.41.162/25
eth3             UP             10.2.99.15/22
root@s-9-VM:~# ping 192.41.41.129
PING 192.41.41.129 (192.41.41.129): 56 data bytes
92 bytes from s-9-VM (192.41.41.162): Destination Host Unreachable
92 bytes from s-9-VM (192.41.41.162): Destination Host Unreachable
92 bytes from s-9-VM (192.41.41.162): Destination Host Unreachable
92 bytes from s-9-VM (192.41.41.162): Destination Host Unreachable
^C--- 192.41.41.129 ping statistics ---
5 packets transmitted, 0 packets received, 100% packet loss

Obviously if the network isn't functioning then it can't connect to my storage 
server and the agent never starts. How do I setup my public network so that it 
tags the packets going over cloudbr1? Also, can I not have a public IP address 
for system VM's or is this required?

I have some other issues as well like the fact that it is creating a storage 
NIC on the system VM's even though I deleted my storage network from the zone, 
but I can tackle one problem at a time. If anyone is curious or it helps 
visualize my network, here is is a little ASCII diagram of how I have the 
compute node's networking setup. Hopefully it comes across the mailing list 
correctly and not all mangled:

+===============================================================================================================
|
|    enp3s0f0 (eth)     enp3s0f1 (eth)     enp65s0f0 (eth)    enp65s0f1 (eth)   
 enp71s0 (eth)    enp72s0 (eth)
|       |                  |                   |                  |             
   |                  |
|       |                  |                   +--------+---------+             
   +--------+---------+
|       |                  |                            |                       
            |
|       |                  |                   bond-services (bond)             
            |
|       |                  |                            |                       
            |
|       |                  |                            |                       
            |
|       |                  |                            |                       
            |
|    cloudbr0 (bridge)    N/A                     cloudbr1 (bridge)             
    bond-storage (bond)       
|    VLAN 20 (access)                        VLAN 48, 400 - 656 (trunk)         
      VLAN 96 (access)

On 6/16/21 9:38 AM, Andrija Panic wrote:
> " There is no secondary storage VM for downloading template to image store
> LXC_SEC_STOR1 "
>
> So next step to investigate why there is no SSVM (can hosts access the
> secondary storage NFS, can they access the Primary Storage, etc - those
> tests you can do manually) - and as Suresh advised - one it's up, is it all
> green (COnnected / Up state).
>
> Best,
>

I appreciate everyone's help.

-- 
Thanks,
Joshua Schaeffer

Reply via email to