[
https://issues.apache.org/jira/browse/CLOUDSTACK-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Barys Dubauski updated CLOUDSTACK-10400:
----------------------------------------
Attachment: (was: testCloudStack.jar)
> VPC Router Corruption when working with large number of networks containing
> instances with public IP addresses
> ---------------------------------------------------------------------------------------------------------------
>
> Key: CLOUDSTACK-10400
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10400
> Project: CloudStack
> Issue Type: Bug
> Security Level: Public(Anyone can view this level - this is the
> default.)
> Components: API
> Affects Versions: 4.11.1.0
> Reporter: Barys Dubauski
> Priority: Critical
> Attachments: testCloudStack.jar
>
>
> We are using CloudStack 4.11.1 running with KVM hosts. To simulate our
> usecase, we created a small program that calls CloudStack API to
> 1) create VPC network with 20 guest networks, each containing one virtual
> machine with a public IP address allocated.
> 2) delete the machines and networks one by one.
>
> However, we frequently get a timeout error, sometimes during VM deletion,
> and sometimes during guest network deletion or even during static NAT disable
> step. Once the timeout occurs, it seems that the VPC network / Virtual
> router is in an *unstable/corrupted* state. We need to restart the Virtual
> Router with a clean option (sometimes have to try restart several times as it
> fails to deploy router VM as well). After that, we can continue delete the
> network remaining environment. Here is the high level steps that we did:
> # Create VPC Network
> # For each of the 20 "environments"
> ## Create Guest Network
> ## Add a VM to the network
> ## Acquire Public IP
> ## Associate the Public IP with VM
> # For each of the 20 environment
> ## Disassociate the Public IP
> ## Delete VM
> ## Delete Guest network
> # Delete VPC
>
> The hanging / timeout problems could be in any time during environment
> deletion. The first few deletion could go through successfully, and then
> fail at some point. The failure could be in any stage. i.e. Disassociate
> public IP, delete VM or delete guest network. We looked at cloud.log, agent
> log and management server log but couldn’t get any obvious errors. It seems
> that management server sends the request to do the deletion, but the VR does
> not respond and the system/network becomes stuck in an invalid state. Network
> often gets stuck in “Shutdown” state as a result.
>
> Here are some errors in the management server log:
> ============================================
> 2018-11-01 01:15:29,263 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
> (API-Job-Executor-119:ctx-c14b2ab4 job-29965) (logid:dbe80d4f) Complete async
> job-29965, jobStatus: FAILED, resultCode: 530, result:
> org.apache.cloudstack.api.response.ExceptionResponse/null/\{"uuidList":[],"errorcode":530,"errortext":"Failed
> to delete network"}
> 2018-11-01 01:15:29,245 DEBUG [c.c.a.t.Request]
> (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f)
> Seq 4-667095694804259240: Received:
> { Ans: , MgmtId: [7474664765770|tel:7474664765770], via:
> 4([cehv02.core.jazz.net|http://cehv02.core.jazz.net/]), Ver: v1, Flags: 110,
> \\{ GroupAnswer }
> }
> 2018-11-01 01:15:29,245 WARN
> [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl]
> (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f)
> *Unable to destroy guest network on router VM*[DomainRouter|r-3388-VM]
> 2018-11-01 01:15:29,247 WARN
> [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl]
> (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f)
> *Failed to destroy guest network config Ntwk*[1122|Guest|12] on router
> VM[DomainRouter|r-3388-VM]
> 2018-11-01 01:15:29,247 WARN [c.c.n.e.VpcVirtualRouterElement]
> (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f)
> *Failed to unplug nic in network Ntwk*[1122|Guest|12] for virtual router
> VM[DomainRouter|r-3388-VM]
> 2018-11-01 01:15:29,247 WARN [o.a.c.e.o.NetworkOrchestrator]
> (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f)
> *Unable to complete shutdown of the network elements due to element:
> VpcVirtualRouter*
> 2018-11-01 01:15:29,255 DEBUG [o.a.c.e.o.NetworkOrchestrator]
> (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f)
> Lock is released for network Ntwk[1122|Guest|12] as a part of network shutdown
> 2018-11-01 01:15:29,256 DEBUG [o.a.c.e.o.NetworkOrchestrator]
> (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f)
> *Network is not not in the correct state to be destroyed: Shutdown*
> ============================================
>
> I'm attaching the simple java program which performs all of the above
> described steps and which allowed us to consistently run into the bug.
>
> To use the application:
>
> java -jar testCloudStack.jar <CloudStack API url: e.g.
> [http://foo:8080/client/api]> <apiKey> <secretKey> <zoneName>
>
> Note, that the test application works successfully with CloudStack server
> 4.9.2 but consistently reproduces the bug with CloudStack server 4.11.1
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)