Public bug reported:

Ubuntu 14.04 + OpenSwan 1:2.6.38-1

Environment with 3 controllers and 2 computes

Steps to reproduce:
1. Create VPN connection between tenant1 and tenant2 and check that it's active
2. Find a controller where one of the routers-participants of VPN connection is 
scheduled (tenant1's router, for example)
3. Shutdown this controller, wait some time and check that tenant1's router is 
rescheduled successfully, and VPN connection is restored
4. Start the controller which was shut downed and wait some time while it's 
completely booted
5. Reschedule tenant1's router back to its origin controller, which was under 
shutdown/start, wait some time and check that tenant1's router is rescheduled 
successfully, and VPN connection is restored

Actual result: tenant1's router is rescheduled, VMs can ping external
hosts, but VPN connection goes to DOWN state on tenant1's side with the
following error in vpn-agent.log on a controller where tenant1's router
was rescheduled back in p.5: http://paste.openstack.org/show/476459/

Analysis:
Pluto processes are running in qrouter namespace (or snat in case of DVR). When 
a controller is being shut down all namespaces get deleted (as they are stored 
in tmpfs), but pluto .pid and .ctl files remain as they are stored in 
/opt/stack/data/neutron/ipsec/<router-id>/var/run/.

Then, when router is rescheduled back to the origin controller, vpn
agent attempts to start pluto process and pluto fails when it finds that
a .pid file already exists. Such behavior of pluto is determined by the
flags that are used to open this file [1],[2] and it is most probably a
defense against accidental rewriting of .pid file .

As it is not a pluto bug, the solution might be to add a workaround to VPNaaS 
that will clean-up .ctl and .pid files on start-up.
Essentially, the same approach was used for LibreSwan driver [3] so we just 
need to do some refactoring to make this approach shared for OpenSwan and 
LibreSwan .

[1] 
https://github.com/xelerance/Openswan/blob/master/programs/pluto/plutomain.c#L258-L259
[2] 
https://github.com/libreswan/libreswan/blob/master/programs/pluto/plutomain.c#L231-L232
[3] 
https://github.com/openstack/neutron-vpnaas/commit/00b633d284f0f21aa380fa47a270c612ebef0795

P.S.
Another way to reproduce this failure is to replace steps 3-5 with:
3. Send kill -9 to the pluto process on that controller
4. Remove tenant1's router from agent running on that controller and then 
schedule it back.

** Affects: neutron
     Importance: Undecided
         Status: New


** Tags: vpnaas

** Tags added: vpnaas

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1506794

Title:
  VPNaaS: Active VPN connection goes down after controller
  shutdown/start

Status in neutron:
  New

Bug description:
  Ubuntu 14.04 + OpenSwan 1:2.6.38-1

  Environment with 3 controllers and 2 computes

  Steps to reproduce:
  1. Create VPN connection between tenant1 and tenant2 and check that it's 
active
  2. Find a controller where one of the routers-participants of VPN connection 
is scheduled (tenant1's router, for example)
  3. Shutdown this controller, wait some time and check that tenant1's router 
is rescheduled successfully, and VPN connection is restored
  4. Start the controller which was shut downed and wait some time while it's 
completely booted
  5. Reschedule tenant1's router back to its origin controller, which was under 
shutdown/start, wait some time and check that tenant1's router is rescheduled 
successfully, and VPN connection is restored

  Actual result: tenant1's router is rescheduled, VMs can ping external
  hosts, but VPN connection goes to DOWN state on tenant1's side with
  the following error in vpn-agent.log on a controller where tenant1's
  router was rescheduled back in p.5:
  http://paste.openstack.org/show/476459/

  Analysis:
  Pluto processes are running in qrouter namespace (or snat in case of DVR). 
When a controller is being shut down all namespaces get deleted (as they are 
stored in tmpfs), but pluto .pid and .ctl files remain as they are stored in 
/opt/stack/data/neutron/ipsec/<router-id>/var/run/.

  Then, when router is rescheduled back to the origin controller, vpn
  agent attempts to start pluto process and pluto fails when it finds
  that a .pid file already exists. Such behavior of pluto is determined
  by the flags that are used to open this file [1],[2] and it is most
  probably a defense against accidental rewriting of .pid file .

  As it is not a pluto bug, the solution might be to add a workaround to VPNaaS 
that will clean-up .ctl and .pid files on start-up.
  Essentially, the same approach was used for LibreSwan driver [3] so we just 
need to do some refactoring to make this approach shared for OpenSwan and 
LibreSwan .

  [1] 
https://github.com/xelerance/Openswan/blob/master/programs/pluto/plutomain.c#L258-L259
  [2] 
https://github.com/libreswan/libreswan/blob/master/programs/pluto/plutomain.c#L231-L232
  [3] 
https://github.com/openstack/neutron-vpnaas/commit/00b633d284f0f21aa380fa47a270c612ebef0795

  P.S.
  Another way to reproduce this failure is to replace steps 3-5 with:
  3. Send kill -9 to the pluto process on that controller
  4. Remove tenant1's router from agent running on that controller and then 
schedule it back.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1506794/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to