RE: How are router checks scheduled?
The change to "/opt/cloud/bin/checkbatchs2svpn.sh" fixes the issues where no all of the VPN checks are returned. I'll create and issue and PR Sean -Original Message- From: Sean Lair Sent: Tuesday, April 11, 2017 2:33 PM To: dev@cloudstack.apache.org Subject: RE: How are router checks scheduled? Found and fixed at least one issue (4.9.2.0), had to update this file: "/server/src/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java" Because "VpcVirtualNetworkApplianceManagerImpl" extends "VirtualNetworkApplianceManagerImpl" When VpcVirtualNetworkApplianceManagerImpl was created it re-ran "VirtualNetworkApplianceManagerImpl.Start". That rescheduled all of the various health and stats checks so everything was now running twice... Added this to the above file: @Override public boolean start() { return true; } @Override public boolean stop() { return true; } Now when we double-check our work by running this command: cat /var/log/cloudstack/management/management-server.log | grep "routers to update status" We only see that job (for example) kicking off once every 30-seconds instead twice every 30-seconds. Not sure if this solved the CPU issue yet. The above code coincidently is already in master as part of (PR #866). The issue with all the VPN alerts was exacerbated by this bug, but not the root-cause it looks like. We have another fix in place for "/opt/cloud/bin/checkbatchs2svpn.sh". When a tenant has a lot of S2S VPN connections, not all of the statuses are returned when the S2S VPN checks occur. It seems the SSHExecutor doesn't get the entire output of the script if there is any delay during execution. The Check S2S VPN code assumes "disconnected" if a S2S status isn't included in the response (or in our case, occasionally the response is cut off and missing a S2S VPN). Here is an example: 2017-04-11 17:05:40,444 DEBUG [c.c.h.x.r.CitrixResourceBase] (DirectAgent-190:ctx-e894af45) (logid:cbbccfaa) Executing command in VR: /opt/cloud/bin/router_proxy.sh checkbatchs2svpn.sh 169.254.2.130 67.41.109.167 65.100.18.183 67.41.109.165 67.41.109.166 2017-04-11 17:05:41,836 DEBUG [c.c.a.t.Request] (DirectAgent-190:ctx-e894af45) (logid:cbbccfaa) Seq 51-772085861117329631: Processing: { Ans: , MgmtId: 345050927939, via: 51(cloudxen01.dsm1.ippathways.net), Ver: v1, Flags: 110, [{"com.cloud.agent.api.CheckS2SVpnConnectionsAnswer":{"ipToConnected":{"65.100.18.183":true,"67.41.109.167":true,"67.41.109.165":true},"ipToDetail":{"65.100.18.183":"ISAKMP SA found;IPsec SA found;Site-to-site VPN have connected","67.41.109.167":"ISAKMP SA found;IPsec SA found;Site-to-site VPN have connected","67.41.109.165":"ISAKMP SA found;IPsec SA found;Site-to-site VPN have connected"},"details":"67.41.109.167:0:ISAKMP SA found;IPsec SA found;Site-to-site VPN have connected&65.100.18.183:0:ISAKMP SA found;IPsec SA found;Site-to-site VPN have connected&67.41.109.165:0:ISAKMP SA found;IPsec SA found;Site-to-site VPN have connected&","result":true,"wait":0}}] } A check was requested for 4x S2S VPNs, but the result only returned 3x S2S VPN statuses!! To fix this we changed "/opt/cloud/bin/checkbatchs2svpn.sh" on the vRouter as follows. So far so good, but we won't know until we run for a while longer if that was definitely the issue... ORIGINALLY: --- for i in $* do info=`/opt/cloud/bin/checks2svpn.sh $i` ret=$? echo -n "$i:$ret:$info&" done NEW: for i in $* do info=`/opt/cloud/bin/checks2svpn.sh $i` ret=$? batchInfo+="$i:$ret:$info&" done echo -n $batchInfo Hopefully that makes sense and helps someone else. PR #1966 has also been very important in our environment. -Original Message- From: Simon Weller [mailto:swel...@ena.com] Sent: Monday, April 10, 2017 5:26 PM To: dev@cloudstack.apache.org Subject: Re: How are router checks scheduled? We've seen something very similar. By any chance, are you seeing any strange cpu load issues that grow over time as well? Our team has been chasing down an issue that appears to be related to s2s vpn checks, where a race condition seems to occur that threads out the cpu over time. From: Sean Lair Sent: Monday, April 10, 2017 5:11 PM To: dev@cloudstack.apache.org Subject: RE: How are router checks scheduled? I do have two mgmt servers, but I have one powered off. The log excerpt is from one management server. This can be checked in the environment by running: cat /var/log/cloudstack/management/management-server.log | grep "routers to update status" This is h
RE: How are router checks scheduled?
Found and fixed at least one issue (4.9.2.0), had to update this file: "/server/src/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java" Because "VpcVirtualNetworkApplianceManagerImpl" extends "VirtualNetworkApplianceManagerImpl" When VpcVirtualNetworkApplianceManagerImpl was created it re-ran "VirtualNetworkApplianceManagerImpl.Start". That rescheduled all of the various health and stats checks so everything was now running twice... Added this to the above file: @Override public boolean start() { return true; } @Override public boolean stop() { return true; } Now when we double-check our work by running this command: cat /var/log/cloudstack/management/management-server.log | grep "routers to update status" We only see that job (for example) kicking off once every 30-seconds instead twice every 30-seconds. Not sure if this solved the CPU issue yet. The above code coincidently is already in master as part of (PR #866). The issue with all the VPN alerts was exacerbated by this bug, but not the root-cause it looks like. We have another fix in place for "/opt/cloud/bin/checkbatchs2svpn.sh". When a tenant has a lot of S2S VPN connections, not all of the statuses are returned when the S2S VPN checks occur. It seems the SSHExecutor doesn't get the entire output of the script if there is any delay during execution. The Check S2S VPN code assumes "disconnected" if a S2S status isn't included in the response (or in our case, occasionally the response is cut off and missing a S2S VPN). Here is an example: 2017-04-11 17:05:40,444 DEBUG [c.c.h.x.r.CitrixResourceBase] (DirectAgent-190:ctx-e894af45) (logid:cbbccfaa) Executing command in VR: /opt/cloud/bin/router_proxy.sh checkbatchs2svpn.sh 169.254.2.130 67.41.109.167 65.100.18.183 67.41.109.165 67.41.109.166 2017-04-11 17:05:41,836 DEBUG [c.c.a.t.Request] (DirectAgent-190:ctx-e894af45) (logid:cbbccfaa) Seq 51-772085861117329631: Processing: { Ans: , MgmtId: 345050927939, via: 51(cloudxen01.dsm1.ippathways.net), Ver: v1, Flags: 110, [{"com.cloud.agent.api.CheckS2SVpnConnectionsAnswer":{"ipToConnected":{"65.100.18.183":true,"67.41.109.167":true,"67.41.109.165":true},"ipToDetail":{"65.100.18.183":"ISAKMP SA found;IPsec SA found;Site-to-site VPN have connected","67.41.109.167":"ISAKMP SA found;IPsec SA found;Site-to-site VPN have connected","67.41.109.165":"ISAKMP SA found;IPsec SA found;Site-to-site VPN have connected"},"details":"67.41.109.167:0:ISAKMP SA found;IPsec SA found;Site-to-site VPN have connected&65.100.18.183:0:ISAKMP SA found;IPsec SA found;Site-to-site VPN have connected&67.41.109.165:0:ISAKMP SA found;IPsec SA found;Site-to-site VPN have connected&","result":true,"wait":0}}] } A check was requested for 4x S2S VPNs, but the result only returned 3x S2S VPN statuses!! To fix this we changed "/opt/cloud/bin/checkbatchs2svpn.sh" on the vRouter as follows. So far so good, but we won't know until we run for a while longer if that was definitely the issue... ORIGINALLY: --- for i in $* do info=`/opt/cloud/bin/checks2svpn.sh $i` ret=$? echo -n "$i:$ret:$info&" done NEW: for i in $* do info=`/opt/cloud/bin/checks2svpn.sh $i` ret=$? batchInfo+="$i:$ret:$info&" done echo -n $batchInfo Hopefully that makes sense and helps someone else. PR #1966 has also been very important in our environment. -Original Message- From: Simon Weller [mailto:swel...@ena.com] Sent: Monday, April 10, 2017 5:26 PM To: dev@cloudstack.apache.org Subject: Re: How are router checks scheduled? We've seen something very similar. By any chance, are you seeing any strange cpu load issues that grow over time as well? Our team has been chasing down an issue that appears to be related to s2s vpn checks, where a race condition seems to occur that threads out the cpu over time. From: Sean Lair Sent: Monday, April 10, 2017 5:11 PM To: dev@cloudstack.apache.org Subject: RE: How are router checks scheduled? I do have two mgmt servers, but I have one powered off. The log excerpt is from one management server. This can be checked in the environment by running: cat /var/log/cloudstack/management/management-server.log | grep "routers to update status" This is happening both in prod and our dev environment. I've been digging through the code and have some ideas and will post back later if successful in correcting the issue. The biggest problem is the race condition between the two simultaneous S2S VPN checks. They step on each other and spam the heck out of us with th
RE: How are router checks scheduled?
Yep! Exactly, we have that issue too. I am testing a possible fix right now, I'll let you know how it goes! -Original Message- From: Simon Weller [mailto:swel...@ena.com] Sent: Monday, April 10, 2017 5:26 PM To: dev@cloudstack.apache.org Subject: Re: How are router checks scheduled? We've seen something very similar. By any chance, are you seeing any strange cpu load issues that grow over time? Our team has been chasing down an issue that appears to be related to s2s vpn checks, where a race condition seems to occur that threads out the cpu over time. From: Sean Lair Sent: Monday, April 10, 2017 5:11 PM To: dev@cloudstack.apache.org Subject: RE: How are router checks scheduled? I do have two mgmt servers, but I have one powered off. The log excerpt is from one management server. This can be checked in the environment by running: cat /var/log/cloudstack/management/management-server.log | grep "routers to update status" This is happening both in prod and our dev environment. I've been digging through the code and have some ideas and will post back later if successful in correcting the issue. The biggest problem is the race condition between the two simultaneous S2S VPN checks. They step on each other and spam the heck out of us with the email alerting. -Original Message- From: Simon Weller [mailto:swel...@ena.com] Sent: Monday, April 10, 2017 5:02 PM To: dev@cloudstack.apache.org Subject: RE: How are router checks scheduled? Do you have 2 management servers? Simon Weller/615-312-6068 -Original Message- From: Sean Lair [sl...@ippathways.com] Received: Monday, 10 Apr 2017, 2:54PM To: dev@cloudstack.apache.org [dev@cloudstack.apache.org] Subject: How are router checks scheduled? According to my management server logs, some of the period checks are getting kicked off twice at the same time. The CheckRouterTask is kicked off every 30-seconds, but each time it is ran, it is ran twice at the same second... See logs below for example: 2017-04-10 21:48:12,879 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-5f7bc584) (logid:4d5b1031) Found 10 routers to update status. 2017-04-10 21:48:12,932 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-d027ab6f) (logid:1bc50629) Found 10 routers to update status. 2017-04-10 21:48:42,877 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-2c8f4d18) (logid:e9111785) Found 10 routers to update status. 2017-04-10 21:48:42,927 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-1bfd5351) (logid:ad0f95ef) Found 10 routers to update status. 2017-04-10 21:49:12,874 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-ede0d2bb) (logid:6f244423) Found 10 routers to update status. 2017-04-10 21:49:12,928 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-d58842d5) (logid:8442d73c) Found 10 routers to update status. How is this scheduled/kicked off? I am debugging some site-to-site VPN alert problems, and they seem to be related to a race condition due to the "CheckRouterTask" be kicked off two at a time. Thanks Sean
Re: How are router checks scheduled?
We've seen something very similar. By any chance, are you seeing any strange cpu load issues that grow over time as well? Our team has been chasing down an issue that appears to be related to s2s vpn checks, where a race condition seems to occur that threads out the cpu over time. From: Sean Lair Sent: Monday, April 10, 2017 5:11 PM To: dev@cloudstack.apache.org Subject: RE: How are router checks scheduled? I do have two mgmt servers, but I have one powered off. The log excerpt is from one management server. This can be checked in the environment by running: cat /var/log/cloudstack/management/management-server.log | grep "routers to update status" This is happening both in prod and our dev environment. I've been digging through the code and have some ideas and will post back later if successful in correcting the issue. The biggest problem is the race condition between the two simultaneous S2S VPN checks. They step on each other and spam the heck out of us with the email alerting. -Original Message- From: Simon Weller [mailto:swel...@ena.com] Sent: Monday, April 10, 2017 5:02 PM To: dev@cloudstack.apache.org Subject: RE: How are router checks scheduled? Do you have 2 management servers? Simon Weller/615-312-6068 -Original Message- From: Sean Lair [sl...@ippathways.com] Received: Monday, 10 Apr 2017, 2:54PM To: dev@cloudstack.apache.org [dev@cloudstack.apache.org] Subject: How are router checks scheduled? According to my management server logs, some of the period checks are getting kicked off twice at the same time. The CheckRouterTask is kicked off every 30-seconds, but each time it is ran, it is ran twice at the same second... See logs below for example: 2017-04-10 21:48:12,879 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-5f7bc584) (logid:4d5b1031) Found 10 routers to update status. 2017-04-10 21:48:12,932 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-d027ab6f) (logid:1bc50629) Found 10 routers to update status. 2017-04-10 21:48:42,877 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-2c8f4d18) (logid:e9111785) Found 10 routers to update status. 2017-04-10 21:48:42,927 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-1bfd5351) (logid:ad0f95ef) Found 10 routers to update status. 2017-04-10 21:49:12,874 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-ede0d2bb) (logid:6f244423) Found 10 routers to update status. 2017-04-10 21:49:12,928 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-d58842d5) (logid:8442d73c) Found 10 routers to update status. How is this scheduled/kicked off? I am debugging some site-to-site VPN alert problems, and they seem to be related to a race condition due to the "CheckRouterTask" be kicked off two at a time. Thanks Sean
Re: How are router checks scheduled?
We've seen something very similar. By any chance, are you seeing any strange cpu load issues that grow over time? Our team has been chasing down an issue that appears to be related to s2s vpn checks, where a race condition seems to occur that threads out the cpu over time. From: Sean Lair Sent: Monday, April 10, 2017 5:11 PM To: dev@cloudstack.apache.org Subject: RE: How are router checks scheduled? I do have two mgmt servers, but I have one powered off. The log excerpt is from one management server. This can be checked in the environment by running: cat /var/log/cloudstack/management/management-server.log | grep "routers to update status" This is happening both in prod and our dev environment. I've been digging through the code and have some ideas and will post back later if successful in correcting the issue. The biggest problem is the race condition between the two simultaneous S2S VPN checks. They step on each other and spam the heck out of us with the email alerting. -Original Message- From: Simon Weller [mailto:swel...@ena.com] Sent: Monday, April 10, 2017 5:02 PM To: dev@cloudstack.apache.org Subject: RE: How are router checks scheduled? Do you have 2 management servers? Simon Weller/615-312-6068 -Original Message- From: Sean Lair [sl...@ippathways.com] Received: Monday, 10 Apr 2017, 2:54PM To: dev@cloudstack.apache.org [dev@cloudstack.apache.org] Subject: How are router checks scheduled? According to my management server logs, some of the period checks are getting kicked off twice at the same time. The CheckRouterTask is kicked off every 30-seconds, but each time it is ran, it is ran twice at the same second... See logs below for example: 2017-04-10 21:48:12,879 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-5f7bc584) (logid:4d5b1031) Found 10 routers to update status. 2017-04-10 21:48:12,932 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-d027ab6f) (logid:1bc50629) Found 10 routers to update status. 2017-04-10 21:48:42,877 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-2c8f4d18) (logid:e9111785) Found 10 routers to update status. 2017-04-10 21:48:42,927 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-1bfd5351) (logid:ad0f95ef) Found 10 routers to update status. 2017-04-10 21:49:12,874 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-ede0d2bb) (logid:6f244423) Found 10 routers to update status. 2017-04-10 21:49:12,928 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-d58842d5) (logid:8442d73c) Found 10 routers to update status. How is this scheduled/kicked off? I am debugging some site-to-site VPN alert problems, and they seem to be related to a race condition due to the "CheckRouterTask" be kicked off two at a time. Thanks Sean
RE: How are router checks scheduled?
I do have two mgmt servers, but I have one powered off. The log excerpt is from one management server. This can be checked in the environment by running: cat /var/log/cloudstack/management/management-server.log | grep "routers to update status" This is happening both in prod and our dev environment. I've been digging through the code and have some ideas and will post back later if successful in correcting the issue. The biggest problem is the race condition between the two simultaneous S2S VPN checks. They step on each other and spam the heck out of us with the email alerting. -Original Message- From: Simon Weller [mailto:swel...@ena.com] Sent: Monday, April 10, 2017 5:02 PM To: dev@cloudstack.apache.org Subject: RE: How are router checks scheduled? Do you have 2 management servers? Simon Weller/615-312-6068 -Original Message- From: Sean Lair [sl...@ippathways.com] Received: Monday, 10 Apr 2017, 2:54PM To: dev@cloudstack.apache.org [dev@cloudstack.apache.org] Subject: How are router checks scheduled? According to my management server logs, some of the period checks are getting kicked off twice at the same time. The CheckRouterTask is kicked off every 30-seconds, but each time it is ran, it is ran twice at the same second... See logs below for example: 2017-04-10 21:48:12,879 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-5f7bc584) (logid:4d5b1031) Found 10 routers to update status. 2017-04-10 21:48:12,932 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-d027ab6f) (logid:1bc50629) Found 10 routers to update status. 2017-04-10 21:48:42,877 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-2c8f4d18) (logid:e9111785) Found 10 routers to update status. 2017-04-10 21:48:42,927 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-1bfd5351) (logid:ad0f95ef) Found 10 routers to update status. 2017-04-10 21:49:12,874 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-ede0d2bb) (logid:6f244423) Found 10 routers to update status. 2017-04-10 21:49:12,928 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-d58842d5) (logid:8442d73c) Found 10 routers to update status. How is this scheduled/kicked off? I am debugging some site-to-site VPN alert problems, and they seem to be related to a race condition due to the "CheckRouterTask" be kicked off two at a time. Thanks Sean
RE: How are router checks scheduled?
Do you have 2 management servers? Simon Weller/615-312-6068 -Original Message- From: Sean Lair [sl...@ippathways.com] Received: Monday, 10 Apr 2017, 2:54PM To: dev@cloudstack.apache.org [dev@cloudstack.apache.org] Subject: How are router checks scheduled? According to my management server logs, some of the period checks are getting kicked off twice at the same time. The CheckRouterTask is kicked off every 30-seconds, but each time it is ran, it is ran twice at the same second... See logs below for example: 2017-04-10 21:48:12,879 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-5f7bc584) (logid:4d5b1031) Found 10 routers to update status. 2017-04-10 21:48:12,932 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-d027ab6f) (logid:1bc50629) Found 10 routers to update status. 2017-04-10 21:48:42,877 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-2c8f4d18) (logid:e9111785) Found 10 routers to update status. 2017-04-10 21:48:42,927 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-1bfd5351) (logid:ad0f95ef) Found 10 routers to update status. 2017-04-10 21:49:12,874 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-ede0d2bb) (logid:6f244423) Found 10 routers to update status. 2017-04-10 21:49:12,928 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-d58842d5) (logid:8442d73c) Found 10 routers to update status. How is this scheduled/kicked off? I am debugging some site-to-site VPN alert problems, and they seem to be related to a race condition due to the "CheckRouterTask" be kicked off two at a time. Thanks Sean