Re: com.cloud.agent.api.CheckRouterCommand timeout
Am 21.06.2018 um 17:08 schrieb Daan Hoogland: > makes sense, well let's hope all breaks soon ;) I am sure it will break! :D And then I will get back to you with more questions! Thanks a lot for taking the time! > > On Thu, Jun 21, 2018 at 2:15 PM, Melanie Desaive < > m.desa...@heinlein-support.de> wrote: > >> Hi Daan, >> >> Am 21.06.2018 um 15:29 schrieb Daan Hoogland: >>> Melanie, attachments get deleted for this list. Your assumption for the >>> comm path is right for xen. Did you try and execute the script as it is >>> called by the proxy script from the host? and capture the return? We had >> a >>> bad problem with getting the template version in the past on xen, this >>> might be similar. That was due to processing of the returned string in >> the >>> script. >> >> I called both stages of the script manually but at at time, when all was >> working as expected and the routers where back to MASTER and BACKUP. >> >> Looked like: >> >> [root@acs-compute-5 ~]# /opt/cloud/bin/router_proxy.sh checkrouter.sh >> 169.254.1.178 >> Status: BACKUP >> >> root@r-2595-VM:~# /opt/cloud/bin/checkrouter.sh >> Status: BACKUP >> >> >>> >>> On Thu, Jun 21, 2018 at 1:16 PM, Melanie Desaive < >>> m.desa...@heinlein-support.de> wrote: >>> Hi Daan, thanks for your reply. The latest occurance of our VRs going to UNKNOWN did resolve 24 hours after it had occured. Nevertheless I would appreciate some insight into how the checkRouter command is handled, as I expect the problem to come back again. Am 21.06.2018 um 10:39 schrieb Daan Hoogland: > Melanie, this depends a bit on the type of hypervisor. The command executes > the checkrouter.sh script on the virtual router if it reaches it, but >> it > seems your problem is before that. I would look at the network first >> and > follow the path that the execution takes for your hypervisortype. With Stephans help I figured out the following guess for the path of connections for the checkrouter command. Could someone please correct me, if my guess is not correct. ;) x Management Nodes connects to XenServer hypervisor host via management network on port 22 by SSH x On hypervisor host the wrapper script "/opt/cloud/bin/router_proxy.sh" is used to call scripts on system VMs via link-local IP and port 3922 x On the VR the script "/opt/cloud/bin/checkrouter.sh" does the actual check. In our case the API call times out with log messages x Operation timed out: Commands 1063975411966525473 to Host 29 timed out after 60 x Unable to update router r-2595-VM's status x Redundant virtual router (name: r-2595-VM, id: 2595) just switch from BACKUP to UNKNOWN To me it seems that this is a timeout that occurs when ACS management is waitig for the API call to return. At what stage (management host <-> virtualization host) or (virutalization host <-> VR> the answer is delayed is unclear to me. (SSH Login from virtualization host to VR via link-local is working all the time) And it is unclear to me, why both VRs of the respective network stay in UNKNOWN for 24 hours, are accessible via link-local but come back immedately after a reboot. I am happy for any suggestions or explanations in this topic and will investigate further as soon, as the problem comes back again. A portion of our management log for the latest occurance of the problem is attached to this email. Greetings, Melanie > > On Wed, Jun 20, 2018 at 1:53 PM, Melanie Desaive < > m.desa...@heinlein-support.de> wrote: > >> Hi all, >> >> we have a recurring problem with our virtual routers. By the log >> messages it seems that com.cloud.agent.api.CheckRouterCommand runs >> into >> a timeout and therefore switches to UNKNOWN. >> >> All network traffic through the routers is still working. They can be >> accessed by their link-local IP adresses, and configuration looks good >> at a first sight. But configuration changes through the CloudStack API >> do no longer reach the routers. A reboot fixes the problem. >> >> I would like to investigate a little further but lack understanding >> about how the checkRouter command is trying to access the virtual router. >> >> Could someone point me to some relevant documentation or give a short >> overview how the connection from CS-Management is done and where such >> an >> timeout could occur? >> >> As background information - the sequence from the management log looks >> kind of this: >> >> --- >> >> x Every few seconds the com.cloud.agent.api.CheckRouterCommand >> returns >> a state BACKUP or MASTER correctly >> x When the problem occurs the log messages change. Some snippets >> below >> >> x ...
Re: com.cloud.agent.api.CheckRouterCommand timeout
makes sense, well let's hope all breaks soon ;) On Thu, Jun 21, 2018 at 2:15 PM, Melanie Desaive < m.desa...@heinlein-support.de> wrote: > Hi Daan, > > Am 21.06.2018 um 15:29 schrieb Daan Hoogland: > > Melanie, attachments get deleted for this list. Your assumption for the > > comm path is right for xen. Did you try and execute the script as it is > > called by the proxy script from the host? and capture the return? We had > a > > bad problem with getting the template version in the past on xen, this > > might be similar. That was due to processing of the returned string in > the > > script. > > I called both stages of the script manually but at at time, when all was > working as expected and the routers where back to MASTER and BACKUP. > > Looked like: > > [root@acs-compute-5 ~]# /opt/cloud/bin/router_proxy.sh checkrouter.sh > 169.254.1.178 > Status: BACKUP > > root@r-2595-VM:~# /opt/cloud/bin/checkrouter.sh > Status: BACKUP > > > > > > On Thu, Jun 21, 2018 at 1:16 PM, Melanie Desaive < > > m.desa...@heinlein-support.de> wrote: > > > >> Hi Daan, > >> > >> thanks for your reply. > >> > >> The latest occurance of our VRs going to UNKNOWN did resolve 24 hours > >> after it had occured. Nevertheless I would appreciate some insight into > >> how the checkRouter command is handled, as I expect the problem to come > >> back again. > >> Am 21.06.2018 um 10:39 schrieb Daan Hoogland: > >>> Melanie, this depends a bit on the type of hypervisor. The command > >> executes > >>> the checkrouter.sh script on the virtual router if it reaches it, but > it > >>> seems your problem is before that. I would look at the network first > and > >>> follow the path that the execution takes for your hypervisortype. > >> > >> With Stephans help I figured out the following guess for the path of > >> connections for the checkrouter command. Could someone please correct > >> me, if my guess is not correct. ;) > >> > >> x Management Nodes connects to XenServer hypervisor host via management > >> network on port 22 by SSH > >> x On hypervisor host the wrapper script > >> "/opt/cloud/bin/router_proxy.sh" is used to call scripts on system VMs > >> via link-local IP and port 3922 > >> x On the VR the script "/opt/cloud/bin/checkrouter.sh" does the actual > >> check. > >> > >> In our case the API call times out with log messages > >> x Operation timed out: Commands 1063975411966525473 to Host 29 timed > >> out after 60 > >> x Unable to update router r-2595-VM's status > >> x Redundant virtual router (name: r-2595-VM, id: 2595) just switch > >> from BACKUP to UNKNOWN > >> > >> To me it seems that this is a timeout that occurs when ACS management is > >> waitig for the API call to return. At what stage (management host <-> > >> virtualization host) or (virutalization host <-> VR> the answer is > >> delayed is unclear to me. (SSH Login from virtualization host to VR via > >> link-local is working all the time) > >> > >> And it is unclear to me, why both VRs of the respective network stay in > >> UNKNOWN for 24 hours, are accessible via link-local but come back > >> immedately after a reboot. > >> > >> I am happy for any suggestions or explanations in this topic and will > >> investigate further as soon, as the problem comes back again. > >> > >> A portion of our management log for the latest occurance of the problem > >> is attached to this email. > >> > >> Greetings, > >> > >> Melanie > >> > >>> > >>> On Wed, Jun 20, 2018 at 1:53 PM, Melanie Desaive < > >>> m.desa...@heinlein-support.de> wrote: > >>> > Hi all, > > we have a recurring problem with our virtual routers. By the log > messages it seems that com.cloud.agent.api.CheckRouterCommand runs > into > a timeout and therefore switches to UNKNOWN. > > All network traffic through the routers is still working. They can be > accessed by their link-local IP adresses, and configuration looks good > at a first sight. But configuration changes through the CloudStack API > do no longer reach the routers. A reboot fixes the problem. > > I would like to investigate a little further but lack understanding > about how the checkRouter command is trying to access the virtual > >> router. > > Could someone point me to some relevant documentation or give a short > overview how the connection from CS-Management is done and where such > an > timeout could occur? > > As background information - the sequence from the management log looks > kind of this: > > --- > > x Every few seconds the com.cloud.agent.api.CheckRouterCommand > returns > a state BACKUP or MASTER correctly > x When the problem occurs the log messages change. Some snippets > below > > x ... Waiting some more time because this is the current command > x ... Waiting some more time because this is the current command > x Could not find exception: >
Re: com.cloud.agent.api.CheckRouterCommand timeout
Hi Daan, Am 21.06.2018 um 15:29 schrieb Daan Hoogland: > Melanie, attachments get deleted for this list. Your assumption for the > comm path is right for xen. Did you try and execute the script as it is > called by the proxy script from the host? and capture the return? We had a > bad problem with getting the template version in the past on xen, this > might be similar. That was due to processing of the returned string in the > script. I called both stages of the script manually but at at time, when all was working as expected and the routers where back to MASTER and BACKUP. Looked like: [root@acs-compute-5 ~]# /opt/cloud/bin/router_proxy.sh checkrouter.sh 169.254.1.178 Status: BACKUP root@r-2595-VM:~# /opt/cloud/bin/checkrouter.sh Status: BACKUP > > On Thu, Jun 21, 2018 at 1:16 PM, Melanie Desaive < > m.desa...@heinlein-support.de> wrote: > >> Hi Daan, >> >> thanks for your reply. >> >> The latest occurance of our VRs going to UNKNOWN did resolve 24 hours >> after it had occured. Nevertheless I would appreciate some insight into >> how the checkRouter command is handled, as I expect the problem to come >> back again. >> Am 21.06.2018 um 10:39 schrieb Daan Hoogland: >>> Melanie, this depends a bit on the type of hypervisor. The command >> executes >>> the checkrouter.sh script on the virtual router if it reaches it, but it >>> seems your problem is before that. I would look at the network first and >>> follow the path that the execution takes for your hypervisortype. >> >> With Stephans help I figured out the following guess for the path of >> connections for the checkrouter command. Could someone please correct >> me, if my guess is not correct. ;) >> >> x Management Nodes connects to XenServer hypervisor host via management >> network on port 22 by SSH >> x On hypervisor host the wrapper script >> "/opt/cloud/bin/router_proxy.sh" is used to call scripts on system VMs >> via link-local IP and port 3922 >> x On the VR the script "/opt/cloud/bin/checkrouter.sh" does the actual >> check. >> >> In our case the API call times out with log messages >> x Operation timed out: Commands 1063975411966525473 to Host 29 timed >> out after 60 >> x Unable to update router r-2595-VM's status >> x Redundant virtual router (name: r-2595-VM, id: 2595) just switch >> from BACKUP to UNKNOWN >> >> To me it seems that this is a timeout that occurs when ACS management is >> waitig for the API call to return. At what stage (management host <-> >> virtualization host) or (virutalization host <-> VR> the answer is >> delayed is unclear to me. (SSH Login from virtualization host to VR via >> link-local is working all the time) >> >> And it is unclear to me, why both VRs of the respective network stay in >> UNKNOWN for 24 hours, are accessible via link-local but come back >> immedately after a reboot. >> >> I am happy for any suggestions or explanations in this topic and will >> investigate further as soon, as the problem comes back again. >> >> A portion of our management log for the latest occurance of the problem >> is attached to this email. >> >> Greetings, >> >> Melanie >> >>> >>> On Wed, Jun 20, 2018 at 1:53 PM, Melanie Desaive < >>> m.desa...@heinlein-support.de> wrote: >>> Hi all, we have a recurring problem with our virtual routers. By the log messages it seems that com.cloud.agent.api.CheckRouterCommand runs into a timeout and therefore switches to UNKNOWN. All network traffic through the routers is still working. They can be accessed by their link-local IP adresses, and configuration looks good at a first sight. But configuration changes through the CloudStack API do no longer reach the routers. A reboot fixes the problem. I would like to investigate a little further but lack understanding about how the checkRouter command is trying to access the virtual >> router. Could someone point me to some relevant documentation or give a short overview how the connection from CS-Management is done and where such an timeout could occur? As background information - the sequence from the management log looks kind of this: --- x Every few seconds the com.cloud.agent.api.CheckRouterCommand returns a state BACKUP or MASTER correctly x When the problem occurs the log messages change. Some snippets below x ... Waiting some more time because this is the current command x ... Waiting some more time because this is the current command x Could not find exception: com.cloud.exception.OperationTimedoutException in error code list for exceptions x Timed out on Seq 28-2352567855348137104 x Seq 28-2352567855348137104: Cancelling. x Operation timed out: Commands 2352567855348137104 to Host 28 timed out after 60 x Unable to update router r-2594-VM's status x Redundant virtual router (name: r-2594-VM, id: 2594) just switch from
Re: com.cloud.agent.api.CheckRouterCommand timeout
Hi Daan, thanks for your reply. The latest occurance of our VRs going to UNKNOWN did resolve 24 hours after it had occured. Nevertheless I would appreciate some insight into how the checkRouter command is handled, as I expect the problem to come back again. Am 21.06.2018 um 10:39 schrieb Daan Hoogland: > Melanie, this depends a bit on the type of hypervisor. The command executes > the checkrouter.sh script on the virtual router if it reaches it, but it > seems your problem is before that. I would look at the network first and > follow the path that the execution takes for your hypervisortype. With Stephans help I figured out the following guess for the path of connections for the checkrouter command. Could someone please correct me, if my guess is not correct. ;) x Management Nodes connects to XenServer hypervisor host via management network on port 22 by SSH x On hypervisor host the wrapper script "/opt/cloud/bin/router_proxy.sh" is used to call scripts on system VMs via link-local IP and port 3922 x On the VR the script "/opt/cloud/bin/checkrouter.sh" does the actual check. In our case the API call times out with log messages x Operation timed out: Commands 1063975411966525473 to Host 29 timed out after 60 x Unable to update router r-2595-VM's status x Redundant virtual router (name: r-2595-VM, id: 2595) just switch from BACKUP to UNKNOWN To me it seems that this is a timeout that occurs when ACS management is waitig for the API call to return. At what stage (management host <-> virtualization host) or (virutalization host <-> VR> the answer is delayed is unclear to me. (SSH Login from virtualization host to VR via link-local is working all the time) And it is unclear to me, why both VRs of the respective network stay in UNKNOWN for 24 hours, are accessible via link-local but come back immedately after a reboot. I am happy for any suggestions or explanations in this topic and will investigate further as soon, as the problem comes back again. A portion of our management log for the latest occurance of the problem is attached to this email. Greetings, Melanie > > On Wed, Jun 20, 2018 at 1:53 PM, Melanie Desaive < > m.desa...@heinlein-support.de> wrote: > >> Hi all, >> >> we have a recurring problem with our virtual routers. By the log >> messages it seems that com.cloud.agent.api.CheckRouterCommand runs into >> a timeout and therefore switches to UNKNOWN. >> >> All network traffic through the routers is still working. They can be >> accessed by their link-local IP adresses, and configuration looks good >> at a first sight. But configuration changes through the CloudStack API >> do no longer reach the routers. A reboot fixes the problem. >> >> I would like to investigate a little further but lack understanding >> about how the checkRouter command is trying to access the virtual router. >> >> Could someone point me to some relevant documentation or give a short >> overview how the connection from CS-Management is done and where such an >> timeout could occur? >> >> As background information - the sequence from the management log looks >> kind of this: >> >> --- >> >> x Every few seconds the com.cloud.agent.api.CheckRouterCommand returns >> a state BACKUP or MASTER correctly >> x When the problem occurs the log messages change. Some snippets below >> >> x ... Waiting some more time because this is the current command >> x ... Waiting some more time because this is the current command >> x Could not find exception: >> com.cloud.exception.OperationTimedoutException in error code list for >> exceptions >> x Timed out on Seq 28-2352567855348137104 >> x Seq 28-2352567855348137104: Cancelling. >> x Operation timed out: Commands 2352567855348137104 to Host 28 timed >> out after 60 >> x Unable to update router r-2594-VM's status >> x Redundant virtual router (name: r-2594-VM, id: 2594) just switch >> from MASTER to UNKNOWN >> >> x Those error messages are now repeated for each following >> CheckRouterCommand until the virtual router is rebootet >> >> >> Greetings, >> >> Melanie >> >> -- >> -- >> >> Heinlein Support GmbH >> Linux: Akademie - Support - Hosting >> >> http://www.heinlein-support.de >> Tel: 030 / 40 50 51 - 0 >> Fax: 030 / 40 50 51 - 19 >> >> Zwangsangaben lt. §35a GmbHG: >> HRB 93818 B / Amtsgericht Berlin-Charlottenburg, >> Geschäftsführer: Peer Heinlein -- Sitz: Berlin >> > > > -- -- Heinlein Support GmbH Linux: Akademie - Support - Hosting http://www.heinlein-support.de Tel: 030 / 40 50 51 - 0 Fax: 030 / 40 50 51 - 19 Zwangsangaben lt. §35a GmbHG: HRB 93818 B / Amtsgericht Berlin-Charlottenburg, Geschäftsführer: Peer Heinlein -- Sitz: Berlin
Re: com.cloud.agent.api.CheckRouterCommand timeout
Melanie, this depends a bit on the type of hypervisor. The command executes the checkrouter.sh script on the virtual router if it reaches it, but it seems your problem is before that. I would look at the network first and follow the path that the execution takes for your hypervisortype. On Wed, Jun 20, 2018 at 1:53 PM, Melanie Desaive < m.desa...@heinlein-support.de> wrote: > Hi all, > > we have a recurring problem with our virtual routers. By the log > messages it seems that com.cloud.agent.api.CheckRouterCommand runs into > a timeout and therefore switches to UNKNOWN. > > All network traffic through the routers is still working. They can be > accessed by their link-local IP adresses, and configuration looks good > at a first sight. But configuration changes through the CloudStack API > do no longer reach the routers. A reboot fixes the problem. > > I would like to investigate a little further but lack understanding > about how the checkRouter command is trying to access the virtual router. > > Could someone point me to some relevant documentation or give a short > overview how the connection from CS-Management is done and where such an > timeout could occur? > > As background information - the sequence from the management log looks > kind of this: > > --- > > x Every few seconds the com.cloud.agent.api.CheckRouterCommand returns > a state BACKUP or MASTER correctly > x When the problem occurs the log messages change. Some snippets below > > x ... Waiting some more time because this is the current command > x ... Waiting some more time because this is the current command > x Could not find exception: > com.cloud.exception.OperationTimedoutException in error code list for > exceptions > x Timed out on Seq 28-2352567855348137104 > x Seq 28-2352567855348137104: Cancelling. > x Operation timed out: Commands 2352567855348137104 to Host 28 timed > out after 60 > x Unable to update router r-2594-VM's status > x Redundant virtual router (name: r-2594-VM, id: 2594) just switch > from MASTER to UNKNOWN > > x Those error messages are now repeated for each following > CheckRouterCommand until the virtual router is rebootet > > > Greetings, > > Melanie > > -- > -- > > Heinlein Support GmbH > Linux: Akademie - Support - Hosting > > http://www.heinlein-support.de > Tel: 030 / 40 50 51 - 0 > Fax: 030 / 40 50 51 - 19 > > Zwangsangaben lt. §35a GmbHG: > HRB 93818 B / Amtsgericht Berlin-Charlottenburg, > Geschäftsführer: Peer Heinlein -- Sitz: Berlin > -- Daan