Re: com.cloud.agent.api.CheckRouterCommand timeout

2018-06-21 Thread Melanie Desaive



Am 21.06.2018 um 17:08 schrieb Daan Hoogland:
> makes sense, well let's hope all breaks soon ;)

I am sure it will break! :D

And then I will get back to you with more questions!

Thanks a lot for taking the time!

> 
> On Thu, Jun 21, 2018 at 2:15 PM, Melanie Desaive <
> m.desa...@heinlein-support.de> wrote:
> 
>> Hi Daan,
>>
>> Am 21.06.2018 um 15:29 schrieb Daan Hoogland:
>>> Melanie, attachments get deleted for this list. Your assumption for the
>>> comm path is right for xen. Did you try and execute the script as it is
>>> called by the proxy script from the host? and capture the return? We had
>> a
>>> bad problem with getting the template version in the past on xen, this
>>> might be similar. That was due to processing of the returned string in
>> the
>>> script.
>>
>> I called both stages of the script manually but at at time, when all was
>> working as expected and the routers where back to MASTER and BACKUP.
>>
>> Looked like:
>>
>> [root@acs-compute-5 ~]# /opt/cloud/bin/router_proxy.sh checkrouter.sh
>> 169.254.1.178
>> Status: BACKUP
>>
>> root@r-2595-VM:~# /opt/cloud/bin/checkrouter.sh
>> Status: BACKUP
>>
>>
>>>
>>> On Thu, Jun 21, 2018 at 1:16 PM, Melanie Desaive <
>>> m.desa...@heinlein-support.de> wrote:
>>>
 Hi Daan,

 thanks for your reply.

 The latest occurance of our VRs going to UNKNOWN did resolve 24 hours
 after it had occured. Nevertheless I would appreciate some insight into
 how the checkRouter command is handled, as I expect the problem to come
 back again.
 Am 21.06.2018 um 10:39 schrieb Daan Hoogland:
> Melanie, this depends a bit on the type of hypervisor. The command
 executes
> the checkrouter.sh script on the virtual router if it reaches it, but
>> it
> seems your problem is before that. I would look at the network first
>> and
> follow the path that the execution takes for your hypervisortype.

 With Stephans help I figured out the following guess for the path of
 connections for the checkrouter command. Could someone please correct
 me, if my guess is not correct. ;)

  x Management Nodes connects to XenServer hypervisor host via management
 network on port 22 by SSH
  x On hypervisor host the wrapper script
 "/opt/cloud/bin/router_proxy.sh" is used to call scripts on system VMs
 via link-local IP and port 3922
  x On the VR the script "/opt/cloud/bin/checkrouter.sh" does the actual
 check.

 In our case the API call times out with log messages
  x Operation timed out: Commands 1063975411966525473 to Host 29 timed
 out after 60
  x Unable to update router r-2595-VM's status
  x Redundant virtual router (name: r-2595-VM, id: 2595)  just switch
 from BACKUP to UNKNOWN

 To me it seems that this is a timeout that occurs when ACS management is
 waitig for the API call to return. At what stage (management host <->
 virtualization host) or (virutalization host <-> VR> the answer is
 delayed is unclear to me. (SSH Login from virtualization host to VR via
 link-local is working all the time)

 And it is unclear to me, why both VRs of the respective network stay in
 UNKNOWN for 24 hours, are accessible via link-local but come back
 immedately after a reboot.

 I am happy for any suggestions or explanations in this topic and will
 investigate further as soon, as the problem comes back again.

 A portion of our management log for the latest occurance of the problem
 is attached to this email.

 Greetings,

 Melanie

>
> On Wed, Jun 20, 2018 at 1:53 PM, Melanie Desaive <
> m.desa...@heinlein-support.de> wrote:
>
>> Hi all,
>>
>> we have a recurring problem with our virtual routers. By the log
>> messages it seems that com.cloud.agent.api.CheckRouterCommand runs
>> into
>> a timeout and therefore switches to UNKNOWN.
>>
>> All network traffic through the routers is still working. They can be
>> accessed by their link-local IP adresses, and configuration looks good
>> at a first sight. But configuration changes through the CloudStack API
>> do no longer reach the routers. A reboot fixes the problem.
>>
>> I would like to investigate a little further but lack understanding
>> about how the checkRouter command is trying to access the virtual
 router.
>>
>> Could someone point me to some relevant documentation or give a short
>> overview how the connection from CS-Management is done and where such
>> an
>> timeout could occur?
>>
>> As background information - the sequence from the management log looks
>> kind of this:
>>
>> ---
>>
>>  x Every few seconds the com.cloud.agent.api.CheckRouterCommand
>> returns
>> a state BACKUP or MASTER correctly
>>  x When the problem occurs the log messages change. Some snippets
>> below
>>
>>  x ... 

Re: com.cloud.agent.api.CheckRouterCommand timeout

2018-06-21 Thread Daan Hoogland
makes sense, well let's hope all breaks soon ;)

On Thu, Jun 21, 2018 at 2:15 PM, Melanie Desaive <
m.desa...@heinlein-support.de> wrote:

> Hi Daan,
>
> Am 21.06.2018 um 15:29 schrieb Daan Hoogland:
> > Melanie, attachments get deleted for this list. Your assumption for the
> > comm path is right for xen. Did you try and execute the script as it is
> > called by the proxy script from the host? and capture the return? We had
> a
> > bad problem with getting the template version in the past on xen, this
> > might be similar. That was due to processing of the returned string in
> the
> > script.
>
> I called both stages of the script manually but at at time, when all was
> working as expected and the routers where back to MASTER and BACKUP.
>
> Looked like:
>
> [root@acs-compute-5 ~]# /opt/cloud/bin/router_proxy.sh checkrouter.sh
> 169.254.1.178
> Status: BACKUP
>
> root@r-2595-VM:~# /opt/cloud/bin/checkrouter.sh
> Status: BACKUP
>
>
> >
> > On Thu, Jun 21, 2018 at 1:16 PM, Melanie Desaive <
> > m.desa...@heinlein-support.de> wrote:
> >
> >> Hi Daan,
> >>
> >> thanks for your reply.
> >>
> >> The latest occurance of our VRs going to UNKNOWN did resolve 24 hours
> >> after it had occured. Nevertheless I would appreciate some insight into
> >> how the checkRouter command is handled, as I expect the problem to come
> >> back again.
> >> Am 21.06.2018 um 10:39 schrieb Daan Hoogland:
> >>> Melanie, this depends a bit on the type of hypervisor. The command
> >> executes
> >>> the checkrouter.sh script on the virtual router if it reaches it, but
> it
> >>> seems your problem is before that. I would look at the network first
> and
> >>> follow the path that the execution takes for your hypervisortype.
> >>
> >> With Stephans help I figured out the following guess for the path of
> >> connections for the checkrouter command. Could someone please correct
> >> me, if my guess is not correct. ;)
> >>
> >>  x Management Nodes connects to XenServer hypervisor host via management
> >> network on port 22 by SSH
> >>  x On hypervisor host the wrapper script
> >> "/opt/cloud/bin/router_proxy.sh" is used to call scripts on system VMs
> >> via link-local IP and port 3922
> >>  x On the VR the script "/opt/cloud/bin/checkrouter.sh" does the actual
> >> check.
> >>
> >> In our case the API call times out with log messages
> >>  x Operation timed out: Commands 1063975411966525473 to Host 29 timed
> >> out after 60
> >>  x Unable to update router r-2595-VM's status
> >>  x Redundant virtual router (name: r-2595-VM, id: 2595)  just switch
> >> from BACKUP to UNKNOWN
> >>
> >> To me it seems that this is a timeout that occurs when ACS management is
> >> waitig for the API call to return. At what stage (management host <->
> >> virtualization host) or (virutalization host <-> VR> the answer is
> >> delayed is unclear to me. (SSH Login from virtualization host to VR via
> >> link-local is working all the time)
> >>
> >> And it is unclear to me, why both VRs of the respective network stay in
> >> UNKNOWN for 24 hours, are accessible via link-local but come back
> >> immedately after a reboot.
> >>
> >> I am happy for any suggestions or explanations in this topic and will
> >> investigate further as soon, as the problem comes back again.
> >>
> >> A portion of our management log for the latest occurance of the problem
> >> is attached to this email.
> >>
> >> Greetings,
> >>
> >> Melanie
> >>
> >>>
> >>> On Wed, Jun 20, 2018 at 1:53 PM, Melanie Desaive <
> >>> m.desa...@heinlein-support.de> wrote:
> >>>
>  Hi all,
> 
>  we have a recurring problem with our virtual routers. By the log
>  messages it seems that com.cloud.agent.api.CheckRouterCommand runs
> into
>  a timeout and therefore switches to UNKNOWN.
> 
>  All network traffic through the routers is still working. They can be
>  accessed by their link-local IP adresses, and configuration looks good
>  at a first sight. But configuration changes through the CloudStack API
>  do no longer reach the routers. A reboot fixes the problem.
> 
>  I would like to investigate a little further but lack understanding
>  about how the checkRouter command is trying to access the virtual
> >> router.
> 
>  Could someone point me to some relevant documentation or give a short
>  overview how the connection from CS-Management is done and where such
> an
>  timeout could occur?
> 
>  As background information - the sequence from the management log looks
>  kind of this:
> 
>  ---
> 
>   x Every few seconds the com.cloud.agent.api.CheckRouterCommand
> returns
>  a state BACKUP or MASTER correctly
>   x When the problem occurs the log messages change. Some snippets
> below
> 
>   x ... Waiting some more time because this is the current command
>   x ... Waiting some more time because this is the current command
>   x Could not find exception:
>  

Re: com.cloud.agent.api.CheckRouterCommand timeout

2018-06-21 Thread Melanie Desaive
Hi Daan,

Am 21.06.2018 um 15:29 schrieb Daan Hoogland:
> Melanie, attachments get deleted for this list. Your assumption for the
> comm path is right for xen. Did you try and execute the script as it is
> called by the proxy script from the host? and capture the return? We had a
> bad problem with getting the template version in the past on xen, this
> might be similar. That was due to processing of the returned string in the
> script.

I called both stages of the script manually but at at time, when all was
working as expected and the routers where back to MASTER and BACKUP.

Looked like:

[root@acs-compute-5 ~]# /opt/cloud/bin/router_proxy.sh checkrouter.sh
169.254.1.178
Status: BACKUP

root@r-2595-VM:~# /opt/cloud/bin/checkrouter.sh
Status: BACKUP


> 
> On Thu, Jun 21, 2018 at 1:16 PM, Melanie Desaive <
> m.desa...@heinlein-support.de> wrote:
> 
>> Hi Daan,
>>
>> thanks for your reply.
>>
>> The latest occurance of our VRs going to UNKNOWN did resolve 24 hours
>> after it had occured. Nevertheless I would appreciate some insight into
>> how the checkRouter command is handled, as I expect the problem to come
>> back again.
>> Am 21.06.2018 um 10:39 schrieb Daan Hoogland:
>>> Melanie, this depends a bit on the type of hypervisor. The command
>> executes
>>> the checkrouter.sh script on the virtual router if it reaches it, but it
>>> seems your problem is before that. I would look at the network first and
>>> follow the path that the execution takes for your hypervisortype.
>>
>> With Stephans help I figured out the following guess for the path of
>> connections for the checkrouter command. Could someone please correct
>> me, if my guess is not correct. ;)
>>
>>  x Management Nodes connects to XenServer hypervisor host via management
>> network on port 22 by SSH
>>  x On hypervisor host the wrapper script
>> "/opt/cloud/bin/router_proxy.sh" is used to call scripts on system VMs
>> via link-local IP and port 3922
>>  x On the VR the script "/opt/cloud/bin/checkrouter.sh" does the actual
>> check.
>>
>> In our case the API call times out with log messages
>>  x Operation timed out: Commands 1063975411966525473 to Host 29 timed
>> out after 60
>>  x Unable to update router r-2595-VM's status
>>  x Redundant virtual router (name: r-2595-VM, id: 2595)  just switch
>> from BACKUP to UNKNOWN
>>
>> To me it seems that this is a timeout that occurs when ACS management is
>> waitig for the API call to return. At what stage (management host <->
>> virtualization host) or (virutalization host <-> VR> the answer is
>> delayed is unclear to me. (SSH Login from virtualization host to VR via
>> link-local is working all the time)
>>
>> And it is unclear to me, why both VRs of the respective network stay in
>> UNKNOWN for 24 hours, are accessible via link-local but come back
>> immedately after a reboot.
>>
>> I am happy for any suggestions or explanations in this topic and will
>> investigate further as soon, as the problem comes back again.
>>
>> A portion of our management log for the latest occurance of the problem
>> is attached to this email.
>>
>> Greetings,
>>
>> Melanie
>>
>>>
>>> On Wed, Jun 20, 2018 at 1:53 PM, Melanie Desaive <
>>> m.desa...@heinlein-support.de> wrote:
>>>
 Hi all,

 we have a recurring problem with our virtual routers. By the log
 messages it seems that com.cloud.agent.api.CheckRouterCommand runs into
 a timeout and therefore switches to UNKNOWN.

 All network traffic through the routers is still working. They can be
 accessed by their link-local IP adresses, and configuration looks good
 at a first sight. But configuration changes through the CloudStack API
 do no longer reach the routers. A reboot fixes the problem.

 I would like to investigate a little further but lack understanding
 about how the checkRouter command is trying to access the virtual
>> router.

 Could someone point me to some relevant documentation or give a short
 overview how the connection from CS-Management is done and where such an
 timeout could occur?

 As background information - the sequence from the management log looks
 kind of this:

 ---

  x Every few seconds the com.cloud.agent.api.CheckRouterCommand returns
 a state BACKUP or MASTER correctly
  x When the problem occurs the log messages change. Some snippets below

  x ... Waiting some more time because this is the current command
  x ... Waiting some more time because this is the current command
  x Could not find exception:
 com.cloud.exception.OperationTimedoutException in error code list for
 exceptions
  x Timed out on Seq 28-2352567855348137104
  x Seq 28-2352567855348137104: Cancelling.
  x Operation timed out: Commands 2352567855348137104 to Host 28 timed
 out after 60
  x Unable to update router r-2594-VM's status
  x Redundant virtual router (name: r-2594-VM, id: 2594)  just switch
 from 

Re: com.cloud.agent.api.CheckRouterCommand timeout

2018-06-21 Thread Melanie Desaive
Hi Daan,

thanks for your reply.

The latest occurance of our VRs going to UNKNOWN did resolve 24 hours
after it had occured. Nevertheless I would appreciate some insight into
how the checkRouter command is handled, as I expect the problem to come
back again.
Am 21.06.2018 um 10:39 schrieb Daan Hoogland:
> Melanie, this depends a bit on the type of hypervisor. The command executes
> the checkrouter.sh script on the virtual router if it reaches it, but it
> seems your problem is before that. I would look at the network first and
> follow the path that the execution takes for your hypervisortype.

With Stephans help I figured out the following guess for the path of
connections for the checkrouter command. Could someone please correct
me, if my guess is not correct. ;)

 x Management Nodes connects to XenServer hypervisor host via management
network on port 22 by SSH
 x On hypervisor host the wrapper script
"/opt/cloud/bin/router_proxy.sh" is used to call scripts on system VMs
via link-local IP and port 3922
 x On the VR the script "/opt/cloud/bin/checkrouter.sh" does the actual
check.

In our case the API call times out with log messages
 x Operation timed out: Commands 1063975411966525473 to Host 29 timed
out after 60
 x Unable to update router r-2595-VM's status
 x Redundant virtual router (name: r-2595-VM, id: 2595)  just switch
from BACKUP to UNKNOWN

To me it seems that this is a timeout that occurs when ACS management is
waitig for the API call to return. At what stage (management host <->
virtualization host) or (virutalization host <-> VR> the answer is
delayed is unclear to me. (SSH Login from virtualization host to VR via
link-local is working all the time)

And it is unclear to me, why both VRs of the respective network stay in
UNKNOWN for 24 hours, are accessible via link-local but come back
immedately after a reboot.

I am happy for any suggestions or explanations in this topic and will
investigate further as soon, as the problem comes back again.

A portion of our management log for the latest occurance of the problem
is attached to this email.

Greetings,

Melanie

> 
> On Wed, Jun 20, 2018 at 1:53 PM, Melanie Desaive <
> m.desa...@heinlein-support.de> wrote:
> 
>> Hi all,
>>
>> we have a recurring problem with our virtual routers. By the log
>> messages it seems that com.cloud.agent.api.CheckRouterCommand runs into
>> a timeout and therefore switches to UNKNOWN.
>>
>> All network traffic through the routers is still working. They can be
>> accessed by their link-local IP adresses, and configuration looks good
>> at a first sight. But configuration changes through the CloudStack API
>> do no longer reach the routers. A reboot fixes the problem.
>>
>> I would like to investigate a little further but lack understanding
>> about how the checkRouter command is trying to access the virtual router.
>>
>> Could someone point me to some relevant documentation or give a short
>> overview how the connection from CS-Management is done and where such an
>> timeout could occur?
>>
>> As background information - the sequence from the management log looks
>> kind of this:
>>
>> ---
>>
>>  x Every few seconds the com.cloud.agent.api.CheckRouterCommand returns
>> a state BACKUP or MASTER correctly
>>  x When the problem occurs the log messages change. Some snippets below
>>
>>  x ... Waiting some more time because this is the current command
>>  x ... Waiting some more time because this is the current command
>>  x Could not find exception:
>> com.cloud.exception.OperationTimedoutException in error code list for
>> exceptions
>>  x Timed out on Seq 28-2352567855348137104
>>  x Seq 28-2352567855348137104: Cancelling.
>>  x Operation timed out: Commands 2352567855348137104 to Host 28 timed
>> out after 60
>>  x Unable to update router r-2594-VM's status
>>  x Redundant virtual router (name: r-2594-VM, id: 2594)  just switch
>> from MASTER to UNKNOWN
>>
>>  x Those error messages are now repeated for each following
>> CheckRouterCommand until the virtual router is rebootet
>>
>>
>> Greetings,
>>
>> Melanie
>>
>> --
>> --
>>
>> Heinlein Support GmbH
>> Linux: Akademie - Support - Hosting
>>
>> http://www.heinlein-support.de
>> Tel: 030 / 40 50 51 - 0
>> Fax: 030 / 40 50 51 - 19
>>
>> Zwangsangaben lt. §35a GmbHG:
>> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
>> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
>>
> 
> 
> 

-- 
--

Heinlein Support GmbH
Linux: Akademie - Support - Hosting

http://www.heinlein-support.de
Tel: 030 / 40 50 51 - 0
Fax: 030 / 40 50 51 - 19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein  -- Sitz: Berlin


Re: com.cloud.agent.api.CheckRouterCommand timeout

2018-06-21 Thread Daan Hoogland
Melanie, this depends a bit on the type of hypervisor. The command executes
the checkrouter.sh script on the virtual router if it reaches it, but it
seems your problem is before that. I would look at the network first and
follow the path that the execution takes for your hypervisortype.

On Wed, Jun 20, 2018 at 1:53 PM, Melanie Desaive <
m.desa...@heinlein-support.de> wrote:

> Hi all,
>
> we have a recurring problem with our virtual routers. By the log
> messages it seems that com.cloud.agent.api.CheckRouterCommand runs into
> a timeout and therefore switches to UNKNOWN.
>
> All network traffic through the routers is still working. They can be
> accessed by their link-local IP adresses, and configuration looks good
> at a first sight. But configuration changes through the CloudStack API
> do no longer reach the routers. A reboot fixes the problem.
>
> I would like to investigate a little further but lack understanding
> about how the checkRouter command is trying to access the virtual router.
>
> Could someone point me to some relevant documentation or give a short
> overview how the connection from CS-Management is done and where such an
> timeout could occur?
>
> As background information - the sequence from the management log looks
> kind of this:
>
> ---
>
>  x Every few seconds the com.cloud.agent.api.CheckRouterCommand returns
> a state BACKUP or MASTER correctly
>  x When the problem occurs the log messages change. Some snippets below
>
>  x ... Waiting some more time because this is the current command
>  x ... Waiting some more time because this is the current command
>  x Could not find exception:
> com.cloud.exception.OperationTimedoutException in error code list for
> exceptions
>  x Timed out on Seq 28-2352567855348137104
>  x Seq 28-2352567855348137104: Cancelling.
>  x Operation timed out: Commands 2352567855348137104 to Host 28 timed
> out after 60
>  x Unable to update router r-2594-VM's status
>  x Redundant virtual router (name: r-2594-VM, id: 2594)  just switch
> from MASTER to UNKNOWN
>
>  x Those error messages are now repeated for each following
> CheckRouterCommand until the virtual router is rebootet
>
>
> Greetings,
>
> Melanie
>
> --
> --
>
> Heinlein Support GmbH
> Linux: Akademie - Support - Hosting
>
> http://www.heinlein-support.de
> Tel: 030 / 40 50 51 - 0
> Fax: 030 / 40 50 51 - 19
>
> Zwangsangaben lt. §35a GmbHG:
> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
>



-- 
Daan