Colmux is issuing an collectl command over SSH. After collectl is invoked on
the server/machine, there is no more communication over the SSH session. So
effectively these ssh sessions are idle, as there is no data/message/command
interchange between colmux and server over SSH channel. All the
communication happens over collectl port between colmux and servers. So if
your server is configured to disconnect the idle SSH session after a certain
pre-defined amount of idle duration, and server disconnects colmux's ssh
session to it.  It results in colmux removing those servers from the output.
Please note disconnection was not due to collectl dying or server and colmux
client disappearing all together, either due to network glitch or due to
reboot/crashes. This disconnection is purely because of idle ssh session. We
can avoid this ssh connection timeout by changing either ClientAliveInterval
on ssh daemon on the server or by changing ServerAliveInterval on the ssh
client. Of course one may not want to change the ssh daemon setting on all
the corresponding server we are trying to connect to. It would even be
impractical to change this setting on all the servers.

On the SSH client side (colmux side) also this setting can be changed in
either of the following location.
1. /etc/ssh/ssh_config  (please note its ssh not sshd file)
2. ~/.ssh/config
3. Command line parameter
Again we may not want to change this setting for all the ssh connection
originating from client on which colmux is running. So it might be better to
pass this as the command line parameter and make it configuration in some
configuration file or via colmux switch.

Regards,
Vishal Gupta
http://blog.vishalgupta.com


From:  Mark Seger <[email protected]>
Date:  Sunday, 3 March 2013 16:45
To:  Vishal Gupta <[email protected]>
Cc:  Collectl Interest <[email protected]>
Subject:  Re: [Collectl-interest] colmux duplicating nodes

interesting.  I wasn't aware of this switch.  But from the description
it sounds like this would take care of the situation where a remote
collectl goes away for over 5 minutes and I wasn't aware that can even
happen.  Are you saying it can and does?  Does this mean collectl
could go away for 4 minutes, time out and disconnect and this wouldn't
help that case?  OR is the network timeout value 5 minutes?  Just
trying to understand the exact mechanics of what is happening
-mark

On Sun, Mar 3, 2013 at 9:44 AM, Vishal Gupta <[email protected]> wrote:
>  Mark,
> 
>  Server disappearing from colmux output on Exadata cluster can be solved by
>  adding "-o ServerAliveInterval=300" to colmux ssh command. This will ensure
>  that a message is sent from client (colmux) to server (machines being
>  monitored) every 300sec over secure encrypted channel (hence not spoofable)
>  to ensure that ssh connection don't timeout.
> 
>  I have tested above by adding the in the ssh command variable. You may want
>  to include that in colmux source code itself.
> 
>  my $Ssh='/usr/bin/ssh -o StrictHostKeyChecking=no -o ServerAliveInterval=300
>  ';
>  $Ssh.=" -q"    unless $debug;
> 
> 
>  Vishal
> 
>  From: Vishal Gupta <[email protected]>
>  Date: Monday, 25 February 2013 11:49
>  To: Vishal Gupta <[email protected]>, Mark Seger <[email protected]>
>  Cc: Collectl Interest <[email protected]>
>  Subject: Re: [Collectl-interest] colmux duplicating nodes
> 
>  Mark,
> 
>  I think my servers disappearing might be due to SSH timeout.
> 
>  From: Vishal Gupta <[email protected]>
>  Date: Wednesday, 24 October 2012 21:42
>  To: Mark Seger <[email protected]>
>  Cc: Collectl Interest <[email protected]>
>  Subject: Re: [Collectl-interest] colmux duplicating nodes
> 
>  Mark,
> 
>  I don't think my servers disappearing from colmux is due to a network
>  glitch. On a Exadata, all the servers are connected via a internal Cisco IP
>  switch. There are also dedicated 3 infiniband switches. I have tried over
>  both Cisco IP switch and infiniband switch with --age=5 as well 10. But my
>  servers still disappear from the output after few hours. Is there i can do
>  to debug this? What level of debug do you recommend for debugging this?
> 
>  Regards,
>  Vishal Gupta
>  Blog |  LinkedIn | Twitter
> 
>  -----Original Message-----
> 
>  From: Mark Seger <[email protected]>
> 
>  Subject: Re: [Collectl-interest] colmux duplicating nodes
> 
>  Date: 20 October 2012 12:19:08 BST
> 
>  To: Vishal Gupta <[email protected]>
> 
>  Cc: [email protected]
> 
> 
> 
> 
> 
>  On Fri, Oct 19, 2012 at 4:16 PM, Vishal Gupta <[email protected]>
>  wrote:
>> 
>>  I am using colmux on a Oracle Exadata Machine full rack with linux hosts
>>  (OEL 5.7), if colmux is left running for few hours it starts showing
>>  duplicate lines for server in the output.
> 
> 
>  are you using the latest version [3.2.0]?  I do remember seeing that in an
>  earlier version and I thought I fixed it.  I'm really hoping it's not still
>  there because it can be pretty painful to track down or even reproduce.  The
>  way colmux works is it asynchronously receives/stores data from each remote
>  host and at the same time fires a timer every monitoring interval.  Colmux
>  then displays the late value it's seen for each entry.   Sounds simple
>  enough but it turned of the incoming data was occasionally overwriting the
>  data from the previous samples.  My solution was to double-buffer the data,
>  reading from one dataset while writing to a new one.  I'm just hoping I
>  don't need to dig back into it.
> 
>> 
>>  Also i noticed that some of the hosts are automatically completely removed
>>  from the output. Is there some kind of timeout configured in colmux or
>>  collectl which might remove the server entries from the output over time.
> 
> 
>  unfortunately the way colmux works is if it doesn't hear from a remote
>  server in x-seconds (which you can set via --age) it drops it from the list
>  and doesn't try to reconnect.  as for the age, you don't want to make it too
>  long or else a server could disconnect and you'd never know it and keep
>  displaying stale data.  I suppose on a glitchy network you could end up
>  having to wait a little longer.  Maybe you could try upping it to 5 or 10
>  and see if that helps OR if the remote machine really did drop the link.
> 
>  you're not the first to ask about reconnecting when a host drops...
> 
>  -mark
> 
>> 
>>  Regards,
>>  Vishal Gupta
>>  Blog |  LinkedIn | Twitter
>> 
> 
> 



------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
Collectl-interest mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/collectl-interest

Reply via email to