It we set the default behaviour of colmux to keep the output running
forever, then it may result into colmux running forever in a
leftover/forgotten ssh session (possibly even in a screen session). So
default to letting ssh server decide what to do with these session is good.
But if someone wants to deliberately keep colmux running forever, like I do
for watching my Oracle Exadata cluster ( 14 storage cells, 8 compute nodes)
performance. Then ability to change the default behaviour and override it
with --keepalive <sec> would be a good option. Colllectl and colmux comes in
very very handy to watch the performance of many of our Exadata full rack
machines on a single vertical screen. Its a life saver along with colplot.

Regards,
Vishal Gupta
http://blog.vishalgupta.com

From:  Mark Seger <[email protected]>
Date:  Sunday, 3 March 2013 17:55
To:  Vishal Gupta <[email protected]>
Cc:  Collectl Interest <[email protected]>
Subject:  Re: [Collectl-interest] colmux duplicating nodes

ahh, makes sense now.  I typically don't run colmux for long periods
and so that much be why I haven't see that behavior before

I'm now wondering what the negatives of setting this is the default
behavior might be as it seems like it'd be a good thing.  If it does
make more sense to not always set it I could always add something like
--keepalive

-mark

On Sun, Mar 3, 2013 at 12:25 PM, Vishal Gupta <[email protected]>
wrote:
>  Colmux is issuing an collectl command over SSH. After collectl is invoked on
>  the server/machine, there is no more communication over the SSH session. So
>  effectively these ssh sessions are idle, as there is no data/message/command
>  interchange between colmux and server over SSH channel. All the
>  communication happens over collectl port between colmux and servers. So if
>  your server is configured to disconnect the idle SSH session after a certain
>  pre-defined amount of idle duration, and server disconnects colmux's ssh
>  session to it.  It results in colmux removing those servers from the output.
>  Please note disconnection was not due to collectl dying or server and colmux
>  client disappearing all together, either due to network glitch or due to
>  reboot/crashes. This disconnection is purely because of idle ssh session. We
>  can avoid this ssh connection timeout by changing either ClientAliveInterval
>  on ssh daemon on the server or by changing ServerAliveInterval on the ssh
>  client. Of course one may not want to change the ssh daemon setting on all
>  the corresponding server we are trying to connect to. It would even be
>  impractical to change this setting on all the servers.
> 
>  On the SSH client side (colmux side) also this setting can be changed in
>  either of the following location.
> 
>  /etc/ssh/ssh_config  (please note its ssh not sshd file)
>  ~/.ssh/config
>  Command line parameter
> 
>  Again we may not want to change this setting for all the ssh connection
>  originating from client on which colmux is running. So it might be better to
>  pass this as the command line parameter and make it configuration in some
>  configuration file or via colmux switch.
> 
>  Regards,
>  Vishal Gupta
>  http://blog.vishalgupta.com
> 
> 
>  From: Mark Seger <[email protected]>
>  Date: Sunday, 3 March 2013 16:45
>  To: Vishal Gupta <[email protected]>
> 
>  Cc: Collectl Interest <[email protected]>
>  Subject: Re: [Collectl-interest] colmux duplicating nodes
> 
>  interesting.  I wasn't aware of this switch.  But from the description
>  it sounds like this would take care of the situation where a remote
>  collectl goes away for over 5 minutes and I wasn't aware that can even
>  happen.  Are you saying it can and does?  Does this mean collectl
>  could go away for 4 minutes, time out and disconnect and this wouldn't
>  help that case?  OR is the network timeout value 5 minutes?  Just
>  trying to understand the exact mechanics of what is happening
>  -mark
> 
>  On Sun, Mar 3, 2013 at 9:44 AM, Vishal Gupta <[email protected]> wrote:
> 
>  Mark,
> 
>  Server disappearing from colmux output on Exadata cluster can be solved by
>  adding "-o ServerAliveInterval=300" to colmux ssh command. This will ensure
>  that a message is sent from client (colmux) to server (machines being
>  monitored) every 300sec over secure encrypted channel (hence not spoofable)
>  to ensure that ssh connection don't timeout.
> 
>  I have tested above by adding the in the ssh command variable. You may want
>  to include that in colmux source code itself.
> 
>  my $Ssh='/usr/bin/ssh -o StrictHostKeyChecking=no -o ServerAliveInterval=300
>  ';
>  $Ssh.=" -q"    unless $debug;
> 
> 
>  Vishal
> 
>  From: Vishal Gupta <[email protected]>
>  Date: Monday, 25 February 2013 11:49
>  To: Vishal Gupta <[email protected]>, Mark Seger <[email protected]>
>  Cc: Collectl Interest <[email protected]>
>  Subject: Re: [Collectl-interest] colmux duplicating nodes
> 
>  Mark,
> 
>  I think my servers disappearing might be due to SSH timeout.
> 
>  From: Vishal Gupta <[email protected]>
>  Date: Wednesday, 24 October 2012 21:42
>  To: Mark Seger <[email protected]>
>  Cc: Collectl Interest <[email protected]>
>  Subject: Re: [Collectl-interest] colmux duplicating nodes
> 
>  Mark,
> 
>  I don't think my servers disappearing from colmux is due to a network
>  glitch. On a Exadata, all the servers are connected via a internal Cisco IP
>  switch. There are also dedicated 3 infiniband switches. I have tried over
>  both Cisco IP switch and infiniband switch with --age=5 as well 10. But my
>  servers still disappear from the output after few hours. Is there i can do
>  to debug this? What level of debug do you recommend for debugging this?
> 
>  Regards,
>  Vishal Gupta
>  Blog |  LinkedIn | Twitter
> 
>  -----Original Message-----
> 
>  From: Mark Seger <[email protected]>
> 
>  Subject: Re: [Collectl-interest] colmux duplicating nodes
> 
>  Date: 20 October 2012 12:19:08 BST
> 
>  To: Vishal Gupta <[email protected]>
> 
>  Cc: [email protected]
> 
> 
> 
> 
> 
>  On Fri, Oct 19, 2012 at 4:16 PM, Vishal Gupta <[email protected]>
>  wrote:
> 
> 
>  I am using colmux on a Oracle Exadata Machine full rack with linux hosts
>  (OEL 5.7), if colmux is left running for few hours it starts showing
>  duplicate lines for server in the output.
> 
> 
> 
>  are you using the latest version [3.2.0]?  I do remember seeing that in an
>  earlier version and I thought I fixed it.  I'm really hoping it's not still
>  there because it can be pretty painful to track down or even reproduce.  The
>  way colmux works is it asynchronously receives/stores data from each remote
>  host and at the same time fires a timer every monitoring interval.  Colmux
>  then displays the late value it's seen for each entry.   Sounds simple
>  enough but it turned of the incoming data was occasionally overwriting the
>  data from the previous samples.  My solution was to double-buffer the data,
>  reading from one dataset while writing to a new one.  I'm just hoping I
>  don't need to dig back into it.
> 
> 
>  Also i noticed that some of the hosts are automatically completely removed
>  from the output. Is there some kind of timeout configured in colmux or
>  collectl which might remove the server entries from the output over time.
> 
> 
> 
>  unfortunately the way colmux works is if it doesn't hear from a remote
>  server in x-seconds (which you can set via --age) it drops it from the list
>  and doesn't try to reconnect.  as for the age, you don't want to make it too
>  long or else a server could disconnect and you'd never know it and keep
>  displaying stale data.  I suppose on a glitchy network you could end up
>  having to wait a little longer.  Maybe you could try upping it to 5 or 10
>  and see if that helps OR if the remote machine really did drop the link.
> 
>  you're not the first to ask about reconnecting when a host drops...
> 
>  -mark
> 
> 
>  Regards,
>  Vishal Gupta
>  Blog |  LinkedIn | Twitter
> 
> 
> 
> 



------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
Collectl-interest mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/collectl-interest

Reply via email to