It we set the default behaviour of colmux to keep the output running
forever, then it may result into colmux running forever in a
leftover/forgotten ssh session (possibly even in a screen session). So
default to letting ssh server decide what to do with these session is good.
But if someone wants to deliberately keep colmux running forever, like I do
for watching my Oracle Exadata cluster ( 14 storage cells, 8 compute nodes)
performance. Then ability to change the default behaviour and override it
with --keepalive <sec> would be a good option. Colllectl and colmux comes in
very very handy to watch the performance of many of our Exadata full rack
machines on a single vertical screen. Its a life saver along with colplot.
Regards,
Vishal Gupta
http://blog.vishalgupta.com
From: Mark Seger <[email protected]>
Date: Sunday, 3 March 2013 17:55
To: Vishal Gupta <[email protected]>
Cc: Collectl Interest <[email protected]>
Subject: Re: [Collectl-interest] colmux duplicating nodes
ahh, makes sense now. I typically don't run colmux for long periods
and so that much be why I haven't see that behavior before
I'm now wondering what the negatives of setting this is the default
behavior might be as it seems like it'd be a good thing. If it does
make more sense to not always set it I could always add something like
--keepalive
-mark
On Sun, Mar 3, 2013 at 12:25 PM, Vishal Gupta <[email protected]>
wrote:
> Colmux is issuing an collectl command over SSH. After collectl is invoked on
> the server/machine, there is no more communication over the SSH session. So
> effectively these ssh sessions are idle, as there is no data/message/command
> interchange between colmux and server over SSH channel. All the
> communication happens over collectl port between colmux and servers. So if
> your server is configured to disconnect the idle SSH session after a certain
> pre-defined amount of idle duration, and server disconnects colmux's ssh
> session to it. It results in colmux removing those servers from the output.
> Please note disconnection was not due to collectl dying or server and colmux
> client disappearing all together, either due to network glitch or due to
> reboot/crashes. This disconnection is purely because of idle ssh session. We
> can avoid this ssh connection timeout by changing either ClientAliveInterval
> on ssh daemon on the server or by changing ServerAliveInterval on the ssh
> client. Of course one may not want to change the ssh daemon setting on all
> the corresponding server we are trying to connect to. It would even be
> impractical to change this setting on all the servers.
>
> On the SSH client side (colmux side) also this setting can be changed in
> either of the following location.
>
> /etc/ssh/ssh_config (please note its ssh not sshd file)
> ~/.ssh/config
> Command line parameter
>
> Again we may not want to change this setting for all the ssh connection
> originating from client on which colmux is running. So it might be better to
> pass this as the command line parameter and make it configuration in some
> configuration file or via colmux switch.
>
> Regards,
> Vishal Gupta
> http://blog.vishalgupta.com
>
>
> From: Mark Seger <[email protected]>
> Date: Sunday, 3 March 2013 16:45
> To: Vishal Gupta <[email protected]>
>
> Cc: Collectl Interest <[email protected]>
> Subject: Re: [Collectl-interest] colmux duplicating nodes
>
> interesting. I wasn't aware of this switch. But from the description
> it sounds like this would take care of the situation where a remote
> collectl goes away for over 5 minutes and I wasn't aware that can even
> happen. Are you saying it can and does? Does this mean collectl
> could go away for 4 minutes, time out and disconnect and this wouldn't
> help that case? OR is the network timeout value 5 minutes? Just
> trying to understand the exact mechanics of what is happening
> -mark
>
> On Sun, Mar 3, 2013 at 9:44 AM, Vishal Gupta <[email protected]> wrote:
>
> Mark,
>
> Server disappearing from colmux output on Exadata cluster can be solved by
> adding "-o ServerAliveInterval=300" to colmux ssh command. This will ensure
> that a message is sent from client (colmux) to server (machines being
> monitored) every 300sec over secure encrypted channel (hence not spoofable)
> to ensure that ssh connection don't timeout.
>
> I have tested above by adding the in the ssh command variable. You may want
> to include that in colmux source code itself.
>
> my $Ssh='/usr/bin/ssh -o StrictHostKeyChecking=no -o ServerAliveInterval=300
> ';
> $Ssh.=" -q" unless $debug;
>
>
> Vishal
>
> From: Vishal Gupta <[email protected]>
> Date: Monday, 25 February 2013 11:49
> To: Vishal Gupta <[email protected]>, Mark Seger <[email protected]>
> Cc: Collectl Interest <[email protected]>
> Subject: Re: [Collectl-interest] colmux duplicating nodes
>
> Mark,
>
> I think my servers disappearing might be due to SSH timeout.
>
> From: Vishal Gupta <[email protected]>
> Date: Wednesday, 24 October 2012 21:42
> To: Mark Seger <[email protected]>
> Cc: Collectl Interest <[email protected]>
> Subject: Re: [Collectl-interest] colmux duplicating nodes
>
> Mark,
>
> I don't think my servers disappearing from colmux is due to a network
> glitch. On a Exadata, all the servers are connected via a internal Cisco IP
> switch. There are also dedicated 3 infiniband switches. I have tried over
> both Cisco IP switch and infiniband switch with --age=5 as well 10. But my
> servers still disappear from the output after few hours. Is there i can do
> to debug this? What level of debug do you recommend for debugging this?
>
> Regards,
> Vishal Gupta
> Blog | LinkedIn | Twitter
>
> -----Original Message-----
>
> From: Mark Seger <[email protected]>
>
> Subject: Re: [Collectl-interest] colmux duplicating nodes
>
> Date: 20 October 2012 12:19:08 BST
>
> To: Vishal Gupta <[email protected]>
>
> Cc: [email protected]
>
>
>
>
>
> On Fri, Oct 19, 2012 at 4:16 PM, Vishal Gupta <[email protected]>
> wrote:
>
>
> I am using colmux on a Oracle Exadata Machine full rack with linux hosts
> (OEL 5.7), if colmux is left running for few hours it starts showing
> duplicate lines for server in the output.
>
>
>
> are you using the latest version [3.2.0]? I do remember seeing that in an
> earlier version and I thought I fixed it. I'm really hoping it's not still
> there because it can be pretty painful to track down or even reproduce. The
> way colmux works is it asynchronously receives/stores data from each remote
> host and at the same time fires a timer every monitoring interval. Colmux
> then displays the late value it's seen for each entry. Sounds simple
> enough but it turned of the incoming data was occasionally overwriting the
> data from the previous samples. My solution was to double-buffer the data,
> reading from one dataset while writing to a new one. I'm just hoping I
> don't need to dig back into it.
>
>
> Also i noticed that some of the hosts are automatically completely removed
> from the output. Is there some kind of timeout configured in colmux or
> collectl which might remove the server entries from the output over time.
>
>
>
> unfortunately the way colmux works is if it doesn't hear from a remote
> server in x-seconds (which you can set via --age) it drops it from the list
> and doesn't try to reconnect. as for the age, you don't want to make it too
> long or else a server could disconnect and you'd never know it and keep
> displaying stale data. I suppose on a glitchy network you could end up
> having to wait a little longer. Maybe you could try upping it to 5 or 10
> and see if that helps OR if the remote machine really did drop the link.
>
> you're not the first to ask about reconnecting when a host drops...
>
> -mark
>
>
> Regards,
> Vishal Gupta
> Blog | LinkedIn | Twitter
>
>
>
>
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
Collectl-interest mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/collectl-interest