interesting. I wasn't aware of this switch. But from the description it sounds like this would take care of the situation where a remote collectl goes away for over 5 minutes and I wasn't aware that can even happen. Are you saying it can and does? Does this mean collectl could go away for 4 minutes, time out and disconnect and this wouldn't help that case? OR is the network timeout value 5 minutes? Just trying to understand the exact mechanics of what is happening -mark
On Sun, Mar 3, 2013 at 9:44 AM, Vishal Gupta <[email protected]> wrote: > Mark, > > Server disappearing from colmux output on Exadata cluster can be solved by > adding "-o ServerAliveInterval=300" to colmux ssh command. This will ensure > that a message is sent from client (colmux) to server (machines being > monitored) every 300sec over secure encrypted channel (hence not spoofable) > to ensure that ssh connection don't timeout. > > I have tested above by adding the in the ssh command variable. You may want > to include that in colmux source code itself. > > my $Ssh='/usr/bin/ssh -o StrictHostKeyChecking=no -o ServerAliveInterval=300 > '; > $Ssh.=" -q" unless $debug; > > > Vishal > > From: Vishal Gupta <[email protected]> > Date: Monday, 25 February 2013 11:49 > To: Vishal Gupta <[email protected]>, Mark Seger <[email protected]> > Cc: Collectl Interest <[email protected]> > Subject: Re: [Collectl-interest] colmux duplicating nodes > > Mark, > > I think my servers disappearing might be due to SSH timeout. > > From: Vishal Gupta <[email protected]> > Date: Wednesday, 24 October 2012 21:42 > To: Mark Seger <[email protected]> > Cc: Collectl Interest <[email protected]> > Subject: Re: [Collectl-interest] colmux duplicating nodes > > Mark, > > I don't think my servers disappearing from colmux is due to a network > glitch. On a Exadata, all the servers are connected via a internal Cisco IP > switch. There are also dedicated 3 infiniband switches. I have tried over > both Cisco IP switch and infiniband switch with --age=5 as well 10. But my > servers still disappear from the output after few hours. Is there i can do > to debug this? What level of debug do you recommend for debugging this? > > Regards, > Vishal Gupta > Blog | LinkedIn | Twitter > > -----Original Message----- > > From: Mark Seger <[email protected]> > > Subject: Re: [Collectl-interest] colmux duplicating nodes > > Date: 20 October 2012 12:19:08 BST > > To: Vishal Gupta <[email protected]> > > Cc: [email protected] > > > > > > On Fri, Oct 19, 2012 at 4:16 PM, Vishal Gupta <[email protected]> > wrote: >> >> I am using colmux on a Oracle Exadata Machine full rack with linux hosts >> (OEL 5.7), if colmux is left running for few hours it starts showing >> duplicate lines for server in the output. > > > are you using the latest version [3.2.0]? I do remember seeing that in an > earlier version and I thought I fixed it. I'm really hoping it's not still > there because it can be pretty painful to track down or even reproduce. The > way colmux works is it asynchronously receives/stores data from each remote > host and at the same time fires a timer every monitoring interval. Colmux > then displays the late value it's seen for each entry. Sounds simple > enough but it turned of the incoming data was occasionally overwriting the > data from the previous samples. My solution was to double-buffer the data, > reading from one dataset while writing to a new one. I'm just hoping I > don't need to dig back into it. > >> >> Also i noticed that some of the hosts are automatically completely removed >> from the output. Is there some kind of timeout configured in colmux or >> collectl which might remove the server entries from the output over time. > > > unfortunately the way colmux works is if it doesn't hear from a remote > server in x-seconds (which you can set via --age) it drops it from the list > and doesn't try to reconnect. as for the age, you don't want to make it too > long or else a server could disconnect and you'd never know it and keep > displaying stale data. I suppose on a glitchy network you could end up > having to wait a little longer. Maybe you could try upping it to 5 or 10 > and see if that helps OR if the remote machine really did drop the link. > > you're not the first to ask about reconnecting when a host drops... > > -mark > >> >> Regards, >> Vishal Gupta >> Blog | LinkedIn | Twitter >> > > ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb _______________________________________________ Collectl-interest mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/collectl-interest
