interesting.  I wasn't aware of this switch.  But from the description
it sounds like this would take care of the situation where a remote
collectl goes away for over 5 minutes and I wasn't aware that can even
happen.  Are you saying it can and does?  Does this mean collectl
could go away for 4 minutes, time out and disconnect and this wouldn't
help that case?  OR is the network timeout value 5 minutes?  Just
trying to understand the exact mechanics of what is happening
-mark

On Sun, Mar 3, 2013 at 9:44 AM, Vishal Gupta <[email protected]> wrote:
> Mark,
>
> Server disappearing from colmux output on Exadata cluster can be solved by
> adding "-o ServerAliveInterval=300" to colmux ssh command. This will ensure
> that a message is sent from client (colmux) to server (machines being
> monitored) every 300sec over secure encrypted channel (hence not spoofable)
> to ensure that ssh connection don't timeout.
>
> I have tested above by adding the in the ssh command variable. You may want
> to include that in colmux source code itself.
>
> my $Ssh='/usr/bin/ssh -o StrictHostKeyChecking=no -o ServerAliveInterval=300
> ';
> $Ssh.=" -q"    unless $debug;
>
>
> Vishal
>
> From: Vishal Gupta <[email protected]>
> Date: Monday, 25 February 2013 11:49
> To: Vishal Gupta <[email protected]>, Mark Seger <[email protected]>
> Cc: Collectl Interest <[email protected]>
> Subject: Re: [Collectl-interest] colmux duplicating nodes
>
> Mark,
>
> I think my servers disappearing might be due to SSH timeout.
>
> From: Vishal Gupta <[email protected]>
> Date: Wednesday, 24 October 2012 21:42
> To: Mark Seger <[email protected]>
> Cc: Collectl Interest <[email protected]>
> Subject: Re: [Collectl-interest] colmux duplicating nodes
>
> Mark,
>
> I don't think my servers disappearing from colmux is due to a network
> glitch. On a Exadata, all the servers are connected via a internal Cisco IP
> switch. There are also dedicated 3 infiniband switches. I have tried over
> both Cisco IP switch and infiniband switch with --age=5 as well 10. But my
> servers still disappear from the output after few hours. Is there i can do
> to debug this? What level of debug do you recommend for debugging this?
>
> Regards,
> Vishal Gupta
> Blog |  LinkedIn | Twitter
>
> -----Original Message-----
>
> From: Mark Seger <[email protected]>
>
> Subject: Re: [Collectl-interest] colmux duplicating nodes
>
> Date: 20 October 2012 12:19:08 BST
>
> To: Vishal Gupta <[email protected]>
>
> Cc: [email protected]
>
>
>
>
>
> On Fri, Oct 19, 2012 at 4:16 PM, Vishal Gupta <[email protected]>
> wrote:
>>
>> I am using colmux on a Oracle Exadata Machine full rack with linux hosts
>> (OEL 5.7), if colmux is left running for few hours it starts showing
>> duplicate lines for server in the output.
>
>
> are you using the latest version [3.2.0]?  I do remember seeing that in an
> earlier version and I thought I fixed it.  I'm really hoping it's not still
> there because it can be pretty painful to track down or even reproduce.  The
> way colmux works is it asynchronously receives/stores data from each remote
> host and at the same time fires a timer every monitoring interval.  Colmux
> then displays the late value it's seen for each entry.   Sounds simple
> enough but it turned of the incoming data was occasionally overwriting the
> data from the previous samples.  My solution was to double-buffer the data,
> reading from one dataset while writing to a new one.  I'm just hoping I
> don't need to dig back into it.
>
>>
>> Also i noticed that some of the hosts are automatically completely removed
>> from the output. Is there some kind of timeout configured in colmux or
>> collectl which might remove the server entries from the output over time.
>
>
> unfortunately the way colmux works is if it doesn't hear from a remote
> server in x-seconds (which you can set via --age) it drops it from the list
> and doesn't try to reconnect.  as for the age, you don't want to make it too
> long or else a server could disconnect and you'd never know it and keep
> displaying stale data.  I suppose on a glitchy network you could end up
> having to wait a little longer.  Maybe you could try upping it to 5 or 10
> and see if that helps OR if the remote machine really did drop the link.
>
> you're not the first to ask about reconnecting when a host drops...
>
> -mark
>
>>
>> Regards,
>> Vishal Gupta
>> Blog |  LinkedIn | Twitter
>>
>
>

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
Collectl-interest mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/collectl-interest

Reply via email to