Mark, 

I don't think my servers disappearing from colmux is due to a network glitch. 
On a Exadata, all the servers are connected via a internal Cisco IP switch. 
There are also dedicated 3 infiniband switches. I have tried over both Cisco IP 
switch and infiniband switch with --age=5 as well 10. But my servers still 
disappear from the output after few hours. Is there i can do to debug this? 
What level of debug do you recommend for debugging this? 

Regards,
Vishal Gupta
Blog |  LinkedIn | Twitter

-----Original Message-----
From: Mark Seger <[email protected]>
Subject: Re: [Collectl-interest] colmux duplicating nodes
Date: 20 October 2012 12:19:08 BST
To: Vishal Gupta <[email protected]>
Cc: [email protected]




On Fri, Oct 19, 2012 at 4:16 PM, Vishal Gupta <[email protected]> wrote:
I am using colmux on a Oracle Exadata Machine full rack with linux hosts (OEL 
5.7), if colmux is left running for few hours it starts showing duplicate lines 
for server in the output.

are you using the latest version [3.2.0]?  I do remember seeing that in an 
earlier version and I thought I fixed it.  I'm really hoping it's not still 
there because it can be pretty painful to track down or even reproduce.  The 
way colmux works is it asynchronously receives/stores data from each remote 
host and at the same time fires a timer every monitoring interval.  Colmux then 
displays the late value it's seen for each entry.   Sounds simple enough but it 
turned of the incoming data was occasionally overwriting the data from the 
previous samples.  My solution was to double-buffer the data, reading from one 
dataset while writing to a new one.  I'm just hoping I don't need to dig back 
into it.
 
Also i noticed that some of the hosts are automatically completely removed from 
the output. Is there some kind of timeout configured in colmux or collectl 
which might remove the server entries from the output over time.

unfortunately the way colmux works is if it doesn't hear from a remote server 
in x-seconds (which you can set via --age) it drops it from the list and 
doesn't try to reconnect.  as for the age, you don't want to make it too long 
or else a server could disconnect and you'd never know it and keep displaying 
stale data.  I suppose on a glitchy network you could end up having to wait a 
little longer.  Maybe you could try upping it to 5 or 10 and see if that helps 
OR if the remote machine really did drop the link.

you're not the first to ask about reconnecting when a host drops...

-mark
 
Regards,
Vishal Gupta
Blog |  LinkedIn | Twitter



------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Collectl-interest mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/collectl-interest

Reply via email to