On Fri, Jul 11, 2008 at 01:56:07PM -0600, Brad Nicholes wrote:
> >>> On 7/11/2008 at 1:42 PM, in message <[EMAIL PROTECTED]>, Carlo
> Marcelo Arenas Belon <[EMAIL PROTECTED]> wrote:
> > 
> > there is also the possibility of just adding the "contrib" into this first
> > release and using instead that (which should be safe enough) and has been
> > already voted for backport (but for the next release).
> > 
> > feel free to commit that then and base disabling this metric / documentation
> > on the contrib directory which should satisfy all raised concerns.
>
> I think I would rather just release it as is and fix all of this in
> the next version.  This issue really isn't that big of a deal.

well considering how much debate has been done about this with no other
work committed and proposed for vote this might be the only option left.

another issue with releasing it AS-IS is that the implementation is buggy, it
does the external interaction with netstat in a non standard/problematic way 
(so it will run in python 2.3), but that is not explained/documented anywhere
and since this is the first release will be most likely used as a template
AS-IS for future modules which will just then inherit its bugs.

> Especially since it is Friday and Bernard is ready to roll.

Well, Bernard proposed to fix this on Wednesday in a private list and you
approved even if no patch was ever produced.  So I was really hoping that
wouldn't be the case.

Indeed in those 2 days I'd worked for improving the contrib inclusion patches
just in case they will be needed for whatever solution you were planning to
implement.

Also in those 2 days, I'd worked on debugging and improving this module as
well but I definitely agree that would be too destabilizing this late of the
game.

I don't really use the python modules in my setup (most of them won't run
or even get installed in any of the platforms I use daily and that I was
focusing my testing on), and really didn't knew it was a known issue since
no one had reported it before Ulf did (which was a surprise to me knowing
how much testing and stabilization work has been done for this release)

I was of course waiting for that proposed fix to materialize but will hack
one out if that is really needed.

> >> >> It still works reliably, it just has a wait timeout issue that is really
> >> >> only noticeable when using the -m parameter.
> >> > 
> >> > but that would result in some metric samples failing silently and 
> >> > therefore
> >> > in some wholes in the RRD values that could then result in mysterious 
> >> > drops
> >> > in the graphs or flat lines.
> >> 
> >> No and the reason why is because the actual gathering of the data is
> >> threaded.  tcpconn.py spins up its own gathering thread that periodically
> >> exec's netstat and updates an internal array of metrics.
> >> When the gmond main thread requests the metrics, all it does is read the
> >> internal array and return whatever the last gathered value was.
> > 
> > Ok, but then that spawning netstat thread will randomly fail, an so
> > depending on the frequency it fails compared with the polling gmond does
> > you will get flat lines.
> 
> No, There aren't flat lines.  A value is always being returned

If the same value is always being returned because the other thread failed to
spawn netstat and get fresh values then the line is flat.

> and I have never seen the netstat thread fail in normal use.  The only
> reason why a failure appears with the -m is because of the metric_clean()
> function was called and there was a race condition.  I have been running
> this code for months now and have never seen any kind of failure other
> than the -m parameter case.  

That is encouraging, and hopefully means that whenever the final fix is
committed it will resolve all issues left.

> >> The gmond -m option causes the metric_init(), which starts the gathering
> >> thread and the metric_cleanup() which shuts down the gathering thread,
> >> to happen one immediately after the other.  Gmond has to delay waiting
> >> for the thread cleanup.
> > 
> > And this is IMHO a bug, but a fix for it is not something that will be ready
> > to release anytime soon as spelled in the STATUS file.
> > 
> > It would be better if the metric_init() doesn't initialize the "spawning
> > netstat thread" but leave that to the collection method that is scheduled
> > by gmond and who would just need to do the first sample and initialize
> > that thread the first time it is called.
> > 
> > This way the metric_cleanup() method won't need to wait either for the 
> > `gmond
> > -m` case which shouldn't execute any metric collection code in principle.
> 
> right, I agree this is a bug and the solution you describe is exactly
> what should be done for the next version.  But this bug doesn't prevent
> tcpconn.py module from functioning normally and providing good metric data.
> At the very worst, it is a -m annoyance and occasionally a shutdown delay.
> IMO, the severity of the bug is not enough to pull it from the release.
> Besides the fact that this is a python module.  If a user really required
> a fix before the next release, updating this module is no more than a file
> copy.  No rebuilding of anything is required.

OK, your previous comments seemed to imply otherwise.

I already think I made my points clear. Your call, but the idea of bug
delaying through procrastination doesn't seem that appealing.

Carlo

-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to