On Mon, Apr 08, 2002 at 05:10:14PM -0700, matt massie wrote:
> asaph-
> 
> > I think the problem still exists even with this fix.
> 
> did you test that this problem still exists and if so can you give me the 
> platform and error?  please make it clearer when you post to the list 
> whether the bug you are reporting is real or theoretical.

At this point the problem is "theoretical" in gmond - I did not
encounter it in practice when using ganglia.

I have encountered instances of this style of problem 
in other programs so I know it is potentially real.

> 
> even if this is a theoretical bug it is important.  to be absolutely sure
> there is no problems in the future, i've declare two separate barrier
> pointers and i don't ever free the data they point to.  this increases the
> memory footprint of gmond by 96 bytes but i think that we can live with
> that.
> 

When you say "two barriers" I take it you are referring to the code
in gmond.c lines 107-121. If so, I think you'd be fine with 
changing those to:

 107       barrier_init(&b1, args_info.mcast_threads_arg );
 108       for ( i = 0 ; i < args_info.mcast_threads_arg; i++ )
 109          {
 110             pthread_create(&tid, &attr, mcast_listen_thread, (void *)b1);
 111          }
 >>>>> REMOVED 112       barrier_destroy(b); 
 113       debug_msg("listening thread(s) have been started");
 114
 115       /* threads to answer requests for XML */
 116       barrier_init(&b2, args_info.xml_threads_arg);
 117       for ( i=0 ; i < args_info.xml_threads_arg; i++ )
 118          {
 119             pthread_create(&tid, &attr, server_thread, (void *)b2);
 120          }
 >>>>> REMOVED 121       barrier_destroy(b);

i.e. using two different barriers which you never free. In this
case the master need not participate in the barrier.


         Thanks!
         Asaph



On Mon, Apr 08, 2002 at 05:10:14PM -0700, matt massie wrote:
> asaph-
> 
> > I think the problem still exists even with this fix.
> 
> did you test that this problem still exists and if so can you give me the 
> platform and error?  please make it clearer when you post to the list 
> whether the bug you are reporting is real or theoretical.
> thanks for your technical expertise
> -matt
> 
> > You don't know the order by which threads leave the barrier,
> > so you might still be calling barrier_destroy() while there
> > are threads accessing b.
> > 
> > In general this kind of scheme:
> > 
> >    thread1:
> >        b = allocate_barrier();
> >        spawn_threads(b);
> >        wait_barrier(b);
> >        free(b);
> > 
> > 
> >    threadnN:
> >        wait_barrier(b);
> > 
> > can't work because you have threads 1..N all accessing the 
> > data structure pointed to by b simultaneously, and you have
> > no control over which one will exit wait_barrier() first.
> > If it happens to be thread1, then it will free() b while
> > other threads are still reading the data pointed to by b.
> > 
> > If you REALLY want to solve this, I think you'd need two
> > barriers:
> > 
> > 
> >    thread1:
> >        b2 = static_barrier;
> >        b1 = allocate_barrier();
> >        spawn_threads(b1,b2);
> >        wait_barrier(b1);
> >        wait_barrier(b2);
> >        free(b1);
> >        // b2 is never freed
> > 
> > 
> >    threadnN:
> >        wait_barrier(b1);
> >        wait_barrier(b2);
> >  
> > Of course, this is only interesting if you can't make do with just
> > having only static barriers. If you are in a situation that you
> > absolutely must allocate and free the memory held by the barriers
> > I don't know of another safe way to do this.
> > 
> > 
> > On Mon, Apr 08, 2002 at 03:48:41PM -0700, matt massie wrote:
> > > mike-
> > > 
> > > you can blame me for the problem you were having.  i didn't code the 
> > > barriers correctly in gmond.  the machines i tested gmond on before i 
> > > released it didn't display the problem so i released it with this bug...
> > > 
> > > if you look at line 108 of gmond you'll see i initialize a barrier and 
> > > then pass it to the mcast_threads that i spin off. directly afterwards i 
> > > run a barrier_destroy().  bad.
> > > 
> > > if the main gmond runs the barrier_destroy() BEFORE all the mcast_threads 
> > > can run a barrier_barrier() then you will have a problem.  the mcast 
> > > threads will be operating on freed memory... otherwise.. everthing is 
> > > peachy.
> > > 
> > > the fix was just to increase the barrier count by one and place a 
> > > barrier_barrier() just before the barrier_destroy() to force the main 
> > > thread to wait until all the mcast threads are started.
> > > 
> > > thanks so much for the feedback.
> > > 
> > > also, i added the --no_setuid and --setuid flags in order to give you 
> > > more 
> > > debugging power.  i know you were having trouble creating a core file 
> > > because gmond sets the uid to the uid of "nobody".  you can prevent gmond 
> > > from starting up as nobody with the "--no_setuid" flag.
> > > 
> > > good luck!  and please let me know if i didn't solve your problem!
> > > -matt
> > > 
> > > Saturday, Mike Snitzer wrote forth saying...
> > > 
> > > > gmond segfaults 50% of the time at startup.  The random nature of it
> > > > suggests to me that their is a race condition when the gmond threads
> > > > startup.  When I tried to strace or run gmond through gdb the problem
> > > > wasn't apparant.. which is what led me to believe it's a threading 
> > > > problem
> > > > that strace or gdb masks.
> > > > 
> > > > Any recommendations for accurately debugging gmond would be great; cause
> > > > when running through strace and gdb I can't get it to segfault.
> > > > 
> > > > FYI, I'm running gmond v2.2.2 on 48 nodes of those 16 of the nodes' 
> > > > gmond
> > > > segfaulted at startup... 
> > > > 
> > > > Mike
> > > > 
> > > > ps.
> > > > here's an example:
> > > > `which gmond` --debug_level=1 -i eth0
> > > > 
> > > > mcast_listen_thread() received metric data cpu_speed
> > > > mcast_value() mcasting cpu_user value
> > > > 2051 pre_process_node() remote_ip=192.168.0.28encoded 8 XDR
> > > > bytespre_process_node() has saved the hostname
> > > > pre_process_node() has set the timestamp
> > > > pre_process_node() received a new node
> > > > 
> > > > 
> > > > XDR data successfully sent
> > > > set_metric_value() got metric key 11
> > > > set_metric_value() exec'd cpu_nice_func (11)
> > > > Segmentation fault
> > > > 
> > > > 
> > > > _______________________________________________
> > > > Ganglia-general mailing list
> > > > Ganglia-general@lists.sourceforge.net
> > > > https://lists.sourceforge.net/lists/listinfo/ganglia-general
> > > > 
> > > 
> > > 
> > > _______________________________________________
> > > Ganglia-general mailing list
> > > Ganglia-general@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/ganglia-general
> > > 
> > > Sponsored by http://www.ThinkGeek.com/
> > 
> 

Reply via email to