On Mon, Apr 08, 2002 at 05:10:14PM -0700, matt massie wrote:
asaph-
I think the problem still exists even with this fix.
did you test that this problem still exists and if so can you give me the
platform and error? please make it clearer when you post to the list
whether the bug you are reporting is real or theoretical.
At this point the problem is theoretical in gmond - I did not
encounter it in practice when using ganglia.
I have encountered instances of this style of problem
in other programs so I know it is potentially real.
even if this is a theoretical bug it is important. to be absolutely sure
there is no problems in the future, i've declare two separate barrier
pointers and i don't ever free the data they point to. this increases the
memory footprint of gmond by 96 bytes but i think that we can live with
that.
When you say two barriers I take it you are referring to the code
in gmond.c lines 107-121. If so, I think you'd be fine with
changing those to:
107 barrier_init(b1, args_info.mcast_threads_arg );
108 for ( i = 0 ; i args_info.mcast_threads_arg; i++ )
109 {
110 pthread_create(tid, attr, mcast_listen_thread, (void *)b1);
111 }
REMOVED 112 barrier_destroy(b);
113 debug_msg(listening thread(s) have been started);
114
115 /* threads to answer requests for XML */
116 barrier_init(b2, args_info.xml_threads_arg);
117 for ( i=0 ; i args_info.xml_threads_arg; i++ )
118 {
119 pthread_create(tid, attr, server_thread, (void *)b2);
120 }
REMOVED 121 barrier_destroy(b);
i.e. using two different barriers which you never free. In this
case the master need not participate in the barrier.
Thanks!
Asaph
On Mon, Apr 08, 2002 at 05:10:14PM -0700, matt massie wrote:
asaph-
I think the problem still exists even with this fix.
did you test that this problem still exists and if so can you give me the
platform and error? please make it clearer when you post to the list
whether the bug you are reporting is real or theoretical.
thanks for your technical expertise
-matt
You don't know the order by which threads leave the barrier,
so you might still be calling barrier_destroy() while there
are threads accessing b.
In general this kind of scheme:
thread1:
b = allocate_barrier();
spawn_threads(b);
wait_barrier(b);
free(b);
threadnN:
wait_barrier(b);
can't work because you have threads 1..N all accessing the
data structure pointed to by b simultaneously, and you have
no control over which one will exit wait_barrier() first.
If it happens to be thread1, then it will free() b while
other threads are still reading the data pointed to by b.
If you REALLY want to solve this, I think you'd need two
barriers:
thread1:
b2 = static_barrier;
b1 = allocate_barrier();
spawn_threads(b1,b2);
wait_barrier(b1);
wait_barrier(b2);
free(b1);
// b2 is never freed
threadnN:
wait_barrier(b1);
wait_barrier(b2);
Of course, this is only interesting if you can't make do with just
having only static barriers. If you are in a situation that you
absolutely must allocate and free the memory held by the barriers
I don't know of another safe way to do this.
On Mon, Apr 08, 2002 at 03:48:41PM -0700, matt massie wrote:
mike-
you can blame me for the problem you were having. i didn't code the
barriers correctly in gmond. the machines i tested gmond on before i
released it didn't display the problem so i released it with this bug...
if you look at line 108 of gmond you'll see i initialize a barrier and
then pass it to the mcast_threads that i spin off. directly afterwards i
run a barrier_destroy(). bad.
if the main gmond runs the barrier_destroy() BEFORE all the mcast_threads
can run a barrier_barrier() then you will have a problem. the mcast
threads will be operating on freed memory... otherwise.. everthing is
peachy.
the fix was just to increase the barrier count by one and place a
barrier_barrier() just before the barrier_destroy() to force the main
thread to wait until all the mcast threads are started.
thanks so much for the feedback.
also, i added the --no_setuid and --setuid flags in order to give you
more
debugging power. i know you were having trouble creating a core file
because gmond sets the uid to the uid of nobody. you can prevent gmond
from starting up as nobody with the --no_setuid flag.
good luck! and please let me know if i didn't solve your problem!
-matt
Saturday, Mike Snitzer wrote forth saying...
gmond segfaults 50% of the time at startup. The random nature of it
suggests to me that their is a