Re: [Ganglia-general] gmond 2.2.2 seg faults.
On Mon, Apr 08, 2002 at 05:10:14PM -0700, matt massie wrote: > asaph- > > > I think the problem still exists even with this fix. > > did you test that this problem still exists and if so can you give me the > platform and error? please make it clearer when you post to the list > whether the bug you are reporting is real or theoretical. At this point the problem is "theoretical" in gmond - I did not encounter it in practice when using ganglia. I have encountered instances of this style of problem in other programs so I know it is potentially real. > > even if this is a theoretical bug it is important. to be absolutely sure > there is no problems in the future, i've declare two separate barrier > pointers and i don't ever free the data they point to. this increases the > memory footprint of gmond by 96 bytes but i think that we can live with > that. > When you say "two barriers" I take it you are referring to the code in gmond.c lines 107-121. If so, I think you'd be fine with changing those to: 107 barrier_init(&b1, args_info.mcast_threads_arg ); 108 for ( i = 0 ; i < args_info.mcast_threads_arg; i++ ) 109 { 110 pthread_create(&tid, &attr, mcast_listen_thread, (void *)b1); 111 } > REMOVED 112 barrier_destroy(b); 113 debug_msg("listening thread(s) have been started"); 114 115 /* threads to answer requests for XML */ 116 barrier_init(&b2, args_info.xml_threads_arg); 117 for ( i=0 ; i < args_info.xml_threads_arg; i++ ) 118 { 119 pthread_create(&tid, &attr, server_thread, (void *)b2); 120 } > REMOVED 121 barrier_destroy(b); i.e. using two different barriers which you never free. In this case the master need not participate in the barrier. Thanks! Asaph On Mon, Apr 08, 2002 at 05:10:14PM -0700, matt massie wrote: > asaph- > > > I think the problem still exists even with this fix. > > did you test that this problem still exists and if so can you give me the > platform and error? please make it clearer when you post to the list > whether the bug you are reporting is real or theoretical. > thanks for your technical expertise > -matt > > > You don't know the order by which threads leave the barrier, > > so you might still be calling barrier_destroy() while there > > are threads accessing b. > > > > In general this kind of scheme: > > > >thread1: > >b = allocate_barrier(); > >spawn_threads(b); > >wait_barrier(b); > >free(b); > > > > > >threadnN: > >wait_barrier(b); > > > > can't work because you have threads 1..N all accessing the > > data structure pointed to by b simultaneously, and you have > > no control over which one will exit wait_barrier() first. > > If it happens to be thread1, then it will free() b while > > other threads are still reading the data pointed to by b. > > > > If you REALLY want to solve this, I think you'd need two > > barriers: > > > > > >thread1: > >b2 = static_barrier; > >b1 = allocate_barrier(); > >spawn_threads(b1,b2); > >wait_barrier(b1); > >wait_barrier(b2); > >free(b1); > >// b2 is never freed > > > > > >threadnN: > >wait_barrier(b1); > >wait_barrier(b2); > > > > Of course, this is only interesting if you can't make do with just > > having only static barriers. If you are in a situation that you > > absolutely must allocate and free the memory held by the barriers > > I don't know of another safe way to do this. > > > > > > On Mon, Apr 08, 2002 at 03:48:41PM -0700, matt massie wrote: > > > mike- > > > > > > you can blame me for the problem you were having. i didn't code the > > > barriers correctly in gmond. the machines i tested gmond on before i > > > released it didn't display the problem so i released it with this bug... > > > > > > if you look at line 108 of gmond you'll see i initialize a barrier and > > > then pass it to the mcast_threads that i spin off. directly afterwards i > > > run a barrier_destroy(). bad. > > > > > > if the main gmond runs the barrier_destroy() BEFORE all the mcast_threads > > > can run a barrier_barrier() then you will have a problem. the mcast > > > threads will be operating on freed memory... otherwise.. everthing is > > > peachy. > > > > > > the fix was just to increase the barrier count by one and place a > > > barrier_barrier() just before the barrier_destroy() to force the main > > > thread to wait until all the mcast threads are started. > > > > > > thanks so much for the feedback. > > > > > > also, i added the --no_setuid and --setuid flags in order to give you > > > more > > > debugging power. i know you were having trouble creating a core file > > > because gmond sets the uid to the uid of "nobody". you can prevent gmond > > > from starting up as nobody with the "--no_setuid" flag. > > > > > > good luck!
Re: [Ganglia-general] gmond 2.2.2 seg faults.
asaph- > I think the problem still exists even with this fix. did you test that this problem still exists and if so can you give me the platform and error? please make it clearer when you post to the list whether the bug you are reporting is real or theoretical. even if this is a theoretical bug it is important. to be absolutely sure there is no problems in the future, i've declare two separate barrier pointers and i don't ever free the data they point to. this increases the memory footprint of gmond by 96 bytes but i think that we can live with that. thanks for your technical expertise -matt > You don't know the order by which threads leave the barrier, > so you might still be calling barrier_destroy() while there > are threads accessing b. > > In general this kind of scheme: > >thread1: >b = allocate_barrier(); >spawn_threads(b); >wait_barrier(b); >free(b); > > >threadnN: >wait_barrier(b); > > can't work because you have threads 1..N all accessing the > data structure pointed to by b simultaneously, and you have > no control over which one will exit wait_barrier() first. > If it happens to be thread1, then it will free() b while > other threads are still reading the data pointed to by b. > > If you REALLY want to solve this, I think you'd need two > barriers: > > >thread1: >b2 = static_barrier; >b1 = allocate_barrier(); >spawn_threads(b1,b2); >wait_barrier(b1); >wait_barrier(b2); >free(b1); >// b2 is never freed > > >threadnN: >wait_barrier(b1); >wait_barrier(b2); > > Of course, this is only interesting if you can't make do with just > having only static barriers. If you are in a situation that you > absolutely must allocate and free the memory held by the barriers > I don't know of another safe way to do this. > > > On Mon, Apr 08, 2002 at 03:48:41PM -0700, matt massie wrote: > > mike- > > > > you can blame me for the problem you were having. i didn't code the > > barriers correctly in gmond. the machines i tested gmond on before i > > released it didn't display the problem so i released it with this bug... > > > > if you look at line 108 of gmond you'll see i initialize a barrier and > > then pass it to the mcast_threads that i spin off. directly afterwards i > > run a barrier_destroy(). bad. > > > > if the main gmond runs the barrier_destroy() BEFORE all the mcast_threads > > can run a barrier_barrier() then you will have a problem. the mcast > > threads will be operating on freed memory... otherwise.. everthing is > > peachy. > > > > the fix was just to increase the barrier count by one and place a > > barrier_barrier() just before the barrier_destroy() to force the main > > thread to wait until all the mcast threads are started. > > > > thanks so much for the feedback. > > > > also, i added the --no_setuid and --setuid flags in order to give you more > > debugging power. i know you were having trouble creating a core file > > because gmond sets the uid to the uid of "nobody". you can prevent gmond > > from starting up as nobody with the "--no_setuid" flag. > > > > good luck! and please let me know if i didn't solve your problem! > > -matt > > > > Saturday, Mike Snitzer wrote forth saying... > > > > > gmond segfaults 50% of the time at startup. The random nature of it > > > suggests to me that their is a race condition when the gmond threads > > > startup. When I tried to strace or run gmond through gdb the problem > > > wasn't apparant.. which is what led me to believe it's a threading problem > > > that strace or gdb masks. > > > > > > Any recommendations for accurately debugging gmond would be great; cause > > > when running through strace and gdb I can't get it to segfault. > > > > > > FYI, I'm running gmond v2.2.2 on 48 nodes of those 16 of the nodes' gmond > > > segfaulted at startup... > > > > > > Mike > > > > > > ps. > > > here's an example: > > > `which gmond` --debug_level=1 -i eth0 > > > > > > mcast_listen_thread() received metric data cpu_speed > > > mcast_value() mcasting cpu_user value > > > 2051 pre_process_node() remote_ip=192.168.0.28encoded 8 XDR > > > bytespre_process_node() has saved the hostname > > > pre_process_node() has set the timestamp > > > pre_process_node() received a new node > > > > > > > > > XDR data successfully sent > > > set_metric_value() got metric key 11 > > > set_metric_value() exec'd cpu_nice_func (11) > > > Segmentation fault > > > > > > > > > ___ > > > Ganglia-general mailing list > > > Ganglia-general@lists.sourceforge.net > > > https://lists.sourceforge.net/lists/listinfo/ganglia-general > > > > > > > > > ___ > > Ganglia-general mailing list > > Ganglia-general@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/ganglia-general > > > > Sponsored by http://ww
[Ganglia-general] Re: gmond 2.2.2 seg faults.
matt, asaph. In testing 2.2.3 it appears that matt's fix did do the trick. If there is still a potential for prematurely free()ing the barrier I can't get it to do it... great job matt! Mike Asaph Zemach ([EMAIL PROTECTED]) said: > I think the problem still exists even with this fix. > You don't know the order by which threads leave the barrier, > so you might still be calling barrier_destroy() while there > are threads accessing b. > > In general this kind of scheme: > >thread1: >b = allocate_barrier(); >spawn_threads(b); >wait_barrier(b); >free(b); > > >threadnN: >wait_barrier(b); > > can't work because you have threads 1..N all accessing the > data structure pointed to by b simultaneously, and you have > no control over which one will exit wait_barrier() first. > If it happens to be thread1, then it will free() b while > other threads are still reading the data pointed to by b. > > If you REALLY want to solve this, I think you'd need two > barriers: > > >thread1: >b2 = static_barrier; >b1 = allocate_barrier(); >spawn_threads(b1,b2); >wait_barrier(b1); >wait_barrier(b2); >free(b1); >// b2 is never freed > > >threadnN: >wait_barrier(b1); >wait_barrier(b2); > > Of course, this is only interesting if you can't make do with just > having only static barriers. If you are in a situation that you > absolutely must allocate and free the memory held by the barriers > I don't know of another safe way to do this. > > > On Mon, Apr 08, 2002 at 03:48:41PM -0700, matt massie wrote: > > mike- > > > > you can blame me for the problem you were having. i didn't code the > > barriers correctly in gmond. the machines i tested gmond on before i > > released it didn't display the problem so i released it with this bug... > > > > if you look at line 108 of gmond you'll see i initialize a barrier and > > then pass it to the mcast_threads that i spin off. directly afterwards i > > run a barrier_destroy(). bad. > > > > if the main gmond runs the barrier_destroy() BEFORE all the mcast_threads > > can run a barrier_barrier() then you will have a problem. the mcast > > threads will be operating on freed memory... otherwise.. everthing is > > peachy. > > > > the fix was just to increase the barrier count by one and place a > > barrier_barrier() just before the barrier_destroy() to force the main > > thread to wait until all the mcast threads are started. > > > > thanks so much for the feedback. > > > > also, i added the --no_setuid and --setuid flags in order to give you more > > debugging power. i know you were having trouble creating a core file > > because gmond sets the uid to the uid of "nobody". you can prevent gmond > > from starting up as nobody with the "--no_setuid" flag. > > > > good luck! and please let me know if i didn't solve your problem! > > -matt > > > > Saturday, Mike Snitzer wrote forth saying... > > > > > gmond segfaults 50% of the time at startup. The random nature of it > > > suggests to me that their is a race condition when the gmond threads > > > startup. When I tried to strace or run gmond through gdb the problem > > > wasn't apparant.. which is what led me to believe it's a threading problem > > > that strace or gdb masks. > > > > > > Any recommendations for accurately debugging gmond would be great; cause > > > when running through strace and gdb I can't get it to segfault. > > > > > > FYI, I'm running gmond v2.2.2 on 48 nodes of those 16 of the nodes' gmond > > > segfaulted at startup... > > > > > > Mike > > > > > > ps. > > > here's an example: > > > `which gmond` --debug_level=1 -i eth0 > > > > > > mcast_listen_thread() received metric data cpu_speed > > > mcast_value() mcasting cpu_user value > > > 2051 pre_process_node() remote_ip=192.168.0.28encoded 8 XDR > > > bytespre_process_node() has saved the hostname > > > pre_process_node() has set the timestamp > > > pre_process_node() received a new node > > > > > > > > > XDR data successfully sent > > > set_metric_value() got metric key 11 > > > set_metric_value() exec'd cpu_nice_func (11) > > > Segmentation fault > > > > > > > > > ___ > > > Ganglia-general mailing list > > > Ganglia-general@lists.sourceforge.net > > > https://lists.sourceforge.net/lists/listinfo/ganglia-general > > > > > > > > > ___ > > Ganglia-general mailing list > > Ganglia-general@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/ganglia-general > > > > Sponsored by http://www.ThinkGeek.com/ > > ___ > Ganglia-general mailing list > Ganglia-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ganglia-general > > Sponsored by http://www.ThinkGeek.com/ >
Re: [Ganglia-general] gmond 2.2.2 seg faults.
I think the problem still exists even with this fix. You don't know the order by which threads leave the barrier, so you might still be calling barrier_destroy() while there are threads accessing b. In general this kind of scheme: thread1: b = allocate_barrier(); spawn_threads(b); wait_barrier(b); free(b); threadnN: wait_barrier(b); can't work because you have threads 1..N all accessing the data structure pointed to by b simultaneously, and you have no control over which one will exit wait_barrier() first. If it happens to be thread1, then it will free() b while other threads are still reading the data pointed to by b. If you REALLY want to solve this, I think you'd need two barriers: thread1: b2 = static_barrier; b1 = allocate_barrier(); spawn_threads(b1,b2); wait_barrier(b1); wait_barrier(b2); free(b1); // b2 is never freed threadnN: wait_barrier(b1); wait_barrier(b2); Of course, this is only interesting if you can't make do with just having only static barriers. If you are in a situation that you absolutely must allocate and free the memory held by the barriers I don't know of another safe way to do this. On Mon, Apr 08, 2002 at 03:48:41PM -0700, matt massie wrote: > mike- > > you can blame me for the problem you were having. i didn't code the > barriers correctly in gmond. the machines i tested gmond on before i > released it didn't display the problem so i released it with this bug... > > if you look at line 108 of gmond you'll see i initialize a barrier and > then pass it to the mcast_threads that i spin off. directly afterwards i > run a barrier_destroy(). bad. > > if the main gmond runs the barrier_destroy() BEFORE all the mcast_threads > can run a barrier_barrier() then you will have a problem. the mcast > threads will be operating on freed memory... otherwise.. everthing is > peachy. > > the fix was just to increase the barrier count by one and place a > barrier_barrier() just before the barrier_destroy() to force the main > thread to wait until all the mcast threads are started. > > thanks so much for the feedback. > > also, i added the --no_setuid and --setuid flags in order to give you more > debugging power. i know you were having trouble creating a core file > because gmond sets the uid to the uid of "nobody". you can prevent gmond > from starting up as nobody with the "--no_setuid" flag. > > good luck! and please let me know if i didn't solve your problem! > -matt > > Saturday, Mike Snitzer wrote forth saying... > > > gmond segfaults 50% of the time at startup. The random nature of it > > suggests to me that their is a race condition when the gmond threads > > startup. When I tried to strace or run gmond through gdb the problem > > wasn't apparant.. which is what led me to believe it's a threading problem > > that strace or gdb masks. > > > > Any recommendations for accurately debugging gmond would be great; cause > > when running through strace and gdb I can't get it to segfault. > > > > FYI, I'm running gmond v2.2.2 on 48 nodes of those 16 of the nodes' gmond > > segfaulted at startup... > > > > Mike > > > > ps. > > here's an example: > > `which gmond` --debug_level=1 -i eth0 > > > > mcast_listen_thread() received metric data cpu_speed > > mcast_value() mcasting cpu_user value > > 2051 pre_process_node() remote_ip=192.168.0.28encoded 8 XDR > > bytespre_process_node() has saved the hostname > > pre_process_node() has set the timestamp > > pre_process_node() received a new node > > > > > > XDR data successfully sent > > set_metric_value() got metric key 11 > > set_metric_value() exec'd cpu_nice_func (11) > > Segmentation fault > > > > > > ___ > > Ganglia-general mailing list > > Ganglia-general@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/ganglia-general > > > > > ___ > Ganglia-general mailing list > Ganglia-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ganglia-general > > Sponsored by http://www.ThinkGeek.com/
Re: [Ganglia-general] gmond 2.2.2 seg faults.
mike- you can blame me for the problem you were having. i didn't code the barriers correctly in gmond. the machines i tested gmond on before i released it didn't display the problem so i released it with this bug... if you look at line 108 of gmond you'll see i initialize a barrier and then pass it to the mcast_threads that i spin off. directly afterwards i run a barrier_destroy(). bad. if the main gmond runs the barrier_destroy() BEFORE all the mcast_threads can run a barrier_barrier() then you will have a problem. the mcast threads will be operating on freed memory... otherwise.. everthing is peachy. the fix was just to increase the barrier count by one and place a barrier_barrier() just before the barrier_destroy() to force the main thread to wait until all the mcast threads are started. thanks so much for the feedback. also, i added the --no_setuid and --setuid flags in order to give you more debugging power. i know you were having trouble creating a core file because gmond sets the uid to the uid of "nobody". you can prevent gmond from starting up as nobody with the "--no_setuid" flag. good luck! and please let me know if i didn't solve your problem! -matt Saturday, Mike Snitzer wrote forth saying... > gmond segfaults 50% of the time at startup. The random nature of it > suggests to me that their is a race condition when the gmond threads > startup. When I tried to strace or run gmond through gdb the problem > wasn't apparant.. which is what led me to believe it's a threading problem > that strace or gdb masks. > > Any recommendations for accurately debugging gmond would be great; cause > when running through strace and gdb I can't get it to segfault. > > FYI, I'm running gmond v2.2.2 on 48 nodes of those 16 of the nodes' gmond > segfaulted at startup... > > Mike > > ps. > here's an example: > `which gmond` --debug_level=1 -i eth0 > > mcast_listen_thread() received metric data cpu_speed > mcast_value() mcasting cpu_user value > 2051 pre_process_node() remote_ip=192.168.0.28encoded 8 XDR > bytespre_process_node() has saved the hostname > pre_process_node() has set the timestamp > pre_process_node() received a new node > > > XDR data successfully sent > set_metric_value() got metric key 11 > set_metric_value() exec'd cpu_nice_func (11) > Segmentation fault > > > ___ > Ganglia-general mailing list > Ganglia-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ganglia-general >
Re: [Ganglia-general] linux monitor implementation
Today, Asaph Zemach wrote forth saying... > Hi, > > I've been looking over the gmond sources, and I was wondering > why you saw the need to create the three threads: > proc_stat_thr > proc_loadavg_thr > proc_meminfo_thr > Was there some problem in having the monitor thread perform > these actions? > > I couldn't find an answer in the documentation. Sorry if > I missed it. Asaph- Ganglia is built to be easily portable to other architectures. If I were to lock it into being simply a Linux tool, then having the proc_stat_thr/proc_loadavg_thr/proc_meminfo_thr threads would not be necessary as I could merge them into the monitor thread. However I wanted the monitor thread to work at a more abstract machine-independent level. The monitor thread shouldn't care about the specifics of how the metrics are collected.. it is only responsible to keeping track of value and time thresholds and making sure new data gets multicast. When I was building the machine-specific file for Linux, I spun off the proc_stat_thr, proc_loadavg_thr, and proc_meminfo_thr to reduce the number of times that the /proc/stat, /proc/loadavg and /proc/meminfo files where opened. Many of the metric functions in read the same file and didn't want each one opening and closing the same file. For example, load_one load_five, load_fifteen, proc_total and proc_run all come from the /proc/loadavg file. To prevent those functions from opening/closing the /proc/loadavg file each time they were called.. they read the file from memory which is updated by the proc_loadavg_thr as necessary. I hope this makes sense. If not, feel free to email back. -matt
Re: [Ganglia-general] Ganglia Problem
Today, D'Onofrio Florindo wrote forth saying... > Dear Sir, > > I'm Florindo D'Onofrio. I'm a student of Computer Science of the > Benevento's Unisannio University (Italy). I have found your e-mail in > the Ganglia Cluster Toolkit v2.1.2, Monitoring Core Documentation. My > teacher has fixed me the task to install ganglia on a cluster, which > composed by 8 computers connected to a central computer. I'm curious. Is installing ganglia part of a class assignment? Does everyone in your class install it or are certain people assigned different tasks? > I have tried > to install the toolkit following, step by step, the documentation, but > there has been this problem: > > Cannot continue because of these errors: > > Invalid element in GANGLIA_DOC XML > > How can I resolve this problem? You are installing an old version of the client v1.0.3 (I think). Version 1.0.4 does not have that problem. You can download it from... http://ganglia.sourceforge.net/ Let me know if the update does not solve your problem (but I'm pretty certain that it will). > I have installed gmond on the server and relative clients, and Ganglia > PHP/RRD Web Client only on the server. The installation of the toolkit > ganglia is explained in the second section of the documentation: in > particular Ganglia Monitoring Daemon (gmond) Installation in the > section 2.1 and Installation of the Ganglia PHP/RRD Web Client in the > section 2.2. In regard to this I want to ask you 2 questions: > > 1. Does gmond have to be installed on a server and on each node in the > cluster? Yes, you need to install gmond on every node on your cluster including the web server. NOTE: If you don't want the web server machine to show up in the list of machines then start gmond with the "--mute" flag. This will make the gmond on the web server listen and process traffic without multicasting it's information. > 2. Does Ganglia PHP/RRD Web Client have to be installed on a server > and on each node in the cluster? The ganglia PHP/RRD Web client only needs to be installed on your web server and no where else. > Have you got a more detailed documentation? Could you send me it? All the documentation for ganglia is at http://ganglia.sourceforge.net/docs/ Good luck! -matt
[Ganglia-general] linux monitor implementation
Hi, I've been looking over the gmond sources, and I was wondering why you saw the need to create the three threads: proc_stat_thr proc_loadavg_thr proc_meminfo_thr Was there some problem in having the monitor thread perform these actions? I couldn't find an answer in the documentation. Sorry if I missed it. Thanks, Asaph
[Ganglia-general] Ganglia Problem
Dear Sir, I'm Florindo D'Onofrio. I'm a student of Computer Science of the Benevento's Unisannio University (Italy). I have found your e-mail in the Ganglia Cluster Toolkit v2.1.2, Monitoring Core Documentation. My teacher has fixed me the task to install ganglia on a cluster, which composed by 8 computers connected to a central computer. I have tried to install the toolkit following, step by step, the documentation, but there has been this problem: Cannot continue because of these errors: Invalid element in GANGLIA_DOC XML How can I resolve this problem? I have installed gmond on the server and relative clients, and Ganglia PHP/RRD Web Client only on the server. The installation of the toolkit ganglia is explained in the second section of the documentation: in particular Ganglia Monitoring Daemon (gmond) Installation in the section 2.1 and Installation of the Ganglia PHP/RRD Web Client in the section 2.2. In regard to this I want to ask you 2 questions: 1. Does gmond have to be installed on a server and on each node in the cluster? 2. Does Ganglia PHP/RRD Web Client have to be installed on a server and on each node in the cluster? Have you got a more detailed documentation? Could you send me it? I hope you answer me, thanks. Florindo D'Onofrio