[Ganglia-general] Ganglia Problem

2002-04-08 Thread D'Onofrio Florindo
 Dear Sir, 
I'm Florindo D'Onofrio. I'm a student of Computer Science of the Benevento's 
Unisannio University (Italy). I have found your e-mail in the Ganglia Cluster 
Toolkit v2.1.2, Monitoring Core Documentation. My teacher has fixed me the task 
to install ganglia on a cluster, which composed by 8 computers connected to a 
central computer. I have tried to install the toolkit following, step by step, 
the documentation, but there has been this problem: 

Cannot continue because of these errors:

Invalid element in GANGLIA_DOC XML   

 How can I resolve this problem?

I have installed gmond on the server and relative clients, and Ganglia PHP/RRD 
Web Client only on the server. The installation of the toolkit ganglia is 
explained in the second section of the documentation: in particular Ganglia 
Monitoring Daemon (gmond) Installation in the section 2.1 and Installation of 
the Ganglia PHP/RRD Web Client in the section 2.2. In regard to this I want to 
ask you 2 questions: 

1.  Does gmond have to be installed on a server and on each node in the 
cluster?

2.  Does Ganglia PHP/RRD Web Client have to be installed on a server and on 
each node in the cluster?

 Have you got a more detailed documentation? Could you send me it?

I hope you answer me, thanks. 



   Florindo D'Onofrio

 

 




[Ganglia-general] linux monitor implementation

2002-04-08 Thread Asaph Zemach
Hi,

  I've been looking over the gmond sources, and I was wondering
why you saw the need to create the three threads:
proc_stat_thr
proc_loadavg_thr
proc_meminfo_thr
Was there some problem in having the monitor thread perform
these actions?

I couldn't find an answer in the documentation. Sorry if
I missed it.

Thanks,
Asaph
 



Re: [Ganglia-general] Ganglia Problem

2002-04-08 Thread matt massie
Today, D'Onofrio Florindo wrote forth saying...

  Dear Sir, 

 I'm Florindo D'Onofrio. I'm a student of Computer Science of the
 Benevento's Unisannio University (Italy). I have found your e-mail in
 the Ganglia Cluster Toolkit v2.1.2, Monitoring Core Documentation. My
 teacher has fixed me the task to install ganglia on a cluster, which
 composed by 8 computers connected to a central computer. 

I'm curious.  Is installing ganglia part of a class assignment?  Does 
everyone in your class install it or are certain people assigned different 
tasks?

 I have tried
 to install the toolkit following, step by step, the documentation, but
 there has been this problem:
 
 Cannot continue because of these errors:
 
 Invalid element in GANGLIA_DOC XML   
 
  How can I resolve this problem?

You are installing an old version of the client v1.0.3 (I think).  Version 
1.0.4 does not have that problem.  You can download it from...

http://ganglia.sourceforge.net/

Let me know if the update does not solve your problem (but I'm pretty 
certain that it will).

 I have installed gmond on the server and relative clients, and Ganglia
 PHP/RRD Web Client only on the server. The installation of the toolkit
 ganglia is explained in the second section of the documentation: in
 particular Ganglia Monitoring Daemon (gmond) Installation in the
 section 2.1 and Installation of the Ganglia PHP/RRD Web Client in the
 section 2.2. In regard to this I want to ask you 2 questions:
 
 1.  Does gmond have to be installed on a server and on each node in the 
 cluster?

Yes, you need to install gmond on every node on your cluster including the 
web server.  

NOTE: If you don't want the web server machine to show up in the 
list of machines then start gmond with the --mute flag.  This will make 
the gmond on the web server listen and process traffic without 
multicasting it's information.

 2.  Does Ganglia PHP/RRD Web Client have to be installed on a server
 and on each node in the cluster?

The ganglia PHP/RRD Web client only needs to be installed on your web 
server and no where else.

 Have you got a more detailed documentation? Could you send me it?

All the documentation for ganglia is at
http://ganglia.sourceforge.net/docs/

Good luck!
-matt




Re: [Ganglia-general] gmond 2.2.2 seg faults.

2002-04-08 Thread Asaph Zemach
On Mon, Apr 08, 2002 at 05:10:14PM -0700, matt massie wrote:
 asaph-
 
  I think the problem still exists even with this fix.
 
 did you test that this problem still exists and if so can you give me the 
 platform and error?  please make it clearer when you post to the list 
 whether the bug you are reporting is real or theoretical.

At this point the problem is theoretical in gmond - I did not
encounter it in practice when using ganglia.

I have encountered instances of this style of problem 
in other programs so I know it is potentially real.

 
 even if this is a theoretical bug it is important.  to be absolutely sure
 there is no problems in the future, i've declare two separate barrier
 pointers and i don't ever free the data they point to.  this increases the
 memory footprint of gmond by 96 bytes but i think that we can live with
 that.
 

When you say two barriers I take it you are referring to the code
in gmond.c lines 107-121. If so, I think you'd be fine with 
changing those to:

 107   barrier_init(b1, args_info.mcast_threads_arg );
 108   for ( i = 0 ; i  args_info.mcast_threads_arg; i++ )
 109  {
 110 pthread_create(tid, attr, mcast_listen_thread, (void *)b1);
 111  }
  REMOVED 112   barrier_destroy(b); 
 113   debug_msg(listening thread(s) have been started);
 114
 115   /* threads to answer requests for XML */
 116   barrier_init(b2, args_info.xml_threads_arg);
 117   for ( i=0 ; i  args_info.xml_threads_arg; i++ )
 118  {
 119 pthread_create(tid, attr, server_thread, (void *)b2);
 120  }
  REMOVED 121   barrier_destroy(b);

i.e. using two different barriers which you never free. In this
case the master need not participate in the barrier.


 Thanks!
 Asaph



On Mon, Apr 08, 2002 at 05:10:14PM -0700, matt massie wrote:
 asaph-
 
  I think the problem still exists even with this fix.
 
 did you test that this problem still exists and if so can you give me the 
 platform and error?  please make it clearer when you post to the list 
 whether the bug you are reporting is real or theoretical.
 thanks for your technical expertise
 -matt
 
  You don't know the order by which threads leave the barrier,
  so you might still be calling barrier_destroy() while there
  are threads accessing b.
  
  In general this kind of scheme:
  
 thread1:
 b = allocate_barrier();
 spawn_threads(b);
 wait_barrier(b);
 free(b);
  
  
 threadnN:
 wait_barrier(b);
  
  can't work because you have threads 1..N all accessing the 
  data structure pointed to by b simultaneously, and you have
  no control over which one will exit wait_barrier() first.
  If it happens to be thread1, then it will free() b while
  other threads are still reading the data pointed to by b.
  
  If you REALLY want to solve this, I think you'd need two
  barriers:
  
  
 thread1:
 b2 = static_barrier;
 b1 = allocate_barrier();
 spawn_threads(b1,b2);
 wait_barrier(b1);
 wait_barrier(b2);
 free(b1);
 // b2 is never freed
  
  
 threadnN:
 wait_barrier(b1);
 wait_barrier(b2);
   
  Of course, this is only interesting if you can't make do with just
  having only static barriers. If you are in a situation that you
  absolutely must allocate and free the memory held by the barriers
  I don't know of another safe way to do this.
  
  
  On Mon, Apr 08, 2002 at 03:48:41PM -0700, matt massie wrote:
   mike-
   
   you can blame me for the problem you were having.  i didn't code the 
   barriers correctly in gmond.  the machines i tested gmond on before i 
   released it didn't display the problem so i released it with this bug...
   
   if you look at line 108 of gmond you'll see i initialize a barrier and 
   then pass it to the mcast_threads that i spin off. directly afterwards i 
   run a barrier_destroy().  bad.
   
   if the main gmond runs the barrier_destroy() BEFORE all the mcast_threads 
   can run a barrier_barrier() then you will have a problem.  the mcast 
   threads will be operating on freed memory... otherwise.. everthing is 
   peachy.
   
   the fix was just to increase the barrier count by one and place a 
   barrier_barrier() just before the barrier_destroy() to force the main 
   thread to wait until all the mcast threads are started.
   
   thanks so much for the feedback.
   
   also, i added the --no_setuid and --setuid flags in order to give you 
   more 
   debugging power.  i know you were having trouble creating a core file 
   because gmond sets the uid to the uid of nobody.  you can prevent gmond 
   from starting up as nobody with the --no_setuid flag.
   
   good luck!  and please let me know if i didn't solve your problem!
   -matt
   
   Saturday, Mike Snitzer wrote forth saying...
   
gmond segfaults 50% of the time at startup.  The random nature of it
suggests to me that their is a