On Mon, 2004-06-21 at 12:36, steven wagner wrote:
> A few other things exploded so I've only just had the chance to check this 
> out.  Info:
> 
> I'm not using a config file at this time.  This is a Redhat 7.1 uniprocessor 
> P4 but I get the same results on a dual-proc Opteron running the same 
> bastardized RH7.1-derivative.
> 
> When built from the 2.5.6 tarball, the monitoring core works outside of debug 
> mode.

do you mean that if you turn on debugging that gmond doesn't work
anymore?

> When built from *this* 2.6.0 tarball, not so much ... works in debug though:
> 
>   644383 Jun  3 12:55 ganglia-2.6.0.tar.gz

and when you build the previous snapshot that gmond ONLY works in debug
mode?

> gdb the happy elf provides this traceback on the gmond threads which *are* 
> created:
> 
> #0  0x40084b85 in __sigsuspend (set=0xbffff200)
>      at ../sysdeps/unix/sysv/linux/sigsuspend.c:45
> #1  0x401ad1c9 in __pthread_wait_for_restart_signal (self=0x401b5f40)
>      at pthread.c:969
> #2  0x401ad29c in __pthread_create_2_1 (thread=0x807be94, attr=0x0,
>      start_routine=0x80539f8 <schedule_thread>, arg=0x807be88) at restart.h:34
> #3  0x08053291 in tpool_init (tpoolp=0xbffff3a4, num_worker_threads=1,
>      max_queue_size=128, do_not_block_when_full=1) at tpool.c:100
> #4  0x08053338 in ganglia_thread_pool_create (num_worker_threads=1,
>      max_queue_size=128, do_not_block_when_full=1) at tpool.c:122
> #5  0x0804bc4b in main (argc=1, argv=0xbffff4c4) at gmond.c:254
> 
> Line 254 is:
> 
>          receive_pool = ganglia_thread_pool_create( 
> gmond_config.num_receive_channels, 128, 1 );

what version of glibc are you running on your boxes?
% rpm -qi glibc

i found that the pthread (LinuxThreads) implementation on linux is a
nightmare.  sometimes you'll find the thread stuff in glibc other times
you'll find it in the kernel.  you can force it to use older pthread
libraries by doing a...

% set LD_ASSUME_KERNEL="2.2.5"

before you start gmond.

also, you are compiling gmond on the host it being run on?

i think the problem is the way that signals are passed around in
threaded programs... older libraries used USR1 and USR2.. the newer
libraries use "real-time" signals.

we may have to remove the thread pool code altogether and just have a
thread per channel (or put in a ./configure flag to override the pools
on broken machines).

your message is timely.  i was going to send an email out today and try
to get feedback on 2.6.0.  so .. it looks like we have a trusted_hosts
IPv4 <=> IPv6 problem and a thread pool problem.  any others?  

-matt 

-- 
PGP fingerprint 'A7C2 3C2F 8445 AD3C 135E F40B 242A 5984 ACBC 91D3'

   They that can give up essential liberty to obtain a little 
      temporary safety deserve neither liberty nor safety. 
  --Benjamin Franklin, Historical Review of Pennsylvania, 1759

Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to