Re: [OMPI devel] Uninitialized ORTE epoch values

2011-08-05 Thread Ralph Castain
Thanks for the explanation. It kinda begs a question, though - I've noticed 
that the assignment of epoch seems to circle around in a number of places. We 
call the ess_base function to get_epoch, and then we assign an epoch. But the 
base function actually seem to do much, if anything.

It's somewhat confusing and difficult to trace. I know Wes and I already 
planned to cleanup some of this once we get back to the orte state machine 
work, but I'm hoping we can simplify this code somewhat to make it easier to 
understand and follow.

Meantime, we'll continue to chase down the problems.

On Aug 5, 2011, at 4:17 PM, Thomas Herault wrote:

> 
> The warnings issued through ess_base_select.c:46 are annoying but harmless. 
> Wesley is going to hunt them and remove them, but they are really issued 
> because of the print:
> orte_ess_base_proc_get_epoch (ess_base_select.c:46) calls 
> ORTE_NAME_PRINT(proc), which prints proc->epoch, before proc->epoch is 
> assigned to the local computed value epoch. This assignment is done in the 
> level just above orte_ess_base_proc_get_epoch: 
> orte_odls_base_default_construct_child_list (odls_base_default_fns.c:737) 
> says proc->epoch = orte_ess_base_proc_get_epoch(proc);
> 
> Wesley is going to find where this proc was created to ensure that its epoch 
> field is initialized to INVALID_EPOCH, but what this trace says is really 
> that nothing references it before it is initialized to its correct value.
> 
> Thomas
> 
> Le 5 août 2011 à 16:52, Ralph Castain a écrit :
> 
>> Thanks Wes - it isn't the print that's the issue, it's the fact that we have 
>> epochs that aren't being initialized, and what else that may be causing to 
>> have problems.
>> 
>> 
>> On Aug 5, 2011, at 2:45 PM, Wesley Bland wrote:
>> 
>>> I don't think these are anything to worry about since they're all print 
>>> statements, but I will work on these tonight.
>>> 
>>> On Fri, Aug 5, 2011 at 3:03 PM, Jeff Squyres  wrote:
>>> Ralph and I are trying to track down the mysterious ORTE error.
>>> 
>>> In doing so, I have found at least one fairly repeatable error on my 
>>> cluster: when running through SLURM the ibm/dynamic/spawn test, where we 
>>> mpirun 3 procs and then we MPI_COMM_SPAWN 3 more.  Running the orteds 
>>> through valgrind, I see a bunch of uninitialized epoch issues.
>>> 
>>> Attached at the 2 valgrind outputs.
>>> 
>>> Can these be fixed?  I don't know if they're actual problems or not, but 
>>> seeing uninitialized values go by makes me extremely nervous.
>>> 
>>> Thanks!
>>> 
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Uninitialized ORTE epoch values

2011-08-05 Thread Thomas Herault

The warnings issued through ess_base_select.c:46 are annoying but harmless. 
Wesley is going to hunt them and remove them, but they are really issued 
because of the print:
orte_ess_base_proc_get_epoch (ess_base_select.c:46) calls 
ORTE_NAME_PRINT(proc), which prints proc->epoch, before proc->epoch is assigned 
to the local computed value epoch. This assignment is done in the level just 
above orte_ess_base_proc_get_epoch: orte_odls_base_default_construct_child_list 
(odls_base_default_fns.c:737) says proc->epoch = 
orte_ess_base_proc_get_epoch(proc);

Wesley is going to find where this proc was created to ensure that its epoch 
field is initialized to INVALID_EPOCH, but what this trace says is really that 
nothing references it before it is initialized to its correct value.

Thomas

Le 5 août 2011 à 16:52, Ralph Castain a écrit :

> Thanks Wes - it isn't the print that's the issue, it's the fact that we have 
> epochs that aren't being initialized, and what else that may be causing to 
> have problems.
> 
> 
> On Aug 5, 2011, at 2:45 PM, Wesley Bland wrote:
> 
>> I don't think these are anything to worry about since they're all print 
>> statements, but I will work on these tonight.
>> 
>> On Fri, Aug 5, 2011 at 3:03 PM, Jeff Squyres  wrote:
>> Ralph and I are trying to track down the mysterious ORTE error.
>> 
>> In doing so, I have found at least one fairly repeatable error on my 
>> cluster: when running through SLURM the ibm/dynamic/spawn test, where we 
>> mpirun 3 procs and then we MPI_COMM_SPAWN 3 more.  Running the orteds 
>> through valgrind, I see a bunch of uninitialized epoch issues.
>> 
>> Attached at the 2 valgrind outputs.
>> 
>> Can these be fixed?  I don't know if they're actual problems or not, but 
>> seeing uninitialized values go by makes me extremely nervous.
>> 
>> Thanks!
>> 
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Uninitialized ORTE epoch values

2011-08-05 Thread Ralph Castain
Thanks Wes - it isn't the print that's the issue, it's the fact that we have 
epochs that aren't being initialized, and what else that may be causing to have 
problems.


On Aug 5, 2011, at 2:45 PM, Wesley Bland wrote:

> I don't think these are anything to worry about since they're all print 
> statements, but I will work on these tonight.
> 
> On Fri, Aug 5, 2011 at 3:03 PM, Jeff Squyres  wrote:
> Ralph and I are trying to track down the mysterious ORTE error.
> 
> In doing so, I have found at least one fairly repeatable error on my cluster: 
> when running through SLURM the ibm/dynamic/spawn test, where we mpirun 3 
> procs and then we MPI_COMM_SPAWN 3 more.  Running the orteds through 
> valgrind, I see a bunch of uninitialized epoch issues.
> 
> Attached at the 2 valgrind outputs.
> 
> Can these be fixed?  I don't know if they're actual problems or not, but 
> seeing uninitialized values go by makes me extremely nervous.
> 
> Thanks!
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Uninitialized ORTE epoch values

2011-08-05 Thread Wesley Bland
I don't think these are anything to worry about since they're all print
statements, but I will work on these tonight.

On Fri, Aug 5, 2011 at 3:03 PM, Jeff Squyres  wrote:

> Ralph and I are trying to track down the mysterious ORTE error.
>
> In doing so, I have found at least one fairly repeatable error on my
> cluster: when running through SLURM the ibm/dynamic/spawn test, where we
> mpirun 3 procs and then we MPI_COMM_SPAWN 3 more.  Running the orteds
> through valgrind, I see a bunch of uninitialized epoch issues.
>
> Attached at the 2 valgrind outputs.
>
> Can these be fixed?  I don't know if they're actual problems or not, but
> seeing uninitialized values go by makes me extremely nervous.
>
> Thanks!
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] Uninitialized ORTE epoch values

2011-08-05 Thread Jeff Squyres
BTW, the -1 file has an invalid free in it that we just fixed.  That's not part 
of the epoch value issue, of course.  :-)

On Aug 5, 2011, at 3:03 PM, Jeff Squyres wrote:

> Ralph and I are trying to track down the mysterious ORTE error.  
> 
> In doing so, I have found at least one fairly repeatable error on my cluster: 
> when running through SLURM the ibm/dynamic/spawn test, where we mpirun 3 
> procs and then we MPI_COMM_SPAWN 3 more.  Running the orteds through 
> valgrind, I see a bunch of uninitialized epoch issues.  
> 
> Attached at the 2 valgrind outputs.
> 
> Can these be fixed?  I don't know if they're actual problems or not, but 
> seeing uninitialized values go by makes me extremely nervous.
> 
> Thanks!
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] Uninitialized ORTE epoch values

2011-08-05 Thread Jeff Squyres
Ralph and I are trying to track down the mysterious ORTE error.  

In doing so, I have found at least one fairly repeatable error on my cluster: 
when running through SLURM the ibm/dynamic/spawn test, where we mpirun 3 procs 
and then we MPI_COMM_SPAWN 3 more.  Running the orteds through valgrind, I see 
a bunch of uninitialized epoch issues.  

Attached at the 2 valgrind outputs.

Can these be fixed?  I don't know if they're actual problems or not, but seeing 
uninitialized values go by makes me extremely nervous.

Thanks!

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
==4436== Memcheck, a memory error detector
==4436== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
==4436== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info
==4436== Command: /home/jsquyres/bogus/bin/orted -mca ess slurm -mca 
orte_ess_jobid 2778071040 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 
--hnp-uri 
"2778071040.0.0;tcp://172.29.218.140:40955;tcp://10.10.10.140:40955;tcp://10.10.20.140:40955;tcp://10.10.30.140:40955"
 --mca orte_startup_timeout 1 --mca mpi_leave_pinned 0 --mca btl tcp,self
==4436== 
==4436== Conditional jump or move depends on uninitialised value(s)
==4436==at 0x4E6634C: orte_util_print_epoch (name_fns.c:301)
==4436==by 0x4E65CB3: orte_util_print_name_args (name_fns.c:144)
==4436==by 0x4E898D6: orte_ess_base_proc_get_epoch (ess_base_select.c:46)
==4436==by 0x4EA3B6D: orte_odls_base_default_construct_child_list 
(odls_base_default_fns.c:737)
==4436==by 0xA36FD92: orte_odls_default_launch_local_procs 
(odls_default_module.c:1496)
==4436==by 0x4E823A3: orte_daemon_process_commands (orted_comm.c:508)
==4436==by 0x4E819B1: orte_daemon_cmd_processor (orted_comm.c:324)
==4436==by 0x4F3EA39: event_process_active_single_queue (event.c:1303)
==4436==by 0x4F3EBAF: event_process_active (event.c:1370)
==4436==by 0x4F3EFBE: opal_libevent207_event_base_loop (event.c:1566)
==4436==by 0x4E806AE: orte_daemon (orted_main.c:682)
==4436==by 0x400929: main (orted.c:62)
==4436== 
==4436== Conditional jump or move depends on uninitialised value(s)
==4436==at 0x4E66392: orte_util_print_epoch (name_fns.c:303)
==4436==by 0x4E65CB3: orte_util_print_name_args (name_fns.c:144)
==4436==by 0x4E898D6: orte_ess_base_proc_get_epoch (ess_base_select.c:46)
==4436==by 0x4EA3B6D: orte_odls_base_default_construct_child_list 
(odls_base_default_fns.c:737)
==4436==by 0xA36FD92: orte_odls_default_launch_local_procs 
(odls_default_module.c:1496)
==4436==by 0x4E823A3: orte_daemon_process_commands (orted_comm.c:508)
==4436==by 0x4E819B1: orte_daemon_cmd_processor (orted_comm.c:324)
==4436==by 0x4F3EA39: event_process_active_single_queue (event.c:1303)
==4436==by 0x4F3EBAF: event_process_active (event.c:1370)
==4436==by 0x4F3EFBE: opal_libevent207_event_base_loop (event.c:1566)
==4436==by 0x4E806AE: orte_daemon (orted_main.c:682)
==4436==by 0x400929: main (orted.c:62)
==4436== 
==4436== Use of uninitialised value of size 8
==4436==at 0x64649BD: _itoa_word (in /lib64/libc-2.5.so)
==4436==by 0x6467E5A: vfprintf (in /lib64/libc-2.5.so)
==4436==by 0x648C889: vsnprintf (in /lib64/libc-2.5.so)
==4436==by 0x6470492: snprintf (in /lib64/libc-2.5.so)
==4436==by 0x4E6640E: orte_util_print_epoch (name_fns.c:306)
==4436==by 0x4E65CB3: orte_util_print_name_args (name_fns.c:144)
==4436==by 0x4E898D6: orte_ess_base_proc_get_epoch (ess_base_select.c:46)
==4436==by 0x4EA3B6D: orte_odls_base_default_construct_child_list 
(odls_base_default_fns.c:737)
==4436==by 0xA36FD92: orte_odls_default_launch_local_procs 
(odls_default_module.c:1496)
==4436==by 0x4E823A3: orte_daemon_process_commands (orted_comm.c:508)
==4436==by 0x4E819B1: orte_daemon_cmd_processor (orted_comm.c:324)
==4436==by 0x4F3EA39: event_process_active_single_queue (event.c:1303)
==4436== 
==4436== Conditional jump or move depends on uninitialised value(s)
==4436==at 0x64649C7: _itoa_word (in /lib64/libc-2.5.so)
==4436==by 0x6467E5A: vfprintf (in /lib64/libc-2.5.so)
==4436==by 0x648C889: vsnprintf (in /lib64/libc-2.5.so)
==4436==by 0x6470492: snprintf (in /lib64/libc-2.5.so)
==4436==by 0x4E6640E: orte_util_print_epoch (name_fns.c:306)
==4436==by 0x4E65CB3: orte_util_print_name_args (name_fns.c:144)
==4436==by 0x4E898D6: orte_ess_base_proc_get_epoch (ess_base_select.c:46)
==4436==by 0x4EA3B6D: orte_odls_base_default_construct_child_list 
(odls_base_default_fns.c:737)
==4436==by 0xA36FD92: orte_odls_default_launch_local_procs 
(odls_default_module.c:1496)
==4436==by 0x4E823A3: orte_daemon_process_commands (orted_comm.c:508)
==4436==by 0x4E819B1: orte_daemon_cmd_processor (orted_comm.c:324)
==4436==by 0x4F3EA39: event_process_active_single_queue (event.c:1303)
==4436== 
==4436== Conditional jump or move depen

Re: [OMPI devel] Open MPI + HWLOC + Static build issue

2011-08-05 Thread Jeff Squyres
On Aug 5, 2011, at 5:55 AM, Brice Goglin wrote:

>> Libtool's -all-static flag probably resolves to some gcc flag(s), right?  
>> Can you just pass those in via CFLAGS / LDFLAGS to configure and then not 
>> pass anything in via make?
> 
> I only see an additional -static flag on the final program-link gcc
> command-line when -all-static is given to libtool. But if you pass
> LDFLAGS=-static to configure, it's interpreted by libtool and gcc
> doesn't get a -static when linking programs.

Well, that's certainly yucky.  As a *guess*, this sounds like this is an 
artifact of Libtool not having a distinct namespace for its CLI arguments 
(i.e., Libtool assuming that "-static" is intended as a Libtool argument, not a 
compiler/linker argument).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] Open MPI + HWLOC + Static build issue

2011-08-05 Thread Brice Goglin
Le 04/08/2011 02:24, Jeff Squyres a écrit :
> Libtool's -all-static flag probably resolves to some gcc flag(s), right?  Can 
> you just pass those in via CFLAGS / LDFLAGS to configure and then not pass 
> anything in via make?

I only see an additional -static flag on the final program-link gcc
command-line when -all-static is given to libtool. But if you pass
LDFLAGS=-static to configure, it's interpreted by libtool and gcc
doesn't get a -static when linking programs.


Pasha, as a workaround, did you try adding LDFLAGS=-static to the OMPI
configure line? This seems to fix hwloc libnuma detection problems. But
I don't know if it will cause other problems for you. Note that you will
still need LDFLAGS=-all-static on the make command line.

Brice