Re: [OMPI devel] Uninitialized ORTE epoch values
Thanks for the explanation. It kinda begs a question, though - I've noticed that the assignment of epoch seems to circle around in a number of places. We call the ess_base function to get_epoch, and then we assign an epoch. But the base function actually seem to do much, if anything. It's somewhat confusing and difficult to trace. I know Wes and I already planned to cleanup some of this once we get back to the orte state machine work, but I'm hoping we can simplify this code somewhat to make it easier to understand and follow. Meantime, we'll continue to chase down the problems. On Aug 5, 2011, at 4:17 PM, Thomas Herault wrote: > > The warnings issued through ess_base_select.c:46 are annoying but harmless. > Wesley is going to hunt them and remove them, but they are really issued > because of the print: > orte_ess_base_proc_get_epoch (ess_base_select.c:46) calls > ORTE_NAME_PRINT(proc), which prints proc->epoch, before proc->epoch is > assigned to the local computed value epoch. This assignment is done in the > level just above orte_ess_base_proc_get_epoch: > orte_odls_base_default_construct_child_list (odls_base_default_fns.c:737) > says proc->epoch = orte_ess_base_proc_get_epoch(proc); > > Wesley is going to find where this proc was created to ensure that its epoch > field is initialized to INVALID_EPOCH, but what this trace says is really > that nothing references it before it is initialized to its correct value. > > Thomas > > Le 5 août 2011 à 16:52, Ralph Castain a écrit : > >> Thanks Wes - it isn't the print that's the issue, it's the fact that we have >> epochs that aren't being initialized, and what else that may be causing to >> have problems. >> >> >> On Aug 5, 2011, at 2:45 PM, Wesley Bland wrote: >> >>> I don't think these are anything to worry about since they're all print >>> statements, but I will work on these tonight. >>> >>> On Fri, Aug 5, 2011 at 3:03 PM, Jeff Squyres wrote: >>> Ralph and I are trying to track down the mysterious ORTE error. >>> >>> In doing so, I have found at least one fairly repeatable error on my >>> cluster: when running through SLURM the ibm/dynamic/spawn test, where we >>> mpirun 3 procs and then we MPI_COMM_SPAWN 3 more. Running the orteds >>> through valgrind, I see a bunch of uninitialized epoch issues. >>> >>> Attached at the 2 valgrind outputs. >>> >>> Can these be fixed? I don't know if they're actual problems or not, but >>> seeing uninitialized values go by makes me extremely nervous. >>> >>> Thanks! >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Uninitialized ORTE epoch values
The warnings issued through ess_base_select.c:46 are annoying but harmless. Wesley is going to hunt them and remove them, but they are really issued because of the print: orte_ess_base_proc_get_epoch (ess_base_select.c:46) calls ORTE_NAME_PRINT(proc), which prints proc->epoch, before proc->epoch is assigned to the local computed value epoch. This assignment is done in the level just above orte_ess_base_proc_get_epoch: orte_odls_base_default_construct_child_list (odls_base_default_fns.c:737) says proc->epoch = orte_ess_base_proc_get_epoch(proc); Wesley is going to find where this proc was created to ensure that its epoch field is initialized to INVALID_EPOCH, but what this trace says is really that nothing references it before it is initialized to its correct value. Thomas Le 5 août 2011 à 16:52, Ralph Castain a écrit : > Thanks Wes - it isn't the print that's the issue, it's the fact that we have > epochs that aren't being initialized, and what else that may be causing to > have problems. > > > On Aug 5, 2011, at 2:45 PM, Wesley Bland wrote: > >> I don't think these are anything to worry about since they're all print >> statements, but I will work on these tonight. >> >> On Fri, Aug 5, 2011 at 3:03 PM, Jeff Squyres wrote: >> Ralph and I are trying to track down the mysterious ORTE error. >> >> In doing so, I have found at least one fairly repeatable error on my >> cluster: when running through SLURM the ibm/dynamic/spawn test, where we >> mpirun 3 procs and then we MPI_COMM_SPAWN 3 more. Running the orteds >> through valgrind, I see a bunch of uninitialized epoch issues. >> >> Attached at the 2 valgrind outputs. >> >> Can these be fixed? I don't know if they're actual problems or not, but >> seeing uninitialized values go by makes me extremely nervous. >> >> Thanks! >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Uninitialized ORTE epoch values
Thanks Wes - it isn't the print that's the issue, it's the fact that we have epochs that aren't being initialized, and what else that may be causing to have problems. On Aug 5, 2011, at 2:45 PM, Wesley Bland wrote: > I don't think these are anything to worry about since they're all print > statements, but I will work on these tonight. > > On Fri, Aug 5, 2011 at 3:03 PM, Jeff Squyres wrote: > Ralph and I are trying to track down the mysterious ORTE error. > > In doing so, I have found at least one fairly repeatable error on my cluster: > when running through SLURM the ibm/dynamic/spawn test, where we mpirun 3 > procs and then we MPI_COMM_SPAWN 3 more. Running the orteds through > valgrind, I see a bunch of uninitialized epoch issues. > > Attached at the 2 valgrind outputs. > > Can these be fixed? I don't know if they're actual problems or not, but > seeing uninitialized values go by makes me extremely nervous. > > Thanks! > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Uninitialized ORTE epoch values
I don't think these are anything to worry about since they're all print statements, but I will work on these tonight. On Fri, Aug 5, 2011 at 3:03 PM, Jeff Squyres wrote: > Ralph and I are trying to track down the mysterious ORTE error. > > In doing so, I have found at least one fairly repeatable error on my > cluster: when running through SLURM the ibm/dynamic/spawn test, where we > mpirun 3 procs and then we MPI_COMM_SPAWN 3 more. Running the orteds > through valgrind, I see a bunch of uninitialized epoch issues. > > Attached at the 2 valgrind outputs. > > Can these be fixed? I don't know if they're actual problems or not, but > seeing uninitialized values go by makes me extremely nervous. > > Thanks! > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] Uninitialized ORTE epoch values
BTW, the -1 file has an invalid free in it that we just fixed. That's not part of the epoch value issue, of course. :-) On Aug 5, 2011, at 3:03 PM, Jeff Squyres wrote: > Ralph and I are trying to track down the mysterious ORTE error. > > In doing so, I have found at least one fairly repeatable error on my cluster: > when running through SLURM the ibm/dynamic/spawn test, where we mpirun 3 > procs and then we MPI_COMM_SPAWN 3 more. Running the orteds through > valgrind, I see a bunch of uninitialized epoch issues. > > Attached at the 2 valgrind outputs. > > Can these be fixed? I don't know if they're actual problems or not, but > seeing uninitialized values go by makes me extremely nervous. > > Thanks! > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI devel] Uninitialized ORTE epoch values
Ralph and I are trying to track down the mysterious ORTE error. In doing so, I have found at least one fairly repeatable error on my cluster: when running through SLURM the ibm/dynamic/spawn test, where we mpirun 3 procs and then we MPI_COMM_SPAWN 3 more. Running the orteds through valgrind, I see a bunch of uninitialized epoch issues. Attached at the 2 valgrind outputs. Can these be fixed? I don't know if they're actual problems or not, but seeing uninitialized values go by makes me extremely nervous. Thanks! -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ==4436== Memcheck, a memory error detector ==4436== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al. ==4436== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info ==4436== Command: /home/jsquyres/bogus/bin/orted -mca ess slurm -mca orte_ess_jobid 2778071040 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri "2778071040.0.0;tcp://172.29.218.140:40955;tcp://10.10.10.140:40955;tcp://10.10.20.140:40955;tcp://10.10.30.140:40955" --mca orte_startup_timeout 1 --mca mpi_leave_pinned 0 --mca btl tcp,self ==4436== ==4436== Conditional jump or move depends on uninitialised value(s) ==4436==at 0x4E6634C: orte_util_print_epoch (name_fns.c:301) ==4436==by 0x4E65CB3: orte_util_print_name_args (name_fns.c:144) ==4436==by 0x4E898D6: orte_ess_base_proc_get_epoch (ess_base_select.c:46) ==4436==by 0x4EA3B6D: orte_odls_base_default_construct_child_list (odls_base_default_fns.c:737) ==4436==by 0xA36FD92: orte_odls_default_launch_local_procs (odls_default_module.c:1496) ==4436==by 0x4E823A3: orte_daemon_process_commands (orted_comm.c:508) ==4436==by 0x4E819B1: orte_daemon_cmd_processor (orted_comm.c:324) ==4436==by 0x4F3EA39: event_process_active_single_queue (event.c:1303) ==4436==by 0x4F3EBAF: event_process_active (event.c:1370) ==4436==by 0x4F3EFBE: opal_libevent207_event_base_loop (event.c:1566) ==4436==by 0x4E806AE: orte_daemon (orted_main.c:682) ==4436==by 0x400929: main (orted.c:62) ==4436== ==4436== Conditional jump or move depends on uninitialised value(s) ==4436==at 0x4E66392: orte_util_print_epoch (name_fns.c:303) ==4436==by 0x4E65CB3: orte_util_print_name_args (name_fns.c:144) ==4436==by 0x4E898D6: orte_ess_base_proc_get_epoch (ess_base_select.c:46) ==4436==by 0x4EA3B6D: orte_odls_base_default_construct_child_list (odls_base_default_fns.c:737) ==4436==by 0xA36FD92: orte_odls_default_launch_local_procs (odls_default_module.c:1496) ==4436==by 0x4E823A3: orte_daemon_process_commands (orted_comm.c:508) ==4436==by 0x4E819B1: orte_daemon_cmd_processor (orted_comm.c:324) ==4436==by 0x4F3EA39: event_process_active_single_queue (event.c:1303) ==4436==by 0x4F3EBAF: event_process_active (event.c:1370) ==4436==by 0x4F3EFBE: opal_libevent207_event_base_loop (event.c:1566) ==4436==by 0x4E806AE: orte_daemon (orted_main.c:682) ==4436==by 0x400929: main (orted.c:62) ==4436== ==4436== Use of uninitialised value of size 8 ==4436==at 0x64649BD: _itoa_word (in /lib64/libc-2.5.so) ==4436==by 0x6467E5A: vfprintf (in /lib64/libc-2.5.so) ==4436==by 0x648C889: vsnprintf (in /lib64/libc-2.5.so) ==4436==by 0x6470492: snprintf (in /lib64/libc-2.5.so) ==4436==by 0x4E6640E: orte_util_print_epoch (name_fns.c:306) ==4436==by 0x4E65CB3: orte_util_print_name_args (name_fns.c:144) ==4436==by 0x4E898D6: orte_ess_base_proc_get_epoch (ess_base_select.c:46) ==4436==by 0x4EA3B6D: orte_odls_base_default_construct_child_list (odls_base_default_fns.c:737) ==4436==by 0xA36FD92: orte_odls_default_launch_local_procs (odls_default_module.c:1496) ==4436==by 0x4E823A3: orte_daemon_process_commands (orted_comm.c:508) ==4436==by 0x4E819B1: orte_daemon_cmd_processor (orted_comm.c:324) ==4436==by 0x4F3EA39: event_process_active_single_queue (event.c:1303) ==4436== ==4436== Conditional jump or move depends on uninitialised value(s) ==4436==at 0x64649C7: _itoa_word (in /lib64/libc-2.5.so) ==4436==by 0x6467E5A: vfprintf (in /lib64/libc-2.5.so) ==4436==by 0x648C889: vsnprintf (in /lib64/libc-2.5.so) ==4436==by 0x6470492: snprintf (in /lib64/libc-2.5.so) ==4436==by 0x4E6640E: orte_util_print_epoch (name_fns.c:306) ==4436==by 0x4E65CB3: orte_util_print_name_args (name_fns.c:144) ==4436==by 0x4E898D6: orte_ess_base_proc_get_epoch (ess_base_select.c:46) ==4436==by 0x4EA3B6D: orte_odls_base_default_construct_child_list (odls_base_default_fns.c:737) ==4436==by 0xA36FD92: orte_odls_default_launch_local_procs (odls_default_module.c:1496) ==4436==by 0x4E823A3: orte_daemon_process_commands (orted_comm.c:508) ==4436==by 0x4E819B1: orte_daemon_cmd_processor (orted_comm.c:324) ==4436==by 0x4F3EA39: event_process_active_single_queue (event.c:1303) ==4436== ==4436== Conditional jump or move depen
Re: [OMPI devel] Open MPI + HWLOC + Static build issue
On Aug 5, 2011, at 5:55 AM, Brice Goglin wrote: >> Libtool's -all-static flag probably resolves to some gcc flag(s), right? >> Can you just pass those in via CFLAGS / LDFLAGS to configure and then not >> pass anything in via make? > > I only see an additional -static flag on the final program-link gcc > command-line when -all-static is given to libtool. But if you pass > LDFLAGS=-static to configure, it's interpreted by libtool and gcc > doesn't get a -static when linking programs. Well, that's certainly yucky. As a *guess*, this sounds like this is an artifact of Libtool not having a distinct namespace for its CLI arguments (i.e., Libtool assuming that "-static" is intended as a Libtool argument, not a compiler/linker argument). -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] Open MPI + HWLOC + Static build issue
Le 04/08/2011 02:24, Jeff Squyres a écrit : > Libtool's -all-static flag probably resolves to some gcc flag(s), right? Can > you just pass those in via CFLAGS / LDFLAGS to configure and then not pass > anything in via make? I only see an additional -static flag on the final program-link gcc command-line when -all-static is given to libtool. But if you pass LDFLAGS=-static to configure, it's interpreted by libtool and gcc doesn't get a -static when linking programs. Pasha, as a workaround, did you try adding LDFLAGS=-static to the OMPI configure line? This seems to fix hwloc libnuma detection problems. But I don't know if it will cause other problems for you. Note that you will still need LDFLAGS=-all-static on the make command line. Brice