Thanks for the explanation. It kinda begs a question, though - I've noticed 
that the assignment of epoch seems to circle around in a number of places. We 
call the ess_base function to get_epoch, and then we assign an epoch. But the 
base function actually seem to do much, if anything.

It's somewhat confusing and difficult to trace. I know Wes and I already 
planned to cleanup some of this once we get back to the orte state machine 
work, but I'm hoping we can simplify this code somewhat to make it easier to 
understand and follow.

Meantime, we'll continue to chase down the problems.

On Aug 5, 2011, at 4:17 PM, Thomas Herault wrote:

> 
> The warnings issued through ess_base_select.c:46 are annoying but harmless. 
> Wesley is going to hunt them and remove them, but they are really issued 
> because of the print:
> orte_ess_base_proc_get_epoch (ess_base_select.c:46) calls 
> ORTE_NAME_PRINT(proc), which prints proc->epoch, before proc->epoch is 
> assigned to the local computed value epoch. This assignment is done in the 
> level just above orte_ess_base_proc_get_epoch: 
> orte_odls_base_default_construct_child_list (odls_base_default_fns.c:737) 
> says proc->epoch = orte_ess_base_proc_get_epoch(proc);
> 
> Wesley is going to find where this proc was created to ensure that its epoch 
> field is initialized to INVALID_EPOCH, but what this trace says is really 
> that nothing references it before it is initialized to its correct value.
> 
> Thomas
> 
> Le 5 août 2011 à 16:52, Ralph Castain a écrit :
> 
>> Thanks Wes - it isn't the print that's the issue, it's the fact that we have 
>> epochs that aren't being initialized, and what else that may be causing to 
>> have problems.
>> 
>> 
>> On Aug 5, 2011, at 2:45 PM, Wesley Bland wrote:
>> 
>>> I don't think these are anything to worry about since they're all print 
>>> statements, but I will work on these tonight.
>>> 
>>> On Fri, Aug 5, 2011 at 3:03 PM, Jeff Squyres <jsquy...@cisco.com> wrote:
>>> Ralph and I are trying to track down the mysterious ORTE error.
>>> 
>>> In doing so, I have found at least one fairly repeatable error on my 
>>> cluster: when running through SLURM the ibm/dynamic/spawn test, where we 
>>> mpirun 3 procs and then we MPI_COMM_SPAWN 3 more.  Running the orteds 
>>> through valgrind, I see a bunch of uninitialized epoch issues.
>>> 
>>> Attached at the 2 valgrind outputs.
>>> 
>>> Can these be fixed?  I don't know if they're actual problems or not, but 
>>> seeing uninitialized values go by makes me extremely nervous.
>>> 
>>> Thanks!
>>> 
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to