Probably should have injected my earlier thoughts here. IMO, Brian is correct - 
we do have a basic design problem, and it does need to be corrected.

It's a bigger discussion than just this commit, though, and we probably should 
add it to the Tues telecon so we at least get some initial ideas on how to 
approach it.


On Aug 8, 2011, at 1:30 PM, Barrett, Brian W wrote:

> 4) Don't design an interface where you're going to have the circular
> dependency.  Perhaps it's a sign that the epoch is in the wrong place, or
> handled in the wrong way, or the whole naming scheme is wrong.  I'm not
> sure.  I've relented that it doesn't need to be fixed now (since it is
> clearly bigger than just the patch introducing the epochs), but I still
> think it's a sign we have a basic design problem with the way we're
> handling names.
> 
> Brian
> 
> On 8/8/11 1:23 PM, "Thomas Herault" <herault.tho...@gmail.com> wrote:
> 
>> proc_get_epoch( proc );
>> calls ORTE_NAME_PRINT(proc)
>> when in debug mode.
>> 
>> When this is the case, ORTE_NAME_PRINT prints all fields of proc,
>> including epoch. So, proc->epoch = proc_get_epoch(proc) triggers the
>> valgrind warning, because of the print, in debug mode.
>> This is the only reason why proc->epoch must be initialized before
>> calling proc->epoch = proc_get_epoch( proc );
>> 
>> I agree with you it's not a good approach, for coding style / avoiding
>> this kind of discussions.
>> 
>> Alternative approaches include:
>> - having proc_get_epoch have a side effect on proc->epoch instead of
>> returning the correct epoch value. Assignment can then be done before the
>> debug message. But this is inconsistent with the rest of the interface
>> used in proc_name_fns;
>> - having proc_get_epoch use another helper than ORTE_NAME_PRINT to print
>> the debug info. But this requires coding an alternative
>> ORTE_SOMETHING_PRINT function, only for this use;
>> - having proc be defined as an opal_object, and set epoch to INVALID (or
>> even UNSET) into the constructor. This could induce changes at many
>> places, and there is always the risk that some changes are left
>> incomplete.
>> 
>> Thomas
>> 
>> Le 8 août 2011 à 12:20, Barrett, Brian W a écrit :
>> 
>>> Ok, that makes some sense.  It's a really weird way of doing
>>> initialization, though.  Some of it is probably how Ralph chose to share
>>> nidpid map code, so perhaps it's unavoidable.  But since the name isn't
>>> use in proc_get_epoch() (or, better not be used in there), I'm not
>>> entirely sure why that made valgrind happy...
>>> 
>>> Brian
>>> 
>>> On 8/8/11 9:53 AM, "Wesley Bland" <wbl...@eecs.utk.edu> wrote:
>>> 
>>>> It's not just copying the value from one to the other.
>>>> The value is initialized on the first line. The proc_name is passed
>>>> into
>>>> the function where the jobid and vpid are used (not the epoch). The
>>>> function returns a new (the correct) value for the epoch based on the
>>>> passed in jobid and vpid which is assigned to the process name.
>>>> 
>>>> This is the same thing as initializing the value to NULL. We do that
>>>> all
>>>> the time to avoid compiler warnings. I just can't do that here because
>>>> this isn't a pointer. If it would make the code look better I can move
>>>> the first assignment to the top of the function where the other
>>>> initializations are.
>>>> 
>>>> On Mon, Aug 8, 2011 at 11:41 AM, Barrett, Brian W <bwba...@sandia.gov>
>>>> wrote:
>>>> 
>>>> On 8/8/11 9:34 AM, "Jeff Squyres" <jsquy...@cisco.com> wrote:
>>>> 
>>>>> On Aug 8, 2011, at 11:30 AM, Wesley Bland wrote:
>>>>> 
>>>>>> The reason is because valgrind was complaining about uninitialized
>>>>>> values that were passed into proc_get_epoch. I saw the same warnings
>>>>>> from valgrind when I ran it. I added the code to initialize the
>>>>>> values
>>>>>> to what really should be the default value and the warnings went
>>>>>> away.
>>>>>> Since the process_name_t struct isn't an object, it doesn't have an
>>>>>> initialization function like so many of the other objects in the
>>>>>> code.
>>>>>> This is what we have.
>>>>> 
>>>>> Ah, I see -- you are passing peer_name into proc_get_epoch().  I
>>>>> missed
>>>>> that.
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>>>   peer_name.jobid  = ORTE_PROC_MY_NAME->jobid;
>>>>>>   peer_name.vpid   = peer_idx;
>>>>>> +    peer_name.epoch  = ORTE_EPOCH_INVALID;
>>>>>>   peer_name.epoch  = orte_ess.proc_get_epoch(&peer_name);
>>>> 
>>>> 
>>>> I'm still not convinced this is a rational coding style.  It seems to
>>>> me
>>>> that if you're just going to copy the epoch from peer_name to
>>>> peer_name,
>>>> just assign the epoch to INVALID and be done with it.  There's no need
>>>> for
>>>> both the assignment added in this patch and the line after it.
>>>> 
>>>> Now, normally I'd say that this is fixing a bug and having to fix other
>>>> people's bad code shouldn't be your problem.  But since you wrote both
>>>> lines, fixing it to make sense seems reasonable to me ;).
>>>> 
>>>> Brian
>>>> 
>>>> --
>>>> Brian W. Barrett
>>>> Dept. 1423: Scalable System Software
>>>> Sandia National Laboratories
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> -- 
>>> Brian W. Barrett
>>> Dept. 1423: Scalable System Software
>>> Sandia National Laboratories
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
> 
> 
> -- 
>  Brian W. Barrett
>  Dept. 1423: Scalable System Software
>  Sandia National Laboratories
> 
> 
> 
> 
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to