Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015

Ralph Castain Mon, 8 Aug 2011 19:59:03 -0400

On Aug 8, 2011, at 1:23 PM, Thomas Herault wrote:

> proc_get_epoch( proc );
> calls ORTE_NAME_PRINT(proc)
> when in debug mode.
> 
> When this is the case, ORTE_NAME_PRINT prints all fields of proc, including 
> epoch. So, proc->epoch = proc_get_epoch(proc) triggers the valgrind warning, 
> because of the print, in debug mode.
> This is the only reason why proc->epoch must be initialized before calling 
> proc->epoch = proc_get_epoch( proc );
> 
> I agree with you it's not a good approach, for coding style / avoiding this 
> kind of discussions.
> 
> Alternative approaches include:
> - having proc_get_epoch have a side effect on proc->epoch instead of 
> returning the correct epoch value. Assignment can then be done before the 
> debug message. But this is inconsistent with the rest of the interface used 
> in proc_name_fns;
> - having proc_get_epoch use another helper than ORTE_NAME_PRINT to print the 
> debug info. But this requires coding an alternative ORTE_SOMETHING_PRINT 
> function, only for this use;
> - having proc be defined as an opal_object, and set epoch to INVALID (or even 
> UNSET) into the constructor. This could induce changes at many places, and 
> there is always the risk that some changes are left incomplete.


Just an FYI here. We have had the discussion of converting orte_process_name_t 
to an opal_object several times. The biggest reason for not doing so is that 
the memory consumed by the opal_object_t itself is actually greater than the 
memory used by the name fields. Since we store a lot of process names, this 
results in a lot of overhead memory loss.

However, we do have the problem of names not being initialized, and this has 
caused trouble a number of times over the last few years. So maybe it would be 
worth the footprint to resolve the problem once and for all?

Something we should discuss, methinks.


> 
> Thomas
> 
> Le 8 août 2011 à 12:20, Barrett, Brian W a écrit :
> 
>> Ok, that makes some sense.  It's a really weird way of doing
>> initialization, though.  Some of it is probably how Ralph chose to share
>> nidpid map code, so perhaps it's unavoidable.  But since the name isn't
>> use in proc_get_epoch() (or, better not be used in there), I'm not
>> entirely sure why that made valgrind happy...
>> 
>> Brian
>> 
>> On 8/8/11 9:53 AM, "Wesley Bland" <wbl...@eecs.utk.edu> wrote:
>> 
>>> It's not just copying the value from one to the other.
>>> The value is initialized on the first line. The proc_name is passed into
>>> the function where the jobid and vpid are used (not the epoch). The
>>> function returns a new (the correct) value for the epoch based on the
>>> passed in jobid and vpid which is assigned to the process name.
>>> 
>>> This is the same thing as initializing the value to NULL. We do that all
>>> the time to avoid compiler warnings. I just can't do that here because
>>> this isn't a pointer. If it would make the code look better I can move
>>> the first assignment to the top of the function where the other
>>> initializations are.
>>> 
>>> On Mon, Aug 8, 2011 at 11:41 AM, Barrett, Brian W <bwba...@sandia.gov>
>>> wrote:
>>> 
>>> On 8/8/11 9:34 AM, "Jeff Squyres" <jsquy...@cisco.com> wrote:
>>> 
>>>> On Aug 8, 2011, at 11:30 AM, Wesley Bland wrote:
>>>> 
>>>>> The reason is because valgrind was complaining about uninitialized
>>>>> values that were passed into proc_get_epoch. I saw the same warnings
>>>>> from valgrind when I ran it. I added the code to initialize the values
>>>>> to what really should be the default value and the warnings went away.
>>>>> Since the process_name_t struct isn't an object, it doesn't have an
>>>>> initialization function like so many of the other objects in the code.
>>>>> This is what we have.
>>>> 
>>>> Ah, I see -- you are passing peer_name into proc_get_epoch().  I missed
>>>> that.
>>>> 
>>>> Thanks!
>>>> 
>>>>>   peer_name.jobid  = ORTE_PROC_MY_NAME->jobid;
>>>>>   peer_name.vpid   = peer_idx;
>>>>> +    peer_name.epoch  = ORTE_EPOCH_INVALID;
>>>>>   peer_name.epoch  = orte_ess.proc_get_epoch(&peer_name);
>>> 
>>> 
>>> I'm still not convinced this is a rational coding style.  It seems to me
>>> that if you're just going to copy the epoch from peer_name to peer_name,
>>> just assign the epoch to INVALID and be done with it.  There's no need for
>>> both the assignment added in this patch and the line after it.
>>> 
>>> Now, normally I'd say that this is fixing a bug and having to fix other
>>> people's bad code shouldn't be your problem.  But since you wrote both
>>> lines, fixing it to make sense seems reasonable to me ;).
>>> 
>>> Brian
>>> 
>>> --
>>> Brian W. Barrett
>>> Dept. 1423: Scalable System Software
>>> Sandia National Laboratories
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> -- 
>> Brian W. Barrett
>> Dept. 1423: Scalable System Software
>> Sandia National Laboratories
>> 
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015

Reply via email to