Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015

2011-08-08 Thread Ralph Castain
Probably should have injected my earlier thoughts here. IMO, Brian is correct - 
we do have a basic design problem, and it does need to be corrected.

It's a bigger discussion than just this commit, though, and we probably should 
add it to the Tues telecon so we at least get some initial ideas on how to 
approach it.


On Aug 8, 2011, at 1:30 PM, Barrett, Brian W wrote:

> 4) Don't design an interface where you're going to have the circular
> dependency.  Perhaps it's a sign that the epoch is in the wrong place, or
> handled in the wrong way, or the whole naming scheme is wrong.  I'm not
> sure.  I've relented that it doesn't need to be fixed now (since it is
> clearly bigger than just the patch introducing the epochs), but I still
> think it's a sign we have a basic design problem with the way we're
> handling names.
> 
> Brian
> 
> On 8/8/11 1:23 PM, "Thomas Herault"  wrote:
> 
>> proc_get_epoch( proc );
>> calls ORTE_NAME_PRINT(proc)
>> when in debug mode.
>> 
>> When this is the case, ORTE_NAME_PRINT prints all fields of proc,
>> including epoch. So, proc->epoch = proc_get_epoch(proc) triggers the
>> valgrind warning, because of the print, in debug mode.
>> This is the only reason why proc->epoch must be initialized before
>> calling proc->epoch = proc_get_epoch( proc );
>> 
>> I agree with you it's not a good approach, for coding style / avoiding
>> this kind of discussions.
>> 
>> Alternative approaches include:
>> - having proc_get_epoch have a side effect on proc->epoch instead of
>> returning the correct epoch value. Assignment can then be done before the
>> debug message. But this is inconsistent with the rest of the interface
>> used in proc_name_fns;
>> - having proc_get_epoch use another helper than ORTE_NAME_PRINT to print
>> the debug info. But this requires coding an alternative
>> ORTE_SOMETHING_PRINT function, only for this use;
>> - having proc be defined as an opal_object, and set epoch to INVALID (or
>> even UNSET) into the constructor. This could induce changes at many
>> places, and there is always the risk that some changes are left
>> incomplete.
>> 
>> Thomas
>> 
>> Le 8 août 2011 à 12:20, Barrett, Brian W a écrit :
>> 
>>> Ok, that makes some sense.  It's a really weird way of doing
>>> initialization, though.  Some of it is probably how Ralph chose to share
>>> nidpid map code, so perhaps it's unavoidable.  But since the name isn't
>>> use in proc_get_epoch() (or, better not be used in there), I'm not
>>> entirely sure why that made valgrind happy...
>>> 
>>> Brian
>>> 
>>> On 8/8/11 9:53 AM, "Wesley Bland"  wrote:
>>> 
 It's not just copying the value from one to the other.
 The value is initialized on the first line. The proc_name is passed
 into
 the function where the jobid and vpid are used (not the epoch). The
 function returns a new (the correct) value for the epoch based on the
 passed in jobid and vpid which is assigned to the process name.
 
 This is the same thing as initializing the value to NULL. We do that
 all
 the time to avoid compiler warnings. I just can't do that here because
 this isn't a pointer. If it would make the code look better I can move
 the first assignment to the top of the function where the other
 initializations are.
 
 On Mon, Aug 8, 2011 at 11:41 AM, Barrett, Brian W 
 wrote:
 
 On 8/8/11 9:34 AM, "Jeff Squyres"  wrote:
 
> On Aug 8, 2011, at 11:30 AM, Wesley Bland wrote:
> 
>> The reason is because valgrind was complaining about uninitialized
>> values that were passed into proc_get_epoch. I saw the same warnings
>> from valgrind when I ran it. I added the code to initialize the
>> values
>> to what really should be the default value and the warnings went
>> away.
>> Since the process_name_t struct isn't an object, it doesn't have an
>> initialization function like so many of the other objects in the
>> code.
>> This is what we have.
> 
> Ah, I see -- you are passing peer_name into proc_get_epoch().  I
> missed
> that.
> 
> Thanks!
> 
>>   peer_name.jobid  = ORTE_PROC_MY_NAME->jobid;
>>   peer_name.vpid   = peer_idx;
>> +peer_name.epoch  = ORTE_EPOCH_INVALID;
>>   peer_name.epoch  = orte_ess.proc_get_epoch(&peer_name);
 
 
 I'm still not convinced this is a rational coding style.  It seems to
 me
 that if you're just going to copy the epoch from peer_name to
 peer_name,
 just assign the epoch to INVALID and be done with it.  There's no need
 for
 both the assignment added in this patch and the line after it.
 
 Now, normally I'd say that this is fixing a bug and having to fix other
 people's bad code shouldn't be your problem.  But since you wrote both
 lines, fixing it to make sense seems reasonable to me ;).
 
 Brian
 
 --
 Brian W. Barrett
 Dept. 1423: Scalable System Software
 Sandia Natio

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015

2011-08-08 Thread Ralph Castain

On Aug 8, 2011, at 1:23 PM, Thomas Herault wrote:

> proc_get_epoch( proc );
> calls ORTE_NAME_PRINT(proc)
> when in debug mode.
> 
> When this is the case, ORTE_NAME_PRINT prints all fields of proc, including 
> epoch. So, proc->epoch = proc_get_epoch(proc) triggers the valgrind warning, 
> because of the print, in debug mode.
> This is the only reason why proc->epoch must be initialized before calling 
> proc->epoch = proc_get_epoch( proc );
> 
> I agree with you it's not a good approach, for coding style / avoiding this 
> kind of discussions.
> 
> Alternative approaches include:
> - having proc_get_epoch have a side effect on proc->epoch instead of 
> returning the correct epoch value. Assignment can then be done before the 
> debug message. But this is inconsistent with the rest of the interface used 
> in proc_name_fns;
> - having proc_get_epoch use another helper than ORTE_NAME_PRINT to print the 
> debug info. But this requires coding an alternative ORTE_SOMETHING_PRINT 
> function, only for this use;
> - having proc be defined as an opal_object, and set epoch to INVALID (or even 
> UNSET) into the constructor. This could induce changes at many places, and 
> there is always the risk that some changes are left incomplete.

Just an FYI here. We have had the discussion of converting orte_process_name_t 
to an opal_object several times. The biggest reason for not doing so is that 
the memory consumed by the opal_object_t itself is actually greater than the 
memory used by the name fields. Since we store a lot of process names, this 
results in a lot of overhead memory loss.

However, we do have the problem of names not being initialized, and this has 
caused trouble a number of times over the last few years. So maybe it would be 
worth the footprint to resolve the problem once and for all?

Something we should discuss, methinks.


> 
> Thomas
> 
> Le 8 août 2011 à 12:20, Barrett, Brian W a écrit :
> 
>> Ok, that makes some sense.  It's a really weird way of doing
>> initialization, though.  Some of it is probably how Ralph chose to share
>> nidpid map code, so perhaps it's unavoidable.  But since the name isn't
>> use in proc_get_epoch() (or, better not be used in there), I'm not
>> entirely sure why that made valgrind happy...
>> 
>> Brian
>> 
>> On 8/8/11 9:53 AM, "Wesley Bland"  wrote:
>> 
>>> It's not just copying the value from one to the other.
>>> The value is initialized on the first line. The proc_name is passed into
>>> the function where the jobid and vpid are used (not the epoch). The
>>> function returns a new (the correct) value for the epoch based on the
>>> passed in jobid and vpid which is assigned to the process name.
>>> 
>>> This is the same thing as initializing the value to NULL. We do that all
>>> the time to avoid compiler warnings. I just can't do that here because
>>> this isn't a pointer. If it would make the code look better I can move
>>> the first assignment to the top of the function where the other
>>> initializations are.
>>> 
>>> On Mon, Aug 8, 2011 at 11:41 AM, Barrett, Brian W 
>>> wrote:
>>> 
>>> On 8/8/11 9:34 AM, "Jeff Squyres"  wrote:
>>> 
 On Aug 8, 2011, at 11:30 AM, Wesley Bland wrote:
 
> The reason is because valgrind was complaining about uninitialized
> values that were passed into proc_get_epoch. I saw the same warnings
> from valgrind when I ran it. I added the code to initialize the values
> to what really should be the default value and the warnings went away.
> Since the process_name_t struct isn't an object, it doesn't have an
> initialization function like so many of the other objects in the code.
> This is what we have.
 
 Ah, I see -- you are passing peer_name into proc_get_epoch().  I missed
 that.
 
 Thanks!
 
>   peer_name.jobid  = ORTE_PROC_MY_NAME->jobid;
>   peer_name.vpid   = peer_idx;
> +peer_name.epoch  = ORTE_EPOCH_INVALID;
>   peer_name.epoch  = orte_ess.proc_get_epoch(&peer_name);
>>> 
>>> 
>>> I'm still not convinced this is a rational coding style.  It seems to me
>>> that if you're just going to copy the epoch from peer_name to peer_name,
>>> just assign the epoch to INVALID and be done with it.  There's no need for
>>> both the assignment added in this patch and the line after it.
>>> 
>>> Now, normally I'd say that this is fixing a bug and having to fix other
>>> people's bad code shouldn't be your problem.  But since you wrote both
>>> lines, fixing it to make sense seems reasonable to me ;).
>>> 
>>> Brian
>>> 
>>> --
>>> Brian W. Barrett
>>> Dept. 1423: Scalable System Software
>>> Sandia National Laboratories
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> -- 
>> Brian W. Barrett
>> Dept. 1423: Scalable System Software
>> Sandia National Laboratories
>> 
>> 
>> 
>> 
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> 

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015

2011-08-08 Thread Ralph Castain
I will interject at this point in the thread. IMO, Brian has raised a valid 
point. We are contorting around with the epoch values, as I commented in a 
separate thread.

The problem really stems from something Wes mentioned a while back in a 
different set of communications. When the epoch code was written, it wasn't 
entirely clear to him where the data required for supporting the epoch-related 
functionality should go. It doesn't really fit the ess, it didn't really belong 
in modex, etc. So he wound up sticking things here and there, trying to disturb 
the existing APIs as little as possible.

What we probably should do is (a) try to better understand what Wes is trying 
to accomplish, and then (b) design the code base to support it. The current 
functional APIs and nidmap code weren't designed to support the extended name, 
and so the fit is sometimes wrong. Getting things designed properly would help 
ease the problem.

In fairness, remember that Wes and I had already agreed to review a bunch of 
this (e.g., where epoch info is stored) when we finally get back to dealing 
with the orte state engine. Hopefully, we can address some or all of these 
issues then.

Meantime, what's there is certainly "weird" - but if it works for now, then 
perhaps we can get things correctly integrated in the near future.

HTH
Ralph


On Aug 8, 2011, at 10:20 AM, Barrett, Brian W wrote:

> Ok, that makes some sense.  It's a really weird way of doing
> initialization, though.  Some of it is probably how Ralph chose to share
> nidpid map code, so perhaps it's unavoidable.  But since the name isn't
> use in proc_get_epoch() (or, better not be used in there), I'm not
> entirely sure why that made valgrind happy...
> 
> Brian
> 
> On 8/8/11 9:53 AM, "Wesley Bland"  wrote:
> 
>> It's not just copying the value from one to the other.
>> The value is initialized on the first line. The proc_name is passed into
>> the function where the jobid and vpid are used (not the epoch). The
>> function returns a new (the correct) value for the epoch based on the
>> passed in jobid and vpid which is assigned to the process name.
>> 
>> This is the same thing as initializing the value to NULL. We do that all
>> the time to avoid compiler warnings. I just can't do that here because
>> this isn't a pointer. If it would make the code look better I can move
>> the first assignment to the top of the function where the other
>> initializations are.
>> 
>> On Mon, Aug 8, 2011 at 11:41 AM, Barrett, Brian W 
>> wrote:
>> 
>> On 8/8/11 9:34 AM, "Jeff Squyres"  wrote:
>> 
>>> On Aug 8, 2011, at 11:30 AM, Wesley Bland wrote:
>>> 
 The reason is because valgrind was complaining about uninitialized
 values that were passed into proc_get_epoch. I saw the same warnings
 from valgrind when I ran it. I added the code to initialize the values
 to what really should be the default value and the warnings went away.
 Since the process_name_t struct isn't an object, it doesn't have an
 initialization function like so many of the other objects in the code.
 This is what we have.
>>> 
>>> Ah, I see -- you are passing peer_name into proc_get_epoch().  I missed
>>> that.
>>> 
>>> Thanks!
>>> 
peer_name.jobid  = ORTE_PROC_MY_NAME->jobid;
peer_name.vpid   = peer_idx;
 +peer_name.epoch  = ORTE_EPOCH_INVALID;
peer_name.epoch  = orte_ess.proc_get_epoch(&peer_name);
>> 
>> 
>> I'm still not convinced this is a rational coding style.  It seems to me
>> that if you're just going to copy the epoch from peer_name to peer_name,
>> just assign the epoch to INVALID and be done with it.  There's no need for
>> both the assignment added in this patch and the line after it.
>> 
>> Now, normally I'd say that this is fixing a bug and having to fix other
>> people's bad code shouldn't be your problem.  But since you wrote both
>> lines, fixing it to make sense seems reasonable to me ;).
>> 
>> Brian
>> 
>> --
>> Brian W. Barrett
>> Dept. 1423: Scalable System Software
>> Sandia National Laboratories
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> -- 
>  Brian W. Barrett
>  Dept. 1423: Scalable System Software
>  Sandia National Laboratories
> 
> 
> 
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015

2011-08-08 Thread Aurélien Bouteiller
I see no problem with the latest code from Wesley. The warning is silenced, and 
was not indicative of any buggy behavior. So for me the problem is closed. 


Aurelien 

Le 8 août 2011 à 15:30, Barrett, Brian W a écrit :

> 4) Don't design an interface where you're going to have the circular
> dependency.  Perhaps it's a sign that the epoch is in the wrong place, or
> handled in the wrong way, or the whole naming scheme is wrong.  I'm not
> sure.  I've relented that it doesn't need to be fixed now (since it is
> clearly bigger than just the patch introducing the epochs), but I still
> think it's a sign we have a basic design problem with the way we're
> handling names.
> 
> Brian
> 
> On 8/8/11 1:23 PM, "Thomas Herault"  wrote:
> 
>> proc_get_epoch( proc );
>> calls ORTE_NAME_PRINT(proc)
>> when in debug mode.
>> 
>> When this is the case, ORTE_NAME_PRINT prints all fields of proc,
>> including epoch. So, proc->epoch = proc_get_epoch(proc) triggers the
>> valgrind warning, because of the print, in debug mode.
>> This is the only reason why proc->epoch must be initialized before
>> calling proc->epoch = proc_get_epoch( proc );
>> 
>> I agree with you it's not a good approach, for coding style / avoiding
>> this kind of discussions.
>> 
>> Alternative approaches include:
>> - having proc_get_epoch have a side effect on proc->epoch instead of
>> returning the correct epoch value. Assignment can then be done before the
>> debug message. But this is inconsistent with the rest of the interface
>> used in proc_name_fns;
>> - having proc_get_epoch use another helper than ORTE_NAME_PRINT to print
>> the debug info. But this requires coding an alternative
>> ORTE_SOMETHING_PRINT function, only for this use;
>> - having proc be defined as an opal_object, and set epoch to INVALID (or
>> even UNSET) into the constructor. This could induce changes at many
>> places, and there is always the risk that some changes are left
>> incomplete.
>> 
>> Thomas
>> 
>> Le 8 août 2011 à 12:20, Barrett, Brian W a écrit :
>> 
>>> Ok, that makes some sense.  It's a really weird way of doing
>>> initialization, though.  Some of it is probably how Ralph chose to share
>>> nidpid map code, so perhaps it's unavoidable.  But since the name isn't
>>> use in proc_get_epoch() (or, better not be used in there), I'm not
>>> entirely sure why that made valgrind happy...
>>> 
>>> Brian
>>> 
>>> On 8/8/11 9:53 AM, "Wesley Bland"  wrote:
>>> 
 It's not just copying the value from one to the other.
 The value is initialized on the first line. The proc_name is passed
 into
 the function where the jobid and vpid are used (not the epoch). The
 function returns a new (the correct) value for the epoch based on the
 passed in jobid and vpid which is assigned to the process name.
 
 This is the same thing as initializing the value to NULL. We do that
 all
 the time to avoid compiler warnings. I just can't do that here because
 this isn't a pointer. If it would make the code look better I can move
 the first assignment to the top of the function where the other
 initializations are.
 
 On Mon, Aug 8, 2011 at 11:41 AM, Barrett, Brian W 
 wrote:
 
 On 8/8/11 9:34 AM, "Jeff Squyres"  wrote:
 
> On Aug 8, 2011, at 11:30 AM, Wesley Bland wrote:
> 
>> The reason is because valgrind was complaining about uninitialized
>> values that were passed into proc_get_epoch. I saw the same warnings
>> from valgrind when I ran it. I added the code to initialize the
>> values
>> to what really should be the default value and the warnings went
>> away.
>> Since the process_name_t struct isn't an object, it doesn't have an
>> initialization function like so many of the other objects in the
>> code.
>> This is what we have.
> 
> Ah, I see -- you are passing peer_name into proc_get_epoch().  I
> missed
> that.
> 
> Thanks!
> 
>>   peer_name.jobid  = ORTE_PROC_MY_NAME->jobid;
>>   peer_name.vpid   = peer_idx;
>> +peer_name.epoch  = ORTE_EPOCH_INVALID;
>>   peer_name.epoch  = orte_ess.proc_get_epoch(&peer_name);
 
 
 I'm still not convinced this is a rational coding style.  It seems to
 me
 that if you're just going to copy the epoch from peer_name to
 peer_name,
 just assign the epoch to INVALID and be done with it.  There's no need
 for
 both the assignment added in this patch and the line after it.
 
 Now, normally I'd say that this is fixing a bug and having to fix other
 people's bad code shouldn't be your problem.  But since you wrote both
 lines, fixing it to make sense seems reasonable to me ;).
 
 Brian
 
 --
 Brian W. Barrett
 Dept. 1423: Scalable System Software
 Sandia National Laboratories
 
 
 
 
 
 
 
 
 
 
 
>>> 
>>> 
>>> -- 
>>> Brian W. Barrett
>>> Dept. 1423: Scalable System

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015

2011-08-08 Thread Barrett, Brian W
4) Don't design an interface where you're going to have the circular
dependency.  Perhaps it's a sign that the epoch is in the wrong place, or
handled in the wrong way, or the whole naming scheme is wrong.  I'm not
sure.  I've relented that it doesn't need to be fixed now (since it is
clearly bigger than just the patch introducing the epochs), but I still
think it's a sign we have a basic design problem with the way we're
handling names.

Brian

On 8/8/11 1:23 PM, "Thomas Herault"  wrote:

>proc_get_epoch( proc );
>calls ORTE_NAME_PRINT(proc)
>when in debug mode.
>
>When this is the case, ORTE_NAME_PRINT prints all fields of proc,
>including epoch. So, proc->epoch = proc_get_epoch(proc) triggers the
>valgrind warning, because of the print, in debug mode.
>This is the only reason why proc->epoch must be initialized before
>calling proc->epoch = proc_get_epoch( proc );
>
>I agree with you it's not a good approach, for coding style / avoiding
>this kind of discussions.
>
>Alternative approaches include:
> - having proc_get_epoch have a side effect on proc->epoch instead of
>returning the correct epoch value. Assignment can then be done before the
>debug message. But this is inconsistent with the rest of the interface
>used in proc_name_fns;
> - having proc_get_epoch use another helper than ORTE_NAME_PRINT to print
>the debug info. But this requires coding an alternative
>ORTE_SOMETHING_PRINT function, only for this use;
> - having proc be defined as an opal_object, and set epoch to INVALID (or
>even UNSET) into the constructor. This could induce changes at many
>places, and there is always the risk that some changes are left
>incomplete.
>
>Thomas
>
>Le 8 août 2011 à 12:20, Barrett, Brian W a écrit :
>
>> Ok, that makes some sense.  It's a really weird way of doing
>> initialization, though.  Some of it is probably how Ralph chose to share
>> nidpid map code, so perhaps it's unavoidable.  But since the name isn't
>> use in proc_get_epoch() (or, better not be used in there), I'm not
>> entirely sure why that made valgrind happy...
>> 
>> Brian
>> 
>> On 8/8/11 9:53 AM, "Wesley Bland"  wrote:
>> 
>>> It's not just copying the value from one to the other.
>>> The value is initialized on the first line. The proc_name is passed
>>>into
>>> the function where the jobid and vpid are used (not the epoch). The
>>> function returns a new (the correct) value for the epoch based on the
>>> passed in jobid and vpid which is assigned to the process name.
>>> 
>>> This is the same thing as initializing the value to NULL. We do that
>>>all
>>> the time to avoid compiler warnings. I just can't do that here because
>>> this isn't a pointer. If it would make the code look better I can move
>>> the first assignment to the top of the function where the other
>>> initializations are.
>>> 
>>> On Mon, Aug 8, 2011 at 11:41 AM, Barrett, Brian W 
>>> wrote:
>>> 
>>> On 8/8/11 9:34 AM, "Jeff Squyres"  wrote:
>>> 
 On Aug 8, 2011, at 11:30 AM, Wesley Bland wrote:
 
> The reason is because valgrind was complaining about uninitialized
> values that were passed into proc_get_epoch. I saw the same warnings
> from valgrind when I ran it. I added the code to initialize the
>values
> to what really should be the default value and the warnings went
>away.
> Since the process_name_t struct isn't an object, it doesn't have an
> initialization function like so many of the other objects in the
>code.
> This is what we have.
 
 Ah, I see -- you are passing peer_name into proc_get_epoch().  I
missed
 that.
 
 Thanks!
 
>peer_name.jobid  = ORTE_PROC_MY_NAME->jobid;
>peer_name.vpid   = peer_idx;
> +peer_name.epoch  = ORTE_EPOCH_INVALID;
>peer_name.epoch  = orte_ess.proc_get_epoch(&peer_name);
>>> 
>>> 
>>> I'm still not convinced this is a rational coding style.  It seems to
>>>me
>>> that if you're just going to copy the epoch from peer_name to
>>>peer_name,
>>> just assign the epoch to INVALID and be done with it.  There's no need
>>>for
>>> both the assignment added in this patch and the line after it.
>>> 
>>> Now, normally I'd say that this is fixing a bug and having to fix other
>>> people's bad code shouldn't be your problem.  But since you wrote both
>>> lines, fixing it to make sense seems reasonable to me ;).
>>> 
>>> Brian
>>> 
>>> --
>>> Brian W. Barrett
>>> Dept. 1423: Scalable System Software
>>> Sandia National Laboratories
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> -- 
>>  Brian W. Barrett
>>  Dept. 1423: Scalable System Software
>>  Sandia National Laboratories
>> 
>> 
>> 
>> 
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>___
>devel mailing list
>de...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>


-- 
  Brian W. Barrett
  Dep

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015

2011-08-08 Thread Thomas Herault
proc_get_epoch( proc );
calls ORTE_NAME_PRINT(proc)
when in debug mode.

When this is the case, ORTE_NAME_PRINT prints all fields of proc, including 
epoch. So, proc->epoch = proc_get_epoch(proc) triggers the valgrind warning, 
because of the print, in debug mode.
This is the only reason why proc->epoch must be initialized before calling 
proc->epoch = proc_get_epoch( proc );

I agree with you it's not a good approach, for coding style / avoiding this 
kind of discussions.

Alternative approaches include:
 - having proc_get_epoch have a side effect on proc->epoch instead of returning 
the correct epoch value. Assignment can then be done before the debug message. 
But this is inconsistent with the rest of the interface used in proc_name_fns;
 - having proc_get_epoch use another helper than ORTE_NAME_PRINT to print the 
debug info. But this requires coding an alternative ORTE_SOMETHING_PRINT 
function, only for this use;
 - having proc be defined as an opal_object, and set epoch to INVALID (or even 
UNSET) into the constructor. This could induce changes at many places, and 
there is always the risk that some changes are left incomplete.

Thomas

Le 8 août 2011 à 12:20, Barrett, Brian W a écrit :

> Ok, that makes some sense.  It's a really weird way of doing
> initialization, though.  Some of it is probably how Ralph chose to share
> nidpid map code, so perhaps it's unavoidable.  But since the name isn't
> use in proc_get_epoch() (or, better not be used in there), I'm not
> entirely sure why that made valgrind happy...
> 
> Brian
> 
> On 8/8/11 9:53 AM, "Wesley Bland"  wrote:
> 
>> It's not just copying the value from one to the other.
>> The value is initialized on the first line. The proc_name is passed into
>> the function where the jobid and vpid are used (not the epoch). The
>> function returns a new (the correct) value for the epoch based on the
>> passed in jobid and vpid which is assigned to the process name.
>> 
>> This is the same thing as initializing the value to NULL. We do that all
>> the time to avoid compiler warnings. I just can't do that here because
>> this isn't a pointer. If it would make the code look better I can move
>> the first assignment to the top of the function where the other
>> initializations are.
>> 
>> On Mon, Aug 8, 2011 at 11:41 AM, Barrett, Brian W 
>> wrote:
>> 
>> On 8/8/11 9:34 AM, "Jeff Squyres"  wrote:
>> 
>>> On Aug 8, 2011, at 11:30 AM, Wesley Bland wrote:
>>> 
 The reason is because valgrind was complaining about uninitialized
 values that were passed into proc_get_epoch. I saw the same warnings
 from valgrind when I ran it. I added the code to initialize the values
 to what really should be the default value and the warnings went away.
 Since the process_name_t struct isn't an object, it doesn't have an
 initialization function like so many of the other objects in the code.
 This is what we have.
>>> 
>>> Ah, I see -- you are passing peer_name into proc_get_epoch().  I missed
>>> that.
>>> 
>>> Thanks!
>>> 
peer_name.jobid  = ORTE_PROC_MY_NAME->jobid;
peer_name.vpid   = peer_idx;
 +peer_name.epoch  = ORTE_EPOCH_INVALID;
peer_name.epoch  = orte_ess.proc_get_epoch(&peer_name);
>> 
>> 
>> I'm still not convinced this is a rational coding style.  It seems to me
>> that if you're just going to copy the epoch from peer_name to peer_name,
>> just assign the epoch to INVALID and be done with it.  There's no need for
>> both the assignment added in this patch and the line after it.
>> 
>> Now, normally I'd say that this is fixing a bug and having to fix other
>> people's bad code shouldn't be your problem.  But since you wrote both
>> lines, fixing it to make sense seems reasonable to me ;).
>> 
>> Brian
>> 
>> --
>> Brian W. Barrett
>> Dept. 1423: Scalable System Software
>> Sandia National Laboratories
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> -- 
>  Brian W. Barrett
>  Dept. 1423: Scalable System Software
>  Sandia National Laboratories
> 
> 
> 
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015

2011-08-08 Thread Barrett, Brian W
Ok, that makes some sense.  It's a really weird way of doing
initialization, though.  Some of it is probably how Ralph chose to share
nidpid map code, so perhaps it's unavoidable.  But since the name isn't
use in proc_get_epoch() (or, better not be used in there), I'm not
entirely sure why that made valgrind happy...

Brian

On 8/8/11 9:53 AM, "Wesley Bland"  wrote:

>It's not just copying the value from one to the other.
>The value is initialized on the first line. The proc_name is passed into
>the function where the jobid and vpid are used (not the epoch). The
>function returns a new (the correct) value for the epoch based on the
>passed in jobid and vpid which is assigned to the process name.
>
>This is the same thing as initializing the value to NULL. We do that all
>the time to avoid compiler warnings. I just can't do that here because
>this isn't a pointer. If it would make the code look better I can move
>the first assignment to the top of the function where the other
>initializations are.
>
>On Mon, Aug 8, 2011 at 11:41 AM, Barrett, Brian W 
>wrote:
>
>On 8/8/11 9:34 AM, "Jeff Squyres"  wrote:
>
>>On Aug 8, 2011, at 11:30 AM, Wesley Bland wrote:
>>
>>> The reason is because valgrind was complaining about uninitialized
>>>values that were passed into proc_get_epoch. I saw the same warnings
>>>from valgrind when I ran it. I added the code to initialize the values
>>>to what really should be the default value and the warnings went away.
>>>Since the process_name_t struct isn't an object, it doesn't have an
>>>initialization function like so many of the other objects in the code.
>>>This is what we have.
>>
>>Ah, I see -- you are passing peer_name into proc_get_epoch().  I missed
>>that.
>>
>>Thanks!
>>
>>> peer_name.jobid  = ORTE_PROC_MY_NAME->jobid;
>>> peer_name.vpid   = peer_idx;
>>> +peer_name.epoch  = ORTE_EPOCH_INVALID;
>>> peer_name.epoch  = orte_ess.proc_get_epoch(&peer_name);
>
>
>I'm still not convinced this is a rational coding style.  It seems to me
>that if you're just going to copy the epoch from peer_name to peer_name,
>just assign the epoch to INVALID and be done with it.  There's no need for
>both the assignment added in this patch and the line after it.
>
>Now, normally I'd say that this is fixing a bug and having to fix other
>people's bad code shouldn't be your problem.  But since you wrote both
>lines, fixing it to make sense seems reasonable to me ;).
>
>Brian
>
>--
>  Brian W. Barrett
>  Dept. 1423: Scalable System Software
>  Sandia National Laboratories
>
>
>
>
>
>
>
>
>
>
>


-- 
  Brian W. Barrett
  Dept. 1423: Scalable System Software
  Sandia National Laboratories








Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015

2011-08-08 Thread Wesley Bland
It's not just copying the value from one to the other.

The value is initialized on the first line. The proc_name is passed into the
function where the jobid and vpid are used (not the epoch). The function
returns a new (the correct) value for the epoch based on the passed in jobid
and vpid which is assigned to the process name.

This is the same thing as initializing the value to NULL. We do that all the
time to avoid compiler warnings. I just can't do that here because this
isn't a pointer. If it would make the code look better I can move the first
assignment to the top of the function where the other initializations are.

On Mon, Aug 8, 2011 at 11:41 AM, Barrett, Brian W wrote:

> On 8/8/11 9:34 AM, "Jeff Squyres"  wrote:
>
> >On Aug 8, 2011, at 11:30 AM, Wesley Bland wrote:
> >
> >> The reason is because valgrind was complaining about uninitialized
> >>values that were passed into proc_get_epoch. I saw the same warnings
> >>from valgrind when I ran it. I added the code to initialize the values
> >>to what really should be the default value and the warnings went away.
> >>Since the process_name_t struct isn't an object, it doesn't have an
> >>initialization function like so many of the other objects in the code.
> >>This is what we have.
> >
> >Ah, I see -- you are passing peer_name into proc_get_epoch().  I missed
> >that.
> >
> >Thanks!
> >
> >> peer_name.jobid  = ORTE_PROC_MY_NAME->jobid;
> >> peer_name.vpid   = peer_idx;
> >> +peer_name.epoch  = ORTE_EPOCH_INVALID;
> >> peer_name.epoch  = orte_ess.proc_get_epoch(&peer_name);
>
> I'm still not convinced this is a rational coding style.  It seems to me
> that if you're just going to copy the epoch from peer_name to peer_name,
> just assign the epoch to INVALID and be done with it.  There's no need for
> both the assignment added in this patch and the line after it.
>
> Now, normally I'd say that this is fixing a bug and having to fix other
> people's bad code shouldn't be your problem.  But since you wrote both
> lines, fixing it to make sense seems reasonable to me ;).
>
> Brian
>
> --
>  Brian W. Barrett
>  Dept. 1423: Scalable System Software
>  Sandia National Laboratories
>
>
>
>
>
>


Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015

2011-08-08 Thread Barrett, Brian W
On 8/8/11 9:34 AM, "Jeff Squyres"  wrote:

>On Aug 8, 2011, at 11:30 AM, Wesley Bland wrote:
>
>> The reason is because valgrind was complaining about uninitialized
>>values that were passed into proc_get_epoch. I saw the same warnings
>>from valgrind when I ran it. I added the code to initialize the values
>>to what really should be the default value and the warnings went away.
>>Since the process_name_t struct isn't an object, it doesn't have an
>>initialization function like so many of the other objects in the code.
>>This is what we have.
>
>Ah, I see -- you are passing peer_name into proc_get_epoch().  I missed
>that.
>
>Thanks!
>
>> peer_name.jobid  = ORTE_PROC_MY_NAME->jobid;
>> peer_name.vpid   = peer_idx;
>> +peer_name.epoch  = ORTE_EPOCH_INVALID;
>> peer_name.epoch  = orte_ess.proc_get_epoch(&peer_name);

I'm still not convinced this is a rational coding style.  It seems to me
that if you're just going to copy the epoch from peer_name to peer_name,
just assign the epoch to INVALID and be done with it.  There's no need for
both the assignment added in this patch and the line after it.

Now, normally I'd say that this is fixing a bug and having to fix other
people's bad code shouldn't be your problem.  But since you wrote both
lines, fixing it to make sense seems reasonable to me ;).

Brian

-- 
  Brian W. Barrett
  Dept. 1423: Scalable System Software
  Sandia National Laboratories








Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015

2011-08-08 Thread Jeff Squyres
On Aug 8, 2011, at 11:30 AM, Wesley Bland wrote:

> The reason is because valgrind was complaining about uninitialized values 
> that were passed into proc_get_epoch. I saw the same warnings from valgrind 
> when I ran it. I added the code to initialize the values to what really 
> should be the default value and the warnings went away. Since the 
> process_name_t struct isn't an object, it doesn't have an initialization 
> function like so many of the other objects in the code. This is what we have.

Ah, I see -- you are passing peer_name into proc_get_epoch().  I missed that.

Thanks!

> peer_name.jobid  = ORTE_PROC_MY_NAME->jobid;
> peer_name.vpid   = peer_idx;
> +peer_name.epoch  = ORTE_EPOCH_INVALID;
> peer_name.epoch  = orte_ess.proc_get_epoch(&peer_name);

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015

2011-08-08 Thread Wesley Bland
The reason is because valgrind was complaining about uninitialized values
that were passed into proc_get_epoch. I saw the same warnings from valgrind
when I ran it. I added the code to initialize the values to what really
should be the default value and the warnings went away. Since the
process_name_t struct isn't an object, it doesn't have an initialization
function like so many of the other objects in the code. This is what we
have.

In reality this code is pretty useless and the value does get overridden.
The purpose of passing the process_name into the function is to be able to
get out the jobid and vpid. The epoch value is completely ignored except
when printing out the process_name struct.

Wes

On Mon, Aug 8, 2011 at 11:18 AM, Jeff Squyres  wrote:

> FYI: Ralph's out today.  He'll be back tomorrow.
>
> I'm not really part of this ORTE discussion, but I am curious about a code
> style that I see in this commit: assigning ORTE_EPOCH_INVALID to a field,
> and then immediately overwriting that field with another value.  E.g.:
>
> > peer_name.jobid  = ORTE_PROC_MY_NAME->jobid;
> > peer_name.vpid   = peer_idx;
> > +peer_name.epoch  = ORTE_EPOCH_INVALID;
> > peer_name.epoch  = orte_ess.proc_get_epoch(&peer_name);
>
>
> This technique is used throughout this patch.
>
> What is the purpose for this?  As I understand it, this won't squash any
> valgrind warnings, and may even get eliminated by the compiler as dead code
> because it seems to be useless.
>
>
>
> On Aug 8, 2011, at 11:11 AM, wbl...@osl.iu.edu wrote:
>
> > Author: wbland
> > Date: 2011-08-08 11:11:55 EDT (Mon, 08 Aug 2011)
> > New Revision: 25015
> > URL: https://svn.open-mpi.org/trac/ompi/changeset/25015
> >
> > Log:
> > Make sure that the epoch is initialized everywhere so we don't get weird
> output
> > during valgrind. This shouldn't have caused any problems with any actual
> > execution. Just extra warnings in valgrind.
> >
> >
> > Text files modified:
> >   trunk/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 2 ++
> >   trunk/ompi/proc/proc.c| 2 +-
> >   trunk/orte/mca/ess/alps/ess_alps_module.c | 1 +
> >   trunk/orte/mca/ess/env/ess_env_module.c   | 1 +
> >   trunk/orte/mca/ess/lsf/ess_lsf_module.c   | 1 +
> >   trunk/orte/mca/ess/slave/ess_slave_module.c   | 1 +
> >   trunk/orte/mca/ess/slurm/ess_slurm_module.c   | 1 +
> >   trunk/orte/mca/grpcomm/base/grpcomm_base_coll.c   |12
> +++-
> >   trunk/orte/mca/iof/hnp/iof_hnp.c  | 1 +
> >   trunk/orte/mca/odls/base/odls_base_default_fns.c  | 1 +
> >   trunk/orte/mca/odls/base/odls_base_open.c | 1 +
> >   trunk/orte/mca/plm/base/plm_base_launch_support.c | 1 +
> >   trunk/orte/mca/plm/base/plm_base_orted_cmds.c | 2 ++
> >   trunk/orte/mca/plm/base/plm_base_receive.c| 1 +
> >   trunk/orte/mca/rmaps/base/rmaps_base_support_fns.c| 3 +++
> >   trunk/orte/mca/rmaps/rank_file/rmaps_rank_file.c  | 1 +
> >   trunk/orte/mca/rmaps/seq/rmaps_seq.c  | 1 +
> >   trunk/orte/mca/rml/oob/rml_oob_component.c| 4 
> >   trunk/orte/mca/routed/binomial/routed_binomial.c  | 4 
> >   trunk/orte/mca/routed/cm/routed_cm.c  | 4 
> >   trunk/orte/mca/routed/linear/routed_linear.c  | 2 ++
> >   trunk/orte/mca/routed/radix/routed_radix.c| 3 +++
> >   trunk/orte/mca/routed/slave/routed_slave.c| 1 +
> >   trunk/orte/mca/sstore/central/sstore_central_global.c | 1 +
> >   trunk/orte/mca/sstore/stage/sstore_stage_global.c | 1 +
> >   trunk/orte/orted/orted_comm.c | 1 +
> >   trunk/orte/test/system/oob_stress.c   | 2 +-
> >   trunk/orte/test/system/orte_ring.c| 2 ++
> >   trunk/orte/test/system/orte_spawn.c   | 1 +
> >   29 files changed, 48 insertions(+), 11 deletions(-)
> >
> > Modified: trunk/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> >
> ==
> > --- trunk/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c(original)
> > +++ trunk/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c2011-08-08 11:11:55
> EDT (Mon, 08 Aug 2011)
> > @@ -5284,6 +5284,7 @@
> >  */
> > peer_name.jobid  = ORTE_PROC_MY_NAME->jobid;
> > peer_name.vpid   = peer_idx;
> > +peer_name.epoch  = ORTE_EPOCH_INVALID;
> > peer_name.epoch  = orte_ess.proc_get_epoch(&peer_name);
> >
> > if( NULL == (peer_ref = find_peer(peer_name))) {
> > @@ -5345,6 +5346,7 @@
> >
> > peer_name.jobid  = ORTE_PROC_MY_NAME->jobid;
> > peer_name.vpid   = peer_idx;
> > +peer_name.epoch  = ORTE_EPOCH_INVALID;
> > peer_name.epoch  = orte_ess.proc_get_epoch(&peer_name);
> >
> > if ( 0 > (ret = orte_rml.recv_buffer_nb(&peer_nam

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015

2011-08-08 Thread Jeff Squyres
FYI: Ralph's out today.  He'll be back tomorrow.

I'm not really part of this ORTE discussion, but I am curious about a code 
style that I see in this commit: assigning ORTE_EPOCH_INVALID to a field, and 
then immediately overwriting that field with another value.  E.g.:

> peer_name.jobid  = ORTE_PROC_MY_NAME->jobid;
> peer_name.vpid   = peer_idx;
> +peer_name.epoch  = ORTE_EPOCH_INVALID;
> peer_name.epoch  = orte_ess.proc_get_epoch(&peer_name);


This technique is used throughout this patch.

What is the purpose for this?  As I understand it, this won't squash any 
valgrind warnings, and may even get eliminated by the compiler as dead code 
because it seems to be useless.



On Aug 8, 2011, at 11:11 AM, wbl...@osl.iu.edu wrote:

> Author: wbland
> Date: 2011-08-08 11:11:55 EDT (Mon, 08 Aug 2011)
> New Revision: 25015
> URL: https://svn.open-mpi.org/trac/ompi/changeset/25015
> 
> Log:
> Make sure that the epoch is initialized everywhere so we don't get weird 
> output
> during valgrind. This shouldn't have caused any problems with any actual
> execution. Just extra warnings in valgrind.
> 
> 
> Text files modified: 
>   trunk/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 2 ++
>   
>   trunk/ompi/proc/proc.c| 2 +-
>   
>   trunk/orte/mca/ess/alps/ess_alps_module.c | 1 + 
>   
>   trunk/orte/mca/ess/env/ess_env_module.c   | 1 + 
>   
>   trunk/orte/mca/ess/lsf/ess_lsf_module.c   | 1 + 
>   
>   trunk/orte/mca/ess/slave/ess_slave_module.c   | 1 + 
>   
>   trunk/orte/mca/ess/slurm/ess_slurm_module.c   | 1 + 
>   
>   trunk/orte/mca/grpcomm/base/grpcomm_base_coll.c   |12 +++-  
>   
>   trunk/orte/mca/iof/hnp/iof_hnp.c  | 1 + 
>   
>   trunk/orte/mca/odls/base/odls_base_default_fns.c  | 1 + 
>   
>   trunk/orte/mca/odls/base/odls_base_open.c | 1 + 
>   
>   trunk/orte/mca/plm/base/plm_base_launch_support.c | 1 + 
>   
>   trunk/orte/mca/plm/base/plm_base_orted_cmds.c | 2 ++
>   
>   trunk/orte/mca/plm/base/plm_base_receive.c| 1 + 
>   
>   trunk/orte/mca/rmaps/base/rmaps_base_support_fns.c| 3 +++   
>   
>   trunk/orte/mca/rmaps/rank_file/rmaps_rank_file.c  | 1 + 
>   
>   trunk/orte/mca/rmaps/seq/rmaps_seq.c  | 1 + 
>   
>   trunk/orte/mca/rml/oob/rml_oob_component.c| 4   
>   
>   trunk/orte/mca/routed/binomial/routed_binomial.c  | 4   
>   
>   trunk/orte/mca/routed/cm/routed_cm.c  | 4   
>   
>   trunk/orte/mca/routed/linear/routed_linear.c  | 2 ++
>   
>   trunk/orte/mca/routed/radix/routed_radix.c| 3 +++   
>   
>   trunk/orte/mca/routed/slave/routed_slave.c| 1 + 
>   
>   trunk/orte/mca/sstore/central/sstore_central_global.c | 1 + 
>   
>   trunk/orte/mca/sstore/stage/sstore_stage_global.c | 1 + 
>   
>   trunk/orte/orted/orted_comm.c | 1 + 
>   
>   trunk/orte/test/system/oob_stress.c   | 2 +-
>   
>   trunk/orte/test/system/orte_ring.c| 2 ++
>   
>   trunk/orte/test/system/orte_spawn.c   | 1 + 
>   
>   29 files changed, 48 insertions(+), 11 deletions(-)
> 
> Modified: trunk/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> ==
> --- trunk/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c(original)
> +++ trunk/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c2011-08-08 11:11:55 EDT 
> (Mon, 08 Aug 2011)
> @@ -5284,6 +5284,7 @@
>  */
> peer_name.jobid  = ORTE_PROC_MY_NAME->jobid;
> peer_name.vpid   = peer_idx;
> +peer_name.epoch  = ORTE_EPOCH_INVALID;
> peer_name.epoch  = orte_ess.proc_get_epoch(&peer_name);
> 
> if( NULL == (peer_ref = find_peer(peer_name))) {
> @@ -53