Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015
Probably should have injected my earlier thoughts here. IMO, Brian is correct - we do have a basic design problem, and it does need to be corrected. It's a bigger discussion than just this commit, though, and we probably should add it to the Tues telecon so we at least get some initial ideas on how to approach it. On Aug 8, 2011, at 1:30 PM, Barrett, Brian W wrote: > 4) Don't design an interface where you're going to have the circular > dependency. Perhaps it's a sign that the epoch is in the wrong place, or > handled in the wrong way, or the whole naming scheme is wrong. I'm not > sure. I've relented that it doesn't need to be fixed now (since it is > clearly bigger than just the patch introducing the epochs), but I still > think it's a sign we have a basic design problem with the way we're > handling names. > > Brian > > On 8/8/11 1:23 PM, "Thomas Herault" wrote: > >> proc_get_epoch( proc ); >> calls ORTE_NAME_PRINT(proc) >> when in debug mode. >> >> When this is the case, ORTE_NAME_PRINT prints all fields of proc, >> including epoch. So, proc->epoch = proc_get_epoch(proc) triggers the >> valgrind warning, because of the print, in debug mode. >> This is the only reason why proc->epoch must be initialized before >> calling proc->epoch = proc_get_epoch( proc ); >> >> I agree with you it's not a good approach, for coding style / avoiding >> this kind of discussions. >> >> Alternative approaches include: >> - having proc_get_epoch have a side effect on proc->epoch instead of >> returning the correct epoch value. Assignment can then be done before the >> debug message. But this is inconsistent with the rest of the interface >> used in proc_name_fns; >> - having proc_get_epoch use another helper than ORTE_NAME_PRINT to print >> the debug info. But this requires coding an alternative >> ORTE_SOMETHING_PRINT function, only for this use; >> - having proc be defined as an opal_object, and set epoch to INVALID (or >> even UNSET) into the constructor. This could induce changes at many >> places, and there is always the risk that some changes are left >> incomplete. >> >> Thomas >> >> Le 8 août 2011 à 12:20, Barrett, Brian W a écrit : >> >>> Ok, that makes some sense. It's a really weird way of doing >>> initialization, though. Some of it is probably how Ralph chose to share >>> nidpid map code, so perhaps it's unavoidable. But since the name isn't >>> use in proc_get_epoch() (or, better not be used in there), I'm not >>> entirely sure why that made valgrind happy... >>> >>> Brian >>> >>> On 8/8/11 9:53 AM, "Wesley Bland" wrote: >>> It's not just copying the value from one to the other. The value is initialized on the first line. The proc_name is passed into the function where the jobid and vpid are used (not the epoch). The function returns a new (the correct) value for the epoch based on the passed in jobid and vpid which is assigned to the process name. This is the same thing as initializing the value to NULL. We do that all the time to avoid compiler warnings. I just can't do that here because this isn't a pointer. If it would make the code look better I can move the first assignment to the top of the function where the other initializations are. On Mon, Aug 8, 2011 at 11:41 AM, Barrett, Brian W wrote: On 8/8/11 9:34 AM, "Jeff Squyres" wrote: > On Aug 8, 2011, at 11:30 AM, Wesley Bland wrote: > >> The reason is because valgrind was complaining about uninitialized >> values that were passed into proc_get_epoch. I saw the same warnings >> from valgrind when I ran it. I added the code to initialize the >> values >> to what really should be the default value and the warnings went >> away. >> Since the process_name_t struct isn't an object, it doesn't have an >> initialization function like so many of the other objects in the >> code. >> This is what we have. > > Ah, I see -- you are passing peer_name into proc_get_epoch(). I > missed > that. > > Thanks! > >> peer_name.jobid = ORTE_PROC_MY_NAME->jobid; >> peer_name.vpid = peer_idx; >> +peer_name.epoch = ORTE_EPOCH_INVALID; >> peer_name.epoch = orte_ess.proc_get_epoch(&peer_name); I'm still not convinced this is a rational coding style. It seems to me that if you're just going to copy the epoch from peer_name to peer_name, just assign the epoch to INVALID and be done with it. There's no need for both the assignment added in this patch and the line after it. Now, normally I'd say that this is fixing a bug and having to fix other people's bad code shouldn't be your problem. But since you wrote both lines, fixing it to make sense seems reasonable to me ;). Brian -- Brian W. Barrett Dept. 1423: Scalable System Software Sandia Natio
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015
On Aug 8, 2011, at 1:23 PM, Thomas Herault wrote: > proc_get_epoch( proc ); > calls ORTE_NAME_PRINT(proc) > when in debug mode. > > When this is the case, ORTE_NAME_PRINT prints all fields of proc, including > epoch. So, proc->epoch = proc_get_epoch(proc) triggers the valgrind warning, > because of the print, in debug mode. > This is the only reason why proc->epoch must be initialized before calling > proc->epoch = proc_get_epoch( proc ); > > I agree with you it's not a good approach, for coding style / avoiding this > kind of discussions. > > Alternative approaches include: > - having proc_get_epoch have a side effect on proc->epoch instead of > returning the correct epoch value. Assignment can then be done before the > debug message. But this is inconsistent with the rest of the interface used > in proc_name_fns; > - having proc_get_epoch use another helper than ORTE_NAME_PRINT to print the > debug info. But this requires coding an alternative ORTE_SOMETHING_PRINT > function, only for this use; > - having proc be defined as an opal_object, and set epoch to INVALID (or even > UNSET) into the constructor. This could induce changes at many places, and > there is always the risk that some changes are left incomplete. Just an FYI here. We have had the discussion of converting orte_process_name_t to an opal_object several times. The biggest reason for not doing so is that the memory consumed by the opal_object_t itself is actually greater than the memory used by the name fields. Since we store a lot of process names, this results in a lot of overhead memory loss. However, we do have the problem of names not being initialized, and this has caused trouble a number of times over the last few years. So maybe it would be worth the footprint to resolve the problem once and for all? Something we should discuss, methinks. > > Thomas > > Le 8 août 2011 à 12:20, Barrett, Brian W a écrit : > >> Ok, that makes some sense. It's a really weird way of doing >> initialization, though. Some of it is probably how Ralph chose to share >> nidpid map code, so perhaps it's unavoidable. But since the name isn't >> use in proc_get_epoch() (or, better not be used in there), I'm not >> entirely sure why that made valgrind happy... >> >> Brian >> >> On 8/8/11 9:53 AM, "Wesley Bland" wrote: >> >>> It's not just copying the value from one to the other. >>> The value is initialized on the first line. The proc_name is passed into >>> the function where the jobid and vpid are used (not the epoch). The >>> function returns a new (the correct) value for the epoch based on the >>> passed in jobid and vpid which is assigned to the process name. >>> >>> This is the same thing as initializing the value to NULL. We do that all >>> the time to avoid compiler warnings. I just can't do that here because >>> this isn't a pointer. If it would make the code look better I can move >>> the first assignment to the top of the function where the other >>> initializations are. >>> >>> On Mon, Aug 8, 2011 at 11:41 AM, Barrett, Brian W >>> wrote: >>> >>> On 8/8/11 9:34 AM, "Jeff Squyres" wrote: >>> On Aug 8, 2011, at 11:30 AM, Wesley Bland wrote: > The reason is because valgrind was complaining about uninitialized > values that were passed into proc_get_epoch. I saw the same warnings > from valgrind when I ran it. I added the code to initialize the values > to what really should be the default value and the warnings went away. > Since the process_name_t struct isn't an object, it doesn't have an > initialization function like so many of the other objects in the code. > This is what we have. Ah, I see -- you are passing peer_name into proc_get_epoch(). I missed that. Thanks! > peer_name.jobid = ORTE_PROC_MY_NAME->jobid; > peer_name.vpid = peer_idx; > +peer_name.epoch = ORTE_EPOCH_INVALID; > peer_name.epoch = orte_ess.proc_get_epoch(&peer_name); >>> >>> >>> I'm still not convinced this is a rational coding style. It seems to me >>> that if you're just going to copy the epoch from peer_name to peer_name, >>> just assign the epoch to INVALID and be done with it. There's no need for >>> both the assignment added in this patch and the line after it. >>> >>> Now, normally I'd say that this is fixing a bug and having to fix other >>> people's bad code shouldn't be your problem. But since you wrote both >>> lines, fixing it to make sense seems reasonable to me ;). >>> >>> Brian >>> >>> -- >>> Brian W. Barrett >>> Dept. 1423: Scalable System Software >>> Sandia National Laboratories >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >> >> >> -- >> Brian W. Barrett >> Dept. 1423: Scalable System Software >> Sandia National Laboratories >> >> >> >> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > >
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015
I will interject at this point in the thread. IMO, Brian has raised a valid point. We are contorting around with the epoch values, as I commented in a separate thread. The problem really stems from something Wes mentioned a while back in a different set of communications. When the epoch code was written, it wasn't entirely clear to him where the data required for supporting the epoch-related functionality should go. It doesn't really fit the ess, it didn't really belong in modex, etc. So he wound up sticking things here and there, trying to disturb the existing APIs as little as possible. What we probably should do is (a) try to better understand what Wes is trying to accomplish, and then (b) design the code base to support it. The current functional APIs and nidmap code weren't designed to support the extended name, and so the fit is sometimes wrong. Getting things designed properly would help ease the problem. In fairness, remember that Wes and I had already agreed to review a bunch of this (e.g., where epoch info is stored) when we finally get back to dealing with the orte state engine. Hopefully, we can address some or all of these issues then. Meantime, what's there is certainly "weird" - but if it works for now, then perhaps we can get things correctly integrated in the near future. HTH Ralph On Aug 8, 2011, at 10:20 AM, Barrett, Brian W wrote: > Ok, that makes some sense. It's a really weird way of doing > initialization, though. Some of it is probably how Ralph chose to share > nidpid map code, so perhaps it's unavoidable. But since the name isn't > use in proc_get_epoch() (or, better not be used in there), I'm not > entirely sure why that made valgrind happy... > > Brian > > On 8/8/11 9:53 AM, "Wesley Bland" wrote: > >> It's not just copying the value from one to the other. >> The value is initialized on the first line. The proc_name is passed into >> the function where the jobid and vpid are used (not the epoch). The >> function returns a new (the correct) value for the epoch based on the >> passed in jobid and vpid which is assigned to the process name. >> >> This is the same thing as initializing the value to NULL. We do that all >> the time to avoid compiler warnings. I just can't do that here because >> this isn't a pointer. If it would make the code look better I can move >> the first assignment to the top of the function where the other >> initializations are. >> >> On Mon, Aug 8, 2011 at 11:41 AM, Barrett, Brian W >> wrote: >> >> On 8/8/11 9:34 AM, "Jeff Squyres" wrote: >> >>> On Aug 8, 2011, at 11:30 AM, Wesley Bland wrote: >>> The reason is because valgrind was complaining about uninitialized values that were passed into proc_get_epoch. I saw the same warnings from valgrind when I ran it. I added the code to initialize the values to what really should be the default value and the warnings went away. Since the process_name_t struct isn't an object, it doesn't have an initialization function like so many of the other objects in the code. This is what we have. >>> >>> Ah, I see -- you are passing peer_name into proc_get_epoch(). I missed >>> that. >>> >>> Thanks! >>> peer_name.jobid = ORTE_PROC_MY_NAME->jobid; peer_name.vpid = peer_idx; +peer_name.epoch = ORTE_EPOCH_INVALID; peer_name.epoch = orte_ess.proc_get_epoch(&peer_name); >> >> >> I'm still not convinced this is a rational coding style. It seems to me >> that if you're just going to copy the epoch from peer_name to peer_name, >> just assign the epoch to INVALID and be done with it. There's no need for >> both the assignment added in this patch and the line after it. >> >> Now, normally I'd say that this is fixing a bug and having to fix other >> people's bad code shouldn't be your problem. But since you wrote both >> lines, fixing it to make sense seems reasonable to me ;). >> >> Brian >> >> -- >> Brian W. Barrett >> Dept. 1423: Scalable System Software >> Sandia National Laboratories >> >> >> >> >> >> >> >> >> >> >> > > > -- > Brian W. Barrett > Dept. 1423: Scalable System Software > Sandia National Laboratories > > > > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015
I see no problem with the latest code from Wesley. The warning is silenced, and was not indicative of any buggy behavior. So for me the problem is closed. Aurelien Le 8 août 2011 à 15:30, Barrett, Brian W a écrit : > 4) Don't design an interface where you're going to have the circular > dependency. Perhaps it's a sign that the epoch is in the wrong place, or > handled in the wrong way, or the whole naming scheme is wrong. I'm not > sure. I've relented that it doesn't need to be fixed now (since it is > clearly bigger than just the patch introducing the epochs), but I still > think it's a sign we have a basic design problem with the way we're > handling names. > > Brian > > On 8/8/11 1:23 PM, "Thomas Herault" wrote: > >> proc_get_epoch( proc ); >> calls ORTE_NAME_PRINT(proc) >> when in debug mode. >> >> When this is the case, ORTE_NAME_PRINT prints all fields of proc, >> including epoch. So, proc->epoch = proc_get_epoch(proc) triggers the >> valgrind warning, because of the print, in debug mode. >> This is the only reason why proc->epoch must be initialized before >> calling proc->epoch = proc_get_epoch( proc ); >> >> I agree with you it's not a good approach, for coding style / avoiding >> this kind of discussions. >> >> Alternative approaches include: >> - having proc_get_epoch have a side effect on proc->epoch instead of >> returning the correct epoch value. Assignment can then be done before the >> debug message. But this is inconsistent with the rest of the interface >> used in proc_name_fns; >> - having proc_get_epoch use another helper than ORTE_NAME_PRINT to print >> the debug info. But this requires coding an alternative >> ORTE_SOMETHING_PRINT function, only for this use; >> - having proc be defined as an opal_object, and set epoch to INVALID (or >> even UNSET) into the constructor. This could induce changes at many >> places, and there is always the risk that some changes are left >> incomplete. >> >> Thomas >> >> Le 8 août 2011 à 12:20, Barrett, Brian W a écrit : >> >>> Ok, that makes some sense. It's a really weird way of doing >>> initialization, though. Some of it is probably how Ralph chose to share >>> nidpid map code, so perhaps it's unavoidable. But since the name isn't >>> use in proc_get_epoch() (or, better not be used in there), I'm not >>> entirely sure why that made valgrind happy... >>> >>> Brian >>> >>> On 8/8/11 9:53 AM, "Wesley Bland" wrote: >>> It's not just copying the value from one to the other. The value is initialized on the first line. The proc_name is passed into the function where the jobid and vpid are used (not the epoch). The function returns a new (the correct) value for the epoch based on the passed in jobid and vpid which is assigned to the process name. This is the same thing as initializing the value to NULL. We do that all the time to avoid compiler warnings. I just can't do that here because this isn't a pointer. If it would make the code look better I can move the first assignment to the top of the function where the other initializations are. On Mon, Aug 8, 2011 at 11:41 AM, Barrett, Brian W wrote: On 8/8/11 9:34 AM, "Jeff Squyres" wrote: > On Aug 8, 2011, at 11:30 AM, Wesley Bland wrote: > >> The reason is because valgrind was complaining about uninitialized >> values that were passed into proc_get_epoch. I saw the same warnings >> from valgrind when I ran it. I added the code to initialize the >> values >> to what really should be the default value and the warnings went >> away. >> Since the process_name_t struct isn't an object, it doesn't have an >> initialization function like so many of the other objects in the >> code. >> This is what we have. > > Ah, I see -- you are passing peer_name into proc_get_epoch(). I > missed > that. > > Thanks! > >> peer_name.jobid = ORTE_PROC_MY_NAME->jobid; >> peer_name.vpid = peer_idx; >> +peer_name.epoch = ORTE_EPOCH_INVALID; >> peer_name.epoch = orte_ess.proc_get_epoch(&peer_name); I'm still not convinced this is a rational coding style. It seems to me that if you're just going to copy the epoch from peer_name to peer_name, just assign the epoch to INVALID and be done with it. There's no need for both the assignment added in this patch and the line after it. Now, normally I'd say that this is fixing a bug and having to fix other people's bad code shouldn't be your problem. But since you wrote both lines, fixing it to make sense seems reasonable to me ;). Brian -- Brian W. Barrett Dept. 1423: Scalable System Software Sandia National Laboratories >>> >>> >>> -- >>> Brian W. Barrett >>> Dept. 1423: Scalable System
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015
4) Don't design an interface where you're going to have the circular dependency. Perhaps it's a sign that the epoch is in the wrong place, or handled in the wrong way, or the whole naming scheme is wrong. I'm not sure. I've relented that it doesn't need to be fixed now (since it is clearly bigger than just the patch introducing the epochs), but I still think it's a sign we have a basic design problem with the way we're handling names. Brian On 8/8/11 1:23 PM, "Thomas Herault" wrote: >proc_get_epoch( proc ); >calls ORTE_NAME_PRINT(proc) >when in debug mode. > >When this is the case, ORTE_NAME_PRINT prints all fields of proc, >including epoch. So, proc->epoch = proc_get_epoch(proc) triggers the >valgrind warning, because of the print, in debug mode. >This is the only reason why proc->epoch must be initialized before >calling proc->epoch = proc_get_epoch( proc ); > >I agree with you it's not a good approach, for coding style / avoiding >this kind of discussions. > >Alternative approaches include: > - having proc_get_epoch have a side effect on proc->epoch instead of >returning the correct epoch value. Assignment can then be done before the >debug message. But this is inconsistent with the rest of the interface >used in proc_name_fns; > - having proc_get_epoch use another helper than ORTE_NAME_PRINT to print >the debug info. But this requires coding an alternative >ORTE_SOMETHING_PRINT function, only for this use; > - having proc be defined as an opal_object, and set epoch to INVALID (or >even UNSET) into the constructor. This could induce changes at many >places, and there is always the risk that some changes are left >incomplete. > >Thomas > >Le 8 août 2011 à 12:20, Barrett, Brian W a écrit : > >> Ok, that makes some sense. It's a really weird way of doing >> initialization, though. Some of it is probably how Ralph chose to share >> nidpid map code, so perhaps it's unavoidable. But since the name isn't >> use in proc_get_epoch() (or, better not be used in there), I'm not >> entirely sure why that made valgrind happy... >> >> Brian >> >> On 8/8/11 9:53 AM, "Wesley Bland" wrote: >> >>> It's not just copying the value from one to the other. >>> The value is initialized on the first line. The proc_name is passed >>>into >>> the function where the jobid and vpid are used (not the epoch). The >>> function returns a new (the correct) value for the epoch based on the >>> passed in jobid and vpid which is assigned to the process name. >>> >>> This is the same thing as initializing the value to NULL. We do that >>>all >>> the time to avoid compiler warnings. I just can't do that here because >>> this isn't a pointer. If it would make the code look better I can move >>> the first assignment to the top of the function where the other >>> initializations are. >>> >>> On Mon, Aug 8, 2011 at 11:41 AM, Barrett, Brian W >>> wrote: >>> >>> On 8/8/11 9:34 AM, "Jeff Squyres" wrote: >>> On Aug 8, 2011, at 11:30 AM, Wesley Bland wrote: > The reason is because valgrind was complaining about uninitialized > values that were passed into proc_get_epoch. I saw the same warnings > from valgrind when I ran it. I added the code to initialize the >values > to what really should be the default value and the warnings went >away. > Since the process_name_t struct isn't an object, it doesn't have an > initialization function like so many of the other objects in the >code. > This is what we have. Ah, I see -- you are passing peer_name into proc_get_epoch(). I missed that. Thanks! >peer_name.jobid = ORTE_PROC_MY_NAME->jobid; >peer_name.vpid = peer_idx; > +peer_name.epoch = ORTE_EPOCH_INVALID; >peer_name.epoch = orte_ess.proc_get_epoch(&peer_name); >>> >>> >>> I'm still not convinced this is a rational coding style. It seems to >>>me >>> that if you're just going to copy the epoch from peer_name to >>>peer_name, >>> just assign the epoch to INVALID and be done with it. There's no need >>>for >>> both the assignment added in this patch and the line after it. >>> >>> Now, normally I'd say that this is fixing a bug and having to fix other >>> people's bad code shouldn't be your problem. But since you wrote both >>> lines, fixing it to make sense seems reasonable to me ;). >>> >>> Brian >>> >>> -- >>> Brian W. Barrett >>> Dept. 1423: Scalable System Software >>> Sandia National Laboratories >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >> >> >> -- >> Brian W. Barrett >> Dept. 1423: Scalable System Software >> Sandia National Laboratories >> >> >> >> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > >___ >devel mailing list >de...@open-mpi.org >http://www.open-mpi.org/mailman/listinfo.cgi/devel > > -- Brian W. Barrett Dep
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015
proc_get_epoch( proc ); calls ORTE_NAME_PRINT(proc) when in debug mode. When this is the case, ORTE_NAME_PRINT prints all fields of proc, including epoch. So, proc->epoch = proc_get_epoch(proc) triggers the valgrind warning, because of the print, in debug mode. This is the only reason why proc->epoch must be initialized before calling proc->epoch = proc_get_epoch( proc ); I agree with you it's not a good approach, for coding style / avoiding this kind of discussions. Alternative approaches include: - having proc_get_epoch have a side effect on proc->epoch instead of returning the correct epoch value. Assignment can then be done before the debug message. But this is inconsistent with the rest of the interface used in proc_name_fns; - having proc_get_epoch use another helper than ORTE_NAME_PRINT to print the debug info. But this requires coding an alternative ORTE_SOMETHING_PRINT function, only for this use; - having proc be defined as an opal_object, and set epoch to INVALID (or even UNSET) into the constructor. This could induce changes at many places, and there is always the risk that some changes are left incomplete. Thomas Le 8 août 2011 à 12:20, Barrett, Brian W a écrit : > Ok, that makes some sense. It's a really weird way of doing > initialization, though. Some of it is probably how Ralph chose to share > nidpid map code, so perhaps it's unavoidable. But since the name isn't > use in proc_get_epoch() (or, better not be used in there), I'm not > entirely sure why that made valgrind happy... > > Brian > > On 8/8/11 9:53 AM, "Wesley Bland" wrote: > >> It's not just copying the value from one to the other. >> The value is initialized on the first line. The proc_name is passed into >> the function where the jobid and vpid are used (not the epoch). The >> function returns a new (the correct) value for the epoch based on the >> passed in jobid and vpid which is assigned to the process name. >> >> This is the same thing as initializing the value to NULL. We do that all >> the time to avoid compiler warnings. I just can't do that here because >> this isn't a pointer. If it would make the code look better I can move >> the first assignment to the top of the function where the other >> initializations are. >> >> On Mon, Aug 8, 2011 at 11:41 AM, Barrett, Brian W >> wrote: >> >> On 8/8/11 9:34 AM, "Jeff Squyres" wrote: >> >>> On Aug 8, 2011, at 11:30 AM, Wesley Bland wrote: >>> The reason is because valgrind was complaining about uninitialized values that were passed into proc_get_epoch. I saw the same warnings from valgrind when I ran it. I added the code to initialize the values to what really should be the default value and the warnings went away. Since the process_name_t struct isn't an object, it doesn't have an initialization function like so many of the other objects in the code. This is what we have. >>> >>> Ah, I see -- you are passing peer_name into proc_get_epoch(). I missed >>> that. >>> >>> Thanks! >>> peer_name.jobid = ORTE_PROC_MY_NAME->jobid; peer_name.vpid = peer_idx; +peer_name.epoch = ORTE_EPOCH_INVALID; peer_name.epoch = orte_ess.proc_get_epoch(&peer_name); >> >> >> I'm still not convinced this is a rational coding style. It seems to me >> that if you're just going to copy the epoch from peer_name to peer_name, >> just assign the epoch to INVALID and be done with it. There's no need for >> both the assignment added in this patch and the line after it. >> >> Now, normally I'd say that this is fixing a bug and having to fix other >> people's bad code shouldn't be your problem. But since you wrote both >> lines, fixing it to make sense seems reasonable to me ;). >> >> Brian >> >> -- >> Brian W. Barrett >> Dept. 1423: Scalable System Software >> Sandia National Laboratories >> >> >> >> >> >> >> >> >> >> >> > > > -- > Brian W. Barrett > Dept. 1423: Scalable System Software > Sandia National Laboratories > > > > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015
Ok, that makes some sense. It's a really weird way of doing initialization, though. Some of it is probably how Ralph chose to share nidpid map code, so perhaps it's unavoidable. But since the name isn't use in proc_get_epoch() (or, better not be used in there), I'm not entirely sure why that made valgrind happy... Brian On 8/8/11 9:53 AM, "Wesley Bland" wrote: >It's not just copying the value from one to the other. >The value is initialized on the first line. The proc_name is passed into >the function where the jobid and vpid are used (not the epoch). The >function returns a new (the correct) value for the epoch based on the >passed in jobid and vpid which is assigned to the process name. > >This is the same thing as initializing the value to NULL. We do that all >the time to avoid compiler warnings. I just can't do that here because >this isn't a pointer. If it would make the code look better I can move >the first assignment to the top of the function where the other >initializations are. > >On Mon, Aug 8, 2011 at 11:41 AM, Barrett, Brian W >wrote: > >On 8/8/11 9:34 AM, "Jeff Squyres" wrote: > >>On Aug 8, 2011, at 11:30 AM, Wesley Bland wrote: >> >>> The reason is because valgrind was complaining about uninitialized >>>values that were passed into proc_get_epoch. I saw the same warnings >>>from valgrind when I ran it. I added the code to initialize the values >>>to what really should be the default value and the warnings went away. >>>Since the process_name_t struct isn't an object, it doesn't have an >>>initialization function like so many of the other objects in the code. >>>This is what we have. >> >>Ah, I see -- you are passing peer_name into proc_get_epoch(). I missed >>that. >> >>Thanks! >> >>> peer_name.jobid = ORTE_PROC_MY_NAME->jobid; >>> peer_name.vpid = peer_idx; >>> +peer_name.epoch = ORTE_EPOCH_INVALID; >>> peer_name.epoch = orte_ess.proc_get_epoch(&peer_name); > > >I'm still not convinced this is a rational coding style. It seems to me >that if you're just going to copy the epoch from peer_name to peer_name, >just assign the epoch to INVALID and be done with it. There's no need for >both the assignment added in this patch and the line after it. > >Now, normally I'd say that this is fixing a bug and having to fix other >people's bad code shouldn't be your problem. But since you wrote both >lines, fixing it to make sense seems reasonable to me ;). > >Brian > >-- > Brian W. Barrett > Dept. 1423: Scalable System Software > Sandia National Laboratories > > > > > > > > > > > -- Brian W. Barrett Dept. 1423: Scalable System Software Sandia National Laboratories
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015
It's not just copying the value from one to the other. The value is initialized on the first line. The proc_name is passed into the function where the jobid and vpid are used (not the epoch). The function returns a new (the correct) value for the epoch based on the passed in jobid and vpid which is assigned to the process name. This is the same thing as initializing the value to NULL. We do that all the time to avoid compiler warnings. I just can't do that here because this isn't a pointer. If it would make the code look better I can move the first assignment to the top of the function where the other initializations are. On Mon, Aug 8, 2011 at 11:41 AM, Barrett, Brian W wrote: > On 8/8/11 9:34 AM, "Jeff Squyres" wrote: > > >On Aug 8, 2011, at 11:30 AM, Wesley Bland wrote: > > > >> The reason is because valgrind was complaining about uninitialized > >>values that were passed into proc_get_epoch. I saw the same warnings > >>from valgrind when I ran it. I added the code to initialize the values > >>to what really should be the default value and the warnings went away. > >>Since the process_name_t struct isn't an object, it doesn't have an > >>initialization function like so many of the other objects in the code. > >>This is what we have. > > > >Ah, I see -- you are passing peer_name into proc_get_epoch(). I missed > >that. > > > >Thanks! > > > >> peer_name.jobid = ORTE_PROC_MY_NAME->jobid; > >> peer_name.vpid = peer_idx; > >> +peer_name.epoch = ORTE_EPOCH_INVALID; > >> peer_name.epoch = orte_ess.proc_get_epoch(&peer_name); > > I'm still not convinced this is a rational coding style. It seems to me > that if you're just going to copy the epoch from peer_name to peer_name, > just assign the epoch to INVALID and be done with it. There's no need for > both the assignment added in this patch and the line after it. > > Now, normally I'd say that this is fixing a bug and having to fix other > people's bad code shouldn't be your problem. But since you wrote both > lines, fixing it to make sense seems reasonable to me ;). > > Brian > > -- > Brian W. Barrett > Dept. 1423: Scalable System Software > Sandia National Laboratories > > > > > >
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015
On 8/8/11 9:34 AM, "Jeff Squyres" wrote: >On Aug 8, 2011, at 11:30 AM, Wesley Bland wrote: > >> The reason is because valgrind was complaining about uninitialized >>values that were passed into proc_get_epoch. I saw the same warnings >>from valgrind when I ran it. I added the code to initialize the values >>to what really should be the default value and the warnings went away. >>Since the process_name_t struct isn't an object, it doesn't have an >>initialization function like so many of the other objects in the code. >>This is what we have. > >Ah, I see -- you are passing peer_name into proc_get_epoch(). I missed >that. > >Thanks! > >> peer_name.jobid = ORTE_PROC_MY_NAME->jobid; >> peer_name.vpid = peer_idx; >> +peer_name.epoch = ORTE_EPOCH_INVALID; >> peer_name.epoch = orte_ess.proc_get_epoch(&peer_name); I'm still not convinced this is a rational coding style. It seems to me that if you're just going to copy the epoch from peer_name to peer_name, just assign the epoch to INVALID and be done with it. There's no need for both the assignment added in this patch and the line after it. Now, normally I'd say that this is fixing a bug and having to fix other people's bad code shouldn't be your problem. But since you wrote both lines, fixing it to make sense seems reasonable to me ;). Brian -- Brian W. Barrett Dept. 1423: Scalable System Software Sandia National Laboratories
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015
On Aug 8, 2011, at 11:30 AM, Wesley Bland wrote: > The reason is because valgrind was complaining about uninitialized values > that were passed into proc_get_epoch. I saw the same warnings from valgrind > when I ran it. I added the code to initialize the values to what really > should be the default value and the warnings went away. Since the > process_name_t struct isn't an object, it doesn't have an initialization > function like so many of the other objects in the code. This is what we have. Ah, I see -- you are passing peer_name into proc_get_epoch(). I missed that. Thanks! > peer_name.jobid = ORTE_PROC_MY_NAME->jobid; > peer_name.vpid = peer_idx; > +peer_name.epoch = ORTE_EPOCH_INVALID; > peer_name.epoch = orte_ess.proc_get_epoch(&peer_name); -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015
The reason is because valgrind was complaining about uninitialized values that were passed into proc_get_epoch. I saw the same warnings from valgrind when I ran it. I added the code to initialize the values to what really should be the default value and the warnings went away. Since the process_name_t struct isn't an object, it doesn't have an initialization function like so many of the other objects in the code. This is what we have. In reality this code is pretty useless and the value does get overridden. The purpose of passing the process_name into the function is to be able to get out the jobid and vpid. The epoch value is completely ignored except when printing out the process_name struct. Wes On Mon, Aug 8, 2011 at 11:18 AM, Jeff Squyres wrote: > FYI: Ralph's out today. He'll be back tomorrow. > > I'm not really part of this ORTE discussion, but I am curious about a code > style that I see in this commit: assigning ORTE_EPOCH_INVALID to a field, > and then immediately overwriting that field with another value. E.g.: > > > peer_name.jobid = ORTE_PROC_MY_NAME->jobid; > > peer_name.vpid = peer_idx; > > +peer_name.epoch = ORTE_EPOCH_INVALID; > > peer_name.epoch = orte_ess.proc_get_epoch(&peer_name); > > > This technique is used throughout this patch. > > What is the purpose for this? As I understand it, this won't squash any > valgrind warnings, and may even get eliminated by the compiler as dead code > because it seems to be useless. > > > > On Aug 8, 2011, at 11:11 AM, wbl...@osl.iu.edu wrote: > > > Author: wbland > > Date: 2011-08-08 11:11:55 EDT (Mon, 08 Aug 2011) > > New Revision: 25015 > > URL: https://svn.open-mpi.org/trac/ompi/changeset/25015 > > > > Log: > > Make sure that the epoch is initialized everywhere so we don't get weird > output > > during valgrind. This shouldn't have caused any problems with any actual > > execution. Just extra warnings in valgrind. > > > > > > Text files modified: > > trunk/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 2 ++ > > trunk/ompi/proc/proc.c| 2 +- > > trunk/orte/mca/ess/alps/ess_alps_module.c | 1 + > > trunk/orte/mca/ess/env/ess_env_module.c | 1 + > > trunk/orte/mca/ess/lsf/ess_lsf_module.c | 1 + > > trunk/orte/mca/ess/slave/ess_slave_module.c | 1 + > > trunk/orte/mca/ess/slurm/ess_slurm_module.c | 1 + > > trunk/orte/mca/grpcomm/base/grpcomm_base_coll.c |12 > +++- > > trunk/orte/mca/iof/hnp/iof_hnp.c | 1 + > > trunk/orte/mca/odls/base/odls_base_default_fns.c | 1 + > > trunk/orte/mca/odls/base/odls_base_open.c | 1 + > > trunk/orte/mca/plm/base/plm_base_launch_support.c | 1 + > > trunk/orte/mca/plm/base/plm_base_orted_cmds.c | 2 ++ > > trunk/orte/mca/plm/base/plm_base_receive.c| 1 + > > trunk/orte/mca/rmaps/base/rmaps_base_support_fns.c| 3 +++ > > trunk/orte/mca/rmaps/rank_file/rmaps_rank_file.c | 1 + > > trunk/orte/mca/rmaps/seq/rmaps_seq.c | 1 + > > trunk/orte/mca/rml/oob/rml_oob_component.c| 4 > > trunk/orte/mca/routed/binomial/routed_binomial.c | 4 > > trunk/orte/mca/routed/cm/routed_cm.c | 4 > > trunk/orte/mca/routed/linear/routed_linear.c | 2 ++ > > trunk/orte/mca/routed/radix/routed_radix.c| 3 +++ > > trunk/orte/mca/routed/slave/routed_slave.c| 1 + > > trunk/orte/mca/sstore/central/sstore_central_global.c | 1 + > > trunk/orte/mca/sstore/stage/sstore_stage_global.c | 1 + > > trunk/orte/orted/orted_comm.c | 1 + > > trunk/orte/test/system/oob_stress.c | 2 +- > > trunk/orte/test/system/orte_ring.c| 2 ++ > > trunk/orte/test/system/orte_spawn.c | 1 + > > 29 files changed, 48 insertions(+), 11 deletions(-) > > > > Modified: trunk/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c > > > == > > --- trunk/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c(original) > > +++ trunk/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c2011-08-08 11:11:55 > EDT (Mon, 08 Aug 2011) > > @@ -5284,6 +5284,7 @@ > > */ > > peer_name.jobid = ORTE_PROC_MY_NAME->jobid; > > peer_name.vpid = peer_idx; > > +peer_name.epoch = ORTE_EPOCH_INVALID; > > peer_name.epoch = orte_ess.proc_get_epoch(&peer_name); > > > > if( NULL == (peer_ref = find_peer(peer_name))) { > > @@ -5345,6 +5346,7 @@ > > > > peer_name.jobid = ORTE_PROC_MY_NAME->jobid; > > peer_name.vpid = peer_idx; > > +peer_name.epoch = ORTE_EPOCH_INVALID; > > peer_name.epoch = orte_ess.proc_get_epoch(&peer_name); > > > > if ( 0 > (ret = orte_rml.recv_buffer_nb(&peer_nam
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015
FYI: Ralph's out today. He'll be back tomorrow. I'm not really part of this ORTE discussion, but I am curious about a code style that I see in this commit: assigning ORTE_EPOCH_INVALID to a field, and then immediately overwriting that field with another value. E.g.: > peer_name.jobid = ORTE_PROC_MY_NAME->jobid; > peer_name.vpid = peer_idx; > +peer_name.epoch = ORTE_EPOCH_INVALID; > peer_name.epoch = orte_ess.proc_get_epoch(&peer_name); This technique is used throughout this patch. What is the purpose for this? As I understand it, this won't squash any valgrind warnings, and may even get eliminated by the compiler as dead code because it seems to be useless. On Aug 8, 2011, at 11:11 AM, wbl...@osl.iu.edu wrote: > Author: wbland > Date: 2011-08-08 11:11:55 EDT (Mon, 08 Aug 2011) > New Revision: 25015 > URL: https://svn.open-mpi.org/trac/ompi/changeset/25015 > > Log: > Make sure that the epoch is initialized everywhere so we don't get weird > output > during valgrind. This shouldn't have caused any problems with any actual > execution. Just extra warnings in valgrind. > > > Text files modified: > trunk/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 2 ++ > > trunk/ompi/proc/proc.c| 2 +- > > trunk/orte/mca/ess/alps/ess_alps_module.c | 1 + > > trunk/orte/mca/ess/env/ess_env_module.c | 1 + > > trunk/orte/mca/ess/lsf/ess_lsf_module.c | 1 + > > trunk/orte/mca/ess/slave/ess_slave_module.c | 1 + > > trunk/orte/mca/ess/slurm/ess_slurm_module.c | 1 + > > trunk/orte/mca/grpcomm/base/grpcomm_base_coll.c |12 +++- > > trunk/orte/mca/iof/hnp/iof_hnp.c | 1 + > > trunk/orte/mca/odls/base/odls_base_default_fns.c | 1 + > > trunk/orte/mca/odls/base/odls_base_open.c | 1 + > > trunk/orte/mca/plm/base/plm_base_launch_support.c | 1 + > > trunk/orte/mca/plm/base/plm_base_orted_cmds.c | 2 ++ > > trunk/orte/mca/plm/base/plm_base_receive.c| 1 + > > trunk/orte/mca/rmaps/base/rmaps_base_support_fns.c| 3 +++ > > trunk/orte/mca/rmaps/rank_file/rmaps_rank_file.c | 1 + > > trunk/orte/mca/rmaps/seq/rmaps_seq.c | 1 + > > trunk/orte/mca/rml/oob/rml_oob_component.c| 4 > > trunk/orte/mca/routed/binomial/routed_binomial.c | 4 > > trunk/orte/mca/routed/cm/routed_cm.c | 4 > > trunk/orte/mca/routed/linear/routed_linear.c | 2 ++ > > trunk/orte/mca/routed/radix/routed_radix.c| 3 +++ > > trunk/orte/mca/routed/slave/routed_slave.c| 1 + > > trunk/orte/mca/sstore/central/sstore_central_global.c | 1 + > > trunk/orte/mca/sstore/stage/sstore_stage_global.c | 1 + > > trunk/orte/orted/orted_comm.c | 1 + > > trunk/orte/test/system/oob_stress.c | 2 +- > > trunk/orte/test/system/orte_ring.c| 2 ++ > > trunk/orte/test/system/orte_spawn.c | 1 + > > 29 files changed, 48 insertions(+), 11 deletions(-) > > Modified: trunk/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c > == > --- trunk/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c(original) > +++ trunk/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c2011-08-08 11:11:55 EDT > (Mon, 08 Aug 2011) > @@ -5284,6 +5284,7 @@ > */ > peer_name.jobid = ORTE_PROC_MY_NAME->jobid; > peer_name.vpid = peer_idx; > +peer_name.epoch = ORTE_EPOCH_INVALID; > peer_name.epoch = orte_ess.proc_get_epoch(&peer_name); > > if( NULL == (peer_ref = find_peer(peer_name))) { > @@ -53