On Aug 8, 2011, at 1:23 PM, Thomas Herault wrote: > proc_get_epoch( proc ); > calls ORTE_NAME_PRINT(proc) > when in debug mode. > > When this is the case, ORTE_NAME_PRINT prints all fields of proc, including > epoch. So, proc->epoch = proc_get_epoch(proc) triggers the valgrind warning, > because of the print, in debug mode. > This is the only reason why proc->epoch must be initialized before calling > proc->epoch = proc_get_epoch( proc ); > > I agree with you it's not a good approach, for coding style / avoiding this > kind of discussions. > > Alternative approaches include: > - having proc_get_epoch have a side effect on proc->epoch instead of > returning the correct epoch value. Assignment can then be done before the > debug message. But this is inconsistent with the rest of the interface used > in proc_name_fns; > - having proc_get_epoch use another helper than ORTE_NAME_PRINT to print the > debug info. But this requires coding an alternative ORTE_SOMETHING_PRINT > function, only for this use; > - having proc be defined as an opal_object, and set epoch to INVALID (or even > UNSET) into the constructor. This could induce changes at many places, and > there is always the risk that some changes are left incomplete.
Just an FYI here. We have had the discussion of converting orte_process_name_t to an opal_object several times. The biggest reason for not doing so is that the memory consumed by the opal_object_t itself is actually greater than the memory used by the name fields. Since we store a lot of process names, this results in a lot of overhead memory loss. However, we do have the problem of names not being initialized, and this has caused trouble a number of times over the last few years. So maybe it would be worth the footprint to resolve the problem once and for all? Something we should discuss, methinks. > > Thomas > > Le 8 août 2011 à 12:20, Barrett, Brian W a écrit : > >> Ok, that makes some sense. It's a really weird way of doing >> initialization, though. Some of it is probably how Ralph chose to share >> nidpid map code, so perhaps it's unavoidable. But since the name isn't >> use in proc_get_epoch() (or, better not be used in there), I'm not >> entirely sure why that made valgrind happy... >> >> Brian >> >> On 8/8/11 9:53 AM, "Wesley Bland" <wbl...@eecs.utk.edu> wrote: >> >>> It's not just copying the value from one to the other. >>> The value is initialized on the first line. The proc_name is passed into >>> the function where the jobid and vpid are used (not the epoch). The >>> function returns a new (the correct) value for the epoch based on the >>> passed in jobid and vpid which is assigned to the process name. >>> >>> This is the same thing as initializing the value to NULL. We do that all >>> the time to avoid compiler warnings. I just can't do that here because >>> this isn't a pointer. If it would make the code look better I can move >>> the first assignment to the top of the function where the other >>> initializations are. >>> >>> On Mon, Aug 8, 2011 at 11:41 AM, Barrett, Brian W <bwba...@sandia.gov> >>> wrote: >>> >>> On 8/8/11 9:34 AM, "Jeff Squyres" <jsquy...@cisco.com> wrote: >>> >>>> On Aug 8, 2011, at 11:30 AM, Wesley Bland wrote: >>>> >>>>> The reason is because valgrind was complaining about uninitialized >>>>> values that were passed into proc_get_epoch. I saw the same warnings >>>>> from valgrind when I ran it. I added the code to initialize the values >>>>> to what really should be the default value and the warnings went away. >>>>> Since the process_name_t struct isn't an object, it doesn't have an >>>>> initialization function like so many of the other objects in the code. >>>>> This is what we have. >>>> >>>> Ah, I see -- you are passing peer_name into proc_get_epoch(). I missed >>>> that. >>>> >>>> Thanks! >>>> >>>>> peer_name.jobid = ORTE_PROC_MY_NAME->jobid; >>>>> peer_name.vpid = peer_idx; >>>>> + peer_name.epoch = ORTE_EPOCH_INVALID; >>>>> peer_name.epoch = orte_ess.proc_get_epoch(&peer_name); >>> >>> >>> I'm still not convinced this is a rational coding style. It seems to me >>> that if you're just going to copy the epoch from peer_name to peer_name, >>> just assign the epoch to INVALID and be done with it. There's no need for >>> both the assignment added in this patch and the line after it. >>> >>> Now, normally I'd say that this is fixing a bug and having to fix other >>> people's bad code shouldn't be your problem. But since you wrote both >>> lines, fixing it to make sense seems reasonable to me ;). >>> >>> Brian >>> >>> -- >>> Brian W. Barrett >>> Dept. 1423: Scalable System Software >>> Sandia National Laboratories >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >> >> >> -- >> Brian W. Barrett >> Dept. 1423: Scalable System Software >> Sandia National Laboratories >> >> >> >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel