Okay, I finally had time to parse this and fix it. Thanks!

On May 16, 2011, at 1:02 PM, Peter Thompson wrote:

> Hmmm?  We're not removing the putenv() calls.  Just adding a strdup() 
> beforehand, and then calling putenv() with the string duplicated from env[j]. 
>  Of course, if the strdup fails, then we bail out. 
> As for why it's suddenly a problem, I'm not quite as certain.   The problem 
> we do show is a double free, so someone has already freed that memory used by 
> putenv(), and I do know that while that used to be just flagged as an event 
> before, now we seem to be unable to continue past it.   Not sure if that is 
> our change or a library/system change. 
> PeterT
> 
> 
> Ralph Castain wrote:
>> On May 16, 2011, at 12:45 PM, Peter Thompson wrote:
>> 
>>  
>>> Hi Ralph,
>>> 
>>> We've had a number of user complaints about this.   Since it seems on the 
>>> face of it that it is a debugger issue, it may have not made it's way back 
>>> here.  Is your objection that the patch basically aborts if it gets a bad 
>>> value?   I could understand that being a concern.   Of course, it aborts on 
>>> TotalView now if we attempt to move forward without this patch.
>>> 
>>>    
>> 
>> No - my concern is that you appear to be removing the "putenv" calls. OMPI 
>> places some values into the local environment so the user can control 
>> behavior. Removing those causes problems.
>> 
>> What I need to know is why, after it has worked with TV for years, these 
>> putenv's are suddenly a problem. Is the problem occurring during shutdown? 
>> Or is this something that causes TV to break?
>> 
>> 
>>  
>>> I've passed your comment back to the engineer, with a suspicion about the 
>>> concerns about the abort, but if you have other objections, let me know.
>>> 
>>> Cheers,
>>> PeterT
>>> 
>>> 
>>> Ralph Castain wrote:
>>>    
>>>> That would be a problem, I fear. We need to push those envars into the 
>>>> environment.
>>>> 
>>>> Is there some particular problem causing what you see? We have no other 
>>>> reports of this issue, and orterun has had that code forever.
>>>> 
>>>> 
>>>> 
>>>> Sent from my iPad
>>>> 
>>>> On May 11, 2011, at 2:05 PM, Peter Thompson <peter.thomp...@roguewave.com> 
>>>> wrote:
>>>> 
>>>>       
>>>>> We've gotten a few reports of problems with memory debugging when using 
>>>>> OpenMPI under TotalView.  Usually, TotalView will attach tot he processes 
>>>>> started after an MPI_Init.  However in the case where memory debugging is 
>>>>> enabled, things seemed to run away or fail.   My analysis showed that we 
>>>>> had a number of core files left over from the attempt, and all were 
>>>>> mpirun (or orterun) cores.   It seemed to be a regression on our part, 
>>>>> since testing seemed to indicate this worked okay before TotalView 
>>>>> 8.9.0-0, so I filed an internal bug and passed it to engineering.   After 
>>>>> giving our engineer a brief tutorial on how to build a debug version of 
>>>>> OpenMPI, he found what appears to be a problem in the code for orterun.c. 
>>>>>   He's made a slight change that fixes the issue in 1.4.2, 1.4.3, 
>>>>> 1.4.4rc2 and 1.5.3, those being the versions he's tested with so far.    
>>>>> He doesn't subscribe to this list that I know of, so I offered to pass 
>>>>> this by the group.   Of course, I'm not sure if this is exactly the right 
>>>>> place to submit patches, but I'm sure you'd tell me where to put it if 
>>>>> I'm in the wrong here.   It's a short patch, so I'll cut and paste it, 
>>>>> and attach as well, since cut and paste can do weird things to formatting.
>>>>> 
>>>>> Credit goes to Ariel Burton for this patch.  Of course he used TotalVIew 
>>>>> to find this ;-)  It shows up if you do 'mpirun -tv -np 4 ./foo'   or 
>>>>> 'totalview mpirun -a -np 4 ./foo'
>>>>> 
>>>>> Cheers,
>>>>> PeterT
>>>>> 
>>>>> 
>>>>> more ~/patches/anbs-patch
>>>>> *** orte/tools/orterun/orterun.c        2010-04-13 13:30:34.000000000 
>>>>> -0400
>>>>> --- 
>>>>> /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../.
>>>>> ./src/openmpi-1.4.2/orte/tools/orterun/orterun.c        2011-05-09 
>>>>> 20:28:16.5881
>>>>> 83000 -0400
>>>>> ***************
>>>>> *** 1578,1588 ****
>>>>>   }
>>>>>   if (NULL != env) {
>>>>>       size1 = opal_argv_count(env);
>>>>>       for (j = 0; j < size1; ++j) {
>>>>> !             putenv(env[j]);
>>>>>       }
>>>>>   }
>>>>>   /* All done */
>>>>> --- 1578,1600 ----
>>>>>   }
>>>>>   if (NULL != env) {
>>>>>       size1 = opal_argv_count(env);
>>>>>       for (j = 0; j < size1; ++j) {
>>>>> !             /* Use-after-Free error possible here.  putenv does not copy
>>>>> !                the string passed to it, and instead stores only the 
>>>>> pointer.
>>>>> !                env[j] may be freed later, in which case the pointer
>>>>> !                in environ will now be left dangling into a deallocated
>>>>> !                region.
>>>>> !                So we make a copy of the variable.
>>>>> !             */
>>>>> !             char *s = strdup(env[j]);
>>>>> !
>>>>> !             if (NULL == s) {
>>>>> !                 return OPAL_ERR_OUT_OF_RESOURCE;
>>>>> !             }
>>>>> !             putenv(s);
>>>>>       }
>>>>>   }
>>>>>   /* All done */
>>>>> 
>>>>> *** orte/tools/orterun/orterun.c    2010-04-13 13:30:34.000000000 -0400
>>>>> --- 
>>>>> /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../../src/openmpi-1.4.2/orte/tools/orterun/orterun.c
>>>>>     2011-05-09 20:28:16.588183000 -0400
>>>>> ***************
>>>>> *** 1578,1588 ****
>>>>>    }
>>>>> 
>>>>>    if (NULL != env) {
>>>>>        size1 = opal_argv_count(env);
>>>>>        for (j = 0; j < size1; ++j) {
>>>>> !             putenv(env[j]);
>>>>>        }
>>>>>    }
>>>>> 
>>>>>    /* All done */
>>>>> 
>>>>> --- 1578,1600 ----
>>>>>    }
>>>>> 
>>>>>    if (NULL != env) {
>>>>>        size1 = opal_argv_count(env);
>>>>>        for (j = 0; j < size1; ++j) {
>>>>> !             /* Use-after-Free error possible here.  putenv does not copy
>>>>> !                the string passed to it, and instead stores only the 
>>>>> pointer.
>>>>> !                env[j] may be freed later, in which case the pointer
>>>>> !                in environ will now be left dangling into a deallocated
>>>>> !                region.
>>>>> !                So we make a copy of the variable.
>>>>> !             */
>>>>> !             char *s = strdup(env[j]);
>>>>> ! !             if (NULL == s) {
>>>>> !                 return OPAL_ERR_OUT_OF_RESOURCE;
>>>>> !             }
>>>>> !             putenv(s);
>>>>>        }
>>>>>    }
>>>>> 
>>>>>    /* All done */
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>           
>> 
>>  
> 


Reply via email to