Re: [OMPI users] Displaying MAIN in Totalview

2011-03-21 Thread Peter Thompson
Gee,  I had tried posting that info earlier today, but my post was 
rejected because my email address has changed.  This is as much a test 
of that address change request as it is a confirmation of the info Dave 
reports.  (Of course I'm the one who sent them the info, so it's only a 
little self-serving ;-)


Cheers,
Peter Thompson


Ralph Castain wrote:

Ick - appears that got dropped a long time ago. I'll add it back in and post a 
CMR for 1.4 and 1.5 series.

Thanks!
Ralph


On Mar 21, 2011, at 11:08 AM, David Turner wrote:

  

Hi,

About a month ago, this topic was discussed with no real resolution:

http://www.open-mpi.org/community/lists/users/2011/02/15538.php

We noticed the same problem (TV does not display the user's MAIN
routine upon initial startup), and contacted the TV developers.
They suggested a simple OMPI code modification, which we implemented
and tested; it seems to work fine.  Hopefully, this capability
can be restored in future releases.

Here is the body of our communication with the TV developers:

--

Interestingly enough, someone else asked this very same question recently and I 
finally dug into it last week and figured out what was going on. TotalView 
publishes a public interface which allows any MPI implementor to set things up 
so that it should work fairly seamless with TotalView. I found that one of the 
defines in the interface is

MPIR_force_to_main

and when we find this symbol defined in mpirun (or orterun in Open MPI's case) 
then we spend a bit more effort to focus the source pane on the main routine. 
As you may guess, this is NOT being defined in OpenMPI 1.4.2. It was being 
defined in the 1.2.x builds though, in a routine called totalview.c. OpenMPI 
has been re-worked significantly since then, and totalview.c has been replaced 
by debuggers.c in orte/tools/orterun. About line 130 to 140 (depending on any 
changes since my look at the 1.4.1 sources) you should find a number of MPIR_ 
symbols being defined.

struct MPIR_PROCDESC *MPIR_proctable = NULL;
int MPIR_proctable_size = 0;
int MPIR_being_debugged = 0;
volatile int MPIR_debug_state = 0;
volatile int MPIR_i_am_starter = 0;
volatile int MPIR_partial_attach_ok = 1;


I believe you should be able to insert the line:

int MPIR_force_to_main = 0;

into this section, and then the behavior you are looking for should work after 
you rebuild OpenMPI. I haven't yet had the time to do that myself, but that was 
all that existed in the 1.2.x sources, and I know those achieved the desired 
effect. It's quite possible that someone realized the symbol was initialized, 
but wasn't be used anyplace, so they just removed it. Without realizing we were 
looking for it in the debugger. When I pointed this out to the other user, he 
said he would try it out and pass it on to the Open MPI group. I just checked 
on that thread, and didn't see any update, so I passed on the info myself.

--

--
Best regards,

David Turner
User Services Groupemail: dptur...@lbl.gov
NERSC Division phone: (510) 486-4027
Lawrence Berkeley Labfax: (510) 486-4316
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
  


[OMPI users] TotalView Memory debugging and OpenMPI

2011-05-11 Thread Peter Thompson
We've gotten a few reports of problems with memory debugging when using 
OpenMPI under TotalView.  Usually, TotalView will attach tot he 
processes started after an MPI_Init.  However in the case where memory 
debugging is enabled, things seemed to run away or fail.   My analysis 
showed that we had a number of core files left over from the attempt, 
and all were mpirun (or orterun) cores.   It seemed to be a regression 
on our part, since testing seemed to indicate this worked okay before 
TotalView 8.9.0-0, so I filed an internal bug and passed it to 
engineering.   After giving our engineer a brief tutorial on how to 
build a debug version of OpenMPI, he found what appears to be a problem 
in the code for orterun.c.   He's made a slight change that fixes the 
issue in 1.4.2, 1.4.3, 1.4.4rc2 and 1.5.3, those being the versions he's 
tested with so far.He doesn't subscribe to this list that I know of, 
so I offered to pass this by the group.   Of course, I'm not sure if 
this is exactly the right place to submit patches, but I'm sure you'd 
tell me where to put it if I'm in the wrong here.   It's a short patch, 
so I'll cut and paste it, and attach as well, since cut and paste can do 
weird things to formatting.


Credit goes to Ariel Burton for this patch.  Of course he used TotalVIew 
to find this ;-)  It shows up if you do 'mpirun -tv -np 4 ./foo'   or 
'totalview mpirun -a -np 4 ./foo'


Cheers,
PeterT


more ~/patches/anbs-patch
*** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
--- 
/home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../.
./src/openmpi-1.4.2/orte/tools/orterun/orterun.c2011-05-09 
20:28:16.5881

83000 -0400
***
*** 1578,1588 
 }

 if (NULL != env) {
 size1 = opal_argv_count(env);
 for (j = 0; j < size1; ++j) {
! putenv(env[j]);
 }
 }

 /* All done */

--- 1578,1600 
 }

 if (NULL != env) {
 size1 = opal_argv_count(env);
 for (j = 0; j < size1; ++j) {
! /* Use-after-Free error possible here.  putenv does not copy
!the string passed to it, and instead stores only the 
pointer.

!env[j] may be freed later, in which case the pointer
!in environ will now be left dangling into a deallocated
!region.
!So we make a copy of the variable.
! */
! char *s = strdup(env[j]);
!
! if (NULL == s) {
! return OPAL_ERR_OUT_OF_RESOURCE;
! }
! putenv(s);
 }
 }

 /* All done */


*** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
--- 
/home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../../src/openmpi-1.4.2/orte/tools/orterun/orterun.c
2011-05-09 20:28:16.588183000 -0400
***
*** 1578,1588 
  }

  if (NULL != env) {
  size1 = opal_argv_count(env);
  for (j = 0; j < size1; ++j) {
! putenv(env[j]);
  }
  }

  /* All done */

--- 1578,1600 
  }

  if (NULL != env) {
  size1 = opal_argv_count(env);
  for (j = 0; j < size1; ++j) {
! /* Use-after-Free error possible here.  putenv does not copy
!the string passed to it, and instead stores only the pointer.
!env[j] may be freed later, in which case the pointer
!in environ will now be left dangling into a deallocated
!region.
!So we make a copy of the variable.
! */
! char *s = strdup(env[j]);
! 
! if (NULL == s) {
! return OPAL_ERR_OUT_OF_RESOURCE;
! }
! putenv(s);
  }
  }

  /* All done */



Re: [OMPI users] TotalView Memory debugging and OpenMPI

2011-05-16 Thread Peter Thompson

Hi Ralph,

We've had a number of user complaints about this.   Since it seems on 
the face of it that it is a debugger issue, it may have not made it's 
way back here.  Is your objection that the patch basically aborts if it 
gets a bad value?   I could understand that being a concern.   Of 
course, it aborts on TotalView now if we attempt to move forward without 
this patch.


I've passed your comment back to the engineer, with a suspicion about 
the concerns about the abort, but if you have other objections, let me know.


Cheers,
PeterT


Ralph Castain wrote:

That would be a problem, I fear. We need to push those envars into the 
environment.

Is there some particular problem causing what you see? We have no other reports 
of this issue, and orterun has had that code forever.



Sent from my iPad

On May 11, 2011, at 2:05 PM, Peter Thompson  
wrote:

  

We've gotten a few reports of problems with memory debugging when using OpenMPI 
under TotalView.  Usually, TotalView will attach tot he processes started after 
an MPI_Init.  However in the case where memory debugging is enabled, things 
seemed to run away or fail.   My analysis showed that we had a number of core 
files left over from the attempt, and all were mpirun (or orterun) cores.   It 
seemed to be a regression on our part, since testing seemed to indicate this 
worked okay before TotalView 8.9.0-0, so I filed an internal bug and passed it 
to engineering.   After giving our engineer a brief tutorial on how to build a 
debug version of OpenMPI, he found what appears to be a problem in the code for 
orterun.c.   He's made a slight change that fixes the issue in 1.4.2, 1.4.3, 
1.4.4rc2 and 1.5.3, those being the versions he's tested with so far.He 
doesn't subscribe to this list that I know of, so I offered to pass this by the 
group.   Of course, I'm not sure if this is exactly the right place to submit 
patches, but I'm sure you'd tell me where to put it if I'm in the wrong here.   
It's a short patch, so I'll cut and paste it, and attach as well, since cut and 
paste can do weird things to formatting.

Credit goes to Ariel Burton for this patch.  Of course he used TotalVIew to 
find this ;-)  It shows up if you do 'mpirun -tv -np 4 ./foo'   or 'totalview 
mpirun -a -np 4 ./foo'

Cheers,
PeterT


more ~/patches/anbs-patch
*** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
--- /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../.
./src/openmpi-1.4.2/orte/tools/orterun/orterun.c2011-05-09 20:28:16.5881
83000 -0400
***
*** 1578,1588 
}
if (NULL != env) {
size1 = opal_argv_count(env);
for (j = 0; j < size1; ++j) {
! putenv(env[j]);
}
}
/* All done */
--- 1578,1600 
}
if (NULL != env) {
size1 = opal_argv_count(env);
for (j = 0; j < size1; ++j) {
! /* Use-after-Free error possible here.  putenv does not copy
!the string passed to it, and instead stores only the pointer.
!env[j] may be freed later, in which case the pointer
!in environ will now be left dangling into a deallocated
!region.
!So we make a copy of the variable.
! */
! char *s = strdup(env[j]);
!
! if (NULL == s) {
! return OPAL_ERR_OUT_OF_RESOURCE;
! }
! putenv(s);
}
}
/* All done */

*** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
--- 
/home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../../src/openmpi-1.4.2/orte/tools/orterun/orterun.c
2011-05-09 20:28:16.588183000 -0400
***
*** 1578,1588 
 }

 if (NULL != env) {
 size1 = opal_argv_count(env);
 for (j = 0; j < size1; ++j) {
! putenv(env[j]);
 }
 }

 /* All done */

--- 1578,1600 
 }

 if (NULL != env) {
 size1 = opal_argv_count(env);
 for (j = 0; j < size1; ++j) {
! /* Use-after-Free error possible here.  putenv does not copy
!the string passed to it, and instead stores only the pointer.
!env[j] may be freed later, in which case the pointer
!in environ will now be left dangling into a deallocated
!region.
!So we make a copy of the variable.
! */
! char *s = strdup(env[j]);
! 
! if (NULL == s) {

! return OPAL_ERR_OUT_OF_RESOURCE;
! }
! putenv(s);
 }
 }

 /* All done */

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






Re: [OMPI users] TotalView Memory debugging and OpenMPI

2011-05-16 Thread Peter Thompson
Hmmm?  We're not removing the putenv() calls.  Just adding a strdup() 
beforehand, and then calling putenv() with the string duplicated from 
env[j].  Of course, if the strdup fails, then we bail out. 

As for why it's suddenly a problem, I'm not quite as certain.   The 
problem we do show is a double free, so someone has already freed that 
memory used by putenv(), and I do know that while that used to be just 
flagged as an event before, now we seem to be unable to continue past 
it.   Not sure if that is our change or a library/system change. 


PeterT


Ralph Castain wrote:

On May 16, 2011, at 12:45 PM, Peter Thompson wrote:

  

Hi Ralph,

We've had a number of user complaints about this.   Since it seems on the face 
of it that it is a debugger issue, it may have not made it's way back here.  Is 
your objection that the patch basically aborts if it gets a bad value?   I 
could understand that being a concern.   Of course, it aborts on TotalView now 
if we attempt to move forward without this patch.




No - my concern is that you appear to be removing the "putenv" calls. OMPI 
places some values into the local environment so the user can control behavior. Removing 
those causes problems.

What I need to know is why, after it has worked with TV for years, these 
putenv's are suddenly a problem. Is the problem occurring during shutdown? Or 
is this something that causes TV to break?


  

I've passed your comment back to the engineer, with a suspicion about the 
concerns about the abort, but if you have other objections, let me know.

Cheers,
PeterT


Ralph Castain wrote:


That would be a problem, I fear. We need to push those envars into the 
environment.

Is there some particular problem causing what you see? We have no other reports 
of this issue, and orterun has had that code forever.



Sent from my iPad

On May 11, 2011, at 2:05 PM, Peter Thompson  
wrote:

 
  

We've gotten a few reports of problems with memory debugging when using OpenMPI 
under TotalView.  Usually, TotalView will attach tot he processes started after 
an MPI_Init.  However in the case where memory debugging is enabled, things 
seemed to run away or fail.   My analysis showed that we had a number of core 
files left over from the attempt, and all were mpirun (or orterun) cores.   It 
seemed to be a regression on our part, since testing seemed to indicate this 
worked okay before TotalView 8.9.0-0, so I filed an internal bug and passed it 
to engineering.   After giving our engineer a brief tutorial on how to build a 
debug version of OpenMPI, he found what appears to be a problem in the code for 
orterun.c.   He's made a slight change that fixes the issue in 1.4.2, 1.4.3, 
1.4.4rc2 and 1.5.3, those being the versions he's tested with so far.He 
doesn't subscribe to this list that I know of, so I offered to pass this by the 
group.   Of course, I'm not sure if this is exactly the right place to submit 
patches, but I'm sure you'd tell me where to put it if I'm in the wrong here.   
It's a short patch, so I'll cut and paste it, and attach as well, since cut and 
paste can do weird things to formatting.

Credit goes to Ariel Burton for this patch.  Of course he used TotalVIew to 
find this ;-)  It shows up if you do 'mpirun -tv -np 4 ./foo'   or 'totalview 
mpirun -a -np 4 ./foo'

Cheers,
PeterT


more ~/patches/anbs-patch
*** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
--- /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../.
./src/openmpi-1.4.2/orte/tools/orterun/orterun.c2011-05-09 20:28:16.5881
83000 -0400
***
*** 1578,1588 
   }
   if (NULL != env) {
   size1 = opal_argv_count(env);
   for (j = 0; j < size1; ++j) {
! putenv(env[j]);
   }
   }
   /* All done */
--- 1578,1600 
   }
   if (NULL != env) {
   size1 = opal_argv_count(env);
   for (j = 0; j < size1; ++j) {
! /* Use-after-Free error possible here.  putenv does not copy
!the string passed to it, and instead stores only the pointer.
!env[j] may be freed later, in which case the pointer
!in environ will now be left dangling into a deallocated
!region.
!So we make a copy of the variable.
! */
! char *s = strdup(env[j]);
!
! if (NULL == s) {
! return OPAL_ERR_OUT_OF_RESOURCE;
! }
! putenv(s);
   }
   }
   /* All done */

*** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
--- 
/home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../../src/openmpi-1.4.2/orte/tools/orterun/orterun.c
2011-05-09 20:28:16.588183000 -0400
***
*** 1578,1588 
}

if (NULL != env) {
size1 = o

Re: [OMPI users] TotalView Memory debugging and OpenMPI

2011-05-20 Thread Peter Thompson
Thanks Ralph.  I've seen the messages generated in b...@open-mpi.org, so 
I figured something was up!   I was going to provide the unified diff, 
but then ran into another issue in testing where we immediately ran into 
a seq fault, even with this fix.   It turns out that a pre-pending of 
/lib64 (and maybe /usr/lib64) to LD_LIBRARY_PATH works around that one 
though, so I don't think it's directly related, but it threw me off, 
along with the beta testing we're doing...


Cheers,
PeterT


Ralph Castain wrote:

Okay, I finally had time to parse this and fix it. Thanks!

On May 16, 2011, at 1:02 PM, Peter Thompson wrote:

  
Hmmm?  We're not removing the putenv() calls.  Just adding a strdup() beforehand, and then calling putenv() with the string duplicated from env[j].  Of course, if the strdup fails, then we bail out. 
As for why it's suddenly a problem, I'm not quite as certain.   The problem we do show is a double free, so someone has already freed that memory used by putenv(), and I do know that while that used to be just flagged as an event before, now we seem to be unable to continue past it.   Not sure if that is our change or a library/system change. 
PeterT



Ralph Castain wrote:
    

On May 16, 2011, at 12:45 PM, Peter Thompson wrote:

 
  

Hi Ralph,

We've had a number of user complaints about this.   Since it seems on the face 
of it that it is a debugger issue, it may have not made it's way back here.  Is 
your objection that the patch basically aborts if it gets a bad value?   I 
could understand that being a concern.   Of course, it aborts on TotalView now 
if we attempt to move forward without this patch.

   


No - my concern is that you appear to be removing the "putenv" calls. OMPI 
places some values into the local environment so the user can control behavior. Removing 
those causes problems.

What I need to know is why, after it has worked with TV for years, these 
putenv's are suddenly a problem. Is the problem occurring during shutdown? Or 
is this something that causes TV to break?


 
  

I've passed your comment back to the engineer, with a suspicion about the 
concerns about the abort, but if you have other objections, let me know.

Cheers,
PeterT


Ralph Castain wrote:
   


That would be a problem, I fear. We need to push those envars into the 
environment.

Is there some particular problem causing what you see? We have no other reports 
of this issue, and orterun has had that code forever.



Sent from my iPad

On May 11, 2011, at 2:05 PM, Peter Thompson  
wrote:

  
  

We've gotten a few reports of problems with memory debugging when using OpenMPI 
under TotalView.  Usually, TotalView will attach tot he processes started after 
an MPI_Init.  However in the case where memory debugging is enabled, things 
seemed to run away or fail.   My analysis showed that we had a number of core 
files left over from the attempt, and all were mpirun (or orterun) cores.   It 
seemed to be a regression on our part, since testing seemed to indicate this 
worked okay before TotalView 8.9.0-0, so I filed an internal bug and passed it 
to engineering.   After giving our engineer a brief tutorial on how to build a 
debug version of OpenMPI, he found what appears to be a problem in the code for 
orterun.c.   He's made a slight change that fixes the issue in 1.4.2, 1.4.3, 
1.4.4rc2 and 1.5.3, those being the versions he's tested with so far.He 
doesn't subscribe to this list that I know of, so I offered to pass this by the 
group.   Of course, I'm not sure if this is exactly the right place to submit 
patches, but I'm sure you'd tell me where to put it if I'm in the wrong here.   
It's a short patch, so I'll cut and paste it, and attach as well, since cut and 
paste can do weird things to formatting.

Credit goes to Ariel Burton for this patch.  Of course he used TotalVIew to 
find this ;-)  It shows up if you do 'mpirun -tv -np 4 ./foo'   or 'totalview 
mpirun -a -np 4 ./foo'

Cheers,
PeterT


more ~/patches/anbs-patch
*** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
--- /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../.
./src/openmpi-1.4.2/orte/tools/orterun/orterun.c2011-05-09 20:28:16.5881
83000 -0400
***
*** 1578,1588 
  }
  if (NULL != env) {
  size1 = opal_argv_count(env);
  for (j = 0; j < size1; ++j) {
! putenv(env[j]);
  }
  }
  /* All done */
--- 1578,1600 
  }
  if (NULL != env) {
  size1 = opal_argv_count(env);
  for (j = 0; j < size1; ++j) {
! /* Use-after-Free error possible here.  putenv does not copy
!the string passed to it, and instead stores only the pointer.
!env[j] may be freed later, in which case the pointer
!  

[OMPI users] Open MPI job not joining up under TotalView.

2011-07-12 Thread Peter Thompson
I wonder if someone might have possible ideas to explore as to why this 
program might not be working correctly under TotalView.  Essentially a 
user is running a very simple hello world like program that does this:


#include 
#include 
#include 

int main(int argc, char **argv)
{
 MPI_Init( &argc, &argv );
 int rank, size;
 MPI_Comm_rank( MPI_COMM_WORLD, &rank );
 MPI_Comm_size( MPI_COMM_WORLD, &size );
 printf("rank = %d, size = %d\n", rank, size );
 MPI_Finalize();
 exit( EXIT_SUCCESS );
}


When run under

mpirun -np 8 ./foo

It spits out 8 lines with ranks 0 through 7 and a size or 8.  But when 
run under TotalView, he sees 8 lines, all of rank 0, and size of 1.  So 
it looks like the processes never combine up when on his system under 
TV.  To make this a bit more interesting, when I run the same program 
here, I do NOT see this separate behavior.  The processes all join up 
and everything looks okay.  My machine, as his, is a Mac running the 
10.6.8 Darwin.  To try and keep this replicable, I just used the native 
OpenMPI on Darwin.  TotalView is started up with


totalview foo

and then Parallel is chosen from the Startup Parameters window, and Open 
MPI and 8 processes are chosen.   Any thoughts about why I might see one 
8 process job, and he sees 8 single process jobs?   Are there any hidden 
debug flags I can use?


Thanks,
PeterT




[OMPI users] memalign usage in OpenMPI and it's consequences for TotalVIew

2009-10-01 Thread Peter Thompson
We had a question from a user who had turned on memory debugging in TotalView 
and experience a memory event error Invalid memory alignment request.  Having a 
1.3.3 build of OpenMPI handy, I tested it and sure enough, saw the error.  I 
traced it down to, surprise, a call to memalign.  I find there are a few places 
where memalign is called, but the one I think I was dealing with was from 
malloc.c in ompi/mca//io/romio/romio/adio/common in the following lines:



#ifdef ROMIO_XFS
new = (void *) memalign(XFS_MEMALIGN, size);
#else
new = (void *) malloc(size);
#endif

I searched, but couldn't find a value for XFS_MEMALIGN, so maybe it was from 
opal_pt_malloc2_component.c instead, where the call is


p = memalign(1, 1024 * 1024);

There are only 10 to 12 references to memalign in the code that I can see, so it 
shouldn't be too hard to find.  What I can tell you is that the value that 
TotalView saw for alignment, the first arg, was 1, and the second, the size, was 
 0x10, which is probably right for 1024 squared.


The man page for memalign says that the first argument is the alignment that the 
allocated memory use, and it must be a power of two.  The second is the length 
you want allocated.  One could argue that 1 is a power of two, but it seems a 
bit specious to me, and TotalView's memory debugger certainly objects to it. 
Can anyone tell me what the intent here is, and whether the memalign alignment 
argument is thought to be valid?  Or is this a bug (that might not affect anyone 
other than TotalView memory debug users?)


Thanks,
Peter Thompson


Re: [OMPI users] memalign usage in OpenMPI and it's consequences for TotalVIew

2009-10-01 Thread Peter Thompson
Took a look at the changes and that looks like it should work.  It's certainly 
not in 1.3.3, but as long as you guys are on top of it, that relieves my 
concerns ;-)


Thanks,
PeterT


Samuel K. Gutierrez wrote:

Ticket created (#2040).  I hope it's okay ;-).

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Oct 1, 2009, at 11:58 AM, Jeff Squyres wrote:


Did that make it over to the v1.3 branch?


On Oct 1, 2009, at 1:39 PM, Samuel K. Gutierrez wrote:


Hi,

I think Jeff has already addressed this problem.

https://svn.open-mpi.org/trac/ompi/changeset/21744

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Oct 1, 2009, at 11:25 AM, Peter Thompson wrote:

> We had a question from a user who had turned on memory debugging in
> TotalView and experience a memory event error Invalid memory
> alignment request.  Having a 1.3.3 build of OpenMPI handy, I tested
> it and sure enough, saw the error.  I traced it down to, surprise, a
> call to memalign.  I find there are a few places where memalign is
> called, but the one I think I was dealing with was from malloc.c in
> ompi/mca//io/romio/romio/adio/common in the following lines:
>
>
> #ifdef ROMIO_XFS
>new = (void *) memalign(XFS_MEMALIGN, size);
> #else
>new = (void *) malloc(size);
> #endif
>
> I searched, but couldn't find a value for XFS_MEMALIGN, so maybe it
> was from opal_pt_malloc2_component.c instead, where the call is
>
>p = memalign(1, 1024 * 1024);
>
> There are only 10 to 12 references to memalign in the code that I
> can see, so it shouldn't be too hard to find.  What I can tell you
> is that the value that TotalView saw for alignment, the first arg,
> was 1, and the second, the size, was  0x10, which is probably
> right for 1024 squared.
>
> The man page for memalign says that the first argument is the
> alignment that the allocated memory use, and it must be a power of
> two.  The second is the length you want allocated.  One could argue
> that 1 is a power of two, but it seems a bit specious to me, and
> TotalView's memory debugger certainly objects to it. Can anyone tell
> me what the intent here is, and whether the memalign alignment
> argument is thought to be valid?  Or is this a bug (that might not
> affect anyone other than TotalView memory debug users?)
>
> Thanks,
> Peter Thompson
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] memalign usage in OpenMPI and it's consequences for TotalVIew

2009-10-01 Thread Peter Thompson
The value of 4 might be invalid (though maybe on a 32b machine, it would be 
okay?) but it's enough to allow TotalView to continue on without raising a 
memory event, so I'm okay with it ;-)


PeterT

Ashley Pittman wrote:

Simple malloc() returns pointers that are at least eight byte aligned
anyway, I'm not sure what the reason for calling memalign() with a value
of four would be be anyway.

Ashley,

On Thu, 2009-10-01 at 20:19 +0200, Åke Sandgren wrote:

No it didn't. And memalign is obsolete according to the manpage.
posix_memalign is the one to use.



https://svn.open-mpi.org/trac/ompi/changeset/21744




[OMPI users] 1.4 OpenMPI build not working well with TotalView on Darwin

2010-01-08 Thread Peter Thompson
I've tried a few builds of 1.4 on Snow Leopard, and trying to start up TotalView 
gets some of the more 'standard' problems.  Either the typdef for MPIR_PROCDESC 
can't be found, or MPIR_PROCTABLE is missing.  You can get things to work if you 
start up TotalView first and then pick your program and go to the Parallel tab 
and pick OpenMPI.  But it would be nice to get the classic launch working as well.


Cheers,
PeterT


Re: [OMPI users] 1.4 OpenMPI build not working well with TotalView on Darwin

2010-01-20 Thread Peter Thompson

Hi Jeff,

Sorry, speaking in shorthand again.

Jeff Squyres wrote:

On Jan 8, 2010, at 5:03 PM, Peter Thompson wrote:


I've tried a few builds of 1.4 on Snow Leopard, and trying to start up TotalView
gets some of the more 'standard' problems.  


I don't quite know what you mean by "standard" problems...?


That's more or less 'standard problems' that I hear described when someone tries 
to build and MPI (not just OpenMPI) and things don't work on first try.  I don't 
know if you've worked on the interface directly, but you are probably aware that 
TotalView has an API where we set up a structure, MPIR_PROCTABLE, based on a 
typedef MPIR_PROCDESC, which gets filled in as to what processes are started up 
on which nodes.  Which allows the debugger to attach to things automatically. 
If the build is done so that the files that hold these structures are optimized, 
sometimes the typedef is optimized away.  Or in the case of other builds, the 
file may have the correct optimization (none) but the symbol info is stripped in 
the link phase.  So it's a typical, or 'standard' issue I face, but hopefully 
not for you.



Either the typdef for MPIR_PROCDESC
can't be found, or MPIR_PROCTABLE is missing.  You can get things to work if you
start up TotalView first and then pick your program and go to the Parallel tab
and pick OpenMPI.  But it would be nice to get the classic launch working as 
well.


I'm unclear on how you could find these symbols if you start TV first, etc., 
but it won't work automatically.


One of the solutions we came up to work around this problem was to start up 
TotalView a different way, so that we need not rely on the symbol information at 
all.  If you start TotalView the 'classic' way, mpirun/mpiexec -tv -np 4 ./foo, 
it will look for MPIR_PROCTABLE and the others.  If you use the newer 'indirect' 
launch, we actually start up the debug servers with MPI, and then use some 
cached into to figure the correct process to start up with the debug servers and 
how many processes to start.  With this method, the symbol information is not 
needed.  This method works with OpenMPI on just about all platforms.  However, 
some users prefer the classic launch with -tv, and this seems to be failing with 
the latest builds I've done on Darwin.  The debug info appears to be preserved 
in the .o files, but does not always seem complete.  It probably needs another 
look on my part, to make sure I'm doing it right.  The fact that Snow Leopard 
(and maybe some earlier releases) now includes OpenMPI also confuses the issue, 
as the version that comes with Darwin does NOT contain the symbol info, and it's 
easy enough to get the native OpenMPI, and not pick up the build you intended.


Does that make any more sense?

I'll try playing around with 1.4.1 and see if it's me, or the compilers, or 
maybe OpenMPI.


PeterT



Do you have deeper knowledge (given your email address) on exactly what is 
going wrong?



[OMPI users] Building from the SRPM version creates an rpm with striped libraries

2010-05-24 Thread Peter Thompson
I have a user who prefers building rpm's from the srpm.  That's okay, 
but for debugging via TotalView it creates a version with the openmpi 
.so files stripped and we can't gain control of the processes when 
launched via mpirun -tv.  I've verified this with my own build of a 
1.4.1 rpm which I then installed and noticed the same behavior that the 
user reports.  I was hoping to give them some advice as to how to avoid 
the stripping, as it appears that the actual build of those libraries is 
done with -g and everything looks fine.  But I can't figure out in the 
build (from the log file I created) just where that stripping takes 
place, or how to get around it if need be.  The best guess I have is 
that it may be happening at the very end when an rpm-tmp file is 
executed, but that file has disappeared so I don't really know what it 
does.  I thought it might be apparent in the spec file, but it's 
certainly not apparent to me!  Any help or advice would be appreciated.


Cheers,
Peter Thompson



Re: [OMPI users] Building from the SRPM version creates an rpm with striped libraries

2010-05-26 Thread Peter Thompson
Thanks Ashley,  that did work, though I must say that %define __strip  
/bin/true is NOT very intuitive!


I did get my symbols in the needed libraries, but unfortunately, at 
least for the compiler I used to build, I still have a typedef 
undefined, and that also prevents that method of launching TV.  But we 
have our own workarounds for that. 


Cheers,
PeterT

Ashley Pittman wrote:

This is a standard rpm feature although like most things it can be disabled.

According to this mail and it's replies the two %defines below will prevent 
striping and the building of debuginfo rpms.

http://lists.rpm.org/pipermail/rpm-list/2009-January/000122.html

%define debug_package %{nil}
%define __strip /bin/true

Ashley.

On 25 May 2010, at 00:25, Peter Thompson wrote:

  

I have a user who prefers building rpm's from the srpm.  That's okay, but for 
debugging via TotalView it creates a version with the openmpi .so files 
stripped and we can't gain control of the processes when launched via mpirun 
-tv.  I've verified this with my own build of a 1.4.1 rpm which I then 
installed and noticed the same behavior that the user reports.  I was hoping to 
give them some advice as to how to avoid the stripping, as it appears that the 
actual build of those libraries is done with -g and everything looks fine.  But 
I can't figure out in the build (from the log file I created) just where that 
stripping takes place, or how to get around it if need be.  The best guess I 
have is that it may be happening at the very end when an rpm-tmp file is 
executed, but that file has disappeared so I don't really know what it does.  I 
thought it might be apparent in the spec file, but it's certainly not apparent 
to me!  Any help or advice would be appreciated.



  





[OMPI users] Debug info on Darwin

2010-06-04 Thread Peter Thompson
We've had a couple of reports of users trying to debug with Open MPI and 
TotalView on Darwin and not being able to use the classic


mpirun -tv -np 4 ./foo

launch.  The typical problem shows up as something like

Can't find typedef for MPIR_PROCDESC

and then TotalView can't attach to the spawned processes.  While the 
Open MPI build may correctly compile the needed files with -g, the 
problem arises in that the DWARF info on Darwin is kept in the .o 
files.  If these files are kept around, we might be able to find that 
info and be happy debugging.  But if they are deleted after the build, 
or things are moved around, then we are unable to locate the .o files 
containing the debug info, and no one is pleased. 

It was suggested by our CTO that if these files were compiled as to 
produce STABS debug info, rather than DWARF, then the debug info would 
be copied into the executables and shared libraries, and we would then 
be able to debug with Open MPI without a problem.   I'm not sure if this 
is the best place to offer that suggestion, but I imagine it's not a bad 
place to start.  ;-)


Regards,
Peter Thompson