Re: [OMPI devel] Autogen improvements: ready for blast off

2010-09-17 Thread Ralph Castain
After chatting with Jeff, we decided it would be good to introduce this into 
the trunk over the weekend so it can settle before people started beating on 
it. Please note:

WARNING: Work on the temp branch being merged here encountered problems with 
bugs in subversion. Considerable effort has gone into validating the branch. 
However, not all conditions can be checked, so users are cautioned that it may 
be advisable to not update from the trunk for a few days to allow MTT to 
identify platform-specific issues.

See Jeff's notes below about THINGS YOU NEED TO KNOW.

Ralph

On Sep 15, 2010, at 12:34 PM, Jeff Squyres wrote:

> Ya, timezone differences and limited communication make it hard / make 
> miscommunication easy.
> 
> We're not going to wait a year.  We're not going to wait a month.  
> 
> Ralph and I just need to sync up and get this stuff in.  I'm *guessing* it'll 
> be ready in a week or so.  I'll be on the call to discuss next Tuesday.
> 
> 
> On Sep 13, 2010, at 3:49 PM, Ralph Castain wrote:
> 
>> Just to correct: I will almost certainly not be on this week's call, and 
>> will definitely not be on for the next two weeks either.
>> 
>> A last-minute concern raised by Jeff makes it doubtful this will come into 
>> the trunk any time soon, and may see it delayed for more than a year until 
>> we are ready for a 1.7 series. Unclear at this time as I don't understand 
>> the nature of his concern, and communication is difficult across the globe.
>> 
>> For now, we can safely table this issue. It isn't coming in anytime soon.
>> 
>> 
>> On Sep 12, 2010, at 2:40 AM, Jeff Squyres wrote:
>> 
>>> (Terry: please add this to the agenda for the Tuesday call -- Ralph will 
>>> talk about it since I may not be on the call)
>>> 
>>> Ralph sent a mail a while ago describing improvements to autogen and the 
>>> build process that Brian, Ralph, and I have been working on.  We think this 
>>> work is now complete, and would like to bring it back to the SVN trunk.  
>>> Here's the bitbucket where this stuff lives:
>>> 
>>>  http://bitbucket.org/rhc/ompi-agen
>>> 
>>> We'd like to bring this stuff in to the SVN trunk by the end of the week.  
>>> Please examine our changes and/or test the things you care about in the 
>>> bitbucket.  The SVN commit to the trunk will look large mainly because it 
>>> makes almost-identical changes in many Makefile.am's and configure.m4's 
>>> (and we removed all configure.params files).
>>> 
>>> 
>>> *** THE MOST IMPORTANT THING DEVELOPERS NEED TO KNOW ***
>>> 
>>> 
>>> 
>>> If your component has a configure.m4 file, it MUST call AC_CONFIG_FILES for 
>>> your Makefile.am!  (and/or any files that you want configure to generate).  
>>> We converted all existing configure.m4 files -- the 
>>> ompi/mca/btl/tcp/configure.m4 is a nice simple example to see what I mean.
>>> 
>>> 
>>> There's some other changes and improvements, but most of them are behind 
>>> the scenes.  We'll update the relevant wiki pages with all the other 
>>> details:
>>> 
>>>  https://svn.open-mpi.org/trac/ompi/wiki/devel/Autogen
>>>  https://svn.open-mpi.org/trac/ompi/wiki/devel/CreateComponent
>>>  https://svn.open-mpi.org/trac/ompi/wiki/devel/CreateFramework
>>> 
>>> We understand that Mellanox may have some changes to their local branch of 
>>> the OMPI build system; it is unknown whether they conflict with our new 
>>> stuff or not.  Mellanox is out for ~2 weeks for holidays; we'd like to 
>>> bring this stuff in to the SVN trunk sooner rather than waiting 2 weeks and 
>>> letting the branch get overly stale.  Of course, when Mellanox does update 
>>> and get the new stuff, if there are any problems, I'm happy to work through 
>>> the issues with them.
>>> 
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Autogen improvements: ready for blast off

2010-09-17 Thread Ralph Castain
Okay, some things I am already discovering. If you do an "svn up", there is 
some file cleanup you'll need to do to get this to build again. Specifically, 
you need to:

rm config/mca_m4_config_include.m4

as this is a stale file that will linger and screw things up.


On Sep 17, 2010, at 4:57 PM, Ralph Castain wrote:

> After chatting with Jeff, we decided it would be good to introduce this into 
> the trunk over the weekend so it can settle before people started beating on 
> it. Please note:
> 
> WARNING: Work on the temp branch being merged here encountered problems with 
> bugs in subversion. Considerable effort has gone into validating the branch. 
> However, not all conditions can be checked, so users are cautioned that it 
> may be advisable to not update from the trunk for a few days to allow MTT to 
> identify platform-specific issues.
> 
> See Jeff's notes below about THINGS YOU NEED TO KNOW.
> 
> Ralph
> 
> On Sep 15, 2010, at 12:34 PM, Jeff Squyres wrote:
> 
>> Ya, timezone differences and limited communication make it hard / make 
>> miscommunication easy.
>> 
>> We're not going to wait a year.  We're not going to wait a month.  
>> 
>> Ralph and I just need to sync up and get this stuff in.  I'm *guessing* 
>> it'll be ready in a week or so.  I'll be on the call to discuss next Tuesday.
>> 
>> 
>> On Sep 13, 2010, at 3:49 PM, Ralph Castain wrote:
>> 
>>> Just to correct: I will almost certainly not be on this week's call, and 
>>> will definitely not be on for the next two weeks either.
>>> 
>>> A last-minute concern raised by Jeff makes it doubtful this will come into 
>>> the trunk any time soon, and may see it delayed for more than a year until 
>>> we are ready for a 1.7 series. Unclear at this time as I don't understand 
>>> the nature of his concern, and communication is difficult across the globe.
>>> 
>>> For now, we can safely table this issue. It isn't coming in anytime soon.
>>> 
>>> 
>>> On Sep 12, 2010, at 2:40 AM, Jeff Squyres wrote:
>>> 
 (Terry: please add this to the agenda for the Tuesday call -- Ralph will 
 talk about it since I may not be on the call)
 
 Ralph sent a mail a while ago describing improvements to autogen and the 
 build process that Brian, Ralph, and I have been working on.  We think 
 this work is now complete, and would like to bring it back to the SVN 
 trunk.  Here's the bitbucket where this stuff lives:
 
 http://bitbucket.org/rhc/ompi-agen
 
 We'd like to bring this stuff in to the SVN trunk by the end of the week.  
 Please examine our changes and/or test the things you care about in the 
 bitbucket.  The SVN commit to the trunk will look large mainly because it 
 makes almost-identical changes in many Makefile.am's and configure.m4's 
 (and we removed all configure.params files).
 
 
 *** THE MOST IMPORTANT THING DEVELOPERS NEED TO KNOW ***
 
 
 
 If your component has a configure.m4 file, it MUST call AC_CONFIG_FILES 
 for your Makefile.am!  (and/or any files that you want configure to 
 generate).  We converted all existing configure.m4 files -- the 
 ompi/mca/btl/tcp/configure.m4 is a nice simple example to see what I mean.
 
 
 There's some other changes and improvements, but most of them are behind 
 the scenes.  We'll update the relevant wiki pages with all the other 
 details:
 
 https://svn.open-mpi.org/trac/ompi/wiki/devel/Autogen
 https://svn.open-mpi.org/trac/ompi/wiki/devel/CreateComponent
 https://svn.open-mpi.org/trac/ompi/wiki/devel/CreateFramework
 
 We understand that Mellanox may have some changes to their local branch of 
 the OMPI build system; it is unknown whether they conflict with our new 
 stuff or not.  Mellanox is out for ~2 weeks for holidays; we'd like to 
 bring this stuff in to the SVN trunk sooner rather than waiting 2 weeks 
 and letting the branch get overly stale.  Of course, when Mellanox does 
 update and get the new stuff, if there are any problems, I'm happy to work 
 through the issues with them.
 
 -- 
 Jeff Squyres
 jsquy...@cisco.com
 For corporate legal information go to:
 http://www.cisco.com/web/about/doing_business/legal/cri/
 
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org

Re: [OMPI devel] NP64 _gather_ problem

2010-09-17 Thread Terry Dontje
Does setting mca parameter mpi_preconnect_mpi to 1 help at all.  This 
might be able to help determine if it is the actually connection set up 
between processes that are out of sync as oppose to something in the 
actual gather algorithm.


--td

Steve Wise wrote:
Here's a clue:  ompi_coll_tuned_gather_intra_dec_fixed() changes its 
algorithm for job sizes > 60 to some binomial method.  I changed the 
threshold to 100 and my NP64 jobs run fine.  Now to try and understand 
what about ompi_coll_tuned_gather_intra_binomial() is causing these 
connect delays...



On 9/16/2010 1:01 PM, Steve Wise wrote:
Oops.  One key typo here:  This is the IMB-MPI1 gather test, not 
barrier. :(



On 9/16/2010 12:05 PM, Steve Wise wrote:

 Hi,

I'm debugging a performance problem with running IMB-MP1/barrier in 
an NP64 cluster (8 nodes, 8 cores each).  I'm using openmpi-1.4.1 
from the OFED-1.5.1 distribution.  The BTL is openib/iWARP via 
Chelsio's T3 RNIC.  In short, a NP60 and smaller run completes in a 
timely manner as expected,  but NP61 and larger runs come to a crawl 
at the 8KB IO size and take ~5-10min to complete.  It does complete 
though.  It behaves this way even if I run on > 8 nodes so there are 
available cores.  IE a NP64 on a 16 node cluster still behaves the 
same way even though there are only 4 ranks on each node.  So its 
apparently not a thread starvation issue due to lack of cores.  When 
in the stalled state, I see on the order of 100 or so established 
iwarp connections on each node.  And the connection count increases 
VERY slowly and sporadically (at its peak there are around 800 
connections for a NP64 gather operation).  In comparison, when I run 
the <= NP60 runs, the connections quickly ramp up to the expected 
amount.  I added hooks in the openib BTL to track the time it takes 
to setup each connection.  In all runs, both <= NP60 and > NP60, the 
average connection setup time is around 200ms.  And the max setup 
time seen is never much above this value.  That tells me that its 
not individual connection setup that is the issue.   I then added 
printfs/fflushes in librdmacm to visually see when a connection is 
attempted and when it is accepted.  When I run with these printfs, I 
see the connections get setup quickly and evently in the <= NP60 
case.  Initially when the job is started, I see a small flurry of 
connections getting setup, then the run begins and at around 1KB IO 
size I see a 2nd large flurry of connection setups.  Then the test 
continues and completes.  With the >NP60 case, this second round of 
connection setups is very sporadic and slow.  Very slow!  I'll see 
little bursts of ~10-20 connections setup, then long random pauses.  
The net is that full connection setup for the job takes 5-10min.  
During this time the ranks are basically spinning idle awaiting the 
connections to get setup.  So I'm concluding that something above 
the BTL layer isn't issuing the endpoint connect requests in a 
timely manner.


Attached are 3 padb dumps during the stall.  Anybody see anything 
interesting in these?


Any ideas how I can further debug this?  Once I get above the 
openib  BTL layer my eyes glaze over and I get lost quickly. :)  I 
would greatly appreciate any ideas from the OpenMPI experts!



Thanks in advance,

Steve.


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 



[OMPI devel] New Romio for OpenMPI available in bitbucket

2010-09-17 Thread Pascal Deveze

Hi all,

In charge of ticket 1888 (see at 
https://svn.open-mpi.org/trac/ompi/ticket/1888) ,

I have put the resulting code in bitbucket at:
http://bitbucket.org/devezep/new-romio-for-openmpi/

The work in this repo consisted in refreshing ROMIO to a newer
version: the one from the very last MPICH2 release (mpich2-1.3b1).

Testing:
 1. runs fine except one minor error (see the explanation below) on 
various FS.

 2. runs fine with Lustre, but:
. had to add a small patch in romio/adio/ad_lustre_open.c
 3. see below how to efficiently run with Lustre.

You are invited to test and send comments

Enjoy !

Pascal

 The minor error ===
The test error.c fails because OpenMPI does not handle correctly the
"two level" error functions of ROMIO:
   error_code = MPIO_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE,
   myname, __LINE__, MPI_ERR_ARG,
   "**iobaddisp", 0);
OpenMPI limits its view to MPI_ERR_ARG, but the real error is 
"**iobaddisp".


= How to test performances with Lustre ===
1) Compile with Lustre ADIO driver. For this, add the flag
   --with-io-romio-flags="--with-file-system=ufs+nfs+lustre" to 
your configure command.


2) Of course, you should have a Lustre file system mounted on all the 
nodes you will run on.


3) Take an application like coll_perf.c (in the test directory). In this 
application, change the
   three dimensions to 1000, that will create a file of  4 GB (big 
files are required in order

   to reach good performances with Lustre).

4) Put the highest possible striping_factor in the hint. For this, one 
solution is :
- If your Lustre file system have 16 OST, create a hint file with the 
following line:

   striping_factor 16
- Export the path to this file in the variable ROMIO_HINTS:
  export ROMIO_HINTS=my_directory/my_hints
If you do not specify the striping_factor, Lustre will set the default 
value (often 2 only).
You can verify the striping_factor set by Lustre with the following 
command:

lfs getstripe  (look at the value of lmm_stripe_count)
 Note: The striping_factor is set once at file creation and cannot be 
changed after.


5) Run your test, specifying a file located in the Lustre file system.



[OMPI devel] Checkpoint is broken in trunk

2010-09-17 Thread ananda.mudar
I downloaded the nightly build of the trunk (r23756) and found that the 
checkpoint functionality is broken. My MPI program is a simple helloworld 
program incrementing and printing the number every few seconds once.

Following are the steps:
1. mpirun with NP set to 32
2. call ompi-checkpoint with "-term" option and it terminate the program after 
successful checkpoint image creation
3. call ompi-restart using the checkpoint image and it terminates with 
segmentation fault

I tried these steps with 1.5rc6 and 1.4.2 and I am able to restart the process 
using the checkpoint image. Am I missing any steps here? Reducing the number of 
processes didn't change the behavior.

Following is the output from my checkpoint attempt:

=== Output START ==
mpirun -am ft-enable-cr --mca opal_cr_enable_timer 1 --mca 
sstore_stage_global_is_shared 1 --mca sstore_base_global_snapshot_dir 
/scratch/hpl005/UIT_test/amudar/FWI --mca mpi_paffinity_alone 1  -np 32 
-hostfile hostfile-32 ../hellompi
Hello, world, I am 0 of 32
  1 Hello, world, I am 4 of 32
  1 Hello, world, I am 5 of 32
  1 Hello, world, I am 1 of 32
  1 Hello, world, I am 9 of 32
  1 Hello, world, I am 8 of 32
  1 Hello, world, I am 2 of 32
  1 Hello, world, I am 7 of 32
  1 Hello, world, I am 16 of 32
  1 Hello, world, I am 10 of 32
  1 Hello, world, I am 14 of 32
  1 Hello, world, I am 3 of 32
  1 Hello, world, I am 11 of 32
  1 Hello, world, I am 13 of 32
  1 Hello, world, I am 15 of 32
  1 Hello, world, I am 20 of 32
  1 Hello, world, I am 18 of 32
  1 Hello, world, I am 17 of 32
  1 Hello, world, I am 23 of 32
  1 Hello, world, I am 24 of 32
  1 Hello, world, I am 22 of 32
  1 Hello, world, I am 19 of 32
  1 Hello, world, I am 21 of 32
  1 Hello, world, I am 28 of 32
  1 Hello, world, I am 6 of 32
  1 Hello, world, I am 26 of 32
  1 Hello, world, I am 27 of 32
  1 Hello, world, I am 25 of 32
  1 Hello, world, I am 30 of 32
  1 Hello, world, I am 31 of 32
  1 Hello, world, I am 29 of 32
  1 Hello, world, I am 12 of 32
  1   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
  2   2   2   2   2   2   2   2   2   2   2   2   2   3   3   3   3   3   3   3 
  3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
  3   3   3   3   3   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4 
  4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   5   5   5 
  5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5 
  5   5   5   5   5   5   5   5   5   6   6   6   6   6   6   6   6   6   6   6 
  6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6 
  6   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7 
  7   7   7   7   7   7   7   7   7   7   7   7   7   8   8   8   8   8   8   8 
  8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8 
  8   8   8   8   8   9   9   9   9   9   9   9   9   9   9   9   9   9   9   9 
  9   9   9   9   9   9   9   9   9   9   9   9   9   9   9   9   9 
[hplcnlj158:13937] OPAL CR Timing:  Summary Begin
[hplcnlj158:13937] opal_cr: timing: Start Entry Point=   0.01 s   
1.22 s  0.57
[hplcnlj158:13937] opal_cr: timing: CRCP Protocol=   0.43 s   
1.22 s 35.45
[hplcnlj158:13937] opal_cr: timing: P2P Suspend  =   0.00 s   
1.22 s  0.34
[hplcnlj158:13937] opal_cr: timing: Checkpoint   =   0.64 s   
1.22 s 52.87
[hplcnlj158:13937] opal_cr: timing: P2P Reactivation = -1284678958.98 s 
 1.22 s -105438618322.51
[hplcnlj158:13937] opal_cr: timing: CRCP Cleanup =   0.00 s   
1.22 s  0.00
[hplcnlj158:13937] opal_cr: timing: Finish Entry Point   = 1284678959.11 s  
 1.22 s 105438618333.28
[hplcnlj158:13937] OPAL CR Timing:  Summary End
hplcnlj158> ompi-restart -am ft-enable-cr --mca opal_cr_enable_timer 1 
-hostfile hostfile-32 --mca sstore_stage_global_is_shared 1 --mca 
sstore_base_global_snapshot_dir /scratch/hpl005/UIT_test/amudar/FWI 
ompi_global_snapshot_13933.ckpt
  9   9   9   9   9   9   9   9   9   9   9   9   9   9   9   9   9   9   9   9 
  9   9   9   9   9   9   9   9   9   9   9   9 [hplcnlj158:13937] *** Process 
received signal ***
[hplcnlj158:13937] Signal: Segmentation fault (11)
[hplcnlj158:13937] Signal code: Address not mapped (1)
[hplcnlj158:13937] Failing at address: 0x2aaa0001
[hplcnlj158:13937] [ 0] /lib64/libpthread.so.0 [0x2b4019a064c0]
[hplcnlj158:13937] [ 1] 
/users/amudar/openmpi-1.7/lib/libmca_common_sm.so.0(mca_common_sm_param_register+0x262)
 [0x2d96628a]
[hplcnlj158:13937] [ 2] /users/amudar/openmpi-1.7/lib/openmpi/mca_btl_sm.so 
[0x2f0a55e8]
[hplcnlj158:13937] [ 3] /users/amudar/openmpi-1.7/lib/libmpi.so.0 
[0x2b4018c3c11b]
[hplcnlj158:13937] [ 4] 
/users/amudar/openmpi-1.7/lib/libmpi.so.0(mca_base_components_open+0x3ef) 
[0x2b4018c3b70b]
[hplcnlj1

Re: [OMPI devel] NP64 _gather_ problem

2010-09-17 Thread Steve Wise
 Yes it does.  With mpi_preconnect_mpi to 1, NP64 doesn't stall.  So 
its not the algorithm in and of itself, but rather some interplay 
between the algorithm and connection setup I guess.



On 9/17/2010 5:24 AM, Terry Dontje wrote:
Does setting mca parameter mpi_preconnect_mpi to 1 help at all.  This 
might be able to help determine if it is the actually connection set 
up between processes that are out of sync as oppose to something in 
the actual gather algorithm.


--td

Steve Wise wrote:
Here's a clue:  ompi_coll_tuned_gather_intra_dec_fixed() changes its 
algorithm for job sizes > 60 to some binomial method.  I changed the 
threshold to 100 and my NP64 jobs run fine.  Now to try and 
understand what about ompi_coll_tuned_gather_intra_binomial() is 
causing these connect delays...



On 9/16/2010 1:01 PM, Steve Wise wrote:
Oops.  One key typo here:  This is the IMB-MPI1 gather test, not 
barrier. :(



On 9/16/2010 12:05 PM, Steve Wise wrote:

 Hi,

I'm debugging a performance problem with running IMB-MP1/barrier in 
an NP64 cluster (8 nodes, 8 cores each).  I'm using openmpi-1.4.1 
from the OFED-1.5.1 distribution.  The BTL is openib/iWARP via 
Chelsio's T3 RNIC.  In short, a NP60 and smaller run completes in a 
timely manner as expected,  but NP61 and larger runs come to a 
crawl at the 8KB IO size and take ~5-10min to complete.  It does 
complete though.  It behaves this way even if I run on > 8 nodes so 
there are available cores.  IE a NP64 on a 16 node cluster still 
behaves the same way even though there are only 4 ranks on each 
node.  So its apparently not a thread starvation issue due to lack 
of cores.  When in the stalled state, I see on the order of 100 or 
so established iwarp connections on each node.  And the connection 
count increases VERY slowly and sporadically (at its peak there are 
around 800 connections for a NP64 gather operation).  In 
comparison, when I run the <= NP60 runs, the connections quickly 
ramp up to the expected amount.  I added hooks in the openib BTL to 
track the time it takes to setup each connection.  In all runs, 
both <= NP60 and > NP60, the average connection setup time is 
around 200ms.  And the max setup time seen is never much above this 
value.  That tells me that its not individual connection setup that 
is the issue.   I then added printfs/fflushes in librdmacm to 
visually see when a connection is attempted and when it is 
accepted.  When I run with these printfs, I see the connections get 
setup quickly and evently in the <= NP60 case.  Initially when the 
job is started, I see a small flurry of connections getting setup, 
then the run begins and at around 1KB IO size I see a 2nd large 
flurry of connection setups.  Then the test continues and 
completes.  With the >NP60 case, this second round of connection 
setups is very sporadic and slow.  Very slow!  I'll see little 
bursts of ~10-20 connections setup, then long random pauses.  The 
net is that full connection setup for the job takes 5-10min.  
During this time the ranks are basically spinning idle awaiting the 
connections to get setup.  So I'm concluding that something above 
the BTL layer isn't issuing the endpoint connect requests in a 
timely manner.


Attached are 3 padb dumps during the stall.  Anybody see anything 
interesting in these?


Any ideas how I can further debug this?  Once I get above the 
openib  BTL layer my eyes glaze over and I get lost quickly. :)  I 
would greatly appreciate any ideas from the OpenMPI experts!



Thanks in advance,

Steve.


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] NP64 _gather_ problem

2010-09-17 Thread Steve Wise
 Does anyone have a NP64 IB cluster handy?  I'd be interested if IB 
behaves this way when running with the rdmacm connect method.  IE with:


 --mca btl_openib_cpc_include rdmacm  --mca btl openib,sm,self

Steve.


On 9/17/2010 10:41 AM, Steve Wise wrote:
Yes it does.  With mpi_preconnect_mpi to 1, NP64 doesn't stall.  So 
its not the algorithm in and of itself, but rather some interplay 
between the algorithm and connection setup I guess.



On 9/17/2010 5:24 AM, Terry Dontje wrote:
Does setting mca parameter mpi_preconnect_mpi to 1 help at all.  This 
might be able to help determine if it is the actually connection set 
up between processes that are out of sync as oppose to something in 
the actual gather algorithm.


--td

Steve Wise wrote:
Here's a clue:  ompi_coll_tuned_gather_intra_dec_fixed() changes its 
algorithm for job sizes > 60 to some binomial method.  I changed the 
threshold to 100 and my NP64 jobs run fine.  Now to try and 
understand what about ompi_coll_tuned_gather_intra_binomial() is 
causing these connect delays...



On 9/16/2010 1:01 PM, Steve Wise wrote:
Oops.  One key typo here:  This is the IMB-MPI1 gather test, not 
barrier. :(



On 9/16/2010 12:05 PM, Steve Wise wrote:

 Hi,

I'm debugging a performance problem with running IMB-MP1/barrier 
in an NP64 cluster (8 nodes, 8 cores each).  I'm using 
openmpi-1.4.1 from the OFED-1.5.1 distribution.  The BTL is 
openib/iWARP via Chelsio's T3 RNIC.  In short, a NP60 and smaller 
run completes in a timely manner as expected,  but NP61 and larger 
runs come to a crawl at the 8KB IO size and take ~5-10min to 
complete.  It does complete though.  It behaves this way even if I 
run on > 8 nodes so there are available cores.  IE a NP64 on a 16 
node cluster still behaves the same way even though there are only 
4 ranks on each node.  So its apparently not a thread starvation 
issue due to lack of cores.  When in the stalled state, I see on 
the order of 100 or so established iwarp connections on each 
node.  And the connection count increases VERY slowly and 
sporadically (at its peak there are around 800 connections for a 
NP64 gather operation).  In comparison, when I run the <= NP60 
runs, the connections quickly ramp up to the expected amount.  I 
added hooks in the openib BTL to track the time it takes to setup 
each connection.  In all runs, both <= NP60 and > NP60, the 
average connection setup time is around 200ms.  And the max setup 
time seen is never much above this value.  That tells me that its 
not individual connection setup that is the issue.   I then added 
printfs/fflushes in librdmacm to visually see when a connection is 
attempted and when it is accepted.  When I run with these printfs, 
I see the connections get setup quickly and evently in the <= NP60 
case.  Initially when the job is started, I see a small flurry of 
connections getting setup, then the run begins and at around 1KB 
IO size I see a 2nd large flurry of connection setups.  Then the 
test continues and completes.  With the >NP60 case, this second 
round of connection setups is very sporadic and slow.  Very slow!  
I'll see little bursts of ~10-20 connections setup, then long 
random pauses.  The net is that full connection setup for the job 
takes 5-10min.  During this time the ranks are basically spinning 
idle awaiting the connections to get setup.  So I'm concluding 
that something above the BTL layer isn't issuing the endpoint 
connect requests in a timely manner.


Attached are 3 padb dumps during the stall.  Anybody see anything 
interesting in these?


Any ideas how I can further debug this?  Once I get above the 
openib  BTL layer my eyes glaze over and I get lost quickly. :)  I 
would greatly appreciate any ideas from the OpenMPI experts!



Thanks in advance,

Steve.


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] NP64 _gather_ problem

2010-09-17 Thread Terry Dontje
Right, by default all connections will be handled on the fly.  So as an 
MPI_Send is executed to a process that there is not a connection to then 
a dance happens between the sender and the receiver.  So why this 
happens with np > 60 may have to do with how many connections are 
happening at the same time or if the destination of one connection 
request is not in the MPI library.


It would be interesting to figure out when in the timeline of the job 
that such requests are are being delayed.  You can get such a timeline 
by using a tool like Solaris Studio collector/analyzer (which actually 
has a Linux version).


--td

Steve Wise wrote:
Yes it does.  With mpi_preconnect_mpi to 1, NP64 doesn't stall.  So 
its not the algorithm in and of itself, but rather some interplay 
between the algorithm and connection setup I guess. 



On 9/17/2010 5:24 AM, Terry Dontje wrote:
Does setting mca parameter mpi_preconnect_mpi to 1 help at all.  This 
might be able to help determine if it is the actually connection set 
up between processes that are out of sync as oppose to something in 
the actual gather algorithm.


--td

Steve Wise wrote:
Here's a clue:  ompi_coll_tuned_gather_intra_dec_fixed() changes its 
algorithm for job sizes > 60 to some binomial method.  I changed the 
threshold to 100 and my NP64 jobs run fine.  Now to try and 
understand what about ompi_coll_tuned_gather_intra_binomial() is 
causing these connect delays...



On 9/16/2010 1:01 PM, Steve Wise wrote:
Oops.  One key typo here:  This is the IMB-MPI1 gather test, not 
barrier. :(



On 9/16/2010 12:05 PM, Steve Wise wrote:

 Hi,

I'm debugging a performance problem with running IMB-MP1/barrier 
in an NP64 cluster (8 nodes, 8 cores each).  I'm using 
openmpi-1.4.1 from the OFED-1.5.1 distribution.  The BTL is 
openib/iWARP via Chelsio's T3 RNIC.  In short, a NP60 and smaller 
run completes in a timely manner as expected,  but NP61 and larger 
runs come to a crawl at the 8KB IO size and take ~5-10min to 
complete.  It does complete though.  It behaves this way even if I 
run on > 8 nodes so there are available cores.  IE a NP64 on a 16 
node cluster still behaves the same way even though there are only 
4 ranks on each node.  So its apparently not a thread starvation 
issue due to lack of cores.  When in the stalled state, I see on 
the order of 100 or so established iwarp connections on each 
node.  And the connection count increases VERY slowly and 
sporadically (at its peak there are around 800 connections for a 
NP64 gather operation).  In comparison, when I run the <= NP60 
runs, the connections quickly ramp up to the expected amount.  I 
added hooks in the openib BTL to track the time it takes to setup 
each connection.  In all runs, both <= NP60 and > NP60, the 
average connection setup time is around 200ms.  And the max setup 
time seen is never much above this value.  That tells me that its 
not individual connection setup that is the issue.   I then added 
printfs/fflushes in librdmacm to visually see when a connection is 
attempted and when it is accepted.  When I run with these printfs, 
I see the connections get setup quickly and evently in the <= NP60 
case.  Initially when the job is started, I see a small flurry of 
connections getting setup, then the run begins and at around 1KB 
IO size I see a 2nd large flurry of connection setups.  Then the 
test continues and completes.  With the >NP60 case, this second 
round of connection setups is very sporadic and slow.  Very slow!  
I'll see little bursts of ~10-20 connections setup, then long 
random pauses.  The net is that full connection setup for the job 
takes 5-10min.  During this time the ranks are basically spinning 
idle awaiting the connections to get setup.  So I'm concluding 
that something above the BTL layer isn't issuing the endpoint 
connect requests in a timely manner.


Attached are 3 padb dumps during the stall.  Anybody see anything 
interesting in these?


Any ideas how I can further debug this?  Once I get above the 
openib  BTL layer my eyes glaze over and I get lost quickly. :)  I 
would greatly appreciate any ideas from the OpenMPI experts!



Thanks in advance,

Steve.


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 


___
devel mailing list
de

Re: [OMPI devel] NP64 _gather_ problem

2010-09-17 Thread Steve Wise
 I'll look into Solaris Studio.  I think somehow the connections are 
getting single threaded or somehow funneled due to the gather 
algorithm.  And since they are taking ~160ms to setup each one, and 
there are ~3600 connections getting setup, we end up with a 7 minute run 
time.   Now, 160ms seems way too high for setting up even an iWARP 
connection which has some streaming mode TCP exchanges as part of 
connection setup.  I would think it should be around a few hundred 
_usecs_.   So I'm pursuing the connect latency too.


Thanks,

Steve.

On 9/17/2010 12:13 PM, Terry Dontje wrote:
Right, by default all connections will be handled on the fly.  So as 
an MPI_Send is executed to a process that there is not a connection to 
then a dance happens between the sender and the receiver.  So why this 
happens with np > 60 may have to do with how many connections are 
happening at the same time or if the destination of one connection 
request is not in the MPI library.


It would be interesting to figure out when in the timeline of the job 
that such requests are are being delayed.  You can get such a timeline 
by using a tool like Solaris Studio collector/analyzer (which actually 
has a Linux version).


--td

Steve Wise wrote:
Yes it does.  With mpi_preconnect_mpi to 1, NP64 doesn't stall.  So 
its not the algorithm in and of itself, but rather some interplay 
between the algorithm and connection setup I guess.



On 9/17/2010 5:24 AM, Terry Dontje wrote:
Does setting mca parameter mpi_preconnect_mpi to 1 help at all.  
This might be able to help determine if it is the actually 
connection set up between processes that are out of sync as oppose 
to something in the actual gather algorithm.


--td

Steve Wise wrote:
Here's a clue:  ompi_coll_tuned_gather_intra_dec_fixed() changes 
its algorithm for job sizes > 60 to some binomial method.  I 
changed the threshold to 100 and my NP64 jobs run fine.  Now to try 
and understand what about ompi_coll_tuned_gather_intra_binomial() 
is causing these connect delays...



On 9/16/2010 1:01 PM, Steve Wise wrote:
Oops.  One key typo here:  This is the IMB-MPI1 gather test, not 
barrier. :(



On 9/16/2010 12:05 PM, Steve Wise wrote:

 Hi,

I'm debugging a performance problem with running IMB-MP1/barrier 
in an NP64 cluster (8 nodes, 8 cores each).  I'm using 
openmpi-1.4.1 from the OFED-1.5.1 distribution.  The BTL is 
openib/iWARP via Chelsio's T3 RNIC.  In short, a NP60 and smaller 
run completes in a timely manner as expected,  but NP61 and 
larger runs come to a crawl at the 8KB IO size and take ~5-10min 
to complete.  It does complete though.  It behaves this way even 
if I run on > 8 nodes so there are available cores.  IE a NP64 on 
a 16 node cluster still behaves the same way even though there 
are only 4 ranks on each node.  So its apparently not a thread 
starvation issue due to lack of cores.  When in the stalled 
state, I see on the order of 100 or so established iwarp 
connections on each node.  And the connection count increases 
VERY slowly and sporadically (at its peak there are around 800 
connections for a NP64 gather operation).  In comparison, when I 
run the <= NP60 runs, the connections quickly ramp up to the 
expected amount.  I added hooks in the openib BTL to track the 
time it takes to setup each connection.  In all runs, both <= 
NP60 and > NP60, the average connection setup time is around 
200ms.  And the max setup time seen is never much above this 
value.  That tells me that its not individual connection setup 
that is the issue.   I then added printfs/fflushes in librdmacm 
to visually see when a connection is attempted and when it is 
accepted.  When I run with these printfs, I see the connections 
get setup quickly and evently in the <= NP60 case.  Initially 
when the job is started, I see a small flurry of connections 
getting setup, then the run begins and at around 1KB IO size I 
see a 2nd large flurry of connection setups.  Then the test 
continues and completes.  With the >NP60 case, this second round 
of connection setups is very sporadic and slow.  Very slow!  I'll 
see little bursts of ~10-20 connections setup, then long random 
pauses.  The net is that full connection setup for the job takes 
5-10min.  During this time the ranks are basically spinning idle 
awaiting the connections to get setup.  So I'm concluding that 
something above the BTL layer isn't issuing the endpoint connect 
requests in a timely manner.


Attached are 3 padb dumps during the stall.  Anybody see anything 
interesting in these?


Any ideas how I can further debug this?  Once I get above the 
openib  BTL layer my eyes glaze over and I get lost quickly. :)  
I would greatly appreciate any ideas from the OpenMPI experts!



Thanks in advance,

Steve.


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...