Re: [OMPI devel] UD BTL alltoall hangs

2007-09-21 Thread Andrew Friedley
Thanks George.  I figured out the problem (two of them actually) based 
on a pointer from Gleb (thanks Gleb).  I have two types of send queues 
on the UD BTL -- one is per-module, and the other is per-endpoint.  I 
had missed looking for stuck frags on the per-endpoint queues.


So something is wrong with the per-endpoint queues and their interaction 
with the per-module queue.  Disabling the per-endpoint queue makes the 
problem go away, and I'm not sure I liked having them in the first place.


But this still left a similar problem at 2kb messages.  I had static 
limits set for free list lengths based on the btl_ofud_sd_num MCA 
parameter.  Switching the max to unlimited makes this problem go away 
too.  Good enough to get some runs through for now :)


Andrew

George Bosilca wrote:

Andrew,

There is an option on the message queue stuff, that allow you to see all 
internal pending requests. On the current trunk, edit the file 
ompi/debuggers/ompi_dll.s at line 736 and set the 
p_info->show_internal_requests to 1. Now compile and install it, and 
then restart totalview. You should be able to get access to all pending 
requests, even those created by the collective modules.


Moreover, the missing sends should be somewhere. If they are not in the 
BTL, and i they are not completed, then hopefully they are in the PML in 
the send_pending list. As the collective works on all other BTL I 
suppose the communication pattern is correct, so there is something 
happening with the requests when using the UD BTL.


If the requests are not in the PML send_pending queue, the next thing 
you can do is to modify the receive handles in the OB1 PML, and print 
all incoming match header. You will have to somehow sort the output, but 
at least you can figure out, what is happening with the missing messages.


  george.

On Sep 11, 2007, at 12:37 PM, Andrew Friedley wrote:


First off, I've managed to reproduce this with nbcbench using only 16
procs (two per node), and setting btl_ofud_sd_num to 12 -- eases
debugging with fewer procs to look at.

ompi_coll_tuned_alltoall_intra_basic_linear is the alltoall routine that
is being called.  What I'm seeing from totalview is that some random
number of procs (1-5 usually, varies from run to run) are sitting with a
send and a recv outstanding to every other proc.  The other procs
however have moved on to the next collective.  This is hard to see with
the default nbcbench code since it calls only alltoall repeatedly --
adding a barrier after the MPI_Alltoall() call makes it easier to see,
as the barrier has a different tag number and communication pattern.  So
what I see is a few procs stuck in alltoall, while the rest are waiting
in the following barrier.

I've also verified with totalview that there are no outstanding send
wqe's at the UD BTL, and all procs are polling progress.  The procs in
the alltoall are polling in the opal_condition_wait() called from
ompi_request_wait_all().

Not sure what to ask or where to look further other than, what should I
look at to see what requests are outstanding in the PML?

Andrew

George Bosilca wrote:

The first step will be to figure out which version of the alltoall
you're using. I suppose you use the default parameters, and then the
decision function in the tuned component say it is using the linear
all to all. As the name state it, this means that every node will
post one receive from any other node and then will start sending to
every other node the respective fragment. This will lead to a lot of
outstanding sends and receives. I doubt that the receive can cause a
problem, so I expect the problem is coming from the send side.

Do you have TotalView installed on your odin ? If yes there is a
simple way to see how many sends are pending and where ... That might
pinpoint [at least] the process where you should look to see what'
wrong.

   george.

On Aug 29, 2007, at 12:37 AM, Andrew Friedley wrote:


I'm having a problem with the UD BTL and hoping someone might have
some
input to help solve it.

What I'm seeing is hangs when running alltoall benchmarks with
nbcbench
or an LLNL program called mpiBench -- both hang exactly the same way.
With the code on the trunk running nbcbench on IU's odin using 32
nodes
and a command line like this:

mpirun -np 128 -mca btl ofud,self ./nbcbench -t MPI_Alltoall -p
128-128
-s 1-262144

hangs consistently when testing 256-byte messages.  There are two
things
I can do to make the hang go away until running at larger scale.
First
is to increase the 'btl_ofud_sd_num' MCA param from its default
value of
128.  This allows you to run with more procs/nodes before hitting the
hang, but AFAICT doesn't fix the actual problem.  What this parameter
does is control the maximum number of outstanding send WQEs posted at
the IB level -- when the limit is reached, frags are queued on an
opal_list_t and later sent by progress as IB sends complete.

The other way I've found is to play games with calling
mca_btl_ud_component_p

Re: [OMPI devel] UD BTL alltoall hangs

2007-09-21 Thread George Bosilca

Andrew,

There is an option on the message queue stuff, that allow you to see  
all internal pending requests. On the current trunk, edit the file  
ompi/debuggers/ompi_dll.s at line 736 and set the p_info- 
>show_internal_requests to 1. Now compile and install it, and then  
restart totalview. You should be able to get access to all pending  
requests, even those created by the collective modules.


Moreover, the missing sends should be somewhere. If they are not in  
the BTL, and i they are not completed, then hopefully they are in the  
PML in the send_pending list. As the collective works on all other  
BTL I suppose the communication pattern is correct, so there is  
something happening with the requests when using the UD BTL.


If the requests are not in the PML send_pending queue, the next thing  
you can do is to modify the receive handles in the OB1 PML, and print  
all incoming match header. You will have to somehow sort the output,  
but at least you can figure out, what is happening with the missing  
messages.


  george.

On Sep 11, 2007, at 12:37 PM, Andrew Friedley wrote:


First off, I've managed to reproduce this with nbcbench using only 16
procs (two per node), and setting btl_ofud_sd_num to 12 -- eases
debugging with fewer procs to look at.

ompi_coll_tuned_alltoall_intra_basic_linear is the alltoall routine  
that

is being called.  What I'm seeing from totalview is that some random
number of procs (1-5 usually, varies from run to run) are sitting  
with a

send and a recv outstanding to every other proc.  The other procs
however have moved on to the next collective.  This is hard to see  
with

the default nbcbench code since it calls only alltoall repeatedly --
adding a barrier after the MPI_Alltoall() call makes it easier to see,
as the barrier has a different tag number and communication  
pattern.  So
what I see is a few procs stuck in alltoall, while the rest are  
waiting

in the following barrier.

I've also verified with totalview that there are no outstanding send
wqe's at the UD BTL, and all procs are polling progress.  The procs in
the alltoall are polling in the opal_condition_wait() called from
ompi_request_wait_all().

Not sure what to ask or where to look further other than, what  
should I

look at to see what requests are outstanding in the PML?

Andrew

George Bosilca wrote:

The first step will be to figure out which version of the alltoall
you're using. I suppose you use the default parameters, and then the
decision function in the tuned component say it is using the linear
all to all. As the name state it, this means that every node will
post one receive from any other node and then will start sending to
every other node the respective fragment. This will lead to a lot of
outstanding sends and receives. I doubt that the receive can cause a
problem, so I expect the problem is coming from the send side.

Do you have TotalView installed on your odin ? If yes there is a
simple way to see how many sends are pending and where ... That might
pinpoint [at least] the process where you should look to see what'
wrong.

   george.

On Aug 29, 2007, at 12:37 AM, Andrew Friedley wrote:


I'm having a problem with the UD BTL and hoping someone might have
some
input to help solve it.

What I'm seeing is hangs when running alltoall benchmarks with
nbcbench
or an LLNL program called mpiBench -- both hang exactly the same  
way.

With the code on the trunk running nbcbench on IU's odin using 32
nodes
and a command line like this:

mpirun -np 128 -mca btl ofud,self ./nbcbench -t MPI_Alltoall -p
128-128
-s 1-262144

hangs consistently when testing 256-byte messages.  There are two
things
I can do to make the hang go away until running at larger scale.
First
is to increase the 'btl_ofud_sd_num' MCA param from its default
value of
128.  This allows you to run with more procs/nodes before hitting  
the
hang, but AFAICT doesn't fix the actual problem.  What this  
parameter
does is control the maximum number of outstanding send WQEs  
posted at

the IB level -- when the limit is reached, frags are queued on an
opal_list_t and later sent by progress as IB sends complete.

The other way I've found is to play games with calling
mca_btl_ud_component_progress() in mca_btl_ud_endpoint_post_send
().  In
fact I replaced the CHECK_FRAG_QUEUES() macro used around
btl_ofud_endpoint.c:77 with a version that loops on progress until a
send WQE slot is available (as opposed to queueing).  Same result  
-- I

can run at larger scale, but still hit the hang eventually.

It appears that when the job hangs, progress is being polled very
quickly, and after spinning for a while there are no outstanding  
send
WQEs or queued sends in the BTL.  I'm not sure where further up  
things

are spinning/blocking, as I can't produce the hang at less than 32
nodes
/ 128 procs and don't have a good way of debugging that (suggestions
appreciated).

Furthermore, both ob1 and dr PMLs result in the same behavior,  
except


Re: [OMPI devel] UD BTL alltoall hangs

2007-09-11 Thread Andrew Friedley
First off, I've managed to reproduce this with nbcbench using only 16 
procs (two per node), and setting btl_ofud_sd_num to 12 -- eases 
debugging with fewer procs to look at.


ompi_coll_tuned_alltoall_intra_basic_linear is the alltoall routine that 
is being called.  What I'm seeing from totalview is that some random 
number of procs (1-5 usually, varies from run to run) are sitting with a 
send and a recv outstanding to every other proc.  The other procs 
however have moved on to the next collective.  This is hard to see with 
the default nbcbench code since it calls only alltoall repeatedly -- 
adding a barrier after the MPI_Alltoall() call makes it easier to see, 
as the barrier has a different tag number and communication pattern.  So 
what I see is a few procs stuck in alltoall, while the rest are waiting 
in the following barrier.


I've also verified with totalview that there are no outstanding send 
wqe's at the UD BTL, and all procs are polling progress.  The procs in 
the alltoall are polling in the opal_condition_wait() called from 
ompi_request_wait_all().


Not sure what to ask or where to look further other than, what should I 
look at to see what requests are outstanding in the PML?


Andrew

George Bosilca wrote:
The first step will be to figure out which version of the alltoall  
you're using. I suppose you use the default parameters, and then the  
decision function in the tuned component say it is using the linear  
all to all. As the name state it, this means that every node will  
post one receive from any other node and then will start sending to  
every other node the respective fragment. This will lead to a lot of  
outstanding sends and receives. I doubt that the receive can cause a  
problem, so I expect the problem is coming from the send side.


Do you have TotalView installed on your odin ? If yes there is a  
simple way to see how many sends are pending and where ... That might  
pinpoint [at least] the process where you should look to see what'  
wrong.


   george.

On Aug 29, 2007, at 12:37 AM, Andrew Friedley wrote:

I'm having a problem with the UD BTL and hoping someone might have  
some

input to help solve it.

What I'm seeing is hangs when running alltoall benchmarks with  
nbcbench

or an LLNL program called mpiBench -- both hang exactly the same way.
With the code on the trunk running nbcbench on IU's odin using 32  
nodes

and a command line like this:

mpirun -np 128 -mca btl ofud,self ./nbcbench -t MPI_Alltoall -p  
128-128

-s 1-262144

hangs consistently when testing 256-byte messages.  There are two  
things
I can do to make the hang go away until running at larger scale.   
First
is to increase the 'btl_ofud_sd_num' MCA param from its default  
value of

128.  This allows you to run with more procs/nodes before hitting the
hang, but AFAICT doesn't fix the actual problem.  What this parameter
does is control the maximum number of outstanding send WQEs posted at
the IB level -- when the limit is reached, frags are queued on an
opal_list_t and later sent by progress as IB sends complete.

The other way I've found is to play games with calling
mca_btl_ud_component_progress() in mca_btl_ud_endpoint_post_send 
().  In

fact I replaced the CHECK_FRAG_QUEUES() macro used around
btl_ofud_endpoint.c:77 with a version that loops on progress until a
send WQE slot is available (as opposed to queueing).  Same result -- I
can run at larger scale, but still hit the hang eventually.

It appears that when the job hangs, progress is being polled very
quickly, and after spinning for a while there are no outstanding send
WQEs or queued sends in the BTL.  I'm not sure where further up things
are spinning/blocking, as I can't produce the hang at less than 32  
nodes

/ 128 procs and don't have a good way of debugging that (suggestions
appreciated).

Furthermore, both ob1 and dr PMLs result in the same behavior, except
that DR eventually trips a watchdog timeout, fails the BTL, and
terminates the job.

Other collectives such as allreduce and allgather do not hang -- only
alltoall.  I can also reproduce the hang on LLNL's Atlas machine.

Can anyone else reproduce this (Torsten might have to make a copy of
nbcbench available)?  Anyone have any ideas as to what's wrong?

Andrew
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] UD BTL alltoall hangs

2007-08-30 Thread Andrew Friedley

George Bosilca wrote:
Until then you should be using the latest command "tv8 mpirun -a -np  
2 -bynode `pwd`/NPmpi". The `pwd` is really important for some  
reason, otherwise TotalView is unable to find the executable. The  
problem is that the name of the process will be "./NPmpi" and  
TotalView does not have access to the path where the executable was  
launched (at least that's the reason I think).




Thanks George.  That works except for one catch, when I'm asked on 
startup if I want to stop the parallel job (and hit yes), totalview 
waits forever trying to connect to a remote server.  I see this on the 
xterm (shortened in a few places):


Launching TotalView Debugger Servers with command:
srun --jobid=0 -N1 -n1 -w`awk -F. 'BEGIN {ORS=","} {if (NR==1) ORS=""; 
print $1}' $PWD/TVT1Pa4Fjm` -l --input=none 
/usr/global/tools/totalview.8.1.0-1/linux-x86-64/bin/tvdsvr 
-callback_host atlas34 -callback_ports atlas31:16382 -set_pws 
47319a24:4688a7a2 -verbosity info -working_directory $PWD/NetPIPE_3.6.2

srun: error: Invalid numeric value "0" for jobid.

I got around this by hitting cancel in the 'waiting to connect' dialog, 
then setting my slurm jobid manually in file -> preferences -> bulk 
launch -> command instead of the %J filler, and restarting.  Is there a 
better work around for this?


Andrew


Re: [OMPI devel] UD BTL alltoall hangs

2007-08-29 Thread George Bosilca


On Aug 29, 2007, at 7:05 PM, Andrew Friedley wrote:


$ mpirun -debug -np 2 -bynode -debug-daemons ./NPmpi
-- 

Internal error -- the orte_base_user_debugger MCA parameter was not  
able to

be found.  Please contact the Open RTE developers; this should not
happen.
-- 



Grepping for that param in ompi_info shows:

 MCA orte: parameter "orte_base_user_debugger" (current value:
   "totalview @mpirun@ -a @mpirun_args@ : ddt -n @np@
   -start @executable@ @executable_argv@  
@single_app@ :

   fxp @mpirun@ -a @mpirun_args@")


This has been broken or a while. It's a long story to explain, but a  
fix is on the way.


Until then you should be using the latest command "tv8 mpirun -a -np  
2 -bynode `pwd`/NPmpi". The `pwd` is really important for some  
reason, otherwise TotalView is unable to find the executable. The  
problem is that the name of the process will be "./NPmpi" and  
TotalView does not have access to the path where the executable was  
launched (at least that's the reason I think).


Once you do this, you should be good to go.

  george.

What's going on?  I also tried running totalview directly, using a  
line

like this:

totalview mpirun -a -np 2 -bynode -debug-daemons ./NPmpi

Totalview comes up and seems to be running debugging the mpirun  
process,
with only one thread.  Doesn't seem to be aware that this is an MPI  
job

with other MPI processes.. any ideas?

Andrew

George Bosilca wrote:

The first step will be to figure out which version of the alltoall
you're using. I suppose you use the default parameters, and then the
decision function in the tuned component say it is using the linear
all to all. As the name state it, this means that every node will
post one receive from any other node and then will start sending to
every other node the respective fragment. This will lead to a lot of
outstanding sends and receives. I doubt that the receive can cause a
problem, so I expect the problem is coming from the send side.

Do you have TotalView installed on your odin ? If yes there is a
simple way to see how many sends are pending and where ... That might
pinpoint [at least] the process where you should look to see what'
wrong.

   george.

On Aug 29, 2007, at 12:37 AM, Andrew Friedley wrote:


I'm having a problem with the UD BTL and hoping someone might have
some
input to help solve it.

What I'm seeing is hangs when running alltoall benchmarks with
nbcbench
or an LLNL program called mpiBench -- both hang exactly the same  
way.

With the code on the trunk running nbcbench on IU's odin using 32
nodes
and a command line like this:

mpirun -np 128 -mca btl ofud,self ./nbcbench -t MPI_Alltoall -p
128-128
-s 1-262144

hangs consistently when testing 256-byte messages.  There are two
things
I can do to make the hang go away until running at larger scale.
First
is to increase the 'btl_ofud_sd_num' MCA param from its default
value of
128.  This allows you to run with more procs/nodes before hitting  
the
hang, but AFAICT doesn't fix the actual problem.  What this  
parameter
does is control the maximum number of outstanding send WQEs  
posted at

the IB level -- when the limit is reached, frags are queued on an
opal_list_t and later sent by progress as IB sends complete.

The other way I've found is to play games with calling
mca_btl_ud_component_progress() in mca_btl_ud_endpoint_post_send
().  In
fact I replaced the CHECK_FRAG_QUEUES() macro used around
btl_ofud_endpoint.c:77 with a version that loops on progress until a
send WQE slot is available (as opposed to queueing).  Same result  
-- I

can run at larger scale, but still hit the hang eventually.

It appears that when the job hangs, progress is being polled very
quickly, and after spinning for a while there are no outstanding  
send
WQEs or queued sends in the BTL.  I'm not sure where further up  
things

are spinning/blocking, as I can't produce the hang at less than 32
nodes
/ 128 procs and don't have a good way of debugging that (suggestions
appreciated).

Furthermore, both ob1 and dr PMLs result in the same behavior,  
except

that DR eventually trips a watchdog timeout, fails the BTL, and
terminates the job.

Other collectives such as allreduce and allgather do not hang --  
only

alltoall.  I can also reproduce the hang on LLNL's Atlas machine.

Can anyone else reproduce this (Torsten might have to make a copy of
nbcbench available)?  Anyone have any ideas as to what's wrong?

Andrew
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.

Re: [OMPI devel] UD BTL alltoall hangs

2007-08-29 Thread Andrew Friedley
Thanks for the suggestion; though that appears to hang with no output 
whatsoever.


Andrew

Aurelien Bouteiller wrote:

You should try mpirun -np 2 -bynode totalview ./NPmpi

Aurelien
Le 29 août 07 à 13:05, Andrew Friedley a écrit :


OK, I've never used totalview before.  So doing some FAQ reading I got
an xterm on an Atlas node (odin doesn't have totalview AFAIK).   
Trying a

simple netpipe run just to get familiar with things results in this:

$ mpirun -debug -np 2 -bynode -debug-daemons ./NPmpi
-- 

Internal error -- the orte_base_user_debugger MCA parameter was not  
able to

be found.  Please contact the Open RTE developers; this should not
happen.
-- 



Grepping for that param in ompi_info shows:

 MCA orte: parameter "orte_base_user_debugger" (current value:
   "totalview @mpirun@ -a @mpirun_args@ : ddt -n @np@
   -start @executable@ @executable_argv@  
@single_app@ :

   fxp @mpirun@ -a @mpirun_args@")

What's going on?  I also tried running totalview directly, using a  
line

like this:

totalview mpirun -a -np 2 -bynode -debug-daemons ./NPmpi

Totalview comes up and seems to be running debugging the mpirun  
process,
with only one thread.  Doesn't seem to be aware that this is an MPI  
job

with other MPI processes.. any ideas?

Andrew

George Bosilca wrote:

The first step will be to figure out which version of the alltoall
you're using. I suppose you use the default parameters, and then the
decision function in the tuned component say it is using the linear
all to all. As the name state it, this means that every node will
post one receive from any other node and then will start sending to
every other node the respective fragment. This will lead to a lot of
outstanding sends and receives. I doubt that the receive can cause a
problem, so I expect the problem is coming from the send side.

Do you have TotalView installed on your odin ? If yes there is a
simple way to see how many sends are pending and where ... That might
pinpoint [at least] the process where you should look to see what'
wrong.

   george.

On Aug 29, 2007, at 12:37 AM, Andrew Friedley wrote:


I'm having a problem with the UD BTL and hoping someone might have
some
input to help solve it.

What I'm seeing is hangs when running alltoall benchmarks with
nbcbench
or an LLNL program called mpiBench -- both hang exactly the same  
way.

With the code on the trunk running nbcbench on IU's odin using 32
nodes
and a command line like this:

mpirun -np 128 -mca btl ofud,self ./nbcbench -t MPI_Alltoall -p
128-128
-s 1-262144

hangs consistently when testing 256-byte messages.  There are two
things
I can do to make the hang go away until running at larger scale.
First
is to increase the 'btl_ofud_sd_num' MCA param from its default
value of
128.  This allows you to run with more procs/nodes before hitting  
the
hang, but AFAICT doesn't fix the actual problem.  What this  
parameter
does is control the maximum number of outstanding send WQEs  
posted at

the IB level -- when the limit is reached, frags are queued on an
opal_list_t and later sent by progress as IB sends complete.

The other way I've found is to play games with calling
mca_btl_ud_component_progress() in mca_btl_ud_endpoint_post_send
().  In
fact I replaced the CHECK_FRAG_QUEUES() macro used around
btl_ofud_endpoint.c:77 with a version that loops on progress until a
send WQE slot is available (as opposed to queueing).  Same result  
-- I

can run at larger scale, but still hit the hang eventually.

It appears that when the job hangs, progress is being polled very
quickly, and after spinning for a while there are no outstanding  
send
WQEs or queued sends in the BTL.  I'm not sure where further up  
things

are spinning/blocking, as I can't produce the hang at less than 32
nodes
/ 128 procs and don't have a good way of debugging that (suggestions
appreciated).

Furthermore, both ob1 and dr PMLs result in the same behavior,  
except

that DR eventually trips a watchdog timeout, fails the BTL, and
terminates the job.

Other collectives such as allreduce and allgather do not hang --  
only

alltoall.  I can also reproduce the hang on LLNL's Atlas machine.

Can anyone else reproduce this (Torsten might have to make a copy of
nbcbench available)?  Anyone have any ideas as to what's wrong?

Andrew
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.

Re: [OMPI devel] UD BTL alltoall hangs

2007-08-29 Thread Aurelien Bouteiller

You should try mpirun -np 2 -bynode totalview ./NPmpi

Aurelien
Le 29 août 07 à 13:05, Andrew Friedley a écrit :


OK, I've never used totalview before.  So doing some FAQ reading I got
an xterm on an Atlas node (odin doesn't have totalview AFAIK).   
Trying a

simple netpipe run just to get familiar with things results in this:

$ mpirun -debug -np 2 -bynode -debug-daemons ./NPmpi
-- 

Internal error -- the orte_base_user_debugger MCA parameter was not  
able to

be found.  Please contact the Open RTE developers; this should not
happen.
-- 



Grepping for that param in ompi_info shows:

 MCA orte: parameter "orte_base_user_debugger" (current value:
   "totalview @mpirun@ -a @mpirun_args@ : ddt -n @np@
   -start @executable@ @executable_argv@  
@single_app@ :

   fxp @mpirun@ -a @mpirun_args@")

What's going on?  I also tried running totalview directly, using a  
line

like this:

totalview mpirun -a -np 2 -bynode -debug-daemons ./NPmpi

Totalview comes up and seems to be running debugging the mpirun  
process,
with only one thread.  Doesn't seem to be aware that this is an MPI  
job

with other MPI processes.. any ideas?

Andrew

George Bosilca wrote:

The first step will be to figure out which version of the alltoall
you're using. I suppose you use the default parameters, and then the
decision function in the tuned component say it is using the linear
all to all. As the name state it, this means that every node will
post one receive from any other node and then will start sending to
every other node the respective fragment. This will lead to a lot of
outstanding sends and receives. I doubt that the receive can cause a
problem, so I expect the problem is coming from the send side.

Do you have TotalView installed on your odin ? If yes there is a
simple way to see how many sends are pending and where ... That might
pinpoint [at least] the process where you should look to see what'
wrong.

   george.

On Aug 29, 2007, at 12:37 AM, Andrew Friedley wrote:


I'm having a problem with the UD BTL and hoping someone might have
some
input to help solve it.

What I'm seeing is hangs when running alltoall benchmarks with
nbcbench
or an LLNL program called mpiBench -- both hang exactly the same  
way.

With the code on the trunk running nbcbench on IU's odin using 32
nodes
and a command line like this:

mpirun -np 128 -mca btl ofud,self ./nbcbench -t MPI_Alltoall -p
128-128
-s 1-262144

hangs consistently when testing 256-byte messages.  There are two
things
I can do to make the hang go away until running at larger scale.
First
is to increase the 'btl_ofud_sd_num' MCA param from its default
value of
128.  This allows you to run with more procs/nodes before hitting  
the
hang, but AFAICT doesn't fix the actual problem.  What this  
parameter
does is control the maximum number of outstanding send WQEs  
posted at

the IB level -- when the limit is reached, frags are queued on an
opal_list_t and later sent by progress as IB sends complete.

The other way I've found is to play games with calling
mca_btl_ud_component_progress() in mca_btl_ud_endpoint_post_send
().  In
fact I replaced the CHECK_FRAG_QUEUES() macro used around
btl_ofud_endpoint.c:77 with a version that loops on progress until a
send WQE slot is available (as opposed to queueing).  Same result  
-- I

can run at larger scale, but still hit the hang eventually.

It appears that when the job hangs, progress is being polled very
quickly, and after spinning for a while there are no outstanding  
send
WQEs or queued sends in the BTL.  I'm not sure where further up  
things

are spinning/blocking, as I can't produce the hang at less than 32
nodes
/ 128 procs and don't have a good way of debugging that (suggestions
appreciated).

Furthermore, both ob1 and dr PMLs result in the same behavior,  
except

that DR eventually trips a watchdog timeout, fails the BTL, and
terminates the job.

Other collectives such as allreduce and allgather do not hang --  
only

alltoall.  I can also reproduce the hang on LLNL's Atlas machine.

Can anyone else reproduce this (Torsten might have to make a copy of
nbcbench available)?  Anyone have any ideas as to what's wrong?

Andrew
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] UD BTL alltoall hangs

2007-08-29 Thread Andrew Friedley
OK, I've never used totalview before.  So doing some FAQ reading I got 
an xterm on an Atlas node (odin doesn't have totalview AFAIK).  Trying a 
simple netpipe run just to get familiar with things results in this:


$ mpirun -debug -np 2 -bynode -debug-daemons ./NPmpi
--
Internal error -- the orte_base_user_debugger MCA parameter was not able to
be found.  Please contact the Open RTE developers; this should not
happen.
--

Grepping for that param in ompi_info shows:

MCA orte: parameter "orte_base_user_debugger" (current value:
  "totalview @mpirun@ -a @mpirun_args@ : ddt -n @np@
  -start @executable@ @executable_argv@ @single_app@ :
  fxp @mpirun@ -a @mpirun_args@")

What's going on?  I also tried running totalview directly, using a line 
like this:


totalview mpirun -a -np 2 -bynode -debug-daemons ./NPmpi

Totalview comes up and seems to be running debugging the mpirun process, 
with only one thread.  Doesn't seem to be aware that this is an MPI job 
with other MPI processes.. any ideas?


Andrew

George Bosilca wrote:
The first step will be to figure out which version of the alltoall  
you're using. I suppose you use the default parameters, and then the  
decision function in the tuned component say it is using the linear  
all to all. As the name state it, this means that every node will  
post one receive from any other node and then will start sending to  
every other node the respective fragment. This will lead to a lot of  
outstanding sends and receives. I doubt that the receive can cause a  
problem, so I expect the problem is coming from the send side.


Do you have TotalView installed on your odin ? If yes there is a  
simple way to see how many sends are pending and where ... That might  
pinpoint [at least] the process where you should look to see what'  
wrong.


   george.

On Aug 29, 2007, at 12:37 AM, Andrew Friedley wrote:

I'm having a problem with the UD BTL and hoping someone might have  
some

input to help solve it.

What I'm seeing is hangs when running alltoall benchmarks with  
nbcbench

or an LLNL program called mpiBench -- both hang exactly the same way.
With the code on the trunk running nbcbench on IU's odin using 32  
nodes

and a command line like this:

mpirun -np 128 -mca btl ofud,self ./nbcbench -t MPI_Alltoall -p  
128-128

-s 1-262144

hangs consistently when testing 256-byte messages.  There are two  
things
I can do to make the hang go away until running at larger scale.   
First
is to increase the 'btl_ofud_sd_num' MCA param from its default  
value of

128.  This allows you to run with more procs/nodes before hitting the
hang, but AFAICT doesn't fix the actual problem.  What this parameter
does is control the maximum number of outstanding send WQEs posted at
the IB level -- when the limit is reached, frags are queued on an
opal_list_t and later sent by progress as IB sends complete.

The other way I've found is to play games with calling
mca_btl_ud_component_progress() in mca_btl_ud_endpoint_post_send 
().  In

fact I replaced the CHECK_FRAG_QUEUES() macro used around
btl_ofud_endpoint.c:77 with a version that loops on progress until a
send WQE slot is available (as opposed to queueing).  Same result -- I
can run at larger scale, but still hit the hang eventually.

It appears that when the job hangs, progress is being polled very
quickly, and after spinning for a while there are no outstanding send
WQEs or queued sends in the BTL.  I'm not sure where further up things
are spinning/blocking, as I can't produce the hang at less than 32  
nodes

/ 128 procs and don't have a good way of debugging that (suggestions
appreciated).

Furthermore, both ob1 and dr PMLs result in the same behavior, except
that DR eventually trips a watchdog timeout, fails the BTL, and
terminates the job.

Other collectives such as allreduce and allgather do not hang -- only
alltoall.  I can also reproduce the hang on LLNL's Atlas machine.

Can anyone else reproduce this (Torsten might have to make a copy of
nbcbench available)?  Anyone have any ideas as to what's wrong?

Andrew
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] UD BTL alltoall hangs

2007-08-28 Thread George Bosilca
The first step will be to figure out which version of the alltoall  
you're using. I suppose you use the default parameters, and then the  
decision function in the tuned component say it is using the linear  
all to all. As the name state it, this means that every node will  
post one receive from any other node and then will start sending to  
every other node the respective fragment. This will lead to a lot of  
outstanding sends and receives. I doubt that the receive can cause a  
problem, so I expect the problem is coming from the send side.


Do you have TotalView installed on your odin ? If yes there is a  
simple way to see how many sends are pending and where ... That might  
pinpoint [at least] the process where you should look to see what'  
wrong.


  george.

On Aug 29, 2007, at 12:37 AM, Andrew Friedley wrote:

I'm having a problem with the UD BTL and hoping someone might have  
some

input to help solve it.

What I'm seeing is hangs when running alltoall benchmarks with  
nbcbench

or an LLNL program called mpiBench -- both hang exactly the same way.
With the code on the trunk running nbcbench on IU's odin using 32  
nodes

and a command line like this:

mpirun -np 128 -mca btl ofud,self ./nbcbench -t MPI_Alltoall -p  
128-128

-s 1-262144

hangs consistently when testing 256-byte messages.  There are two  
things
I can do to make the hang go away until running at larger scale.   
First
is to increase the 'btl_ofud_sd_num' MCA param from its default  
value of

128.  This allows you to run with more procs/nodes before hitting the
hang, but AFAICT doesn't fix the actual problem.  What this parameter
does is control the maximum number of outstanding send WQEs posted at
the IB level -- when the limit is reached, frags are queued on an
opal_list_t and later sent by progress as IB sends complete.

The other way I've found is to play games with calling
mca_btl_ud_component_progress() in mca_btl_ud_endpoint_post_send 
().  In

fact I replaced the CHECK_FRAG_QUEUES() macro used around
btl_ofud_endpoint.c:77 with a version that loops on progress until a
send WQE slot is available (as opposed to queueing).  Same result -- I
can run at larger scale, but still hit the hang eventually.

It appears that when the job hangs, progress is being polled very
quickly, and after spinning for a while there are no outstanding send
WQEs or queued sends in the BTL.  I'm not sure where further up things
are spinning/blocking, as I can't produce the hang at less than 32  
nodes

/ 128 procs and don't have a good way of debugging that (suggestions
appreciated).

Furthermore, both ob1 and dr PMLs result in the same behavior, except
that DR eventually trips a watchdog timeout, fails the BTL, and
terminates the job.

Other collectives such as allreduce and allgather do not hang -- only
alltoall.  I can also reproduce the hang on LLNL's Atlas machine.

Can anyone else reproduce this (Torsten might have to make a copy of
nbcbench available)?  Anyone have any ideas as to what's wrong?

Andrew
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




[OMPI devel] UD BTL alltoall hangs

2007-08-28 Thread Andrew Friedley
I'm having a problem with the UD BTL and hoping someone might have some 
input to help solve it.


What I'm seeing is hangs when running alltoall benchmarks with nbcbench 
or an LLNL program called mpiBench -- both hang exactly the same way. 
With the code on the trunk running nbcbench on IU's odin using 32 nodes 
and a command line like this:


mpirun -np 128 -mca btl ofud,self ./nbcbench -t MPI_Alltoall -p 128-128 
-s 1-262144


hangs consistently when testing 256-byte messages.  There are two things 
I can do to make the hang go away until running at larger scale.  First 
is to increase the 'btl_ofud_sd_num' MCA param from its default value of 
128.  This allows you to run with more procs/nodes before hitting the 
hang, but AFAICT doesn't fix the actual problem.  What this parameter 
does is control the maximum number of outstanding send WQEs posted at 
the IB level -- when the limit is reached, frags are queued on an 
opal_list_t and later sent by progress as IB sends complete.


The other way I've found is to play games with calling 
mca_btl_ud_component_progress() in mca_btl_ud_endpoint_post_send().  In 
fact I replaced the CHECK_FRAG_QUEUES() macro used around 
btl_ofud_endpoint.c:77 with a version that loops on progress until a 
send WQE slot is available (as opposed to queueing).  Same result -- I 
can run at larger scale, but still hit the hang eventually.


It appears that when the job hangs, progress is being polled very 
quickly, and after spinning for a while there are no outstanding send 
WQEs or queued sends in the BTL.  I'm not sure where further up things 
are spinning/blocking, as I can't produce the hang at less than 32 nodes 
/ 128 procs and don't have a good way of debugging that (suggestions 
appreciated).


Furthermore, both ob1 and dr PMLs result in the same behavior, except 
that DR eventually trips a watchdog timeout, fails the BTL, and 
terminates the job.


Other collectives such as allreduce and allgather do not hang -- only 
alltoall.  I can also reproduce the hang on LLNL's Atlas machine.


Can anyone else reproduce this (Torsten might have to make a copy of 
nbcbench available)?  Anyone have any ideas as to what's wrong?


Andrew