Re: [OMPI devel] [RFC] Low pressure OPAL progress

2009-06-10 Thread Sylvain Jeaugey

Hi Jeff,

Thanks for jumping in.

On Tue, 9 Jun 2009, Jeff Squyres wrote:

2. Note that your solution presupposes that one MPI process can detect that 
the entire job is deadlocked.  This is not quite correct.  What exactly do 
you want to detect -- that one process may be imbalanced on its receives 
(waiting for long periods of time without doing anything), or that the entire 
job is deadlocked?  The former may be ok -- it depends on the app.  If the 
latter, it requires a bit more work -- e.g., if one process detects that 
nothing has happened for a long time, it can initiate a 
collective/distributed deadlock detection algorithm with all the other MPI 
processes in the job.  Only if *all* processes agree, then you can say "this 
job is deadlocked, we might as well abort."  IIRC, there are some 3rd party 
tools / libraries that do this kind of stuff...?  (although it might be cool 
/ useful to incorporate some of this technology into OMPI itself)
My approach was based on a per-process detection. Of course this does not 
indicate that the job is stuck, but tools like ganglia will quickly show 
you whether all processes are in the "sleep" state or not (maybe combined 
with debugging tools, to see if all are really in MPI, not blocked in an 
I/O or something). Then, the user or the admin can take a decision whether 
to abort the job or not. The "sleep" was only a way for me to bring the 
information to the user/admin. But as Ralph stated, a log would be even 
better in this case (more precise, no performance penalty, ..), also it 
needs to be coupled with other tools (whereas the sleep was naturally 
coupled with ganglia).


3. As Ralph noted, how exactly do you know when "nothing happens for a long 
time" is a bad thing?  a) some codes are structured that way -- that they'll 
have no MPI activity for a long time, even if they have pending non-blocking 
receives pre-posted.  b) are you looking within the scope of *one* MPI 
blocking call?  I.e., if nothing happens *within the span of one blocking MPI 
call*, or are you looking if nothing happens across successive calls to 
opal_progress() (which may be few and far between after OMPI hits steady 
state when using non-TCP networks)?  It seems like there would need to be a 
[thread safe] "reset" at some point -- indicating that something has 
happened.  That either would be when something has happened, or that a 
blocking MPI call has exited, or ?  Need to make sure that that "reset" 
doesn't get expensive.
Uh. This is way more complicated than my patch. From the various 
reactions, it seems my RFC is misleading. I only work in 
opal_condition_wait(), which calls opal_progress(). The idea was only to 
sleep when we had been blocked in an MPI Wait (or similar) for a long 
time. So, we sleep only if there is no possible background computation : 
the MPI process is waiting, and basically doing nothing else. MPI_Test 
functions will never call sleep. The fact that opal_progress() did 
progress or not does not matter, the only question is : how long have we 
been in opal_condition_wait() ?


So, what I would want to do now is to replace the sleep by a message sent 
to the HNP indicating "I'm blocked for X minutes", then X minutes later 
"I'm blocked for 2X minutes", etc.


The HNP would then aggregate those messages and when every process has 
sent one, log "Everyone is blocked for X minutes", then (I presume) X 
minutes later, "Everyone is blocked for 2X minutes", etc.


I would then let users, admin or admin tools decide whether or not to 
abort the job.


If someone finally receives something, it should send a message to the HNP 
indicating that it is no longer blocked, or maybe just looking at logs 
should suffice to see if block times continue to increase or not.


Since I'm only working on opal_condition_wait(), deadlocks in applications 
using only MPI_Test calls will not be detected (but is that possible in 
the first place ?).


Sylvain


On Jun 9, 2009, at 6:43 AM, Ralph Castain wrote:


Couple of other things to help stimulate the thinking:

1. it isn't that OMPI -couldn't- receive a message, but rather that it 
-didn't- receive a message. This may or may not indicate that there is a 
problem. Could just be an application that doesn't need to communicate for 
awhile, as per my example. I admit, though, that 10 minutes is a tad 
long...but I've seen some bizarre apps around here :-)


2. instead of putting things to sleep or even adjusting the loop rate, you 
might want to consider using the orte_notifier capability and notify the 
system that the job may be stalled. Or perhaps adding an API to the 
orte_errmgr framework to notify it that nothing has been received for 
awhile, and let people implement different strategies for detecting what 
might be "wrong" and what they want to do about it.


My point with this second bullet is that there are other response options 
than hardwiring putting the process to sleep. You could let someone know so 
a huma

[OMPI devel] Does open MPI support nodes behind NAT or Firewall

2009-06-10 Thread Anjin Pradhan
Hi Everyone,

I wanted to know whether OPENMPI supported nodes that are behind a NAT or a 
firewall. 

If it doesn't do this by default can anyone let me know how i should go about 
making openMPI support NAT and firewall.

LEO



  Explore and discover exciting holidays and getaways with Yahoo! India 
Travel http://in.travel.yahoo.com/

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Ashley Pittman

Ralph,

If I may say this is exactly the type of problem the tool I have been
working on recently aims to help with and I'd be happy to help you
through it.

Firstly I'd say of the three collectives you mention, MPI_Allgather,
MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a many-to-one
and the last a many-to-one communication pattern.  The scenario of a
root process falling behind and getting swamped in comms is a plausible
one for MPI_Reduce only but doesn't hold water with the other two.  You
also don't mention if the loop is over a single collective or if you
have loop calling a number of different collectives each iteration.

padb, the tool I've been working on has the ability to look at parallel
jobs and report on the state of collective comms and should help narrow
you down on erroneous processes and those simply blocked waiting for
comms.  I'd recommend using it to look at maybe four or five instances
where the application has hung and look for any common features between
them.

Let me know if you are willing to try this route and I'll talk, the code
is downloadable from http://padb.pittman.org.uk and if you want the full
collective functionality you'll need to patch openmp with the patch from
http://padb.pittman.org.uk/extensions.html

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Ralph Castain
Hi Ashley

Thanks! I would definitely be interested and will look at the tool.
Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps
you might take a glance at that and offer some thoughts?

https://svn.open-mpi.org/trac/ompi/ticket/1944

Will be back after I look at the tool.

Thanks again
Ralph


On Wed, Jun 10, 2009 at 8:51 AM, Ashley Pittman wrote:

>
> Ralph,
>
> If I may say this is exactly the type of problem the tool I have been
> working on recently aims to help with and I'd be happy to help you
> through it.
>
> Firstly I'd say of the three collectives you mention, MPI_Allgather,
> MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a many-to-one
> and the last a many-to-one communication pattern.  The scenario of a
> root process falling behind and getting swamped in comms is a plausible
> one for MPI_Reduce only but doesn't hold water with the other two.  You
> also don't mention if the loop is over a single collective or if you
> have loop calling a number of different collectives each iteration.
>
> padb, the tool I've been working on has the ability to look at parallel
> jobs and report on the state of collective comms and should help narrow
> you down on erroneous processes and those simply blocked waiting for
> comms.  I'd recommend using it to look at maybe four or five instances
> where the application has hung and look for any common features between
> them.
>
> Let me know if you are willing to try this route and I'll talk, the code
> is downloadable from http://padb.pittman.org.uk and if you want the full
> collective functionality you'll need to patch openmp with the patch from
> http://padb.pittman.org.uk/extensions.html
>
> Ashley,
>
> --
>
> Ashley Pittman, Bath, UK.
>
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Ashley Pittman
On Wed, 2009-06-10 at 09:07 -0600, Ralph Castain wrote:
> Hi Ashley
> 
> Thanks! I would definitely be interested and will look at the tool.

Great.  My plan was to introduce the tool to this list today or tomorrow
anyway but this problem falls right it's it's target area so I brought
it up early.

>  Meantime, I have filed a bunch of data on this in ticket #1944, so
> perhaps you might take a glance at that and offer some thoughts?
> 
> https://svn.open-mpi.org/trac/ompi/ticket/1944

One thing that springs to mind is does the fortran reproducer hang on
other machines if you use the same process geometry.  That would tell us
if we are looking for a pure OpenMPI problem or a wider issue,
potentially eliminating any questions about numa memory layout.

> Will be back after I look at the tool.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Sylvain Jeaugey
Hum, very glad that padb works with Open MPI, I couldn't live without it. 
In my opinion, the best debug tool for parallel applications, and more 
importantly, the only one that scales.


About the issue, I couldn't reproduce it on my platform (tried 2 nodes 
with 2 to 8 processes each, nodes are twin 2.93 GHz Nehalem, IB is 
Mellanox QDR).


So my feeling about that is that is may be very hardware related. 
Especially if you use the hierarch component, some transactions will be 
done through RDMA on one side and read directly through shared memory on 
the other side, which can, depending on the hardware, produce very 
different timings and bugs. Did you try with a different collective 
component (i.e. not hierarch) ? Or with another interconnect ? [Yes, of 
course, if it is a race condition, we might well avoid the bug because 
timings will be different, but that's still information]


Perhaps all what I'm saying makes no sense or you already thought about 
this, anyway, if you want me to try different things, just let me know.


Sylvain

On Wed, 10 Jun 2009, Ralph Castain wrote:


Hi Ashley

Thanks! I would definitely be interested and will look at the tool. Meantime, I 
have filed a bunch of data on this in
ticket #1944, so perhaps you might take a glance at that and offer some 
thoughts?

https://svn.open-mpi.org/trac/ompi/ticket/1944

Will be back after I look at the tool.

Thanks again
Ralph


On Wed, Jun 10, 2009 at 8:51 AM, Ashley Pittman  wrote:

  Ralph,

  If I may say this is exactly the type of problem the tool I have been
  working on recently aims to help with and I'd be happy to help you
  through it.

  Firstly I'd say of the three collectives you mention, MPI_Allgather,
  MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a many-to-one
  and the last a many-to-one communication pattern.  The scenario of a
  root process falling behind and getting swamped in comms is a plausible
  one for MPI_Reduce only but doesn't hold water with the other two.  You
  also don't mention if the loop is over a single collective or if you
  have loop calling a number of different collectives each iteration.

  padb, the tool I've been working on has the ability to look at parallel
  jobs and report on the state of collective comms and should help narrow
  you down on erroneous processes and those simply blocked waiting for
  comms.  I'd recommend using it to look at maybe four or five instances
  where the application has hung and look for any common features between
  them.

  Let me know if you are willing to try this route and I'll talk, the code
  is downloadable from http://padb.pittman.org.uk and if you want the full
  collective functionality you'll need to patch openmp with the patch from
  http://padb.pittman.org.uk/extensions.html

  Ashley,

  --

  Ashley Pittman, Bath, UK.

  Padb - A parallel job inspection tool for cluster computing
  http://padb.pittman.org.uk

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Bogdan Costescu

On Wed, 10 Jun 2009, Ralph Castain wrote:


Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps
you might take a glance at that and offer some thoughts?

https://svn.open-mpi.org/trac/ompi/ticket/1944


I wasn't able to reproduce this. I have run with the following setup:
- OS is Scientific Linux 5.1 with a custom compiled kernel based on 
2.6.22.19, but (due to circumstances that I can't control):


checking if MCA component maffinity:libnuma can compile... no

- Intel compiler 10.1
- OpenMPI 1.3.2
- nodes have 2 CPUs of type E5440 (quad core), 16GB RAM and a ConnectX 
IB DDR


I've used the platform file that you have provided, but took out the 
references to PanFS and fixed the paths. I've also used the MCA file 
that you have provided.


I have run with nodes=1:ppn=8 and nodes=2:ppn=8 and the test finished 
successfully with m=50 several times. This, together with the earlier 
post also describing a negative result, points to a problem related to 
your particular setup...


--
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.coste...@iwr.uni-heidelberg.de


Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Ralph Castain
I appreciate the input and have captured it in the ticket. Since this
appears to be a NUMA-related issue, the lack of NUMA support in your setup
makes the test difficult to interpret.

I agree, though, that this is likely something peculiar to our particular
setup. Of primary concern is that it might be related to the relatively old
kernel (2.6.18) on these machines. There has been a lot of change since that
kernel was released, and some of those changes may be relevant to this
problem.

Unfortunately, upgrading the kernel will take persuasive argument. We are
going to try and run the reproducers on machines with more modern kernels to
see if we get different behavior.

Please feel free to follow this further on the ticket.

Thanks again!
Ralph


On Wed, Jun 10, 2009 at 11:29 AM, Bogdan Costescu <
bogdan.coste...@iwr.uni-heidelberg.de> wrote:

> On Wed, 10 Jun 2009, Ralph Castain wrote:
>
>  Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps
>> you might take a glance at that and offer some thoughts?
>>
>> https://svn.open-mpi.org/trac/ompi/ticket/1944
>>
>
> I wasn't able to reproduce this. I have run with the following setup:
> - OS is Scientific Linux 5.1 with a custom compiled kernel based on
> 2.6.22.19, but (due to circumstances that I can't control):
>
> checking if MCA component maffinity:libnuma can compile... no
>
> - Intel compiler 10.1
> - OpenMPI 1.3.2
> - nodes have 2 CPUs of type E5440 (quad core), 16GB RAM and a ConnectX IB
> DDR
>
> I've used the platform file that you have provided, but took out the
> references to PanFS and fixed the paths. I've also used the MCA file that
> you have provided.
>
> I have run with nodes=1:ppn=8 and nodes=2:ppn=8 and the test finished
> successfully with m=50 several times. This, together with the earlier post
> also describing a negative result, points to a problem related to your
> particular setup...
>
> --
> Bogdan Costescu
>
> IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
> Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
> E-mail: bogdan.coste...@iwr.uni-heidelberg.de
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Ralph Castain
Much appreciated!

Per some of my other comments on this thread and on the referenced ticket,
can you tell me what kernel you have on that machine? I assume you have NUMA
support enabled, given that chipset?

Thanks!
Ralph

On Wed, Jun 10, 2009 at 10:29 AM, Sylvain Jeaugey
wrote:

> Hum, very glad that padb works with Open MPI, I couldn't live without it.
> In my opinion, the best debug tool for parallel applications, and more
> importantly, the only one that scales.
>
> About the issue, I couldn't reproduce it on my platform (tried 2 nodes with
> 2 to 8 processes each, nodes are twin 2.93 GHz Nehalem, IB is Mellanox QDR).
>
> So my feeling about that is that is may be very hardware related.
> Especially if you use the hierarch component, some transactions will be done
> through RDMA on one side and read directly through shared memory on the
> other side, which can, depending on the hardware, produce very different
> timings and bugs. Did you try with a different collective component (i.e.
> not hierarch) ? Or with another interconnect ? [Yes, of course, if it is a
> race condition, we might well avoid the bug because timings will be
> different, but that's still information]
>
> Perhaps all what I'm saying makes no sense or you already thought about
> this, anyway, if you want me to try different things, just let me know.
>
> Sylvain
>
>
> On Wed, 10 Jun 2009, Ralph Castain wrote:
>
>  Hi Ashley
>>
>> Thanks! I would definitely be interested and will look at the tool.
>> Meantime, I have filed a bunch of data on this in
>> ticket #1944, so perhaps you might take a glance at that and offer some
>> thoughts?
>>
>> https://svn.open-mpi.org/trac/ompi/ticket/1944
>>
>> Will be back after I look at the tool.
>>
>> Thanks again
>> Ralph
>>
>>
>> On Wed, Jun 10, 2009 at 8:51 AM, Ashley Pittman 
>> wrote:
>>
>>  Ralph,
>>
>>  If I may say this is exactly the type of problem the tool I have been
>>  working on recently aims to help with and I'd be happy to help you
>>  through it.
>>
>>  Firstly I'd say of the three collectives you mention, MPI_Allgather,
>>  MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a
>> many-to-one
>>  and the last a many-to-one communication pattern.  The scenario of a
>>  root process falling behind and getting swamped in comms is a
>> plausible
>>  one for MPI_Reduce only but doesn't hold water with the other two.
>>  You
>>  also don't mention if the loop is over a single collective or if you
>>  have loop calling a number of different collectives each iteration.
>>
>>  padb, the tool I've been working on has the ability to look at
>> parallel
>>  jobs and report on the state of collective comms and should help
>> narrow
>>  you down on erroneous processes and those simply blocked waiting for
>>  comms.  I'd recommend using it to look at maybe four or five
>> instances
>>  where the application has hung and look for any common features
>> between
>>  them.
>>
>>  Let me know if you are willing to try this route and I'll talk, the
>> code
>>  is downloadable from http://padb.pittman.org.uk and if you want the
>> full
>>  collective functionality you'll need to patch openmp with the patch
>> from
>>  http://padb.pittman.org.uk/extensions.html
>>
>>  Ashley,
>>
>>  --
>>
>>  Ashley Pittman, Bath, UK.
>>
>>  Padb - A parallel job inspection tool for cluster computing
>>  http://padb.pittman.org.uk
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>
>>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


[OMPI devel] padb and orte

2009-06-10 Thread Ashley Pittman

All,

As mentioned in another thread I've recently ported padb, a command line
job inspection tool (kinda like a parallel debugger) to orte and
OpenMPI.  Padb is an existing stable product which has worked for a
number of years on Slurm and RMS, orte support is new and not widely
tested yet although it works for all cases I've tried.

For those who haven't used it padb is a open source command-line tool
which among other things can collect stack traces, display MPI message
queues and present a lot of process information about parallel jobs to
the user is an accessible way.

Ideally padb will find it's place within the day to day workings of
OpenMPI developers and become a recommended tool for users as well, it
also has a mode where it can be launched automatically to gather
information about job hangs without human intervention, I'd be willing
to work with the OpenMPI team to integrate this into the MTT code if
desired.

I would encourage you to download it and try it out, if it works for you
and you like it that's great, if not let me know and I'll do what I can
to fix it.  There is a website and public mailing lists for padb issues
or I am happy to discuss orte specific issues on this list.

The website is at http://padb.pittman.org.uk and I welcome any feedback,
either here, off-list or on either of the padb mailing lists.

Yours,

Ashley Pittman,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Bogdan Costescu

On Wed, 10 Jun 2009, Ralph Castain wrote:

I appreciate the input and have captured it in the ticket. Since 
this appears to be a NUMA-related issue, the lack of NUMA support in 
your setup makes the test difficult to interpret.


Based on this reasoning, disabling libnuma support in your OpenMPI 
build should also solve the problem, or do I interpret things the 
wrong way ?


--
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.coste...@iwr.uni-heidelberg.de


Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Ralph Castain
Well, it would - except then -all- the procs would run real slow! :-)

Still, might be a reasonable diagnostic step to try...will give it a shot.

On Wed, Jun 10, 2009 at 1:12 PM, Bogdan Costescu <
bogdan.coste...@iwr.uni-heidelberg.de> wrote:

> On Wed, 10 Jun 2009, Ralph Castain wrote:
>
>  I appreciate the input and have captured it in the ticket. Since this
>> appears to be a NUMA-related issue, the lack of NUMA support in your setup
>> makes the test difficult to interpret.
>>
>
> Based on this reasoning, disabling libnuma support in your OpenMPI build
> should also solve the problem, or do I interpret things the wrong way ?
>
>
> --
> Bogdan Costescu
>
> IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
> Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
> E-mail: bogdan.coste...@iwr.uni-heidelberg.de
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>