Re: [OMPI users] "self scheduled" work & mpi receive???

2010-09-24 Thread Lewis, Ambrose J.
Good points...I'll see if anything can be done to speed up the master.  If we 
can shrink the number of MPI processes without hurting overall throughput maybe 
I could save enough to fit another run on the freed cores.  Thanks for the 
ideas!
I was also worried about contention on the nodes since I'm running multiple MPI 
processes on the same multi-core box.  A typical run is 120 MPI processes on 5 
nodes, each with 24 cores. I may play a little with the "--bynode" parameter to 
see if this has any (significant) effect
THANXS
amb


-Original Message-
From: users-boun...@open-mpi.org on behalf of Richard Treumann
Sent: Fri 9/24/2010 9:16 AM
To: Open MPI Users
Subject: Re: [OMPI users] "self scheduled" work & mpi receive???
 
Amb 

It sounds like you have more workers than you can keep fed. Workers are 
finishing up and requesting their next assignment but sit idle because 
there are so many other idle workers too.

Load balance does not really matter if the choke point is the master.  The 
work is being done as fast as the master can hand it out.

Consider using fewer workers and seeing if your load balance improves and 
your total thruput stays the same. If you want to use all the workers you 
have efficiently, you need to find a way to make the master deliver 
assignments as fast as workers finish them. 

Compute processes do not care about fairness. Having half the processes 
busy 100% of the time and the other half idle  vs. having all the 
processes busy 50% of the time gives the same thruput and the hard workers 
will not complain. 


Dick Treumann  -  MPI Team 
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363




From:
Mikael Lavoie 
To:
Open MPI Users 
List-Post: users@lists.open-mpi.org
Date:
09/23/2010 05:08 PM
Subject:
Re: [OMPI users] "self scheduled" work & mpi receive???
Sent by:
users-boun...@open-mpi.org



Hi Ambrose,

I'm interested in you work, i have a app to convert for myself and i don't 
know enough the MPI structure and syntaxe to make it...

So if you wanna share your app i'm interested in taking a look at it!! 

Thanks and have a nice day!!

Mikael Lavoie
2010/9/23 Lewis, Ambrose J. 
Hi All:
I've written an openmpi program that "self schedules" the work.  
The master task is in a loop chunking up an input stream and handing off 
jobs to worker tasks.  At first the master gives the next job to the next 
highest rank.  After all ranks have their first job, the master waits via 
an MPI receive call for the next free worker.  The master parses out the 
rank from the MPI receive and sends the next job to this node.  The jobs 
aren't all identical, so they run for slightly different durations based 
on the input data.
 
When I plot a histogram of the number of jobs each worker performed, the 
lower mpi ranks are doing much more work than the higher ranks.  For 
example, in a 120 process run, rank 1 did 32 jobs while rank 119 only did 
2.  My guess is that openmpi returns the lowest rank from the MPI Recv 
when I've got MPI_ANY_SOURCE set and multiple sends have happened since 
the last call.
 
Is there a different Recv call to make that will spread out the data 
better?
 
THANXS!
amb
 

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



<>

Re: [OMPI users] Shared memory

2010-09-24 Thread Durga Choudhury
I think the 'middle ground' approach can be simplified even further if
the data file is in a shared device (e.g. NFS/Samba mount) that can be
mounted at the same location of the file system tree on all nodes. I
have never tried it, though and mmap()'ing a non-POSIX compliant file
system such as Samba might have issues I am unaware of.

However, I do not see why you should not be able to do this even if
the file is being written to as long as you call msync() before using
the mapped pages.

Durga


On Fri, Sep 24, 2010 at 12:31 PM, Eugene Loh  wrote:
> It seems to me there are two extremes.
>
> One is that you replicate the data for each process.  This has the
> disadvantage of consuming lots of memory "unnecessarily."
>
> Another extreme is that shared data is distributed over all processes.  This
> has the disadvantage of making at least some of the data less accessible,
> whether in programming complexity and/or run-time performance.
>
> I'm not familiar with Global Arrays.  I was somewhat familiar with HPF.  I
> think the natural thing to do with those programming models is to distribute
> data over all processes, which may relieve the excessive memory consumption
> you're trying to address but which may also just put you at a different
> "extreme" of this spectrum.
>
> The middle ground I think might make most sense would be to share data only
> within a node, but to replicate the data for each node.  There are probably
> multiple ways of doing this -- possibly even GA, I don't know.  One way
> might be to use one MPI process per node, with OMP multithreading within
> each process|node.  Or (and I thought this was the solution you were looking
> for), have some idea which processes are collocal.  Have one process per
> node create and initialize some shared memory -- mmap, perhaps, or SysV
> shared memory.  Then, have its peers map the same shared memory into their
> address spaces.
>
> You asked what source code changes would be required.  It depends.  If
> you're going to mmap shared memory in on each node, you need to know which
> processes are collocal.  If you're willing to constrain how processes are
> mapped to nodes, this could be easy.  (E.g., "every 4 processes are
> collocal".)  If you want to discover dynamically at run time which are
> collocal, it would be harder.  The mmap stuff could be in a stand-alone
> function of about a dozen lines.  If the shared area is allocated as one
> piece, substituting the single malloc() call with a call to your mmap
> function should be simple.  If you have many malloc()s you're trying to
> replace, it's harder.
>
> Andrei Fokau wrote:
>
> The data are read from a file and processed before calculations begin, so I
> think that mapping will not work in our case.
> Global Arrays look promising indeed. As I said, we need to put just a part
> of data to the shared section. John, do you (or may be other users) have an
> experience of working with GA?
> http://www.emsl.pnl.gov/docs/global/um/build.html
> When GA runs with MPI:
> MPI_Init(..)      ! start MPI
> GA_Initialize()   ! start global arrays
> MA_Init(..)       ! start memory allocator
>     do work
> GA_Terminate()    ! tidy up global arrays
> MPI_Finalize()    ! tidy up MPI
>                   ! exit program
> On Fri, Sep 24, 2010 at 13:44, Reuti  wrote:
>>
>> Am 24.09.2010 um 13:26 schrieb John Hearns:
>>
>> > On 24 September 2010 08:46, Andrei Fokau 
>> > wrote:
>> >> We use a C-program which consumes a lot of memory per process (up to
>> >> few
>> >> GB), 99% of the data being the same for each process. So for us it
>> >> would be
>> >> quite reasonable to put that part of data in a shared memory.
>> >
>> > http://www.emsl.pnl.gov/docs/global/
>> >
>> > Is this eny help? Apologies if I'm talking through my hat.
>>
>> I was also thinking of this when I read "data in a shared memory" (besides
>> approaches like http://www.kerrighed.org/wiki/index.php/Main_Page). Wasn't
>> this also one idea behind "High Performance Fortran" - running in parallel
>> across nodes even without knowing that it's across nodes at all while
>> programming and access all data like it's being local.
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



Re: [OMPI users] [openib] segfault when using openib btl

2010-09-24 Thread Terry Dontje

Eloi Gaudry wrote:

Terry,

You were right, the error indeed seems to come from the message coalescing 
feature.
If I turn it off using the "--mca btl_openib_use_message_coalescing 0", I'm not able to 
observe the "hdr->tag=0" error.

There are some trac requests associated to very similar error (https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they are all closed (except https://svn.open-mpi.org/trac/ompi/ticket/2352 
that might be related), aren't they ? What would you suggest Terry ?


  
Interesting, though it looks to me like the segv in ticket 2352 would 
have happened on the send side instead of the receive side like you 
have.  As to what to do next it would be really nice to have some sort 
of reproducer that we can try and debug what is really going on.  The 
only other thing to do without a reproducer is to inspect the code on 
the send side to figure out what might make it generate at 0 hdr->tag.  
Or maybe instrument the send side to stop when it is about ready to send 
a 0 hdr->tag and see if we can see how the code got there.


I might have some cycles to look at this Monday.

--td

Eloi


On Friday 24 September 2010 16:00:26 Terry Dontje wrote:
  

Eloi Gaudry wrote:


Terry,

No, I haven't tried any other values than P,65536,256,192,128 yet.

The reason why is quite simple. I've been reading and reading again this
thread to understand the btl_openib_receive_queues meaning and I can't
figure out why the default values seem to induce the hdr-

  

tag=0 issue
(http://www.open-mpi.org/community/lists/users/2009/01/7808.php).


Yeah, the size of the fragments and number of them really should not
cause this issue.  So I too am a little perplexed about it.



Do you think that the default shared received queue parameters are
erroneous for this specific Mellanox card ? Any help on finding the
proper parameters would actually be much appreciated.
  

I don't necessarily think it is the queue size for a specific card but
more so the handling of the queues by the BTL when using certain sizes.
At least that is one gut feel I have.

In my mind the tag being 0 is either something below OMPI is polluting
the data fragment or OMPI's internal protocol is some how getting messed
up.  I can imagine (no empirical data here) the queue sizes could change
how the OMPI protocol sets things up.  Another thing may be the
coalescing feature in the openib BTL which tries to gang multiple
messages into one packet when resources are running low.   I can see
where changing the queue sizes might affect the coalescing.  So, it
might be interesting to turn off the coalescing.  You can do that by
setting "--mca btl_openib_use_message_coalescing 0" in your mpirun line.

If that doesn't solve the issue then obviously there must be something
else going on :-).

Note, the reason I am interested in this is I am seeing a similar error
condition (hdr->tag == 0) on a development system.  Though my failing
case fails with np=8 using the connectivity test program which is mainly
point to point and there are not a significant amount of data transfers
going on either.

--td



Eloi

On Friday 24 September 2010 14:27:07 you wrote:
  

That is interesting.  So does the number of processes affect your runs
any.  The times I've seen hdr->tag be 0 usually has been due to protocol
issues.  The tag should never be 0.  Have you tried to do other
receive_queue settings other than the default and the one you mention.

I wonder if you did a combination of the two receive queues causes a
failure or not.  Something like

P,128,256,192,128:P,65536,256,192,128

I am wondering if it is the first queuing definition causing the issue
or possibly the SRQ defined in the default.

--td

Eloi Gaudry wrote:


Hi Terry,

The messages being send/received can be of any size, but the error
seems to happen more often with small messages (as an int being
broadcasted or allreduced). The failing communication differs from one
run to another, but some spots are more likely to be failing than
another. And as far as I know, there are always located next to a
small message (an int being broadcasted for instance) communication.
Other typical messages size are

  

10k but can be very much larger.


I've been checking the hca being used, its' from mellanox (with
vendor_part_id=26428). There is no receive_queues parameters associated
to it.

 $ cat share/openmpi/mca-btl-openib-device-params.ini as well:
[...]

  # A.k.a. ConnectX
  [Mellanox Hermon]
  vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3
  vendor_part_id =
  25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,26488
  use_eager_rdma = 1
  mtu = 2048
  max_inline_data = 128

[..]

$ ompi_info --param btl openib --parsable | grep receive_queues

 mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256,192,128
 :S ,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32
 

Re: [OMPI users] Shared memory

2010-09-24 Thread Eugene Loh




It seems to me there are two extremes.

One is that you replicate the data for each process.  This has the
disadvantage of consuming lots of memory "unnecessarily."

Another extreme is that shared data is distributed over all processes. 
This has the disadvantage of making at least some of the data less
accessible, whether in programming complexity and/or run-time
performance.

I'm not familiar with Global Arrays.  I was somewhat familiar with
HPF.  I think the natural thing to do with those programming models is
to distribute data over all processes, which may relieve the excessive
memory consumption you're trying to address but which may also just put
you at a different "extreme" of this spectrum.

The middle ground I think might make most sense would be to share data
only within a node, but to replicate the data for each node.  There are
probably multiple ways of doing this -- possibly even GA, I don't
know.  One way might be to use one MPI process per node, with OMP
multithreading within each process|node.  Or (and I thought this was
the solution you were looking for), have some idea which processes are
collocal.  Have one process per node create and initialize some shared
memory -- mmap, perhaps, or SysV shared memory.  Then, have its peers
map the same shared memory into their address spaces.

You asked what source code changes would be required.  It depends.  If
you're going to mmap shared memory in on each node, you need to know
which processes are collocal.  If you're willing to constrain how
processes are mapped to nodes, this could be easy.  (E.g., "every 4
processes are collocal".)  If you want to discover dynamically at run
time which are collocal, it would be harder.  The mmap stuff could be
in a stand-alone function of about a dozen lines.  If the shared area
is allocated as one piece, substituting the single malloc() call with a
call to your mmap function should be simple.  If you have many
malloc()s you're trying to replace, it's harder.

Andrei Fokau wrote:

  The data are read from a file and
processed before calculations begin, so I think that mapping will not
work in our case.
  
  
Global Arrays look promising indeed. As I said, we need to put just a
part of data to the shared section. John, do you (or may be other
users) have an experience of working with GA?
  
  
  
  
  
  
  http://www.emsl.pnl.gov/docs/global/um/build.html
  
  
  
  When GA runs with MPI:
  
  
  MPI_Init(..)
     ! start MPI 
  GA_Initialize()
  ! start global arrays 
  MA_Init(..)
      ! start memory allocator
  
  
    
 do work
  
  
  GA_Terminate()
   ! tidy up global arrays 
  MPI_Finalize()
   ! tidy up MPI 
    
               ! exit program
  
  
  On Fri, Sep 24, 2010 at 13:44, Reuti 
wrote:
  Am
24.09.2010 um 13:26 schrieb John Hearns:

> On 24 September 2010 08:46, Andrei Fokau 
wrote:
>> We use a C-program which consumes a lot of memory per process
(up to few
>> GB), 99% of the data being the same for each process. So for
us it would be
>> quite reasonable to put that part of data in a shared memory.
>
> http://www.emsl.pnl.gov/docs/global/
>
> Is this eny help? Apologies if I'm talking through my hat.


I was also thinking of this when I read "data in a shared memory"
(besides approaches like http://www.kerrighed.org/wiki/index.php/Main_Page).
Wasn't this also one idea behind "High Performance Fortran" - running
in parallel across nodes even without knowing that it's across nodes at
all while programming and access all data like it's being local.
  
  
  
  
  
  





Re: [OMPI users] [openib] segfault when using openib btl

2010-09-24 Thread Eloi Gaudry
Terry,

You were right, the error indeed seems to come from the message coalescing 
feature.
If I turn it off using the "--mca btl_openib_use_message_coalescing 0", I'm not 
able to observe the "hdr->tag=0" error.

There are some trac requests associated to very similar error 
(https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they are all 
closed (except https://svn.open-mpi.org/trac/ompi/ticket/2352 
that might be related), aren't they ? What would you suggest Terry ?

Eloi


On Friday 24 September 2010 16:00:26 Terry Dontje wrote:
> Eloi Gaudry wrote:
> > Terry,
> > 
> > No, I haven't tried any other values than P,65536,256,192,128 yet.
> > 
> > The reason why is quite simple. I've been reading and reading again this
> > thread to understand the btl_openib_receive_queues meaning and I can't
> > figure out why the default values seem to induce the hdr-
> > 
> >> tag=0 issue
> >> (http://www.open-mpi.org/community/lists/users/2009/01/7808.php).
> 
> Yeah, the size of the fragments and number of them really should not
> cause this issue.  So I too am a little perplexed about it.
> 
> > Do you think that the default shared received queue parameters are
> > erroneous for this specific Mellanox card ? Any help on finding the
> > proper parameters would actually be much appreciated.
> 
> I don't necessarily think it is the queue size for a specific card but
> more so the handling of the queues by the BTL when using certain sizes.
> At least that is one gut feel I have.
> 
> In my mind the tag being 0 is either something below OMPI is polluting
> the data fragment or OMPI's internal protocol is some how getting messed
> up.  I can imagine (no empirical data here) the queue sizes could change
> how the OMPI protocol sets things up.  Another thing may be the
> coalescing feature in the openib BTL which tries to gang multiple
> messages into one packet when resources are running low.   I can see
> where changing the queue sizes might affect the coalescing.  So, it
> might be interesting to turn off the coalescing.  You can do that by
> setting "--mca btl_openib_use_message_coalescing 0" in your mpirun line.
> 
> If that doesn't solve the issue then obviously there must be something
> else going on :-).
> 
> Note, the reason I am interested in this is I am seeing a similar error
> condition (hdr->tag == 0) on a development system.  Though my failing
> case fails with np=8 using the connectivity test program which is mainly
> point to point and there are not a significant amount of data transfers
> going on either.
> 
> --td
> 
> > Eloi
> > 
> > On Friday 24 September 2010 14:27:07 you wrote:
> >> That is interesting.  So does the number of processes affect your runs
> >> any.  The times I've seen hdr->tag be 0 usually has been due to protocol
> >> issues.  The tag should never be 0.  Have you tried to do other
> >> receive_queue settings other than the default and the one you mention.
> >> 
> >> I wonder if you did a combination of the two receive queues causes a
> >> failure or not.  Something like
> >> 
> >> P,128,256,192,128:P,65536,256,192,128
> >> 
> >> I am wondering if it is the first queuing definition causing the issue
> >> or possibly the SRQ defined in the default.
> >> 
> >> --td
> >> 
> >> Eloi Gaudry wrote:
> >>> Hi Terry,
> >>> 
> >>> The messages being send/received can be of any size, but the error
> >>> seems to happen more often with small messages (as an int being
> >>> broadcasted or allreduced). The failing communication differs from one
> >>> run to another, but some spots are more likely to be failing than
> >>> another. And as far as I know, there are always located next to a
> >>> small message (an int being broadcasted for instance) communication.
> >>> Other typical messages size are
> >>> 
>  10k but can be very much larger.
> >>> 
> >>> I've been checking the hca being used, its' from mellanox (with
> >>> vendor_part_id=26428). There is no receive_queues parameters associated
> >>> to it.
> >>> 
> >>>  $ cat share/openmpi/mca-btl-openib-device-params.ini as well:
> >>> [...]
> >>> 
> >>>   # A.k.a. ConnectX
> >>>   [Mellanox Hermon]
> >>>   vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3
> >>>   vendor_part_id =
> >>>   25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,26488
> >>>   use_eager_rdma = 1
> >>>   mtu = 2048
> >>>   max_inline_data = 128
> >>> 
> >>> [..]
> >>> 
> >>> $ ompi_info --param btl openib --parsable | grep receive_queues
> >>> 
> >>>  mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256,192,128
> >>>  :S ,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32
> >>>  mca:btl:openib:param:btl_openib_receive_queues:data_source:default
> >>>  value mca:btl:openib:param:btl_openib_receive_queues:status:writable
> >>>  mca:btl:openib:param:btl_openib_receive_queues:help:Colon-delimited,
> >>>  comma delimited list of receive queues: P,4096,8,6,4:P,32768,8,6,4
> >>>  mca:btl:openib:param:btl_openib_receive_queues:deprecated:no
> 

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-24 Thread Eloi Gaudry
Terry,

No, I haven't tried any other values than P,65536,256,192,128 yet.

The reason why is quite simple. I've been reading and reading again this thread 
to understand the btl_openib_receive_queues meaning and I can't figure out why 
the default values seem to induce the 
"hdr->tag=0" issue 
(http://www.open-mpi.org/community/lists/users/2009/01/7808.php). 

Do you think that the default shared received queue parameters are erroneous 
for this specific Mellanox card ? Any help on finding the proper parameters 
would actually be much appreciated.

Eloi

On Friday 24 September 2010 14:27:07 you wrote:
> That is interesting.  So does the number of processes affect your runs
> any.  The times I've seen hdr->tag be 0 usually has been due to protocol
> issues.  The tag should never be 0.  Have you tried to do other
> receive_queue settings other than the default and the one you mention.
> 
> I wonder if you did a combination of the two receive queues causes a
> failure or not.  Something like
> 
> P,128,256,192,128:P,65536,256,192,128
> 
> I am wondering if it is the first queuing definition causing the issue or
> possibly the SRQ defined in the default.
> 
> --td
> 
> Eloi Gaudry wrote:
> > Hi Terry,
> > 
> > The messages being send/received can be of any size, but the error seems
> > to happen more often with small messages (as an int being broadcasted or
> > allreduced). The failing communication differs from one run to another,
> > but some spots are more likely to be failing than another. And as far as
> > I know, there are always located next to a small message (an int being
> > broadcasted for instance) communication. Other typical messages size are
> > >10k but can be very much larger.
> > 
> > I've been checking the hca being used, its' from mellanox (with
> > vendor_part_id=26428). There is no receive_queues parameters associated
> > to it.
> > 
> >  $ cat share/openmpi/mca-btl-openib-device-params.ini as well:
> > [...]
> > 
> >   # A.k.a. ConnectX
> >   [Mellanox Hermon]
> >   vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3
> >   vendor_part_id =
> >   25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,26488
> >   use_eager_rdma = 1
> >   mtu = 2048
> >   max_inline_data = 128
> > 
> > [..]
> > 
> > $ ompi_info --param btl openib --parsable | grep receive_queues
> > 
> >  mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256,192,128:S
> >  ,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32
> >  mca:btl:openib:param:btl_openib_receive_queues:data_source:default
> >  value mca:btl:openib:param:btl_openib_receive_queues:status:writable
> >  mca:btl:openib:param:btl_openib_receive_queues:help:Colon-delimited,
> >  comma delimited list of receive queues: P,4096,8,6,4:P,32768,8,6,4
> >  mca:btl:openib:param:btl_openib_receive_queues:deprecated:no
> > 
> > I was wondering if these parameters (automatically computed at openib btl
> > init for what I understood) were not incorrect in some way and I plugged
> > some others values: "P,65536,256,192,128" (someone on the list used that
> > values when encountering a different issue) . Since that, I haven't been
> > able to observe the segfault (occuring as hrd->tag = 0 in
> > btl_openib_component.c:2881) yet.
> > 
> > Eloi
> > 
> > 
> > /home/pp_fr/st03230/EG/Softs/openmpi-custom-1.4.2/bin/
> > 
> > On Thursday 23 September 2010 23:33:48 Terry Dontje wrote:
> >> Eloi, I am curious about your problem.  Can you tell me what size of job
> >> it is?  Does it always fail on the same bcast,  or same process?
> >> 
> >> Eloi Gaudry wrote:
> >>> Hi Nysal,
> >>> 
> >>> Thanks for your suggestions.
> >>> 
> >>> I'm now able to get the checksum computed and redirected to stdout,
> >>> thanks (I forgot the  "-mca pml_base_verbose 5" option, you were
> >>> right). I haven't been able to observe the segmentation fault (with
> >>> hdr->tag=0) so far (when using pml csum) but I 'll let you know when I
> >>> am.
> >>> 
> >>> I've got two others question, which may be related to the error
> >>> observed:
> >>> 
> >>> 1/ does the maximum number of MPI_Comm that can be handled by OpenMPI
> >>> somehow depends on the btl being used (i.e. if I'm using openib, may I
> >>> use the same number of MPI_Comm object as with tcp) ? Is there
> >>> something as MPI_COMM_MAX in OpenMPI ?
> >>> 
> >>> 2/ the segfaults only appears during a mpi collective call, with very
> >>> small message (one int is being broadcast, for instance) ; i followed
> >>> the guidelines given at http://icl.cs.utk.edu/open-
> >>> mpi/faq/?category=openfabrics#ib-small-message-rdma but the debug-build
> >>> of OpenMPI asserts if I use a different min-size that 255. Anyway, if I
> >>> deactivate eager_rdma, the segfaults remains. Does the openib btl
> >>> handle very small message differently (even with eager_rdma
> >>> deactivated) than tcp ?
> >> 
> >> Others on the list does coalescing happen with non-eager_rdma?  If so
> >> then that would possibly be one difference between the 

Re: [OMPI users] "self scheduled" work & mpi receive???

2010-09-24 Thread Richard Treumann
Amb 

It sounds like you have more workers than you can keep fed. Workers are 
finishing up and requesting their next assignment but sit idle because 
there are so many other idle workers too.

Load balance does not really matter if the choke point is the master.  The 
work is being done as fast as the master can hand it out.

Consider using fewer workers and seeing if your load balance improves and 
your total thruput stays the same. If you want to use all the workers you 
have efficiently, you need to find a way to make the master deliver 
assignments as fast as workers finish them. 

Compute processes do not care about fairness. Having half the processes 
busy 100% of the time and the other half idle  vs. having all the 
processes busy 50% of the time gives the same thruput and the hard workers 
will not complain. 


Dick Treumann  -  MPI Team 
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363




From:
Mikael Lavoie 
To:
Open MPI Users 
Date:
09/23/2010 05:08 PM
Subject:
Re: [OMPI users] "self scheduled" work & mpi receive???
Sent by:
users-boun...@open-mpi.org



Hi Ambrose,

I'm interested in you work, i have a app to convert for myself and i don't 
know enough the MPI structure and syntaxe to make it...

So if you wanna share your app i'm interested in taking a look at it!! 

Thanks and have a nice day!!

Mikael Lavoie
2010/9/23 Lewis, Ambrose J. 
Hi All:
I’ve written an openmpi program that “self schedules” the work.  
The master task is in a loop chunking up an input stream and handing off 
jobs to worker tasks.  At first the master gives the next job to the next 
highest rank.  After all ranks have their first job, the master waits via 
an MPI receive call for the next free worker.  The master parses out the 
rank from the MPI receive and sends the next job to this node.  The jobs 
aren’t all identical, so they run for slightly different durations based 
on the input data.
 
When I plot a histogram of the number of jobs each worker performed, the 
lower mpi ranks are doing much more work than the higher ranks.  For 
example, in a 120 process run, rank 1 did 32 jobs while rank 119 only did 
2.  My guess is that openmpi returns the lowest rank from the MPI Recv 
when I’ve got MPI_ANY_SOURCE set and multiple sends have happened since 
the last call.
 
Is there a different Recv call to make that will spread out the data 
better?
 
THANXS!
amb
 

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Display in terminal of error message using throw std::runtime_error on distant node...

2010-09-24 Thread Olivier Riff
That is already an answer that make sense. I understand that it is really
not a trivial issue. I have seen other recent threads about "running on
crashed nodes", and that the openmpi team is working hard on it. Well we
will wait and be glad to test the first versions when (I understand it will
take some time) they are released.

Thanks for this quick reply,

Olivier

2010/9/24 Jeff Squyres 

> Open MPI's fault tolerance is still somewhat rudimentary; it's a complex
> topic within the entire scope of MPI.  There has been much research into MPI
> and fault tolerance over the years; the MPI Forum itself is grappling with
> terms and definitions that make sense.  It's by no means a "solved" problem.
>
> It's unfortunately unsurprising that Open MPI may hang in the case of a
> node crash.  I wish that I had a better answer for you, but I don't.  :-\
>
>
> On Sep 24, 2010, at 3:36 AM, Olivier Riff wrote:
>
> > Hello,
> >
> > My question concerns the display of error message generated by a throw
> std::runtime_error("Explicit error message").
> > I am launching on a terminal an openMPI program on several machines
> using:
> > mpirun -v -machinefile MyMachineFile.txt MyProgram.
> > I am wondering why I cannot see an error message displayed on the
> terminal when one of my distant node (meaning not the node where the
> terminal is used) is crashing. I was expecting that following try catch
> could also generates a display in the terminal:
> > try {...My code where a crash happens... }
> > {
> >   throw std::runtime_error( "Explicit error message" );
> > }
> >
> > Generally, my problem is that one of the node crashes and the global
> application waits forever data from this node. On the terminal, nothing is
> displayed indicating that the node has crashed and generated a useful
> information of the crash nature.
> >
> > ( I don't think these information are relevant here, but just in case: I
> am using openMPI 1.4.2, on a Mandriva 2008 system )
> >
> > Thanks in advance for any help/info/advice.
> >
> > Olivier
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Shared memory

2010-09-24 Thread Andrei Fokau
The data are read from a file and processed before calculations begin, so I
think that mapping will not work in our case.

Global Arrays look promising indeed. As I said, we need to put just a part
of data to the shared section. John, do you (or may be other users) have an
experience of working with GA?

http://www.emsl.pnl.gov/docs/global/um/build.html
*When GA runs with MPI:*

MPI_Init(..)  ! start MPI
GA_Initialize()   ! start global arrays
MA_Init(..)   ! start memory allocator

    do work

GA_Terminate()! tidy up global arrays
MPI_Finalize()! tidy up MPI
  ! exit program



On Fri, Sep 24, 2010 at 13:44, Reuti  wrote:

> Am 24.09.2010 um 13:26 schrieb John Hearns:
>
> > On 24 September 2010 08:46, Andrei Fokau 
> wrote:
> >> We use a C-program which consumes a lot of memory per process (up to few
> >> GB), 99% of the data being the same for each process. So for us it would
> be
> >> quite reasonable to put that part of data in a shared memory.
> >
> > http://www.emsl.pnl.gov/docs/global/
> >
> > Is this eny help? Apologies if I'm talking through my hat.
>
> I was also thinking of this when I read "data in a shared memory" (besides
> approaches like http://www.kerrighed.org/wiki/index.php/Main_Page). Wasn't
> this also one idea behind "High Performance Fortran" - running in parallel
> across nodes even without knowing that it's across nodes at all while
> programming and access all data like it's being local.
>
> -- Reuti
>
>


Re: [OMPI users] How to know which process is running on which core?

2010-09-24 Thread Jeff Squyres
I completely neglected to mention that you could also use hwloc (Hardware 
Locality), a small utility library for learning topology-kinds of things 
(including if you're bound, where you're bound, etc.).  Hwloc is a sub-project 
of Open MPI:

http://www.open-mpi.org/projects/hwloc/

Open MPI uses hwloc internally, but you could also link your application 
against hwloc and call its C functions to get information about your process' 
locality, etc.



On Sep 24, 2010, at 8:14 AM, Jeff Squyres wrote:

> On the OMPI SVN trunk, we have an "Open MPI extension" call named 
> OMPI_Affinity_str().  Below is an excerpt from the man page.  If this is 
> desirable, we can probably get it into 1.5.1.
> 
> -
> 
> NAME
>   OMPI_Affinity_str  -  Obtain  prettyprint strings of processor affinity
>   information for this process
> 
> 
> SYNTAX
> C Syntax
>   #include 
>   #include 
> 
>   int OMPI_Affinity_str(char ompi_bound[OMPI_AFFINITY_STRING_MAX],
> char current_binding[OMPI_AFFINITY_STRING_MAX],
> char exists[OMPI_AFFINITY_STRING_MAX])
> 
> [snip]
> OUTPUT PARAMETERS
>   ompi_bound
> A prettyprint string describing what  processor(s)  Open  MPI
> bound  this  process to, or a string indicating that Open MPI
> did not bind this process.
> 
>   current_binding
> A  prettyprint  string  describing  what  processor(s)   this
> process  is  currently  bound to, or a string indicating that
> the process is bound to  all  available  processors  (and  is
> therefore considered "unbound").
> 
>   existsA  prettyprint  string  describing  the available sockets and
> cores on this host.
> 
> 
> 
> 
> On Sep 23, 2010, at 11:10 PM, Ralph Castain wrote:
> 
>> You mean via an API of some kind? Not through an MPI call, but you can do it 
>> (though your code will wind up OMPI-specific). Look at the OMPI source code 
>> in opal/mca/paffinity/paffinity.h and you'll see the necessary calls as well 
>> as some macros to help parse the results.
>> 
>> Depending upon what version you are using, there may also be a function in 
>> opal/mca/paffinity/base/base.h to pretty-print that info for you. I believe 
>> it may only be in the developer's trunk right now, or it may have made it 
>> into the 1.5.0 release candidate
>> 
>> 
>> On Thu, Sep 23, 2010 at 11:24 AM, Fernando Saez  
>> wrote:
>> Hi all, I'm new in the list. I don't know if this post has been treated 
>> before.
>> 
>> My question is:
>> 
>> Is there a way in the OMPI library to report which process is running
>> on which core in a SMP system? I need to know processor affinity for 
>> optimizations issues.
>> 
>> Regards
>> 
>> Fernando Saez
>> 
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] [openib] segfault when using openib btl

2010-09-24 Thread Terry Dontje
That is interesting.  So does the number of processes affect your runs 
any.  The times I've seen hdr->tag be 0 usually has been due to protocol 
issues.  The tag should never be 0.  Have you tried to do other 
receive_queue settings other than the default and the one you mention.


I wonder if you did a combination of the two receive queues causes a 
failure or not.  Something like


P,128,256,192,128:P,65536,256,192,128

I am wondering if it is the first queuing definition causing the issue or 
possibly the SRQ defined in the default.

--td

Eloi Gaudry wrote:

Hi Terry,

The messages being send/received can be of any size, but the error seems to 
happen more often with small messages (as an int being broadcasted or 
allreduced).
The failing communication differs from one run to another, but some spots are more likely to be failing than another. And as far as I know, there are always located next to a small message (an int 
being broadcasted for instance) communication. Other typical messages size are >10k but can be very much larger.


I've been checking the hca being used, its' from mellanox (with 
vendor_part_id=26428). There is no receive_queues parameters associated to it.
 $ cat share/openmpi/mca-btl-openib-device-params.ini as well:
[...]
  # A.k.a. ConnectX
  [Mellanox Hermon]
  vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3
  vendor_part_id = 
25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,26488
  use_eager_rdma = 1
  mtu = 2048
  max_inline_data = 128
[..]

$ ompi_info --param btl openib --parsable | grep receive_queues
 
mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32
 mca:btl:openib:param:btl_openib_receive_queues:data_source:default value
 mca:btl:openib:param:btl_openib_receive_queues:status:writable
 mca:btl:openib:param:btl_openib_receive_queues:help:Colon-delimited, comma 
delimited list of receive queues: P,4096,8,6,4:P,32768,8,6,4
 mca:btl:openib:param:btl_openib_receive_queues:deprecated:no

I was wondering if these parameters (automatically computed at openib btl init for what I understood) were not incorrect in some way and I plugged some others values: "P,65536,256,192,128" (someone on 
the list used that values when encountering a different issue) . Since that, I haven't been able to observe the segfault (occuring as hrd->tag = 0 in btl_openib_component.c:2881) yet.


Eloi


/home/pp_fr/st03230/EG/Softs/openmpi-custom-1.4.2/bin/

On Thursday 23 September 2010 23:33:48 Terry Dontje wrote:
  

Eloi, I am curious about your problem.  Can you tell me what size of job
it is?  Does it always fail on the same bcast,  or same process?

Eloi Gaudry wrote:


Hi Nysal,

Thanks for your suggestions.

I'm now able to get the checksum computed and redirected to stdout,
thanks (I forgot the  "-mca pml_base_verbose 5" option, you were right).
I haven't been able to observe the segmentation fault (with hdr->tag=0)
so far (when using pml csum) but I 'll let you know when I am.

I've got two others question, which may be related to the error observed:

1/ does the maximum number of MPI_Comm that can be handled by OpenMPI
somehow depends on the btl being used (i.e. if I'm using openib, may I
use the same number of MPI_Comm object as with tcp) ? Is there something
as MPI_COMM_MAX in OpenMPI ?

2/ the segfaults only appears during a mpi collective call, with very
small message (one int is being broadcast, for instance) ; i followed
the guidelines given at http://icl.cs.utk.edu/open-
mpi/faq/?category=openfabrics#ib-small-message-rdma but the debug-build
of OpenMPI asserts if I use a different min-size that 255. Anyway, if I
deactivate eager_rdma, the segfaults remains. Does the openib btl handle
very small message differently (even with eager_rdma deactivated) than
tcp ?
  

Others on the list does coalescing happen with non-eager_rdma?  If so
then that would possibly be one difference between the openib btl and
tcp aside from the actual protocol used.



 is there a way to make sure that large messages and small messages are
 handled the same way ?
  

Do you mean so they all look like eager messages?  How large of messages
are we talking about here 1K, 1M or 10M?

--td



Regards,
Eloi

On Friday 17 September 2010 17:57:17 Nysal Jan wrote:
  

Hi Eloi,
Create a debug build of OpenMPI (--enable-debug) and while running with
the csum PML add "-mca pml_base_verbose 5" to the command line. This
will print the checksum details for each fragment sent over the wire.
I'm guessing it didnt catch anything because the BTL failed. The
checksum verification is done in the PML, which the BTL calls via a
callback function. In your case the PML callback is never called
because the hdr->tag is invalid. So enabling checksum tracing also
might not be of much use. Is it the first Bcast that fails or the nth
Bcast and what is the message size? I'm not sure what could be the
problem at this 

Re: [OMPI users] Running on crashing nodes

2010-09-24 Thread Joshua Hursey
As one of the Open MPI developers actively working on the MPI layer 
stabilization/recover feature set, I don't think we can give you a specific 
timeframe for availability, especially availability in a stable release. Once 
the initial functionality is finished, we will open it up for user testing by 
making a public branch available. After addressing the concerns highlighted by 
public testing, we will attempt to work this feature into the mainline trunk 
and eventual release.

Unfortunately it is difficult to assess the time needed to go through these 
development stages. What I can tell you is that the work to this point on the 
MPI layer is looking promising, and that as soon as we feel that the code is 
ready we will make it available to the public for further testing.

-- Josh

On Sep 24, 2010, at 3:37 AM, Andrei Fokau wrote:

> Ralph, could you tell us when this functionality will be available in the 
> stable version? A rough estimate will be fine.
> 
> 
> On Fri, Sep 24, 2010 at 01:24, Ralph Castain  wrote:
> In a word, no. If a node crashes, OMPI will abort the currently-running job 
> if it had processes on that node. There is no current ability to "ride-thru" 
> such an event.
> 
> That said, there is work being done to support "ride-thru". Most of that is 
> in the current developer's code trunk, and more is coming, but I wouldn't 
> consider it production-quality just yet.
> 
> Specifically, the code that does what you specify below is done and works. It 
> is recovery of the MPI job itself (collectives, lost messages, etc.) that 
> remains to be completed.
> 
> 
> On Thu, Sep 23, 2010 at 7:22 AM, Andrei Fokau  
> wrote:
> Dear users,
> 
> Our cluster has a number of nodes which have high probability to crash, so it 
> happens quite often that calculations stop due to one node getting down. May 
> be you know if it is possible to block the crashed nodes during run-time when 
> running with OpenMPI? I am asking about principal possibility to program such 
> behavior. Does OpenMPI allow such dynamic checking? The scheme I am curious 
> about is the following:
> 
> 1. A code starts its tasks via mpirun on several nodes
> 2. At some moment one node gets down
> 3. The code realizes that the node is down (the results are lost) and 
> excludes it from the list of nodes to run its tasks on
> 4. At later moment the user restarts the crashed node
> 5. The code notices that the node is up again, and puts it back to the list 
> of active nodes
> 
> 
> Regards,
> Andrei
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 


Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://www.cs.indiana.edu/~jjhursey




Re: [OMPI users] Display in terminal of error message using throw std::runtime_error on distant node...

2010-09-24 Thread Jeff Squyres
Open MPI's fault tolerance is still somewhat rudimentary; it's a complex topic 
within the entire scope of MPI.  There has been much research into MPI and 
fault tolerance over the years; the MPI Forum itself is grappling with terms 
and definitions that make sense.  It's by no means a "solved" problem.

It's unfortunately unsurprising that Open MPI may hang in the case of a node 
crash.  I wish that I had a better answer for you, but I don't.  :-\


On Sep 24, 2010, at 3:36 AM, Olivier Riff wrote:

> Hello,
> 
> My question concerns the display of error message generated by a throw 
> std::runtime_error("Explicit error message").
> I am launching on a terminal an openMPI program on several machines using:
> mpirun -v -machinefile MyMachineFile.txt MyProgram.
> I am wondering why I cannot see an error message displayed on the terminal 
> when one of my distant node (meaning not the node where the terminal is used) 
> is crashing. I was expecting that following try catch could also generates a 
> display in the terminal:
> try {...My code where a crash happens... } 
> {
>   throw std::runtime_error( "Explicit error message" );
> }
> 
> Generally, my problem is that one of the node crashes and the global 
> application waits forever data from this node. On the terminal, nothing is 
> displayed indicating that the node has crashed and generated a useful 
> information of the crash nature.
> 
> ( I don't think these information are relevant here, but just in case: I am 
> using openMPI 1.4.2, on a Mandriva 2008 system )
> 
> Thanks in advance for any help/info/advice.
> 
> Olivier
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] How to know which process is running on which core?

2010-09-24 Thread Jeff Squyres
On the OMPI SVN trunk, we have an "Open MPI extension" call named 
OMPI_Affinity_str().  Below is an excerpt from the man page.  If this is 
desirable, we can probably get it into 1.5.1.

-

NAME
   OMPI_Affinity_str  -  Obtain  prettyprint strings of processor affinity
   information for this process


SYNTAX
C Syntax
   #include 
   #include 

   int OMPI_Affinity_str(char ompi_bound[OMPI_AFFINITY_STRING_MAX],
 char current_binding[OMPI_AFFINITY_STRING_MAX],
 char exists[OMPI_AFFINITY_STRING_MAX])

[snip]
OUTPUT PARAMETERS
   ompi_bound
 A prettyprint string describing what  processor(s)  Open  MPI
 bound  this  process to, or a string indicating that Open MPI
 did not bind this process.

   current_binding
 A  prettyprint  string  describing  what  processor(s)   this
 process  is  currently  bound to, or a string indicating that
 the process is bound to  all  available  processors  (and  is
 therefore considered "unbound").

   existsA  prettyprint  string  describing  the available sockets and
 cores on this host.




On Sep 23, 2010, at 11:10 PM, Ralph Castain wrote:

> You mean via an API of some kind? Not through an MPI call, but you can do it 
> (though your code will wind up OMPI-specific). Look at the OMPI source code 
> in opal/mca/paffinity/paffinity.h and you'll see the necessary calls as well 
> as some macros to help parse the results.
> 
> Depending upon what version you are using, there may also be a function in 
> opal/mca/paffinity/base/base.h to pretty-print that info for you. I believe 
> it may only be in the developer's trunk right now, or it may have made it 
> into the 1.5.0 release candidate
> 
> 
> On Thu, Sep 23, 2010 at 11:24 AM, Fernando Saez  
> wrote:
> Hi all, I'm new in the list. I don't know if this post has been treated 
> before.
> 
> My question is:
> 
> Is there a way in the OMPI library to report which process is running
> on which core in a SMP system? I need to know processor affinity for 
> optimizations issues.
> 
> Regards
> 
> Fernando Saez
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Shared memory

2010-09-24 Thread Reuti
Am 24.09.2010 um 13:26 schrieb John Hearns:

> On 24 September 2010 08:46, Andrei Fokau  wrote:
>> We use a C-program which consumes a lot of memory per process (up to few
>> GB), 99% of the data being the same for each process. So for us it would be
>> quite reasonable to put that part of data in a shared memory.
> 
> http://www.emsl.pnl.gov/docs/global/
> 
> Is this eny help? Apologies if I'm talking through my hat.

I was also thinking of this when I read "data in a shared memory" (besides 
approaches like http://www.kerrighed.org/wiki/index.php/Main_Page). Wasn't this 
also one idea behind "High Performance Fortran" - running in parallel across 
nodes even without knowing that it's across nodes at all while programming and 
access all data like it's being local.

-- Reuti


> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Shared memory

2010-09-24 Thread John Hearns
On 24 September 2010 08:46, Andrei Fokau  wrote:
> We use a C-program which consumes a lot of memory per process (up to few
> GB), 99% of the data being the same for each process. So for us it would be
> quite reasonable to put that part of data in a shared memory.

http://www.emsl.pnl.gov/docs/global/

Is this eny help? Apologies if I'm talking through my hat.


Re: [OMPI users] Shared memory

2010-09-24 Thread Durga Choudhury
Is the data coming from a read-only file? In that case, a better way
might be to memory map that file in the root process and share the map
pointer in all the slave threads. This, like shared memory, will work
only for processes within a node, of course.


On Fri, Sep 24, 2010 at 3:46 AM, Andrei Fokau
 wrote:
> We use a C-program which consumes a lot of memory per process (up to few
> GB), 99% of the data being the same for each process. So for us it would be
> quite reasonable to put that part of data in a shared memory.
> In the source code, the memory is allocated via malloc() function. What
> would it require for us to change in the source code to be able to put that
> repeating data in a shared memory?
> The code is normally run on several nodes.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


[OMPI users] Shared memory

2010-09-24 Thread Andrei Fokau
We use a C-program which consumes a lot of memory per process (up to few
GB), 99% of the data being the same for each process. So for us it would be
quite reasonable to put that part of data in a shared memory.

In the source code, the memory is allocated via malloc() function. What
would it require for us to change in the source code to be able to put that
repeating data in a shared memory?

The code is normally run on several nodes.


Re: [OMPI users] Running on crashing nodes

2010-09-24 Thread Andrei Fokau
Ralph, could you tell us when this functionality will be available in the
stable version? A rough estimate will be fine.


On Fri, Sep 24, 2010 at 01:24, Ralph Castain  wrote:

> In a word, no. If a node crashes, OMPI will abort the currently-running job
> if it had processes on that node. There is no current ability to "ride-thru"
> such an event.
>
> That said, there is work being done to support "ride-thru". Most of that is
> in the current developer's code trunk, and more is coming, but I wouldn't
> consider it production-quality just yet.
>
> Specifically, the code that does what you specify below is done and works.
> It is recovery of the MPI job itself (collectives, lost messages, etc.) that
> remains to be completed.
>
>
>  On Thu, Sep 23, 2010 at 7:22 AM, Andrei Fokau <
> andrei.fo...@neutron.kth.se> wrote:
>
>>  Dear users,
>>
>> Our cluster has a number of nodes which have high probability to crash, so
>> it happens quite often that calculations stop due to one node getting down.
>> May be you know if it is possible to block the crashed nodes during run-time
>> when running with OpenMPI? I am asking about principal possibility to
>> program such behavior. Does OpenMPI allow such dynamic checking? The scheme
>> I am curious about is the following:
>>
>> 1. A code starts its tasks via mpirun on several nodes
>> 2. At some moment one node gets down
>> 3. The code realizes that the node is down (the results are lost) and
>> excludes it from the list of nodes to run its tasks on
>> 4. At later moment the user restarts the crashed node
>> 5. The code notices that the node is up again, and puts it back to the
>> list of active nodes
>>
>>
>> Regards,
>> Andrei
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


[OMPI users] Display in terminal of error message using throw std::runtime_error on distant node...

2010-09-24 Thread Olivier Riff
Hello,

My question concerns the display of error message generated by a throw
std::runtime_error("Explicit error message").
I am launching on a terminal an openMPI program on several machines using:
mpirun -v -machinefile MyMachineFile.txt MyProgram.
I am wondering why I cannot see an error message displayed on the terminal
when one of my distant node (meaning not the node where the terminal is
used) is crashing. I was expecting that following try catch could also
generates a display in the terminal:
try {...My code where a crash happens... }
{
  throw std::runtime_error( "Explicit error message" );
}

Generally, my problem is that one of the node crashes and the global
application waits forever data from this node. On the terminal, nothing is
displayed indicating that the node has crashed and generated a useful
information of the crash nature.

( I don't think these information are relevant here, but just in case: I am
using openMPI 1.4.2, on a Mandriva 2008 system )

Thanks in advance for any help/info/advice.

Olivier


Re: [OMPI users] [openib] segfault when using openib btl

2010-09-24 Thread Eloi Gaudry
Hi Terry,

The messages being send/received can be of any size, but the error seems to 
happen more often with small messages (as an int being broadcasted or 
allreduced).
The failing communication differs from one run to another, but some spots are 
more likely to be failing than another. And as far as I know, there are always 
located next to a small message (an int 
being broadcasted for instance) communication. Other typical messages size are 
>10k but can be very much larger.

I've been checking the hca being used, its' from mellanox (with 
vendor_part_id=26428). There is no receive_queues parameters associated to it.
 $ cat share/openmpi/mca-btl-openib-device-params.ini as well:
[...]
  # A.k.a. ConnectX
  [Mellanox Hermon]
  vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3
  vendor_part_id = 
25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,26488
  use_eager_rdma = 1
  mtu = 2048
  max_inline_data = 128
[..]

$ ompi_info --param btl openib --parsable | grep receive_queues
 
mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32
 mca:btl:openib:param:btl_openib_receive_queues:data_source:default value
 mca:btl:openib:param:btl_openib_receive_queues:status:writable
 mca:btl:openib:param:btl_openib_receive_queues:help:Colon-delimited, comma 
delimited list of receive queues: P,4096,8,6,4:P,32768,8,6,4
 mca:btl:openib:param:btl_openib_receive_queues:deprecated:no

I was wondering if these parameters (automatically computed at openib btl init 
for what I understood) were not incorrect in some way and I plugged some others 
values: "P,65536,256,192,128" (someone on 
the list used that values when encountering a different issue) . Since that, I 
haven't been able to observe the segfault (occuring as hrd->tag = 0 in 
btl_openib_component.c:2881) yet.

Eloi


/home/pp_fr/st03230/EG/Softs/openmpi-custom-1.4.2/bin/

On Thursday 23 September 2010 23:33:48 Terry Dontje wrote:
> Eloi, I am curious about your problem.  Can you tell me what size of job
> it is?  Does it always fail on the same bcast,  or same process?
> 
> Eloi Gaudry wrote:
> > Hi Nysal,
> > 
> > Thanks for your suggestions.
> > 
> > I'm now able to get the checksum computed and redirected to stdout,
> > thanks (I forgot the  "-mca pml_base_verbose 5" option, you were right).
> > I haven't been able to observe the segmentation fault (with hdr->tag=0)
> > so far (when using pml csum) but I 'll let you know when I am.
> > 
> > I've got two others question, which may be related to the error observed:
> > 
> > 1/ does the maximum number of MPI_Comm that can be handled by OpenMPI
> > somehow depends on the btl being used (i.e. if I'm using openib, may I
> > use the same number of MPI_Comm object as with tcp) ? Is there something
> > as MPI_COMM_MAX in OpenMPI ?
> > 
> > 2/ the segfaults only appears during a mpi collective call, with very
> > small message (one int is being broadcast, for instance) ; i followed
> > the guidelines given at http://icl.cs.utk.edu/open-
> > mpi/faq/?category=openfabrics#ib-small-message-rdma but the debug-build
> > of OpenMPI asserts if I use a different min-size that 255. Anyway, if I
> > deactivate eager_rdma, the segfaults remains. Does the openib btl handle
> > very small message differently (even with eager_rdma deactivated) than
> > tcp ?
> 
> Others on the list does coalescing happen with non-eager_rdma?  If so
> then that would possibly be one difference between the openib btl and
> tcp aside from the actual protocol used.
> 
> >  is there a way to make sure that large messages and small messages are
> >  handled the same way ?
> 
> Do you mean so they all look like eager messages?  How large of messages
> are we talking about here 1K, 1M or 10M?
> 
> --td
> 
> > Regards,
> > Eloi
> > 
> > On Friday 17 September 2010 17:57:17 Nysal Jan wrote:
> >> Hi Eloi,
> >> Create a debug build of OpenMPI (--enable-debug) and while running with
> >> the csum PML add "-mca pml_base_verbose 5" to the command line. This
> >> will print the checksum details for each fragment sent over the wire.
> >> I'm guessing it didnt catch anything because the BTL failed. The
> >> checksum verification is done in the PML, which the BTL calls via a
> >> callback function. In your case the PML callback is never called
> >> because the hdr->tag is invalid. So enabling checksum tracing also
> >> might not be of much use. Is it the first Bcast that fails or the nth
> >> Bcast and what is the message size? I'm not sure what could be the
> >> problem at this moment. I'm afraid you will have to debug the BTL to
> >> find out more.
> >> 
> >> --Nysal
> >> 
> >> On Fri, Sep 17, 2010 at 4:39 PM, Eloi Gaudry  wrote:
> >>> Hi Nysal,
> >>> 
> >>> thanks for your response.
> >>> 
> >>> I've been unable so far to write a test case that could illustrate the
> >>> hdr->tag=0 error.
> >>> Actually, I'm only observing this issue when running an internode
> >>> 

Re: [OMPI users] How to know which process is running on which core?

2010-09-24 Thread Ralph Castain
You mean via an API of some kind? Not through an MPI call, but you can do it
(though your code will wind up OMPI-specific). Look at the OMPI source code
in opal/mca/paffinity/paffinity.h and you'll see the necessary calls as well
as some macros to help parse the results.

Depending upon what version you are using, there may also be a function in
opal/mca/paffinity/base/base.h to pretty-print that info for you. I believe
it may only be in the developer's trunk right now, or it may have made it
into the 1.5.0 release candidate


On Thu, Sep 23, 2010 at 11:24 AM, Fernando Saez wrote:

> Hi all, I'm new in the list. I don't know if this post has been treated
> before.
>
> My question is:
>
> Is there a way in the OMPI library to report which process is running
> on which core in a SMP system? I need to know processor affinity for
> optimizations issues.
>
> Regards
>
> Fernando Saez
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>