from:"Pavel Shamis \(Pasha\)"

Re: [OMPI users] Bad Infiniband latency with subounce

2010-02-18 Thread Pavel Shamis (Pasha)


Steve, thank for the details.

What is the command line that you use to run the benchmark ?
Can you try to add follow mca parameters to your command line:
"--mca btl openib,sm,self --mca btl_openib_max_btls 1"

Thanks,
Pasha

Repsher, Stephen J wrote:

Thanks for keeping on this Hopefully this answers all the questions:

The cluster has some blades with XRC, others without.  I've tested on both with 
the same results. For MVAPICH, a flag is set to turn on XRC; I'm not sure how 
OpenMPI handles it but my build is configured --enable-openib-connectx-xrc.

OpenMPI is built on a head node with a 2-port HCA (1 active) and installed on a 
shared file system.  The compute blades I'm using are Infinihost IIIs, 1-port 
HCAs.

As for nRepeats in bounce, I could increase it, but if that were the problem 
then I'd expect MVAPICH to report sporadic results as well.

I just downloaded the OSU benchmarks and tried osu_latency It's report ~40 
microsecs for OpenMPI, and ~3 micrcosecs for MVAPICH.  Still puzzled...

Steve


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Pavel Shamis (Pasha)
Sent: Thursday, February 18, 2010 3:33 AM
To: Open MPI Users
Subject: Re: [OMPI users] Bad Infiniband latency with subounce

Hey,
I only may to add the XRC and RC have the same latency.
What is the command line that you use to run this benchmark ?
What is the system configuration  (one hca, one active port ) ?
Any addition information about system configuration, mpi command line, etc. 
will help to analyze your issue.

Regards,
Pasha (Mellanox guy :-) )

Jeff Squyres wrote:
  

I'll defer to the Mellanox guys to reply more in detail, but here's a few 
thoughts:

- Is MVAPICH using XRC?  (I never played with XRC much; it would 
surprise me if it caused instability on the order of up to 100 micros 
-- I ask just to see if it is an apples-to-apples comparison)


- The nRepeats value in this code is only 10, meaning that it only seems to be 
doing 10 iterations on each size.  For small sizes, this might well be not 
enough to be accurate.  Have you tried increasing it?  Or using a different 
benchmark app, such as NetPIPE, osu_latency, ...etc.?



On Feb 16, 2010, at 8:49 AM, Repsher, Stephen J wrote:

  


Well the "good" news is I can end your debate over binding here...setting 
mpi_paffinity_alone 1 did nothing. (And personally as a user, I don't care what the 
default is so long as info is readily apparent in the main docs...and I did see the FAQs 
on it).

It did lead me to try another parameter though, -mca mpi_preconnect_all 1, 
which seems to reduce the measured latency reliably of subounce, but it's still 
sporadic and order ~10-100 microseconds.  It leads me to think that OpenMPI has 
issues with the method of measurement, which is simply to send progressively 
larger blocked messages right after calling MPI_Init (starting at 0 bytes which 
it times as the latency). OpenMPI's lazy connections clearly mess with this.

But still not consistently 1-2 microsecs...

Steve


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] 
On Behalf Of Ralph Castain

Sent: Monday, February 15, 2010 11:21 PM
To: Open MPI Users
Subject: Re: [OMPI users] Bad Infiniband latency with subounce


On Feb 15, 2010, at 8:44 PM, Terry Frankcombe wrote:


  

On Mon, 2010-02-15 at 20:18 -0700, Ralph Castain wrote:
  


Did you run it with -mca mpi_paffinity_alone 1? Given this is 1.4.1, you can 
set the bindings to -bind-to-socket or -bind-to-core. Either will give you 
improved performance.

IIRC, MVAPICH defaults to -bind-to-socket. OMPI defaults to no binding.

  
Is this sensible?  Won't most users want processes bound?  OMPI's 
supposed to "to the right thing" out of the box, right?
  


Well, that depends on how you look at it. Been the subject of a lot of debate 
within the devel community. If you bind by default and it is a shared node 
cluster, then you can really mess people up. On the other hand, if you don't 
bind by default, then people that run benchmarks without looking at the options 
can get bad numbers. Unfortunately, there is no automated way to tell if the 
cluster is configured for shared use or dedicated nodes.

I honestly don't know that "most users want processes bound". One 
installation I was at set binding by default using the system mca 
param file, and got yelled at by a group of users that had threaded 
apps - and most definitely did -not- want their processes bound. 
After a while, it became clear that nothing we could do would make 
everyone happy :-/


I doubt there is a right/wrong answer - at least, we sure can't find one. So we don't 
bind by default so we "do no harm", and put out FAQs, man pages, mpirun option 
help messages,

Re: [OMPI users] Bad Infiniband latency with subounce

2010-02-18 Thread Pavel Shamis (Pasha)


Hey,
I only may to add the XRC and RC have the same latency.
What is the command line that you use to run this benchmark ?
What is the system configuration  (one hca, one active port ) ?
Any addition information about system configuration, mpi command line, 
etc. will help to analyze your issue.


Regards,
Pasha (Mellanox guy :-) )

Jeff Squyres wrote:

I'll defer to the Mellanox guys to reply more in detail, but here's a few 
thoughts:

- Is MVAPICH using XRC?  (I never played with XRC much; it would surprise me if 
it caused instability on the order of up to 100 micros -- I ask just to see if 
it is an apples-to-apples comparison)

- The nRepeats value in this code is only 10, meaning that it only seems to be 
doing 10 iterations on each size.  For small sizes, this might well be not 
enough to be accurate.  Have you tried increasing it?  Or using a different 
benchmark app, such as NetPIPE, osu_latency, ...etc.?



On Feb 16, 2010, at 8:49 AM, Repsher, Stephen J wrote:

  

Well the "good" news is I can end your debate over binding here...setting 
mpi_paffinity_alone 1 did nothing. (And personally as a user, I don't care what the 
default is so long as info is readily apparent in the main docs...and I did see the FAQs 
on it).

It did lead me to try another parameter though, -mca mpi_preconnect_all 1, 
which seems to reduce the measured latency reliably of subounce, but it's still 
sporadic and order ~10-100 microseconds.  It leads me to think that OpenMPI has 
issues with the method of measurement, which is simply to send progressively 
larger blocked messages right after calling MPI_Init (starting at 0 bytes which 
it times as the latency). OpenMPI's lazy connections clearly mess with this.

But still not consistently 1-2 microsecs...

Steve


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ralph Castain
Sent: Monday, February 15, 2010 11:21 PM
To: Open MPI Users
Subject: Re: [OMPI users] Bad Infiniband latency with subounce


On Feb 15, 2010, at 8:44 PM, Terry Frankcombe wrote:



On Mon, 2010-02-15 at 20:18 -0700, Ralph Castain wrote:
  

Did you run it with -mca mpi_paffinity_alone 1? Given this is 1.4.1, you can 
set the bindings to -bind-to-socket or -bind-to-core. Either will give you 
improved performance.

IIRC, MVAPICH defaults to -bind-to-socket. OMPI defaults to no binding.


Is this sensible?  Won't most users want processes bound?  OMPI's
supposed to "to the right thing" out of the box, right?
  

Well, that depends on how you look at it. Been the subject of a lot of debate 
within the devel community. If you bind by default and it is a shared node 
cluster, then you can really mess people up. On the other hand, if you don't 
bind by default, then people that run benchmarks without looking at the options 
can get bad numbers. Unfortunately, there is no automated way to tell if the 
cluster is configured for shared use or dedicated nodes.

I honestly don't know that "most users want processes bound". One installation 
I was at set binding by default using the system mca param file, and got yelled at by a 
group of users that had threaded apps - and most definitely did -not- want their 
processes bound. After a while, it became clear that nothing we could do would make 
everyone happy :-/

I doubt there is a right/wrong answer - at least, we sure can't find one. So we don't 
bind by default so we "do no harm", and put out FAQs, man pages, mpirun option 
help messages, etc. that explain the situation and tell you when/how to bind.




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
  

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] [btl_openib_component.c:1373:btl_openib_component_progress] error polling HP CQ with -2 errno says Success

2009-09-26 Thread Pavel Shamis (Pasha)


Very strange. MPI tries to access CQ context and it get immediate error.
Please make sure that you limits configuration is ok, take a look on 
this FAQ - http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages


Pasha.


Charles Wright wrote:

Hello,
   I just got some new cluster hardware :)  :(

I can't seem to overcome an openib problem
I get this at run time

error polling HP CQ with -2 errno says Success

I've tried 2 different IB switches and multiple sets of nodes all on 
one switch or the other to try to eliminate the hardware.   (IPoIB 
pings work and IB switches ree
I've tried both v1.3.3 and v1.2.9 and get the same errors.I'm not 
really sure what these errors mean or how to get rid of them.
My MPI application work if all the CPUs are on the same node (self btl 
only probably)


Any advice would be appreciated.  Thanks.

asnrcw@dmc:~> qsub -I -l nodes=32,partition=dmc,feature=qc226 -q sysadm
qsub: waiting for job 232035.mds1.asc.edu to start
qsub: job 232035.mds1.asc.edu ready


# Alabama Supercomputer Center - PBS Prologue
# Your job id is : 232035
# Your job name is : STDIN
# Your job's queue is : sysadm
# Your username for this job is : asnrcw
# Your groupfor this job is : analyst
# Your job used : #   8 CPUs on dmc101
#   8 CPUs on dmc102
#   8 CPUs on dmc103
#   8 CPUs on dmc104
# Your job started at : Fri Sep 25 10:20:05 CDT 2009

asnrcw@dmc101:~> asnrcw@dmc101:~> asnrcw@dmc101:~> asnrcw@dmc101:~> 
asnrcw@dmc101:~> cd mpiprintrank

asnrcw@dmc101:~/mpiprintrank> which mpirun
/apps/openmpi-1.3.3-intel/bin/mpirun
asnrcw@dmc101:~/mpiprintrank> mpirun ./mpiprintrank-dmc-1.3.3-intel 
[dmc103][[46071,1],19][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],16][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],17][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],18][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],20][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],21][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],23][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc101][[46071,1],6][btl_openib_component.c:3047:poll_device] 
[dmc102][[46071,1],14][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success

error polling HP CQ with -2 errno says Success
[dmc101][[46071,1],7][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],22][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc102][[46071,1],15][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc102][[46071,1],11][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc102][[46071,1],11][btl_openib_component.c:3047:poll_device] 
[dmc102][[46071,1],12][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc102][[46071,1],12][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success

error polling HP CQ with -2 errno says Success
[dmc101][[46071,1],3][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc101][[46071,1],4][btl_openib_component.c:3047:poll_device] 
[dmc102][[46071,1],8][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc101][[46071,1],0][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success

error polling HP CQ with -2 errno says Success
[dmc102][[46071,1],15][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc101][[46071,1],1][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc102][[46071,1],9][btl_openib_component.c:3047:poll_device] 
[dmc102][[46071,1],14][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success

error polling HP CQ with -2 errno says Success
[dmc102][[46071,1],9][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc101][[46071,1],5][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc102][[46071,1],13][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc102][[46071,1],13][btl_openib_component.c:3047:poll_device] 
[dmc101][[46071,1],2][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success

Re: [OMPI users] running open mpi on ubuntu 9.04

2009-09-21 Thread Pavel Shamis (Pasha)

You will not be need the trick if you will configure Open Mpi with 
follow flag:

--enable-mpirun-prefix-by-default

Pasha.



Hodgess, Erin wrote:


the LD_LIBRARY_PATH did the trick;
thanks so much!

Sincerely,
Erin


Erin M. Hodgess, PhD
Associate Professor
Department of Computer and Mathematical Sciences
University of Houston - Downtown
mailto: hodge...@uhd.edu



-Original Message-
From: users-boun...@open-mpi.org on behalf of Marce
Sent: Sat 9/19/2009 3:54 PM
To: Open MPI Users
Subject: Re: [OMPI users] running open mpi on ubuntu 9.04

2009/9/18 Hodgess, Erin :
> There is no hosts file there originally
> I put in
>
>  cat hosts
> 127.0.0.1   localhost
>
>
> but still get the same thing
>
> thanks,
> Erin
>
> erin@erin-laptop:~$
> Erin M. Hodgess, PhD
> Associate Professor
> Department of Computer and Mathematical Sciences
> University of Houston - Downtown
> mailto: hodge...@uhd.edu
>
>
>
> -Original Message-
> From: users-boun...@open-mpi.org on behalf of Whit Armstrong
> Sent: Fri 9/18/2009 7:36 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] running open mpi on ubuntu 9.04
>
> yes, I had this issue before (we are on 9.04 as well).
> it has to do with the hosts file.
>
> Erin, can you send your hosts file?
>
> I think you want to make this the first line of your host file:
> 127.0.0.1   localhost
>
> Which Ubuntu, if memory serves defaults to the name of the machine 
instead

> of localhost.
>
> -Whit
>
>
> On Fri, Sep 18, 2009 at 8:31 AM, Ralph Castain  wrote:
>
>> It doesn't matter - 1.3 isn't going to launch another daemon on the 
local

>> node.
>> The problem here is that OMPI isn't recognizing your local host as 
being
>> "local" - i.e., it thinks that the host mpirun is executing on is 
somehow

>> not the the local host. This has come up before with ubuntu - you might
>> search the user mailing list for "ubuntu" to see earlier threads on 
this

>> issue.
>>
>> I forget the final solution, but those earlier threads will explain 
what
>> needs to be done. I'm afraid this is something quite specific to 
ubuntu.

>>
>>
>> On Sep 18, 2009, at 6:23 AM, Whit Armstrong wrote:
>>
>> can you "ssh localhost" without a password?
>> -Whit
>>
>>
>> On Thu, Sep 17, 2009 at 11:50 PM, Hodgess, Erin  
wrote:

>>
>>> It's 1.3, please.
>>>
>>> Thanks,
>>>
>>> Erin
>>>
>>>
>>> Erin M. Hodgess, PhD
>>> Associate Professor
>>> Department of Computer and Mathematical Sciences
>>> University of Houston - Downtown
>>> mailto: hodge...@uhd.edu
>>>
>>>
>>>
>>> -Original Message-
>>> From: users-boun...@open-mpi.org on behalf of Ralph Castain
>>> Sent: Thu 9/17/2009 10:39 PM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] running open mpi on ubuntu 9.04
>>>
>>> I gather you must be running a version of the old 1.2 series? Or are
>>> you running 1.3?
>>>
>>> It does make a difference as to the nature of the problem, and the
>>> recommended solution.
>>>
>>> Thanks
>>> Ralph
>>>
>>> On Sep 17, 2009, at 8:51 PM, Hodgess, Erin wrote:
>>>
>>> > Dear Open MPI people:
>>> >
>>> > I'm trying to run a simple "hello world" program on Ubuntu 9.04
>>> >
>>> > It's on a dual core laptop; no other machines.
>>> >
>>> > Here is the output:
>>> > erin@erin-laptop:~$ mpirun -np 2 a.out
>>> > ssh: connect to host erin-laptop port 22: Connection refused
>>> >
>>>
>>> 
--

>>> > A daemon (pid 11854) died unexpectedly with status 255 while
>>> > attempting
>>> > to launch so we are aborting.
>>> >
>>> > There may be more information reported by the environment (see 
above).

>>> >
>>> > This may be because the daemon was unable to find all the needed
>>> > shared
>>> > libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>> > have the
>>> > location of the shared libraries on the remote nodes and this will
>>> > automatically be forwarded to the remote nodes.
>>> >
>>>
>>> 
--

>>> >
>>>
>>> 
--
>>> > mpirun noticed that the job aborted, but has no info as to the 
process

>>> > that caused that situation.
>>> >
>>>
>>> 
--

>>> > mpirun: clean termination accomplished
>>> >
>>> > erin@erin-laptop:~$
>>> >
>>> > Any help would be much appreciated.
>>> >
>>> > Sincerely,
>>> > Erin
>>> >
>>> >
>>> > Erin M. Hodgess, PhD
>>> > Associate Professor
>>> > Department of Computer and Mathematical Sciences
>>> > University of Houston - Downtown
>>> > mailto: hodge...@uhd.edu
>>> >
>>> >
>>> > ___
>>> > users mailing list
>>> > us...@open-mpi.org
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Job fails after hours of running on a specific node

2009-09-21 Thread Pavel Shamis (Pasha)


Sangamesh,

The ib tunings that you added to your command line only delay the 
problem but doesn't resolve it.
The node-0-2.local gets asynchronous event "IBV_EVENT_PORT_ERROR" and as 
result
the processes fails to deliver packets to some remote hosts and as 
result you see bunch of IB errors.


The IBV_EVENT_PORT_ERROR error means that the IB port gone from ACTIVE 
state do DOWN state.
Or in other words you have problem with your IB networks that cause all 
these networks errors.
Source cause of such issue maybe some bad cable or some problematic port 
on switch.


For the IB network debug I propose you use Ibdiaget, it is open source 
IB network diagnostic tool :

http://linux.die.net/man/1/ibdiagnet
The tool is part of OFED distribution.

Pasha.


Sangamesh B wrote:

Dear all,
 
 The CPMD application which is compiled with OpenMPI-1.3 (Intel 
10.1 compilers) on CentOS-4.5, fails only, when a specific node i.e. 
node-0-2 is involved. But runs well on other nodes.
 
  Initially job failed after 5-10 mins (on node-0-2 + some other 
nodes). After googling error, I added options "-mca 
btl_openib_ib_min_rnr_timer 25 -mca btl_openib_ib_timeout 20" to 
mpirun command in the SGE script:
 
$ cat cpmdrun.sh

#!/bin/bash
#$ -N cpmd-acw
#$ -S /bin/bash
#$ -cwd
#$ -e err.$JOB_ID.$JOB_NAME
#$ -o out.$JOB_ID.$JOB_NAME
#$ -pe ib 32
unset SGE_ROOT  
PP_LIBRARY=/home/user1/cpmdrun/wac/prod/PP

CPMD=/opt/apps/cpmd/3.11/ompi/SOURCE/cpmd311-ompi-mkl.x
MPIRUN=/opt/mpi/openmpi/1.3/intel/bin/mpirun
$MPIRUN -np $NSLOTS -hostfile $TMPDIR/machines -mca 
btl_openib_ib_min_rnr_timer 25 -mca btl_openib_ib_timeout 20 $CPMD 
wac_md26.in   $PP_LIBRARY > wac_md26.out
After adding these options, job executed for 24+ hours then failed 
with the same error as earlier. The error is:
 
$ cat err.6186.cpmd-acw

--
The OpenFabrics stack has reported a network error event.  Open MPI
will try to continue, but your job may end up failing.
  Local host:node-0-2.local
  MPI process PID:   11840
  Error number:  10 (IBV_EVENT_PORT_ERR)
This error may indicate connectivity problems within the fabric;
please contact your system administrator.
--
[node-0-2.local:11836] 7 more processes have sent help message 
help-mpi-btl-openib.txt / of error event
[node-0-2.local:11836] Set MCA parameter "orte_base_help_aggregate" to 
0 to see all help / error messages
[node-0-2.local:11836] 1 more process has sent help message 
help-mpi-btl-openib.txt / of error event
[node-0-2.local:11836] 7 more processes have sent help message 
help-mpi-btl-openib.txt / of error event
[node-0-2.local:11836] 1 more process has sent help message 
help-mpi-btl-openib.txt / of error event
[node-0-2.local:11836] 7 more processes have sent help message 
help-mpi-btl-openib.txt / of error event
[node-0-2.local:11836] 1 more process has sent help message 
help-mpi-btl-openib.txt / of error event
[node-0-2.local:11836] 7 more processes have sent help message 
help-mpi-btl-openib.txt / of error event
[node-0-2.local:11836] 1 more process has sent help message 
help-mpi-btl-openib.txt / of error event
[node-0-2.local:11836] 7 more processes have sent help message 
help-mpi-btl-openib.txt / of error event
[node-0-2.local:11836] 1 more process has sent help message 
help-mpi-btl-openib.txt / of error event
[node-0-2.local:11836] 15 more processes have sent help message 
help-mpi-btl-openib.txt / of error event
[node-0-2.local:11836] 16 more processes have sent help message 
help-mpi-btl-openib.txt / of error event
[node-0-2.local:11836] 16 more processes have sent help message 
help-mpi-btl-openib.txt / of error event
[[718,1],20][btl_openib_component.c:2902:handle_wc] from 
node-0-22.local to: node-0-2 
--

The InfiniBand retry count between two MPI processes has been
exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):
The total number of times that the sender wishes the receiver to
retry timeout, packet sequence, etc. errors before posting a
completion error.
This error typically means that there is something awry within the
InfiniBand fabric itself.  You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.
Two MCA parameters can be used to control Open MPI's behavior with
respect to the retry count:
* btl_openib_ib_retry_count - The number of times the sender will
  attempt to retry (defaulted to 7, the maximum value).
* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
  to 10).  The actual timeout value used is calculated as:
 4.096 microseconds * (2^btl_openib_ib_timeout)
  See the InfiniBand spec 1.2 (section 12.7.34) for more details.
Below is some informat

Re: [OMPI users] RETRY EXCEEDED ERROR status number 12

2009-08-21 Thread Pavel Shamis (Pasha)

You may try to use ibdiagnet tool:
http://linux.die.net/man/1/ibdiagnet

The tool is part of OFED (http://www.openfabrics.org/)

Pasha.

Prentice Bisbal wrote:

Several jobs on my cluster just died with the error below.

Are there any IB/Open MPI diagnostics I should use to diagnose, should I
just reboot the nodes, or should I have the user who submitted these
jobs just increase the retry count/timeout paramters?

[0,1,6][../../../../../ompi/mca/btl/openib/btl_openib_component.c:1375:btl_openib_component_progress]
from node14.aurora to: node40.aurora error polling HP CQ with status
RETRY EXCEEDED ERROR status number 12 for wr_id 13606831800 opcode 9
--
The InfiniBand retry count between two MPI processes has been
exceeded. "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):

The total number of times that the sender wishes the receiver to
retry timeout, packet sequence, etc. errors before posting a
completion error.

This error typically means that there is something awry within the
InfiniBand fabric itself. You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.

Two MCA parameters can be used to control Open MPI's behavior with
respect to the retry count:

* btl_openib_ib_retry_count - The number of times the sender will
attempt to retry (defaulted to 7, the maximum value).

* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
to 10). The actual timeout value used is calculated as:

4.096 microseconds * (2^btl_openib_ib_timeout)

See the InfiniBand spec 1.2 (section 12.7.34) for more details.

Re: [OMPI users] Performance difference on OpenMPI, IntelMPI and ScaliMPI

2009-08-05 Thread Pavel Shamis (Pasha)




However, setting:
-mca btl_openib_eager_limit 65536
gave a 15% improvement so OpenMPI is now down to 326 (from previous 
376 seconds). Still a lot more than ScaliMPI with 214 seconds.
Can you please run ibv_devinfo on one of compute nodes ? It is 
interesting to know what kind of IB HW you have on our cluster.


Pasha

Re: [OMPI users] Performance difference on OpenMPI, IntelMPI and ScaliMPI

2009-08-05 Thread Pavel Shamis (Pasha)





If the above doesn't improve anything the next question is do you know 
what the sizes of the messages are? For very small messages I believe 
Scali shows a 2x better performance than Intel and OMPI (I think this 
is due to a fastpath optimization).


I remember that mvapich was faster that scali for small messages (I'm 
talking only about IB, no sm).
Ompi 1.3 latency is very close to mvapich latency. So I do not see how 
Scali latency may be better than OMPI.


Pasha

Re: [OMPI users] Performance difference on OpenMPI, IntelMPI and ScaliMPI

2009-08-05 Thread Pavel Shamis (Pasha)


Torgny,
We have one know issue in openib btl that it related to IPROBE - 
https://svn.open-mpi.org/trac/ompi/ticket/1362
Theoretical it maybe source cause of the performance degradation, but 
for me the performance difference sounds too big.


* Do you know what is typical message size for this application?
* Did you enable live pinned ? (--mca mpi_leave_pinned 1) ?

Also I recommend you to read this FAQ - 
http://netmirror.org/mirror/open-mpi.org/faq/?category=tuning#running-perf-numbers


Pasha


Torgny Faxen wrote:

Pasha,
no collectives are being used.

A simple grep in the code reveals the following MPI functions being used:
MPI_Init
MPI_wtime
MPI_COMM_RANK
MPI_COMM_SIZE
MPI_BUFFER_ATTACH
MPI_BSEND
MPI_PACK
MPI_UNPACK
MPI_PROBE
MPI_GET_COUNT
MPI_RECV
MPI_IPROBE
MPI_FINALIZE

where MPI_IPROBE is the clear winner in terms of number of calls.

/Torgny

Pavel Shamis (Pasha) wrote:

Do you know if the application use some collective operations ?

Thanks

Pasha

Torgny Faxen wrote:

Hello,
we are seeing a large difference in performance for some 
applications depending on what MPI is being used.


Attached are performance numbers and oprofile output (first 30 
lines) from one out of 14 nodes from one application run using 
OpenMPI, IntelMPI and Scali MPI respectively.


Scali MPI is faster the other two MPI:s with a factor of 1.6 and 1.75:

ScaliMPI: walltime for the whole application is 214 seconds
OpenMPI: walltime for the whole application is 376 seconds
Intel MPI: walltime for the whole application is 346 seconds.

The application is running with the main send receive commands being:
MPI_Bsend
MPI_Iprobe followed by MPI_Recv (in case of there being a message). 
Quite often MPI_Iprobe is being called just to check whether there 
is a certain message pending.


Any idea on tuning tips, performance analysis, code modifications to 
improve the OpenMPI performance? A lot of time is being spent in 
"mca_btl_sm_component_progress", "btl_openib_component_progress" and 
other internal routines.


The code is running on a cluster with 140 HP ProLiant DL160 G5 
compute servers. Infiniband interconnect. Intel Xeon E5462 
processors. The profiled application is using 144 cores on 18 nodes 
over Infiniband.


Regards / Torgny
=0 


OpenMPI  1.3b2
=0 



Walltime: 376 seconds

CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %image name   app name 
symbol name
668288   22.2113  mca_btl_sm.sorco2.24pe
mca_btl_sm_component_progress
441828   14.6846  rco2.24perco2.24pe
step_
335929   11.1650  libmlx4-rdmav2.sorco2.24pe
(no symbols)
301446   10.0189  mca_btl_openib.sorco2.24pe
btl_openib_component_progress
1610335.3521  libopen-pal.so.0.0.0 rco2.24pe
opal_progress
1570245.2189  libpthread-2.5.sorco2.24pe
pthread_spin_lock
99526 3.3079  no-vmlinux   no-vmlinux   
(no symbols)
93887 3.1204  mca_btl_sm.sorco2.24pe
opal_using_threads
69979 2.3258  mca_pml_ob1.so   rco2.24pe
mca_pml_ob1_iprobe
58895 1.9574  mca_bml_r2.sorco2.24pe
mca_bml_r2_progress
55095 1.8311  mca_pml_ob1.so   rco2.24pe
mca_pml_ob1_recv_request_match_wild
49286 1.6381  rco2.24perco2.24pe
tracer_
41946 1.3941  libintlc.so.5rco2.24pe
__intel_new_memcpy
40730 1.3537  rco2.24perco2.24pe
scobi_
36586 1.2160  rco2.24perco2.24pe
state_
20986 0.6975  rco2.24perco2.24pe
diag_
19321 0.6422  libmpi.so.0.0.0  rco2.24pe
PMPI_Unpack
18552 0.6166  libmpi.so.0.0.0  rco2.24pe
PMPI_Iprobe
17323 0.5757  rco2.24perco2.24pe
clinic_
16194 0.5382  rco2.24perco2.24pe
k_epsi_
15330 0.5095  libmpi.so.0.0.0  rco2.24pe
PMPI_Comm_f2c
13778 0.4579  libmpi_f77.so.0.0.0  rco2.24pe
mpi_iprobe_f
13241 0.4401  rco2.24perco2.24pe
s_recv_
12386 0.4117  rco2.24perco2.24pe
growth_
11699 0.3888  rco2.24perco2.24pe
testnrecv_
11268 0.3745  libmpi.so.0.0.0  rco2.24pe
mca_pml_base_recv_request_construct
10971 0.3646  libmpi.so.0.0.0  rco2.24pe
ompi_co

Re: [OMPI users] Performance difference on OpenMPI, IntelMPI and ScaliMPI

2009-08-05 Thread Pavel Shamis (Pasha)


Do you know if the application use some collective operations ?

Thanks

Pasha

Torgny Faxen wrote:

Hello,
we are seeing a large difference in performance for some applications 
depending on what MPI is being used.


Attached are performance numbers and oprofile output (first 30 lines) 
from one out of 14 nodes from one application run using OpenMPI, 
IntelMPI and Scali MPI respectively.


Scali MPI is faster the other two MPI:s with a factor of 1.6 and 1.75:

ScaliMPI: walltime for the whole application is 214 seconds
OpenMPI: walltime for the whole application is 376 seconds
Intel MPI: walltime for the whole application is 346 seconds.

The application is running with the main send receive commands being:
MPI_Bsend
MPI_Iprobe followed by MPI_Recv (in case of there being a message). 
Quite often MPI_Iprobe is being called just to check whether there is 
a certain message pending.


Any idea on tuning tips, performance analysis, code modifications to 
improve the OpenMPI performance? A lot of time is being spent in 
"mca_btl_sm_component_progress", "btl_openib_component_progress" and 
other internal routines.


The code is running on a cluster with 140 HP ProLiant DL160 G5 compute 
servers. Infiniband interconnect. Intel Xeon E5462 processors. The 
profiled application is using 144 cores on 18 nodes over Infiniband.


Regards / Torgny
=0 


OpenMPI  1.3b2
=0 



Walltime: 376 seconds

CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %image name   app name 
symbol name
668288   22.2113  mca_btl_sm.sorco2.24pe
mca_btl_sm_component_progress

441828   14.6846  rco2.24perco2.24pestep_
335929   11.1650  libmlx4-rdmav2.sorco2.24pe
(no symbols)
301446   10.0189  mca_btl_openib.sorco2.24pe
btl_openib_component_progress
1610335.3521  libopen-pal.so.0.0.0 rco2.24pe
opal_progress
1570245.2189  libpthread-2.5.sorco2.24pe
pthread_spin_lock
99526 3.3079  no-vmlinux   no-vmlinux   
(no symbols)
93887 3.1204  mca_btl_sm.sorco2.24pe
opal_using_threads
69979 2.3258  mca_pml_ob1.so   rco2.24pe
mca_pml_ob1_iprobe
58895 1.9574  mca_bml_r2.sorco2.24pe
mca_bml_r2_progress
55095 1.8311  mca_pml_ob1.so   rco2.24pe
mca_pml_ob1_recv_request_match_wild
49286 1.6381  rco2.24perco2.24pe
tracer_
41946 1.3941  libintlc.so.5rco2.24pe
__intel_new_memcpy
40730 1.3537  rco2.24perco2.24pe
scobi_
36586 1.2160  rco2.24perco2.24pe
state_

20986 0.6975  rco2.24perco2.24pediag_
19321 0.6422  libmpi.so.0.0.0  rco2.24pe
PMPI_Unpack
18552 0.6166  libmpi.so.0.0.0  rco2.24pe
PMPI_Iprobe
17323 0.5757  rco2.24perco2.24pe
clinic_
16194 0.5382  rco2.24perco2.24pe
k_epsi_
15330 0.5095  libmpi.so.0.0.0  rco2.24pe
PMPI_Comm_f2c
13778 0.4579  libmpi_f77.so.0.0.0  rco2.24pe
mpi_iprobe_f
13241 0.4401  rco2.24perco2.24pe
s_recv_
12386 0.4117  rco2.24perco2.24pe
growth_
11699 0.3888  rco2.24perco2.24pe
testnrecv_
11268 0.3745  libmpi.so.0.0.0  rco2.24pe
mca_pml_base_recv_request_construct
10971 0.3646  libmpi.so.0.0.0  rco2.24pe
ompi_convertor_unpack
10034 0.3335  mca_pml_ob1.so   rco2.24pe
mca_pml_ob1_recv_request_match_specific

10003 0.3325  libimf.sorco2.24peexp.L
9375  0.3116  rco2.24perco2.24pe
subbasin_
8912  0.2962  libmpi_f77.so.0.0.0  rco2.24pe
mpi_unpack_f




=0 


Intel MPI, version 3.2.0.011/
=0 



Walltime: 346 seconds

CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %image name   app name 
symbol name

486712   17.7537  rco2 rco2 step_
431941   15.755

Re: [OMPI users] Tuned collectives: How to choose them dynamically? (-mca coll_tuned_dynamic_rules_filename dyn_rules)"

2009-08-04 Thread Pavel Shamis (Pasha)


Lenny,
You can find some details here:
http://icl.cs.utk.edu/news_pub/submissions/Flex-collective-euro-pvmmpi-2006.pdf

Pasha


Lenny Verkhovsky wrote:

Hi,
I am looking too for a file example of rules for dynamic collectives,
Have anybody tried it ? Where can I find a proper syntax for it ?
 
thanks.

Lenny.


 
On Thu, Jul 23, 2009 at 3:08 PM, Igor Kozin > wrote:


Hi Gus,
I played with collectives a few months ago. Details are here
http://www.cse.scitech.ac.uk/disco/publications/WorkingNotes.ConnectX.pdf
That was in the context of 1.2.6

You can get available tuning options by doing
ompi_info -all -mca coll_tuned_use_dynamic_rules 1 | grep alltoall
and similarly for other collectives.

Best,
Igor

2009/7/23 Gus Correa mailto:g...@ldeo.columbia.edu>>:
> Dear OpenMPI experts
>
> I would like to experiment with the OpenMPI tuned collectives,
> hoping to improve the performance of some programs we run
> in production mode.
>
> However, I could not find any documentation on how to select the
> different collective algorithms and other parameters.
> In particular, I would love to read an explanation clarifying
> the syntax and meaning of the lines on "dyn_rules"
> file that is passed to
> "-mca coll_tuned_dynamic_rules_filename ./dyn_rules"
>
> Recently there was an interesting discussion on the list
> about this topic.  It showed that choosing the right collective
> algorithm can make a big difference in overall performance:
>
> http://www.open-mpi.org/community/lists/users/2009/05/9355.php
> http://www.open-mpi.org/community/lists/users/2009/05/9399.php
> http://www.open-mpi.org/community/lists/users/2009/05/9401.php
> http://www.open-mpi.org/community/lists/users/2009/05/9419.php
>
> However, the thread was concentrated on "MPI_Alltoall".
> Nothing was said about other collective functions.
> Not much was said about the
> "tuned collective dynamic rules" file syntax,
> the meaning of its parameters, etc.
>
> Is there any source of information about that which I missed?
> Thank you for any pointers or clarifications.
>
> Gus Correa
>
-
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
>
-
> ___
> users mailing list
> us...@open-mpi.org 
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

___
users mailing list
us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/users




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Using dual infiniband HCA cards

2009-07-30 Thread Pavel Shamis (Pasha)





We have a computational cluster which is  consisting of 8 HP Proliant
ML370G5 with 32GB ram.
Each node has  a Melanox single port infiniband DDR HCA  card (20Gbit/s)
and connected each other through
a Voltaire ISR9024D-M DDR infiniband switch.

Now we want to increase the bandwidth to 40GBit/s adding second
infiniband cards to each node.

I want to ask if this is possible, if yes how?
  

You need to check if it possible to add one more Infiniband card to
your motherboard. As well you need verify that you PCI-EX link and the 
chipset
will allow to utilize resources of 2 HCAs.  
You may temporary take 2 hca from some of your machines
and add them to another pair machines. It will allow to you make some 
benchmarking with 2 hcas.


From driver and OpenMPI perspective 2 (and more) hca configuration is 
supported by default.


Pasha.

Re: [OMPI users] [OMPI devel] selectively bind MPI to one HCA out of available ones

2009-07-16 Thread Pavel Shamis (Pasha)


Hi,
You can select ib device used with openib btl by using follow parametres:
MCA btl: parameter "btl_openib_if_include" (current value: , data 
source: default value)
 Comma-delimited list of devices/ports to be 
used (e.g. "mthca0,mthca1:2"; empty value means to
 use all ports found).  Mutually exclusive with 
btl_openib_if_exclude.
MCA btl: parameter "btl_openib_if_exclude" (current value: , data 
source: default value)
 Comma-delimited list of device/ports to be 
excluded (empty value means to not exclude any
 ports).  Mutually exclusive with 
btl_openib_if_include.


For example, if you want to use first port on mthc0 you command line 
will look like:


mpirun -np. --mca btl_openib_if_include mthca0:1 

Pasha

nee...@crlindia.com wrote:


Hi all,

I have a cluster where both HCA's of blade are active, but 
connected to different subnet.
Is there an option in MPI to select one HCA out of available 
one's? I know it can be done by making changes in openmpi code, but i 
need clean interface like option during mpi launch time to select 
mthca0 or mthca1?


Any help is appreciated. Btw i just checked Mvapich and 
feature is there inside.


Regards

Neeraj Chourasia (MTS)
Computational Research Laboratories Ltd.
(A wholly Owned Subsidiary of TATA SONS Ltd)
B-101, ICC Trade Towers, Senapati Bapat Road
Pune 411016 (Mah) INDIA
(O) +91-20-6620 9863  (Fax) +91-20-6620 9862
M: +91.9225520634

=-=-= Notice: The information contained in this 
e-mail message and/or attachments to it may contain confidential or 
privileged information. If you are not the intended recipient, any 
dissemination, use, review, distribution, printing or copying of the 
information contained in this e-mail message and/or attachments to it 
are strictly prohibited. If you have received this communication in 
error, please notify us by reply e-mail or telephone and immediately 
and permanently delete the message and any attachments. Internet 
communications cannot be guaranteed to be timely, secure, error or 
virus-free. The sender does not accept liability for any errors or 
omissions.Thank you =-=-=




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI users] 50% performance reduction due to OpenMPI v 1.3.2 forcing all MPI traffic over Ethernet instead of using Infiniband

2009-06-23 Thread Pavel Shamis (Pasha)


Jim,
Can you please share with us you mca conf file.

Pasha.
Jim Kress ORG wrote:

For the app I am using, ORCA (a Quantum Chemistry program), when it was
compiled using openMPI 1.2.8 and run under 1.2.8 with the following in
the openmpi-mca-params.conf file:

btl=self,openib

the app ran fine with no traffic over my Ethernet network and all
traffic over my Infiniband network.

However, now that ORCA has been recompiled with openMPI v1.3.2 and run
under 1.3.2 (using the same openmpi-mca-params.conf file), the
performance has been reduced by 50% and all the MPI traffic is going
over the Ethernet network.

As a matter of fact, the openMPI v1.3.2 performance now looks exactly
like the performance I get if I use MPICH 1.2.7.

Anyone have any ideas:

1) How could this have happened?

2) How can I fix it?

a 50% reduction in performance is just not acceptable.  Ideas/
suggestions would be appreciated.

Jim

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] scaling problem with openmpi

2009-05-21 Thread Pavel Shamis (Pasha)




I tried to run with the first dynamic rules file that Pavel proposed
and it works, the time per one MD step on 48 cores decreased from 2.8
s to 1.8 s as expected. 
  

Good news :-)

Pasha.

Thanks

Roman

On Wed, May 20, 2009 at 7:18 PM, Pavel Shamis (Pasha)  wrote:
  

Tomorrow I will add some printf to collective code and check what really
happens there...

Pasha

Peter Kjellstrom wrote:


On Wednesday 20 May 2009, Pavel Shamis (Pasha) wrote:

  

Disabling basic_linear seems like a good idea but your config file sets
the cut-off at 128 Bytes for 64-ranks (the field you set to 8192 seems
to
result in a message size of that value divided by the number of ranks).

In my testing bruck seems to win clearly (at least for 64 ranks on my
IB)
up to 2048. Hence, the following line may be better:

 131072 2 0 0 # switch to pair wise for size 128K/nranks

Disclaimer: I guess this could differ quite a bit for nranks!=64 and
different btls.

  

Sounds strange for me. From the code is looks that we take the threshold
as
is without dividing by number of ranks.



Interesting, I may have had to little or too much coffe but the figures in
my previous e-mail (3rd run, bruckto2k_pair) was run with the above line.
And it very much looks like it switched at 128K/64=2K, not at 128K (which
would have been above my largest size of 3000 and as such equiv. to
all_bruck).

I also ran tests with:
 8192 2 0 0 # ...
And it seemed to switch between 10 Bytes and 500 Bytes (most likely then
at 8192/64=128).

My testprogram calls MPI_Alltoall like this:
 time1 = MPI_Wtime();
 for (i = 0; i < repetitions; i++) {
   MPI_Alltoall(sbuf, message_size, MPI_CHAR,
rbuf, message_size, MPI_CHAR, MPI_COMM_WORLD);
 }
 time2 = MPI_Wtime();

/Peter
 

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
  

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] scaling problem with openmpi

2009-05-20 Thread Pavel Shamis (Pasha)

Tomorrow I will add some printf to collective code and check what really 
happens there...


Pasha

Peter Kjellstrom wrote:

On Wednesday 20 May 2009, Pavel Shamis (Pasha) wrote:
  

Disabling basic_linear seems like a good idea but your config file sets
the cut-off at 128 Bytes for 64-ranks (the field you set to 8192 seems to
result in a message size of that value divided by the number of ranks).

In my testing bruck seems to win clearly (at least for 64 ranks on my IB)
up to 2048. Hence, the following line may be better:

 131072 2 0 0 # switch to pair wise for size 128K/nranks

Disclaimer: I guess this could differ quite a bit for nranks!=64 and
different btls.
  

Sounds strange for me. From the code is looks that we take the threshold as
is without dividing by number of ranks.



Interesting, I may have had to little or too much coffe but the figures in my 
previous e-mail (3rd run, bruckto2k_pair) was run with the above line. And it 
very much looks like it switched at 128K/64=2K, not at 128K (which would have 
been above my largest size of 3000 and as such equiv. to all_bruck).


I also ran tests with:
 8192 2 0 0 # ...
And it seemed to switch between 10 Bytes and 500 Bytes (most likely then at 
8192/64=128).


My testprogram calls MPI_Alltoall like this:
  time1 = MPI_Wtime();
  for (i = 0; i < repetitions; i++) {
MPI_Alltoall(sbuf, message_size, MPI_CHAR,
 rbuf, message_size, MPI_CHAR, MPI_COMM_WORLD);
  }
  time2 = MPI_Wtime();

/Peter
  



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] scaling problem with openmpi

2009-05-20 Thread Pavel Shamis (Pasha)



Disabling basic_linear seems like a good idea but your config file sets the 
cut-off at 128 Bytes for 64-ranks (the field you set to 8192 seems to result 
in a message size of that value divided by the number of ranks).


In my testing bruck seems to win clearly (at least for 64 ranks on my IB) up 
to 2048. Hence, the following line may be better:


 131072 2 0 0 # switch to pair wise for size 128K/nranks

Disclaimer: I guess this could differ quite a bit for nranks!=64 and different 
btls.
  

Sounds strange for me. From the code is looks that we take the threshold as
is without dividing by number of ranks.


Pasha,

Re: [OMPI users] scaling problem with openmpi

2009-05-20 Thread Pavel Shamis (Pasha)




The correct MCA parameters are the following:
-mca coll_tuned_use_dynamic_rules 1
-mca coll_tuned_dynamic_rules_filename ./dyn_rules

Ohh..it was my mistake



You can also run the following command:
ompi_info -mca coll_tuned_use_dynamic_rules 1 -param coll tuned
This will give some insight into all the various algorithms that make 
up the tuned collectives.


If I am understanding what is happening, it looks like the original 
MPI_Alltoall made use of three algorithms. (You can look in 
coll_tuned_decision_fixed.c)


If message size < 200 or communicator size > 12
bruck
else if message size < 3000
basic linear
else
pairwise
end

Yep it is correct.


With the file Pavel has provided things have changed to the following. 
(maybe someone can confirm)


If message size < 8192
bruck
else
pairwise
end
You are right here. Target of my conf file is disable basic_linear for 
medium message side.



Pasha.

Re: [OMPI users] scaling problem with openmpi

2009-05-20 Thread Pavel Shamis (Pasha)


Default algorithm thresholds in mvapich are different from ompi.
Using tunned collectives in Open MPI you may configure the Open MPI 
Alltoall threshold as Mvapich defaults.
The follow mca parameters configure Open MPI to use custom rules that 
are defined in configure(txt) file.

"--mca use_dynamic_rules 1 --mca dynamic_rules_filename"

Here is example of dynamic_rules_filename that should make Ompi Alltoall 
tuning similar to Mvapich:

1 # num of collectives
3 # ID = 3 Alltoall collective (ID in coll_tuned.h)
1 # number of com sizes
64 # comm size 8
2 # number of msg sizes
0 3 0 0 # for message size 0, bruck 1, topo 0, 0 segmentation
8192 2 0 0 # 8k+, pairwise 2, no topo or segmentation
# end of first collective


Thanks,
Pasha

Peter Kjellstrom wrote:

On Tuesday 19 May 2009, Peter Kjellstrom wrote:
  

On Tuesday 19 May 2009, Roman Martonak wrote:


On Tue, May 19, 2009 at 3:29 PM, Peter Kjellstrom  wrote:
  

On Tuesday 19 May 2009, Roman Martonak wrote:
...



openmpi-1.3.2   time per one MD step is 3.66 s
   ELAPSED TIME :0 HOURS  1 MINUTES 25.90 SECONDS
 = ALL TO ALL COMM   102033. BYTES   4221.  =
 = ALL TO ALL COMM 7.802  MB/S  55.200 SEC  =
  

...



With TASKGROUP=2 the summary looks as follows
  

...



 = ALL TO ALL COMM   231821. BYTES   4221.  =
 = ALL TO ALL COMM82.716  MB/S  11.830 SEC  =
  

Wow, according to this it takes 1/5th the time to do the same number (4221)
of alltoalls if the size is (roughly) doubled... (ten times better
performance with the larger transfer size)

Something is not quite right, could you possibly try to run just the
alltoalls like I suggested in my previous e-mail?



I was curious so I ran som tests. First it seems that the size reported by 
CPMD is the total size of the data buffer not the message size. Running 
alltoalls with 231821/64 and 102033/64 gives this (on a similar setup):


bw for   4221x 1595 B :  36.5 Mbytes/s   time was:  23.3 s
bw for   4221x 3623 B : 125.4 Mbytes/s   time was:  15.4 s
bw for   4221x 1595 B :  36.4 Mbytes/s   time was:  23.3 s
bw for   4221x 3623 B : 125.6 Mbytes/s   time was:  15.3 s

So it does seem that OpenMPI has some problems with small alltoalls. It is 
obviously broken when you can get things across faster by sending more...


As a reference I ran with a commercial MPI using the same program and node-set 
(I did not have MVAPICH nor IntelMPI on this system):


bw for   4221x 1595 B :  71.4 Mbytes/s   time was:  11.9 s
bw for   4221x 3623 B : 125.8 Mbytes/s   time was:  15.3 s
bw for   4221x 1595 B :  71.1 Mbytes/s   time was:  11.9 s
bw for   4221x 3623 B : 125.5 Mbytes/s   time was:  15.3 s

To see when OpenMPI falls over I ran with an increasing packet size:

bw for   10  x 2900 B :  59.8 Mbytes/s   time was:  61.2 ms
bw for   10  x 2925 B :  59.2 Mbytes/s   time was:  62.2 ms
bw for   10  x 2950 B :  59.4 Mbytes/s   time was:  62.6 ms
bw for   10  x 2975 B :  58.5 Mbytes/s   time was:  64.1 ms
bw for   10  x 3000 B : 113.5 Mbytes/s   time was:  33.3 ms
bw for   10  x 3100 B : 116.1 Mbytes/s   time was:  33.6 ms

The problem seems to be for packets with 1000Bytes < size < 3000Bytes with a 
hard edge at 3000Bytes. Your CPMD was communicating at more or less the worst 
case packet size.


These are the figures for my "reference" MPI:

bw for   10  x 2900 B : 110.3 Mbytes/s   time was:  33.1 ms
bw for   10  x 2925 B : 110.4 Mbytes/s   time was:  33.4 ms
bw for   10  x 2950 B : 111.5 Mbytes/s   time was:  33.3 ms
bw for   10  x 2975 B : 112.4 Mbytes/s   time was:  33.4 ms
bw for   10  x 3000 B : 118.2 Mbytes/s   time was:  32.0 ms
bw for   10  x 3100 B : 114.1 Mbytes/s   time was:  34.2 ms

Setup-details:
hw: dual socket quad core harpertowns with ConnectX IB and 1:1 2-level tree
sw: CentOS-5.3 x86_64 with OpenMPI-1.3b2 (did not have time to try 1.3.2) on 
OFED from CentOS (1.3.2-ish I think).


/Peter
  



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] scaling problem with openmpi

2009-05-18 Thread Pavel Shamis (Pasha)


Roman,
Can you please share with us Mvapich numbers that you get . Also what is 
mvapich version that you use.
Default mvapich and openmpi IB tuning is very similar, so it is strange 
to see so big difference. Do you know what kind of collectives operation 
is used in this specific application.


Pasha.

Roman Martonak wrote:

I've been using --mca mpi_paffinity_alone 1 in all simulations. Concerning "-mca
 mpi_leave_pinned 1", I tried it with openmpi 1.2.X versions and it
makes no difference.

Best regards

Roman

On Mon, May 18, 2009 at 4:57 PM, Pavel Shamis (Pasha)  wrote:
  

1) I was told to add "-mca mpi_leave_pinned 0" to avoid problems with
Infinband.  This was with OpenMPI 1.3.1.  Not
  

Actually for 1.2.X version I will recommend you to enable leave pinned "-mca
mpi_leave_pinned 1"


sure if the problems were fixed on 1.3.2, but I am hanging on to that
setting just in case.
  

We had data corruption issue in 1.3.1 but it was resolved in 1.3.2. In 1.3.2
version leave_pinned is enabled by default.

If I remember correct mvapich enables affinity mode by default, so I can
recommend you to try to enable it too:
"--mca mpi_paffinity_alone 1". For more details please check FAQ -
http://www.open-mpi.org/faq/?category=tuning#using-paffinity

Thanks,
Pasha.
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] scaling problem with openmpi

2009-05-18 Thread Pavel Shamis (Pasha)





1) I was told to add "-mca mpi_leave_pinned 0" to avoid problems with 
Infinband.  This was with OpenMPI 1.3.1.  Not
Actually for 1.2.X version I will recommend you to enable leave pinned 
"-mca mpi_leave_pinned 1"
sure if the problems were fixed on 1.3.2, but I am hanging on to that 
setting just in case.
We had data corruption issue in 1.3.1 but it was resolved in 1.3.2. In 
1.3.2 version leave_pinned is enabled by default.


If I remember correct mvapich enables affinity mode by default, so I can 
recommend you to try to enable it too:
"--mca mpi_paffinity_alone 1". For more details please check FAQ - 
http://www.open-mpi.org/faq/?category=tuning#using-paffinity


Thanks,
Pasha.

Re: [OMPI users] Problems with "error polling LP CQ with status RNR"

2009-05-14 Thread Pavel Shamis (Pasha)

RNR , receive is not ready - It means that on recv side MPI don't have 
buffers to get the data.
It may point to some broken configuration in MPI/ofud or credit leak in 
OFUD code.



Åke Sandgren wrote:

Hi!

I'm having problem with getting the "error polling LP CQ with status
RNR..." on an otherwise completely empty system.
There are no errors visible in the error counters in any of the HCAs or
switches or anywhere else.

I'm running OMPI 1.3.2 built with pathscale 3.2

If i add -mca btl 'ofud,self,sm' the same code works ok.

It usually only shows up on runs with nodes=16:ppn=8 or higher, i.e. 8x8
works ok.

This might very well be a pathscale problem since when running with the
debug version of ompi 1.3.2 the problem goes away.

Complete error is:
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR
status number 13 for wr_id 465284992 opcode -1  vendor error 135 qp_idx
0

Any ideas to where in the ompi code i should start reducing optimization
levels to pinpoint this?

I'll try some more tests tomorrow with a hopefully fresh mind...

Re: [OMPI users] Slightly off topic: Ethernet and InfiniBand speed evolution

2009-05-07 Thread Pavel Shamis (Pasha)




The (low level verbs) latency has AFAIR changed only a few times:

1) started at 5-6us with PCI-X Infinihost3
2) dropped to 3-4us with PCI-express Infinihost3
3) dropped to ~1us with PCI-express ConnectX
  
I would like to add that on PCI-EX Gen2 platforms the latency is sub 
micro (~0.8-0.95)

Disclaimer: rough figures and only for Mellanox chips.
  

The same declaimer here.

Pasha
  

Regards

Neeraj Chourasia




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Slightly off topic: Ethernet and InfiniBand speed evolution

2009-05-05 Thread Pavel Shamis (Pasha)



I can't find a similar data set for Infiniband. I would appreciate any 
comment/links.

Here is IB roadmap http://www.infinibandta.org/itinfo/IB_roadmap
...But I do not see there SDR

Pasha

Re: [OMPI users] users Digest, Vol 1217, Issue 2, Message3

2009-05-05 Thread Pavel Shamis (Pasha)


Jan,
I guess that you have OFED driver installed on you machines. You may do 
basic network verification with ibdiagnet utility 
(http://linux.die.net/man/1/ibdiagnet) that is part of OFED installation. 


Regards,
Pasha


Jeff Squyres wrote:

On May 4, 2009, at 9:50 AM, jan wrote:


Thank you Jeff. I have passed the mail to the IB vendor Dell company(the
blade was ordered from Dell Taiwan), but he todl me that he didn't
understand  "layer 0 diagnostics". Coluld you help us to get more
information of "layer 0 diagnostics". Thanks again.



Layer 0 = your physical network layer.  Specifically: ensure that your 
IB network is actually functioning properly at both the physical and 
driver layer.  Cisco was an IB vendor for several years; I can tell 
you from experience that it is *not* enough to just plug everything in 
and run a few trivial tests to ensure that network traffic seems to be 
passed properly.  You need to have your vendor run a full set of layer 
0 diagnostics to ensure that all the cables are good, all the HCAs are 
good, all the drivers are functioning properly, etc.  This involves 
running diagnostic network testing patterns, checking various error 
counters on the HCAs and IB switches, etc.


This is something that Dell should know how to do.

I say all this because the problem that you are seeing *seems* to be a 
network-related problem, not an OMPI-related problem.  One can never 
know for sure, but it is fairly clear that the very first step in your 
case is to verify that the network is functioning 100% properly.  
FWIW: this was standard operating procedure when Cisco was selling IB 
hardware.

Re: [OMPI users] [Fwd: mpi alltoall memory requirement]

2009-04-26 Thread Pavel Shamis (Pasha)

You may try to use XRC, it should decrease openib btl memory footprint, 
especially on multi-core system, like you have. The follow command will 
switch default OMPI config to XRC:
" --mca btl_openib_receive_queues 
X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32:X,65536,256,128,32"

Regards,
Pasha

Jeff Squyres wrote:

I think Ashley still has the right general idea.

You need to see how much memory the OS is taking off the top.  Then 
see how much memory the application images consume (before using any 
memory).  Open MPI itself then takes up a bunch of memory for its own 
internal buffering.  Remember, too, that Open MPI will default to both 
shared memory and OpenFabrics -- both of which have their own, 
separate buffering.

You can disable shared memory, if you want, by specifying

   mpirun --mca btl openib,self ...

("sm" is the shared memory btl, so by not specifying it, Open MPI 
won't use it)

If you have recent Mellanox HCAs, you should probably be using OMPI's 
XRC support, which will decrease OMPI's memory usage even further 
(I'll let Mellanox comment on this further if they want).

Finally, there's a bunch of information on this FAQ page describing 
how to tune Open MPI's OpenFabrics usage:

http://www.open-mpi.org/faq/?category=openfabrics

On Apr 23, 2009, at 1:35 PM, Viral Mehta wrote:

yes of course i m sure about wellness of providers and i m using 
ofed-1.4.1-rc3
i m running 24 proc per node on 8 node cluster. so as i showed in 
calculation that i require 36G mem.
i just need to know if my calculation has not some obvious flaw 
and/or if i m missing anything about setting up system environment or 
anything like that

On Thu, Apr 23, 2009 at 10:36 PM, gossips J  wrote:
What is the NIC you use?
What OFED build?
Are you sure about wellness of provider lib/drivers..?

It is strange that you get out of mem in all to all tests... should 
not happen on 32G system,..!!!

-polk.

On 4/23/09, viral@gmail.com  wrote:
or any link which helps to understand system reuirement for certain 
test scenario ..

On Apr 23, 2009 12:42pm, viral@gmail.com wrote:
> Hi
> Thanks for your response.
> However, I am running
> mpiexec  -ppn 24 -n 192 /opt/IMB-MPI1 alltaoll -msglen /root/temp
>
> And file /root/temp contains entry upto 65535 size only. That means 
alltoall test will run upto 65K size only

>
> So, in that case I will require very less memory but then in that 
case also test is running out-of-memory. Please help someone to 
understand the scenario.
> Or do I need to switch to some algorithm or do I need to set some 
other environment variables ? or anything like that ?

>
> On Apr 22, 2009 6:43pm, Ashley Pittman ash...@pittman.co.uk> wrote:
> > On Wed, 2009-04-22 at 12:40 +0530, vkm wrote:
> >
> >
> >
> > > The same amount of memory required for recvbuf. So at the least 
each

> >
> > > node should have 36GB of memory.
> >
> > >
> >
> > > Am I calculating right ? Please correct.
> >
> >
> >
> > Your calculation looks correct, the conclusion is slightly wrong
> >
> > however. The Application buffers will consume 36Gb of memory, the 
rest

> >
> > of the application, any comms buffers and the usual OS overhead 
will be

> >
> > on top of this so putting only 36Gb of ram in your nodes will still
> >
> > leave you short.
> >
> >
> >
> > Ashley,
> >
> >
> >

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Thanks,
Viral Mehta

Re: [OMPI users] mlx4 error - looking for guidance

2009-03-05 Thread Pavel Shamis (Pasha)

The fw version 2.3.0 is too old. I recommend you to upgrade to the 
latest version (2.6.0) from
Mellanox website 
http://www.mellanox.com/content/pages.php?pg=firmware_table_ConnectXIB


Thanks,
Pasha

Jeff Layton wrote:

Oops. I ran it on the head node and not the compute node. Here is the
output from a compute node:

hca_id: mlx4_0
   fw_ver: 2.3.000
   node_guid:  0018:8b90:97fe:1b6d
   sys_image_guid: 0018:8b90:97fe:1b70
   vendor_id:  0x02c9
   vendor_part_id: 25418
   hw_ver: 0xA0
   board_id:   DEL08C001
   phys_port_cnt:  2
   max_mr_size:0x
   page_size_cap:  0xf000
   max_qp: 131008
   max_qp_wr:  16351
   device_cap_flags:   0x001c1c66
   max_sge:32
   max_sge_rd: 0
   max_cq: 65408
   max_cqe:4194303
   max_mr: 131056
   max_pd: 32764
   max_qp_rd_atom: 16
   max_ee_rd_atom: 0
   max_res_rd_atom:2096128
   max_qp_init_rd_atom:128
   max_ee_init_rd_atom:0
   atomic_cap: ATOMIC_HCA (1)
   max_ee: 0
   max_rdd:0
   max_mw: 0
   max_raw_ipv6_qp:0
   max_raw_ethy_qp:0
   max_mcast_grp:  8192
   max_mcast_qp_attach:56
   max_total_mcast_qp_attach:  458752
   max_ah: 0
   max_fmr:0
   max_srq:65472
   max_srq_wr: 16383
   max_srq_sge:31
   max_pkeys:  128
   local_ca_ack_delay: 15
   port:   1
   state:  PORT_ACTIVE (4)
   max_mtu:2048 (4)
   active_mtu: 2048 (4)
   sm_lid: 41
   port_lid:   70
   port_lmc:   0x00
   max_msg_sz: 0x4000
   port_cap_flags: 0x02510868
   max_vl_num: 8 (4)
   bad_pkey_cntr:  0x0
   qkey_viol_cntr: 0x0
   sm_sl:  0
   pkey_tbl_len:   128
   gid_tbl_len:128
   subnet_timeout: 18
   init_type_reply:0
   active_width:   4X (2)
   active_speed:   5.0 Gbps (2)
   phys_state: LINK_UP (5)
   GID[  0]:   
fe80::::0018:8b90:97fe:1b6e


   port:   2
   state:  PORT_DOWN (1)
   max_mtu:2048 (4)
   active_mtu: 2048 (4)
   sm_lid: 0
   port_lid:   0
   port_lmc:   0x00
   max_msg_sz: 0x4000
   port_cap_flags: 0x02510868
   max_vl_num: 8 (4)
   bad_pkey_cntr:  0x0
   qkey_viol_cntr: 0x0
   sm_sl:  0
   pkey_tbl_len:   128
   gid_tbl_len:128
   subnet_timeout: 0
   init_type_reply:0
   active_width:   4X (2)
   active_speed:   2.5 Gbps (1)
   phys_state: POLLING (2)
   GID[  0]:   
fe80::::0018:8b90:97fe:1b6f






Do you have the same HCA adapter type on all of your machines ?
In the error log I see mlx4 error message , and mlx4 is connectX driver,
but ibv_devinfo show some older hca.


Jeff,
Can you please provide more information about you HCA type 
(ibv_devinfo -v).
Do you see this error immediate during startup, or you get it 
during your run ?


Thanks,
Pasha

Jeff Layton wrote:

Evening everyone,

I'm running a CFD code on IB and I've encountered an error I'm not 
sure about and I'm looking for some guidance on where to start 
looking. Here's the error:


mlx4: local QP operat

Re: [OMPI users] mlx4 error - looking for guidance

2009-03-05 Thread Pavel Shamis (Pasha)



Do you have the same HCA adapter type on all of your machines ?
In the error log I see mlx4 error message , and mlx4 is connectX driver,
but ibv_devinfo show some older hca.

Pasha

Jeff Layton wrote:

Pasha,

Here you go... :) Thanks for looking at this.

Jeff

hca_id: mthca0
   fw_ver: 4.8.200
   node_guid:  0003:ba00:0100:38ac
   sys_image_guid: 0003:ba00:0100:38af
   vendor_id:  0x02c9
   vendor_part_id: 25208
   hw_ver: 0xA0
   board_id:   MT_00B0010001
   phys_port_cnt:  2
   max_mr_size:0x
   page_size_cap:  0xf000
   max_qp: 64512
   max_qp_wr:  65535
   device_cap_flags:   0x1c76
   max_sge:59
   max_sge_rd: 0
   max_cq: 65408
   max_cqe:131071
   max_mr: 131056
   max_pd: 32768
   max_qp_rd_atom: 4
   max_ee_rd_atom: 0
   max_res_rd_atom:258048
   max_qp_init_rd_atom:128
   max_ee_init_rd_atom:0
   atomic_cap: ATOMIC_HCA (1)
   max_ee: 0
   max_rdd:0
   max_mw: 0
   max_raw_ipv6_qp:0
   max_raw_ethy_qp:0
   max_mcast_grp:  8192
   max_mcast_qp_attach:56
   max_total_mcast_qp_attach:  458752
   max_ah: 0
   max_fmr:0
   max_srq:960
   max_srq_wr: 65535
   max_srq_sge:31
   max_pkeys:  64
   local_ca_ack_delay: 15
   port:   1
   state:  PORT_ACTIVE (4)
   max_mtu:2048 (4)
   active_mtu: 2048 (4)
   sm_lid: 41
   port_lid:   41
   port_lmc:   0x00
   max_msg_sz: 0x8000
   port_cap_flags: 0x02510a6a
   max_vl_num: 8 (4)
   bad_pkey_cntr:  0x0
   qkey_viol_cntr: 0x0
   sm_sl:  0
   pkey_tbl_len:   64
   gid_tbl_len:32
   subnet_timeout: 18
   init_type_reply:0
   active_width:   4X (2)
   active_speed:   2.5 Gbps (1)
   phys_state: LINK_UP (5)
   GID[  0]:   
fe80::::0003:ba00:0100:38ad


   port:   2
   state:  PORT_DOWN (1)
   max_mtu:2048 (4)
   active_mtu: 512 (2)
   sm_lid: 0
   port_lid:   0
   port_lmc:   0x00
   max_msg_sz: 0x8000
   port_cap_flags: 0x02510a68
   max_vl_num: 8 (4)
   bad_pkey_cntr:  0x0
   qkey_viol_cntr: 0x0
   sm_sl:  0
   pkey_tbl_len:   64
   gid_tbl_len:32
   subnet_timeout: 0
   init_type_reply:0
   active_width:   4X (2)
   active_speed:   2.5 Gbps (1)
   phys_state: POLLING (2)
   GID[  0]:   
fe80::::0003:ba00:0100:38ae




Jeff,
Can you please provide more information about you HCA type 
(ibv_devinfo -v).
Do you see this error immediate during startup, or you get it during 
your run ?


Thanks,
Pasha

Jeff Layton wrote:

Evening everyone,

I'm running a CFD code on IB and I've encountered an error I'm not 
sure about and I'm looking for some guidance on where to start 
looking. Here's the error:


mlx4: local QP operation err (QPN 260092, WQE index 9a9e, vendor 
syndrome 6f, opcode = 5e)
[0,1,6][btl_openib_component.c:1392:btl_openib_component_progress] 
from compute-2-0.local to: compute-2-0.local erro
r polling HP CQ with status LOCAL QP OPERATION ERROR

Re: [OMPI users] RETRY EXCEEDED ERROR

2009-03-05 Thread Pavel Shamis (Pasha)




Thanks Pasha!
ibdiagnet reports the following:

-I---
-I- IPoIB Subnets Check
-I---
-I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- Port localhost/P1 lid=0x00e2 guid=0x001e0b4ced75 dev=25218 can not join
due to rate:2.5Gbps < group:10Gbps

I guess this may indicate a bad adapter.  Now, I just need to find what
system this maps to.
  

I guess it is some bad cable

I also ran ibcheckerrors and it reports a lot of problems with buffer
overruns.  Here's the tail end of the output, with only some of the last
ports reported:

#warn: counter SymbolErrors = 36905 (threshold 10) lid 193 port 14
#warn: counter LinkDowned = 23  (threshold 10) lid 193 port 14
#warn: counter RcvErrors = 15641(threshold 10) lid 193 port 14
#warn: counter RcvSwRelayErrors = 225   (threshold 100) lid 193 port 14
#warn: counter ExcBufOverrunErrors = 10 (threshold 10) lid 193 port 14
Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 14:  FAILED 
#warn: counter LinkRecovers = 181   (threshold 10) lid 193 port 1

#warn: counter RcvSwRelayErrors = 2417  (threshold 100) lid 193 port 1
Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 1:  FAILED 
#warn: counter LinkRecovers = 103   (threshold 10) lid 193 port 3

#warn: counter RcvErrors = 9035 (threshold 10) lid 193 port 3
#warn: counter RcvSwRelayErrors = 64670 (threshold 100) lid 193 port 3
Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 3:  FAILED 
#warn: counter SymbolErrors = 13151 (threshold 10) lid 193 port 4

#warn: counter RcvErrors = 109  (threshold 10) lid 193 port 4
#warn: counter RcvSwRelayErrors = 507   (threshold 100) lid 193 port 4
Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 4:  FAILED 


## Summary: 209 nodes checked, 0 bad nodes found
##  716 ports checked, 103 ports have errors beyond threshold

  
It reports a lot of symbol errors. I recommend you to reset all these 
counters (if i remember correct it is
-c flag in ibdiagnet) and rerun the testing again after the mpi process 
failure.


Thanks,
Pasha

Re: [OMPI users] mlx4 error - looking for guidance

2009-03-05 Thread Pavel Shamis (Pasha)


Jeff,
Can you please provide more information about you HCA type (ibv_devinfo -v).
Do you see this error immediate during startup, or you get it during 
your run ?


Thanks,
Pasha

Jeff Layton wrote:

Evening everyone,

I'm running a CFD code on IB and I've encountered an error I'm not 
sure about and I'm looking for some guidance on where to start 
looking. Here's the error:


mlx4: local QP operation err (QPN 260092, WQE index 9a9e, vendor 
syndrome 6f, opcode = 5e)
[0,1,6][btl_openib_component.c:1392:btl_openib_component_progress] 
from compute-2-0.local to: compute-2-0.local erro
r polling HP CQ with status LOCAL QP OPERATION ERROR status number 2 
for wr_id 37742320 opcode 0
mpirun noticed that job rank 0 with PID 21220 on node 
compute-2-0.local exited on signal 15 (Terminated).

78 additional processes aborted (not shown)


This is openmpi-1.2.9rc2 (sorry - need to upgrade to 1.3.0). The code 
works correctly for smaller cases, but when I run larger cases I get 
this error.


I'm heading to bed but I'll check email tomorrow (so to sleep and run 
but it's been a long day).


TIA!

Jeff




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] RETRY EXCEEDED ERROR

2009-03-05 Thread Pavel Shamis (Pasha)




Time to dig up diagnostics tools and look at port statistics.
  
You may use ibdiagnet tool for the network debug - 
*http://linux.die.net/man/1/ibdiagnet. *This tool is part of OFED.


Pasha.

Re: [OMPI users] openib RETRY EXCEEDED ERROR

2009-02-27 Thread Pavel Shamis (Pasha)

Usually "retry exceeded error" points to some network issues, like bad 
cable or some bad connector. You may use ibdiagnet tool for the network 
debug - *http://linux.die.net/man/1/ibdiagnet. *This tool is part of OFED.


Pasha

Brett Pemberton wrote:

Hey,

I've had a couple of errors recently, of the form:

[[1176,1],0][btl_openib_component.c:2905:handle_wc] from 
tango092.vpac.org to: tango090 error polling LP CQ with status RETRY 
EXCEEDED ERROR status number 12 for wr_id 38996224 opcode 0 qp_idx 0
-- 


The InfiniBand retry count between two MPI processes has been
exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):

My first thought was to increase the retry count, but it is already at 
maximum.


I've checked connections between the two nodes, and they seem ok

[root@tango090 ~]# ibv_rc_pingpong
  local address:  LID 0x005f, QPN 0xe4045d, PSN 0xdd13f0
  remote address: LID 0x005d, QPN 0xfe0425, PSN 0xc43fe2
8192000 bytes in 0.07 seconds = 996.93 Mbit/sec
1000 iters in 0.07 seconds = 65.74 usec/iter

How can I stop this happening in the future, without increasing the 
retry count?


cheers,

/ Brett



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] BTL question

2008-12-29 Thread Pavel Shamis (Pasha)




You may specify:
--mca btl openib,sm,self

Application sometime runs fast, sometimes runs slow

When you specify the parameter above, open mpi will use only three btls
openib - for Infiniband
sm - for shared memory communication
self - for "self" communication

NO other btl will be used.

And OpenMPI will use IB and shared memory for communication.
--mca btl tcp,sm,self

Allpication always runs fast. So...
is there a way to determine (from my application code) which
BTL is really being used?

And with this parameter you use TCP btl instead of IB.

So you see better performance with tcp btl ?!

I'm not sure that you may see list of active btls in 1.2.X. But anyway 
when you explicitly specify BTLs in command line, only these btls are used.


Thanks,
Pasha


I appreciate any help I can get,
S.


And OpenMPI will use TCP and shared memory for communication.

Thanks,
Pasha

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Problem with openmpi and infiniband

2008-12-28 Thread Pavel Shamis (Pasha)



Another thing to try is a change that we made late in the Open MPI 
v1.2 series with regards to IB:



http://www.open-mpi.org/faq/?category=openfabrics#v1.2-use-early-completion 



Thanks, this is something worth investigating. What would be the exact 
syntax to use to turn off pml_ob1_use_early_completion? 
Your problem definitely maybe related to the know issue with early 
completions. The exact syntax is:|

--mca pml_ob1_use_early_completion 0|

Do you think the same problem can also happen in the 1.1(.2) release, 
which is the one I have also tested, since it comes with Ofed 1.2.5? Would
I'm not sure , but I think it is very old issue, so it is big chance 
that it exist in 1.1 as well.


it be worth to try the 1.3? So far I have avoided it since it is 
tagged as "prerelease".
The early completion issue was resolved in 1.3. You may try 1.3, i hope 
that it will work for you.


Pasha

Re: [OMPI users] Problem with openmpi and infiniband

2008-12-24 Thread Pavel Shamis (Pasha)

If the basic test run the installation is ok. So what happens when you 
try to run your application ? What is command line ? What is the error 
message ? do you run the application on the same set of machines with 
the same command line as IMB ?

Pasha




yes to both questions: the OMPI version is the one that comes with 
OFED (1.1.2-1) and the basic tests run fine. For instance, IMB-MPI1 
(which is more than basic, as far as I can see) reports for the last 
test:


#---
# Benchmarking Barrier
# #processes = 6
#---
 #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
 100022.9322.9522.94


for the openib,self btl (6 processes, all processes on different nodes)
and

#---
# Benchmarking Barrier
# #processes = 6
#---
 #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
 1000   191.30   191.42   191.34

for the tcp,self btl (same test)

No anomalies for other tests (ping-pong, all-to-all etc.)

Thanks,
Biagio

Re: [OMPI users] BTL question

2008-12-24 Thread Pavel Shamis (Pasha)


Teige, Scott W wrote:

Greetings,

I have observed strange behavior with an application running with
OpenMPI 1.2.8, OFED 1.2. The application runs in two "modes", fast
and slow. The exectution time is either within one second of 108 sec.
or within one second of 67 sec. My cluster has 1 Gig ethernet and
DDR Infiniband so the byte transport layer is a prime suspect.

So, is there a way to determine (from my application code) which
BTL is really being used?

You may specify:
--mca btl openib,sm,self
And OpenMPI will use IB and shared memory for communication.
--mca btl tcp,sm,self
And OpenMPI will use TCP and shared memory for communication.

Thanks,
Pasha

Re: [OMPI users] Problem with openmpi and infiniband

2008-12-24 Thread Pavel Shamis (Pasha)


Biagio Lucini wrote:

Hello,

I am new to this list, where I hope to find a solution for a problem 
that I have been having for quite a longtime.


I run various versions of openmpi (from 1.1.2 to 1.2.8) on a cluster 
with Infiniband interconnects that I use and administer at the same 
time. The openfabric stac is OFED-1.2.5, the compilers gcc 4.2 and 
Intel. The queue manager is SGE 6.0u8. 
Do you use OpenMPI version that is included in OFED ? Did you was able 
to run basic OFED/OMPI tests/benchmarks between two nodes ?


Pasha

Re: [OMPI users] infiniband problem

2008-11-23 Thread Pavel Shamis (Pasha)




recommend you upgrade your Open MPI installation.  v1.2.8 has
a lot of bugfixes relative to v1.2.2. Also, Open MPI 1.3 should be
available "next month"...  so watch for an announcement on that front.
  
BTW OMPI 1.2.8 also will be available as part of OFED 1.4 that will be 
released in end of the month.

Pasha.

Re: [OMPI users] OpenMPI with openib partitions

2008-10-07 Thread Pavel Shamis (Pasha)


Matt,
I guess that you have some problem with partition configuration.
Can you share with us your partition configuration file (by default 
opensm use /etc/opensm/partitions.conf) and guid from your machines ( 
ibstat | grep GUID ) ?


Regards,
Pasha

Matt Burgess wrote:

Hi,


I'm trying to get openmpi working over openib partitions. On this 
cluster, the partition number is 0x109. The ib interfaces are pingable 
over the appropriate ib0.8109 interface:


d2:/opt/openmpi-ib # ifconfig ib0.8109
ib0.8109  Link encap:UNSPEC  HWaddr 
80-00-00-4A-FE-80-00-00-00-00-00-00-00-00-00-00 
  inet addr:10.21.48.2 <http://10.21.48.2>  
Bcast:10.21.255.255 <http://10.21.255.255>  Mask:255.255.0.0 
<http://255.255.0.0>

  inet6 addr: fe80::202:c902:26:ca01/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
  RX packets:16811 errors:0 dropped:0 overruns:0 frame:0
  TX packets:15848 errors:0 dropped:1 overruns:0 carrier:0
  collisions:0 txqueuelen:256
  RX bytes:102229428 (97.4 Mb)  TX bytes:102324172 (97.5 Mb)


I have tried the following:

/opt/openmpi-ib/1.2.6/bin/mpirun -np 2 -machinefile machinefile -mca 
btl openib,self -mca btl_openib_max_btls 1 -mca btl_openib_ib_pkey_val 
0x8109 -mca btl_openib_ib_pkey_ix 1 /cluster/pallas/x86_64-ib/IMB-MPI1


but I just get a RETRY EXCEEDED ERROR. Is there a MCA parameter I am 
missing?


I was successful using tcp only:

/opt/openmpi-ib/1.2.6/bin/mpirun -np 2 -machinefile machinefile -mca 
btl tcp,self -mca btl_openib_max_btls 1 -mca btl_openib_ib_pkey_val 
0x8109 /cluster/pallas/x86_64-ib/IMB-MPI1




Thanks,
Matt Burgess


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
--
Pavel Shamis (Pasha)
Mellanox Technologies LTD.

Re: [OMPI users] Problem with btl_openib_endpoint_post_rr

2008-08-26 Thread Pavel Shamis (Pasha)


Hi,
Can you please provide more information about your setup:
- OpenMPI version
- Runtime tuning
- Platform
- IB vendor and driver version

Thanks,
Pasha

Åke Sandgren wrote:

Hi!

We have a code that (at least sometimes) gets the following error
message:
[p-bc2909][0,1,98][btl_openib_endpoint.h:201:btl_openib_endpoint_post_rr] error
posting receive errno says Numerical result out of range


Any ideas as to where i should start searching for the problem?

Re: [OMPI users] Fail to install openmpi 1.2.5 on bladecenter with OFED 1.3

2008-08-13 Thread Pavel Shamis (Pasha)

Usually OFED installs only 64 bit version of libibverbs. If you want to 
install 32bit and 64bit version  you need pass "--build32" flag to OFED 
install. So after reinstalling OFED with 32bit support, you may rebuild 
the OMPI for 32 bit support.


Regards,
Pasha


Mohd Radzi Nurul Azri wrote:

Hi,


Thanks for the prompt reply. This might be basic but typically where 
is the 32 bit ofed libs? I think the default install prefix is /usr 
and my guess is the 64 bit libs is in /usr/lib64 . Where do I look for 
the 32 bit ofed libs? I remembered during the ofed build that passing 
32 bit build argument failed - will it still install an OFED 32 bit libs?




On Wed, Aug 13, 2008 at 1:40 AM, Jeff Squyres > wrote:


You probably need to add
--with-openib-libdir=/path/to/your/32/bit/ofed/libs.  I'm guessing
that the system installed the 64 bit libs in the default location
and the 32 bit libs in a different location.  If that's the case,
then --with-openib-libdir will tell OMPI specifically where to
look for those libs and use those instead.



On Aug 12, 2008, at 1:32 PM, Mohd Radzi Nurul Azri wrote:


Hi,


I've been trying to install openmpi 1.2.5 on my cluster system
running RHEL 4 (x64) with OFED 1.3. I need openmpi 1.2.5 (32
bit) and OFED seems to only install 64 bit version. I tried to
build OFED with 32 bit support but it failed so I figure it's
best to just compile 32 bit openmpi. I followed the FAQ and
few user experience on the web.

I ran this command:
./configure --prefix=/usr/mpi/gcc/32bit --with-openib=/usr
CFLAGS=-m32 CXXFLAGS=-m32 FFLAGS=-m32 FCFLAGS=-m32

and after few scrolling lines, it stops here:
--- MCA component btl:openib (m4 configuration macro)
checking for MCA component btl:openib compile mode... dso
looking for header without includes
checking infiniband/verbs.h usability... yes
checking infiniband/verbs.h presence... yes
checking for infiniband/verbs.h... yes
looking for library without search path
checking for ibv_open_device in -libverbs... no
looking for library in lib
checking for ibv_open_device in -libverbs... no
looking for library in lib64
checking for ibv_open_device in -libverbs... no
checking for ibv_create_srq... no
checking for ibv_get_device_list... no
checking for ibv_resize_cq... no
configure: WARNING: OpenFabrics support requested (via
--with-openib) but not fo  und.
configure: WARNING: If you are using libibverbs v1.0 (i.e.,
OFED v1.0 or v1.1),   you *MUST* have both the libsysfs
headers and libraries installed.  Later versio  ns of
libibverbs do not require libsysfs.
configure: error: Aborting.


What went wrong? From the error it says early OFED version
which is not the one I'm using (running OFED 1.3 now).

Any advice is greatly appreciated.


-- 
Thank you.


azri
___
users mailing list
us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/users



-- 
Jeff Squyres

Cisco Systems

___
users mailing list
us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
Thank you.

Nurul Azri Mohd Radzi


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] How can I start building apps in Open MPI? any docs?

2008-07-27 Thread Pavel Shamis (Pasha)


Amir Saad wrote:
I'll be starting some parallel programs in Open  MPI and I would like 
to find a guide or any docs of Open MPI, any suggestions please? I 
couldn't find any docs on the website, how do I know about the APIs or 
the functions that I should use?

Here are videos about OpenMPI/MPI - http://www.open-mpi.org/video/
FAQ - http://www.open-mpi.org/faq/

Pasha.

Re: [OMPI users] OpenMPI locking up only on IB

2008-07-03 Thread Pavel Shamis (Pasha)


Brock Palen wrote:
Ok it looks like a bigger problem.  The segfault is not related to 
OMPI because when I go and rebuild 1.2 or another version we use with 
IB all the time, it will now fail with a segfault when forcing IB.  
The old libs of the same version still work.  They of-course do not 
have the flag to turn off early completion.


Was there an older version of OpenMPI that did not suffer from the 
early completion problem? 
The issue was fixed in 1.3 branch, all versions before 1.3 have this 
problem.
We have many installed and for a quick test latest and greatest would 
not be of much concern while we track down the problem on our end.


We are on RHEL4  using OFED provided by redhat.  The error is "address 
not mapped to object"
I think that best for you will be try to install Mellanox OFED 
distribution that already include pre-build versions on OpenMPI 1.2.6 
with Intel and Pgi compilers:

http://www.mellanox.com/products/ofed.php



Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On Jul 3, 2008, at 8:38 AM, Jeff Squyres wrote:

On Jul 2, 2008, at 11:51 PM, Pavel Shamis (Pasha) wrote:

In trying to build 1.2.6 with the pgi compilers it makes an MPI 
library that works with tcp, sm.  But it segfaults on openib.


Both our intel compiler version and pgi version of 1.2.6 blow up 
like this when we force IB.  So this is a new issue.
I have ompi 1.2.6 installed on my machines with Intel compiler 
(version 10.1) and Pgi compiler (version 7.1-5), both of them works
with IB without any problem. BTW Mellanox provides Mellanox OFED 
binary distribution that include Intel and Pgi Open MPI 1.2.6 build.

You can download it from here http://www.mellanox.com/products/ofed.php



Is there a way to shut off early completion in 1.2.3?
Sure, just add "--mca |pml_ob1_use_early_completion 0" to your 
command line.| ||


Note that this flag was not added until v1.2.6; it has no effect in 
v1.2.3.


Or the the above a known issues and i should use 1.2.7-pre  or grab 
a 1.3 snap shot?

1.2.6 should be ok.



The upcoming v1.3 series works a little differently; there's no need 
to use this flag in the v1.3 series (i.e., this flag only exists in 
the v1.2 series starting with v1.2.6).


--
Jeff Squyres
Cisco Systems

Re: [OMPI users] OpenMPI locking up only on IB

2008-07-03 Thread Pavel Shamis (Pasha)



In trying to build 1.2.6 with the pgi compilers it makes an MPI 
library that works with tcp, sm.  But it segfaults on openib.


Both our intel compiler version and pgi version of 1.2.6 blow up like 
this when we force IB.  So this is a new issue.
I have ompi 1.2.6 installed on my machines with Intel compiler (version 
10.1) and Pgi compiler (version 7.1-5), both of them works
with IB without any problem. BTW Mellanox provides Mellanox OFED binary 
distribution that include Intel and Pgi Open MPI 1.2.6 build.

You can download it from here http://www.mellanox.com/products/ofed.php



Is there a way to shut off early completion in 1.2.3?
Sure, just add "--mca |pml_ob1_use_early_completion 0" to your command 
line.| ||
Or the the above a known issues and i should use 1.2.7-pre  or grab a 
1.3 snap shot?

1.2.6 should be ok.

Regards,
Pasha




On Jul 2, 2008, at 10:42 AM, Pavel Shamis (Pasha) wrote:
May be this FAQ will help : 
http://www.open-mpi.org/faq/?category=openfabrics#v1.2-use-early-completion 



Brock Palen wrote:
We have a code (arts)  that locks up only when running on IB.  Works 
fine on tcp and sm.


When we ran it in a debugger.  It locked up on a MPI_Comm_split()  
That as far as I could tell was valid.
Because the split was a hack they did to use MPI_File_open() on a 
single cpu,  we reworked it to remove the split.  The code then 
locks up again.


This time its locked up on an MPI_Allreduce()  Which was really 
strange.  When running on 8 cpus only rank 4 would get sucks.  The 
rest of the ranks are fine and get the right value.  (we are using 
ddt as our debugger).


Its very strange.  Do you have any idea what could cause this to 
happen?  We are using openmpi-1.2.3/1.2.6  with PGI compilers.



Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] OpenMPI locking up only on IB

2008-07-02 Thread Pavel Shamis (Pasha)

May be this FAQ will help : 
http://www.open-mpi.org/faq/?category=openfabrics#v1.2-use-early-completion


Brock Palen wrote:
We have a code (arts)  that locks up only when running on IB.  Works 
fine on tcp and sm.


When we ran it in a debugger.  It locked up on a MPI_Comm_split()  
That as far as I could tell was valid.
Because the split was a hack they did to use MPI_File_open() on a 
single cpu,  we reworked it to remove the split.  The code then locks 
up again.


This time its locked up on an MPI_Allreduce()  Which was really 
strange.  When running on 8 cpus only rank 4 would get sucks.  The 
rest of the ranks are fine and get the right value.  (we are using ddt 
as our debugger).


Its very strange.  Do you have any idea what could cause this to 
happen?  We are using openmpi-1.2.3/1.2.6  with PGI compilers.



Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Fw: Re: Open MPI timeout problems.

2008-06-19 Thread Pavel Shamis (Pasha)



I appreciate the feedback. I'm assuming that this upgrade to the Open 
Fabric
driver is something that the System Admin. of the cluster should be 
concerned with and not I ?

Driver upgrade will require root permissions.
Thanks,
Pasha



Thanks,

Peter

Peter Diamessis wrote:



--- On *Thu, 6/19/08, Pavel Shamis (Pasha) 
//* wrote:


From: Pavel Shamis (Pasha) 
Subject: Re: [OMPI users] Open MPI timeout problems.
To: pj...@cornell.edu, "Open MPI Users" 
Date: Thursday, June 19, 2008, 5:20 AM

Usually the retry exceed point to some network issue on your 
cluster. I see from the logs that you still
use MVAPI. If i remember correct,  MVAPI include IBADM 
application that should be able to check and debug the network.
BTW I recommend you to update your MVAPI driver to latest 
OpenFabric driver.


Peter Diamessis wrote:
> Dear folks,
>
> I would appreciate your help on the following:
>
> I'm running a parallel CFD code on the Army Research Lab's MJM
Linux
> cluster, which uses Open-MPI. I've run the same code on other 
Linux

> clusters that use MPICH2 and had never run into this problem.
>
> I'm quite convinced that the bottleneck for my code is this data
> transposition routine, although I have not done any rigorous 
profiling
> to check on it. This is where 90% of the parallel communication 
takes
> place. I'm running a CFD code that uses a 3-D rectangular 
domain which

> is partitioned across processors in such a way that each processor
> stores vertical slabs that are contiguous in the x-direction 
but shared
> across processors in the y-dir. . When a 2-D Fast Fourier 
Transform
> (FFT) needs to be done, data is transposed such that the 
vertical slabs

> are now contiguous in the y-dir. in each processor. >
> The code would normally be run for about 10,000 timesteps. In the
> specific case which blocks, the job crashes after ~200 
timesteps and at
> each timestep a large number of 2-D FFTs are performed. For a 
domain
> with resolution of Nx * Ny * Nz points and P processors, during 
one FFT,
> each processor performs P Sends and P Receives of a message of 
size
> (Nx*Ny*Nz)/P, i.e. there are a total of 2*P^2 such 
Sends/Receives. >
> I've focused on a case using P=32 procs with Nx=256, Ny=128, 
Nz=175.

You
> can see that each FFT involves 2048 communications. I totally 
rewrote my

> data transposition routine to no longer use specific blocking/non-
> blocking Sends/Receives but to use MPI_ALLTOALL which I would 
hope is
> optimized for the specific MPI Implementation to do data 
transpositions.
> Unfortunately, my code still crashes with time-out problems 
like before.

>
> This happens for P=4, 8, 16 & 32 processors. The same MPI_ALLTOALL
code
> worked fine on a smaller cluster here. Note that in the future 
I would
> like to work with resolutions of (Nx,Ny,Nz)=(512,256,533) and 
P=128 or
> 256 procs. which will involve an order of magnitude more 
communication.

>
> Note that I ran the job by submitting it to an LSF queue 
system. I've
> attached the script file used for that. I basically enter bsub 
-x <

> script_openmpi at the command line. >
> When I communicated with a consultant at ARL, he recommended I use
> 3 specific script files which I've attached. I believe these 
enable
> control over some of the MCA parameters. I've experimented with 
values
> of  btl_mvapi_ib_timeout = 14, 18, 20, 24 and 30 and I still 
have this
> problem. I am still in contact with this consultant but thought 
it would

> be good to contact you folks directly.
>
> Note:
> a) echo $PATH returns: >
> /opt/mpi/x86_64/pgi/6.2/openmpi-1.2/bin:
> 
/opt/compiler/pgi/linux86-64/6.2/bin:/usr/lsf/6.2/linux2.6-glibc2.3-

> ia32e/bin:/usr/lsf/6.2/linux2.6-glibc2.3-
> ia32e/etc:/usr/cta/modules/3.1.6/bin:
> 
/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/games:/opt/gnome/bin:

> .:/usr/lib/java/bin:/opt/gm/bin:/opt/mx/bin:/opt/PST/bin
>
> b) echo $LD_LIBRARY_PATH returns:
> /opt/mpi/x86_64/pgi/6.2/openmpi-1.2/lib:
> /opt/compiler/pgi/linux86-64/6.2/lib:
> 
/opt/compiler/pgi/linux86-64/6.2/libso:/usr/lsf/6.2/linux2.6-glibc2.3-

> ia32e/lib
>
> I've attached the following files:
> 1) Gzipped versions of the .out & .err files of the failed job.
> 2) ompi_info.log: The output of ompi_info -all
> 3) mpirun, mpirun.lsf, openmpi_wrapper: the three script files 
provided
> to me by the ARL consultant. I store these in my home direc

Re: [OMPI users] Open MPI timeout problems.

2008-06-19 Thread Pavel Shamis (Pasha)

Usually the retry exceed point to some network issue on your cluster. I 
see from the logs that you still
use MVAPI. If i remember correct,  MVAPI include IBADM application that 
should be able to check and debug the network.

BTW I recommend you to update your MVAPI driver to latest OpenFabric driver.

Peter Diamessis wrote:

Dear folks,

I would appreciate your help on the following:

I'm running a parallel CFD code on the Army Research Lab's MJM Linux
cluster, which uses Open-MPI. I've run the same code on other Linux
clusters that use MPICH2 and had never run into this problem.

I'm quite convinced that the bottleneck for my code is this data
transposition routine, although I have not done any rigorous profiling
to check on it. This is where 90% of the parallel communication takes
place. I'm running a CFD code that uses a 3-D rectangular domain which
is partitioned across processors in such a way that each processor
stores vertical slabs that are contiguous in the x-direction but shared
across processors in the y-dir. . When a 2-D Fast Fourier Transform
(FFT) needs to be done, data is transposed such that the vertical slabs
are now contiguous in the y-dir. in each processor. 


The code would normally be run for about 10,000 timesteps. In the
specific case which blocks, the job crashes after ~200 timesteps and at
each timestep a large number of 2-D FFTs are performed. For a domain
with resolution of Nx * Ny * Nz points and P processors, during one FFT,
each processor performs P Sends and P Receives of a message of size
(Nx*Ny*Nz)/P, i.e. there are a total of 2*P^2 such Sends/Receives. 


I've focused on a case using P=32 procs with Nx=256, Ny=128, Nz=175. You
can see that each FFT involves 2048 communications. I totally rewrote my
data transposition routine to no longer use specific blocking/non-
blocking Sends/Receives but to use MPI_ALLTOALL which I would hope is
optimized for the specific MPI Implementation to do data transpositions.
Unfortunately, my code still crashes with time-out problems like before.

This happens for P=4, 8, 16 & 32 processors. The same MPI_ALLTOALL code
worked fine on a smaller cluster here. Note that in the future I would
like to work with resolutions of (Nx,Ny,Nz)=(512,256,533) and P=128 or
256 procs. which will involve an order of magnitude more communication.

Note that I ran the job by submitting it to an LSF queue system. I've
attached the script file used for that. I basically enter bsub -x <
script_openmpi at the command line. 


When I communicated with a consultant at ARL, he recommended I use
3 specific script files which I've attached. I believe these enable
control over some of the MCA parameters. I've experimented with values
of  btl_mvapi_ib_timeout = 14, 18, 20, 24 and 30 and I still have this
problem. I am still in contact with this consultant but thought it would
be good to contact you folks directly.

Note:
a) echo $PATH returns: 


/opt/mpi/x86_64/pgi/6.2/openmpi-1.2/bin:
/opt/compiler/pgi/linux86-64/6.2/bin:/usr/lsf/6.2/linux2.6-glibc2.3-
ia32e/bin:/usr/lsf/6.2/linux2.6-glibc2.3-
ia32e/etc:/usr/cta/modules/3.1.6/bin:
/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/games:/opt/gnome/bin:
.:/usr/lib/java/bin:/opt/gm/bin:/opt/mx/bin:/opt/PST/bin

b) echo $LD_LIBRARY_PATH returns:
/opt/mpi/x86_64/pgi/6.2/openmpi-1.2/lib:
/opt/compiler/pgi/linux86-64/6.2/lib:
/opt/compiler/pgi/linux86-64/6.2/libso:/usr/lsf/6.2/linux2.6-glibc2.3-
ia32e/lib

I've attached the following files:
1) Gzipped versions of the .out & .err files of the failed job.
2) ompi_info.log: The output of ompi_info -all
3) mpirun, mpirun.lsf, openmpi_wrapper: the three script files provided
to me by the ARL consultant. I store these in my home directory and
experimented with the MCA parameter btl_mvapi_ib_timeout in mpirun.
4) The script file script_openmpi that I use to submit the job.

I am unable to provide you with the config.log file as I cannot find it
in the top level Open MPI directory.

I am also unable to provide you with details on the specific cluster
that I'm running in terms of the network. I know they use Infiniband and
some more detail may be found on:

http://www.arl.hpc.mil/Systems/mjm.html

Some other info:
a) uname -a returns: 
Linux l1 2.6.5-7.308-smp.arl-msrc #2 SMP Thu Jan 10 09:18:41 EST 2008

x86_64 x86_64 x86_64 GNU/Linux

b) ulimit -l returns: unlimited

I cannot see a pattern as to which nodes are bad and which are good ...


Note that I found in the mail archives that someone had a similar
problem in transposing a matrix with 16 million elements. The only
answer I found in the thread was to increase the value of
btl_mvapi_ib_timeout to 14 or 16, something I've done already.

I'm hoping that there must be a way out of this problem. I need to
get my code running as I'm under pressure to produce results for a
grant that's paying me.

If you have any feedback I would be hugely grateful.

Sincerely,

Peter Diamessis
Cornell University


  
--

Re: [OMPI users] OpenMPI scaling > 512 cores

2008-06-04 Thread Pavel Shamis (Pasha)


Scott Shaw wrote:

Hi, I hope this is the right forum for my questions.  I am running into
a problem when scaling >512 cores on a infiniband cluster which has
14,336 cores. I am new to openmpi and trying to figure out the right
-mca options to pass to avoid the "mca_oob_tcp_peer_complete_connect:
connection failed:" on a cluster which has infiniband HCAs and OFED
v1.3GA release.  Other MPI implementation like Intel MPI and mvapich
work fine using uDAPL or VERBs IB layers for MPI communications.
  
Did you have chance to see this FAQ - 
http://www.open-mpi.org/faq/?category=troubleshooting#large-job-tcp-oob-timeout

I find it difficult to understand which network interface or IB layer
being used. When I explicitly state not to use eth0,lo,ib1, or ib1:0
interfaces with the cmdline option "-mca oob_tcp_exclude" openmpi will
continue to probe these interfaces.  For all MPI traffic openmpi should
use IB0 which is the 10.148 network. But with debugging enabled I see
references trying the 10.149 network which is IB1.  Below is the
ifconfig network device output for a compute node.

Questions:

1. Is there away to determine which network device is being used and not
have openmpi fallback to another device? With Intel MPI or HP MPI you
can state not to use a fallback device.  I thought "-mca
oob_tcp_exclude" would be the correct option to pass but I maybe wrong. 
  

If you want to use the IB verbs , you may specify:
-mca btl sm.self,openib
sm - shmem
self - self comunication
openib - IB communication (IB verbs)


2. How can I determine infiniband openib device is actually being used?
When running a MPI app I continue to see counters for in/out packets at
a tcp level increasing when it should be using the IB RDMA device for
all MPI comms over the IB0 or mtcha0 device? OpenMPI was bundled with
OFED v1.3 so I am assuming the openib interface should work.  Running
ompi_info shows btl_open_* references. 


/usr/mpi/openmpi-1.2-2/intel/bin/mpiexec -mca
btl_openib_warn_default_gid_prefix 0 -mca oob_tcp_exclude
eth0,lo,ib1,ib1:0  -mca btl openib,sm,self -machinefile mpd.hosts.$$ -np
1024 ~/bin/test_ompi < input1
  

http://www.open-mpi.org/community/lists/users/2008/05/5583.php

Re: [OMPI users] infiniband

2008-05-01 Thread Pavel Shamis (Pasha)


Another nice tools for ib monitoring.

1. perfquery (part of  OFED), example of report:

Port counters: Lid 12 port 1
PortSelect:..1
CounterSelect:...0x
SymbolErrors:7836
LinkRecovers:255
LinkDowned:..0
RcvErrors:...24058
RcvRemotePhysErrors:.6159
RcvSwRelayErrors:0
XmtDiscards:.3176
XmtConstraintErrors:.0
RcvConstraintErrors:.0
LinkIntegrityErrors:.0
ExcBufOverrunErrors:.0
VL15Dropped:.0
XmtData:.1930
RcvData:.1708
XmtPkts:.114
RcvPkts:.114

2. collectl - http://collectl.sourceforge.net/, example of report:

#<CPU><---Memory--><--InfiniBand-->
#cpu sys inter  ctxsw free buff cach inac slab  map   KBin  pktIn  KBOut 
pktOut Errs
  1   0   847   1273   1G 264M   3G 594M   1G 234M  2 29  
2 29 123242
  2   1   851   2578   1G 264M   3G 594M   1G 234M  1  5  
1  5 123391




Pavel Shamis (Pasha) wrote:

SLIM H.A. wrote:
  

Is it possible to get information about the usage of hca ports similar
to the result of the mx_endpoint_info command for Myrinet boards?

The ibstat command gives information like this:

Port 1:
State: Active
Physical state: LinkUp

but does not say whether a job is actually using an infiniband port or
comunicates through plain ethernet. 


I would be grateful for any advice
  

You have access to some counters in 
/sys/class/infiniband/mlx4_0/ports/1/counters/  (counters for hca - 
mlx4_0 , port 1)


  



--
Pavel Shamis (Pasha)
Mellanox Technologies

Re: [OMPI users] infiniband

2008-04-29 Thread Pavel Shamis (Pasha)


SLIM H.A. wrote:

Is it possible to get information about the usage of hca ports similar
to the result of the mx_endpoint_info command for Myrinet boards?

The ibstat command gives information like this:

Port 1:
State: Active
Physical state: LinkUp

but does not say whether a job is actually using an infiniband port or
comunicates through plain ethernet. 


I would be grateful for any advice
  
You have access to some counters in 
/sys/class/infiniband/mlx4_0/ports/1/counters/  (counters for hca - 
mlx4_0 , port 1)


--
Pavel Shamis (Pasha)
Mellanox Technologies

Re: [OMPI users] multi-rail failover with IB

2008-04-03 Thread Pavel Shamis (Pasha)


Jeff Squyres wrote:

can OpenMPI also deal with one of the subnets failing?
ie. will OpenMPI automatically fall back to using the last remaining
working IB port out of a node, or even fallback to GigE if all the IB
fails?



Not in the 1.2 series.

The 1.3 series *may* include "APM" support (automatic path migration  
-- a feature in IB).  It looks positive that that'll make the 1.3 cut,  
but I don't have definite information yet.
  
Current ompi-trunk have APM implementation. If you enable APM ompi will 
use only first port on the
HCA for data transmission and second one will be reserver for back-up. 
On network failure on the first port
all connections will migrate to second port. The APM works only on the 
HCA level - I mean that you can not migrate between

different HCAs, you can migrate only between 2 ports of the same HCA.


--
Pavel Shamis (Pasha)
Mellanox Technologies

Re: [OMPI users] MPI-2 Supported on Open MPI 1.2.5?

2008-03-12 Thread Pavel Shamis (Pasha)

Jeff Pummill wrote:
Haha, yeah we found out about that one when trying to run Linpack with 
threaded BLAS implementations.
BTW Goto BLAS threaded implementation doesn't use any MPI calls. So you 
may build Linpack with Goto_thread + OMPI non-threaded and it will run

without any problem.

Pasha.

On the MPI-2 note, anyone running MATLAB and the parallel toolkit 
under Open MPI? They are unreasonably obscure about what MPI they need 
although I do believe they need MPI-2 functions.

If it will work with Open MPI, and is not a nightmare to set up, I may 
try it as some of my users would be elated. If there are excessive 
problems, I may opt for SciLab which is supposed to be an "equivalent" 
and open source.

Jeff F. Pummill
Senior Linux Cluster Administrator
University of Arkansas
Fayetteville, Arkansas 72701
(479) 575 - 4590
http://hpc.uark.edu

"In theory, there is no difference between theory and
practice. But in practice, there is!" /-- anonymous/

Brian Budge wrote:
One small (or to some, not so small) note is that full 
multi-threading with OpenMPI is very unlikely to work with infiniband 
right now.

  Brian

On Mon, Mar 10, 2008 at 6:24 AM, Michael <mailto:mk...@ieee.org>> wrote:

Quick answer, till you get a complete answer, Yes, OpenMPI has long
supported most of the MPI-2 features.

Michael

On Mar 7, 2008, at 7:44 AM, Jeff Pummill wrote:
> Just a quick question...
>
> Does Open MPI 1.2.5 support most or all of the MPI-2 directives and
> features?
>
> I have a user who specified MVAPICH2 as he needs some features like
> extra task spawning, but I am trying to standardize on Open MPI
> compiled against Infiniband for my primary software stack.
>
> Thanks!
>
> --
> Jeff F. Pummill
> Senior Linux Cluster Administrator
> University of Arkansas

___
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Pavel Shamis (Pasha)
Mellanox Technologies

Re: [OMPI users] Set GID

2008-03-12 Thread Pavel Shamis (Pasha)


Ok, I will do.

Jeff Squyres wrote:

Sure, that would be fine.

Can you write it up in a little more FAQ-ish style?  I can add it to 
the web page.  See this wiki item:


https://svn.open-mpi.org/trac/ompi/wiki/OMPIFAQEntries



On Mar 12, 2008, at 5:33 AM, Pavel Shamis (Pasha) wrote:


Run opensm with follow parameters:
opensm -c -o

It will make 1 loop and will exit (-o parameter). The -c option says to
opensm to generate option file.
The file will be generated under /var/cache/opensm//opensm.opts
Open the file and find line with "subnet_prefix". Replace the prefix
with new one.
Re-ran opensm: /etc/init.d/opensm start
It will automatically load the option file from cache repository and
will use new prefix

Jeff ,
Do we want to add it to Open-MPI FAQ ?

Regards,
Pasha


Jon Mason wrote:

I am getting the following error when I try to run OMPI over my dual
port IB adapter:

-- 


WARNING: There are more than one active ports on host 'vic20', but the
default subnet GID prefix was detected on more than one of these
ports.  If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI.  This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

 http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid 



NOTE: You can turn off this warning by setting the MCA parameter
 btl_openib_warn_default_gid_prefix to 0.
-- 




I understand why it is doing the above based on the description in the
link above, but I cannot seem to find anywhere that will tell me how to
change the GID to something else.

Thanks,
Jon
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





--
Pavel Shamis (Pasha)
Mellanox Technologies

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






--
Pavel Shamis (Pasha)
Mellanox Technologies

Re: [OMPI users] Set GID

2008-03-12 Thread Pavel Shamis (Pasha)


Run opensm with follow parameters:
opensm -c -o

It will make 1 loop and will exit (-o parameter). The -c option says to 
opensm to generate option file.

The file will be generated under /var/cache/opensm//opensm.opts
Open the file and find line with "subnet_prefix". Replace the prefix 
with new one.

Re-ran opensm: /etc/init.d/opensm start
It will automatically load the option file from cache repository and 
will use new prefix


Jeff ,
Do we want to add it to Open-MPI FAQ ?

Regards,
Pasha


Jon Mason wrote:

I am getting the following error when I try to run OMPI over my dual
port IB adapter:

--
WARNING: There are more than one active ports on host 'vic20', but the
default subnet GID prefix was detected on more than one of these
ports.  If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI.  This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

  http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

NOTE: You can turn off this warning by setting the MCA parameter
  btl_openib_warn_default_gid_prefix to 0.
--


I understand why it is doing the above based on the description in the
link above, but I cannot seem to find anywhere that will tell me how to
change the GID to something else.

Thanks,
Jon
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

  



--
Pavel Shamis (Pasha)
Mellanox Technologies

Re: [OMPI users] job running question

2006-04-10 Thread Pavel Shamis (Pasha)

Mpirun opens separate shell on each machine/node, so the "ulimit" will 
not be available in new sheel.  I think if you will add "ulimit -c 
unlimited" to you default shell configuration file (~/.bashrc in BASH 
case ant ~/.tcshrc in TCSH/CSH case) you will find your core files :)


Regards,
Pavel Shamis (Pasha)

Adams Samuel D Contr AFRL/HEDR wrote:

I set bash to have unlimited size core files like this:

$ ulimit -c unlimited

But, it was not dropping core files for some reason when I was running with
mpirun.  Just to make sure it would do what I expected, I wrote a little C
program that was kind of like this

int ptr = 4;
fprintf(stderr,"bad! %s\n", (char*)ptr);

That would give a segmentation fault.  It dropped a core file like you would
expect.  Am I missing something?  


Sam Adams
General Dynamics - Network Systems
Phone: 210.536.5945

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Jeff Squyres (jsquyres)
Sent: Saturday, April 08, 2006 6:25 AM
To: Open MPI Users
Subject: Re: [OMPI users] job running question

Some process is exiting on a segv -- are you getting any corefiles?

If not, can you increase your coredumpsize to unlimited?  This should
let you get a corefile; can you send the backtrace from that corefile?
 


-Original Message-
From: users-boun...@open-mpi.org 
[mailto:users-boun...@open-mpi.org] On Behalf Of Adams Samuel 
D Contr AFRL/HEDR

Sent: Friday, April 07, 2006 11:53 AM
To: 'us...@open-mpi.org'
Subject: [OMPI users] job running question

We are trying to build a new cluster running OpenMPI.  We 
were previous

running LAM-MPI.  To run jobs we would do the following:

$ lamboot lam-host-file
$ mpirun C program

I am not sure if this works more or less the same way with 
ompi.  We were

trying to run it like this:

$ [james.parker@Cent01 FORTRAN]$ mpirun --np 2 f_5x5 localhost
mpirun noticed that job rank 1 with PID 0 on node "localhost" 
exited on

signal 11.
[Cent01.brooks.afmc.ds.af.mil:16124] ERROR: A daemon on node localhost
failed to start as expected.
[Cent01.brooks.afmc.ds.af.mil:16124] ERROR: There may be more 
information

available from
[Cent01.brooks.afmc.ds.af.mil:16124] ERROR: the remote shell 
(see above).

[Cent01.brooks.afmc.ds.af.mil:16124] The daemon received a signal 11.
1 additional process aborted (not shown)
[james.parker@Cent01 FORTRAN]$

We have ompi installed to /usr/local, and these are our environment
variables:

[james.parker@Cent01 FORTRAN]$ export
declare -x COLORTERM="gnome-terminal"
declare -x 
DBUS_SESSION_BUS_ADDRESS="unix:abstract=/tmp/dbus-sfzFctmRFS"

declare -x DESKTOP_SESSION="default"
declare -x DISPLAY=":0.0"
declare -x GDMSESSION="default"
declare -x GNOME_DESKTOP_SESSION_ID="Default"
declare -x GNOME_KEYRING_SOCKET="/tmp/keyring-x8WQ1E/socket"
declare -x
GTK_RC_FILES="/etc/gtk/gtkrc:/home/BROOKS-2K/james.parker/.gtk
rc-1.2-gnome2"
declare -x G_BROKEN_FILENAMES="1"
declare -x HISTSIZE="1000"
declare -x HOME="/home/BROOKS-2K/james.parker"
declare -x HOSTNAME="Cent01"
declare -x INPUTRC="/etc/inputrc"
declare -x KDEDIR="/usr"
declare -x LANG="en_US.UTF-8"
declare -x LD_LIBRARY_PATH="/usr/local/lib:/usr/local/lib/openmpi"
declare -x LESSOPEN="|/usr/bin/lesspipe.sh %s"
declare -x LOGNAME="james.parker"
declare -x
LS_COLORS="no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=
40;33;01:cd=40
;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.ex
e=00;32:*.com=
00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;
31:*.tgz=00;31
:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z
=00;31:*.gz=00
;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31
:*.jpg=00;35:*
.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.t
if=00;35:"
declare -x MAIL="/var/spool/mail/james.parker"
declare -x 
OLDPWD="/home/BROOKS-2K/james.parker/build/SuperLU_DIST_2.0"

declare -x
PATH="/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R
6/bin:/home/BR
OOKS-2K/james.parker/bin:/usr/local/bin"
declare -x
PERL5LIB="/usr/lib/perl5/site_perl/5.8.5/i386-linux-thread-mul
ti:/usr/lib/pe
rl5/site_perl/5.8.5"
declare -x 
PWD="/home/BROOKS-2K/james.parker/build/SuperLU_DIST_2.0/FORTRAN"

declare -x
SESSION_MANAGER="local/Cent01.brooks.afmc.ds.af.mil:/tmp/.ICE-
unix/14516"
declare -x SHELL="/bin/bash"
declare -x SHLVL="2"
declare -x SSH_AGENT_PID="14541"
declare -x SSH_ASKPASS="/usr/libexec/openssh/gnome-ssh-askpass"
declare -x SSH_AUTH_SOCK="/tmp/ssh-JUIxl14540/agent.14540"
declare -x TERM="xterm"
declare -x USER="james.parker"
declare -x WINDOWID="35651663"
decla

55 matches

Mail list logo