Re: [OMPI users] OpenMPI data transfer error

2011-07-26 Thread Ashley Pittman

On 26 Jul 2011, at 19:59, Jack Bryan wrote:
> Any help is appreciated. 

Your best option is to distill this down to a short example program which shows 
what's happening v's what you think should be happening.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] OMPI free() error

2011-03-18 Thread Ashley Pittman

On 18 Mar 2011, at 06:07, Jack Bryan wrote:

> Hi, 
> 
> I am running a C++ program with OMPI.
> I got error: 
> 
> *** glibc detected *** /nsga2b: free(): invalid next size (fast): 
> 0x01817a90 ***

This error indicates that when glibc tried to free some memory the internal 
data structures it uses were corrupt.

> In valgrind, 
> 
> there are some invalid read and write butno errors about this 
>  free(): invalid next size .

You need to fix the invalid write errors, the above error is almost certainly a 
symptom is these.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] This must be ssh problem, but I can't figure out what it is...

2011-02-18 Thread Ashley Pittman

On 18 Feb 2011, at 09:09, Tena Sakai wrote:
> I had created a security group "intra."  I opened ssh port from 0 to
> 65535, and launched instances (I unleashed 2 at a time in a same
> geography zone) each belonging to the group intra.  So, here, ssh
> is a security rule of a security group intra.  A field for each
> rule is "source."  I had different settings for the source field,
> but what I had been failing to do is to have this field known by
> the name of the group, namely intra.  By doing so, each instance
> that belongs to this group can get to each other.

I'm glad you got to the bottom of the problem, I've never fully understood the 
EC2 "Security Groups" but I found that the default group was adequate and I 
didn't need to create my own.  Now that I look at it more closely it appears to 
open all incoming ports to the local instances and incoming port 22 to the 
world which would agree with I've seen.

> Many thanks for your guidance all along.  In a week or two, I look
> forward to put together a mini "how-to openMPI on cloud".

If you do this I would appreciate the chance to proof-read it before you go 
public, I have many thousands of hours of EC2 time to my name and have spent 
much of it configuring and testing MPI librarys within them to allow me to test 
my debugger which sits on top of them.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] How are IP addresses determined?

2011-02-17 Thread Ashley Pittman

On 17 Feb 2011, at 04:56, Barnet Wagman wrote:

> I've run into a problem involving accessing a remote host via a router and I 
> think need to understand how opmpi determines ip addresses.  If there's 
> anything posted on this subject, please point me to it.
> 
> Here's the problem:
> 
> I've installed opmpi (1.4.3) on a remote system (an Amazon ec2 instance).  If 
> the local system I'm working on has a static ip address (and a direct 
> connection to the internet), there's no problem.  But if the local system 
> accesses the internet through a router (which itself gets it's ip via dhcp), 
> a call to runmpi command hangs.

I would strongly recommend that all machines involved in a Open MPI job are at 
the same geographical location.  This includes all nodes doing computation but 
also the "submission host".  For EC2 this would mean all in the same region.

As you correctly notice not only are your hosts are on the same network which 
means that they won't all be able to contact each other over the network, 
without this OpenMPI is not going to be able to work.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-14 Thread Ashley Pittman

On 14 Feb 2011, at 21:10, Tena Sakai wrote:
> Regarding firewall, they are different:

> 
> I don't understand what they mean.

vixen has a normal, or empty config and as such has no firewall, dasher has a 
number of firewall rules configured which could easily be the cause of the 
problem on these two machines.  To be able to run OpenMPI across these two 
machines you'll need to disable the firewall on dasher.

To disable the firewall the command (as root) is "service iptables off" to turn 
it off until next boot or "chkconfig iptables off" to do it permanently from 
the next boot, obviously you should check with your network administrator 
before doing this.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-14 Thread Ashley Pittman

"sudo" and "su" are two similar commands for doing nearly identical things, you 
should be running one or the other but there is no need to run both.  "sudo -s" 
is probably the command you should have used.  It's a very common mistake.

sudo is a command for allowing you to run commands as another user, either 
using your own or no password.  su is a command to allow you to run commands as 
another user using their password.  What sudo su is doing is running a command 
as root which is then running a shell as root, "sudo -s" is a much better way 
of achieving the same effect.

Ashley.

On 13 Feb 2011, at 22:16, Tena Sakai wrote:

> Thank you, Ashley, for your comments.
> 
> I do have a question.
> I was using 'sudo su' to document the problem I am running
> into for people who read this mailing list, as well as for
> my own record.  Why would you say I shouldn't be doing so?
> 
> Regards,
> 
> Tena
> 
> 
> On 2/13/11 1:29 PM, "Ashley Pittman"  wrote:
> 
>> On 12 Feb 2011, at 14:06, Ralph Castain wrote:
>> 
>>> Have you searched the email archive and/or web for openmpi and Amazon cloud?
>>> Others have previously worked through many of these problems for that
>>> environment - might be worth a look to see if someone already solved this, 
>>> or
>>> at least a contact point for someone who is already running in that
>>> environment.
>> 
>> I've run Open MPI on Amazon ec2 for over a year and never experienced any
>> problems like the original poster describes.
>> 
>>> IIRC, there are some unique problems with running on that platform.
>> 
>> 
>> None that I'm aware of.
>> 
>> EC2 really is no different from any other environment I've used, either real
>> or virtual, a simple download, ./configure, make and make install has always
>> resulted in a working OpenMPI assuming a shared install location and home
>> directory (for launching applications from).
>> 
>> When I'm using EC2 I tend to re-name machines into something that is easier 
>> to
>> follow, typically "cloud[0-15].ec2" assuming I am running 16 machines, I
>> change the hostname of each host and then write a /etc/hosts file to convert
>> from hostname to internal IP address.  I them export /home from cloud0.ec2 to
>> all the other nodes and configure OpenMPI with --prefix=/home/ashley/install
>> so that the code is installed everywhere.
>> 
>> For EC2 Instances I commonly use Fedora but have also used Ubuntu and 
>> Solaris,
>> all have been fundamentally similar.
>> 
>> My other tip for using EC2 would be to use a persistent "home" folder by
>> renting a disk partition and attaching it to the first instance you boot in a
>> session.  You pay for this by Gb/Month, I was able to use a 5Gb device which 
>> I
>> mounted at /home in cloud0.ec2 and NFS exported to the other instances, again
>> at /home.  You'll need to add "ForwardAgent yes" to your personal .ssh/config
>> to allow you to hop around inside the virtual cluster without entering a
>> password.  The persistent devices are called "Volumes" in EC2 speak, there is
>> no need to create snapshots unless you want to share your volume with other
>> people.
>> 
>> Ashley.
>> 
>> Ps, I would recommend reading up on sudo and su, "sudo su" is not a command
>> you should be typing.
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-13 Thread Ashley Pittman
On 12 Feb 2011, at 14:06, Ralph Castain wrote:

> Have you searched the email archive and/or web for openmpi and Amazon cloud? 
> Others have previously worked through many of these problems for that 
> environment - might be worth a look to see if someone already solved this, or 
> at least a contact point for someone who is already running in that 
> environment.

I've run Open MPI on Amazon ec2 for over a year and never experienced any 
problems like the original poster describes.

> IIRC, there are some unique problems with running on that platform.


None that I'm aware of.

EC2 really is no different from any other environment I've used, either real or 
virtual, a simple download, ./configure, make and make install has always 
resulted in a working OpenMPI assuming a shared install location and home 
directory (for launching applications from).

When I'm using EC2 I tend to re-name machines into something that is easier to 
follow, typically "cloud[0-15].ec2" assuming I am running 16 machines, I change 
the hostname of each host and then write a /etc/hosts file to convert from 
hostname to internal IP address.  I them export /home from cloud0.ec2 to all 
the other nodes and configure OpenMPI with --prefix=/home/ashley/install so 
that the code is installed everywhere.

For EC2 Instances I commonly use Fedora but have also used Ubuntu and Solaris, 
all have been fundamentally similar.

My other tip for using EC2 would be to use a persistent "home" folder by 
renting a disk partition and attaching it to the first instance you boot in a 
session.  You pay for this by Gb/Month, I was able to use a 5Gb device which I 
mounted at /home in cloud0.ec2 and NFS exported to the other instances, again 
at /home.  You'll need to add "ForwardAgent yes" to your personal .ssh/config 
to allow you to hop around inside the virtual cluster without entering a 
password.  The persistent devices are called "Volumes" in EC2 speak, there is 
no need to create snapshots unless you want to share your volume with other 
people.

Ashley.

Ps, I would recommend reading up on sudo and su, "sudo su" is not a command you 
should be typing.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] Issue Resolved: Re: bash: orted: ...

2011-01-26 Thread Ashley Pittman

On 26 Jan 2011, at 13:30, Kedar Soparkar wrote:

> Thanks for your assistance, Jeff. The issue has been resolved.
> 
> I discovered that on non-interactive ssh login, only .bash_rc shell
> startup file gets executed.

If I understand the issue correctly another option would have been using the 
--prefix option to mpirun or configuring OpenMPI with the 
--enable-mpirun-prefix-by-default option.

See:
http://www.open-mpi.org/faq/?category=running#run-prereqs
http://www.open-mpi.org/faq/?category=running#mpirun-prefix

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] mixing send and bcast

2011-01-08 Thread Ashley Pittman

On 8 Jan 2011, at 12:05, Hicham Mouline wrote:

> Hi
>  
> Will MPI_Probe return that there is a message pending reception if the sender 
> MPI_Bcast a message?

No.

> Is the only way to receive a broadcast from the root is to call MPI_BCast in 
> the slave?

Yes.

Broadcast and the other collective operations are just that, "collective" and 
have to be called from all ranks in a communicator with the same parameters and 
in the same order.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] Method for worker to determine its "rank" on a single machine?

2010-12-10 Thread Ashley Pittman

For a much simpler approach you could also use these two environment variables, 
this is on my current system which is 1.5 based, YMMV of course.

OMPI_COMM_WORLD_LOCAL_RANK
OMPI_COMM_WORLD_LOCAL_SIZE

Actually orte seems to set both OMPI_COMM_WORLD_LOCAL_RANK and 
OMPI_COMM_WORLD_NODE_RANK, I can't see any difference between the two.

Ashley.

On 10 Dec 2010, at 18:25, Ralph Castain wrote:
> 
> So if you wanted to get your own local rank, you would call:
> 
> my_local_rank = orte_ess.proc_get_local_rank(ORTE_PROC_MY_NAME);

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] Question about collective messages implementation

2010-11-02 Thread Ashley Pittman

On 2 Nov 2010, at 10:21, Jerome Reybert wrote:
>  - in my implementation, is MPI_Bcast aware that it should use shared memory
> memory communication? Is data go through the network? It seems it is the case,
> considering the first results.
>  - is there any other methods to group task by machine, OpenMPI being aware
> that it is grouping task by shared memory?
>  - is it possible to assign a policy (in this case, a shared memory policy) to
> a Bcast or a Barrier call?
>  - do you have any better idea for this problem? :)

Interesting stuff, two points quickly spring to mind from the above:

MPI_Comm_split() is an expensive operation, sure the manual says it's low cost 
but it shouldn't be used inside any critical loops so be sure you are doing the 
Comm_Split() at startup and then re-using it as and when needed.

Any blocking call into OpenMPI will poll consuming CPU cycles until the call is 
complete, you can mitigate against this by telling OpenMPI to aggressively call 
yield whilst polling which would mean that your parallel Lapack function could 
get the CPU resources it required.  Have a look at this FAQ entry for details 
of the option and what you can expect it to do.

http://www.open-mpi.org/faq/?category=running#force-aggressive-degraded

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] open MPI please recommend a debugger for open MPI

2010-10-29 Thread Ashley Pittman

Not without a list of hostnames it's not any use no, if you can get that, then 
I have something to work with.  From looking around on google -n might help 
here.  Once I have this info you'll need to verify that you are able to ssh to 
these nodes without a password, that pdsh is installed and give me the names of 
an environment variable that pbs sets for ranks within a job.

I'm sure we can get something working but it might be better to take this 
off-list or to the padb-users list to avoid spamming the Open-MPI users list.

Ashley.

On 29 Oct 2010, at 18:44, Jack Bryan wrote:

> Hi, 
> 
> this is what I got :
> 
> -bash-3.2$ qstat -n -u myName
> 
> clsuter:
>  
> Req'd  Req'd   Elap
> Job ID   Username QueueJobname  SessID NDS   TSK 
> Memory Time  S Time
>     -- - --- 
> -- - - -
> 48933.cluster.e myName   develmyJob  107835 1  ----  
> 00:02 C 00:00
>n20/0
> 
> Any help is appreciated. 
> 
> thanks
> 
> > From: ash...@pittman.co.uk
> > Date: Fri, 29 Oct 2010 18:38:25 +0100
> > To: us...@open-mpi.org
> > Subject: Re: [OMPI users] open MPI please recommend a debugger for open MPI
> > 
> > 
> > Can you try the following and send me the output.
> > 
> > qstat -n -u `whoami` @clusterName
> > 
> > The output sent before implies that your cluster is called "clusterName" 
> > rather than "cluster" which is a little surprising but let's see what it 
> > gives us if we query on that basis.
> > 
> > Ashley.
> > 
> > On 29 Oct 2010, at 18:29, Jack Bryan wrote:
> > 
> > > thanks
> > > 
> > > I have run padb (the new one with your patch) on my system and got :
> > > 
> > > -bash-3.2$ padb -Ormgr=pbs -Q 48516.cluster
> > > $VAR1 = {};
> > > Job 48516.cluster is not active
> > > 
> > > Actually, the job is running. 
> > > 
> > > How to check whether my system has pbs_pro ?
> > > 
> > > Any help is appreciated. 
> > > 
> > > thanks
> > > Jinxu Ding
> > > 
> > > Oct. 29 2010
> > > 
> > > 
> > > > From: ash...@pittman.co.uk
> > > > Date: Fri, 29 Oct 2010 18:21:46 +0100
> > > > To: us...@open-mpi.org
> > > > Subject: Re: [OMPI users] open MPI please recommend a debugger for open 
> > > > MPI
> > > > 
> > > > 
> > > > On 29 Oct 2010, at 12:06, Jeremy Roberts wrote:
> > > > 
> > > > > I'd suggest looking into TotalView (http://www.totalviewtech.com) 
> > > > > and/or DDT (http://www.allinea.com/). I've used TotalView pretty 
> > > > > extensively and found it to be pretty easy to use. They are both 
> > > > > commercial, however, and not cheap. 
> > > > > 
> > > > > As far as I know, there isn't a whole lot of open source support for 
> > > > > parallel debugging. The Parallel Tools Platform of Eclipse claims to 
> > > > > provide a parallel debugger, though I have yet to try it 
> > > > > (http://www.eclipse.org/ptp/).
> > > > 
> > > > Jeremy has covered the graphical parallel debuggers that I'm aware of, 
> > > > for a different approach there is padb which isn't a "parallel 
> > > > debugger" in the traditional model but is able to show you the same 
> > > > type of information, it won't allow you to point-and-click through the 
> > > > source or single step through the code but it is lightweight and will 
> > > > show you the information which you need to know. 
> > > > 
> > > > Padb needs to integrate with the resource manager, I know it works with 
> > > > pbs_pro but it seems there are a few issues on your system which is pbs 
> > > > (without the pro). I can help you with this and work through the 
> > > > problems but only if you work with me and provide details of the 
> > > > integration, in particular I've sent you a version which has a small 
> > > > patch and some debug printfs added, if you could send me the output 
> > > > from this I'd be able to tell you if it was likely to work and how to 
> > > > go about making it do so.
> > > > 
> > > > Ashley.
> > > > 
> > > > -- 
> > > > 
>

Re: [OMPI users] open MPI please recommend a debugger for open MPI

2010-10-29 Thread Ashley Pittman

Can you try the following and send me the output.

qstat -n -u `whoami` @clusterName

The output sent before implies that your cluster is called "clusterName" rather 
than "cluster" which is a little surprising but let's see what it gives us if 
we query on that basis.

Ashley.

On 29 Oct 2010, at 18:29, Jack Bryan wrote:

> thanks
> 
> I have run padb (the new one with your patch) on my system and got :
> 
> -bash-3.2$ padb -Ormgr=pbs -Q 48516.cluster
> $VAR1 = {};
> Job 48516.cluster  is not active
> 
> Actually, the job is running. 
> 
> How to check whether my system has pbs_pro ?
> 
> Any help is appreciated. 
> 
> thanks
> Jinxu Ding
> 
> Oct. 29 2010
> 
> 
> > From: ash...@pittman.co.uk
> > Date: Fri, 29 Oct 2010 18:21:46 +0100
> > To: us...@open-mpi.org
> > Subject: Re: [OMPI users] open MPI please recommend a debugger for open MPI
> > 
> > 
> > On 29 Oct 2010, at 12:06, Jeremy Roberts wrote:
> > 
> > > I'd suggest looking into TotalView (http://www.totalviewtech.com) and/or 
> > > DDT (http://www.allinea.com/). I've used TotalView pretty extensively and 
> > > found it to be pretty easy to use. They are both commercial, however, and 
> > > not cheap. 
> > > 
> > > As far as I know, there isn't a whole lot of open source support for 
> > > parallel debugging. The Parallel Tools Platform of Eclipse claims to 
> > > provide a parallel debugger, though I have yet to try it 
> > > (http://www.eclipse.org/ptp/).
> > 
> > Jeremy has covered the graphical parallel debuggers that I'm aware of, for 
> > a different approach there is padb which isn't a "parallel debugger" in the 
> > traditional model but is able to show you the same type of information, it 
> > won't allow you to point-and-click through the source or single step 
> > through the code but it is lightweight and will show you the information 
> > which you need to know. 
> > 
> > Padb needs to integrate with the resource manager, I know it works with 
> > pbs_pro but it seems there are a few issues on your system which is pbs 
> > (without the pro). I can help you with this and work through the problems 
> > but only if you work with me and provide details of the integration, in 
> > particular I've sent you a version which has a small patch and some debug 
> > printfs added, if you could send me the output from this I'd be able to 
> > tell you if it was likely to work and how to go about making it do so.
> > 
> > Ashley.
> > 
> > -- 
> > 
> > Ashley Pittman, Bath, UK.
> > 
> > Padb - A parallel job inspection tool for cluster computing
> > http://padb.pittman.org.uk
> > 
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] open MPI please recommend a debugger for open MPI

2010-10-29 Thread Ashley Pittman

On 29 Oct 2010, at 12:06, Jeremy Roberts wrote:

> I'd suggest looking into TotalView (http://www.totalviewtech.com) and/or DDT 
> (http://www.allinea.com/).  I've used TotalView pretty extensively and found 
> it to be pretty easy to use.  They are both commercial, however, and not 
> cheap.  
> 
> As far as I know, there isn't a whole lot of open source support for parallel 
> debugging. The Parallel Tools Platform of Eclipse claims to provide a 
> parallel debugger, though I have yet to try it (http://www.eclipse.org/ptp/).

Jeremy has covered the graphical parallel debuggers that I'm aware of, for a 
different approach there is padb which isn't a "parallel debugger" in the 
traditional model but is able to show you the same type of information, it 
won't allow you to point-and-click through the source or single step through 
the code but it is lightweight and will show you the information which you need 
to know. 

Padb needs to integrate with the resource manager, I know it works with pbs_pro 
but it seems there are a few issues on your system which is pbs (without the 
pro).  I can help you with this and work through the problems but only if you 
work with me and provide details of the integration, in particular I've sent 
you a version which has a small patch and some debug printfs added, if you 
could send me the output from this I'd be able to tell you if it was likely to 
work and how to go about making it do so.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] Open MPI program cannot complete

2010-10-25 Thread Ashley Pittman

On 25 Oct 2010, at 20:18, Jack Bryan wrote:

> Thanks
> I have downloaded 
> http://padb.googlecode.com/files/padb-3.0.tgz
> 
> and compile it.
> 
> But, no user manual, I can not use it by padb -aQ.

The -a flag is a shortcut to all jobs, if you are providing a jobid (which is 
normally numeric) then don't set the -a flag.

> Do you have use manual about how to use it ? 

In my previous mail I was assuming you were using orte to launch the jobs but 
if you are using PBS then you'll need to use the 3.2 beta as the PBS code is 
new, alternatively you could find the host where the PBS script itself runs and 
check of the "ompi-ps" command gives you any output, if it does then you could 
run it from there giving it the orte jobid.

A bit of background about resource managers (in which I'm including orte and 
PBS), padb supports many resource managers and tries to automatically detect 
which ones you have installed on your system.  If you don't specify one then 
it'll see what is installed, if there is more than one resource manager 
installed then it'll see which of them claim to have active jobs - if only one 
resource manager meets this criteria then it'll pick that one - hence 99% of 
the time it should just work.  If more than one resource manager claims to have 
active jobs then padb will refuse to run but ask the user to specify one 
explicitly.

You should try the following in order once you have 3.2 installed.

padb -Ormgr=pbs -Q 

Or - find the node where the PBS script is being executed, check that the 
ompi-ps command is returning the jobid and then run

padb -Ormgr=orte -Q 

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] Open MPI program cannot complete

2010-10-25 Thread Ashley Pittman

On 25 Oct 2010, at 17:26, Jack Bryan wrote:

> Thanks, the problem is still there. 
> 
> I used: 
> 
> Only process 0 returns. Other processes are still struck in
> MPI_Finalize(). 
> 
> Any help is appreciated. 

You can use the command "padb -aQ" to show you the message queues for your 
application, you'll need to download and install padb then simply run your job, 
allow it to hang and they run padb - it'll show you the message queues for each 
rank that it can find processes for (the ones that haven't exited).  If this 
isn't any help run "padb -axt" for the stack traces and send the output to this 
list.

The web-site is in my signature or there is a new beta release out this week at 
http://padb.googlecode.com/files/padb-3.2-beta1.tar.gz

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] Running simple MPI program

2010-10-23 Thread Ashley Pittman

On 23 Oct 2010, at 17:58, Brandon Fulcher wrote:
> So I checked the OMPI package details on both machines, they each are running 
> Open MPI 1.3. . . but then I noticed that the packages are different 
> versions.   Basically, the slave is running the previous Ubuntu release, and 
> the master is running the current one. Both have the most recent packages for 
> their release. . .but perhaps that is enough of a difference? 

You need to have exactly the same version of OpenMPI installed on both 
machines.  Typically in a cluster all machines are identical in terms of 
software, if this isn't the case for your systems then the easiest way might be 
to compile open mpi from source (on the older of the two machines would be 
best) and to install it to a common directory on both machines.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.

2010-09-21 Thread Ashley Pittman

This smacks of a firewall issue, I thought you'd said you weren't using one but 
now I read back your emails I can't see anywhere where you say that.  Are you 
running a flrewall or any iptables rules on any of the nodes?  It looks to me 
like you may have some setup from on the worker nodes.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] Thread as MPI process

2010-09-21 Thread Ashley Pittman

On 21 Sep 2010, at 09:54, Mikael Lavoie wrote:

> Hi,
> 
> Sorry, but i get lost in what i wanna do, i have build a small home cluster 
> with Pelican_HPC, that user openMPI, and i was trying to find a way to get a 
> multithreaded program work in a multiprocess way without taking the time to 
> learn MPI. And my vision was a sort of wrapper that take C posix app src 
> code, and convert it from pthread to a multiprocessMPI app. But the problem 
> is the remote memory access, that will only be implemented in MPI 3.0(for 
> what i've read of it).
> 
> So, after 12 hour of intensive reading about MPI and POSIX, the best way to 
> deal with my problem(running a C pthreaded app in my cluster) is to convert 
> the src in a SPMD way.
> I didn't mentionned that basicly, my prog open huge text file, take each 
> string and process it through lot's of cryptographic iteration and then save 
> the result in an output.out like file.
> So i will need to make the master process split the input file and then send 
> them as input for the worker process.
> 
> But if you or someone else know a kind of interpretor like program to run a 
> multithreaded C program and convert it logically to a master/worker 
> multiprocess MPI that will be sended by ssh to the interpreter on the worker 
> side and then lunched.
> 
> This is what i've tried to explain in the last msg. A dream for the hobyist 
> that want to get the full power of a night-time cluster, without having to 
> learn all the MPI syntax and structure.
> 
> If it doesn't exist, this would be a really great tool i think.
> 
> Thank you for your reply, but i think i have answered my question alone... No 
> Pain, No Gain...

What you are thinking of is I believe something more like ScaleMP or Mosix, 
neither of which I have first-hand experience of.  It's a hard problem to solve 
and I don't believe there is any general solution available.

It sounds like your application would be a fairly easy conversion to MPI but to 
do that you will need to re-code areas of your application, it almost sounds 
like you could get away with just using MPI_Init, MPI_Scatter and MPI_Gather.  
Typically you would use the head-node to launch the job but not do any 
computation, rank 0 in the job would then do the marshalling of data and all 
ranks would be started simultaneously, you'll find this easier than having one 
single-rank job spawn more ranks as required.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] Thread as MPI process

2010-09-21 Thread Ashley Pittman

On 20 Sep 2010, at 22:24, Mikael Lavoie wrote:
> I wanna know if it exist a implementation that permit to run a single host 
> process on the master of the cluster, that will then spawn 1 process per -np 
> X defined thread at the host specified in the host list. The host will then 
> act as a syncronized sender/collecter of the work done.

I don't fully understand you explanation either but I may be able to help clear 
up what you are asking for:

If you mean "pthreads" or "linux threads" then no, you cannot have different 
threads on different nodes under any programming paradigm.

However if you mean "execution threads" or in MPI parlance "ranks" then yes, 
under OpenMPI each "rank" will be a separate process on one of the nodes in the 
host list, as Jody says look at MPI_Comm_Spawn for this.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] latency #2

2010-09-13 Thread Ashley Pittman

On 13 Sep 2010, at 12:20, Georges Markomanolis wrote:

> Dear all,
> 
> Hi again, after using MPI_Ssend seems to be what I was looking for but I 
> would like to know more about MPI_Send.
> 
> For example sending 1 byte with MPI_Send it takes 8.69 microsec but with 
> MPI_Ssend it takes 152.9 microsec. I understand the difference but it seems 
> that from one message's size and after their difference is not so big like 
> trying for 518400 bytes where it needs 3515.78 microsec with MPI_Send and 
> 3584.1 microsec with MPI_Ssend.

It sounds like you are measuring send overhead rather than latency, in fact as 
far as I know it's impossible to measure the send latency as you have no way of 
being able to know when to 'stop the clock', this is why ping-pong latency is 
always quoted.  I suspect the underlying latency of the two sends is very 
similar to each other in practice.

> So has is there any rule to figure out (of course it depends on the hardware) 
> the threshold that after this size the difference between the timings of 
> MPI_Send and MPI_Send is not so big or at least how to find it for my 
> hardware?

Yes there is but I'm not familiar enough with OMPI to be able to tell you, I'm 
sure somebody can though.  If my suspicion above is correct I have doubt 
knowing what this value is would help you at all though in terms of application 
performance.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread Ashley Pittman

On 9 Sep 2010, at 21:40, Richard Treumann wrote:

> 
> Ashley 
> 
> Can you provide an example of a situation in which these semantically 
> redundant barriers help? 

I'm not making the case for semantically redundant barriers, I'm making a case 
for implicit synchronisation in every iteration of a application.  Many 
applications have this already by nature of the data-flow required, anything 
that calls mpi_allgather or mpi_allreduce are the easiest to verify but there 
are many other ways of achieving the same thing.  My point is about the subset 
of programs which don't have this attribute and are therefore susceptible to 
synchronisation problems.  It's my experience that for low iteration counts 
these codes can run fine but once they hit a problem they go over a cliff edge 
performance wise and there is no way back from there until the end of the job.  
The email from Gabriele would appear to be a case that demonstrates this 
problem but I've seen it many times before.

Using your previous email as an example I would describe adding barriers to a 
problem as a way artificially reducing the "elasticity" of the program to 
ensure balanced use of resources.

> I may be missing something but my statement for the text book would be 
> 
> "If adding a barrier to your MPI program makes it run faster, there is almost 
> certainly a flaw in it that is better solved another way." 
> 
> The only exception I can think of is some sort of one direction data 
> dependancy with messages small enough to go eagerly.  A program that calls 
> MPI_Reduce with a small message and the same root every iteration and  calls 
> no other collective would be an example. 
> 
> In that case, fast tasks at leaf positions would run free and a slow task 
> near the root could pile up early arrivals and end up with some additional 
> slowing. Unless it was driven into paging I cannot imagine the slowdown would 
> be significant though. 

I've diagnosed problems where the cause was a receive queue of tens of 
thousands of messages, in this case each and every receive performs slowly 
unless the descriptor is near the front of the queue so the concern is not 
purely about memory usage at individual processes although that can also be a 
factor.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread Ashley Pittman

On 9 Sep 2010, at 21:10, jody wrote:

> Hi
> @Ashley:
> What is the exact semantics of an asynchronous barrier,

I'm not sure of the exact semantics but once you've got your head around the 
concept it's fairly simple to understand how to use it, you call MPI_IBarrier() 
and it gives you a handle you can test with MPI_Test() or block for with 
MPI_Wait().  The tricky part comes in how many times you can call 
MPI_IBarrier() on a communicator without waiting for the previous barriers to 
complete but I haven't been following the discussion on this one to know the 
specifics.

> and is it part of the MPI specs?

It will be a part of the next revision of the standard I believe.  It's been a 
long time coming and there is at least one implementation out there already but 
I can't comment on it's usability today.  To be clear it's something I've long 
advocated and have implemented and played around with in the past however it's 
not yet available to users today but I believe it will be shortly and as you'll 
have read my believe is it's going to be a very useful addition to the MPI 
offering.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread Ashley Pittman

On 9 Sep 2010, at 17:00, Gus Correa wrote:

> Hello All
> 
> Gabrielle's question, Ashley's recipe, and Dick Treutmann's cautionary words, 
> may be part of a larger context of load balance, or not?
> 
> Would Ashley's recipe of sporadic barriers be a silver bullet to
> improve load imbalance problems, regardless of which collectives or
> even point-to-point calls are in use?

No, it only holds where there is no data dependency between some of the ranks, 
in particular if there are any non-rooted collectives in an iteration of your 
code then it cannot make any difference at all, likewise if you have a reduce 
followed by a barrier using the same root for example then you already have 
global synchronisation each iteration and it won't help.  My feeling is that it 
applies to a significant minority of problems, certainly the phrase "adding 
barriers can make codes faster" should be textbook stuff if it isn't already.

> Would sporadic barriers in the flux coupler "shake up" these delays?

I don't fully understand your description but it sounds like it might set the 
program back to a clean slate which would give you per-iteraion delays only 
rather than cumulative or worse delays.

> Ashley:  How did you get to the magic number of 25 iterations for the
> sporadic barriers?

Experience and finger in the air.  The major factors in picking this number is 
the likelihood of a positives feedback cycle of delays happening, the delays 
these delays add and the cost of a barrier itself.  Having too low a value will 
slightly reduce performance, having too high a value can drastically reduce 
performance.

As a further item (because I like them) the asynchronous barrier is even better 
again if used properly, in the good case it doesn't cause any process to block 
ever so the cost is only that of the CPU cycles the code takes itself, in the 
bad case where it has to delay a rank then this tends to have a positive impact 
on performance.

> Would it be application/communicator pattern dependent?

Absolutely.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread Ashley Pittman

On 9 Sep 2010, at 08:31, Terry Frankcombe wrote:

> On Thu, 2010-09-09 at 01:24 -0600, Ralph Castain wrote:
>> As people have said, these time values are to be expected. All they
>> reflect is the time difference spent in reduce waiting for the slowest
>> process to catch up to everyone else. The barrier removes that factor
>> by forcing all processes to start from the same place.
>> 
>> 
>> No mystery here - just a reflection of the fact that your processes
>> arrive at the MPI_Reduce calls at different times.
> 
> 
> Yes, however, it seems Gabriele is saying the total execution time
> *drops* by ~500 s when the barrier is put *in*.  (Is that the right way
> around, Gabriele?)
> 
> That's harder to explain as a sync issue.

Not really, you need some way of keeping processes in sync or else the slow 
ones get slower and the fast ones stay fast.  If you have an un-balanced 
algorithm then you can end up swamping certain ranks and when they get behind 
they get even slower and performance goes off a cliff edge.

Adding sporadic barriers keeps everything in sync and running nicely, if things 
are performing well then the barrier only slows things down but if there is a 
problem it'll bring all process back together and destroy the positive feedback 
cycle.  This is why you often only need a synchronisation point every so often, 
I'm also a huge fan of asyncronous barriers as a full sync is a blunt and slow 
operation, using asyncronous barriers you can allow small differences in timing 
but prevent them from getting too large with very little overhead in the common 
case where processes are synced already.  I'm thinking specifically of starting 
a sync-barrier on iteration N, waiting for it on N+25 and immediately starting 
another one, again waiting for it 25 steps later.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] MPI_Reduce performance

2010-09-08 Thread Ashley Pittman

On 8 Sep 2010, at 10:21, Gabriele Fatigati wrote:
> So, im my opinion, it is better put MPI_Barrier before any MPI_Reduce to 
> mitigate "asynchronous" behaviour of MPI_Reduce in OpenMPI. I suspect the 
> same for others collective communications. Someone can explaine me why 
> MPI_reduce has this strange behaviour?

There are many cases where where adding an explicit barrier before a call to 
reduce would be superfluous so the standard rightly says that it isn't needed 
and need not be performed.  As you've seen though there are also cases where it 
can help.  I'd be interested to know the effect if you only added a barrier 
before MPI_Reduce occasionally, perhaps every one or two hundred iterations, 
this can also have a beneficial effect as a barrier every iteration adds 
significant overhead.

This is a textbook example of where the new asynchronous barrier could help, in 
theory it should have the effect of being able keep processes in sync without 
any additional overhead in the case that they are already well synchronised.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] simplest way to check message queues

2010-09-02 Thread Ashley Pittman
On 1 Sep 2010, at 23:32, Jaison Mulerikkal wrote:

> Hi,
> 
> I am getting interested in this thread.
> 
> I'm looking for some solutions, where I can redirect a task/message 
> (MPI_send) to a particular process (say rank 1), which is in a queue (at rank 
> 1) to another process (say rank 2), if the queue is longer at rank 1. 
> 
> How can I do it?
> 
> First of all, I need to know the queue length at a particular process (rank 
> 1) at a particular instant. how can I use padb to get that info?
> 
> Then on the basis of that info 'send'  some (queued up) messages (from rank 
> 1) to some other process (say rank 2) which are relatively free. Is that 
> possible?


The tools being discussed are for querying the state of message queues within a 
parallel job from outside of that job and are not suitable for the type of 
introspection you are talking about.

It sounds like you are looking for some kind of shared receive queue which 
multiple ranks can pull messages off, I can't think of anything in MPI that 
would allow this kind of functionality short of having a RTS/CTS protocol in 
the application layer.  The easiest might be to had a single rank receive all 
messages and keep them in a queue and then use MPI_Ssend() to forward messages 
to your "consumer" ranks.  Substitute ranks for threads in the above text as 
you feel is appropriate.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] simplest way to check message queues

2010-09-02 Thread Ashley Pittman

On 2 Sep 2010, at 15:56, Brock Palen wrote:

> Ashly still having trouble using padb with openmpi/1.4.2
> 
> [dianawon@nyx0862 ~]$ /home/software/rhel5/padb/3.0/padb -a -Q
> [nyx0862.engin.umich.edu:30717] [[16608,0],0]-[[25542,0],0] oob-tcp: 
> Communication retries exceeded.  Can not communicate with peer
> [nyx0862.engin.umich.edu:30717] [[16608,0],0] ORTE_ERROR_LOG: Unreachable in 
> file util/comm/comm.c at line 62
> [nyx0862.engin.umich.edu:30717] [[16608,0],0] ORTE_ERROR_LOG: Unreachable in 
> file orte-ps.c at line 799
> [nyx0862.engin.umich.edu:30717] [[16608,0],0]-[[25542,0],0] oob-tcp: 
> Communication retries exceeded.  Can not communicate with peer
> No active jobs could be found for user 'dianawon'
> 
> The job is running, I get this error running just orte-ps, 

If orte-ps isn't running correctly then there is very little padb can do, if 
that is the case try using the "mpirun" resource manager interface rather than 
"orte", this will cause padb to use the MPIR interface and try to get the 
information directly from the mpirun process before launching itself via pdsh.  
It doesn't scale as well as the orte integration (pdsh runs out of file 
descriptors eventually) but is more generic and might get you to somewhere that 
works.  If your job spans more than 32 nodes you may need to set the FANOUT 
variable for pdsh to work.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] simplest way to check message queues

2010-09-01 Thread Ashley Pittman

padb as a binary (it's a perl script) needs to exist on all nodes as it calls 
orterun on itself, try installing it to a shared directory or copying padb to 
/tmp on every node.

To access the message queues padb needs a compiled helper program which is 
installed in $PREFIX/lib so I would recommend re-building padb giving it a 
prefix of a NFS shared directory.  I can help you more with this if required.

Ashley,

On 1 Sep 2010, at 23:01, Brock Palen wrote:

> We have ddt, but we do not have licenses to attach to the number of cores 
> these jobs run at.
> 
> I tried padb,  but it fails, 
> 
> Example:
> 
> ssh to root node for running MPI job:
> /tmp/padb -Q -a
> 
> [nyx0862.engin.umich.edu:25054] [[22211,0],0]-[[25542,0],0] oob-tcp: 
> Communication retries exceeded.  Can not communicate with peer
> [nyx0862.engin.umich.edu:25054] [[22211,0],0] ORTE_ERROR_LOG: Unreachable in 
> file util/comm/comm.c at line 62
> [nyx0862.engin.umich.edu:25054] [[22211,0],0] ORTE_ERROR_LOG: Unreachable in 
> file orte-ps.c at line 799
> [nyx0862.engin.umich.edu:25054] [[22211,0],0]-[[25542,0],0] oob-tcp: 
> Communication retries exceeded.  Can not communicate with peer
> einner: 
> --
> einner: orterun was unable to launch the specified application as it could 
> not access
> einner: or execute an executable:
> Unexpected EOF from Inner stdout (connecting)
> Unexpected EOF from Inner stderr (connecting)
> Unexpected exit from parallel command (state=connecting)
> Bad exit code from parallel command (exit_code=131)

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] simplest way to check message queues

2010-09-01 Thread Ashley Pittman

On 1 Sep 2010, at 21:13, Brock Palen wrote:

> I have a code for a user (namd if anyone cares)  that on a specific case will 
> lock up,  a quick ltrace shows the processes doing Iprobes over and over, so 
> this makes me think that a process someplace is blocking on communication.  
> 
> What is the best way to look at message queues? To see what process is stuck 
> and to drill into.

The only three programs I know which can do this are TotalView, DDT and Padb.  
Totalview and DDT are graphical parallel debuggers and are commercial projects, 
Padb is a command-line tool and is open-source

Ashley (padb developer)

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] padb and openmpi

2010-08-17 Thread Ashley Pittman

On 17 Aug 2010, at 21:20, Steve Wise wrote:
> [ompi@hpc-hn1 ~]$ padb  --show-jobs --config-option rmgr=orte
> 65427
> [ompi@hpc-hn1 ~]$ padb --all --proc-summary --config-option rmgr=orte
> Warning, failed to locate ranks [0-3]
> 
> Any ideas on what I am doing wrong?

Nothing that springs to mind, you don't appear to be doing anything unusual.  
Could you try the same command and add "--debug all=all" to the command line 
and send me the output, I'll see if I can see anything.  One quick thing to 
check is that the ompi-ps command is giving the correct output, this should 
contain the hostname and pids of each of your processes, you could check this 
is correct and send me the output as well to check the format hasn't changed 
again.

The 3.2 beta release of padb is proving very good, it's purely time that's 
stopped me turning the handle and making it a fully fledged release so you 
should try this to see if it makes a difference to your problem.  The website 
for padb (containing links to it's own mailing lists) is in my signature.

Ashley (the padb developer)

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] MPI_Bcast issue

2010-08-11 Thread Ashley Pittman

On 11 Aug 2010, at 05:10, Randolph Pullen wrote:

> Sure, but broadcasts are faster - less reliable apparently, but much faster 
> for large clusters.

Going off-topic here but I think it's worth saying:

If you have a dataset that requires collective communication then use the 
function call that best matches what you are trying to do, far to many people 
try and re-implement the collectives in their own code and it nearly always 
goes badly, as someone who's spent many years implementing collectives I've 
lost count of the number of times I've made someones code go faster by 
replacing 500+ lines of code with a single call to MPI_Gather().

In the rare case that you find that some collectives are slower than they 
should be for your specific network and message size then the best thing to do 
is to work with the Open-MPI developers to tweak the thresholds so a better 
algorithm gets picked by the library.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] Debugging OpenMPI with GDB

2010-06-25 Thread Ashley Pittman

On 25 Jun 2010, at 19:18, Немања Илић (Nemanja Ilic) wrote:

> Dear Sir or Madam,
> 
> I am about to start a project that includes MPI communication. My question 
> is: "Is there a way to debug parallel OpenMPI applications on linux in 
> console mode on one computer using gdb?"

You can debug individual processes in a job with gdb directly.  Alternatively 
if you want to see the global picture of what a job is doing at a point in time 
follow the link in my sig, it won't allow deep debugging with breakpoints and 
register dumps but it should allow you to narrow in on problems quickly.

Also which is unique to MPI it's possible to see the "message queues" for ranks 
within an MPI application which can help with programming.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] Building from the SRPM version creates an rpm with striped libraries

2010-05-25 Thread Ashley Pittman

This is a standard rpm feature although like most things it can be disabled.

According to this mail and it's replies the two %defines below will prevent 
striping and the building of debuginfo rpms.

http://lists.rpm.org/pipermail/rpm-list/2009-January/000122.html

%define debug_package %{nil}
%define __strip /bin/true

Ashley.

On 25 May 2010, at 00:25, Peter Thompson wrote:

> I have a user who prefers building rpm's from the srpm.  That's okay, but for 
> debugging via TotalView it creates a version with the openmpi .so files 
> stripped and we can't gain control of the processes when launched via mpirun 
> -tv.  I've verified this with my own build of a 1.4.1 rpm which I then 
> installed and noticed the same behavior that the user reports.  I was hoping 
> to give them some advice as to how to avoid the stripping, as it appears that 
> the actual build of those libraries is done with -g and everything looks 
> fine.  But I can't figure out in the build (from the log file I created) just 
> where that stripping takes place, or how to get around it if need be.  The 
> best guess I have is that it may be happening at the very end when an rpm-tmp 
> file is executed, but that file has disappeared so I don't really know what 
> it does.  I thought it might be apparent in the spec file, but it's certainly 
> not apparent to me!  Any help or advice would be appreciated.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] open-mpi behaviour on Fedora, Ubuntu, Debian and CentOS

2010-04-26 Thread Ashley Pittman

On 25 Apr 2010, at 22:27, Asad Ali wrote:

> Yes I use different machines such as 
> 
> machine 1 uses AMD Opterons. (Fedora)
> 
> machine 2 and 3 use Intel Xeons. (CentOS)
> 
> machine 4 uses slightly older Intel Xeons. (Debian)
> 
> Only machine 1 gives correct results.  While CentOS and Debian results are 
> same but are wrong and different from those of machine 1.

Have you verified the are actually wrong or are they just different?  It's 
actually perfectly possible for the same program to get different results from 
run to run even on the same hardware and the same OS.  All floating point 
operations by the MPI library are expected to be deterministic but changing the 
process layout or and MPI settings can affect this and of course anything the 
application does can introduce differences as well.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] viewing message queues for running job

2010-03-15 Thread Ashley Pittman

On 15 Mar 2010, at 20:18, Brock Palen wrote:
> Is there a way to view what outstanding messages are in queues for an already 
> running job?  I know I can do this via ddt (parallel debugger)  but for 
> normal non debugged jobs is there a way to just ask open-mpi  "show 
> outstanding messages you have"?

This is one of the bits of information Padb can tell you, as well as lots of 
other detail about running jobs, the message queue data isn't as concise as it 
could be when looking at large process counts but the data is there.

http://padb.pittman.org.uk/modes.html#mpi-queue

> Thanks, this would be really useful for jobs that only hang randomly or after 
> very long runtimes.

You're right, for example it's used to good effect in the open-mpi automated 
testing as well as at numerous other sites from the large to the small.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] totalview and message queue, empty windows

2010-01-29 Thread Ashley Pittman

On 28 Jan 2010, at 21:04, DevL wrote:

> Hi,
> it looks that there is an issue with totalview and
> openmpi
>  
> message queue just empty and output shows:
> WARNING: Field mtc_ndims_or_nnodes of type mca_topo_base_comm_1_0_0_t not 
> found!
> WARNING: Field mtc_dims_or_index of type mca_topo_base_comm_1_0_0_t not found!
> WARNING: Field mtc_periods_or_edges of type mca_topo_base_comm_1_0_0_t not 
> found!
> WARNING: Field mtc_reorder of type mca_topo_base_comm_1_0_0_t not found!
> WARNING: Field mtc_ndims_or_nnodes of type mca_topo_base_comm_1_0_0_t not 
> found!
> WARNING: Field mtc_dims_or_index of type mca_topo_base_comm_1_0_0_t not found!
> WARNING: Field mtc_periods_or_edges of type mca_topo_base_comm_1_0_0_t not 
> found!
> WARNING: Field mtc_reorder of type mca_topo_base_comm_1_0_0_t not found!
> [
>  (Open MPI) 1.4a1r21427
> and
> totalview.8.7.0-7/linux-x86-64
>  
> is this a known issue?

I've not seen it before but I do know of problems with the 
mca_topo_base_comm_1_0_0_t type and the debugger plugin (which TotalView is 
calling).

> and if so - how to overcome it ?

I'm afraid I don't know.

The Debugger plugin looks for the type (it's a struct) and then looks for some 
offsets within the struct.  I've seen it fail to find the struct completely 
whereas this error appears to claim it can't find the entries within the 
struct.  Perhaps the difference is that I found the problem using padb and you 
are using TotalView.

You could try the attached patch which allows the code to continue if the type 
isn't found, if you are seeing a different symptom of the same error then it 
might work for you.



ompi-topo-type.patch
Description: Binary data


As to the cause I've no idea, I've only seen it once or twice in the last six 
months and not on installations I've installed myself, I've never been able to 
find out the underlying cause and why some machines report this error and some 
don't.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] 1.4 OpenMPI build not working well with TotalView on Darwin

2010-01-22 Thread Ashley Pittman
On Wed, 2010-01-20 at 21:18 -0500, Peter Thompson wrote:
> Hi Jeff,
> 
> Sorry, speaking in shorthand again.
> 
> Jeff Squyres wrote:
> > On Jan 8, 2010, at 5:03 PM, Peter Thompson wrote:
> > 
> >> I've tried a few builds of 1.4 on Snow Leopard, and trying to start up 
> >> TotalView
> >> gets some of the more 'standard' problems.  
> > 
> > I don't quite know what you mean by "standard" problems...?
> 
> That's more or less 'standard problems' that I hear described when someone 
> tries 
> to build and MPI (not just OpenMPI) and things don't work on first try.  I 
> don't 
> know if you've worked on the interface directly, but you are probably aware 
> that 
> TotalView has an API where we set up a structure, MPIR_PROCTABLE, based on a 
> typedef MPIR_PROCDESC, which gets filled in as to what processes are started 
> up 
> on which nodes.  Which allows the debugger to attach to things automatically. 
> If the build is done so that the files that hold these structures are 
> optimized, 
> sometimes the typedef is optimized away.  Or in the case of other builds, the 
> file may have the correct optimization (none) but the symbol info is stripped 
> in 
> the link phase.  So it's a typical, or 'standard' issue I face, but hopefully 
> not for you.

I've seen several OpenMPI installs in the wild like this where the type
information for MPIR_PROCTABLE is missing.  The fact the type
information is missing however doesn't affect the code or contents of
memory at all, just that it's not described by debug information.  As
there is a standard (sort of) to describe MPIR_PROCTABLE what I choose
to do in padb is to use the standard to calculate the struct size and
offsets rather than the debug info.  This allows padb to work even when
the debug information is missing.

If the debug information is available that it matches what I expect it
to be.

Don't use the debug info but rather use fixed sizes and offsets:
http://code.google.com/p/padb/source/detail?r=355

Verify the type information if present:
http://code.google.com/p/padb/source/detail?r=386

> However, 
> some users prefer the classic launch with -tv, and this seems to be failing 
> with 
> the latest builds I've done on Darwin.

I've seen this 'problem' on Linux as well.  I'm unsure of the OpenMPI
version although I could ask the organisation concerned if required.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] MPI debugger

2010-01-15 Thread Ashley Pittman

On 11 Jan 2010, at 06:20, Jed Brown wrote:

> On Sun, 10 Jan 2010 19:29:18 +0000, Ashley Pittman  
> wrote:
>> It'll show you parallel stack traces but won't let you single step for
>> example.
> 
> Two lightweight options if you want stepping, breakpoints, watchpoints,
> etc.
> 
> * Use serial debuggers on some interesting processes, for example with
> 
>mpiexec -n 1 xterm -e gdb --args ./trouble args : -n 2 ./trouble args : -n 
> 1 xterm -e gdb --args ./trouble args
> 
>  to put an xterm on rank 0 and 3 of a four process job (there are lots
>  of other ways to get here).

You can also achieve something similar with padb by starting the job normally 
and then using padb to launch xterms in a similar manner although it's been 
pointed out to me that this only works with one process per node right now.

> * MPICH2 has a poor-man's parallel debugger, mpiexec.mpd -gdb allows you
>  to send the same gdb commands to each process and collate the output.

True, I'd forgotten about that, the MPICH2 people are moving away from mpd 
though so I don't know how much longer that will be an option.

Ashley,


Re: [OMPI users] MPI debugger

2010-01-10 Thread Ashley Pittman
On Fri, 2010-01-08 at 11:36 +0530, Arunkumar C R wrote:

> I do MPI programs using Fortran 90 in a Quad Core Machine with Fedora
> OS. Could any one of you suggest a good debugger to resolve the
> compilation/ run time errors?

It depends on what you mean by a debugger, there are two "parallel
debuggers" on the market, TotalView and DDT, both closed source and
fairly expensive.  They are both graphical apps that allow you to start
a job under their control or attach to existing jobs and allow full view
of the job and control it's execution (stepping and setting breakpoints
as you would in a non-parallel debugger).

There is also padb which is a tool I develop, it's open-source and
command line based, it doesn't allow you to dig as deep but does provide
a lot of information about the state of a parallel job.  It'll show you
parallel stack traces but won't let you single step for example.

The most basic way of using it and sample output are on-line here: 
http://padb.pittman.org.uk/full-report.html

All three of these tools will allow you to see the "Message queues"
contained within the parallel job as well.

In addition I believe Eclipse has some support for parallel programs,
I've not used it however so can't comment on it's features.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] Is OpenMPI's orted = MPICH2's smpd?

2009-12-22 Thread Ashley Pittman
On Tue, 2009-12-22 at 09:59 +0530, Sangamesh B wrote:
> Hi,
> 
> MPICh2 has different process managers: MPD, SMPD, GFORKER etc.

It also has Hydra.

>  Is the Open MPI's startup daemon orted similar to MPICH2's smpd? Or
> something else?

My understand is that SMPD is for launching on Windows which isn't
something I know about.

orte is similar to MPD although without the requirement that you start
the ring before-hand.

A quick summary of orte: Orte takes a list of nodes and a process count,
given these it will start a job of the given size on the given nodes.
No prior configuration or starting of deamons is required.  No effort is
made to prevent multiple jobs from starting on the same nodes and no
effort is made to maintain a "queue" of jobs waiting for nodes to become
free.  Each job is independent, and runs where you tell it to
immediately.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] Debugging spawned processes

2009-12-18 Thread Ashley Pittman
On Wed, 2009-12-16 at 12:06 +0100, jody wrote:

> Has anybody got some hints on how to debug spawned processes?

If you can live with the processes starting normally and attaching gdb
to them after they have started then you could use padb.

Assuming you only have one job active (replace -a with the job-id if you
don't) and watch to target the first spawned job then the following
command will launch an xterm for each rank in the job and automatically
attach to the process for you.

padb -Oorte-job-step=2 --command -Ocommand="xterm -T %r -e 'gdb -p %p'"
-a

You'll need to use the SVN version of padb for this, the "orte-job-step"
option tells it to attach to the first spawned job, use orte-ps to see
the list of job steps.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Ashley Pittman
On Thu, 2009-12-17 at 14:40 +, Min Zhu wrote:

> Here is the content of openmpi-mpirun file, so maybe something needs to
> be changed?
> if [ x"${LSB_JOBFILENAME}" = x -o x"${LSB_HOSTS}" = x ]; then
> usage
> exit -1
> fi
> 
> MYARGS=$*

Shouldn't this be MYARGS=$@  It'll change the way quoted args are
forwarded to the parallel job.

Ashley,


-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] OpenMPI problem on Fedora Core 12

2009-12-14 Thread Ashley Pittman
On Sun, 2009-12-13 at 19:04 +0100, Gijsbert Wiesenekker wrote:
> The following routine gives a problem after some (not reproducible)
> time on Fedora Core 12. The routine is a CPU usage friendly version of
> MPI_Barrier.

There are some proposals for Non-blocking collectives before the MPI
forum currently and I believe a working implementation which can be used
as a plug-in for OpenMPI, I would urge you to look at these rather than
try and implement your own.

> My question is: is there a problem with this routine that I overlooked
> that somehow did not show up until now

Your code both does all-to-all communication and also uses probe, both
of these can easily be avoided when implementing Barrier.

> Is there a way to see which messages have been sent/received/are
> pending?

Yes, there is a message queue interface allowing tools to peek inside
the MPI library and see these queues.  That I know of there are three
tools which use this, either TotalView, DDT or my own tool, padb.
TotalView and DDT are both full-featured graphical debuggers and
commercial products, padb is a open-source text based tool.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] Mimicking timeout for MPI_Wait

2009-12-10 Thread Ashley Pittman
On Tue, 2009-12-08 at 10:14 +, Number Cruncher wrote:
> Whilst MPI has traditionally been run on dedicated hardware, the rise of 
> cheap multicore CPUs makes it very attractive for ISVs such as ourselves 
> (http://www.cambridgeflowsolutions.com/) to build a *single* executable 
> that can be run in batch mode on a dedicated cluster *or* interactively 
> on a user's workstation.
> 
> Once you've taken the pain of writing a distributed-memory app (rather 
> than shared-memory/multithreaded), MPI provides a transparent API to 
> cover both use cases above. *However*, at the moment, the lack of 
> select()-like behaviour (instead of polling) means we have to write 
> custom code to avoid hogging a workstation. A runtime-selectable 
> mechanism would be perfect!

Speaking as an independent observer here (i.e. not a OMPI developer) I
don't think you'll find anyone who wouldn't view what you are asking for
as a good thing, it's something that has been and is continued to be
discussed often.  I for one would love to see it, whilst as Richard says
it can increase latency it can also reduce noise so help performance on
larger systems.

As you say you are one of a new breed of MPI users and this feature
would most likely benefit you more than the traditional
dedicated-machine users of MPI, I expect it to become more of an issue
as MPI is adopted by a wider audience.  As OpenMPI is a open-source
project the question should not be what appetite is there amongst users
but is there any one user who is both motivated enough, able to do the
work and finally not busy doing other things.  I've implemented this
before and it's not an easy feature to add by any means and tends to be
very intrusive into the code-base which itself causes problems.

There was another thread on this mailing list this week where Ralph
recommended setting the yield_when_idle mca param ("--mca
yield_when_idle 1) which will cause threads to call sched_yield() when
polling.  The end result here is that they will still consume 100% of
idle CPU time but then other programs want to use the CPU the MPI
processes will not hog it but rather let the other processes use as much
CPU time as they want and just spin when the CPU would otherwise be
idle.  This is something I use daily and greatly increases the
responsiveness of systems which are mixing idle MPI with other
applications.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] mpirun only works when -np <4

2009-12-09 Thread Ashley Pittman
On Tue, 2009-12-08 at 08:30 -0800, Matthew MacManes wrote:
> There are 8 physical cores, or 16 with hyperthreading enabled. 

That should be meaty enough.

> 1st of all, let me say that when I specify that -np is less than 4
> processors (1, 2, or 3), both programs seem to work as expected. Also,
> the non-mpi version of each of them works fine.

Presumably the non-mpi version is serial however? this this doesn't mean
the program is bug-free or that the parallel version isn't broken.
There are any number of apps that don't work above N processes, in fact
probably all programs break for some value of N, it's normally a little
higher then 3 however.

> Thus, I am pretty sure that this is a problem with MPI rather that
> with the program code or something else.  
> 
> What happens is simply that the program hangs..

I presume you mean here the output stops?  The program continues to use
CPU cycles but no longer appears to make any progress?

I'm of the opinion that this is most likely a error in your program, I
would start by using either valgrind or padb.

You can run the app under valgrind using the following mpirun options,
this will give you four files named v.log.0 to v.log.3 which you can
check for errors in the normal way.  The "--mca btl tcp,self" option
will disable shared memory which can create false positives.

mpirun -n 4 --mca btl tcp,self valgrind --log-file=v.log.%
q{OMPI_COMM_WORLD_RANK} 

Alternatively you can run the application, wait for it to hang and then
in another window run my tool, padb, which will show you the MPI message
queues and stack traces which should show you where it's hung,
instructions and sample output are on this page.

http://padb.pittman.org.uk/full-report.html

> There are no error messages, and there is no clue from anything else
> (system working fine otherwise- no RAM issues, etc). It does not hang
> at the same place everytime, sometimes in the very beginning, sometime
> near the middle..  
> 
> Could this an issue with hyperthreading? A conflict with something?

Unlikely, if there was a problem in OMPI running more than 3 processes
it would have been found by now.  I regularly run 8 process applications
on my dual-core netbook alongside all my desktop processes without
issue, it runs fine, a little slowly but fine.

All this talk about binding and affinity won't help either, process
binding is about squeezing the last 15% of performance out of a system
and making performance reproducible, it has no bearing on correctness or
scalability.  If you're not running on a dedicated machine which with
firefox running I guess you aren't then there would be a good case for
leaving it off anyway.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] Program deadlocks, on simple send/recv loop

2009-12-03 Thread Ashley Pittman
On Wed, 2009-12-02 at 13:11 -0500, Brock Palen wrote:
> On Dec 1, 2009, at 11:15 AM, Ashley Pittman wrote:
> > On Tue, 2009-12-01 at 10:46 -0500, Brock Palen wrote:
> >> The attached code, is an example where openmpi/1.3.2 will lock up, if
> >> ran on 48 cores, of IB (4 cores per node),
> >> The code loops over recv from all processors on rank 0 and sends from
> >> all other ranks, as far as I know this should work, and I can't see
> >> why not.
> >> Note yes I know we can do the same thing with a gather, this is a
> >> simple case to demonstrate the issue.
> >> Note that if I increase the openib eager limit, the program runs,
> >> which normally means improper MPI, but I can't on my own figure out
> >> the problem with this code.
> >
> > What are you increasing the eager limit from and too?
> 
> The same value as ethernet on our system,
> mpirun --mca btl_openib_eager_limit 655360 --mca  
> btl_openib_max_send_size 655360 ./a.out
> 
> Huge values compared to the defaults, but works,

My understanding of the code is that each message will be 256k long and
the code pretty much guarantees that at some point there will be 46
messages in the queue in front of the one you are looking to receive
which makes a total of 11.5Mb, slightly less if you take shared memory
into account.

If the MPI_SEND isn't blocking then each rank will send 50 messages to
rank zero and you'll have 2000 messages and 500Mb of data being received
with the message you want being somewhere towards the end of the queue.

These numbers are far from huge but then compared to an eager limit of
64k they aren't small either.

I suspect the eager limit is being reached on COMM_WORLD rank 0 and it's
not pulling any more messages off the network pending some of the
existing ones being out of the queue but they never will be because the
message being waited for is one that's stuck on the network.  As I say
the message queue for rank 0 when it's deadlocked would be interesting
to look at.

In summary this code makes heavy use of unexpected messages and network
buffering, it's not surprising to me that it only works with eager
limits set fairly high.

Ashley,
-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] Program deadlocks, on simple send/recv loop

2009-12-01 Thread Ashley Pittman
On Tue, 2009-12-01 at 10:46 -0500, Brock Palen wrote:
> The attached code, is an example where openmpi/1.3.2 will lock up, if  
> ran on 48 cores, of IB (4 cores per node),
> The code loops over recv from all processors on rank 0 and sends from  
> all other ranks, as far as I know this should work, and I can't see  
> why not.
> Note yes I know we can do the same thing with a gather, this is a  
> simple case to demonstrate the issue.
> Note that if I increase the openib eager limit, the program runs,  
> which normally means improper MPI, but I can't on my own figure out  
> the problem with this code.

What are you increasing the eager limit from and too?  There is a
moderate amount of data flowing and as the receives are made
synchronously and in order it could be that you there are several
thousand unexpected messages arriving before the one you are looking for
which will lead to long receive queues and a need to buffer lots of
data.

> Any input on why code like this locks up.

If you ran padb against this code when it had locked up you should be
able to get some more information, in particular the message queues for
rank zero.  Hopefully this information would be useful.

http://padb.pittman.org.uk/full-report.html

Ashley Pittman.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] Help tracing casue of readv errors

2009-11-25 Thread Ashley Pittman
On Wed, 2009-11-25 at 12:36 +0100, Atle Rudshaug wrote:

> I got a similar error when using non-blocking communication on large 
> datasets. I could not figure out why this was happening, since it seemed 
> sort of random. I eventually bypassed the problem by switching to 
> blocking communication, which felt kind of sad...If anyone knows if this 
> is a bug in OpenMPI or connected to hardware somehow, please share.

You could easily be running out of memory on one node by saturating it
with messages, all of which may need to be buffered.  Have you checked
the offending nodes for messages from the OOM killer?

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0

2009-11-18 Thread Ashley Pittman
On Wed, 2009-11-18 at 01:28 -0800, Bill Broadley wrote:
> A rather stable production code that has worked with various versions
> of MPI
> on various architectures started hanging with gcc-4.4.2 and openmpi
> 1.3.33
> 
> Which lead me to this thread. 

If you're investigating hangs in a parallel job take a look at the tool
linked to below (padb), it should be able to give you a parallel stack
trace and the message queues for the job.

http://padb.pittman.org.uk/full-report.html

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] sending/receiving large buffers

2009-11-09 Thread Ashley Pittman
On Sun, 2009-11-08 at 20:40 -0800, Martin Siegert wrote:
> Hi,
> 
> I am running into a problem with mpi_allreduce when large buffers
> are used. But does not appear to be unique for mpi_allreduce; it
> occurs with mpi_send/mpi_recv as well; program is attached.
> 1) run this using MPI_Allreduce:

> allreduce completed 2.700941
> enter array size (integer; negative to stop):
> 32000
> 
> At this point the program just hangs forever.

You could use padb (It's linked to in my sig) to tell you where the
application is stuck - it could just be swapping.

> All programs/libraries are 64bit, interconnect is IB.
> I expect problems with sizes larger than 2^31-1, but these array sizes
> are still much smaller.

Whilst the message counts are smaller than 2^31-1 you should be aware
that the message sizes are larger as they are multiplied by
sizeof(double) so I wouldn't rule out this theory.

Also, you are mallocing at least 4Gb per process and quite possibly a
large amount for buffering in the MPI library as well, it could be that
you are simply running out of memory.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] memchecker overhead?

2009-10-26 Thread Ashley Pittman
On Mon, 2009-10-26 at 16:21 -0400, Jeff Squyres wrote:

> there's a tiny/ 
> small amount of overhead inserted by OMPI telling Valgrind "this  
> memory region is ok", but we live in an intensely competitive HPC  
> environment.

I may be wrong but I seem to remember Julian saying the overhead is
twelve cycles for the valgrind calls.  Of course calculating what to
pass to valgrind may add to this.

> The option to enable this Valgrind Goodness in OMPI is --with- 
> valgrind.  I *think* the option may be the same for libibverbs, but I  
> don't remember offhand.
> 
> That being said, I'm guessing that we still have bunches of other  
> valgrind warnings that may be legitimate.  We can always use some help  
> to stamp out these warnings...  :-)

I note there is a bug for this, being "Valgrind clean" is a very
desirable feature for any software and particularly a library IMHO.

https://svn.open-mpi.org/trac/ompi/ticket/1720

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] Re : Yet another stdin problem

2009-10-07 Thread Ashley Pittman

Or better still if you want to be able to pass the filename and args on
the mpirun command line use the following and then run it as 

mpirun -np 64 ./input_wrapper inputs.txt my_exe

#!/bin/bash

FILE=$1
shift

"$@" < $FILE

In general though using stdin on parallel applications is rarely a good
solution.

Ashley.

On Wed, 2009-10-07 at 18:42 +0300, Roman Cheplyaka wrote:
> As a slight modification, you can write a wrapper script
> 
> #!/bin/sh
> my_exe < inputs.txt
> 
> and pass it to mpirun.

-- 
Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] Program hangs when run in the remote host ...

2009-10-06 Thread Ashley Pittman
On Tue, 2009-10-06 at 12:22 +0530, souvik bhattacherjee wrote:

> This implies that one has to copy the executables in the remote host
> each time one requires to run a program which is different from the
> previous one. 

This is correct, the name of the executable is passed to each node and
that executable is then executed locally.

> Is the implication correct or is there some way around.

Typically some kind of a shared filesystem would be used, nfs for
example.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] Is Iprobe fast when there is no message to recieve

2009-10-03 Thread Ashley Pittman
On Sat, 2009-10-03 at 07:05 -0400, Jeff Squyres wrote:
> That being said, if you just want to send a quick "notify" that an  
> event has occurred, you might want to use a specific tag and/or  
> communicator for these extraordinary messages.  Then, when the event  
> occurs, send a very short message on this special tag/communicator  
> (potentially even a 0-byte message).

> You can MPI_TEST for  
> the completion of this short/0-byte receive very quickly.  You can  
> then send the actual data of the event in a different non-blocking  
> receive that is only checked if the short "alert" message is received.

In general I would say that Iprobe is a bad thing to use, as Jeff says
post a receive in advance and then call test on this receive rather than
using Iprobe.

>From your description it sounds like a zero byte send is all you need
which should be fast in all cases.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] memalign usage in OpenMPI and it's consequencesfor TotalVIew

2009-10-01 Thread Ashley Pittman

Simple malloc() returns pointers that are at least eight byte aligned
anyway, I'm not sure what the reason for calling memalign() with a value
of four would be be anyway.

Ashley,

On Thu, 2009-10-01 at 20:19 +0200, Åke Sandgren wrote:
> No it didn't. And memalign is obsolete according to the manpage.
> posix_memalign is the one to use.

> > > https://svn.open-mpi.org/trac/ompi/changeset/21744

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] Is there an "flush()"-like function in MPI?

2009-09-27 Thread Ashley Pittman

There are tools available to allow you to see the "message queues" of a
process, this might help you identify why you aren't seeing the messages
that you are waiting on complete.  One such tool is linked to in my
signature, you could also look into TotalView or DDT as well.

I would also suggest that as you are seeing random hangs and crashes
running your code under Valgrind might be advantageous.

Ashley Pittman.

On Sun, 2009-09-27 at 02:05 +0800, guosong wrote:
> Yes, I know there should be a bug. But I do not know where and why.
> The strange thing was sometimes it worked but at this time there will
> be a segmentation fault. If it did not work, some process must sit
> there waiting for the message. There are many iterations in my
> program(using a loop). It would after a few iterations the "bug" would
> appear, which means the previous a few iterations the communication
> worked. I am quite comfused now.


-- 

Ashley Pittman, Bath, UK.

Padb - A open source job inspection tool for parallel computing
http://padb.pittman.org.uk



Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ashley Pittman
On Tue, 2009-08-11 at 03:03 -0600, Ralph Castain wrote:
> If it isn't already there, try putting a print statement tight at
> program start, another just prior to MPI_Init, and another just after
> MPI_Init. It could be that something is hanging somewhere during
> program startup since it sounds like everything is launching just
> fine.

If you suspect a hang then you can use the command orte-ps (on the node
where the mpirun is running) and it should show you your job.  This will
tell you if the job is started and still running or if there was a
problem launching.

If the program did start and has really hung then you can get more
in-depth information about it using padb which is linked to in my
signature.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] Embarrassingly parallel problems with MapReduce and MPI ?

2009-07-13 Thread Ashley Pittman
On Mon, 2009-07-13 at 16:06 +0900, Ashika Umanga Umagiliya wrote:

> I am just curious, if the problem is embarrassingly parallel , then how 
> effective using MPI over a 'MapReduce' implementation(apache Hadoop ) .

Almost impossible.  You could implement MapReduce on top of MPI fairly
trivially however.

The problem being Embarrassingly parallel is of no consequence beyond
the fact that if it was they you wouldn't need either MPI or MapReduce.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] Problems in OpenMPI

2009-07-13 Thread Ashley Pittman
On Sun, 2009-07-12 at 19:49 -0500, Yin Feng wrote:
> Can you give me a further explanation about why results are different
> when it ran it on mutiprocessors against single processor?

Floating point number are problematical for a number of reasons, they
are only *approximations of real numbers because of the limited
precision.  This means that when you do calculations with floating point
numbers the you end up with approximations of the answers (because you
only really had a approximation of the question).

In parallel computing you find that the route taken to reach an answer
is different to that taken in serial computing and hence you get
different errors so the eventual answer is different.  Furthermore
you'll quite likely find that you get different answers running at
different scales, depending on how you spread out your job.

Unfortunately it's a fundamental limitation of classical computing and
one that people have learned to live with.

http://en.wikipedia.org/wiki/Floating_point#Accuracy_problems

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] Problems in OpenMPI

2009-07-10 Thread Ashley Pittman
On Fri, 2009-07-10 at 14:35 -0500, Yin Feng wrote:
> I have my code run on supercomputer.
> First, I required allocation and then just run my code using mpirun.
> The supercomputer will assign 4 nodes but they are different at each
> time of requirement. So, I don't know the machines I will use before
> it runs.
> Do you know how to figure out under this situation?

The answer depends on what scheduler the computer is using, if it's
using SGE then I believe it's enough to compile Open-MPI with the
--with-sge flag and it figures it out for itself.  You'll probably need
to check with the local admins for a definitive answer.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] Problems in OpenMPI

2009-07-10 Thread Ashley Pittman
On Thu, 2009-07-09 at 23:40 -0500, Yin Feng wrote:
> I am a beginner in MPI.
> 
> I ran an example code using OpenMPI and it seems work.
> And then I tried a parallel example in PETSc tutorials folder (ex5).
> 
> mpirun -np 4 ex5
> It can do but the results are not as accurate as just running ex5.
> Is that thing normal?

Not as accurate or just different?  Different is normal and in light of
that accurate is itself a vague concept.

> After that, send this job to supercomputer which allocates me 4 nodes
> and each node has 8 processors. When I check load on each node, I
> found:

> Does anyone have any idea about this?

I'd say it's obvious all 32 processes have been located on the same
node, what was the mpirun command you issued and the contents of the
machinefile you used?

Running "orte-ps" on the machine where the mpirun command is running
will tell you the hostname where every rank is running or if you want
more information (load, cpu usage etc) you can use padb, the link for
which is in my signature.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] quadrics support?

2009-07-08 Thread Ashley Pittman
On Wed, 2009-07-08 at 15:43 -0400, Michael Di Domenico wrote:
> On Wed, Jul 8, 2009 at 3:33 PM, Ashley Pittman wrote:
> >> When i run tping i get:
> >> ELAN_EXCEOPTIOn @ --: 6 (Initialization error)
> >> elan_init: Can't get capability from environment
> >>
> >> I am not using slurm or RMS at all, just trying to get openmpi to run
> >> between two nodes.
> >
> > To attach to the elan a process has to have a "capability" which is a
> > kernel attribute describing the size (number of nodes/ranks) of the job,
> > without this you'll get errors like the one from tping.  The only way to
> > generate these capabilities is by using RMS, Slurm or I believe pdsh
> > which can generate one and push it into the kernel before calling fork()
> > to create the user application.
> 
> I didn't realize it was an MPI type program, so I ran is using the
> QSNet version of mpirun and OpenMPI.  The process does start and runs
> through 0: and 2:, which i assume are packet sizes, but freezes at
> that point.
> 
> We have an existing XC cluster from HP, that we're trying to convert
> from XC to standard RHEL5.3 w/ Slurm and OpenMPI.  All i want to be
> able to do is load RHEL5 and the Quadrics NIC drivers, and run OpenMPI
> jobs between these two nodes I yanked from the cluster before we
> switch the whole thing over.

My advice would be to try OpenMPI on the (presumably functional) XC
cluster and then migrate that from there to RHEL5.3.  I don't recall
Slurm being hard to get working but it'll be a lot easier to diagnose if
you get OpenMPI and the resource manager working separately before
putting them together.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] quadrics support?

2009-07-08 Thread Ashley Pittman
On Wed, 2009-07-08 at 15:09 -0400, Michael Di Domenico wrote:
> On Wed, Jul 8, 2009 at 12:33 PM, Ashley Pittman wrote:
> > Is the machine configured correctly to allow non OpenMPI QsNet programs
> > to run, for example tping?
> >
> > Which resource manager are you running, I think slurm compiled for RMS
> > is essential.
> 
> I can ping via TCP/IP using the eip0 ports.
> 
> When i run tping i get:
> ELAN_EXCEOPTIOn @ --: 6 (Initialization error)
> elan_init: Can't get capability from environment
> 
> I am not using slurm or RMS at all, just trying to get openmpi to run
> between two nodes.

To attach to the elan a process has to have a "capability" which is a
kernel attribute describing the size (number of nodes/ranks) of the job,
without this you'll get errors like the one from tping.  The only way to
generate these capabilities is by using RMS, Slurm or I believe pdsh
which can generate one and push it into the kernel before calling fork()
to create the user application.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] quadrics support?

2009-07-08 Thread Ashley Pittman
On Tue, 2009-07-07 at 17:18 -0400, Michael Di Domenico wrote:
> So, first run i seem to have run into a bit of an issue.  All the
> Quadrics modules are compiled and loaded.  I can ping between nodes
> over the quadrics interfaces.  But when i try to run one of the hello
> mpi example from openmpi, i get:
> 
> first run, the process hung - killed with ctl-c
> though it doesnt seem to actually die and kill -9 doesn't work
> 
> second run, the process fails with
>   failed elan4_attach  Device or resource busy
>   
>   elan_allocSleepDesc  Failed to allocate IRQ cookie 2a: 22
> Invalid argument
> all subsequent runs fail the same way and i have to reboot the box to
> get the processes to go away
> 
> I'm not sure if this is a quadrics or openmpi issue at this point, but
> i figured since there are quadrics people on the list its a good place
> to start

Is the machine configured correctly to allow non OpenMPI QsNet programs
to run, for example tping?

Which resource manager are you running, I think slurm compiled for RMS
is essential.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] quadrics support?

2009-07-08 Thread Ashley Pittman
On Tue, 2009-07-07 at 15:30 -0400, Michael Di Domenico wrote:
> Does OpenMPI/Quadrics require the Quadrics Kernel patches in order to
> operate?  Or operate at full speed or are the Quadrics modules
> sufficient?

In theory you can run without although you'll find it easier and the
code faster if you run the patches.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] Segmentation fault - Address not mapped

2009-07-07 Thread Ashley Pittman

This is the error you get when an invalid communicator handle is passed
to a MPI function, the handle is deferenced so you may or may not get a
SEGV from it depending on the value you pass.

The  0x44a0 address is an offset from 0x4400, the value of
MPI_COMM_WORLD in mpich2, my guess would be you are either picking up a
mpich2 mpi.h or the mpich2 mpicc.

Ashley,

On Tue, 2009-07-07 at 11:05 +0100, Catalin David wrote:
> Hello, all!
> 
> Just installed Valgrind (since this seems like a memory issue) and got
> this interesting output (when running the test program):
> 
> ==4616== Syscall param sched_setaffinity(mask) points to unaddressable byte(s)
> ==4616==at 0x43656BD: syscall (in /lib/tls/libc-2.3.2.so)
> ==4616==by 0x4236A75: opal_paffinity_linux_plpa_init (plpa_runtime.c:37)
> ==4616==by 0x423779B:
> opal_paffinity_linux_plpa_have_topology_information (plpa_map.c:501)
> ==4616==by 0x4235FEE: linux_module_init (paffinity_linux_module.c:119)
> ==4616==by 0x447F114: opal_paffinity_base_select
> (paffinity_base_select.c:64)
> ==4616==by 0x444CD71: opal_init (opal_init.c:292)
> ==4616==by 0x43CE7E6: orte_init (orte_init.c:76)
> ==4616==by 0x4067A50: ompi_mpi_init (ompi_mpi_init.c:342)
> ==4616==by 0x40A3444: PMPI_Init (pinit.c:80)
> ==4616==by 0x804875C: main (test.cpp:17)
> ==4616==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
> ==4616==
> ==4616== Invalid read of size 4
> ==4616==at 0x4095772: ompi_comm_invalid (communicator.h:261)
> ==4616==by 0x409581E: PMPI_Comm_size (pcomm_size.c:46)
> ==4616==by 0x8048770: main (test.cpp:18)
> ==4616==  Address 0x44a0 is not stack'd, malloc'd or (recently) free'd
> [denali:04616] *** Process received signal ***
> [denali:04616] Signal: Segmentation fault (11)
> [denali:04616] Signal code: Address not mapped (1)
> [denali:04616] Failing at address: 0x44a0
> [denali:04616] [ 0] /lib/tls/libc.so.6 [0x42b4de0]
> [denali:04616] [ 1]
> /users/cluster/cdavid/local/lib/libmpi.so.0(MPI_Comm_size+0x6f)
> [0x409581f]
> [denali:04616] [ 2] ./test(__gxx_personality_v0+0x12d) [0x8048771]
> [denali:04616] [ 3] /lib/tls/libc.so.6(__libc_start_main+0xf8) [0x42a2768]
> [denali:04616] [ 4] ./test(__gxx_personality_v0+0x3d) [0x8048681]
> [denali:04616] *** End of error message ***
> ==4616==
> ==4616== Invalid read of size 4
> ==4616==at 0x4095782: ompi_comm_invalid (communicator.h:261)
> ==4616==by 0x409581E: PMPI_Comm_size (pcomm_size.c:46)
> ==4616==by 0x8048770: main (test.cpp:18)
> ==4616==  Address 0x44a0 is not stack'd, malloc'd or (recently) free'd
> 
> 
> The problem is that, now, I don't know where the issue comes from (is
> it libc that is too old and incompatible with g++ 4.4/OpenMPI? is libc
> broken?).
> 
> Any help would be highly appreciated.
> 
> Thanks,
> Catalin
> 
> 
> On Mon, Jul 6, 2009 at 3:36 PM, Catalin David 
> wrote:
> > On Mon, Jul 6, 2009 at 3:26 PM, jody wrote:
> >> Hi
> >> Are you also sure that you have the same version of Open-MPI
> >> on every machine of your cluster, and that it is the mpicxx of this
> >> version that is called when you run your program?
> >> I ask because you mentioned that there was an old version of Open-MPI
> >> present... die you remove this?
> >>
> >> Jody
> >
> > Hi
> >
> > I have just logged in a few other boxes and they all mount my home
> > folder. When running `echo $LD_LIBRARY_PATH` and other commands, I get
> > what I expect to get, but this might be because I have set these
> > variables in the .bashrc file. So, I tried compiling/running like this
> >  ~/local/bin/mpicxx [stuff] and ~/local/bin/mpirun -np 4 ray-trace,
> > but I get the same errors.
> >
> > As for the previous version, I don't have root access, therefore I was
> > not able to remove it. I was just trying to outrun it by setting the
> > $PATH variable to point first at my local installation.
> >
> >
> > Catalin
> >
> >
> > --
> >
> > **
> > Catalin David
> > B.Sc. Computer Science 2010
> > Jacobs University Bremen
> >
> > Phone: +49-(0)1577-49-38-667
> >
> > College Ring 4, #343
> > Bremen, 28759
> > Germany
> > **
> >
> 
> 
> 
-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] quadrics support?

2009-07-02 Thread Ashley Pittman
On Thu, 2009-07-02 at 09:34 -0400, Michael Di Domenico wrote:
> Jeff,
> 
> Okay, thanks.  I'll give it a shot and report back.  I can't
> contribute any code, but I can certainly do testing...

I'm from the Quadrics stable so could certainty support a port should
you require it but I don't have access to hardware either currently.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] Valgrind writev() errors with 1.3.2.

2009-06-09 Thread Ashley Pittman
On Mon, 2009-06-08 at 23:41 -0600, tom fogal wrote:
> George Bosilca  writes:
> > There is a whole page on valgrind web page about this topic. Please  
> > read http://valgrind.org/docs/manual/manual-core.html#manual-core.suppress 
> >   for more information.
> 
> Even better, Ralph (et al.) is if we could just make valgrind think
> this is defined memory.  One can do this with client requests:
> 
>   http://valgrind.org/docs/manual/mc-manual.html#mc-manual.clientreqs

Using the Valgrind client requests unnecessarily is a very bad idea,
they are intended for where applications use their own memory allocator
(i.e. replace malloc/free) or are using custom kernel modules or
hardware which Valgrind doesn't know about.

The correct solution is either to not send un-initialised memory or to
suppress the error using a suppression file as George said.  As the
error is from MPI_Init() you can safely ignore it from a end-user
perspective.

Ashley.

-- 

Ashley Pittman

Padb - A parallel job viewer for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] Openmpi and processor affinity

2009-06-03 Thread Ashley Pittman
On Wed, 2009-06-03 at 11:27 -0400, Jeff Squyres wrote:
> On Jun 3, 2009, at 10:48 AM,  wrote:
> 
> > For HPL, try writing a bash script that pins processes to their  
> > local memory controllers using numactl before kicking off HPL.  This  
> > is particularly helpful when spawning more than 1 thread per  
> > process.  The last line of your script should look like "numactl -c  
> > $cpu_bind -m $ mem_bind $*".
> >
> > Believe it or not, I hit 94.5% HPL efficiency using this tactic on a  
> > 16 node cluster. Using processor affinity (various MPIs) my results  
> > were inconsistent and ranged between 88-93%
> >
> 
> If you're using multi-threaded HPL, that might be useful.  But if  
> you're not, I'd be surprised if you got any different results than  
> Open MPI binding itself.  If there really is a difference, we should  
> figure out why.  More specifically, calling numactl yourself should be  
> pretty much exactly what we do in OMPI (via API, not via calling  
> numactl).

Wasn't there a discussion about this recently on the list, OMPI binds
during MPI_Init() so it's possible for memory to be allocated on the
wrong quad, the discussion was about moving the binding to the orte
process as I recall?

>From my testing of process affinity you tend to get much more consistent
results with it on and much more unpredictable results with it off, I'd
questing that it's working properly if you are seeing a 88-93% range in
the results.

Ashley Pittman.



Re: [OMPI users] CP2K mpi hang

2009-05-19 Thread Ashley Pittman
On Tue, 2009-05-19 at 14:01 -0400, Noam Bernstein wrote:

I'm glad you got to the bottom of it.

> With one of them, apparently, CP2K will silently go on if  
> the
> file is missing,  but then lock up in an MPI call (maybe it leaves
> some
> variables  uninitialized, and then uses them in the call to the MPI  
> function?).

More likely it takes a different path through the code and ends up
making mis-matched collective calls across processes.

Ashley,



Re: [OMPI users] CP2K mpi hang

2009-05-19 Thread Ashley Pittman
On Tue, 2009-05-19 at 11:01 -0400, Noam Bernstein wrote:

> I'd suspect the filesystem too, except that it's hung up in an MPI  
> call.  As I said
> before, the whole thing is bizarre.  It doesn't matter where the  
> executable is,
> just what CWD is (i.e. I can do mpirun /scratch/exec or mpirun /home/ 
> bernstei/exec,
> but if it's sitting in /scratch it'll hang).  And I've been running
> other codes both from NFS and from scratch directories for months,
> and never had a problem.

That is indeed odd but it shouldn't be too hard to track down, how often
does the failure occur?  Presumably when you say you have three
invocations of the program they communicate via files, is the location
of these files changing?

I assume you're certain it's actually hanging and not just failing to
converge?

Finally if you could run it with "--mca btl ^ofed" to rule out the ofed
stack causing the problem that would be useful.  You'd need to check the
syntax here.

> Using MVAPICH every process is stuck in a collective, but they're not  
> all the
> same collective (see stack traces below).  The 2 processes on the head  
> node
> are stuck on mpi_bcast, in various low level MPI routines.  The other 6
> processes are stuck on an mpi_allreduce, again in various low level mpi
> processes.  I don't know enough about the code to tell they're all  
> supposed
> to be part of the same communicator, and the fact that they're stuck on
> different collectives is suspicious.  I can look into that.

This isn't so suspicious, if there is a problem with some processes it's
common for other processes to continue till the next collective call.

Ashley,



Re: [OMPI users] CP2K mpi hang

2009-05-19 Thread Ashley Pittman
On Mon, 2009-05-18 at 17:05 -0400, Noam Bernstein wrote:
> The code is complicated, the input files are big and lead to long  
> computation
> times, so I don't think I'll be able to make a simple test case.   
> Instead
> I attached to the hanging processes (all 8 of them) with gdb
> during  the hang. The stack trace is below.  Nodes seem to spend most of
> their time in the  btl_openib_component_progress(), and occasionally in
> mca_pml_ob1_progress().  I.e. not completely stuck, but not making  
> progress.

Can you confirm that *all* processes are in PMPI_Allreduce at some
point, the collectives commonly get blamed for a lot of hangs and it's
not always the correct place to look.

> P.S. I get a similar hang with MVAPICH, in a nearby but different part  
> of the
> code (on an MPI_Bcast, specifically), increasing my tendency to believe
> that it's OFED's fault.  But maybe the stack trace will suggest to  
> someone
> where it might be stuck, and therefore perhaps an mca flag to try?

This strikes me as a filesystem problem more than MPI per se.  Again
with MVAPICH are all your processes in MPI_Bcast or just some of them?

Ashley,



Re: [OMPI users] [Fwd: mpi alltoall memory requirement]

2009-05-13 Thread Ashley Pittman
On Thu, 2009-04-23 at 07:12 +, viral@gmail.com wrote:
> Hi 
> Thanks for your response. 
> However, I am running 
> mpiexec  -ppn 24 -n 192 /opt/IMB-MPI1 alltaoll -msglen /root/temp 
> 
> And file /root/temp contains entry upto 65535 size only. That means
> alltoall test will run upto 65K size only
> 
> So, in that case I will require very less memory but then in that case
> also test is running out-of-memory. Please help someone to understand
> the scenario.
> Or do I need to switch to some algorithm or do I need to set some
> other environment variables ? or anything like that ?

I'm not sure but I seem to remember that IMB uses two application
buffers and alternates which one it uses, this itself will double the
memory requirement.  You should be able to plot performance against max
message size and see where the drop off occurs.

I've always used the compile options to specify max message size and rep
count, the -msglen option is not one I've seen before.

Ashley Pittman.



Re: [OMPI users] [Fwd: mpi alltoall memory requirement]

2009-04-22 Thread Ashley Pittman
On Wed, 2009-04-22 at 12:40 +0530, vkm wrote:

> The same amount of memory required for recvbuf. So at the least each 
> node should have 36GB of memory.
> 
> Am I calculating right ? Please correct.

Your calculation looks correct, the conclusion is slightly wrong
however.  The Application buffers will consume 36Gb of memory, the rest
of the application, any comms buffers and the usual OS overhead will be
on top of this so putting only 36Gb of ram in your nodes will still
leave you short.

Ashley,



Re: [OMPI users] Collective operations and synchronization

2009-03-25 Thread Ashley Pittman
On Tue, 2009-03-24 at 07:03 -0800, Eugene Loh wrote:
> > Perhaps there is a better way of accomplishing the same thing however, 
> > MPI_Barrier syncronises all processes so is potentially a lot more 
> > heavyweight than it needs to be, in this example you only need to 
> > syncronise with your neighbours so it might be quicker to use a 
> > send/receive for each of your neighbours containing a true/false value 
> > rather than to rely on the existence of a message or not.  i.e. the 
> > barrier is needed because you don't know how many messages there are, 
> > it may well be quicker to have a fixed number of point to point 
> > messages rather than a extra global synchronisation.  The added 
> > advantage of doing it this way would be you could remove the Probe as 
> > well.
> 
> I'm not sure I understand this suggestion, so I'll say it the way I 
> understand it.  Would it be possible for each process to send an "all 
> done" message to each of its neighbors?  Conversely, each process would 
> poll its neighbors for messages, either processing graph operations or 
> collecting "all done" messages depending on whether the message 
> indicates a graph operation or signals "all done".

Exactly, that way you have a defined number of messages which can be
calculated locally for each process and hence there is no need to use
Probe and you can get rid of the MPI_Barrier call.

Ashley Pittman.



Re: [OMPI users] Collective operations and synchronization

2009-03-24 Thread Ashley Pittman


On 23 Mar 2009, at 23:36, Shaun Jackman wrote:

loop {
MPI_Ibsend (for every edge of every leaf node)
MPI_barrier
MPI_Iprobe/MPI_Recv (until no messages pending)
MPI_Allreduce (number of nodes removed)
} until (no nodes removed by any node)

Previously, I attempted to use a single MPI_Allreduce without the 
MPI_Barrier:


You need both the MPI_Barrier and the synchronisation semantics of the 
MPI_Allreduce in this example, it's important that each send matches a 
recv for the same iteration so you need to ensure all sends have been 
sent before you call probe and a Barrier is one way of doing this.  You 
also need the syncronisation semantics of the Allreduce to ensure the 
iProbe doesn't match a send from the next iteration of the loop.


Perhaps there is a better way of accomplishing the same thing however, 
MPI_Barrier syncronises all processes so is potentially a lot more 
heavyweight than it needs to be, in this example you only need to 
syncronise with your neighbours so it might be quicker to use a 
send/receive for each of your neighbours containing a true/false value 
rather than to rely on the existence of a message or not.  i.e. the 
barrier is needed because you don't know how many messages there are, 
it may well be quicker to have a fixed number of point to point 
messages rather than a extra global synchronisation.  The added 
advantage of doing it this way would be you could remove the Probe as 
well.


Potentially it would be possible to remove the Allreduce as well and 
use the tag to identify the iteration count, assuming of course you 
don't need to know the global number of branches at any iteration.  One 
problem with this approach can be that one process can get very slow 
and swamped with unexpected messages however assuming your neighbour 
count is small this shouldn't be a problem.  I'd expect their to not 
only be a net gain changing to this way but for the application to 
scale better as well.


Finally I've always favoured iRecv/Send over Ibsend/Recv as in the 
majority of cases this tends to be faster, you'd have to benchmark your 
specific setup however.


Ashley,



Re: [OMPI users] Collective operations and synchronization

2009-03-23 Thread Ashley Pittman


On 23 Mar 2009, at 21:11, Ralph Castain wrote:
Just one point to emphasize - Eugene said it, but many times people 
don't fully grasp the implication.


On an MPI_Allreduce, the algorithm requires that all processes -enter- 
the call before anyone can exit.


It does -not- require that they all exit at the same time.

So if you want to synchronize on the -exit-, as your question 
indicated, then you must add the MPI_Barrier as you describe.


All MPI_Barrier requires is that all processes enter the call before 
anyone can exit, I'm not sure that "synchronising on exit" has any 
particular meaning at all.


Putting a MPI_Barrier call immediatly after a MPI_Allreduce call would 
be superfluous.


Ashley,



Re: [OMPI users] Any scientific application heavily using MPI_Barrier?

2009-03-06 Thread Ashley Pittman


On 5 Mar 2009, at 15:25, Jeff Squyres wrote:
I don't remember who originally said it, but I've repeated the 
statement: any MPI program that relies on a barrier for correctness is 
an incorrect MPI application.


I'm not 100% sure this holds although it's a good rule of thumb, I've 
certainly written programs which need barriers but that's using 
one-sided comms so is slightly different.


There's anecdotal evidence that throwing in a barrier every once in a 
while can help reduce unexpected messages (and other things), and 
therefore improve performance a bit.  But that's very application 
dependent, and usually not frequent.


I've seen this a number off times, a number of algorithms work fairly 
well as long as things are vaguely in sync but slow down drastically if 
they are not, without barriers there is no way to recover from this 
slowdown.  Basically if one rank is slow for whatever reason other 
ranks try to communicate with it and the unexpected messages cause it 
to slow down further and you get a positive feedback loop.


I sometimes feel that Barriers have a bad reputation and maybe it is 
because they can be used to hide sloppy coding and allow incorrect MPI 
applications to run, I don't see that as a reason not to use them 
however, just be sure you need one.


On 5 Mar 2009, at 15:52, Shanyuan Gao wrote:
My current research is trying to rewrite some collective MPI 
operations to work with our system.  Barrier is my first step, maybe I 
will have bcast and reduce in the future.  I understand that some 
applications used too many unnecessary barriers.  But here what I want 
is just an application to measure the performance improvement versus 
normal MPI_Barrier.  And the improvement can only be measured if the 
barriers are executed many times.  I have done some synthetic tests, 
all I need now are real applications.


I've done a lot of work on Barrier and on collectives in general, my 
advice would be to implement a non-blocking barrier, barriers can be 
slow and *always* delay the application for the duration of the 
barrier, if you can write a non-blocking barrier and pipeline it with 
your application steps then assuming the application is working well 
the CPU cost of the barrier is almost zero (I got it down to .15uS) and 
if the application isn't working well then the barrier will still bring 
it back in step.


Another interesting challenge is to benchmark MPI_Barrier, it's not as 
easy as you might think...


Ashley Pittman.



Re: [OMPI users] OpenMPI and Valgrind

2009-02-12 Thread Ashley Pittman


On 12 Feb 2009, at 15:53, Reuben D. Budiardja wrote:

Hello,
I am having problem that if a program is compiled with OpenMPI, 
Valgrind

doesn't work correctly, i.e: it does not show the memory leak like it
supposed too. The same test program compiled with regular "gfortran" 
and run

under Valgrind will show the memory leak.


Not only will Valgrind not show the memory leak but it also won't show 
buffer over-runs, as it doesn't understand the allocator it'll assume 
all memory handled by the allocator is read/writeable even if it's 
redzone or hasn't been allocated.


I search the list archive and found this post here, which exactly 
described my
problem: 
http://www.open-mpi.org/community/lists/users/2008/07/6058.php,

but I don't understand if there is resolution to it.


It's worth reading the whole of that thread, in particular

http://www.open-mpi.org/community/lists/users/2008/07/6076.php

I am using OpenMPI-1.2.8 with all the default configure option. What 
should I

do to be able use Valgrind with program compiled by OpenMPI ?


From memory and reading the above links (i.e. untested advice):
1) Use OpenMPI 1.3 where the default is not to include this.
2) Configure Open MPI 1.2.8 with the --disable-memory-manager option
3) Compile you application without the -lopen-pal otion.  In practice 
this means running "mpicc --show" and cut-and pasting the underlying 
compile line without the -lopen-pal into your application build 
procedure.  I was able to do this for hello world but I can image it'll 
be a real PITA for anything larger.


I'm experimenting with Open MPI and valgrind at this time, if you are 
still having problems I may be able to help further.


Ashley,



Re: [OMPI users] Supporting OpenMPI compiled for multiple compilers

2009-02-11 Thread Ashley Pittman


On 11 Feb 2009, at 14:13, Prentice Bisbal wrote:


Douglas Guptill wrote:
Thanks. I did end up building for all the compilers under separate
trees. It looks like the --exec-prefix option is only of use if your
compiling 32-bit and 64-bit versions using the same compiler.


This is what I decided to do when I was packaging up Open MPI, it's not 
ideal but it's the only way I could think of doing it.  If there is a 
better way I'd be eager to hear it.


Ashley.



Re: [OMPI users] Asynchronous behaviour of MPI Collectives

2009-01-23 Thread Ashley Pittman
On Fri, 2009-01-23 at 06:51 -0500, Jeff Squyres wrote:
> > This behaviour sometimes can cause some problems with a lot of
> > processors in the jobs.

> Can you describe what exactly you mean?  The MPI spec specifically  
> allows this behavior; OMPI made specific design choices and  
> optimizations to support this behavior.  FWIW, I'd be pretty surprised  
> if any optimized MPI implementation defaults to fully synchronous  
> collective operations.

As Jeff says the spec encourages the kind of behaviour you describe.  I
have however seen this causing problems in applications before and it's
not uncommon for adding barriers to improve the performance of a
application.  You might find that it's better to add barriers after
every N collectives rather than every single collective.

> > Is there an OpenMPI parameter to lock all process in the collective
> > call until is finished? Otherwise  i have to insert many MPI_Barrier
> > in my code and it is very tedious and strange..
> 
> As you have notes, MPI_Barrier is the *only* collective operation that  
> MPI guarantees to have any synchronization properties

AllGather, AllReduce and AlltoAll also have an implicit barrier by
virtue of the dataflow required, all processes need input from all other
processes before they can return.

Ashley Pittman.



Re: [OMPI users] How to know which task on which node

2009-01-19 Thread Ashley Pittman
On Mon, 2009-01-19 at 12:50 +0530, gaurav gupta wrote:
> Hello,
> 
> I want to know that which task is running on which node. Is there any
> way to know this. 

>From where?  From the command line outside of a running job then the new
open-ps command in v1.3 will give you this information.  In 1.2 it's a
little more difficult to get at IIRC.

Ashley,



Re: [OMPI users] Debian MPI -- mpirun missing

2008-10-17 Thread Ashley Pittman
On Sat, 2008-10-18 at 00:16 +0900, Raymond Wan wrote:
> 
> Is there a package that I neglected to install?  I did an "aptitude 
> search openmpi" and installed everything listed...  :-)  Or perhaps I 
> haven't removed all trace of mpich?

According to packages.debian.org there isn't a openmpi pacakge which
contains mpirun which as you note isn't expected.  There is a orterun
however which you could use instead.

The Etch version of openmpi is very old, openmpi has made a lot of
progress since 1.1-2.3, I'd recommend building from source if you are
able to.

Ashley.



Re: [OMPI users] Performance: MPICH2 vs OpenMPI

2008-10-08 Thread Ashley Pittman
On Wed, 2008-10-08 at 09:46 -0400, Jeff Squyres wrote:
> - Have you tried compiling Open MPI with something other than GCC?   
> Just this week, we've gotten some reports from an OMPI member that  
> they are sometimes seeing *huge* performance differences with OMPI  
> compiled with GCC vs. any other compiler (Intel, PGI, Pathscale).
> We  
> are working to figure out why; no root cause has been identified yet.

Jeff,

You probably already know this but the obvious candidate here is the
memcpy() function, icc sticks in it's own which in some cases is much
better than the libc one.  It's unusual for compilers to have *huge*
differences from code optimisations alone.

Ashley,



Re: [OMPI users] problem with alltoall with ppn=8

2008-08-16 Thread Ashley Pittman
On Sat, 2008-08-16 at 08:03 -0400, Jeff Squyres wrote:
> - large all to all operations are very stressful on the network, even  
> if you have very low latency / high bandwidth networking such as DDR IB
> 
> - if you only have 1 IB HCA in a machine with 8 cores, the problem  
> becomes even more difficult because all 8 of your MPI processes will  
> be hammering the HCA with read and write requests; it's a simple I/O  
> resource contention issue

That alone doesn't explain the sudden jump (drop) in performance
figures.

> - there are several different algorithms in Open MPI for performing  
> alltoall, but they were not tuned for ppn>4 (honestly, they were tuned  
> for ppn=1, but they still usually work "well enough" for ppn<=4).  In  
> Open MPI v1.3, we introduce the "hierarch" collective module, which  
> should greatly help with ppn>4 kinds of scenarios for collectives  
> (including, at least to some degree, all to all)

Is there a way to tell or influence which algorithm is used in the
current case?  Looking through the code I can see several but cannot see
how to tune the thresholds.

Ashley.



Re: [OMPI users] Heap profiling with OpenMPI

2008-08-05 Thread Ashley Pittman

One tip is to use the --log-file=valgrind.out.%
q{OMPI_MCA_ns_nds_vpid} option to valgrind which will name the output
file according to rank.  In the 1.3 series the variable has changed from
OMPI_MCA_ns_nds_vpid to OMPI_COMM_WORLD_RANK.

Ashley.

On Tue, 2008-08-05 at 17:51 +0200, George Bosilca wrote: 
> Jan,
> 
> I'm using valgrind with Open MPI on a [very] regular basis and I never  
> had any problems. I usually want to know the execution path on the MPI  
> applications. For this I use:
> mpirun -np XX valgrind --tool=callgrind -q --log-file=some_file ./my_app
> 
> I just run your example:
>  mpirun -np 2 -bynode --mca btl tcp,self valgrind --tool=massif - 
> q ./NPmpi -u 4
> and I got 2 non empty files in the current directory:
>  bosilca@dancer:~/NetPIPE_3.6.2$ ls -l massif.out.*
>  -rw--- 1 bosilca bosilca 140451 2008-08-05 11:57 massif.out. 
> 21197
>  -rw--- 1 bosilca bosilca 131471 2008-08-05 11:57 massif.out. 
> 21210
> 
>george.




Re: [OMPI users] Missing F90 modules

2008-07-31 Thread Ashley Pittman
On Wed, 2008-07-30 at 10:45 -0700, Scott Beardsley wrote:
> I'm attempting to move to OpenMPI from another MPICH-derived 
> implementation. I compiled openmpi 1.2.6 using the following configure:
> 
> ./configure --build=x86_64-redhat-linux-gnu 
> --host=x86_64-redhat-linux-gnu --target=x86_64-redhat-linux-gnu 
> --program-prefix= --prefix=/usr/mpi/pathscale/openmpi-1.2.6 
> --exec-prefix=/usr/mpi/pathscale/openmpi-1.2.6 
> --bindir=/usr/mpi/pathscale/openmpi-1.2.6/bin 
> --sbindir=/usr/mpi/pathscale/openmpi-1.2.6/sbin 
> --sysconfdir=/usr/mpi/pathscale/openmpi-1.2.6/etc 
> --datadir=/usr/mpi/pathscale/openmpi-1.2.6/share 
> --includedir=/usr/mpi/pathscale/openmpi-1.2.6/include 
> --libdir=/usr/mpi/pathscale/openmpi-1.2.6/lib64 
> --libexecdir=/usr/mpi/pathscale/openmpi-1.2.6/libexec 
> --localstatedir=/var 
> --sharedstatedir=/usr/mpi/pathscale/openmpi-1.2.6/com 
> --mandir=/usr/mpi/pathscale/openmpi-1.2.6/share/man 
> --infodir=/usr/share/info --with-openib=/usr 
> --with-openib-libdir=/usr/lib64 CC=pathcc CXX=pathCC F77=pathf90 
> FC=pathf90 --with-psm-dir=/usr --enable-mpirun-prefix-by-default 
> --with-mpi-f90-size=large

Nothing to do with fortran but I think I'm right in saying a lot of
these command line options aren't needed, you simply set --prefix and
the rest of the options default to be relative to that.

Ashley.



Re: [OMPI users] Valgrind Functionality

2008-07-14 Thread Ashley Pittman
On Sun, 2008-07-13 at 09:16 -0400, Jeff Squyres wrote:
> On Jul 13, 2008, at 9:11 AM, Tom Riddle wrote:
> 
> > Does anyone know if this feature has been incorporated yet? I did a
> > ./configure --help but do not see the enable-ptmalloc2-internal  
> > option.
> >
> > - The ptmalloc2 memory manager component is now by default built as
> >   a standalone library named libopenmpi-malloc.  Users wanting to
> >   use leave_pinned with ptmalloc2 will now need to link the library
> >   into their application explicitly.  All other users will use the
> >   libc-provided allocator instead of Open MPI's ptmalloc2.  This  
> > change
> >   may be overriden with the configure option enable-ptmalloc2-internal
> >   --> Expected: 1.3
> 
> This is on the trunk/v1.3 branch, yes.
> 
> The default in v1.3 is that ptmalloc2 is *not* built into libopen- 
> pal.  This is different than v1.2, where ptmalloc2 *was* included in  
> libopen-pal unless you specified --disable-memory-manager.

Thank you for clearing that ip Jeff, what is the cost of using this
option.  The comments in the code led me to believe this was more to do
with pinning memory than anything else?

Would it be advisable to add a mpicc option to enable and disable
linking this library, with 1.2.6 I was sucesfully able to compile and
run a application without it by simply changing the gcc compile line.

Ashley,



Re: [OMPI users] Valgrind Functionality

2008-07-11 Thread Ashley Pittman
On Tue, 2008-07-08 at 18:01 -0700, Tom Riddle wrote:
> Thanks Ashley, after going through your suggestions we tried our test
> with valgrind 3.3.0 and with glibc-devel-2.5-18.el5_1.1, both exhibit
> the same results. A simple non-MPI test prog however returns expected
> responses, so valgrind itself look ok. We then checked that the same
> (shared) libc gets linked in both the MPI and non-MPI cases, adding
> -pthread to the cc command line yields the same result, the only
> difference it appears is the open mpi libraries.
> 
> Now mpicc links against libopen-pal which defines malloc for it's own
> purposes. The big difference seems to be that libopen-pal.so is
> providing its own malloc replacement 

This will be the problem, I've tested on a openmpi (1.2.6) machine here
and I see exactly the same behaviour as you.  I re-compiled the
application without libopen-pal and valgrind works as expected.  To do
this I used mpicc -show to see what compile line it was using and ran
the command myself without the -lopen-pal option.  This clearly isn't a
acceptable long-term solution but might help you in the short term.

I'm a MPI expert but work on a different MPI to openmpi normally, I have
however done a lot of work with Valgrind on different platforms so pick
up questions related to it here.  I think this problem is going to need
input from one of the openmpi guys...

The problem seems to be the presence of malloc() and free() functions in
the libopen-pal library is preventing valgrind from intercepting these
functions in glibc and hence dramatically reducing the benefit which
valgrind brings.

Ashley Pittman.



Re: [OMPI users] Query regarding OMPI_MCA_ns_nds_vpid env variable

2008-07-11 Thread Ashley Pittman
On Fri, 2008-07-11 at 08:01 -0600, Ralph H Castain wrote:
> >> I believe this is partly what motivated the creation of the MPI envars - to
> >> create a vehicle that -would- be guaranteed stable for just these purposes.
> >> The concern was that users were doing things that accessed internal envars
> >> which we changed from version to version. The new envars will remain fixed.
> > 
> > Absolutely, these are useful time and time again so should be part of
> > the API and hence stable.  Care to mention what they are and I'll add it
> > to my note as something to change when upgrading to 1.3 (we are looking
> > at testing a snapshot in the near future).
> 
> Surely:
> 
> OMPI_COMM_WORLD_SIZE#procs in the job
> OMPI_COMM_WORLD_LOCAL_SIZE  #procs in this job that are sharing the node
> OMPI_UNIVERSE_SIZE  total #slots allocated to this user
> (across all nodes)
> OMPI_COMM_WORLD_RANKproc's rank
> OMPI_COMM_WORLD_LOCAL_RANK  local rank on node - lowest rank'd proc on
> the node is given local_rank=0
> 
> If there are others that would be useful, now is definitely the time to
> speak up!

The only other one I'd like to see is some kind of global identifier for
the job but as far as I can see I don't believe that openmpi has such a
concept.

Ashley Pittman.



Re: [OMPI users] Outputting rank and size for all outputs.

2008-07-11 Thread Ashley Pittman
On Fri, 2008-07-11 at 07:59 -0600, Ralph H Castain wrote:
> Not until next week's meeting, but I would guess we would simply prepend the
> rank. The issue will be how often to tag the output since we write it in
> fragments to avoid blocking - so do we tag the fragment, look for newlines
> and tag each line, etc.

I don't know if you are familiar with it but pdsh is a very useful
parallel shell that uses a "$HOSTNAME: " syntax, going along with this
there is a dshbak command which can take output in that form and present
it to the user in a number of different ways.  It would be a nice bonus
if openmpi was to also able to benefit from this command.

There is a dshbak manpage on-line but unfortunately no examples.

Feel free to contact me on or off-list if I'm you want a example or
further information.

Ashley,



Re: [OMPI users] Query regarding OMPI_MCA_ns_nds_vpid env variable

2008-07-11 Thread Ashley Pittman
On Fri, 2008-07-11 at 07:42 -0600, Ralph H Castain wrote:
> 
> 
> On 7/11/08 7:32 AM, "Ashley Pittman" 
> wrote:
> 
> > On Fri, 2008-07-11 at 07:20 -0600, Ralph H Castain wrote:
> >> This variable is only for internal use and has no applicability to a user.
> >> Basically, it is used by the local daemon to tell an application process 
> >> its
> >> rank when launched.
> >> 
> >> Note that it disappears in v1.3...so I wouldn't recommend looking for it. 
> >> Is
> >> there something you are trying to do with it?
> > 
> > Recently on this list I recommended somebody use it for their needs.
> > 
> > http://www.open-mpi.org/community/lists/users/2008/06/5983.php
> 
> Ah - yeah, that one slid by me. I'll address it directly.

I was quite surprised that openmpi didn't have a command option for this
actually, it's quite a common thing to use.

> >> Reason I ask: some folks wanted to know things like the MPI rank prior to
> >> calling MPI_Init, so we added a few MPI envar's that are available from
> >> beginning of process execution, if that is what you are looking for.
> > 
> > It's also essential for Valgrind support which can use it to name
> > logfiles according to rank using the --log-file=valgrind.out.%
> > q{OMPI_MCA_ns_nds_vpid} option.
> 
> Well, it won't hurt for now - but it won't work with 1.3 or beyond. It's
> always risky to depend upon a code's internal variables as developers feel
> free to change those as circumstances dictate since users aren't supposed to
> be affected.
> 
> I believe this is partly what motivated the creation of the MPI envars - to
> create a vehicle that -would- be guaranteed stable for just these purposes.
> The concern was that users were doing things that accessed internal envars
> which we changed from version to version. The new envars will remain fixed.

Absolutely, these are useful time and time again so should be part of
the API and hence stable.  Care to mention what they are and I'll add it
to my note as something to change when upgrading to 1.3 (we are looking
at testing a snapshot in the near future).

Ashley Pittman.



Re: [OMPI users] Query regarding OMPI_MCA_ns_nds_vpid env variable

2008-07-11 Thread Ashley Pittman
On Fri, 2008-07-11 at 07:20 -0600, Ralph H Castain wrote:
> This variable is only for internal use and has no applicability to a user.
> Basically, it is used by the local daemon to tell an application process its
> rank when launched.
> 
> Note that it disappears in v1.3...so I wouldn't recommend looking for it. Is
> there something you are trying to do with it?

Recently on this list I recommended somebody use it for their needs.

http://www.open-mpi.org/community/lists/users/2008/06/5983.php

> Reason I ask: some folks wanted to know things like the MPI rank prior to
> calling MPI_Init, so we added a few MPI envar's that are available from
> beginning of process execution, if that is what you are looking for.

It's also essential for Valgrind support which can use it to name
logfiles according to rank using the --log-file=valgrind.out.%
q{OMPI_MCA_ns_nds_vpid} option.

Ashley,



Re: [OMPI users] Valgrind Functionality

2008-07-08 Thread Ashley Pittman
On Mon, 2008-07-07 at 19:09 -0700, Tom Riddle wrote:
> 
> I was attempting to get valgrind working with a simple MPI app
> (osu_latency) on OpenMPI. While it appears to report uninitialized
> values it fails to report any mallocs or frees that have been
> conducted. 

The normal reason for this is either using static applications or having
a very stripped glibc.  It doesn't appear you've done the former as you
are linking in libpthread but the latter is a possibility, you might
benefit from installing the glibc-devel package.  I don't recalled RHEL
being the worst offenders at stripping libc however.

> I am using RHEL 5, gcc 4.2.3 and a drop from the repo labeled
> openmpi-1.3a1r18303. configured with  
> 
>  $ ../configure --prefix=/opt/wkspace/openmpi-1.3a1r18303 CC=gcc 
> CXX=g++ --disable-mpi-f77 --enable-debug --enable-memchecker 
> --with-psm=/usr/include --with-valgrind=/opt/wkspace/valgrind-3.3.0/

> As the FAQ's suggest I am running a later version of valgrind,
> enabling the memchecker and debug. I tested a slightly modified
> osu_latency test which has a simple char buffer malloc and free but
> the valgrind summary shows no malloc/free activity whatsoever. This is
> running on a dual node system using Infinipath HCAs.  Here is a
> trimmed output.

Although you configured openmpi with what appears to be valgrind 3.3.0
the version of valgrind you are using is 3.2.1, perhaps you want to
specify the full path of valgrind on the mpirun command line?

> [tom@lab01 ~]$ mpirun --mca pml cm -np 2 --hostfile my_hostfile
> valgrind ./osu_latency1 
> ==17839== Memcheck, a memory error detector.
> ==17839== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et
> al.
> ==17839== Using LibVEX rev 1658, a library for dynamic binary
> translation.
> ==17839== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP.
> ==17839== Using valgrind-3.2.1, a dynamic binary instrumentation
> framework.
> ==17839== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et
> al.
> ==17839== For more details, rerun with: -v

Ashley Pittman.



Re: [OMPI users] Outputting rank and size for all outputs.

2008-06-24 Thread Ashley Pittman

If you are using the openmpi mpirun then you can put the following in a
wrapper script which will prefix stdout in a manner similar to what you
appear to want.  Simply add the wrapper script before the name of your
application.

Is this the kind of thing you were aiming for?  I'm quite surprised
mpirun doesn't have an option for this actually, it's a fairly common
thing to want.

Ashley Pittman.

#!/bin/sh

$@ | sed "s/^/\[rk:$OMPI_MCA_ns_nds_vpid,sz:$OMPI_MCA_ns_nds_num_procs
\]/"

On Tue, 2008-06-24 at 11:06 -0400, Mark Dobossy wrote:
> Lately I have been doing a great deal of MPI debugging.  I have, on an  
> occasion or two, fallen into the trap of "Well, that error MUST be  
> coming from rank X.  There is no way it could be coming from any other  
> rank..."  Then proceeding to debug what's happening at rank X, only to  
> find out a few frustrating hours later that rank Y is throwing the  
> output (I'm sure no one else out there has fallen into this trap).  It  
> was at that point, I decided to write up some code to automatically  
> (sort of) output the rank and size of my domain with every output.  I  
> write mostly in C++, and this is what I came up with:
> 
> #include 
> #include 
> 
> std::ostream &mpi_info(std::ostream &s) {
>   int rank, size;
>   rank = MPI::COMM_WORLD.Get_rank();
>   size = MPI::COMM_WORLD.Get_size();
>   s << "[rk:" << rank << ",sz:" << size << "]: ";
>   return s;
> }
> 
> Then in my code, I have changed:
> 
> std::cerr << "blah" << std::endl;
> 
> to:
> 
> std::cerr << mpi_info << "blah" << std::endl;
> 
> (or cout, or file stream, etc...)
> 
> where "blah" is some amazingly informative error message.
> 
> Are there other ways people do this?  Simpler ways perhaps?
> 
> -Mark
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Different CC for orte and opmi?

2008-06-10 Thread Ashley Pittman

Sorry, I'll try and fill in the background.  I'm attempting to package
openmpi for a number of customers we have, whenever possible on our
clusters we use modules to provide users with a choice of MPI
environment.

I'm using the 1.2.6 stable release and have built the code twice, once
to /opt/openmpi-1.2.6/gnu and once to /opt/openmpi-1.2.6/intel, I have
create two modules environments called openmpi-gnu and openmpi-intel and
am also using a existing one called intel-compiler.  The build was
successful in both cases.

If I load the openmpi-gnu module I can compile and run code using
mpicc/mpirun as expected, if I load openmpi-intel and intel-compiler I
find I can compile code but I get an error about missing libimf.so when
I try to run it (reproduced below).

The application *will* run if I add the line "module load
intel-compiler" to my bashrc as this allows orted to link.  What I think
I want to do is to compile the actual library with icc but to compile
orted with gcc so that I don't need to load the intel environment by
default.  I'm assuming that the link problems only exist with orted and
not with the actual application as the LD_LIBRARY_PATH is set correctly
in the shell which is launching the program.

Ashley Pittman.

sccomp@demo4-sles-10-1-fe:~/benchmarks/IMB_3.0/src> mpirun -H comp00,comp01 
./IMB-MPI1
/opt/openmpi-1.2.6/intel/bin/orted: error while loading shared libraries: 
libimf.so: cannot open shared object file: No such file or directory
/opt/openmpi-1.2.6/intel/bin/orted: error while loading shared libraries: 
libimf.so: cannot open shared object file: No such file or directory
[demo4-sles-10-1-fe:29303] ERROR: A daemon on node comp01 failed to start as 
expected.
[demo4-sles-10-1-fe:29303] ERROR: There may be more information available from
[demo4-sles-10-1-fe:29303] ERROR: the remote shell (see above).
[demo4-sles-10-1-fe:29303] ERROR: The daemon exited unexpectedly with status 
127.
[demo4-sles-10-1-fe:29303] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
base/pls_base_orted_cmds.c at line 275
[demo4-sles-10-1-fe:29303] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
pls_rsh_module.c at line 1166
[demo4-sles-10-1-fe:29303] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c 
at line 90
[demo4-sles-10-1-fe:29303] ERROR: A daemon on node comp00 failed to start as 
expected.
[demo4-sles-10-1-fe:29303] ERROR: There may be more information available from
[demo4-sles-10-1-fe:29303] ERROR: the remote shell (see above).
[demo4-sles-10-1-fe:29303] ERROR: The daemon exited unexpectedly with status 
127.
[demo4-sles-10-1-fe:29303] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
base/pls_base_orted_cmds.c at line 188
[demo4-sles-10-1-fe:29303] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
pls_rsh_module.c at line 1198
--
mpirun was unable to cleanly terminate the daemons for this job. Returned value 
Timeout instead of ORTE_SUCCESS.
--

$ ldd /opt/openmpi-1.2.6/intel/bin/orted
linux-vdso.so.1 =>  (0x7fff877fe000)
libopen-rte.so.0 => /opt/openmpi-1.2.6/intel/lib/libopen-rte.so.0 
(0x7fe97f3ac000)
libopen-pal.so.0 => /opt/openmpi-1.2.6/intel/lib/libopen-pal.so.0 
(0x7fe97f239000)
libdl.so.2 => /lib64/libdl.so.2 (0x7fe97f135000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x7fe97f01f000)
libutil.so.1 => /lib64/libutil.so.1 (0x7fe97ef1c000)
libm.so.6 => /lib64/libm.so.6 (0x7fe97edc7000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7fe97ecba000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x7fe97eba3000)
libc.so.6 => /lib64/libc.so.6 (0x7fe97e972000)
libimf.so => /opt/intel/compiler_10.1/x86_64/lib/libimf.so 
(0x7fe97e61)
libsvml.so => /opt/intel/compiler_10.1/x86_64/lib/libsvml.so 
(0x7fe97e489000)
libintlc.so.5 => /opt/intel/compiler_10.1/x86_64/lib/libintlc.so.5 
(0x7fe97e35)
/lib64/ld-linux-x86-64.so.2 (0x7fe97f525000)
$ ssh comp00 ldd /opt/openmpi-1.2.6/intel/bin/orted
libopen-rte.so.0 => /opt/openmpi-1.2.6/intel/lib/libopen-rte.so.0 
(0x2b1f0c0c5000)
libopen-pal.so.0 => /opt/openmpi-1.2.6/intel/lib/libopen-pal.so.0 
(0x2b1f0c23e000)
libdl.so.2 => /lib64/libdl.so.2 (0x2b1f0c3bc000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x2b1f0c4c)
libutil.so.1 => /lib64/libutil.so.1 (0x2b1f0c5d7000)
libm.so.6 => /lib64/libm.so.6 (0x2b1f0c6da000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x2b1f0c82f000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x2b1f0c93d000)
libc.so.6 => /lib64/libc.so.6 (0x2b1f0ca54000)
/lib64/ld-linux-x86-64.so.2 (0x2b1f0bfa9000)
libimf.so => not found
libsvml.so =&g

Re: [OMPI users] Different CC for orte and opmi?

2008-06-09 Thread Ashley Pittman

Putting to side any religious views I might have about static linking
how would that help in this case?   It appears to be orted itself that
fails to link, I'm assuming that the application would actually run,
either because the LD_LIBRARY_PATH is set correctly on the front end or
the --prefix option to mpirun.

Or do you mean static linking of the tools?  I could go for that if
there is a configure option for it.

Ashley Pittman.

On Mon, 2008-06-09 at 08:27 -0700, Doug Reeder wrote:
> Ashley,
> 
> It could work but I think you would be better off to try and  
> statically link the intel libraries.
> 
> Doug Reeder
> On Jun 9, 2008, at 4:34 AM, Ashley Pittman wrote:
> 
> >
> > Is there a way to use a different compiler for the orte component and
> > the shared library component when using openmpi?  We are finding  
> > that if
> > we use icc to compile openmpi then orted fails with link errors when I
> > try and launch a job as the intel environment isn't loaded by default.
> >
> > We use the module command heavily and have modules for openmpi-gnu and
> > openmpi-intel as well as a intel_compiler module.  To use openmpi- 
> > intel
> > we have to load intel_compiler by default on the compute nodes which
> > isn't ideal, is it possible to compile the orte component with gcc and
> > the library component with icc?
> >
> > Yours,
> >
> > Ashley Pittman,
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] Different CC for orte and opmi?

2008-06-09 Thread Ashley Pittman

Is there a way to use a different compiler for the orte component and
the shared library component when using openmpi?  We are finding that if
we use icc to compile openmpi then orted fails with link errors when I
try and launch a job as the intel environment isn't loaded by default.

We use the module command heavily and have modules for openmpi-gnu and
openmpi-intel as well as a intel_compiler module.  To use openmpi-intel
we have to load intel_compiler by default on the compute nodes which
isn't ideal, is it possible to compile the orte component with gcc and
the library component with icc?

Yours,

Ashley Pittman,



[OMPI users] File download sizes

2008-05-30 Thread Ashley Pittman

I notice on the download page all file sizes are listed as 0KB, this is
presumably an error somewhere.

http://www.open-mpi.org/software/ompi/v1.2/

Ashley,



  1   2   >