date:20090406

Re: [OMPI users] 'orte_ess_base_select failed'

2009-04-06 Thread Russell McQueeney


Jeff Squyres wrote:
Run with "--mca ess_base_verbose 1000" on the mpirun command line and 
send the output, such as:


  mpirun --mca ess_base_verbose 1000 rest of your command here...


On Mar 30, 2009, at 5:33 PM, Russell McQueeney wrote:


I only invoked orted manually to see the error message, as it wasn't
showing up on the node's monitor or the xterm window i used to run
mpirun.  And no, no prefix command, no aliases, no absolute path,
environment variables set.
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Sorry, I was away for a few days.  Anyway, here's the verbose output.



a.doc
Description: MS-Word document

[OMPI users] Interaction between Intel and OpenMPI floating point exceptions

2009-04-06 Thread Steve Lowder

Recently I've been running an MPI code that uses the LAPACK slamch 
routine to determine machine precision parameters.  This software is 
compiled using the latest Intel Fortran compiler and setting the -fpe0 
argument to watch for certain  floating point errors.  The slamch 
routines crashed and printed an OpenMPI stacktrace to report an 
underflow error, however the Intel -fpe0 setting doesn't abort on 
underflow.  When this software is not compiled and linked with OpenMPI, 
it ignores the underflow and doesn't abort when compiled with  -fpe0.


When I run the MPI version and set --mca opal_signal 6,7,11 the code 
doesn't abort on underflow.  I'd like to know if I'm interpreting this 
behavior correctly, it appears that the mpi versus no mpi cases handle 
underflow differently. I'm assuming OpenMPI has a handler that processes 
the interrupts ahead of the Fortran RTL, stopping execution.  Otherwise 
the Fortran RTL handler would just ignore the underflow.  Do I sort of 
understand what is going on here?  Is there another solution short of 
the --mca opal_signal switch?


thanks
Steve

Re: [OMPI users] Factor of 10 loss in performance with 1.3.x

2009-04-06 Thread Steve Kargl

On Mon, Apr 06, 2009 at 02:04:16PM -0700, Eugene Loh wrote:
> Steve Kargl wrote:
> 
> >I recently upgraded OpenMPI from 1.2.9 to 1.3 and then 1.3.1.
> >One of my colleagues reported a dramatic drop in performance
> >with one of his applications.  My investigation shows a factor
> >of 10 drop in communication over the memory bus.  I've placed
> >a figure that iilustrates the problem at 
> >
> >http://troutmask.apl.washington.edu/~kargl/ompi_cmp.jpg
> >
> >The legend in the figure has 'ver. 1.2.9  11 <--> 18'.  This
> >means communication between node 11 and node 18 over GigE 
> >ethernet in my cluster.  'ver. 1.2.9  20 <--> 20' means
> >communication between processes on node 20 where node 20 has
> >8 processors.  The image clearly shows
> >
> Not so clearly in my mind since I have trouble discriminating between 
> the colors and the overlapping lines and so on.  But I'll take your word 
> for it that the plot illustrates the point you are reporting.

OK.  I've removed the GigE results in the graph and plotted with
points as well as lines.  You'll see a red line by itself.  The
green and blue lines overlap.  The original data is now 

http://troutmask.apl.washington.edu/~kargl/ompi_cmp_new.jpg

> It appears that you used to have just better than 1-usec latency (which 
> is reasonable), but then it skyrocketed just over 10x with 1.3.  I did 
> some sm work, but that first appears in 1.3.2.

According to netpipe, I have

version 1.3.1
0: node20.cimu.org
1: node20.cimu.org
Latency: 0.09131
Sync Time: 0.18241
Now starting main loop

version 1.2.9
0: node20.cimu.org
1: node20.cimu.org
Latency: 0.00669
Sync Time: 0.01811

So, the latency has indeed gone up.

> The huge sm latencies are, so far as I know, inconsistent with
> everyone else's experience with 1.3.  Is there any chance you
> could rebuild all three versions and really confirm that the
> observed difference can actually be attributed to differences
> in the OMPI source code?  And/or run with "--mca btl 
> self,sm" to make sure that the on-node message passing is indeed using sm?
> 

The command lines I used are

/usr/local/openmpi-1.2.9/bin/mpicc -o z -O -static GetOpt.c netmpi.c
/usr/local/openmpi-1.2.9/bin/mpiexec -machinefile mf_ompi_2 -n 2 ./z

/usr/local/openmpi-1.3.1/bin/mpicc -o z -O -static GetOpt.c netmpi.c
/usr/local/openmpi-1.3.1/bin/mpiexec --mca btl self,sm -machinefile \
   mf_ompi_2 -n 2 ./z

There is no change in the results as can be seen at

http://troutmask.apl.washington.edu/~kargl/ompi_cmp_self.sm.jpg

The machinefile contains the single line 'node20.cimu.org slots=2'.


I can rebuild 1.2.9 and 1.3.1.  Is there any particular configure
options that I should enable/disable?

-- 
Steve

Re: [OMPI users] Factor of 10 loss in performance with 1.3.x

2009-04-06 Thread Eugene Loh


Steve Kargl wrote:


I recently upgraded OpenMPI from 1.2.9 to 1.3 and then 1.3.1.
One of my colleagues reported a dramatic drop in performance
with one of his applications.  My investigation shows a factor
of 10 drop in communication over the memory bus.  I've placed
a figure that iilustrates the problem at 


http://troutmask.apl.washington.edu/~kargl/ompi_cmp.jpg

The legend in the figure has 'ver. 1.2.9  11 <--> 18'.  This
means communication between node 11 and node 18 over GigE 
ethernet in my cluster.  'ver. 1.2.9  20 <--> 20' means

communication between processes on node 20 where node 20 has
8 processors.  The image clearly shows

Not so clearly in my mind since I have trouble discriminating between 
the colors and the overlapping lines and so on.  But I'll take your word 
for it that the plot illustrates the point you are reporting.


It appears that you used to have just better than 1-usec latency (which 
is reasonable), but then it skyrocketed just over 10x with 1.3.  I did 
some sm work, but that first appears in 1.3.2.  The huge sm latencies 
are, so far as I know, inconsistent with everyone else's experience with 
1.3.  Is there any chance you could rebuild all three versions and 
really confirm that the observed difference can actually be attributed 
to differences in the OMPI source code?  And/or run with "--mca btl 
self,sm" to make sure that the on-node message passing is indeed using sm?



that communication over
GigE is consistent among the versions of OpenMPI.  However, some
change in going from 1.2.9 to 1.3.x is causing a drop in
communication between processes on a single node.

Things to note.  Nodes 11, 18, and 20 are essentially idle
before and after a test.  configure was run with the same set
of options except with 1.3 and 1.3.1 I needed to disable ipv6:

 ./configure --prefix=/usr/local/openmpi-1.2.9 \
  --enable-orterun-prefix-by-default --enable-static
  --disable-shared

 ./configure --prefix=/usr/local/openmpi-1.3.1 \
  --enable-orterun-prefix-by-default --enable-static
  --disable-shared --disable-ipv6

 ./configure --prefix=/usr/local/openmpi-1.3.1 \
  --enable-orterun-prefix-by-default --enable-static
  --disable-shared --disable-ipv6

The operating system is FreeBSD 8.0 where nodes 18 and 20
are quad-core, dual-cpu opteron based systems and node 11
is a dual-core, dual-cpu opteron based system.  For additional
information, I've placed the output of ompi_info at

http://troutmask.apl.washington.edu/~kargl/ompi_info-1.2.9
http://troutmask.apl.washington.edu/~kargl/ompi_info-1.3.0
http://troutmask.apl.washington.edu/~kargl/ompi_info-1.3.1

Any hints on tuning 1.3.1 would be appreciated?

Re: [OMPI users] ssh MPi and program tests

2009-04-06 Thread Gus Correa


Hi Francesco


See answers inline.

Francesco Pietra wrote:

Hi Gus:
Partial quick answers below. I have reestablished the ssh connection
so that tomorrow I'll run the tests. Everything that relates to
running amber is on the "parallel computer", where I have access to
everything.

On Mon, Apr 6, 2009 at 7:53 PM, Gus Correa  wrote:

Hi Francesco, list

Francesco Pietra wrote:

On Mon, Apr 6, 2009 at 5:21 PM, Gus Correa  wrote:

Hi Francesco

Did you try to run examples/connectivity_c.c,
or examples/hello_c.c before trying amber?
They are in the directory where you untarred the OpenMPI tarball.
It is easier to troubleshoot
possible network and host problems
with these simpler programs.

I have found the "examples". Should they be compiled? how? This is my
only question here.

cd examples/
/full/path/to/openmpi/bin/mpicc -o connectivity_c connectivity_c.c

Then run it with, say:

/full/path/to/openmpi/bin/mpirun -host {whatever_hosts_you_want}
-n {as_many_processes_you_want} connectivity_c

Likewise for hello_c.c


What's below is info. Although amber parallel
would have not compiled with faulty openmpi, I'll run openmpi tests as
soon as I understand how.


Also, to avoid confusion,
you may use a full path name to mpirun,
in case you have other MPI flavors in your system.
Often times the mpirun your path is pointing to is not what you
may think it is.


which mpirun
/usr/local/bin/mpirun

Did you install OpenMPI on /usr/local ?
When you do "mpirun -help", do you see "mpirun (Open MPI) 1.3"?


mpirun -help
mpirun (Open MPI) 1.3.1
on the 1st line, then follow the options


Ok, it looks like you installed OpenMPI 1.3.1 with the default
"--prefix" which is /usr/local.





How about the output of "orte_info" ?

orte_info was not installed. See below what has been installed.



Sorry, my fault.
I meant ompi_info (not orte_info).
Please try ompi_info or "ompi_info --config".
It will tell you the compilers used to build OpenMPI, etc.

I presume all of this is being done in the "parallel computer",
i.e., in one of the AMD64 Debian systems, right?




Does it show your Intel compilers, etc?


I guess so, otherwise amber would have not been compiled, but I don't
know the commands to prove it. The intel compilers are on the path:
/opt/intel/cce/10.1.015/bin:/opt/intel/fce/10.1.015/bin and the mkl
are sourced in .bashrc.



Again, all in the AMD64 system, right?


I ask because many Linux distributions come with one or more flavors
of MPI (OpenMPI, MPICH, LAM, etc), some compilers also do (PGI for
instance), some tools (Intel MKL?) may also have their MPI,
and you end up with a bunch of MPI commands
on your path that may produce a big mixup.
This is a pretty common problem that affect new users on this list,
on the MPICH list, on clustering lists, etc.
The errors messages often don't help find the source of the problem,
and people spend a lot of time trying to troubleshoot network,
etc, when is often just a path problem.

So, this is why when you begin, you may want to use full path
names, to avoid confusion.
After the basic MPI functionality is working,
then you can go and fix your path chain,
and rely on your path chain.


there is no other accessible MPI (one application, DOT2, has mpich but
it is a static compilation; DOT2 parallelizatuion requires thar the
computer knows itself, i.e." ssh hostname date" should afford the date
passwordless. The reported issues in testing amber have destroyed this
situation: now deb64 has port22 closed, evem to itself.


Have you tried to reboot the master node, to see if it comes back
to the original ssh setup?
You need ssh to be functional to run OpenMPI code,
including the tests above.


I don't know if you want to run on amd64 alone (master node?)
or on a cluster.
In any case, you may use a list of hosts
or a hostfile on the mpirun command line,
to specify where you want to run.

With amber I use the parallel computer directly and the amber
installation is chown to me. The ssh connection, in this case, only
serves to get file from. or send files to, my desktop.


It is unclear to me what you mean by "the parallel computer directly".
Can you explain better which computers are in this game?
Your desktop and a cluster perhaps?
Are they both Debian 64 Linux?
Where do you compile the programs?
Where do you want to run the programs?


In my .bashrc:

(for amber)
MPI_HOME=/usr/local
export MPI_HOME

(for openmpi)
if [ "$LD_LIBRARY_PATH" ] ; then
 export LD_LIBRARY_PATH="$LD_LIBRARY_PATH'/usr/local/lib"
else
 export LD_LIBRARY_PATH="/usr/local/lib"
fi


Is this on your desktop or on the "parallel computer"?



On both "parallel computers" (there is my desktop, ssh to two uma-type
dual-opteron "parallel computers". 
Only one was active when the "test"

problems arose. While the (ten years old) destop is i386, both other
machines are amd64, i.e., all debian lenny. I prepare the input files
on the i386 and use it also as storage for backups. 


So, you only use your i386 desktop to ssh to

Re: [OMPI users] Incorrect results with MPI-IO under OpenMPI v1.3.1

2009-04-06 Thread Yvan Fournier

Hello to all,

I have also encountered a similar bug with MPI-IO
with Open MPI 1.3.1, reading a Code_Saturne preprocessed mesh file
(www.code-saturne.org). Reading the file can be done using 2 MPI-IO
modes, or one non-MPI-IO mode.

The first MPI-IO mode uses individual file pointers, and  involves a
series of MPI_File_Read_all with all ranks using the same view (for
record headers), interlaced with MPI_File_Read_all with ranks using
different views (for record data, successive blocks being read by each
rank).

The second MPI-IO mode uses explicit file offsets, with
MPI_File_read_at_all instead of MPI_File_read_all.

Both MPI-IO modes seem to work fine with OpenMPI 1.2, MPICH 2,
and variants on IBM Blue Gene/L and P, as well as Bull Novascale,
but with OpenMPI 1.3.1, data read seems to be corrupt on at least
one file using the individual file pointers approach (though it
works well using explicit offsets).

The bug does not appear in unit tests, and it only appears after several
records are read on the case that does fail (on 2 ranks), so to
reproduce it with a simple program, I would have to extract the exact
file access patterns from the exact case which fails, which would
require a few extra hours of work.

If the bug is not reproduced in a simpler manner first, I will try
to build a simple program reproducing the bug within a week or 2,
but In the meantime, I just want to confirm Scott's observation
(hoping it is the same bug).

Best regards,

Yvan Fournier

On Mon, 2009-04-06 at 16:03 -0400, users-requ...@open-mpi.org wrote:

> Date: Mon, 06 Apr 2009 12:16:18 -0600
> From: Scott Collis 
> Subject: [OMPI users] Incorrect results with MPI-IO under OpenMPI
>   v1.3.1
> To: us...@open-mpi.org
> Message-ID: 
> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
> 
> I have been a user of MPI-IO for 4+ years and have a code that has run  
> correctly with MPICH, MPICH2, and OpenMPI 1.2.*
> 
> I recently upgraded to OpenMPI 1.3.1 and immediately noticed that my  
> MPI-IO generated output files are corrupted.  I have not yet had a  
> chance to debug this in detail, but it appears that  
> MPI_File_write_all() commands are not placing information correctly on  
> their file_view when running with more than 1 processor (everything is  
> okay with -np 1).
> 
> Note that I have observed the same incorrect behavior on both Linux  
> and OS-X.  I have also gone back and made sure that the same code  
> works with MPICH, MPICH2, and OpenMPI 1.2.* so I'm fairly confident  
> that something has been changed or broken as of OpenMPI 1.3.*.  Just  
> today, I checked out the SVN repository version of OpenMPI and built  
> and tested my code with that and the results are incorrect just as for  
> the 1.3.1 tarball.
> 
> While I plan to continue to debug this and will try to put together a  
> small test that demonstrates the issue, I thought that I would first  
> send out this message to see if this might trigger a thought within  
> the OpenMPI development team as to where this issue might be.
> 
> Please let me know if you have any ideas as I would very much  
> appreciate it!
> 
> Thanks in advance,
> 
> Scott
> --
> Scott Collis
> sscol...@me.com
>

Re: [OMPI users] libnuma issue

2009-04-06 Thread Prentice Bisbal

Francesco Pietra wrote:
> I am posting again more specifically because it may have been buried
> in a more generic thread.
> 
> With debian linux amd64 lenny and openmpi-1.3.1
> 
> ./configure cc=/opt/intel/cce/10.1.015/bin/icc
> cxx=/opt/intel/cce/10.1.015/bin/icpc
> F77=/opt/intel/fce/10.1.015/bin/ifort
> FC=/opt/intel/fce/10.1.015/bin/ifort --with-libnuma=/usr/lib
> 
> failed because
> 
> "expected file /usr/lib/include/numa.h was not found"
> 
> In debian amd64 lenny numa.h has a different location
> "/usr/include/numa.h". Attached is the config.log.
> 
> I would appreciate help in circumventing the problem.
> 

I believe you need

--with-libnuma=/usr.

The configure script then assumes the includes files are under
/usr/include and the libs are under /usr/lib

-- 
Prentice

Re: [OMPI users] mpirun: symbol lookup error: /usr/local/lib/openmpi/mca_plm_lsf.so: undefined symbol: ls b_init

2009-04-06 Thread Prentice Bisbal

Alessandro Surace wrote:
> Hi guys, I try to repost my question...
> I've a problem with the last stable build and the last nightly snapshot.
> 
> When I run a job directly with mpirun no problem.
> If I try to submit it with lsf:
> bsub -a openmpi -m grid01 mpirun.lsf /mnt/ewd/mpi/fibonacci/fibonacci_mpi
> 
> I get the follow error:
> mpirun: symbol lookup error: /usr/local/lib/openmpi/mca_plm_lsf.so:
> undefined symbol: lsb_init
> Job  /opt/lsf/7.0/linux2.6-glibc2.3-x86/bin/openmpi_wrapper
> /mnt/ewd/mpi/fibonacci/fibonacci_mpi
> 
> I've verified that the lsb_init symbol is present in the library:
> [root@grid01 lib]# strings libbat.* |grep lsb_init
> lsb_init
> sch_lsb_init
> lsb_init()
> lsb_init
> sch_lsb_init
> sch_lsb_init
> sch_lsb_init
> sch_lsb_init
> lsb_init()
> sch_lsb_init
> 

Can you verify that LSF is passing your evironment along correctly? It
looks like your LD_LIBRARY_PATH is set in your login environment, but
not the environment that the LSF job runs in

You can check this by submitting a jog that executes just the command
'printenv'. Compare the output to what you get when you type 'printenv'
on the command. Compare the values for LD_LIBRARY_PATH, in particular.

If that looks okay, then try running a job that just executes

ldd /mnt/ewd/mpi/fibonacci/fibonacci_mpi

This will show you any libraries that ld can't find in the LSF run-time
environment.

-- 
Prentice

Re: [OMPI users] ssh MPi and program tests

2009-04-06 Thread Francesco Pietra

Hi Gus:
Partial quick answers below. I have reestablished the ssh connection
so that tomorrow I'll run the tests. Everything that relates to
running amber is on the "parallel computer", where I have access to
everything.

On Mon, Apr 6, 2009 at 7:53 PM, Gus Correa  wrote:
> Hi Francesco, list
>
> Francesco Pietra wrote:
>>
>> On Mon, Apr 6, 2009 at 5:21 PM, Gus Correa  wrote:
>>>
>>> Hi Francesco
>>>
>>> Did you try to run examples/connectivity_c.c,
>>> or examples/hello_c.c before trying amber?
>>> They are in the directory where you untarred the OpenMPI tarball.
>>> It is easier to troubleshoot
>>> possible network and host problems
>>> with these simpler programs.
>>
>> I have found the "examples". Should they be compiled? how? This is my
>> only question here.
>
> cd examples/
> /full/path/to/openmpi/bin/mpicc -o connectivity_c connectivity_c.c
>
> Then run it with, say:
>
> /full/path/to/openmpi/bin/mpirun -host {whatever_hosts_you_want}
> -n {as_many_processes_you_want} connectivity_c
>
> Likewise for hello_c.c
>
>> What's below is info. Although amber parallel
>> would have not compiled with faulty openmpi, I'll run openmpi tests as
>> soon as I understand how.
>>
>>> Also, to avoid confusion,
>>> you may use a full path name to mpirun,
>>> in case you have other MPI flavors in your system.
>>> Often times the mpirun your path is pointing to is not what you
>>> may think it is.
>>
>>
>> which mpirun
>> /usr/local/bin/mpirun
>
> Did you install OpenMPI on /usr/local ?
> When you do "mpirun -help", do you see "mpirun (Open MPI) 1.3"?

mpirun -help
mpirun (Open MPI) 1.3.1
on the 1st line, then follow the options


> How about the output of "orte_info" ?
orte_info was not installed. See below what has been installed.


> Does it show your Intel compilers, etc?

I guess so, otherwise amber would have not been compiled, but I don't
know the commands to prove it. The intel compilers are on the path:
/opt/intel/cce/10.1.015/bin:/opt/intel/fce/10.1.015/bin and the mkl
are sourced in .bashrc.

>
> I ask because many Linux distributions come with one or more flavors
> of MPI (OpenMPI, MPICH, LAM, etc), some compilers also do (PGI for
> instance), some tools (Intel MKL?) may also have their MPI,
> and you end up with a bunch of MPI commands
> on your path that may produce a big mixup.
> This is a pretty common problem that affect new users on this list,
> on the MPICH list, on clustering lists, etc.
> The errors messages often don't help find the source of the problem,
> and people spend a lot of time trying to troubleshoot network,
> etc, when is often just a path problem.
>
> So, this is why when you begin, you may want to use full path
> names, to avoid confusion.
> After the basic MPI functionality is working,
> then you can go and fix your path chain,
> and rely on your path chain.
>
>>
>> there is no other accessible MPI (one application, DOT2, has mpich but
>> it is a static compilation; DOT2 parallelizatuion requires thar the
>> computer knows itself, i.e." ssh hostname date" should afford the date
>> passwordless. The reported issues in testing amber have destroyed this
>> situation: now deb64 has port22 closed, evem to itself.
>>
>
> Have you tried to reboot the master node, to see if it comes back
> to the original ssh setup?
> You need ssh to be functional to run OpenMPI code,
> including the tests above.
>
>>
>>> I don't know if you want to run on amd64 alone (master node?)
>>> or on a cluster.
>>> In any case, you may use a list of hosts
>>> or a hostfile on the mpirun command line,
>>> to specify where you want to run.
>>
>> With amber I use the parallel computer directly and the amber
>> installation is chown to me. The ssh connection, in this case, only
>> serves to get file from. or send files to, my desktop.
>>
>
> It is unclear to me what you mean by "the parallel computer directly".
> Can you explain better which computers are in this game?
> Your desktop and a cluster perhaps?
> Are they both Debian 64 Linux?
> Where do you compile the programs?
> Where do you want to run the programs?
>
>> In my .bashrc:
>>
>> (for amber)
>> MPI_HOME=/usr/local
>> export MPI_HOME
>>
>> (for openmpi)
>> if [ "$LD_LIBRARY_PATH" ] ; then
>>  export LD_LIBRARY_PATH="$LD_LIBRARY_PATH'/usr/local/lib"
>> else
>>  export LD_LIBRARY_PATH="/usr/local/lib"
>> fi
>>
>
> Is this on your desktop or on the "parallel computer"?


On both "parallel computers" (there is my desktop, ssh to two uma-type
dual-opteron "parallel computers". Only one was active when the "test"
problems arose. While the (ten years old) destop is i386, both other
machines are amd64, i.e., all debian lenny. I prepare the input files
on the i386 and use it also as storage for backups. The "parallel
computer" has only the X server and a minimal window for a
two-dimensional graphics of amber. The other parallel computer has a
GeForce 6600 card with GLSL support, which I use to elaborate
graphically the outputs from the numerical computations (using VM

[OMPI users] Incorrect results with MPI-IO under OpenMPI v1.3.1

2009-04-06 Thread Scott Collis

I have been a user of MPI-IO for 4+ years and have a code that has run  
correctly with MPICH, MPICH2, and OpenMPI 1.2.*


I recently upgraded to OpenMPI 1.3.1 and immediately noticed that my  
MPI-IO generated output files are corrupted.  I have not yet had a  
chance to debug this in detail, but it appears that  
MPI_File_write_all() commands are not placing information correctly on  
their file_view when running with more than 1 processor (everything is  
okay with -np 1).


Note that I have observed the same incorrect behavior on both Linux  
and OS-X.  I have also gone back and made sure that the same code  
works with MPICH, MPICH2, and OpenMPI 1.2.* so I'm fairly confident  
that something has been changed or broken as of OpenMPI 1.3.*.  Just  
today, I checked out the SVN repository version of OpenMPI and built  
and tested my code with that and the results are incorrect just as for  
the 1.3.1 tarball.


While I plan to continue to debug this and will try to put together a  
small test that demonstrates the issue, I thought that I would first  
send out this message to see if this might trigger a thought within  
the OpenMPI development team as to where this issue might be.


Please let me know if you have any ideas as I would very much  
appreciate it!


Thanks in advance,

Scott
--
Scott Collis
sscol...@me.com

Re: [OMPI users] ssh MPi and program tests

2009-04-06 Thread Gus Correa


Hi Francesco, list

Francesco Pietra wrote:

On Mon, Apr 6, 2009 at 5:21 PM, Gus Correa  wrote:

Hi Francesco

Did you try to run examples/connectivity_c.c,
or examples/hello_c.c before trying amber?
They are in the directory where you untarred the OpenMPI tarball.
It is easier to troubleshoot
possible network and host problems
with these simpler programs.


I have found the "examples". Should they be compiled? how? This is my
only question here. 


cd examples/
/full/path/to/openmpi/bin/mpicc -o connectivity_c connectivity_c.c

Then run it with, say:

/full/path/to/openmpi/bin/mpirun -host {whatever_hosts_you_want}
-n {as_many_processes_you_want} connectivity_c

Likewise for hello_c.c


What's below is info. Although amber parallel
would have not compiled with faulty openmpi, I'll run openmpi tests as
soon as I understand how.


Also, to avoid confusion,
you may use a full path name to mpirun,
in case you have other MPI flavors in your system.
Often times the mpirun your path is pointing to is not what you
may think it is.



which mpirun
/usr/local/bin/mpirun


Did you install OpenMPI on /usr/local ?
When you do "mpirun -help", do you see "mpirun (Open MPI) 1.3"?
How about the output of "orte_info" ?
Does it show your Intel compilers, etc?

I ask because many Linux distributions come with one or more flavors
of MPI (OpenMPI, MPICH, LAM, etc), some compilers also do (PGI for 
instance), some tools (Intel MKL?) may also have their MPI,

and you end up with a bunch of MPI commands
on your path that may produce a big mixup.
This is a pretty common problem that affect new users on this list,
on the MPICH list, on clustering lists, etc.
The errors messages often don't help find the source of the problem,
and people spend a lot of time trying to troubleshoot network,
etc, when is often just a path problem.

So, this is why when you begin, you may want to use full path
names, to avoid confusion.
After the basic MPI functionality is working,
then you can go and fix your path chain,
and rely on your path chain.



there is no other accessible MPI (one application, DOT2, has mpich but
it is a static compilation; DOT2 parallelizatuion requires thar the
computer knows itself, i.e." ssh hostname date" should afford the date
passwordless. The reported issues in testing amber have destroyed this
situation: now deb64 has port22 closed, evem to itself.



Have you tried to reboot the master node, to see if it comes back
to the original ssh setup?
You need ssh to be functional to run OpenMPI code,
including the tests above.




I don't know if you want to run on amd64 alone (master node?)
or on a cluster.
In any case, you may use a list of hosts
or a hostfile on the mpirun command line,
to specify where you want to run.


With amber I use the parallel computer directly and the amber
installation is chown to me. The ssh connection, in this case, only
serves to get file from. or send files to, my desktop.



It is unclear to me what you mean by "the parallel computer directly".
Can you explain better which computers are in this game?
Your desktop and a cluster perhaps?
Are they both Debian 64 Linux?
Where do you compile the programs?
Where do you want to run the programs?


In my .bashrc:

(for amber)
MPI_HOME=/usr/local
export MPI_HOME

(for openmpi)
if [ "$LD_LIBRARY_PATH" ] ; then
  export LD_LIBRARY_PATH="$LD_LIBRARY_PATH'/usr/local/lib"
else
  export LD_LIBRARY_PATH="/usr/local/lib"
fi



Is this on your desktop or on the "parallel computer"?



There is also

MPICH_HOME=/usr/local
export MPICH_HOME

this is for DOCK, which, with this env variabl, accepts openmpi (at
lest it was so with v 1.2.6)



Oh, well, it looks like there is MPICH already installed on /usr/local.
So, this may be part of the confusion, the path confusion I referred to.

I would suggest installing OpenMPI on a different directory,
using the --prefix option of the OpenMPI configure script.
Do configure --help for details about all configuration options.



the intel compilers (compiled ifort and icc, are sourced in both my
.bashrc and root home .bashrc.

Thanks and apologies for my low level in these affairs. It is the
first time I am faced by such problems, with amd64, same intel
compilers, and openmpi 1.2.6 everything was in order.



To me it doesn't look like the problem is related to the new version
of OpenMPI.

Try the test programs with full path names first.
It may not solve the problem, but it may clarify things a bit.

Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-


francesco




Do "/full/path/to/openmpi/bin/mpirun --help" for details.

I am not familiar to amber, but how does it find your openmpi
libraries and compiler wrappers?
Don't you need to give it the paths during configuration,
say,
/configure_amber -openmpi=/full/path/to/open

Re: [OMPI users] Problem with running openMPI program

2009-04-06 Thread Gus Correa


Hi Ankush

Ankush Kaul wrote:

I am not able to check if NFS export/mount of /tmp is working,
when i give the command *ssh 192.168.45.65 192.168.67.18* i get the 
error : bash: 192.168.67.18 : command not found




The ssh command syntax above is wrong.
Use only one IP address, which should be your remote machine's IP.

Assuming you are logged in to 192.168.67.18 (is this the master ?),
and want to ssh to 192.168.45.65 (is this the slave ?),
and run the command 'my_command' there, do:

ssh 192.168.45.65 'my_command'

If you already set up the passwordless ssh connection,
this should work.


let me explain what i understood using an example.

First, i make a folder '/work directory' on my master node.



Yes ...
... but don't use spaces in Linux/Unix names! Never!
It is either "/work"
or "/work_directory".
Using "/work directory" with a blank space in-between
is to ask for real trouble!
This is OK in Windows, but raises the hell on Linux/Unix.
In Linux/Unix blank space is a separator for everything,
so it will interpret only the first chunk of your directory name,
and think that what comes after the blank is another directory name,
or a command option, or whatever else.

You can create subdirectories there also, to put your own
programs.
Or maybe one subdirectory
for each user, and change the ownership of each subdirectory
to the corresponding user.

As root, on the master node, do:

cd /work
whoami  (this will give you your own user-name)
mkdir user-name
chown  user-name:user-name  user-name  (pay attention to the : and blanks!)

Then i mount this directory on a folder named '/work directory/mnt' on 
the slave node


is this correct?


No.
The easy thing to do is to use the same name for the mountpoint
as the original directory, say, /work only, if you called
it /work on the master node.
Again, don't use white space on Linux/Unix names!

Create a mountpoint directory called /work on the slave node:

mkdir /work

Don't populate the slave node /work directory,
as it is just a mountpoint.
Leave it empty.
Then use it to mount the actual /work directory that you
want to export from the master node.



also how and where (is it on the master node) do i give the list of 
hosts? 


On the master node, in the mpirun command line.

As I said, do "/full/path/to/openmpi/bin/mpirun --help" to get
a lot of information about the mpirun command options.



and by hosts you mean the compute nodes.




By hosts I mean whatever computers you want to run your MPI program on.
It can be the master only, the slave only, or both.

The (excellent) OpenMPI FAQ may also help you:

http://www.open-mpi.org/faq/

Many of your questions may have been answered there already.
I encourage you to read them, particularly the General Information,
Building, and Running Jobs ones.

Plez bear with me as this is the first time i am doin a project on Linux 
clustering.




Welcome, and good luck!

Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-

On Mon, Apr 6, 2009 at 9:27 PM, Gus Correa > wrote:


Hi Ankush

If I remember right,
mpirun will put you on your home directory, not on /tmp,
when it starts your ssh session.
To run on /tmp (or on /mnt/nfs)
you may need to use "-path" option.

Likewise, you may want to give mpirun a list of hosts (-host option)
or a hostfile (-hostfile option), to specify where you want the
program to run.

Do
"/full/path/to/openmpi/mpriun -help"
for details.

Make sure your NFS export/mount of /tmp is working,
say, by doing:

ssh slave_node 'hostname; ls /tmp; ls /mnt/nfs'

or similar, and see if your  program "pi" is really there (and where).

Actually, it may be confusing to export /tmp, as it is part
of the basic Linux directory tree,
which is the reason why you mounted it on /mnt/nfs.
You may want to choose to export/mount
a directory that is not so generic as /tmp,
so that you can use a consistent name on both computers.
For instance, you can create a /my_export or /work directory
(or whatever name you prefer) on the master node,
export it to the slave node, mount it on the slave node
with the same name/mountpoint, and use it for your MPI work.

I hope this helps.
Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-

Ankush Kaul wrote:

Thank you sir,
one more thing i am confused about, suppose i have 2 run a 'pi'
program using open mpi, where do i place the program?

currently i have placed it in /t

[OMPI users] Factor of 10 loss in performance with 1.3.x

2009-04-06 Thread Steve Kargl

Hi,

I recently upgraded OpenMPI from 1.2.9 to 1.3 and then 1.3.1.
One of my colleagues reported a dramatic drop in performance
with one of his applications.  My investigation shows a factor
of 10 drop in communication over the memory bus.  I've placed
a figure that iilustrates the problem at 

http://troutmask.apl.washington.edu/~kargl/ompi_cmp.jpg

The legend in the figure has 'ver. 1.2.9  11 <--> 18'.  This
means communication between node 11 and node 18 over GigE 
ethernet in my cluster.  'ver. 1.2.9  20 <--> 20' means
communication between processes on node 20 where node 20 has
8 processors.  The image clearly shows that communication over
GigE is consistent among the versions of OpenMPI.  However, some
change in going from 1.2.9 to 1.3.x is causing a drop in
communication between processes on a single node.

Things to note.  Nodes 11, 18, and 20 are essentially idle
before and after a test.  configure was run with the same set
of options except with 1.3 and 1.3.1 I needed to disable ipv6:

  ./configure --prefix=/usr/local/openmpi-1.2.9 \
   --enable-orterun-prefix-by-default --enable-static
   --disable-shared

  ./configure --prefix=/usr/local/openmpi-1.3.1 \
   --enable-orterun-prefix-by-default --enable-static
   --disable-shared --disable-ipv6

  ./configure --prefix=/usr/local/openmpi-1.3.1 \
   --enable-orterun-prefix-by-default --enable-static
   --disable-shared --disable-ipv6

The operating system is FreeBSD 8.0 where nodes 18 and 20
are quad-core, dual-cpu opteron based systems and node 11
is a dual-core, dual-cpu opteron based system.  For additional
information, I've placed the output of ompi_info at

http://troutmask.apl.washington.edu/~kargl/ompi_info-1.2.9
http://troutmask.apl.washington.edu/~kargl/ompi_info-1.3.0
http://troutmask.apl.washington.edu/~kargl/ompi_info-1.3.1

Any hints on tuning 1.3.1 would be appreciated?

-- 
steve
-- 
Steve

Re: [OMPI users] ssh MPi and program tests

2009-04-06 Thread Francesco Pietra

On Mon, Apr 6, 2009 at 5:21 PM, Gus Correa  wrote:
> Hi Francesco
>
> Did you try to run examples/connectivity_c.c,
> or examples/hello_c.c before trying amber?
> They are in the directory where you untarred the OpenMPI tarball.
> It is easier to troubleshoot
> possible network and host problems
> with these simpler programs.

I have found the "examples". Should they be compiled? how? This is my
only question here. What's below is info. Although amber parallel
would have not compiled with faulty openmpi, I'll run openmpi tests as
soon as I understand how.

>
> Also, to avoid confusion,
> you may use a full path name to mpirun,
> in case you have other MPI flavors in your system.
> Often times the mpirun your path is pointing to is not what you
> may think it is.


which mpirun
/usr/local/bin/mpirun

there is no other accessible MPI (one application, DOT2, has mpich but
it is a static compilation; DOT2 parallelizatuion requires thar the
computer knows itself, i.e." ssh hostname date" should afford the date
passwordless. The reported issues in testing amber have destroyed this
situation: now deb64 has port22 closed, evem to itself.


>
> I don't know if you want to run on amd64 alone (master node?)
> or on a cluster.
> In any case, you may use a list of hosts
> or a hostfile on the mpirun command line,
> to specify where you want to run.

With amber I use the parallel computer directly and the amber
installation is chown to me. The ssh connection, in this case, only
serves to get file from. or send files to, my desktop.

In my .bashrc:

(for amber)
MPI_HOME=/usr/local
export MPI_HOME

(for openmpi)
if [ "$LD_LIBRARY_PATH" ] ; then
  export LD_LIBRARY_PATH="$LD_LIBRARY_PATH'/usr/local/lib"
else
  export LD_LIBRARY_PATH="/usr/local/lib"
fi


There is also

MPICH_HOME=/usr/local
export MPICH_HOME

this is for DOCK, which, with this env variabl, accepts openmpi (at
lest it was so with v 1.2.6)

the intel compilers (compiled ifort and icc, are sourced in both my
.bashrc and root home .bashrc.

Thanks and apologies for my low level in these affairs. It is the
first time I am faced by such problems, with amd64, same intel
compilers, and openmpi 1.2.6 everything was in order.

francesco



>
> Do "/full/path/to/openmpi/bin/mpirun --help" for details.
>
> I am not familiar to amber, but how does it find your openmpi
> libraries and compiler wrappers?
> Don't you need to give it the paths during configuration,
> say,
> /configure_amber -openmpi=/full/path/to/openmpi
> or similar?
>
> I hope this helps.
> Gus Correa
> -
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> -
>
>
> Francesco Pietra wrote:
>>
>> I have compiled openmpi 1.3.1 on debian amd64 lenny with icc/ifort
>> (10.1.015) and libnuma. Tests passed:
>>
>> ompi_info | grep libnuma
>>  MCA affinity: libnuma (MCA v 2.0, API 2.0)
>>
>> ompi_info | grep maffinity
>>  MCA affinity: first use (MCA as above)
>>  MCA affinity: libnuma as above.
>>
>> Then, I have compiled parallel a molecular dynamics package, amber10,
>> without error signals but I am having problems in testing the amber
>> parallel installation.
>>
>> amber10 configure was set as:
>>
>> ./configure_amber -openmpi -nobintray ifort
>>
>> just as I used before with openmpi 1.2.6. Could you say if the
>> -openmpi should be changed?
>>
>> cd tests
>>
>> export DO_PARALLEL='mpirun -np 4'
>>
>> make test.parallel.MM  < /dev/null
>>
>> cd cytosine && ./Run.cytosine
>> The authenticity of host deb64 (which is the hostname) (127.0.1.1)
>> can't be established.
>> RSA fingerprint .
>> connecting ?
>>
>> I stopped the ssh daemon, whereby tests were interrupted because deb64
>> (i.e., itself) could no more be accessed. Further attempts under these
>> conditions failed for the same reason. Now, sshing to deb64 is no more
>> possible: port 22 closed. In contrast, sshing from deb64 to other
>> computers occurs passwordless. No such problems arose at the time of
>> amd64 etch with the same
>> configuration of ssh, same compilers, and openmpi 1.2.6.
>>
>> I am here because the warning from the amber site is that I should to
>> learn how to use my installation of MPI. Therefore, if there is any
>> clue ..
>>
>> thanks
>> francesco pietra
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Problem with running openMPI program

2009-04-06 Thread Ankush Kaul

I am not able to check if NFS export/mount of /tmp is working,
when i give the command *ssh 192.168.45.65 192.168.67.18* i get the error :
bash: 192.168.67.18: command not found

let me explain what i understood using an example.

First, i make a folder '/work directory' on my master node.

Then i mount this directory on a folder named '/work directory/mnt' on the
slave node

is this correct?

also how and where (is it on the master node) do i give the list of hosts?
and by hosts you mean the compute nodes.

Plez bear with me as this is the first time i am doin a project on Linux
clustering.

On Mon, Apr 6, 2009 at 9:27 PM, Gus Correa  wrote:

> Hi Ankush
>
> If I remember right,
> mpirun will put you on your home directory, not on /tmp,
> when it starts your ssh session.
> To run on /tmp (or on /mnt/nfs)
> you may need to use "-path" option.
>
> Likewise, you may want to give mpirun a list of hosts (-host option)
> or a hostfile (-hostfile option), to specify where you want the
> program to run.
>
> Do
> "/full/path/to/openmpi/mpriun -help"
> for details.
>
> Make sure your NFS export/mount of /tmp is working,
> say, by doing:
>
> ssh slave_node 'hostname; ls /tmp; ls /mnt/nfs'
>
> or similar, and see if your  program "pi" is really there (and where).
>
> Actually, it may be confusing to export /tmp, as it is part
> of the basic Linux directory tree,
> which is the reason why you mounted it on /mnt/nfs.
> You may want to choose to export/mount
> a directory that is not so generic as /tmp,
> so that you can use a consistent name on both computers.
> For instance, you can create a /my_export or /work directory
> (or whatever name you prefer) on the master node,
> export it to the slave node, mount it on the slave node
> with the same name/mountpoint, and use it for your MPI work.
>
> I hope this helps.
> Gus Correa
> -
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> -
>
> Ankush Kaul wrote:
>
>> Thank you sir,
>> one more thing i am confused about, suppose i have 2 run a 'pi' program
>> using open mpi, where do i place the program?
>>
>> currently i have placed it in /tmp folder on de master node. this /tmp
>> folder is mounted on /mnt/nfs of the compute node.
>>
>> i run de progam from the tmp folder on de master node, is this correct?
>>
>> i m a newbie n really need some help, thanks in advance
>>
>> On Mon, Apr 6, 2009 at 8:43 PM, John Hearns > hear...@googlemail.com>> wrote:
>>
>>2009/4/6 Ankush Kaul >>:
>> >> Also how do i come to know that the program is using resources
>>of both the
>> > nodes?
>>
>>Log into the second node before you start the program.
>>Run 'top'
>>Seriously - top is a very, very useful utility.
>>___
>>users mailing list
>>us...@open-mpi.org 
>>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> 
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Problem with running openMPI program

2009-04-06 Thread Gus Correa

Hi Ankush

If I remember right,
mpirun will put you on your home directory, not on /tmp,
when it starts your ssh session.
To run on /tmp (or on /mnt/nfs)
you may need to use "-path" option.

Likewise, you may want to give mpirun a list of hosts (-host option)
or a hostfile (-hostfile option), to specify where you want the
program to run.

Do
"/full/path/to/openmpi/mpriun -help"
for details.

Make sure your NFS export/mount of /tmp is working,
say, by doing:

ssh slave_node 'hostname; ls /tmp; ls /mnt/nfs'

or similar, and see if your  program "pi" is really there (and where).

Actually, it may be confusing to export /tmp, as it is part
of the basic Linux directory tree,
which is the reason why you mounted it on /mnt/nfs.
You may want to choose to export/mount
a directory that is not so generic as /tmp,
so that you can use a consistent name on both computers.
For instance, you can create a /my_export or /work directory
(or whatever name you prefer) on the master node,
export it to the slave node, mount it on the slave node
with the same name/mountpoint, and use it for your MPI work.

I hope this helps.
Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-

Ankush Kaul wrote:

Thank you sir,
one more thing i am confused about, suppose i have 2 run a 'pi' program 
using open mpi, where do i place the program?

currently i have placed it in /tmp folder on de master node. this /tmp 
folder is mounted on /mnt/nfs of the compute node.

i run de progam from the tmp folder on de master node, is this correct?

i m a newbie n really need some help, thanks in advance

On Mon, Apr 6, 2009 at 8:43 PM, John Hearns > wrote:

2009/4/6 Ankush Kaul mailto:ankush.rk...@gmail.com>>:
 >> Also how do i come to know that the program is using resources
of both the
 > nodes?

Log into the second node before you start the program.
Run 'top'
Seriously - top is a very, very useful utility.
___
users mailing list
us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] ssh MPi and program tests

2009-04-06 Thread Gus Correa


Hi Francesco

Did you try to run examples/connectivity_c.c,
or examples/hello_c.c before trying amber?
They are in the directory where you untarred the OpenMPI tarball.
It is easier to troubleshoot
possible network and host problems
with these simpler programs.

Also, to avoid confusion,
you may use a full path name to mpirun,
in case you have other MPI flavors in your system.
Often times the mpirun your path is pointing to is not what you
may think it is.

I don't know if you want to run on amd64 alone (master node?)
or on a cluster.
In any case, you may use a list of hosts
or a hostfile on the mpirun command line,
to specify where you want to run.

Do "/full/path/to/openmpi/bin/mpirun --help" for details.

I am not familiar to amber, but how does it find your openmpi
libraries and compiler wrappers?
Don't you need to give it the paths during configuration,
say,
/configure_amber -openmpi=/full/path/to/openmpi
or similar?

I hope this helps.
Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-


Francesco Pietra wrote:

I have compiled openmpi 1.3.1 on debian amd64 lenny with icc/ifort
(10.1.015) and libnuma. Tests passed:

ompi_info | grep libnuma
 MCA affinity: libnuma (MCA v 2.0, API 2.0)

ompi_info | grep maffinity
 MCA affinity: first use (MCA as above)
 MCA affinity: libnuma as above.

Then, I have compiled parallel a molecular dynamics package, amber10,
without error signals but I am having problems in testing the amber
parallel installation.

amber10 configure was set as:

./configure_amber -openmpi -nobintray ifort

just as I used before with openmpi 1.2.6. Could you say if the
-openmpi should be changed?

cd tests

export DO_PARALLEL='mpirun -np 4'

make test.parallel.MM  < /dev/null

cd cytosine && ./Run.cytosine
The authenticity of host deb64 (which is the hostname) (127.0.1.1)
can't be established.
RSA fingerprint .
connecting ?

I stopped the ssh daemon, whereby tests were interrupted because deb64
(i.e., itself) could no more be accessed. Further attempts under these
conditions failed for the same reason. Now, sshing to deb64 is no more
possible: port 22 closed. In contrast, sshing from deb64 to other
computers occurs passwordless. No such problems arose at the time of
amd64 etch with the same
configuration of ssh, same compilers, and openmpi 1.2.6.

I am here because the warning from the amber site is that I should to
learn how to use my installation of MPI. Therefore, if there is any
clue ..

thanks
francesco pietra
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Problem with running openMPI program

2009-04-06 Thread Ankush Kaul

Thank you sir,
one more thing i am confused about, suppose i have 2 run a 'pi' program
using open mpi, where do i place the program?

currently i have placed it in /tmp folder on de master node. this /tmp
folder is mounted on /mnt/nfs of the compute node.

i run de progam from the tmp folder on de master node, is this correct?

i m a newbie n really need some help, thanks in advance

On Mon, Apr 6, 2009 at 8:43 PM, John Hearns  wrote:

> 2009/4/6 Ankush Kaul :
> >> Also how do i come to know that the program is using resources of both
> the
> > nodes?
>
> Log into the second node before you start the program.
> Run 'top'
> Seriously - top is a very, very useful utility.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Problem with running openMPI program

2009-04-06 Thread John Hearns

2009/4/6 Ankush Kaul :
>> Also how do i come to know that the program is using resources of both the
> nodes?

Log into the second node before you start the program.
Run 'top'
Seriously - top is a very, very useful utility.

Re: [OMPI users] ssh MPi and program tests

2009-04-06 Thread Ralph Castain

You might first try and see if you can run something other than amber  
with your new installation. Make sure you have the PATH and  
LD_LIBRARY_PATH set correctly on the remote node, or add --prefix to  
your mpirun cmd line.


Also, did you remember to install the OMPI 1.3 libraries on the remote  
nodes?


One thing I see below is that host deb64 was resolved to the loopback  
interface - was that correct? Seems unusual - even if you are on that  
host, it usually would resolve to some public IP address.



On Apr 6, 2009, at 8:51 AM, Francesco Pietra wrote:


I have compiled openmpi 1.3.1 on debian amd64 lenny with icc/ifort
(10.1.015) and libnuma. Tests passed:

ompi_info | grep libnuma
MCA affinity: libnuma (MCA v 2.0, API 2.0)

ompi_info | grep maffinity
MCA affinity: first use (MCA as above)
MCA affinity: libnuma as above.

Then, I have compiled parallel a molecular dynamics package, amber10,
without error signals but I am having problems in testing the amber
parallel installation.

amber10 configure was set as:

./configure_amber -openmpi -nobintray ifort

just as I used before with openmpi 1.2.6. Could you say if the
-openmpi should be changed?

cd tests

export DO_PARALLEL='mpirun -np 4'

make test.parallel.MM  < /dev/null

cd cytosine && ./Run.cytosine
The authenticity of host deb64 (which is the hostname) (127.0.1.1)
can't be established.
RSA fingerprint .
connecting ?

I stopped the ssh daemon, whereby tests were interrupted because deb64
(i.e., itself) could no more be accessed. Further attempts under these
conditions failed for the same reason. Now, sshing to deb64 is no more
possible: port 22 closed. In contrast, sshing from deb64 to other
computers occurs passwordless. No such problems arose at the time of
amd64 etch with the same
configuration of ssh, same compilers, and openmpi 1.2.6.

I am here because the warning from the amber site is that I should to
learn how to use my installation of MPI. Therefore, if there is any
clue ..

thanks
francesco pietra
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] ssh MPi and program tests

2009-04-06 Thread Francesco Pietra

I have compiled openmpi 1.3.1 on debian amd64 lenny with icc/ifort
(10.1.015) and libnuma. Tests passed:

ompi_info | grep libnuma
 MCA affinity: libnuma (MCA v 2.0, API 2.0)

ompi_info | grep maffinity
 MCA affinity: first use (MCA as above)
 MCA affinity: libnuma as above.

Then, I have compiled parallel a molecular dynamics package, amber10,
without error signals but I am having problems in testing the amber
parallel installation.

amber10 configure was set as:

./configure_amber -openmpi -nobintray ifort

just as I used before with openmpi 1.2.6. Could you say if the
-openmpi should be changed?

cd tests

export DO_PARALLEL='mpirun -np 4'

make test.parallel.MM  < /dev/null

cd cytosine && ./Run.cytosine
The authenticity of host deb64 (which is the hostname) (127.0.1.1)
can't be established.
RSA fingerprint .
connecting ?

I stopped the ssh daemon, whereby tests were interrupted because deb64
(i.e., itself) could no more be accessed. Further attempts under these
conditions failed for the same reason. Now, sshing to deb64 is no more
possible: port 22 closed. In contrast, sshing from deb64 to other
computers occurs passwordless. No such problems arose at the time of
amd64 etch with the same
configuration of ssh, same compilers, and openmpi 1.2.6.

I am here because the warning from the amber site is that I should to
learn how to use my installation of MPI. Therefore, if there is any
clue ..

thanks
francesco pietra

Re: [OMPI users] Problem with running openMPI program

2009-04-06 Thread Ankush Kaul

Thank you Sir the problem was with the paths of 'bin' and 'lib' folders so i
used de *mpirun --prefix* command. I want to run a program 'pi' now using
the cluster, so where do i place de file on de master and the compute nodes?

Also how do i come to know that the program is using resources of both the
nodes?

On Sat, Apr 4, 2009 at 7:05 PM, Jeff Squyres  wrote:

> It might be best to:
>
> 1. Setup a non-root user to run MPI applications
> 2. Setup SSH keys between the hosts for this non-root user so that you can
> "ssh  uptime" and not be prompted for a password/passphrase
>
> This should help.
>
>
>
> On Apr 4, 2009, at 5:51 AM, Ankush Kaul wrote:
>
>   I followed the steps given here to setup up openMPI cluster :
>> http://www.ps3cluster.umassd.edu/step3mpi.html
>>
>> My cluster consists of two nodes, master(192.168.67.18) and
>> salve(192.168.45.65), connected directly through a cross cable.
>>
>> After setting up the cluster n configuring the master node, i mounted
>>  /tmp folder of master node on the slave node(i had some problems with nfs
>> at first but i worked my way out of it).
>>
>> Then i copied the 'pi.c' program in the /tmp folder and successfully
>> complied it, giving me a binary file 'pi'.
>>
>> Now when i try to run the binary file using the following command
>>
>> #mpirun –np 2 ./Pi
>>
>> root@192.168.45.65's password:
>> 
>>
>> after entering the password it gives the following error:
>>
>> bash: orted: command not found
>> [ccomp.cluster:18963] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>> base/pls_base_orted_cmds.c at line 275
>> [ccomp.cluster:18963] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>> pls_rsh_module.c at line 1166
>> [ccomp.cluster:18963] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c
>> at line 90
>> [ccomp.cluster:18963] ERROR: A daemon on node 192.168.45.65 failed to
>> start as expected.
>> [ccomp.cluster:18963] ERROR: There may be more information available from
>> [ccomp.cluster:18963] ERROR: the remote shell (see above).
>> [ccomp.cluster:18963] ERROR: The daemon exited unexpectedly with status
>> 127.
>> [ccomp.cluster:18963] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>> base/pls_base_orted_cmds.c at line 188
>> [ccomp.cluster:18963] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>> pls_rsh_module.c at line 1198
>> --
>> mpirun was unable to cleanly terminate the daemons for this job. Returned
>> value Timeout instead of ORTE_SUCCESS.
>> --
>>
>> I am totally lost now, as this is the first time i am working on a cluster
>> project, and need some help
>>
>> Thank you
>> Ankush
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> --
> Jeff Squyres
> Cisco Systems
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Bogus memcpy or bogus valgrind record

2009-04-06 Thread Number Cruncher

I'd like to add my concern to the thread at 
http://www.open-mpi.org/community/lists/users/2009/03/8661.php that the 
latest 1.3 series produces far too much memory-checker noise.


We use Valgrind extensively during debugging, and although I'm using the 
latest snapshot (1.3.2a1r20901) and latest Valgrind, and have 
--with-valgrind turned on to suppress the PLPA-check related errors, I'm 
still getting far too many issues from the following simple test:

#include 
#include 

int
main(int argc, char *argv[])
{
  MPI_Init(&argc, &argv);
  int myRank;
  if(!MPI_Comm_rank(MPI_COMM_WORLD, &myRank)) {
std::cout << "Hello World from " << myRank << std::endl;
  }

  MPI_Finalize();
  return 0;
}

Running this via "mpirun -np 2 valgrind hello_mpi" gives:
==16829== Memcheck, a memory error detector.
==16829== Copyright (C) 2002-2008, and GNU GPL'd, by Julian Seward et al.
==16829== Using LibVEX rev 1884, a library for dynamic binary translation.
==16829== Copyright (C) 2004-2008, and GNU GPL'd, by OpenWorks LLP.
==16829== Using valgrind-3.4.1, a dynamic binary instrumentation framework.
==16829== Copyright (C) 2000-2008, and GNU GPL'd, by Julian Seward et al.
==16829== For more details, rerun with: -v
==16829==
==16830== Memcheck, a memory error detector.
==16830== Copyright (C) 2002-2008, and GNU GPL'd, by Julian Seward et al.
==16830== Using LibVEX rev 1884, a library for dynamic binary translation.
==16830== Copyright (C) 2004-2008, and GNU GPL'd, by OpenWorks LLP.
==16830== Using valgrind-3.4.1, a dynamic binary instrumentation framework.
==16830== Copyright (C) 2000-2008, and GNU GPL'd, by Julian Seward et al.
==16830== For more details, rerun with: -v
==16830==
==16830== Syscall param writev(vector[...]) points to uninitialised byte(s)
==16830==at 0x34DE2C9F0C: writev (in /lib64/libc-2.6.so)
==16830==by 0x5CD213: mca_oob_tcp_msg_send_handler (oob_tcp_msg.c:265)
==16830==by 0x5C5B6A: mca_oob_tcp_peer_send (oob_tcp_peer.c:197)
==16830==by 0x5CB958: mca_oob_tcp_send_nb (oob_tcp_send.c:167)
==16830==by 0x5DB136: orte_rml_oob_send (rml_oob_send.c:137)
==16830==by 0x5D: orte_rml_oob_send_buffer (rml_oob_send.c:269)
==16830==by 0x5AFF7E: allgather (grpcomm_bad_module.c:369)
==16830==by 0x5B0805: modex (grpcomm_bad_module.c:497)
==16830==by 0x453518: ompi_mpi_init (ompi_mpi_init.c:626)
==16830==by 0x476CF8: PMPI_Init (pinit.c:80)
==16830==by 0x423DE0: main (helloMPI.cpp:8)
==16830==  Address 0x4e9e383 is 107 bytes inside a block of size 128 alloc'd
==16830==at 0x4A05FBB: malloc (vg_replace_malloc.c:207)
==16830==by 0x61684E: opal_dss_buffer_extend 
(dss_internal_functions.c:68)

==16830==by 0x5F36CE: opal_dss_pack_byte (dss_pack.c:198)
==16830==by 0x616974: opal_dss_store_data_type 
(dss_internal_functions.c:117)

==16830==by 0x5F31FF: opal_dss_pack (dss_pack.c:37)
==16830==by 0x5AFD65: allgather (grpcomm_bad_module.c:351)
==16830==by 0x5B0805: modex (grpcomm_bad_module.c:497)
==16830==by 0x453518: ompi_mpi_init (ompi_mpi_init.c:626)
==16830==by 0x476CF8: PMPI_Init (pinit.c:80)
==16830==by 0x423DE0: main (helloMPI.cpp:8)
==16829== Syscall param writev(vector[...]) points to uninitialised byte(s)
==16829==at 0x34DE2C9F0C: writev (in /lib64/libc-2.6.so)
==16829==by 0x5CD213: mca_oob_tcp_msg_send_handler (oob_tcp_msg.c:265)
==16829==by 0x5C5B6A: mca_oob_tcp_peer_send (oob_tcp_peer.c:197)
==16829==by 0x5CB958: mca_oob_tcp_send_nb (oob_tcp_send.c:167)
==16829==by 0x5DB136: orte_rml_oob_send (rml_oob_send.c:137)
==16829==by 0x5D: orte_rml_oob_send_buffer (rml_oob_send.c:269)
==16829==by 0x5AFF7E: allgather (grpcomm_bad_module.c:369)
==16829==by 0x5B0805: modex (grpcomm_bad_module.c:497)
==16829==by 0x453518: ompi_mpi_init (ompi_mpi_init.c:626)
==16829==by 0x476CF8: PMPI_Init (pinit.c:80)
==16829==by 0x423DE0: main (helloMPI.cpp:8)
==16829==  Address 0x4e9e63b is 107 bytes inside a block of size 256 alloc'd
==16829==at 0x4A06092: realloc (vg_replace_malloc.c:429)
==16829==by 0x61681C: opal_dss_buffer_extend 
(dss_internal_functions.c:63)

==16829==by 0x6181D2: opal_dss_copy_payload (dss_load_unload.c:164)
==16829==by 0x5AFEC9: allgather (grpcomm_bad_module.c:363)
==16829==by 0x5B0805: modex (grpcomm_bad_module.c:497)
==16829==by 0x453518: ompi_mpi_init (ompi_mpi_init.c:626)
==16829==by 0x476CF8: PMPI_Init (pinit.c:80)
==16829==by 0x423DE0: main (helloMPI.cpp:8)
==16829==
==16829== Conditional jump or move depends on uninitialised value(s)
==16829==at 0x4A5F4C: mca_mpool_sm_alloc (mpool_sm_module.c:79)
==16829==by 0x4F3585: mpool_calloc (btl_sm.c:108)
==16829==by 0x4F3E3B: sm_btl_first_time_init (btl_sm.c:307)
==16829==by 0x4F436F: mca_btl_sm_add_procs (btl_sm.c:484)
==16829==by 0x54ECFB: mca_bml_r2_add_procs (bml_r2.c:206)
==16829==by 0x4C2DC4: mca_pml_ob1_add_procs (pml_ob1.c:308)
==16829==by 0x45362A: ompi_mpi_init (ompi_mpi_

Re: [OMPI users] 'orte_ess_base_select failed'

[OMPI users] Interaction between Intel and OpenMPI floating point exceptions

Re: [OMPI users] Factor of 10 loss in performance with 1.3.x

Re: [OMPI users] Factor of 10 loss in performance with 1.3.x

Re: [OMPI users] ssh MPi and program tests

Re: [OMPI users] Incorrect results with MPI-IO under OpenMPI v1.3.1

Re: [OMPI users] libnuma issue

Re: [OMPI users] mpirun: symbol lookup error: /usr/local/lib/openmpi/mca_plm_lsf.so: undefined symbol: ls b_init

Re: [OMPI users] ssh MPi and program tests

[OMPI users] Incorrect results with MPI-IO under OpenMPI v1.3.1

Re: [OMPI users] ssh MPi and program tests

Re: [OMPI users] Problem with running openMPI program

[OMPI users] Factor of 10 loss in performance with 1.3.x

Re: [OMPI users] ssh MPi and program tests

Re: [OMPI users] Problem with running openMPI program

Re: [OMPI users] Problem with running openMPI program

Re: [OMPI users] ssh MPi and program tests

Re: [OMPI users] Problem with running openMPI program

Re: [OMPI users] Problem with running openMPI program

Re: [OMPI users] ssh MPi and program tests

[OMPI users] ssh MPi and program tests

Re: [OMPI users] Problem with running openMPI program

Re: [OMPI users] Bogus memcpy or bogus valgrind record

23 matches

Site Navigation

Mail list logo

Footer information