Re: [OMPI users] OpenMPI on Windows - policy

2013-06-25 Thread Mathieu Gontier
Thanks guys fo these information.
Mathieu.


On Tue, Jun 25, 2013 at 12:49 AM, Jeff Squyres (jsquyres) <
jsquy...@cisco.com> wrote:

> Also, we have no way of making 1.6.x releases in Windows any more.  So
> Windows support unfortunately ended in mid-series.  :-\
>
>
> On Jun 24, 2013, at 6:40 PM, Ralph Castain  wrote:
>
> > Our windows supporter has left to greener pastures. Long term, we may
> have an org that will want to support it. However, for now support has been
> dropped in 1.7.
> >
> > Sent from my iPhone
> >
> > On Jun 24, 2013, at 1:42 PM, Mathieu Gontier 
> wrote:
> >
> >> Dear all,
> >>
> >> I have used OpenMPI since more than a decade and I convinced my
> technical director to use it into official releases of our commercial
> applications 4 years ago. It solved all our problems and user are quite
> happy of it.
> >>
> >> Now, I am looking for a good Windows MPI solution and I am testing
> OpenMPI-1.6.3 from the Windows installers I downloaded few weeks ago.  But
> I am wondering what is the long term support? Indeed, the version 1.6.4
> only propose a version for Cygwin.
> >>
> >> So, does someone can share any information about the Windows support
> for the coming versions? I am mainly interested in Windows installers,
> Cygwin support does not really match with my expectations and constraints.
> >>
> >> Thanks a lot,
> >> Mathieu.
> >>
> >> --
> >> Mathieu Gontier
> >> - MSN: mathieu.gont...@gmail.com
> >> - Skype: mathieu_gontier
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Mathieu Gontier
- MSN: mathieu.gont...@gmail.com
- Skype: mathieu_gontier


Re: [OMPI users] OpenMPI 1.6.4 and Intel Composer_xe_2013.4.183: problem with remote runs, orted: error while loading shared libraries: libimf.so

2013-06-25 Thread Stefano Zaghi
Dear All,
I have performed some tests and I finally run successfully mpiexec without
simlinks. As Thomas said my error was the LD_LIBRARY_PATH setting. The
correct setup is the following:

source /home/stefano/opt/intel/2013.4.183/bin/compilervars.sh intel64
export MPI=/home/stefano/opt/mpi/openmpi/1.6.4/intel
export PATH=${MPI}/bin:$PATH
export LD_LIBRARY_PATH=*/home/stefano/opt/intel/2013.4.183/lib*
:${MPI}/lib/openmpi:${MPI}/lib:$LD_LIBRARY_PATH
export LD_RUN_PATH=${MPI}/lib/openmpi:${MPI}/lib:$LD_RUN_PATH

Using the above setting mpiexec (orted) finds all its shared library also
with remote node runs. My previous setups were wrong because:

1) in the first test I have forgot
*/home/stefano/opt/intel/2013.4.183/lib*in the LD_LIBRARY_PATH;
2) in the second test I have used *
/home/stefano/opt/intel/2013.4.183/lib/intel64* in the LD_LIBRARY_PATH.

It seems that the source of *compilervars.sh* does not set the correct
LD_LIBRARY_PATH.

Thanks you for all suggestions,
sincerely


Stefano Zaghi
Ph.D. Aerospace Engineer,
Research Scientist, Dept. of Computational Hydrodynamics at
*CNR-INSEAN*

The Italian Ship Model Basin
(+39) 06.50299297 (Office)
My codes:
*OFF* , Open source Finite volumes Fluid
dynamics code
*Lib_VTK_IO* , a Fortran library to
write and read data conforming the VTK standard
*IR_Precision* , a Fortran
(standard 2003) module to develop portable codes


2013/6/21 Stefano Zaghi 

> Dear All,
> I have compiled OpenMPI 1.6.4 with Intel Composer_xe_2013.4.183.
>
> My configure is:
>
> ./configure --prefix=/home/stefano/opt/mpi/openmpi/1.6.4/intel CC=icc
> CXX=icpc F77=ifort FC=ifort
>
> Intel Composer has been installed in:
>
> /home/stefano/opt/intel/2013.4.183/composer_xe_2013.4.183
>
> Into the .bashrc and .profile in all nodes there is:
>
> source /home/stefano/opt/intel/2013.4.183/bin/compilervars.sh intel64
> export MPI=/home/stefano/opt/mpi/openmpi/1.6.4/intel
> export PATH=${MPI}/bin:$PATH
> export LD_LIBRARY_PATH=${MPI}/lib/openmpi:${MPI}/lib:$LD_LIBRARY_PATH
> export LD_RUN_PATH=${MPI}/lib/openmpi:${MPI}/lib:$LD_RUN_PATH
>
> If I run parallel job into each single node (e.g. mpirun -np 8 myprog) all
> works well. However, when I tried to run parallel job in more nodes of the
> cluster (remote runs) like the following:
>
> mpirun -np 16 --bynode --machinefile nodi.txt -x LD_LIBRARY_PATH -x
> LD_RUN_PATH myprog
>
> I got the following error:
>
> /home/stefano/opt/mpi/openmpi/1.6.4/intel/bin/orted: error while loading
> shared libraries: libimf.so: cannot open shared object file: No such file
> or directory
>
> I have read many FAQs and online resources, all indicating LD_LIBRARY_PATH
> as the possible problem (wrong setting). However I am not able to figure
> out what is going wrong, the LD_LIBRARY_PATH seems to set right in all
> nodes.
>
> It is worth noting that in the same cluster I have successful installed
> OpenMPI 1.4.3 with Intel Composer_xe_2011_sp1.6.233 following exactly the
> same procedure.
>
> Thank you in advance for all suggestion,
> sincerely
>
> Stefano Zaghi
> Ph.D. Aerospace Engineer,
> Research Scientist, Dept. of Computational Hydrodynamics at 
> *CNR-INSEAN*
>
> The Italian Ship Model Basin
> (+39) 06.50299297 (Office)
> My codes:
> *OFF* , Open source Finite volumes Fluid
> dynamics code
> *Lib_VTK_IO* , a Fortran library to
> write and read data conforming the VTK standard
> *IR_Precision* , a Fortran
> (standard 2003) module to develop portable codes
>


Re: [OMPI users] openmpi 1.6.3 fails to identify local host if its IP is 127.0.1.1

2013-06-25 Thread Ralph Castain
I'll ignore the rest of this thread as it kinda diverged from your original
question. I've been reviewing the code, and I think I'm getting a handle on
the issue.

Just to be clear - your hostname resolves to the 127 address? And you are
on a Linux (not one of the BSD flavors out there)?

If the answer to both is "yes", then the problem is that we ignore loopback
devices if anything else is present. When we check to see if the hostname
we were given is the local node, we resolve the name to the address and
then check our list of interfaces. The loopback device is ignored and
therefore not on the list. So if you resolve to the 127 address, we will
decide this is a different node than the one we are on.

I can modify that logic, but want to ensure this accurately captures the
problem. I'll also have to discuss the change with the other developers to
ensure we don't shoot ourselves in the foot if we make it.



On Thu, Jun 20, 2013 at 2:56 AM, Riccardo Murri wrote:

> On 20 June 2013 06:33, Ralph Castain  wrote:
> > Been trying to decipher this problem, and think maybe I'm beginning to
> > understand it. Just to clarify:
> >
> > * when you execute "hostname", you get the .local response?
>
> Yes:
>
> [rmurri@nh64-2-11 ~]$ hostname
> nh64-2-11.local
>
> [rmurri@nh64-2-11 ~]$ uname -n
> nh64-2-11.local
>
> [rmurri@nh64-2-11 ~]$ hostname -s
> nh64-2-11
>
> [rmurri@nh64-2-11 ~]$ hostname -f
> nh64-2-11.local
>
>
> > * you somewhere have it setup so that 10.x.x.x resolves to , with
> no
> > ".local" extension?
>
> No. Host name resolution is correct, but the hostname resolves to the
> 127.0.1.1 address:
>
> [rmurri@nh64-2-11 ~]$ getent hosts `hostname`
> 127.0.1.1nh64-2-11.local nh64-2-11
>
> Note that `/etc/hosts` also lists a 10.x.x.x address, which is the one
> actually assigned to the ethernet interface:
>
> [rmurri@nh64-2-11 ~]$ fgrep `hostname -s` /etc/hosts
> 127.0.1.1   nh64-2-11.local nh64-2-11
> 10.1.255.201nh64-2-11.local nh64-2-11
> 192.168.255.206 nh64-2-11-myri0
>
> If we remove the `127.0.1.1` line from `/etc/hosts`, then everything
> works again.  Also, everything works if we use only FQDNs in the
> hostfile.
>
> So it seems that the 127.0.1.1 address is treated specially.
>
> Thanks,
> Riccardo
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Application hangs on mpi_waitall

2013-06-25 Thread eblosch
An update: I recoded the mpi_waitall as a loop over the requests with
mpi_test and a 30 second timeout.  The timeout happens unpredictably,
sometimes after 10 minutes of run time, other times after 15 minutes, for
the exact same case.

After 30 seconds, I print out the status of all outstanding receive
requests.  The message tags that are outstanding have definitely been
sent, so I am wondering why they are not getting received?

As I said before, everybody posts non-blocking standard receives, then
non-blocking standard sends, then calls mpi_waitall. Each process is
typically waiting on 200 to 300 requests. Is deadlock possible via this
implementation approach under some kind of unusual conditions?

Thanks again,

Ed

> I'm running OpenMPI 1.6.4 and seeing a problem where mpi_waitall never
> returns.  The case runs fine with MVAPICH.  The logic associated with the
> communications has been extensively debugged in the past; we don't think
> it has errors.   Each process posts non-blocking receives, non-blocking
> sends, and then does waitall on all the outstanding requests.
>
> The work is broken down into 960 chunks. If I run with 960 processes (60
> nodes of 16 cores each), things seem to work.  If I use 160 processes
> (each process handling 6 chunks of work), then each process is handling 6
> times as much communication, and that is the case that hangs with OpenMPI
> 1.6.4; again, seems to work with MVAPICH.  Is there an obvious place to
> start, diagnostically?  We're using the openib btl.
>
> Thanks,
>
> Ed
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] error: unknown type name 'ompi_jobid_t'

2013-06-25 Thread Jeff Hammond
I observe this error with the OpenMPI 1.7.1 "feature":

Making all in mca/common/ofacm
make[2]: Entering directory
`/gpfs/mira-home/jhammond/MPI/openmpi-1.7.1/build-gcc/ompi/mca/common/ofacm'
  CC   common_ofacm_xoob.lo
../../../../../ompi/mca/common/ofacm/common_ofacm_xoob.c:158:91:
error: unknown type name 'ompi_jobid_t'
 static int xoob_ib_address_init(ofacm_ib_address_t *ib_addr, uint16_t
lid, uint64_t s_id, ompi_jobid_t ep_jobid)

^
../../../../../ompi/mca/common/ofacm/common_ofacm_xoob.c: In function
'xoob_ib_address_add_new':
../../../../../ompi/mca/common/ofacm/common_ofacm_xoob.c:189:5:
warning: implicit declaration of function 'xoob_ib_address_init'
[-Wimplicit-function-declaration]
 ret = xoob_ib_address_init(ib_addr, lid, s_id, ep_jobid);
 ^
make[2]: *** [common_ofacm_xoob.lo] Error 1
make[2]: Leaving directory
`/gpfs/mira-home/jhammond/MPI/openmpi-1.7.1/build-gcc/ompi/mca/common/ofacm'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory
`/gpfs/mira-home/jhammond/MPI/openmpi-1.7.1/build-gcc/ompi'
make: *** [all-recursive] Error 1

I invoked configure like this:

../configure CC=gcc CXX=g++ FC=gfortran F77=gfortran
--prefix=/home/jhammond/MPI/openmpi-1.7.1/install-gcc --with-verbs
--enable-mpi-thread-multiple --enable-static --enable-shared

My config.log is attached with bzip2 compression or if you do not
trust binary attachments, please go to Dropbox and blindly download
the uncompressed text file.

https://www.dropbox.com/l/ZxZoE6FNROZuBY7I7wdsgc

Any suggestions?  I asked the Google and it had not heard of this
particular error message before.

Thanks,

Jeff

PS Please do not tell Pavan I was here :-)
PPS I recognize the Streisand effect is now in play and that someone
will deliberately disobey the previous request because I made it.

-- 
Jeff Hammond
jeff.scie...@gmail.com


config.log.tbz
Description: Binary data