Sorry for the delayed response.....

I got and built the tarball 1.8.3-272-g4e4f997 below.


A single node job runs ok, with correct cores etc

A multi-node job dies with the following error (no core dumps now):


>>>A specified physical processor does not exist in this topology:

>>>  CPU number:     0
>>>  Cpu set given:  None


my mpirun line looks like:

/apps/share/openmpi/1.8.3-272-g4e4f997/bin/mpirun --prefix 
/apps/share/openmpi/1.8.3-272-g4e4f997 --mca btl openib,tcp,sm,self --x 
LD_LIBRARY_PATH --np 64 myexe -i br.i -l tommy1.o


My compile options for openmpi are:


version=1.8.3-272-g4e4f997

./configure \
    --disable-vt \
    --prefix=/apps/share/openmpi/$version \
    --disable-shared \
    --enable-static \
    --with-openib \
    --enable-mpirun-prefix-by-default \
    --with-memory-manager=none \
    --with-hwloc \
    --with-lsf=/apps/share/LSF/9.1.3/9.1 \
    --with-lsf-libdir=/apps/share/LSF/9.1.3/9.1/linux2.6-glibc2.3-x86_64/lib \
    --with-wrapper-cflags="-shared-intel" \
    --with-wrapper-cxxflags="-shared-intel" \
    --with-wrapper-ldflags="-shared-intel" \
    --with-wrapper-fcflags="-shared-intel" \
    --enable-mpi-ext


Can you see anything that should or shouldn't be there?


Thanks



________________________________
From: devel <devel-boun...@open-mpi.org> on behalf of Ralph Castain 
<r...@open-mpi.org>
Sent: Monday, December 15, 2014 10:07 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] 1.8.4rc Status

My correction - the fix is in the nightly tarball from tonight. You can get it 
here:

openmpi-v1.8.3-272-g4e4f997.tar.bz2<http://www.open-mpi.org/nightly/v1.8/openmpi-v1.8.3-272-g4e4f997.tar.bz2>



On Mon, Dec 15, 2014 at 2:40 PM, Ralph Castain 
<r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote:
Hey Tom

Note that rc2 had a bug in the out-of-band messaging system - might be what you 
are hitting. I'd suggest working with rc4.


On Mon, Dec 15, 2014 at 12:57 PM, Tom Wurgler 
<twu...@goodyear.com<mailto:twu...@goodyear.com>> wrote:

I have to take it back.  While the first job was less than a node's worth of 
cores and ran properly on the cores I wanted. more testing is revealing other 
problems.

Anything that spans more than one node crashes and burns, with a core dump, and 
nothing in the files to indicate why.

Note this is still rc2....

More testing on-going....


________________________________
From: devel <devel-boun...@open-mpi.org<mailto:devel-boun...@open-mpi.org>> on 
behalf of Tom Wurgler <twu...@goodyear.com<mailto:twu...@goodyear.com>>
Sent: Monday, December 15, 2014 1:23 PM

To: Open MPI Developers
Subject: Re: [OMPI devel] 1.8.4rc Status


It seems to be working in rc2 after all.

I was still trying to use a rankfile, but it appears that is no longer needed.

Thanks!


________________________________
From: devel <devel-boun...@open-mpi.org<mailto:devel-boun...@open-mpi.org>> on 
behalf of Ralph Castain <r...@open-mpi.org<mailto:r...@open-mpi.org>>
Sent: Monday, December 15, 2014 8:45 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] 1.8.4rc Status

Should be there in rc4, and I thought it made it to rc2 for that matter. I'll 
take a gander.

FWIW: I'm working off-list with IBM to tighten the LSF integration so we 
correctly read and follow their binding directives. This will also be in 1.8.4 
as we are in final test with it now.

Ralph


On Mon, Dec 15, 2014 at 5:40 AM, Tom Wurgler 
<twu...@goodyear.com<mailto:twu...@goodyear.com>> wrote:
Forgive me if I've missed it, but I believe using physical OR logical core 
numbering was going to be

reimplemented in the 1.8.4 series.


I've checked out rc2 and as far as I can tell, it isn't there as yet.   Is this 
correct?


thanks!


________________________________
From: devel <devel-boun...@open-mpi.org<mailto:devel-boun...@open-mpi.org>> on 
behalf of Ralph Castain <r...@open-mpi.org<mailto:r...@open-mpi.org>>
Sent: Monday, December 15, 2014 8:35 AM
To: Open MPI Developers
Subject: [OMPI devel] 1.8.4rc Status

Hi folks

Trying to summarize the current situation on releasing 1.8.4. Remaining 
identified issues:

1. TCP/BTL hang under mpi-thread-multiple. Asked George to look into it.

2. hwloc updates required. Brice committed them to the hwloc 1.7 repo. Gilles 
volunteered to create the PR from there.

3. Fortran f08 binding disable for compilers not meeting certain conditions. PR 
from Gilles awaiting review by Jeff

4. Topo signature issue reported by IBM. Ralph is waiting for more debug.

5. MPI/IO issue reported by Eric Chamberland. Gilles investigating.

6. make check issue on SPARC. Problem and fix reported by Paul Hargrove, Ralph 
will commit

7. Linkage issue on Solaris-11 reported by Paul Hargrove. Missing the 
multi-threaded C libraries, apparently need "-mt=yes" in both compile and link. 
Need someone to investigate.

Please let me know if I've missed anything.
Ralph


_______________________________________________
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/12/16595.php

_______________________________________________
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/12/16604.php

Reply via email to