Re: [OMPI devel] 1.7.5 status

2014-02-12 Thread Jeff Squyres (jsquyres)
idx_null is a datatype test, but it makes one datatype call into the MPI_File stuff. So I wonder if it's failing with the new ROMIO...? That being said, I'm unable to get this to fail manually. On Feb 11, 2014, at 10:18 PM, Ralph Castain wrote: > Things are looking relatively good - I see t

Re: [OMPI devel] v1.7.4, mpiexec "exit 1" and no other warning - behaviour changed to previous versions

2014-02-12 Thread Ralph Castain
Could you please give the nightly 1.7.5 tarball a try using the same cmd line options and send me the output? I see the problem, but am trying to understand how it happens. I've added a bunch of diagnostic statements that should help me track it down. Thanks Ralph On Feb 12, 2014, at 1:26 AM,

Re: [OMPI devel] v1.7.5a1: mpirun failure on ppc/linux (regression vs 1.7.4)

2014-02-12 Thread Paul Hargrove
Yes, Jeff, this was resolved. It was the same problem others were chasing on x86-64 platforms at about the same time. -Paul On Wed, Feb 12, 2014 at 5:58 AM, Jeff Squyres (jsquyres) wrote: > Has this issue been resolved? > > > On Feb 9, 2014, at 5:35 PM, Paul Hargrove wrote: > > > Below is som

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Ralph Castain
Interesting - good to know. Thanks On Feb 12, 2014, at 10:38 AM, Adrian Reber wrote: > It seems this is indeed a Moab bug for interactive jobs. At least a bug > was opened against moab. Using non-interactive jobs the variables have > the correct values and mpirun has no problems detecting the co

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Adrian Reber
It seems this is indeed a Moab bug for interactive jobs. At least a bug was opened against moab. Using non-interactive jobs the variables have the correct values and mpirun has no problems detecting the correct number of cores. On Wed, Feb 12, 2014 at 07:50:40AM -0800, Ralph Castain wrote: > Anoth

Re: [OMPI devel] fail when linking against libmpi.so

2014-02-12 Thread Jeff Squyres (jsquyres)
Mike -- this should be fixed. Has Jenkins been re-run yet? On Feb 12, 2014, at 9:30 AM, Ralph Castain wrote: > I can't reproduce this regardless - since you are using a git mirror, are > you sure you don't have a problem over there? > > > On Feb 11, 2014, at 11:05 PM, Bert Wesarg wrote:

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Ralph Castain
Another possibility to check - it is entirely possible that Moab is miscommunicating the values to Slurm. You might need to check it - I'll install a copy of 2.6.5 on my machines and see if I get similar issues when Slurm does the allocation itself. On Feb 12, 2014, at 7:47 AM, Ralph Castain w

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Adrian Reber
On Wed, Feb 12, 2014 at 07:47:53AM -0800, Ralph Castain wrote: > > > > $ msub -I -l nodes=3:ppn=8 > > salloc: Job is in held state, pending scheduler release > > salloc: Pending job allocation 131828 > > salloc: job 131828 queued and waiting for resources > > salloc: job 131828 has been allocated

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Ralph Castain
On Feb 12, 2014, at 7:32 AM, Adrian Reber wrote: > > $ msub -I -l nodes=3:ppn=8 > salloc: Job is in held state, pending scheduler release > salloc: Pending job allocation 131828 > salloc: job 131828 queued and waiting for resources > salloc: job 131828 has been allocated resources > salloc: Gra

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Adrian Reber
$ msub -I -l nodes=3:ppn=8 salloc: Job is in held state, pending scheduler release salloc: Pending job allocation 131828 salloc: job 131828 queued and waiting for resources salloc: job 131828 has been allocated resources salloc: Granted job allocation 131828 sh-4.1$ echo $SLURM_TASKS_PER_NODE 1 s

Re: [OMPI devel] 1.7.5 status

2014-02-12 Thread Nathan Hjelm
Yeah. The coll/ml changes fix intercomm_create. -Nathan On Tue, Feb 11, 2014 at 07:18:54PM -0800, Ralph Castain wrote: > Things are looking relatively good - I see two recurring failures: > > 1. idx_null - no idea what that test does, but it routinely fails > > 2. intercomm_create - this is the

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Ralph Castain
...and your version of Slurm? On Feb 12, 2014, at 7:19 AM, Ralph Castain wrote: > What is your SLURM_TASKS_PER_NODE? > > On Feb 12, 2014, at 6:58 AM, Adrian Reber wrote: > >> No, the system has only a few MOAB_* variables and many SLURM_* >> variables: >> >> $BASH $IF

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Ralph Castain
What is your SLURM_TASKS_PER_NODE? On Feb 12, 2014, at 6:58 AM, Adrian Reber wrote: > No, the system has only a few MOAB_* variables and many SLURM_* > variables: > > $BASH $IFS $SECONDS > $SLURM_PTY_PORT > $BASHOPTS

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Adrian Reber
No, the system has only a few MOAB_* variables and many SLURM_* variables: $BASH $IFS $SECONDS $SLURM_PTY_PORT $BASHOPTS $LINENO $SHELL $SLURM_PTY_WIN_COL $BASHP

Re: [OMPI devel] fail when linking against libmpi.so

2014-02-12 Thread Ralph Castain
I can't reproduce this regardless - since you are using a git mirror, are you sure you don't have a problem over there? On Feb 11, 2014, at 11:05 PM, Bert Wesarg wrote: > On 02/12/2014 07:31 AM, Mike Dubman wrote: >> Hi, >> Following changes caused failure: >> >> >>1. Fixes #4239: Move

Re: [OMPI devel] SLURM affinity accounting in Open MPI

2014-02-12 Thread Ralph Castain
I'm not entirely comfortable with the solution, as the problem truly is that we are doing what you asked - i.e., if you tell Slurm to bind tasks to a single core, then we live within it. The problem with your proposed fix is that we override whatever the user may have actually wanted - e.g., if

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Ralph Castain
Seems rather odd - since this is managed by Moab, you shouldn't be seeing SLURM envars at all. What you should see are PBS_* envars, including a PBS_NODEFILE that actually contains the allocation. On Feb 12, 2014, at 4:42 AM, Adrian Reber wrote: > I tried the nightly snapshot (openmpi-1.7.5a1

Re: [OMPI devel] v1.7.5a1: mpirun failure on ppc/linux (regression vs 1.7.4)

2014-02-12 Thread Ralph Castain
Yes - it was a missing changeset that Rolf tracked down and has since been applied On Feb 12, 2014, at 5:58 AM, Jeff Squyres (jsquyres) wrote: > Has this issue been resolved? > > > On Feb 9, 2014, at 5:35 PM, Paul Hargrove wrote: > >> Below is some info collected from a core generated from

Re: [OMPI devel] [PATCH] Re: Still having issues w/ opal_path_nfs and EPERM

2014-02-12 Thread Jeff Squyres (jsquyres)
This patch was applied to both trunk and v1.7; thanks Paul. On Feb 9, 2014, at 7:36 PM, Paul Hargrove wrote: > I found the source of the problem, and a solution. > > The following is r30612, in which Jeff thought he had fixed the problem: > > --- opal/util/path.c(revision 30611) > +++ opal

Re: [OMPI devel] v1.7.5a1: mpirun failure on ppc/linux (regression vs 1.7.4)

2014-02-12 Thread Jeff Squyres (jsquyres)
Has this issue been resolved? On Feb 9, 2014, at 5:35 PM, Paul Hargrove wrote: > Below is some info collected from a core generated from running ring_c > without mpirun. > It looks like a bogus btl_module pointer or corrupted object is the culprit > in this crash. > > -Paul > > Core was gen

Re: [OMPI devel] v1.7 and trunk: hello_oshmemfh link failure with xlc/ppc32/linux

2014-02-12 Thread Jeff Squyres (jsquyres)
Filed as https://svn.open-mpi.org/trac/ompi/ticket/4262 On Feb 8, 2014, at 8:22 PM, Paul Hargrove wrote: > Testing the current v1.7 tarball (1.7.5a1r30634), I get a failure when > building the oshmem examples. > I've confirmed that the same problem exists on trunk (so not a problem with > the

[OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Adrian Reber
I tried the nightly snapshot (openmpi-1.7.5a1r30692.tar.gz) on a system with slurm and moab. I requested an interactive session using: msub -I -l nodes=3:ppn=8 and started a simple test case which fails: $ mpirun -np 2 ./mpi-test 1

[OMPI devel] SLURM affinity accounting in Open MPI

2014-02-12 Thread Artem Polyakov
Hello I found that SLURM installations that use cgroup plugin and have TaskAffinity=yes in cgroup.conf have problems with OpenMPI: all processes on non-launch node are assigned on one core. This leads to quite poor performance. The problem can be seen only if using mpirun to start parallel applica

Re: [OMPI devel] v1.7.4, mpiexec "exit 1" and no other warning - behaviour changed to previous versions

2014-02-12 Thread Paul Kapinos
As said, the change in behaviour is new in 1.7.4 - all previous versions has been worked. Moreover, setting "-mca oob_tcp_if_include ib0" is a workaround for older versions of Open MPI for some 60-seconds timeout when starting the same command (which is still sucessfull); or for infinite waiting

Re: [OMPI devel] fail when linking against libmpi.so

2014-02-12 Thread Bert Wesarg
On 02/12/2014 07:31 AM, Mike Dubman wrote: Hi, Following changes caused failure: 1. Fixes #4239: Move r30642 to v1.7 branch (purge stale session dirs at startup) (detail / gitblit

[OMPI devel] fail in vt

2014-02-12 Thread Mike Dubman
Hi, Following changes caused failure: 1. Fixes #4239: Move r30642 to v1.7 branch (purge stale session dirs at startup) (detail / gitblit