Re: [OMPI users] Errors for openib, mpirun fails

2014-07-22 Thread Syed Ahsan Ali
Hi Josh It was my mistake. The status of error generating node is pasted below Infiniband device 'mlx4_0' port 1 status: default gid: fe80::::0018:8b90:97fe:94fe base lid:0x0 sm lid: 0x0 state: 1: DOWN phys state:

Re: [OMPI users] Errors for openib, mpirun fails

2014-07-22 Thread Syed Ahsan Ali
Dear Pasha The ibstatus is not of two different machines it is of the same machine. There are two infiband ports showing up on all nodes. I checked on all the nodes that one of the port in always in INIT status and other one active. Now please see below the ibstatus of the problem causing node (co

Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots

2014-07-22 Thread Jeff Squyres (jsquyres)
Hyperthreading is pretty great for non-HPC applications, which is why Intel makes it. But hyperthreading *generally* does not help HPC application performance. You're basically halving several on-chip resources / queues / pipelines, and that can hurt for performance-hungry HPC applications. T

Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots

2014-07-22 Thread Lane, William
Ralph, The 32 slot systems/nodes I'm running my openMPI test code on only have 16 physical cores, the rest of the slots are hyperthreads. I've done some more testing and noticed that if I limit the number of slots per node to 8 (via -npernode 8) everything works and 8 slots are used from each syst

Re: [OMPI users] Salloc and mpirun problem

2014-07-22 Thread Ralph Castain
Okay, the problem is that the connection back to mpirun isn't getting thru. We are trying on the 10.0.251.53 address - is that blocked, or should we be using something else? If so, you might want to direct us by adding "-mca oob_tcp_if_include foo", where foo is the interface you want us to use

Re: [OMPI users] Incorrect escaping of OMPI_MCA environment variables with spaces (for rsh?)

2014-07-22 Thread Ralph Castain
Thanks - I revised the module so it quotes all params and envar values, and CMRd it for 1.8.2 (added you on the ticket) On Jul 21, 2014, at 2:42 AM, Dirk Schubert wrote: > Hello Ralph, > > thanks for your answer. > > > I can look to see if there is something generic we can do (perhaps > > en

Re: [OMPI users] Question on process and memory affinity with 1.8.1

2014-07-22 Thread Ralph Castain
I'm not aware of any way to tell using ompi_info, I'm afraid. I'd have to ponder a bit as to how we could do so since it's a link to a library down below the one we directly use. On Jul 21, 2014, at 3:00 PM, Blosch, Edwin L wrote: > In making the leap from 1.6 to 1.8, how can I check whether

Re: [OMPI users] Using the Hydra process manager to launch Open MPI applications

2014-07-22 Thread Ralph Castain
Certainly possible, though I haven't heard of anyone doing so - not sure of the motivation, but nothing prevents it. You would need to build OMPI --with-pmi=, but that should be all that is required On Jul 22, 2014, at 1:59 AM, Jukka-Pekka Kekkonen wrote: > Hi, > > I am trying to get the Hyd

Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots

2014-07-22 Thread Ralph Castain
Hmmm...that's not a "bug", but just a packaging issue with the way CentOS distributed some variants of OMPI that requires you install/update things in a specific order. On Jul 20, 2014, at 11:34 PM, Lane, William wrote: > Please see: > > http://bugs.centos.org/view.php?id=5812 > > From: user

Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots

2014-07-22 Thread Jeff Squyres (jsquyres)
Can you try upgrading to OMPI 1.6.5? 1.6.5 has *many* bug fixes compared to 1.5.4. A little background... Open MPI is developed in terms of release version pairs: "1.odd" are feature releases. We add new (and remove old) features, etc. We do a lot of testing, but this is all done in lab/tes

Re: [OMPI users] Errors for openib, mpirun fails

2014-07-22 Thread Shamis, Pavel
Hmm, this does not make sense. Your copy-n-paste shows that both machines (00 and 01) have the same guid/lid (sort of equivalent of mac address in ethernet world). As you can guess these two can not be identical for two different machines (unless you moved the card around). Best, Pasha On Jul 2

Re: [OMPI users] Errors for openib, mpirun fails

2014-07-22 Thread Joshua Ladd
Sayed, You might try this link (or have your sysadmin do it if you do not have admin privileges.) To me it looks like your second port is in the "INIT" state but has not been added by the subnet manager. https://software.intel.com/en-us/articles/troubleshooting-infiniband-connection-issues-using-

[OMPI users] Using the Hydra process manager to launch Open MPI applications

2014-07-22 Thread Jukka-Pekka Kekkonen
Hi, I am trying to get the Hydra process manager (that is typically associated with mpich2) to launch Open MPI applications. Has someone here managed to do this or is it even possible? / JP

Re: [OMPI users] Errors for openib, mpirun fails

2014-07-22 Thread Syed Ahsan Ali
And where I can find run/job/submission ? On Mon, Jul 21, 2014 at 6:57 PM, Shamis, Pavel wrote: > > You have to check the ports states on *all* nodes in the > run/job/submission. Checking on a single node is not enough. > My guess is the 01-00 tries to connect 01-01 and the ports are down on > 0