----------------------------------------------------- Arnaud HERITIER Meteo France International +33 561432940 arnaud.herit...@mfi.fr ------------------------------------------------------
On Mon, Dec 5, 2011 at 6:12 PM, Ralph Castain <r...@open-mpi.org> wrote: > > On Dec 5, 2011, at 5:49 AM, arnaud Heritier wrote: > > Hello, > > I found the solution, thanks to Qlogic support. > > The "can't open /dev/ipath, network down (err=26)" message from the ipath > driver is really misleading. > > Actually, this is an hardware context problem on the Qlogic PSM. PSM can't > allocate any hardware context for the job because other(s) MPI job(s) have > already used all available contexts. In order to avoid this problem, every > MPI jobs have to use the PSM_SHAREDCONTEXTS_MAX variable set with the good > value, according to the number of processes that will run on the node. If > we don't use this variable, PSM will "greedily" use all contexts with the > first mpi job spawned on the node. > > > Sounds like we should be setting this value when starting the process - > yes? If so, what is the "good" value, and how do we compute it? > The good value is roundup( $OMPI_COMM_WORLD_LOCAL_SIZE / Context shared ratio) (ratio max 4 on my HCA) Qlogic provided me with simple script to compute this value, i just changed my mpirun script to call this script , set the PSM_SHAREDCONTEXTS_MAX with the returned value and the call the mpi binary. Script attached. Arnaud > Regards, > > Arnaud > > > On Tue, Nov 29, 2011 at 6:44 PM, Jeff Squyres <jsquy...@cisco.com> wrote: > >> On Nov 28, 2011, at 11:53 PM, arnaud Heritier wrote: >> >> > I do have a contract and i tried to open a case, but their support is >> ...... >> >> What happens if you put a delay between the two jobs? E.g., if you just >> delay a few seconds before the 2nd job starts? Perhaps the ipath device >> just needs a little time before it will be available...? (that's a total >> guess) >> >> I suggest this because the PSM device will definitely give you better >> overall performance than the QLogic verbs support. Their verbs support >> basically barely works -- PSM is their primary device and the one that we >> always recommend. >> >> > Anyway. I'm stii working on the strange error message from mpirun >> saying it can't allocate memory when at the same time it also reports that >> the memory is unlimited ... >> > >> > >> > Arnaud >> > >> > On Tue, Nov 29, 2011 at 4:23 AM, Jeff Squyres <jsquy...@cisco.com> >> wrote: >> > I'm afraid we don't have any contacts left at QLogic to ask them any >> more... do you have a support contract, perchance? >> > >> > On Nov 27, 2011, at 3:11 PM, Arnaud Heritier wrote: >> > >> > > Hello, >> > > >> > > I run into a stange problem with qlogic OFED and openmpi. When i >> submit (through SGE) 2 jobs on the same node, the second job ends up with: >> > > >> > > (ipath/PSM)[10292]: can't open /dev/ipath, network down (err=26) >> > > >> > > I'm pretty sure the infiniband is working well as the other job runs >> fine. >> > > >> > > Here is details about the configuration: >> > > >> > > Qlogic HCA: InfiniPath_QMH7342 (2 ports but only one connected to a >> switch) >> > > qlogic_ofed-1.5.3-7.0.0.0.35 (rocks cluster roll) >> > > openmpi 1.5.4 (./configure --with-psm --with-openib --with-sge) >> > > >> > > ------------- >> > > >> > > In order to fix this problem i recompiled openmpi without psm >> support, but i faced an other problem: >> > > >> > > The OpenFabrics (openib) BTL failed to initialize while trying to >> > > allocate some locked memory. This typically can indicate that the >> > > memlock limits are set too low. For most HPC installations, the >> > > memlock limits should be set to "unlimited". The failure occured >> > > here: >> > > >> > > Local host: compute-0-6.local >> > > OMPI source: btl_openib.c:329 >> > > Function: ibv_create_srq() >> > > Device: qib0 >> > > Memlock limit: unlimited >> > > >> > > >> > > _______________________________________________ >> > > users mailing list >> > > us...@open-mpi.org >> > > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> > >> > -- >> > Jeff Squyres >> > jsquy...@cisco.com >> > For corporate legal information go to: >> > http://www.cisco.com/web/about/doing_business/legal/cri/ >> > >> > >> > _______________________________________________ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> > _______________________________________________ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
get_psm_sharedcontexts_max.sh
Description: Bourne shell script