Hi Wodel, As Howard mentioned, this is probably because many ranks and sending to a single one and exhausting the receive requests MQ. You can individually enlarge the receive/send requests queues with the specific variables (PSM_MQ_RECVREQS_MAX/ PSM_MQ_SENDREQS_MAX) or increase both with PSM_MEMORY=max. Note that the psm library will allocate more system memory for the queues.
Thanks, _MAC From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Howard Pritchard Sent: Tuesday, January 31, 2017 6:38 AM To: Open MPI Users <users@lists.open-mpi.org> Subject: Re: [OMPI users] Error using hpcc benchmark Hi Wodel Randomaccess part of HPCC is probably causing this. Perhaps set PSM env. variable - Export PSM_MQ_REVCREQ_MAX=10000000 or something like that. Alternatively launch the job using mpirun --mca plm ob1 --host .... to avoid use of psm. Performance will probably suffer with this option however. Howard wodel youchi <wodel.you...@gmail.com<mailto:wodel.you...@gmail.com>> schrieb am Di. 31. Jan. 2017 um 08:27: Hi, I am n newbie in HPC world I am trying to execute the hpcc benchmark on our cluster, but every time I start the job, I get this error, then the job exits compute017.22840Exhausted 1048576 MQ irecv request descriptors, which usually indicates a user program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576) compute024.22840Exhausted 1048576 MQ irecv request descriptors, which usually indicates a user program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576) compute019.22847Exhausted 1048576 MQ irecv request descriptors, which usually indicates a user program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576) ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[19601,1],272] Exit code: 255 -------------------------------------------------------------------------- Platform : IBM PHPC OS : RHEL 6.5 one management node 32 compute node : 16 cores, 32GB RAM, intel qlogic QLE7340 one port QRD infiniband 40Gb/s I compiled hpcc against : IBM MPI, Openmpi 2.0.1 (compiled with gcc 4.4.7) and Openmpi 1.8.1 (compiled with gcc 4.4.7) I get the errors, but each time on different compute nodes. This is the command I used to start the job mpirun -np 512 --mca mtl psm --hostfile hosts32 /shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt Any help will be appreciated, and if you need more details, let me know. Thanks in advance. Regards. _______________________________________________ users mailing list users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users