Hi, Thank you for you replies, but :-) it didn't work for me.
Using hpcc compiled with OpenMPI 2.0.1 : I tried to use *export **PSM_MQ_RECVREQS_MAX=10000000* as mentioned by Howard, but the job didn't take into account the export (I am starting the job from the home directory of a user, the home directory is shared by nfs with all compute nodes). I tried to use the .bash_profile to export the variable, but the job didn't take it into account, I got the same error *Exhausted 1048576 MQ irecv request descriptors, which usually indicates a user program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576)* And as I mentioned before, each time on different node(s). >From the help of the mpirun command, I read that to pass an environment variable we have to use *-x *with the commend; i.e. : mpirun -np 512* -x **PSM_MQ_RECVREQS_MAX=10000000 *--mca mtl psm --hostfile hosts32 /shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt But when tested, I get this errors *PSM was unable to open an endpoint. Please make sure that the network link isactive on the node and the hardware is functioning.Error: Ran out of memory* I tested with lower values, the only one that worked for me is *2097152 *which is 2 times the default value of PSM_MQ..., but even with this value, I get the same error with the new value, and the exits. *Exhausted **2097152 MQ irecv request descriptors, which usually indicates a user program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=**2097152 )* PS: for Cabral, I didn't find any way to know the default value of *PSM_MEMORY *to be able to modify it. Any idea??? Could this be a problem on the infiniband configuration? Does the mtu have anything to do with this problem ? ibv_devinfo hca_id: qib0 transport: InfiniBand (0) fw_ver: 0.0.0 node_guid: 0011:7500:0070:59a6 sys_image_guid: 0011:7500:0070:59a6 vendor_id: 0x1175 vendor_part_id: 29474 hw_ver: 0x2 board_id: InfiniPath_QLE7340 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) *max_mtu: 4096 (5) active_mtu: 2048 (4)* sm_lid: 1 port_lid: 1 port_lmc: 0x00 link_layer: InfiniBand Regards. 2017-01-31 17:55 GMT+01:00 Cabral, Matias A <matias.a.cab...@intel.com>: > Hi Wodel, > > > > As Howard mentioned, this is probably because many ranks and sending to a > single one and exhausting the receive requests MQ. You can individually > enlarge the receive/send requests queues with the specific variables > (PSM_MQ_RECVREQS_MAX/ PSM_MQ_SENDREQS_MAX) or increase both with > PSM_MEMORY=max. Note that the psm library will allocate more system memory > for the queues. > > > > Thanks, > > > > _MAC > > > > *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Howard > Pritchard > *Sent:* Tuesday, January 31, 2017 6:38 AM > *To:* Open MPI Users <users@lists.open-mpi.org> > *Subject:* Re: [OMPI users] Error using hpcc benchmark > > > > Hi Wodel > > > > Randomaccess part of HPCC is probably causing this. > > > > Perhaps set PSM env. variable - > > > > Export PSM_MQ_REVCREQ_MAX=10000000 > > > > or something like that. > > > > Alternatively launch the job using > > > > mpirun --mca plm ob1 --host .... > > > > to avoid use of psm. Performance will probably suffer with this option > however. > > > > Howard > > wodel youchi <wodel.you...@gmail.com> schrieb am Di. 31. Jan. 2017 um > 08:27: > > Hi, > > I am n newbie in HPC world > > I am trying to execute the hpcc benchmark on our cluster, but every time I > start the job, I get this error, then the job exits > > > > > > > > > > > > > > > *compute017.22840Exhausted 1048576 MQ irecv request descriptors, which > usually indicates a user program error or insufficient request descriptors > (PSM_MQ_RECVREQS_MAX=1048576) compute024.22840Exhausted 1048576 MQ irecv > request descriptors, which usually indicates a user program error or > insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576) > compute019.22847Exhausted 1048576 MQ irecv request descriptors, which > usually indicates a user program error or insufficient request descriptors > (PSM_MQ_RECVREQS_MAX=1048576) > ------------------------------------------------------- Primary job > terminated normally, but 1 process returned a non-zero exit code.. Per > user-direction, the job has been aborted. > ------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun detected that one or more processes exited with non-zero status, > thus causing the job to be terminated. The first process to do so was: > Process name: [[19601,1],272] Exit code: 255 > --------------------------------------------------------------------------* > > Platform : IBM PHPC > > OS : RHEL 6.5 > > one management node > > 32 compute node : 16 cores, 32GB RAM, intel qlogic QLE7340 one port QRD > infiniband 40Gb/s > > I compiled hpcc against : IBM MPI, Openmpi 2.0.1 (compiled with gcc 4.4.7) > and Openmpi 1.8.1 (compiled with gcc 4.4.7) > > I get the errors, but each time on different compute nodes. > > This is the command I used to start the job > *mpirun -np 512 --mca mtl psm --hostfile hosts32 > /shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt* > > > > Any help will be appreciated, and if you need more details, let me know. > > Thanks in advance. > > > > Regards. > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users