Re: [OMPI users] Error using hpcc benchmark
Hi Cabral, and thank you. I started hpcc benchmark using -x PSM_MEMORY=large without any error, I didn't finish the test for now, but I waited about 10 minutes, and this time no errors, I even augmented the Ns variable on hpccint.txt and started the test again without problem. The cluster is composed of : - one management node - 32 compute nodes, each one has 16 cores (2sockets x 8 cores), 32GB of RAM, and intel qle_7340 one port infiniband 40Gb/s card I used this site to generate the input file for hpcc : http://www.advancedclustering.com/act-kb/tune-hpl-dat-file/ with some modifications : 1# of problems sizes (N) 331520 Ns 1# of NBs 128 NBs 0PMAP process mapping (0=Row-,1=Column-major) 1# of process grids (P x Q) 16Ps 32Qs The Ns here represents almost 90% of the total memory of the cluster. the total number of processes is 512, each node will start 16 processes 1 per core. Before modifying the PSM_MEMORY value, the test exited with the mentioned error, even with lower values of Ns. I find it weird, that there is no mention of this variable anywhere in the net, not even in the intel true scale ofed+ documentation!!!??? Thanks again. 2017-02-01 22:12 GMT+01:00 Cabral, Matias A: > Hi Wodel, > > > > As you already figured out, mpirun -x
Re: [OMPI users] Error using hpcc benchmark
Hi Wodel, As you already figured out, mpirun -x
Re: [OMPI users] Error using hpcc benchmark
Hi, Thank you for you replies, but :-) it didn't work for me. Using hpcc compiled with OpenMPI 2.0.1 : I tried to use *export **PSM_MQ_RECVREQS_MAX=1000* as mentioned by Howard, but the job didn't take into account the export (I am starting the job from the home directory of a user, the home directory is shared by nfs with all compute nodes). I tried to use the .bash_profile to export the variable, but the job didn't take it into account, I got the same error *Exhausted 1048576 MQ irecv request descriptors, which usually indicates a user program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576)* And as I mentioned before, each time on different node(s). >From the help of the mpirun command, I read that to pass an environment variable we have to use *-x *with the commend; i.e. : mpirun -np 512* -x **PSM_MQ_RECVREQS_MAX=1000 *--mca mtl psm --hostfile hosts32 /shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt But when tested, I get this errors *PSM was unable to open an endpoint. Please make sure that the network link isactive on the node and the hardware is functioning.Error: Ran out of memory* I tested with lower values, the only one that worked for me is *2097152 *which is 2 times the default value of PSM_MQ..., but even with this value, I get the same error with the new value, and the exits. *Exhausted **2097152 MQ irecv request descriptors, which usually indicates a user program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=**2097152 )* PS: for Cabral, I didn't find any way to know the default value of *PSM_MEMORY *to be able to modify it. Any idea??? Could this be a problem on the infiniband configuration? Does the mtu have anything to do with this problem ? ibv_devinfo hca_id: qib0 transport: InfiniBand (0) fw_ver: 0.0.0 node_guid: 0011:7500:0070:59a6 sys_image_guid: 0011:7500:0070:59a6 vendor_id: 0x1175 vendor_part_id: 29474 hw_ver: 0x2 board_id: InfiniPath_QLE7340 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) *max_mtu:4096 (5) active_mtu: 2048 (4)* sm_lid: 1 port_lid: 1 port_lmc: 0x00 link_layer: InfiniBand Regards. 2017-01-31 17:55 GMT+01:00 Cabral, Matias A <matias.a.cab...@intel.com>: > Hi Wodel, > > > > As Howard mentioned, this is probably because many ranks and sending to a > single one and exhausting the receive requests MQ. You can individually > enlarge the receive/send requests queues with the specific variables > (PSM_MQ_RECVREQS_MAX/ PSM_MQ_SENDREQS_MAX) or increase both with > PSM_MEMORY=max. Note that the psm library will allocate more system memory > for the queues. > > > > Thanks, > > > > _MAC > > > > *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Howard > Pritchard > *Sent:* Tuesday, January 31, 2017 6:38 AM > *To:* Open MPI Users <users@lists.open-mpi.org> > *Subject:* Re: [OMPI users] Error using hpcc benchmark > > > > Hi Wodel > > > > Randomaccess part of HPCC is probably causing this. > > > > Perhaps set PSM env. variable - > > > > Export PSM_MQ_REVCREQ_MAX=1000 > > > > or something like that. > > > > Alternatively launch the job using > > > > mpirun --mca plm ob1 --host > > > > to avoid use of psm. Performance will probably suffer with this option > however. > > > > Howard > > wodel youchi <wodel.you...@gmail.com> schrieb am Di. 31. Jan. 2017 um > 08:27: > > Hi, > > I am n newbie in HPC world > > I am trying to execute the hpcc benchmark on our cluster, but every time I > start the job, I get this error, then the job exits > > > > > > > > > > > > > > > *compute017.22840Exhausted 1048576 MQ irecv request descriptors, which > usually indicates a user program error or insufficient request descriptors > (PSM_MQ_RECVREQS_MAX=1048576) compute024.22840Exhausted 1048576 MQ irecv > request descriptors, which usually indicates a user program error or > insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576) > compute019.22847Exhausted 1048576 MQ irecv request descriptors, which > usually indicates a user program error or insufficient request descriptors > (PSM_MQ_RECVREQS_MAX=1048576) > --- Primary job > terminated normall
Re: [OMPI users] Error using hpcc benchmark
Hi Wodel, As Howard mentioned, this is probably because many ranks and sending to a single one and exhausting the receive requests MQ. You can individually enlarge the receive/send requests queues with the specific variables (PSM_MQ_RECVREQS_MAX/ PSM_MQ_SENDREQS_MAX) or increase both with PSM_MEMORY=max. Note that the psm library will allocate more system memory for the queues. Thanks, _MAC From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Howard Pritchard Sent: Tuesday, January 31, 2017 6:38 AM To: Open MPI Users <users@lists.open-mpi.org> Subject: Re: [OMPI users] Error using hpcc benchmark Hi Wodel Randomaccess part of HPCC is probably causing this. Perhaps set PSM env. variable - Export PSM_MQ_REVCREQ_MAX=1000 or something like that. Alternatively launch the job using mpirun --mca plm ob1 --host to avoid use of psm. Performance will probably suffer with this option however. Howard wodel youchi <wodel.you...@gmail.com<mailto:wodel.you...@gmail.com>> schrieb am Di. 31. Jan. 2017 um 08:27: Hi, I am n newbie in HPC world I am trying to execute the hpcc benchmark on our cluster, but every time I start the job, I get this error, then the job exits compute017.22840Exhausted 1048576 MQ irecv request descriptors, which usually indicates a user program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576) compute024.22840Exhausted 1048576 MQ irecv request descriptors, which usually indicates a user program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576) compute019.22847Exhausted 1048576 MQ irecv request descriptors, which usually indicates a user program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576) --- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. --- -- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[19601,1],272] Exit code:255 -- Platform : IBM PHPC OS : RHEL 6.5 one management node 32 compute node : 16 cores, 32GB RAM, intel qlogic QLE7340 one port QRD infiniband 40Gb/s I compiled hpcc against : IBM MPI, Openmpi 2.0.1 (compiled with gcc 4.4.7) and Openmpi 1.8.1 (compiled with gcc 4.4.7) I get the errors, but each time on different compute nodes. This is the command I used to start the job mpirun -np 512 --mca mtl psm --hostfile hosts32 /shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt Any help will be appreciated, and if you need more details, let me know. Thanks in advance. Regards. ___ users mailing list users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Error using hpcc benchmark
Hi Wodel Randomaccess part of HPCC is probably causing this. Perhaps set PSM env. variable - Export PSM_MQ_REVCREQ_MAX=1000 or something like that. Alternatively launch the job using mpirun --mca plm ob1 --host to avoid use of psm. Performance will probably suffer with this option however. Howard wodel youchischrieb am Di. 31. Jan. 2017 um 08:27: > Hi, > > I am n newbie in HPC world > > I am trying to execute the hpcc benchmark on our cluster, but every time I > start the job, I get this error, then the job exits > > > > > > > > > > > > > > > *compute017.22840Exhausted 1048576 MQ irecv request descriptors, which > usually indicates a user program error or insufficient request descriptors > (PSM_MQ_RECVREQS_MAX=1048576)compute024.22840Exhausted 1048576 MQ irecv > request descriptors, which usually indicates a user program error or > insufficient request descriptors > (PSM_MQ_RECVREQS_MAX=1048576)compute019.22847Exhausted 1048576 MQ irecv > request descriptors, which usually indicates a user program error or > insufficient request descriptors > (PSM_MQ_RECVREQS_MAX=1048576)---Primary > job terminated normally, but 1 process returneda non-zero exit code.. Per > user-direction, the job has been > aborted.-mpirun > detected that one or more processes exited with non-zero status, thus > causingthe job to be terminated. The first process to do so was: Process > name: [[19601,1],272] Exit code: > 255--* > > Platform : IBM PHPC > OS : RHEL 6.5 > one management node > 32 compute node : 16 cores, 32GB RAM, intel qlogic QLE7340 one port QRD > infiniband 40Gb/s > > I compiled hpcc against : IBM MPI, Openmpi 2.0.1 (compiled with gcc 4.4.7) > and Openmpi 1.8.1 (compiled with gcc 4.4.7) > > I get the errors, but each time on different compute nodes. > > This is the command I used to start the job > > *mpirun -np 512 --mca mtl psm --hostfile hosts32 > /shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt* > > Any help will be appreciated, and if you need more details, let me know. > Thanks in advance. > > > Regards. > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users