Hi Wodel,

As Howard mentioned, this is probably because many ranks and sending to a 
single one and exhausting the receive requests MQ. You can individually enlarge 
the receive/send requests queues with the specific variables 
(PSM_MQ_RECVREQS_MAX/ PSM_MQ_SENDREQS_MAX) or increase both with 
PSM_MEMORY=max.  Note that the psm library will allocate more system memory for 
the queues.

Thanks,

_MAC

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Howard 
Pritchard
Sent: Tuesday, January 31, 2017 6:38 AM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] Error using hpcc benchmark

Hi Wodel

Randomaccess part of HPCC is probably causing this.

Perhaps set PSM env. variable -

Export PSM_MQ_REVCREQ_MAX=10000000

or something like that.

Alternatively launch the job using

mpirun --mca plm ob1 --host ....

to avoid use of psm.  Performance will probably suffer with this option however.

Howard
wodel youchi <wodel.you...@gmail.com<mailto:wodel.you...@gmail.com>> schrieb am 
Di. 31. Jan. 2017 um 08:27:
Hi,
I am n newbie in HPC world
I am trying to execute the hpcc benchmark on our cluster, but every time I 
start the job, I get this error, then the job exits
compute017.22840Exhausted 1048576 MQ irecv request descriptors, which usually 
indicates a user program error or insufficient request descriptors 
(PSM_MQ_RECVREQS_MAX=1048576)
compute024.22840Exhausted 1048576 MQ irecv request descriptors, which usually 
indicates a user program error or insufficient request descriptors 
(PSM_MQ_RECVREQS_MAX=1048576)
compute019.22847Exhausted 1048576 MQ irecv request descriptors, which usually 
indicates a user program error or insufficient request descriptors 
(PSM_MQ_RECVREQS_MAX=1048576)
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus 
causing
the job to be terminated. The first process to do so was:

  Process name: [[19601,1],272]
  Exit code:    255
--------------------------------------------------------------------------
Platform : IBM PHPC
OS : RHEL 6.5
one management node
32 compute node : 16 cores, 32GB RAM, intel qlogic QLE7340 one port QRD 
infiniband 40Gb/s
I compiled hpcc against : IBM MPI, Openmpi 2.0.1 (compiled with gcc 4.4.7) and 
Openmpi 1.8.1 (compiled with gcc 4.4.7)
I get the errors, but each time on different compute nodes.
This is the command I used to start the job
mpirun -np 512 --mca mtl psm --hostfile hosts32 
/shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt

Any help will be appreciated, and if you need more details, let me know.
Thanks in advance.

Regards.
_______________________________________________
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to