Hi Wodel,

As you already figured out, mpirun -x <ENV_VAR=value> … is the right way to do 
it so the psm library will read the values when initializing on every node.
The default value for "PSM_MEMORY" is “normal” and you may change it to 
“large”. If you want to look inside the code, it is on 
https://github.com/01org/psm . One useful variable to play with is 
PSM_TRACEMASK (only set it on the head node) to see what values are being used. 
I think 0xffff will dump lots of info.
As I mentioned below, playing with the size of the MQ is tricky since will be 
using system memory. I think this will be a combination of a) number of total 
ranks and per node b) memory on the hosts c) HPCC parameters. The bigger number 
of ranks, more ranks will be possibly transmitting simultaneously to a single 
node (I would assume a reduction) (a node could be posting receives at faster 
rate it is completing them), so will need bigger MQ, so more memory used. Would 
you share the number of ranks per node, nodes, and memory per node to have an 
idea?  A quick test could be to start with very small number of ranks to see if 
it runs.

Thanks,
Regards,

_MAC

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of wodel youchi
Sent: Wednesday, February 01, 2017 3:36 AM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] Error using hpcc benchmark

Hi,
Thank you for you replies, but :-) it didn't work for me.
Using hpcc compiled with OpenMPI 2.0.1 :
I tried to use export PSM_MQ_RECVREQS_MAX=10000000 as mentioned by Howard, but 
the job didn't take into account the export (I am starting the job from the 
home directory of a user, the home directory is shared by nfs with all compute 
nodes).
I tried to use the .bash_profile to export the variable, but the job didn't 
take it into account, I got the same error
Exhausted 1048576 MQ irecv request descriptors, which usually indicates a user 
program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576)
And as I mentioned before, each time on different node(s).

From the help of the mpirun command, I read that to pass an environment 
variable we have to use -x with the commend; i.e. :
mpirun -np 512 -x PSM_MQ_RECVREQS_MAX=10000000 --mca mtl psm --hostfile hosts32 
/shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt

But when tested, I get this errors

PSM was unable to open an endpoint. Please make sure that the network link is
active on the node and the hardware is functioning.
Error: Ran out of memory
I tested with lower values, the only one that worked for me is  2097152 which 
is 2 times the default value of PSM_MQ..., but even with this value, I get the 
same error with the new value, and the exits.
Exhausted 2097152 MQ irecv request descriptors, which usually indicates a user 
program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=2097152 )

PS: for Cabral, I didn't find any way to know the default value of PSM_MEMORY 
to be able to modify it.
Any idea??? Could this be a problem on the infiniband configuration?

Does the mtu have anything to do with this problem ?

ibv_devinfo
hca_id: qib0
        transport:                      InfiniBand (0)
        fw_ver:                         0.0.0
        node_guid:                      0011:7500:0070:59a6
        sys_image_guid:                 0011:7500:0070:59a6
        vendor_id:                      0x1175
        vendor_part_id:                 29474
        hw_ver:                         0x2
        board_id:                       InfiniPath_QLE7340
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               1
                        port_lmc:               0x00
                        link_layer:             InfiniBand



Regards.

2017-01-31 17:55 GMT+01:00 Cabral, Matias A 
<matias.a.cab...@intel.com<mailto:matias.a.cab...@intel.com>>:
Hi Wodel,

As Howard mentioned, this is probably because many ranks and sending to a 
single one and exhausting the receive requests MQ. You can individually enlarge 
the receive/send requests queues with the specific variables 
(PSM_MQ_RECVREQS_MAX/ PSM_MQ_SENDREQS_MAX) or increase both with 
PSM_MEMORY=max.  Note that the psm library will allocate more system memory for 
the queues.

Thanks,

_MAC

From: users 
[mailto:users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>]
 On Behalf Of Howard Pritchard
Sent: Tuesday, January 31, 2017 6:38 AM
To: Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
Subject: Re: [OMPI users] Error using hpcc benchmark

Hi Wodel

Randomaccess part of HPCC is probably causing this.

Perhaps set PSM env. variable -

Export PSM_MQ_REVCREQ_MAX=10000000

or something like that.

Alternatively launch the job using

mpirun --mca plm ob1 --host ....

to avoid use of psm.  Performance will probably suffer with this option however.

Howard
wodel youchi <wodel.you...@gmail.com<mailto:wodel.you...@gmail.com>> schrieb am 
Di. 31. Jan. 2017 um 08:27:
Hi,
I am n newbie in HPC world
I am trying to execute the hpcc benchmark on our cluster, but every time I 
start the job, I get this error, then the job exits
compute017.22840Exhausted 1048576 MQ irecv request descriptors, which usually 
indicates a user program error or insufficient request descriptors 
(PSM_MQ_RECVREQS_MAX=1048576)
compute024.22840Exhausted 1048576 MQ irecv request descriptors, which usually 
indicates a user program error or insufficient request descriptors 
(PSM_MQ_RECVREQS_MAX=1048576)
compute019.22847Exhausted 1048576 MQ irecv request descriptors, which usually 
indicates a user program error or insufficient request descriptors 
(PSM_MQ_RECVREQS_MAX=1048576)
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus 
causing
the job to be terminated. The first process to do so was:

  Process name: [[19601,1],272]
  Exit code:    255
--------------------------------------------------------------------------
Platform : IBM PHPC
OS : RHEL 6.5
one management node
32 compute node : 16 cores, 32GB RAM, intel qlogic QLE7340 one port QRD 
infiniband 40Gb/s
I compiled hpcc against : IBM MPI, Openmpi 2.0.1 (compiled with gcc 4.4.7) and 
Openmpi 1.8.1 (compiled with gcc 4.4.7)
I get the errors, but each time on different compute nodes.
This is the command I used to start the job
mpirun -np 512 --mca mtl psm --hostfile hosts32 
/shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt

Any help will be appreciated, and if you need more details, let me know.
Thanks in advance.

Regards.
_______________________________________________
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to