Hi,

Thank you for you replies, but :-) it didn't work for me.

Using hpcc compiled with OpenMPI 2.0.1 :
I tried to use *export **PSM_MQ_RECVREQS_MAX=10000000* as mentioned by
Howard, but the job didn't take into account the export (I am starting the
job from the home directory of a user, the home directory is shared by nfs
with all compute nodes).
I tried to use the .bash_profile to export the variable, but the job didn't
take it into account, I got the same error


*Exhausted 1048576 MQ irecv request descriptors, which usually indicates a
user program error or insufficient request descriptors
(PSM_MQ_RECVREQS_MAX=1048576)*
And as I mentioned before, each time on different node(s).


>From the help of the mpirun command, I read that to pass an environment
variable we have to use *-x *with the commend; i.e. :
mpirun -np 512* -x **PSM_MQ_RECVREQS_MAX=10000000 *--mca mtl psm --hostfile
hosts32 /shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt

But when tested, I get this errors



*PSM was unable to open an endpoint. Please make sure that the network link
isactive on the node and the hardware is functioning.Error: Ran out of
memory*

I tested with lower values, the only one that worked for me is  *2097152 *which
is 2 times the default value of PSM_MQ..., but even with this value, I get
the same error with the new value, and the exits.

*Exhausted **2097152 MQ irecv request descriptors, which usually indicates
a user program error or insufficient request descriptors
(PSM_MQ_RECVREQS_MAX=**2097152 )*

PS: for Cabral, I didn't find any way to know the default value of *PSM_MEMORY
*to be able to modify it.

Any idea??? Could this be a problem on the infiniband configuration?

Does the mtu have anything to do with this problem ?

ibv_devinfo
hca_id: qib0
        transport:                      InfiniBand (0)
        fw_ver:                         0.0.0
        node_guid:                      0011:7500:0070:59a6
        sys_image_guid:                 0011:7500:0070:59a6
        vendor_id:                      0x1175
        vendor_part_id:                 29474
        hw_ver:                         0x2
        board_id:                       InfiniPath_QLE7340
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)

*max_mtu:                4096 (5)
active_mtu:             2048 (4)*
                        sm_lid:                 1
                        port_lid:               1
                        port_lmc:               0x00
                        link_layer:             InfiniBand



Regards.

2017-01-31 17:55 GMT+01:00 Cabral, Matias A <matias.a.cab...@intel.com>:

> Hi Wodel,
>
>
>
> As Howard mentioned, this is probably because many ranks and sending to a
> single one and exhausting the receive requests MQ. You can individually
> enlarge the receive/send requests queues with the specific variables
> (PSM_MQ_RECVREQS_MAX/ PSM_MQ_SENDREQS_MAX) or increase both with
> PSM_MEMORY=max.  Note that the psm library will allocate more system memory
> for the queues.
>
>
>
> Thanks,
>
>
>
> _MAC
>
>
>
> *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Howard
> Pritchard
> *Sent:* Tuesday, January 31, 2017 6:38 AM
> *To:* Open MPI Users <users@lists.open-mpi.org>
> *Subject:* Re: [OMPI users] Error using hpcc benchmark
>
>
>
> Hi Wodel
>
>
>
> Randomaccess part of HPCC is probably causing this.
>
>
>
> Perhaps set PSM env. variable -
>
>
>
> Export PSM_MQ_REVCREQ_MAX=10000000
>
>
>
> or something like that.
>
>
>
> Alternatively launch the job using
>
>
>
> mpirun --mca plm ob1 --host ....
>
>
>
> to avoid use of psm.  Performance will probably suffer with this option
> however.
>
>
>
> Howard
>
> wodel youchi <wodel.you...@gmail.com> schrieb am Di. 31. Jan. 2017 um
> 08:27:
>
> Hi,
>
> I am n newbie in HPC world
>
> I am trying to execute the hpcc benchmark on our cluster, but every time I
> start the job, I get this error, then the job exits
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *compute017.22840Exhausted 1048576 MQ irecv request descriptors, which
> usually indicates a user program error or insufficient request descriptors
> (PSM_MQ_RECVREQS_MAX=1048576) compute024.22840Exhausted 1048576 MQ irecv
> request descriptors, which usually indicates a user program error or
> insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576)
> compute019.22847Exhausted 1048576 MQ irecv request descriptors, which
> usually indicates a user program error or insufficient request descriptors
> (PSM_MQ_RECVREQS_MAX=1048576)
> ------------------------------------------------------- Primary job
> terminated normally, but 1 process returned a non-zero exit code.. Per
> user-direction, the job has been aborted.
> -------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun detected that one or more processes exited with non-zero status,
> thus causing the job to be terminated. The first process to do so was:
> Process name: [[19601,1],272]   Exit code:    255
> --------------------------------------------------------------------------*
>
> Platform : IBM PHPC
>
> OS : RHEL 6.5
>
> one management node
>
> 32 compute node : 16 cores, 32GB RAM, intel qlogic QLE7340 one port QRD
> infiniband 40Gb/s
>
> I compiled hpcc against : IBM MPI, Openmpi 2.0.1 (compiled with gcc 4.4.7)
> and Openmpi 1.8.1 (compiled with gcc 4.4.7)
>
> I get the errors, but each time on different compute nodes.
>
> This is the command I used to start the job
> *mpirun -np 512 --mca mtl psm --hostfile hosts32
> /shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt*
>
>
>
> Any help will be appreciated, and if you need more details, let me know.
>
> Thanks in advance.
>
>
>
> Regards.
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to