Re: [OMPI users] Error using hpcc benchmark

2017-02-02 Thread wodel youchi
Hi Cabral, and thank you.

I started hpcc benchmark using -x PSM_MEMORY=large without any error, I
didn't finish the test for now, but I waited about 10 minutes, and this
time no errors, I even augmented the Ns variable on hpccint.txt and started
the test again without problem.

The cluster is composed of :
- one management node
- 32 compute nodes, each one has 16 cores (2sockets x 8 cores), 32GB of
RAM, and intel qle_7340 one port infiniband 40Gb/s card

I used this site to generate the input file for hpcc :
http://www.advancedclustering.com/act-kb/tune-hpl-dat-file/
with some modifications :

1# of problems sizes (N)
331520 Ns
1# of NBs
128   NBs
0PMAP process mapping (0=Row-,1=Column-major)
1# of process grids (P x Q)
16Ps
32Qs

The Ns here represents almost 90% of the total memory of the cluster. the
total number of processes is 512, each node will start 16 processes 1 per
core.

Before modifying the PSM_MEMORY value, the test exited with the mentioned
error, even with lower values of Ns.

I find it weird, that there is no mention of this variable anywhere in the
net, not even in the intel true scale ofed+ documentation!!!???

Thanks again.




2017-02-01 22:12 GMT+01:00 Cabral, Matias A :

> Hi Wodel,
>
>
>
> As you already figured out, mpirun -x 

Re: [OMPI users] Error using hpcc benchmark

2017-02-01 Thread Cabral, Matias A
Hi Wodel,

As you already figured out, mpirun -x 

Re: [OMPI users] Error using hpcc benchmark

2017-02-01 Thread wodel youchi
Hi,

Thank you for you replies, but :-) it didn't work for me.

Using hpcc compiled with OpenMPI 2.0.1 :
I tried to use *export **PSM_MQ_RECVREQS_MAX=1000* as mentioned by
Howard, but the job didn't take into account the export (I am starting the
job from the home directory of a user, the home directory is shared by nfs
with all compute nodes).
I tried to use the .bash_profile to export the variable, but the job didn't
take it into account, I got the same error


*Exhausted 1048576 MQ irecv request descriptors, which usually indicates a
user program error or insufficient request descriptors
(PSM_MQ_RECVREQS_MAX=1048576)*
And as I mentioned before, each time on different node(s).


>From the help of the mpirun command, I read that to pass an environment
variable we have to use *-x *with the commend; i.e. :
mpirun -np 512* -x **PSM_MQ_RECVREQS_MAX=1000 *--mca mtl psm --hostfile
hosts32 /shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt

But when tested, I get this errors



*PSM was unable to open an endpoint. Please make sure that the network link
isactive on the node and the hardware is functioning.Error: Ran out of
memory*

I tested with lower values, the only one that worked for me is  *2097152 *which
is 2 times the default value of PSM_MQ..., but even with this value, I get
the same error with the new value, and the exits.

*Exhausted **2097152 MQ irecv request descriptors, which usually indicates
a user program error or insufficient request descriptors
(PSM_MQ_RECVREQS_MAX=**2097152 )*

PS: for Cabral, I didn't find any way to know the default value of *PSM_MEMORY
*to be able to modify it.

Any idea??? Could this be a problem on the infiniband configuration?

Does the mtu have anything to do with this problem ?

ibv_devinfo
hca_id: qib0
transport:  InfiniBand (0)
fw_ver: 0.0.0
node_guid:  0011:7500:0070:59a6
sys_image_guid: 0011:7500:0070:59a6
vendor_id:  0x1175
vendor_part_id: 29474
hw_ver: 0x2
board_id:   InfiniPath_QLE7340
phys_port_cnt:  1
port:   1
state:  PORT_ACTIVE (4)

*max_mtu:4096 (5)
active_mtu: 2048 (4)*
sm_lid: 1
port_lid:   1
port_lmc:   0x00
link_layer: InfiniBand



Regards.

2017-01-31 17:55 GMT+01:00 Cabral, Matias A <matias.a.cab...@intel.com>:

> Hi Wodel,
>
>
>
> As Howard mentioned, this is probably because many ranks and sending to a
> single one and exhausting the receive requests MQ. You can individually
> enlarge the receive/send requests queues with the specific variables
> (PSM_MQ_RECVREQS_MAX/ PSM_MQ_SENDREQS_MAX) or increase both with
> PSM_MEMORY=max.  Note that the psm library will allocate more system memory
> for the queues.
>
>
>
> Thanks,
>
>
>
> _MAC
>
>
>
> *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Howard
> Pritchard
> *Sent:* Tuesday, January 31, 2017 6:38 AM
> *To:* Open MPI Users <users@lists.open-mpi.org>
> *Subject:* Re: [OMPI users] Error using hpcc benchmark
>
>
>
> Hi Wodel
>
>
>
> Randomaccess part of HPCC is probably causing this.
>
>
>
> Perhaps set PSM env. variable -
>
>
>
> Export PSM_MQ_REVCREQ_MAX=1000
>
>
>
> or something like that.
>
>
>
> Alternatively launch the job using
>
>
>
> mpirun --mca plm ob1 --host 
>
>
>
> to avoid use of psm.  Performance will probably suffer with this option
> however.
>
>
>
> Howard
>
> wodel youchi <wodel.you...@gmail.com> schrieb am Di. 31. Jan. 2017 um
> 08:27:
>
> Hi,
>
> I am n newbie in HPC world
>
> I am trying to execute the hpcc benchmark on our cluster, but every time I
> start the job, I get this error, then the job exits
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *compute017.22840Exhausted 1048576 MQ irecv request descriptors, which
> usually indicates a user program error or insufficient request descriptors
> (PSM_MQ_RECVREQS_MAX=1048576) compute024.22840Exhausted 1048576 MQ irecv
> request descriptors, which usually indicates a user program error or
> insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576)
> compute019.22847Exhausted 1048576 MQ irecv request descriptors, which
> usually indicates a user program error or insufficient request descriptors
> (PSM_MQ_RECVREQS_MAX=1048576)
> --- Primary job
> terminated normall

Re: [OMPI users] Error using hpcc benchmark

2017-01-31 Thread Cabral, Matias A
Hi Wodel,

As Howard mentioned, this is probably because many ranks and sending to a 
single one and exhausting the receive requests MQ. You can individually enlarge 
the receive/send requests queues with the specific variables 
(PSM_MQ_RECVREQS_MAX/ PSM_MQ_SENDREQS_MAX) or increase both with 
PSM_MEMORY=max.  Note that the psm library will allocate more system memory for 
the queues.

Thanks,

_MAC

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Howard 
Pritchard
Sent: Tuesday, January 31, 2017 6:38 AM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] Error using hpcc benchmark

Hi Wodel

Randomaccess part of HPCC is probably causing this.

Perhaps set PSM env. variable -

Export PSM_MQ_REVCREQ_MAX=1000

or something like that.

Alternatively launch the job using

mpirun --mca plm ob1 --host 

to avoid use of psm.  Performance will probably suffer with this option however.

Howard
wodel youchi <wodel.you...@gmail.com<mailto:wodel.you...@gmail.com>> schrieb am 
Di. 31. Jan. 2017 um 08:27:
Hi,
I am n newbie in HPC world
I am trying to execute the hpcc benchmark on our cluster, but every time I 
start the job, I get this error, then the job exits
compute017.22840Exhausted 1048576 MQ irecv request descriptors, which usually 
indicates a user program error or insufficient request descriptors 
(PSM_MQ_RECVREQS_MAX=1048576)
compute024.22840Exhausted 1048576 MQ irecv request descriptors, which usually 
indicates a user program error or insufficient request descriptors 
(PSM_MQ_RECVREQS_MAX=1048576)
compute019.22847Exhausted 1048576 MQ irecv request descriptors, which usually 
indicates a user program error or insufficient request descriptors 
(PSM_MQ_RECVREQS_MAX=1048576)
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---
--
mpirun detected that one or more processes exited with non-zero status, thus 
causing
the job to be terminated. The first process to do so was:

  Process name: [[19601,1],272]
  Exit code:255
--
Platform : IBM PHPC
OS : RHEL 6.5
one management node
32 compute node : 16 cores, 32GB RAM, intel qlogic QLE7340 one port QRD 
infiniband 40Gb/s
I compiled hpcc against : IBM MPI, Openmpi 2.0.1 (compiled with gcc 4.4.7) and 
Openmpi 1.8.1 (compiled with gcc 4.4.7)
I get the errors, but each time on different compute nodes.
This is the command I used to start the job
mpirun -np 512 --mca mtl psm --hostfile hosts32 
/shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt

Any help will be appreciated, and if you need more details, let me know.
Thanks in advance.

Regards.
___
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Error using hpcc benchmark

2017-01-31 Thread Howard Pritchard
Hi Wodel

Randomaccess part of HPCC is probably causing this.

Perhaps set PSM env. variable -

Export PSM_MQ_REVCREQ_MAX=1000

or something like that.

Alternatively launch the job using

mpirun --mca plm ob1 --host 

to avoid use of psm.  Performance will probably suffer with this option
however.

Howard
wodel youchi  schrieb am Di. 31. Jan. 2017 um 08:27:

> Hi,
>
> I am n newbie in HPC world
>
> I am trying to execute the hpcc benchmark on our cluster, but every time I
> start the job, I get this error, then the job exits
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *compute017.22840Exhausted 1048576 MQ irecv request descriptors, which
> usually indicates a user program error or insufficient request descriptors
> (PSM_MQ_RECVREQS_MAX=1048576)compute024.22840Exhausted 1048576 MQ irecv
> request descriptors, which usually indicates a user program error or
> insufficient request descriptors
> (PSM_MQ_RECVREQS_MAX=1048576)compute019.22847Exhausted 1048576 MQ irecv
> request descriptors, which usually indicates a user program error or
> insufficient request descriptors
> (PSM_MQ_RECVREQS_MAX=1048576)---Primary
> job  terminated normally, but 1 process returneda non-zero exit code.. Per
> user-direction, the job has been
> aborted.-mpirun
> detected that one or more processes exited with non-zero status, thus
> causingthe job to be terminated. The first process to do so was:  Process
> name: [[19601,1],272]  Exit code:
> 255--*
>
> Platform : IBM PHPC
> OS : RHEL 6.5
> one management node
> 32 compute node : 16 cores, 32GB RAM, intel qlogic QLE7340 one port QRD
> infiniband 40Gb/s
>
> I compiled hpcc against : IBM MPI, Openmpi 2.0.1 (compiled with gcc 4.4.7)
> and Openmpi 1.8.1 (compiled with gcc 4.4.7)
>
> I get the errors, but each time on different compute nodes.
>
> This is the command I used to start the job
>
> *mpirun -np 512 --mca mtl psm --hostfile hosts32
> /shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt*
>
> Any help will be appreciated, and if you need more details, let me know.
> Thanks in advance.
>
>
> Regards.
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users