I am trying to run WRF on 1024 cores with OpenMPI 1.3.3 and
1.4.  I can get the code to run with 512 cores, but it crashes
at startup on 1024 cores.  I am getting the following error message:

[n172][[43536,1],0][connect/btl_openib_connect_oob.c:463:qp_create_one] error 
creating qp errno says Cannot allocate memory
[n172][[43536,1],0][connect/btl_openib_connect_oob.c:809:rml_recv_cb] error in 
endpoint reply start connect

From google, I have tried to change the settings for btl_openib_receive_queues,
but my tries have not worked.  Here was my latest try to reduce the
total queue pairs.

mpirun -np 1024 \
   -mca btl_openib_receive_queues P,128,2048,128,128:S,65536,256,192,128 \
  `wrf.exe

These settings did not help.

Am I looking in the right place?

System setup:
Centos-5.3
Ofed-1.4.1
Intel Compiler 11.1.038
Openmpi-1.3.3 and 1.4

Build options:

./configure CC=icc CXX=icpc F77=ifort F90=ifort FC=ifort --prefix=/opt/openmpi/1.3.3-intel --without-sge --with-openib --enable-io-romio --with-io-romio-flags=--with-file-system=lustre --with-pic

Thanks,
Craig

Reply via email to