Looks like a bug here. It is considering source of var is MCA_BASE_VAR_SOURCE_FILE for both variables reading from mca-param.conf andv INI file(opal/mca/btl/openib/mca-btl-openib-device-params.ini). But, in function mca_btl_openib_tune_endpoint(), where this error triggered is only checking values from INI file (opal_btl_openib_ini_query()) for receive queue value correctness.
Not sure, if it should consider MCA_BASE_VAR_SOURCE_ENV as a source for variables read from .openmpi/mca-param.conf file. Nathan - any idea? -Devendar -----Original Message----- From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Edgar Gabriel Sent: Friday, December 26, 2014 10:35 AM To: Open MPI Developers Subject: [OMPI devel] openib receive queue settings I still see an issue with the openib receive queues settings. Interestingly, it seems to work if I pass the setting with the mpirun command, e.g. mpirun --mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32 --npernode 1 -np 2 ./lat but if I add it to the ${HOME}/.openmpi/mca-param.conf file, e.g. ---snip--- cat ~/.openmpi/mca-params.conf ... btl_openib_receive_queues = S,12288,128,64,32:S,65536,128,64,32 ... ----------snip--- I receive the following error message: gabriel@crill:~> mpirun --npernode 1 -np 2 ./lat -------------------------------------------------------------------------- The Open MPI receive queue configuration for the OpenFabrics devices on two nodes are incompatible, meaning that MPI processes on two specific nodes were unable to communicate with each other. This generally happens when you are using OpenFabrics devices from different vendors on the same network. You should be able to use the mca_btl_openib_receive_queues MCA parameter to set a uniform receive queue configuration for all the devices in the MPI job, and therefore be able to run successfully. Local host: crill-003 Local adapter: mlx4_0 (vendor 0x2c9, part ID 26418) Local queues: S,12288,128,64,32:S,65536,128,64,32 Remote host: crill-004 Remote adapter: (vendor 0x2c9, part ID 26418) Remote queues: P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64 -------------------------------------------------------------------------- Does anybody have an idea what I should be looking for to fix this? I can definitely confirm, that the home file system is mounted on all nodes correctly (i.e. all processes can access the same mca-params.conf file), and they have the identical IB hardware (in contrary to what the error message says). Thanks Edgar -- Edgar Gabriel Associate Professor Parallel Software Technologies Lab http://pstl.cs.uh.edu Department of Computer Science University of Houston Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 _______________________________________________ devel mailing list de...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2014/12/16731.php