Looks like a bug here. It is considering  source of var is  
MCA_BASE_VAR_SOURCE_FILE  for both variables reading from  mca-param.conf andv  
INI file(opal/mca/btl/openib/mca-btl-openib-device-params.ini).  
But, in function mca_btl_openib_tune_endpoint(),  where this error triggered is 
only checking values from INI file (opal_btl_openib_ini_query()) for receive 
queue value correctness.  

Not sure, if it should consider  MCA_BASE_VAR_SOURCE_ENV  as a source for 
variables read from .openmpi/mca-param.conf file.
Nathan - any idea?

-Devendar

-----Original Message-----
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Edgar Gabriel
Sent: Friday, December 26, 2014 10:35 AM
To: Open MPI Developers
Subject: [OMPI devel] openib receive queue settings

I still see an issue with the openib receive queues settings. 
Interestingly, it seems to work if I pass the setting with the mpirun command, 
e.g.

mpirun  --mca btl_openib_receive_queues
S,12288,128,64,32:S,65536,128,64,32 --npernode 1 -np 2 ./lat

but if I add it to the ${HOME}/.openmpi/mca-param.conf file, e.g.

---snip---
  cat  ~/.openmpi/mca-params.conf
...
btl_openib_receive_queues = S,12288,128,64,32:S,65536,128,64,32
...
----------snip---

I receive the following error message:
gabriel@crill:~> mpirun  --npernode 1 -np 2 ./lat
--------------------------------------------------------------------------
The Open MPI receive queue configuration for the OpenFabrics devices on two 
nodes are incompatible, meaning that MPI processes on two specific nodes were 
unable to communicate with each other.  This generally happens when you are 
using OpenFabrics devices from different vendors on the same network.  You 
should be able to use the mca_btl_openib_receive_queues MCA parameter to set a 
uniform receive queue configuration for all the devices in the MPI job, and 
therefore be able to run successfully.

   Local host:       crill-003
   Local adapter:    mlx4_0 (vendor 0x2c9, part ID 26418)
   Local queues:     S,12288,128,64,32:S,65536,128,64,32

   Remote host:      crill-004
   Remote adapter:   (vendor 0x2c9, part ID 26418)
   Remote queues: 
P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64
--------------------------------------------------------------------------

Does anybody have an idea what I should be looking for to fix this? I can 
definitely confirm, that the home file system is mounted on all nodes correctly 
(i.e. all processes can access the same mca-params.conf file), and they have 
the identical IB hardware (in contrary to what the error message says).


Thanks
Edgar

--
Edgar Gabriel
Associate Professor
Parallel Software Technologies Lab      http://pstl.cs.uh.edu
Department of Computer Science          University of Houston
Philip G. Hoffman Hall, Room 524        Houston, TX-77204, USA
Tel: +1 (713) 743-3857                  Fax: +1 (713) 743-3335
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/12/16731.php

Reply via email to