Bonjour John,
First, Thanks for your feedback.
Le 17 déc. 10 à 16:13, John Hearns a écrit :
On 17 December 2010 14:45, Gilbert Grosdidier
<gilbert.grosdid...@cern.ch> wrote:
Bonjour,
About this issue, for which I got NO feedback ;-)
Gilbert, as you have an SGI cluster, have you filed a support
request to SGI?
gg= Yes, I filed one, but with no more luck yet.
Also, which firmware do you have installed?
I have Firmware version: 2.5.0
gg= I don't know, and firmware_revs does not seem to be available.
Only thing I got on a worker node was with lspci :
03:00.0 InfiniBand: Mellanox Technologies MT26418 [ConnectX IB DDR,
PCIe 2.0 5GT/s] (rev a0)
http://www.openfabrics.org/downloads/OFED/ofed-1.4/OFED-1.4-docs/mlx4_release_notes.txt
gg= Looking into this one, I noticed pointers towards /etc/infiniband/
connectx.conf
and /sbin/connectx_port_config, but they are not available either.
Features that are enabled with FW 2.5.0 only:
- Send with invalidate and Local invalidate send queue work requests.
- Resize CQ support.
gg= I also spotted some special hooks inside openib code about
HAVE_IBV_GET_DEVICE_LIST, HAVE_IBV_CREATE_XRC_RCV_QP and
HAVE_IBV_FORK_INIT.
Are any of them suspicious together with ConnectX HCAs, please ?
Thanks, Best, G.
I recently spotted
into btl_openib.c code, that this error message could come from
some missing ConnectX HCA ibv_resize_cq function. Well ...
I was unable yet to figure out why/how this could occur, but I have
a now a closely related question about ConnectX Infiniband HCA :
does anybody know which other unimplemented IB functionalities
could be lacking for this ConnectX HCA ?
This could allow me to patch appropriately by hand the OpenMPI code,
since I currently believe these functionalities are going
undetected as missing by the configure step.
Thanks, Best, G.
Le 15 déc. 10 à 08:59, Gilbert Grosdidier a écrit :
Bonjour,
Running with OpenMPI 1.4.3 on an SGI Altix cluster with 2048 cores,
I got
this error message on all cores, right at startup :
btl_openib.c:211:adjust_cq] cannot resize completion queue, error: 12
What could be the culprit please ?
Is there a workaround ?
What parameter is to be tuned ?
Thanks in advance for any help, Best, G.