Hi Francesco,
Can you please provide complete output from ibv_devinfo -v command?
Also, it seems that you have Centos 5.8 with mxm/centos5.7 installed, will
check if there is a distro version incompatibilities which may cause it and
update you.

Alina/Josh - please follow.

Regards
M

On Thu, Jan 17, 2013 at 4:09 PM, Francesco Simula <
francesco.sim...@roma1.infn.it> wrote:

> I tried building from OMPI 1.6.3 tarball with the following ./configure:
> ./configure 
> --prefix=/apotto/home1/**homedirs/fsimula/Lavoro/**openmpi-1.6.3/install/
> \
> --disable-mpi-io \
> --disable-io-romio \
> --enable-dependency-tracking \
> --without-slurm \
> --with-platform=optimized \
> --disable-mpi-f77 \
> --disable-mpi-f90 \
> --with-openib \
> --disable-static \
> --enable-shared \
> --disable-vt \
> --enable-pty-support \
> --enable-mca-no-build=btl-**ofud,pml-bfo \
> --with-mxm=/opt/mellanox/mxm \
> --with-mxm-libdir=/opt/**mellanox/mxm/lib
>
> As you can see from the last two lines, I want to enable the MXM transport
> layer on a cluster made of SuperMicro X8DTG-D boards with dual Xeons and
> Mellanox MT26428 HCAs; the OS is CentOS 5.8.
>
> I tried with two different .rpm's for MXM, either
> 'mxm-1.1.ad085ef-1.x86_64-**centos5u7.rpm' taken from here:
> http://www.mellanox.com/**downloads/hpc/mxm/v1.1/mxm-**latest.tar<http://www.mellanox.com/downloads/hpc/mxm/v1.1/mxm-latest.tar>
>
> and 'mxm-1.5.f583875-1.x86_64-**centos5u7.rpm' taken from here:
> http://www.mellanox.com/**downloads/hpc/mxm/v1.5/mxm-**latest.tar<http://www.mellanox.com/downloads/hpc/mxm/v1.5/mxm-latest.tar>
>
> With both, even if the compilation concludes successfully, a simple test
> (osu_bw from the OSU Micro-Benchmarks 3.8) fails with the sort of message
> reported below; the lines:
>
> rdma_dev.c:122  MXM DEBUG Port 1 on mlx4_0 has a link layer different from
> IB. Skipping it
> rdma_dev.c:155  MXM ERROR An active IB port on a Mellanox device, with lid
> [any] gid [any] not found
>
> make it seem like it cannot access the HW for the HCA: is that so? The
> very same test works when using '-mca pml ob1' (thus using the openib BTL).
>
> I'm quite ready to start pulling my hair; any suggestions?
>
> The output of /usr/bin/ibv_devinfo for the two cluster nodes follows:
> [cut]
> hca_id: mlx4_0
>         transport:                      InfiniBand (0)
>         fw_ver:                         2.7.000
>         node_guid:                      0025:90ff:ff07:0ac4
>         sys_image_guid:                 0025:90ff:ff07:0ac7
>         vendor_id:                      0x02c9
>         vendor_part_id:                 26428
>         hw_ver:                         0xB0
>         board_id:                       SM_1061000001000
>         phys_port_cnt:                  1
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 4
>                         port_lid:               6
>                         port_lmc:               0x00
> [/cut]
>
> [cut]
> hca_id: mlx4_0
>         transport:                      InfiniBand (0)
>         fw_ver:                         2.7.000
>         node_guid:                      0025:90ff:ff07:0acc
>         sys_image_guid:                 0025:90ff:ff07:0acf
>         vendor_id:                      0x02c9
>         vendor_part_id:                 26428
>         hw_ver:                         0xB0
>         board_id:                       SM_1061000001000
>         phys_port_cnt:                  1
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 4
>                         port_lid:               8
>                         port_lmc:               0x00
> [/cut]
>
> The complete output of the failing test follows:
>
> [fsimula@agape5 osu-micro-benchmarks-3.8]$ mpirun -x MXM_LOG_LEVEL=poll
> -mca pml cm -mca mtl_mxm_np 1 -np 2 -host agape4,agape5
> install/libexec/osu-micro-**benchmarks/mpi/pt2pt/osu_bw H H
> [1358430343.266782] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> [1358430343.266815] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_HANDLE_ERRORS=bt
> [1358430343.266826] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_GDB_PATH=/usr/bin/gdb
> [1358430343.266838] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_DUMP_SIGNO=1
> [1358430343.266851] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_DUMP_LEVEL=conn
> [1358430343.266924] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_ASYNC_MODE=THREAD
> [1358430343.266936] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_TIME_ACCURACY=0.1
> [1358430343.266956] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_PTLS=self,shm,rdma
> [1358430343.267249] [agape5:8596 :0]     mpool.c:265  MXM DEBUG mpool
> 'ptl_self_recv_ev': allocated chunk 0xc075f40 of 96016 bytes with 1000
> elements
> [1358430343.267308] [agape5:8596 :0]     mpool.c:156  MXM DEBUG mpool
> 'ptl_self_recv_ev': align 16, maxelems 1000, elemsize 88, padding 8
> [1358430343.267316] [agape5:8596 :0]      self.c:410  MXM DEBUG Created
> ptl_self
> [1358430343.267333] [agape5:8596 :0]   shm_ptl.c:56   MXM DEBUG Created
> ptl_shm
> [1358430343.268457] [agape5:8596 :0]  rdma_ptl.c:65   MXM TRACE Got 1 IB
> devices
> [1358430343.268640] [agape5:8596 :0]  rdma_ptl.c:112  MXM DEBUG added
> device mlx4_0
> [1358430343.268665] [agape5:8596 :0]    memreg.c:187  MXM TRACE Created
> memory registration cache on 1 devices
> [1358430343.268676] [agape5:8596 :0]  rdma_ptl.c:133  MXM DEBUG Created
> ptl_rdma
> [1358430343.268689] [agape5:8596 :0]     event.c:353  MXM FUNC
>  mxm_event_init(event=**0x2b73e0ee3038 mode=2 time_accuracy=160000000)
> [1358430343.268698] [agape5:8596 :0]    timerq.c:55   MXM FUNC
>  mxm_timerq_init(timerq=**0x2b73e0ee3060 accuracy=160000000
> max_interval=1600000000)
> [1358430343.268706] [agape5:8596 :0]     event.c:292  MXM FUNC
>  mxm_event_add_thread_context(**thread=0x2b73e0ee30d0)
> [1358430343.268732] [agape5:8596 :0]     event.c:198  MXM FUNC
>  mxm_set_fd_nonblock(fd=10)
> [1358430343.268741] [agape5:8596 :0]     event.c:198  MXM FUNC
>  mxm_set_fd_nonblock(fd=11)
> [1358430343.268841] [agape5:8596 :0]       mxm.c:162  MXM INFO  context
> 0x2b73e0ee3010 created
> [1358430343.269090] [agape5:8596 :1]     event.c:41   MXM FUNC
>  __call_handler(handler->cb=**0x2b73e0ab28a0 handler->arg=0x2b73e0ee3038)
> [1358430343.269104] [agape5:8596 :1]    timerq.c:88   MXM FUNC
>  mxm_timerq_sweep(timerq=**0x2b73e0ee3060 current_time=568595527963578)
> [1358430343.274685] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_ENABLE_HUGETLB=1
> [1358430343.274700] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_ENABLE_TIMEOUTS=y
> [1358430343.274709] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_ACK_TIMEOUT=0.3
> [1358430343.274721] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_POLL_INTERVAL=0.1
> [1358430343.274742] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_WINDOW_SIZE=512
> [1358430343.274755] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_TX_BATCH=1
> [1358430343.274764] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_CQ_MODERATION=64
> [1358430343.274773] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_DRAIN_CQ=n
> [1358430343.274782] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_RNDV_THRESH=65536
> [1358430343.274791] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_ZCOPY_THRESH=2040
> [1358430343.274815] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_RESIZE_CQ=y
> [1358430343.274826] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_UD_MTU=65536
> [1358430343.274836] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_UD_RX_QUEUE_LEN=16000
> [1358430343.274849] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_UD_TX_QUEUE_LEN=64
> [1358430343.274859] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_UD_RX_MAX_BUFFERS=128000
> [1358430343.274877] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_UD_TX_MAX_BUFFERS=8192
> [1358430343.274887] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_UD_RX_DROP_RATE=0
> [1358430343.274896] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_UD_ENABLE_NAK=y
> [1358430343.274904] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_UD_RX_FILL_THRESH=0.6
> [1358430343.274915] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_UD_TX_MAX_INLINE=128
> [1358430343.274925] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_SHM_RX_MAX_BUFFERS=2000
> [1358430343.274941] [agape5:8596 :0] config_parser.c:168  MXM DEBUG
> default: MXM_RDMA_ALLOC=1
> [1358430343.274968] [agape5:8596 :0]        ep.c:36   MXM FUNC
>  mxm_ep_create(context=**0x2b73e0ee3010)
> [1358430343.274984] [agape5:8596 :0]      self.c:380  MXM DEBUG Created
> ptl_self EP(rank=3767085072)
> [1358430343.275028] [agape5:8596 :0] shm_queue.c:230  MXM DEBUG shm_ep=0,
> shmid=6815750
> [1358430343.275072] [agape5:8596 :0]     mpool.c:265  MXM DEBUG mpool
> 'shm_ep_recv': allocated chunk 0x2aaaadd0c010 of 65824016 bytes with 2000
> elements
> [1358430343.278550] [agape5:8596 :0]     mpool.c:156  MXM DEBUG mpool
> 'shm_ep_recv': align 16, maxelems 2000, elemsize 32904, padding 8
> [1358430343.278584] [agape5:8596 :0]    timerq.c:139  MXM FUNC
>  mxm_timer_schedule(timerq=**0x2b73e0ee3060 timer=0xc029538
> expiration=568595550657300)
> [1358430343.278594] [agape5:8596 :0]    timerq.c:43   MXM FUNC
>  mxm_timerq_insert_timer(put timer 0xc029538 expiration 568595550657300 in
> slot 10)
> [1358430343.278608] [agape5:8596 :0]    timerq.c:145  MXM TRACE added
> timer 0xc029538 expiration 568595550657300 interval 160000000
> [1358430343.278617] [agape5:8596 :0]    shm_ep.c:176  MXM DEBUG Created
> ptl_shm EP (rank=0, ctx_id=1)
> [1358430343.278641] [agape5:8596 :0]   rdma_ep.c:317  MXM FUNC
>  mxm_rdma_ep_create()
> [1358430343.278722] [agape5:8596 :0]  rdma_dev.c:194  MXM FUNC
>  mxm_rdma_dev_init(dev=**0xc0b3f00)
> [1358430343.278924] [agape5:8596 :0]  rdma_dev.c:122  MXM DEBUG Port 1 on
> mlx4_0 has a link layer different from IB. Skipping it
> [1358430343.278939] [agape5:8596 :0]  rdma_dev.c:155  MXM ERROR An active
> IB port on a Mellanox device, with lid [any] gid [any] not found
> [1358430343.278954] [agape5:8596 :0]    timerq.c:150  MXM FUNC
>  mxm_timer_cancel(timerq=**0x2b73e0ee3060 timer=0xc029538)
> [1358430343.279454] [agape5:8596 :0]     mpool.c:184  MXM DEBUG mpool
> 'shm_ep_recv': destroyed
> [1358430343.279466] [agape5:8596 :0]      self.c:287  MXM FUNC
>  mxm_self_ep_destroy(ep=**0xc094600)
> ------------------------------**------------------------------**
> --------------
> MXM was unable to create an endpoint. Please make sure that the network
> link is
> active on the node and the hardware is functioning.
>
>   Error: No such device
>
> ------------------------------**------------------------------**
> --------------
> [1358430343.287336] [agape5:8596 :0]     event.c:400  MXM FUNC
>  mxm_event_cleanup(event=**0x2b73e0ee3038)
> [1358430343.287348] [agape5:8596 :0]     event.c:338  MXM FUNC
>  mxm_event_remove_thread_**context(thread=0x2b73e0ee30d0)
> [1358430343.287355] [agape5:8596 :0]     event.c:145  MXM FUNC
>  mxm_event_thread_wakeup()
> [1358430343.371011] [agape5:8596 :0]    timerq.c:76   MXM FUNC
>  mxm_timerq_cleanup(timerq=**0x2b73e0ee3060)
> [1358430343.371030] [agape5:8596 :0]    memreg.c:194  MXM TRACE Destroying
> memory registration cache
> [1358430343.371129] [agape5:8596 :0]   shm_ptl.c:34   MXM FUNC
>  ptl_shm_destroy(ptl=0xc0729b0)
> [1358430343.371139] [agape5:8596 :0]      self.c:340  MXM FUNC
>  mxm_self_destroy(ptl=**0xc0699a0)
> [1358430343.371148] [agape5:8596 :0]     mpool.c:184  MXM DEBUG mpool
> 'ptl_self_recv_ev': destroyed
> [1358430343.371156] [agape5:8596 :0]       mxm.c:197  MXM INFO  context
> 0x2b73e0ee3010 destroyed
> ------------------------------**------------------------------**
> --------------
> No available pml components were found!
>
> This means that there are no components of this type installed on your
> system or all the components reported that they could not be used.
>
> This is a fatal error; your MPI process is likely to abort.  Check the
> output of the "ompi_info" command and ensure that components of this
> type are available on your system.  You may also wish to check the
> value of the "component_path" MCA parameter and ensure that it has at
> least one directory that contains valid MCA components.
> ------------------------------**------------------------------**
> --------------
> [agape5:08596] PML cm cannot be selected
> ------------------------------**------------------------------**
> --------------
> mpirun has exited due to process rank 1 with PID 8596 on
> node agape5 exiting improperly. There are two reasons this could occur:
>
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
>
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
>
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> ------------------------------**------------------------------**
> --------------
>
> Regards,
> Francesco
>
> ______________________________**_________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/**mailman/listinfo.cgi/users<http://www.open-mpi.org/mailman/listinfo.cgi/users>
>

Reply via email to