Subbu, We don't have any tip for setup IPoIB, looks like linux can't find the ifaddr of ib0 on MDS(-99 is EADDRNOTAVAIL), so I think it's because you didn't assign any address to ib0 (or failed to assign address to ib0) before loading o2iblnd in the first try. I can reproduce exactly same error by: 1. modprobe ib_ipoib 2. ifconfig ib0 up // without assign any address 3. modprobe ko2iblnd 4. lctl network up
Regards Liang subbu kl: > Liang, > after executing following echo : > echo +neterror > /proc/sys/lnet/printk > > now lctlt ping shows the following error > > # lctl ping 172.24.198....@o2ib > failed to ping 172.24.198....@o2ib: Input/output error > > Jan 16 10:24:14 p128 kernel: Lustre: > 2750:0:(o2iblnd_cb.c:2687:kiblnd_cm_callback()) 172.24.198....@o2ib: > ROUTE ERROR -22 > Jan 16 10:24:14 p128 kernel: Lustre: > 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) Deleting > messages for 172.24.198....@o2ib: connection failed > > Looks like some problem with "IB connection manager" ! > > 1. do we have any help docs to setup IPoIB and Lustre, lustre > operation manual has very minimal info about this . I think I am > missing some IPoIB setup part here. > 2. or is it mannual assignment of IP addresses to "ib0" is creating > some problem > > > *Some more supporting info : > *subnet manager of following version is also running : OpenSM 3.1.8 > > Initially I got this error for MDS mount > > Jan 16 09:45:20 p128 kernel: LustreError: > 4991:0:(linux-tcpip.c:124:libcfs_ipif_query()) Can't get IP address > for interface ib0 > Jan 16 09:45:20 p128 kernel: LustreError: > 4991:0:(o2iblnd.c:1563:kiblnd_startup()) Can't query IPoIB interface > ib0: -99 > Jan 16 09:45:21 p128 kernel: LustreError: 105-4: Error -100 starting > up LNI o2ib > Jan 16 09:45:21 p128 kernel: LustreError: > 4991:0:(events.c:707:ptlrpc_init_portals()) network initialisation failed > Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting ptlrpc > (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/ptlrpc.ko): > > Input/output error > Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting osc > (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/osc.ko): > Unknown symbol in module, or unknown parameter (see dmesg) > Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_prep_enqueue_req > Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_resource_get > Jan 16 09:45:21 p128 kernel: osc: Unknown symbol > ptlrpc_lprocfs_register_obd > . > . > . > > then I mannually set the IP address for ib0 as folows : > ifconfig ib0 172.24.198.111 > > [r...@p186 ~]# ifconfig ib0 > ib0 Link encap:InfiniBand HWaddr > 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 > inet addr:172.24.198.112 Bcast:172.24.255.255 Mask:255.255.0.0 > UP BROADCAST MULTICAST MTU:65520 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:256 > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) > > then it mounted sucessfully > > Jan 16 09:47:09 p128 kernel: Lustre: Added LNI 172.24.198....@o2ib [8/64] > Jan 16 09:47:09 p128 kernel: Lustre: MGS MGS started > Jan 16 09:47:09 p128 kernel: Lustre: Setting parameter > lustre-MDT0000.mdt.group_upcall in log lustre-MDT0000 > Jan 16 09:47:09 p128 kernel: Lustre: Enabling user_xattr > Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000: new disk, > initializing > Jan 16 09:47:09 p128 kernel: Lustre: MDT lustre-MDT0000 now serving > dev (lustre-MDT0000/64db1fc7-03ba-9803-4d20-ab0d2aa66116) with > recovery enabled > Jan 16 09:47:09 p128 kernel: Lustre: > 5274:0:(lproc_mds.c:262:lprocfs_wr_group_upcall()) lustre-MDT0000: > group upcall set to /usr/sbin/l_getgroups > Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000.mdt: set parameter > group_upcall=/usr/sbin/l_getgroups > Jan 16 09:47:09 p128 kernel: Lustre: Server lustre-MDT0000 on device > /dev/loop0 has started > . > . > . > > > ~subbu > > > On Thu, Jan 15, 2009 at 8:37 PM, Liang Zhen <zhen.li...@sun.com > <mailto:zhen.li...@sun.com>> wrote: > > Subbu, > > I'd suggest: > 1) make sure ko2iblnd has been brought up (please check if there > is any error message when startup ko2iblnd) > 2) echo +neterror > /proc/sys/lnet/printk, then try with lctl > ping, if it still can't work please post error messages > > Regards > Liang > > subbu kl: > > Problem is similer to > http://lists.lustre.org/pipermail/lustre-discuss/2008-May/007498.html > But by looking at the thread could not really get the solution > for the problem. > > I have two RHEL5 Linux servers installed with following packages - > > kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1 > kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp > lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp > lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp > lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp > e2fsprogs-1.40.7.sun3-0redhat > > > machine 1: with ib0 IP address : 172.24.198.111 > machine 2: with ib0 IP address : 172.24.198.112 > > /etc/modprobe.conf contains > options lnet networks=o2ib > > TCP networking worked fine and now I am trying with Infiniband > network finding it difficult in communicating with IB nodes > mounting effort throghs me the following error > > [r...@p186 ~]# mount -t lustre -o loop /tmp/lustre-ost1 /mnt/ost1 > mount.lustre: mount /dev/loop0 at /mnt/ost1 failed: > Input/output error > Is the MGS running? > > /var/log/messages : > Jan 15 16:55:25 p186 kernel: kjournald starting. Commit > interval 5 seconds > Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal > Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem > with ordered data mode. > Jan 15 16:55:25 p186 kernel: kjournald starting. Commit > interval 5 seconds > Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal > Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem > with ordered data mode. > Jan 15 16:55:25 p186 kernel: LDISKFS-fs: file extents enabled > Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mballoc enabled > Jan 15 16:55:30 p186 kernel: Lustre: Request x7 sent from > mgc172.24.198....@o2ib to NID 172.24.198....@o2ib 5s ago has > timed out (limit 5s). > Jan 15 16:55:30 p186 kernel: LustreError: > 7193:0:(obd_mount.c:1062:server_start_targets()) Required > registration failed for lustre-OSTffff: -5 > Jan 15 16:55:30 p186 kernel: LustreError: 15f-b: Communication > error with the MGS. Is the MGS running? > Jan 15 16:55:30 p186 kernel: LustreError: > 7193:0:(obd_mount.c:1597:server_fill_super()) Unable to start > targets: -5 > Jan 15 16:55:30 p186 kernel: LustreError: > 7193:0:(obd_mount.c:1382:server_put_super()) no obd lustre-OSTffff > Jan 15 16:55:30 p186 kernel: LustreError: > 7193:0:(obd_mount.c:119:server_deregister_mount()) > lustre-OSTffff not registered > Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 blocks 0 > reqs (0 success) > Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 extents > scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost > Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 generated > and it took 0 > Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 > preallocated, 0 discarded > Jan 15 16:55:30 p186 kernel: Lustre: server umount > lustre-OSTffff complete > Jan 15 16:55:30 p186 kernel: LustreError: > 7193:0:(obd_mount.c:1951:lustre_fill_super()) Unable to mount > (-5) > > All pinging efforts also failed to the IB NIDS local/remote > can ping the ip address : > [r...@p186 ~]# ping 172.24.198.112 > PING 172.24.198.112 (172.24.198.112) 56(84) bytes of data. > 64 bytes from 172.24.198.112 <http://172.24.198.112>: > icmp_seq=1 ttl=64 time=0.052 ms > 64 bytes from 172.24.198.112 <http://172.24.198.112>: > icmp_seq=2 ttl=64 time=0.024 ms > > > --- 172.24.198.112 ping statistics --- > 2 packets transmitted, 2 received, 0% packet loss, time 1000ms > rtt min/avg/max/mdev = 0.024/0.038/0.052/0.014 ms > [r...@p186 ~]# ping 172.24.198.111 > PING 172.24.198.111 (172.24.198.111) 56(84) bytes of data. > 64 bytes from 172.24.198.111 <http://172.24.198.111>: > icmp_seq=1 ttl=64 time=2.16 ms > 64 bytes from 172.24.198.111 <http://172.24.198.111>: > icmp_seq=2 ttl=64 time=0.296 ms > > > --- 172.24.198.111 ping statistics --- > 2 packets transmitted, 2 received, 0% packet loss, time 1000ms > rtt min/avg/max/mdev = 0.296/1.231/2.166/0.935 ms > > but cant ping the NIDS : > [r...@p186 ~]# lctl ping 172.24.198....@o2ib > failed to ping 172.24.198....@o2ib: Input/output error > [r...@p186 ~]# lctl ping 172.24.198....@o2ib > failed to ping 172.24.198....@o2ib: Input/output error > > Any idea why lnet cant ping NIDS ? > > some more configurations: > [r...@p186 ~]# ibstat > CA 'mthca0' > CA type: MT23108 > Number of ports: 2 > Firmware version: 3.5.0 > Hardware version: a1 > Node GUID: 0x0002c9020021550c > > Machines are connected via IB switch. > > Looking forward for help. > > ~subbu > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > <mailto:Lustre-discuss@lists.lustre.org> > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > > > -- > . . . s u b b u > "You've got to be original, because if you're like someone else, what > do they need you for?" > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss