Hi all, I'm running a small Luster system(1.8.1.1): 1 MDS, 1 OSS, 2 clients.
Each node has 1gig and infiniband (mlx4_0) with ipoib setup. I'm trying to use IB transport. The /etc/modprobe.conf is the same for all nodes: ---------- alias eth0 e1000e alias eth1 e1000e alias eth2 8139too alias scsi_hostadapter aic79xx alias scsi_hostadapter1 ata_piix alias ib0 ib_ipoib alias ib1 ib_ipoib options lnet ip2nets="o2ib0(ib0) 172.16.0.[1-255]" ----------------------- MDS has an IP: 192.168.104.4, IPoIB: 172.16.0.3 OSS has an IP:192.168.104.5, IPoIB: 172.16.0.4 For the MDS, it succeeded to run "/usr/sbin/mkfs.lustre", but failed at "mount -t lustre ". and the the OSS is unable to mount MDS. Below is some log output. Here is the /var/log/messages from MDS: ---------------------- Apr 19 09:59:18 i3 kernel: Lustre: Added LNI 172.16....@o2ib [8/64/0/0] Apr 19 09:59:19 i3 kernel: Lustre: Lustre Client File System; http://www.lustre.org/ Apr 19 09:59:19 i3 kernel: kjournald starting. Commit interval 5 seconds Apr 19 09:59:19 i3 kernel: LDISKFS FS on dm-2, internal journal Apr 19 09:59:19 i3 kernel: LDISKFS-fs: recovery complete. Apr 19 09:59:19 i3 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Apr 19 09:59:19 i3 kernel: kjournald starting. Commit interval 5 seconds Apr 19 09:59:19 i3 kernel: LDISKFS FS on dm-2, internal journal Apr 19 09:59:19 i3 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Apr 19 09:59:19 i3 kernel: Lustre: MGS MGS started Apr 19 09:59:19 i3 kernel: Lustre: mgc172.16....@o2ib: Reactivating import Apr 19 09:59:19 i3 kernel: Inside function: ldlm_cli_enqueue Apr 19 09:59:19 i3 kernel: Inside function:ldlm_handle_enqueue Apr 19 09:59:19 i3 kernel: Inside function ldlm_lock_enqueue Apr 19 09:59:19 i3 kernel: Time spent in policy function to grab the lock5 Apr 19 09:59:19 i3 kernel: Time spent in ldlm_lock_enqueue 24 Apr 19 09:59:19 i3 kernel: Time spent in ptlrpc_queue_wait 269 Apr 19 09:59:19 i3 kernel: Inside function ldlm_lock_enqueue Apr 19 09:59:19 i3 kernel: Inside function:ldlm_completion_ast Apr 19 09:59:19 i3 kernel: Lustre: Enabling user_xattr Apr 19 09:59:19 i3 kernel: Lustre: MDT lustre-MDT0000 now serving lustre-MDT0000_UUID (lustre-MDT0000/d92432e1-1f9a-b963-44cf-c7d529e44575) with recovery en abled Apr 19 09:59:19 i3 kernel: Lustre: 24287:0:(lproc_mds.c:271:lprocfs_wr_group_upcall()) lustre-MDT0000: group upcall set to /usr/sbin/l_getgroups Apr 19 09:59:19 i3 kernel: Lustre: lustre-MDT0000.mdt: set parameter group_upcall=/usr/sbin/l_getgroups Apr 19 09:59:19 i3 kernel: LustreError: 24287:0:(events.c:460:ptlrpc_uuid_to_peer()) No NID found for 192.168.10...@tcp Apr 19 09:59:19 i3 kernel: LustreError: 24287:0:(client.c:69:ptlrpc_uuid_to_connection()) cannot find peer 192.168.10...@tcp! Apr 19 09:59:19 i3 kernel: LustreError: 24287:0:(ldlm_lib.c:329:client_obd_setup()) can't add initial connection Apr 19 09:59:19 i3 kernel: LustreError: 24287:0:(obd_config.c:370:class_setup()) setup lustre-OST0000-osc failed (-2) Apr 19 09:59:19 i3 kernel: LustreError: 24287:0:(obd_config.c:1197:class_config_llog_handler()) Err -2 on cfg command: Apr 19 09:59:19 i3 kernel: Lustre: cmd=cf003 0:lustre-OST0000-osc 1:lustre-OST0000_UUID 2:192.168.10...@tcp Apr 19 09:59:19 i3 kernel: LustreError: 15c-8: mgc172.16....@o2ib: The configuration from log 'lustre-MDT0000' failed (-2). This may be the result of commun ication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. Apr 19 09:59:19 i3 kernel: LustreError: 23990:0:(obd_mount.c:1114:server_start_targets()) failed to start server lustre-MDT0000: -2 Apr 19 09:59:19 i3 kernel: LustreError: 23990:0:(obd_mount.c:1629:server_fill_super()) Unable to start targets: -2 Apr 19 09:59:19 i3 kernel: Lustre: Failing over lustre-MDT0000 Apr 19 09:59:19 i3 kernel: Lustre: *** setting obd lustre-MDT0000 device 'dm-2' read-only *** Apr 19 09:59:19 i3 kernel: Turning device dm-2 (0xfd00002) read-only Apr 19 09:59:19 i3 kernel: Lustre: Failing over lustre-mdtlov Apr 19 09:59:19 i3 kernel: Lustre: lustre-MDT0000: shutting down for failover; client state will be preserved. Apr 19 09:59:19 i3 kernel: Lustre: MDT lustre-MDT0000 has stopped. Apr 19 09:59:19 i3 kernel: LustreError: 23990:0:(ldlm_request.c:1074:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway Apr 19 09:59:19 i3 kernel: LustreError: 23990:0:(ldlm_request.c:1579:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Apr 19 09:59:19 i3 kernel: Lustre: MGS has stopped. Apr 19 09:59:19 i3 kernel: Removing read-only on unknown block (0xfd00002) Apr 19 09:59:19 i3 kernel: Lustre: server umount lustre-MDT0000 complete Apr 19 09:59:19 i3 kernel: LustreError: 23990:0:(obd_mount.c:1997:lustre_fill_super()) Unable to mount (-2) Apr 19 09:59:30 i3 kernel: Lustre: 24058:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 12345-172.16....@o2ib portal 26 match 1333458968248321 offse t 0 length 368: 2 Apr 19 09:59:36 i3 kernel: Lustre: 24060:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 12345-172.16....@o2ib portal 26 match 1333458974539777 offse t 0 length 368: 2 Apr 19 09:59:42 i3 kernel: Lustre: 24061:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 12345-172.16....@o2ib portal 26 match 1333458980831233 offse t 0 length 368: 2 ------------------- Here is /var/log/messages from OSS: ---------------- Apr 19 09:59:30 i4 kernel: kjournald starting. Commit interval 5 seconds Apr 19 09:59:30 i4 kernel: LDISKFS FS on dm-2, internal journal Apr 19 09:59:30 i4 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Apr 19 09:59:30 i4 restorecond: Will not restore a file with more than one hard link (/etc/resolv.conf) Invalid argument Apr 19 09:59:30 i4 restorecond: Will not restore a file with more than one hard link (/etc/resolv.conf) Invalid argument Apr 19 09:59:30 i4 kernel: Lustre: OBD class driver, http://www.lustre.org/ Apr 19 09:59:30 i4 kernel: Lustre: Lustre Version: 1.8.1.1 Apr 19 09:59:30 i4 kernel: Lustre: Build Version: 1.8.1.1-20091009095116-PRISTINE-2.6.18-128.7.1.el5-lustre.1.8.1.1smp-cust Apr 19 09:59:30 i4 kernel: Lustre: Listener bound to ib0:172.16.0.4:987:mlx4_0 Apr 19 09:59:30 i4 kernel: Lustre: Register global MR array, MR size: 0xffffffffffffffff, array size: 1 Apr 19 09:59:30 i4 kernel: Lustre: Added LNI 172.16....@o2ib [8/64/0/0] Apr 19 09:59:30 i4 kernel: Lustre: Lustre Client File System; http://www.lustre.org/ Apr 19 09:59:30 i4 kernel: kjournald starting. Commit interval 5 seconds Apr 19 09:59:30 i4 kernel: LDISKFS FS on dm-2, internal journal Apr 19 09:59:30 i4 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Apr 19 09:59:30 i4 kernel: kjournald starting. Commit interval 5 seconds Apr 19 09:59:30 i4 kernel: LDISKFS FS on dm-2, internal journal Apr 19 09:59:30 i4 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Apr 19 09:59:30 i4 kernel: LDISKFS-fs: file extents enabled Apr 19 09:59:30 i4 kernel: LDISKFS-fs: mballoc enabled Apr 19 09:59:35 i4 kernel: Lustre: 6853:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request x1333458968248321 sent from mgc172.16....@o2ib to NID 172.16....@o2ib 5s ago has timed out (limit 5s).oc enabled Apr 19 09:59:35 i4 kernel: r...@ffff8100bfc31400 x1333458968248321/t0 o250->m...@mgc172.16.0.3@o2ib_0:26/25 lens 368/584 e 0 to 1 dl 1271685575 ref 1 fl Rpc:N/0/0 rc 0/072.16....@o2ib to NID 172.16....@o2ib 5s ago has timed out (limit 5s). Apr 19 09:59:35 i4 kernel: LustreError: 6775:0:(obd_mount.c:1085:server_start_targets()) Required registration failed for lustre-OSTffff: -5 Apr 19 09:59:35 i4 kernel: LustreError: 15f-b: Communication error with the MGS. Is the MGS running? Apr 19 09:59:35 i4 kernel: LustreError: 6775:0:(obd_mount.c:1629:server_fill_super()) Unable to start targets: -5 Apr 19 09:59:35 i4 kernel: LustreError: 6775:0:(obd_mount.c:1412:server_put_super()) no obd lustre-OSTffff Apr 19 09:59:35 i4 kernel: LustreError: 6775:0:(obd_mount.c:136:server_deregister_mount()) lustre-OSTffff not registered Apr 19 09:59:35 i4 kernel: LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 success) Apr 19 09:59:35 i4 kernel: LDISKFS-fs: mballoc: 0 extents scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost Apr 19 09:59:35 i4 kernel: LDISKFS-fs: mballoc: 0 generated and it took 0 Apr 19 09:59:35 i4 kernel: LDISKFS-fs: mballoc: 0 preallocated, 0 discarded Apr 19 09:59:35 i4 kernel: Lustre: server umount lustre-OSTffff complete Apr 19 09:59:35 i4 kernel: LustreError: 6775:0:(obd_mount.c:1997:lustre_fill_super()) Unable to mount (-5) ------------------------- What's wrong with my config? Thanks! -Neutron _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss