Ok, I figured it out. I have some creative /etc/sysconfig/network- script/ifcfg-ib* scripts that may choose to do nothing if no device is present (or some other esoteric, specific-to-jeffs-cluster criteria is met) -- they call "exit 0" in this case. This apparently causes the top-level /etc/init.d/openibd to exit (!). I've fixed this (they now never call "exit"); now everything works as expected.

Upon reflection, I can see that this was totally my fault -- ifcfg-* scripts are always sourced and should therefore never call "exit".

But given that /etc/init.d/openib is sooo complex and has sooo many moving parts, it would be nice if there were a way to track down problems a little more easily; perhaps a "verbose" setting in /etc/ infiniband/openibd.conf, or somesuch. Indeed, since OFED is targeted at the datacenter, monitors attached to the servers in question and/or serial consoles may not be readily available. Hence, having the ability to drop some verbose output into syslog during boot, for example, might be quite useful to sysadmins/network admins when troubleshooting.

Just my $0.02.

Thanks for the tips where to look, Woody!



On May 13, 2009, at 3:18 PM, Jeff Squyres (jsquyres) wrote:

On May 13, 2009, at 3:12 PM, Woodruff, Robert J wrote:

> Check to see if some other driver failed to load.
> I think I have seen before that if another driver
> fails to load, the start script bails out and
> does not load the other drivers.
>
> Perhaps try doing a /etc/init.d/openibd restart
> manually to see if something is failing to load.
>

Weird -- doing it manually shows no problem:

[r...@svbu-mpi055 ~]# /etc/init.d/openibd restart
Unloading HCA driver:                                      [  OK  ]
Loading HCA driver and Access Layer:                       [  OK  ]
Setting up InfiniBand network interfaces:
Bringing up interface ib0:                                 [  OK  ]
Bringing up interface ib1:                                 [  OK  ]
Setting up service network . . .                           [  done  ]
[r...@svbu-mpi055 ~]# ls -l /dev/infiniband/rdma_cm
crw-rw-rw-  1 root root 10, 62 May 13 12:17 /dev/infiniband/rdma_cm
[r...@svbu-mpi055 ~]#

Something must be going wrong during the bootup.  I'm unfortunately
several thousand miles from the server and don't have a serial
console.  I guess I'll insert some initlog's in /etc/init.d/openibd...

--
Jeff Squyres
Cisco Systems

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


--
Jeff Squyres
Cisco Systems

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to