On Wed, 2008-04-09 at 13:39 -0700, Hal Rosenstock wrote: > Hi Christopher, > > On Wed, 2008-04-09 at 13:14 -0600, Maestas, Christopher Daniel wrote: > > Hello Hal, > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, April 09, 2008 12:38 PM > > To: Maestas, Christopher Daniel > > Cc: [email protected] > > Subject: Re: running opensm 3.0.3 on 4000+ node system > > > > On Wed, 2008-04-09 at 12:26 -0600, Maestas, Christopher Daniel wrote: > > > I'm trying to run opensm on a 4000+ node system, > > > > Which version ? Do you mean 3.0.3 (or 3.0.13) ? > > > > cdm> Version 3.0.13 ... you're right on that > > # rpm -q opensm > > opensm-3.0.3-6.el5_1.1 > > --- > > Apr 9 12:49:53 HOST OpenSM[3295]: /var/log/osm.log log file opened > > Apr 9 12:49:53 HOST OpenSM[3295]: OpenSM Rev:openib-3.0.13 > > Apr 9 12:49:53 HOST kernel: user_mad: process opensm did not enable P_Key > > index support. > > Apr 9 12:49:53 HOST kernel: user_mad: > > Documentation/infiniband/user_mad.txt has info on the new ABI. > > Apr 9 12:49:59 HOST OpenSM[3295]: Entering MASTER state > > Apr 9 12:50:02 HOST OpenSM[3295]: Errors during initialization > > Your subnet has errors :-( > > > Apr 9 12:50:16 HOST OpenSM[3295]: SUBNET UP > > Apr 9 12:50:22 HOST kernel: ADDRCONF(NETDEV_CHANGE): ib0: link becomes > > ready > > Apr 9 12:50:30 HOST OpenSM[3295]: Errors during initialization > > Apr 9 12:51:05 HOST last message repeated 2 times > > Apr 9 12:52:17 HOST last message repeated 3 times > > Apr 9 12:53:27 HOST last message repeated 3 times > > ... > > > > > and seem to be having difficulties in keeping the opensm around. > > > When I attach to the process w/ strace it does: > > > --- > > > # strace -p 5921 > > > Process 5921 attached - interrupt to quit restart_syscall(<... resuming > > > interrupted call ...>) = 0 > > > nanosleep({10, 0}, NULL) = 0 > > > nanosleep({10, 0}, NULL) = 0 > > > nanosleep({10, 0}, NULL) = 0 > > > nanosleep({10, 0}, NULL) = 0 > > > ... > > > nanosleep({10, 0}, NULL) = 0 > > > nanosleep({10, 0}, NULL) = 0 > > > nanosleep({10, 0}, NULL) = 0 > > > nanosleep({10, 0}, NULL) = 0 > > > nanosleep({10, 0}, NULL) = 0 > > > nanosleep({10, 0}, NULL) = 0 > > > nanosleep({10, 0}, NULL) = 0 > > > nanosleep({10, 0}, NULL) = 0 > > > nanosleep({10, 0}, NULL) = 0 > > > nanosleep({10, 0}, NULL) = 0 > > > nanosleep({10, 0}, NULL) = 0 > > > nanosleep({10, 0}, NULL) = 0 > > > nanosleep({10, 0}, NULL) = 0 > > > nanosleep({10, 0}, NULL) = 0 > > > nanosleep({10, 0}, NULL) = 0 > > > nanosleep({10, 0}, <unfinished ...> > > > +++ killed by SIGSEGV +++ > > > --- > > > > > > I have ofed 1.1 and 1.2 drivers loaded on the system. I've done this in > > > the past using opensm 3.0.0 svn tag 10188 from ofed 1.0 clients and had > > > no issues before. Here's how opensm is running: > > > --- > > > 6079 pts/0 Sl 0:08 /usr/sbin/opensm -d 3 -maxsmps 0 -s 300 -t > > > 1000 -f /var/log/osm.log -V -g 0 > > > --- > > > > > > I have lots of data in the osm.log as you can imagine ... I don't know > > > offhand what I should be looking at/for. > > > > What's towards the end of the log ? > > > > cdm> > > I rebooted the node ... then brought ib0, then restarted opensmd ... It > > died when file got this big: > > # ls -l osm.log -h > > -rw-r--r-- 1 root root 3.2G Apr 9 13:12 osm.log > > # tail osm.log > > Apr 09 13:12:31 439877 [43204940] -> __osm_trap_rcv_process_request: > > Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x0089 Port > > 12 TID:0x00000000000032d3 > > Apr 09 13:12:31 440370 [41E02940] -> __osm_trap_rcv_process_request: > > Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x00D0 Port 3 > > TID:0x0000000000007480 > > Apr 09 13:12:31 440669 [43204940] -> __osm_trap_rcv_process_request: > > Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x00B3 Port 7 > > TID:0x00000000000058dd > > Apr 09 13:12:31 440987 [41E02940] -> __osm_trap_rcv_process_request: > > Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x0082 Port > > 21 TID:0x000000000000285a > > Apr 09 13:12:31 441228 [43204940] -> __osm_trap_rcv_process_request: > > Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x00E8 Port > > 10 TID:0x00000000000095a2 > > Apr 09 13:12:31 441579 [41E02940] -> __osm_trap_rcv_process_request: > > Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x004A Port 1 > > TID:0x0000000000010d29 > > Apr 09 13:12:31 441847 [43204940] -> __osm_trap_rcv_process_request: > > Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x0063 Port > > 24 TID:0x000000000000e40c > > Apr 09 13:12:31 442130 [41E02940] -> __osm_trap_rcv_process_request: > > Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x000A Port > > 23 TID:0x000000000006fca2 > > Apr 09 13:12:31 442469 [43204940] -> __osm_trap_rcv_process_request: > > Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x0009 Port > > 18 TID:0x0000000000059fc4 > > Apr 09 13:12:31 442710 [41E02940] -> __osm_trap_rcv_process_request: > > Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x0009 Port > > 17 TID:0x0000000000059fc5 > > Those are flow control watchdog errors.
One possible explanation for this: SM could be (mis)configuring mismatched OperVLs at the two ends of these links. Not sure why. -- Hal > Any special opensm options set > in the option file or are you running with the defaults ? > > -- Hal _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
