On Wed, 2008-04-09 at 13:39 -0700, Hal Rosenstock wrote:
> Hi Christopher,
> 
> On Wed, 2008-04-09 at 13:14 -0600, Maestas, Christopher Daniel wrote:
> > Hello Hal,
> > 
> > -----Original Message-----
> > From: Hal Rosenstock [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, April 09, 2008 12:38 PM
> > To: Maestas, Christopher Daniel
> > Cc: [email protected]
> > Subject: Re: running opensm 3.0.3 on 4000+ node system
> > 
> > On Wed, 2008-04-09 at 12:26 -0600, Maestas, Christopher Daniel wrote:
> > > I'm trying to run opensm on a 4000+ node system,
> > 
> > Which version ? Do you mean 3.0.3 (or 3.0.13) ?
> > 
> > cdm> Version 3.0.13 ... you're right on that
> > # rpm -q opensm
> > opensm-3.0.3-6.el5_1.1
> > ---
> > Apr  9 12:49:53 HOST OpenSM[3295]: /var/log/osm.log log file opened
> > Apr  9 12:49:53 HOST OpenSM[3295]: OpenSM Rev:openib-3.0.13
> > Apr  9 12:49:53 HOST kernel: user_mad: process opensm did not enable P_Key 
> > index support.
> > Apr  9 12:49:53 HOST kernel: user_mad:   
> > Documentation/infiniband/user_mad.txt has info on the new ABI.
> > Apr  9 12:49:59 HOST OpenSM[3295]: Entering MASTER state
> > Apr  9 12:50:02 HOST OpenSM[3295]: Errors during initialization
> 
> Your subnet has errors :-(
> 
> > Apr  9 12:50:16 HOST OpenSM[3295]: SUBNET UP
> > Apr  9 12:50:22 HOST kernel: ADDRCONF(NETDEV_CHANGE): ib0: link becomes 
> > ready
> > Apr  9 12:50:30 HOST OpenSM[3295]: Errors during initialization
> > Apr  9 12:51:05 HOST last message repeated 2 times
> > Apr  9 12:52:17 HOST last message repeated 3 times
> > Apr  9 12:53:27 HOST last message repeated 3 times
> > ...
> > 
> > >  and seem to be having difficulties in keeping the opensm around.
> > > When I attach to the process w/ strace it does:
> > > ---
> > > # strace -p 5921
> > > Process 5921 attached - interrupt to quit restart_syscall(<... resuming 
> > > interrupted call ...>) = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > ...
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0}, NULL)                = 0
> > > nanosleep({10, 0},  <unfinished ...>
> > > +++ killed by SIGSEGV +++
> > > ---
> > >
> > > I have ofed 1.1 and 1.2 drivers loaded on the system.  I've done this in 
> > > the past using opensm 3.0.0 svn tag 10188 from ofed 1.0 clients and had 
> > > no issues before.  Here's how opensm is running:
> > > ---
> > >  6079 pts/0    Sl     0:08 /usr/sbin/opensm -d 3 -maxsmps 0 -s 300 -t 
> > > 1000 -f /var/log/osm.log -V -g 0
> > > ---
> > >
> > > I have lots of data in the osm.log as you can imagine ... I don't know 
> > > offhand what I should be looking at/for.
> > 
> > What's towards the end of the log ?
> > 
> > cdm>
> > I rebooted the node ... then brought ib0, then restarted opensmd ... It 
> > died when file got this big:
> > # ls -l osm.log -h
> > -rw-r--r-- 1 root root 3.2G Apr  9 13:12 osm.log
> > # tail osm.log
> > Apr 09 13:12:31 439877 [43204940] -> __osm_trap_rcv_process_request: 
> > Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x0089 Port 
> > 12 TID:0x00000000000032d3
> > Apr 09 13:12:31 440370 [41E02940] -> __osm_trap_rcv_process_request: 
> > Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x00D0 Port 3 
> > TID:0x0000000000007480
> > Apr 09 13:12:31 440669 [43204940] -> __osm_trap_rcv_process_request: 
> > Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x00B3 Port 7 
> > TID:0x00000000000058dd
> > Apr 09 13:12:31 440987 [41E02940] -> __osm_trap_rcv_process_request: 
> > Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x0082 Port 
> > 21 TID:0x000000000000285a
> > Apr 09 13:12:31 441228 [43204940] -> __osm_trap_rcv_process_request: 
> > Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x00E8 Port 
> > 10 TID:0x00000000000095a2
> > Apr 09 13:12:31 441579 [41E02940] -> __osm_trap_rcv_process_request: 
> > Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x004A Port 1 
> > TID:0x0000000000010d29
> > Apr 09 13:12:31 441847 [43204940] -> __osm_trap_rcv_process_request: 
> > Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x0063 Port 
> > 24 TID:0x000000000000e40c
> > Apr 09 13:12:31 442130 [41E02940] -> __osm_trap_rcv_process_request: 
> > Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x000A Port 
> > 23 TID:0x000000000006fca2
> > Apr 09 13:12:31 442469 [43204940] -> __osm_trap_rcv_process_request: 
> > Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x0009 Port 
> > 18 TID:0x0000000000059fc4
> > Apr 09 13:12:31 442710 [41E02940] -> __osm_trap_rcv_process_request: 
> > Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x0009 Port 
> > 17 TID:0x0000000000059fc5
> 
> Those are flow control watchdog errors.

One possible explanation for this: SM could be (mis)configuring
mismatched OperVLs at the two ends of these links. Not sure why.

-- Hal

>  Any special opensm options set
> in the option file or are you running with the defaults ?
> 
> -- Hal

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to