Hi Christopher,

On Wed, 2008-04-09 at 13:14 -0600, Maestas, Christopher Daniel wrote:
> Hello Hal,
> 
> -----Original Message-----
> From: Hal Rosenstock [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, April 09, 2008 12:38 PM
> To: Maestas, Christopher Daniel
> Cc: [email protected]
> Subject: Re: running opensm 3.0.3 on 4000+ node system
> 
> On Wed, 2008-04-09 at 12:26 -0600, Maestas, Christopher Daniel wrote:
> > I'm trying to run opensm on a 4000+ node system,
> 
> Which version ? Do you mean 3.0.3 (or 3.0.13) ?
> 
> cdm> Version 3.0.13 ... you're right on that
> # rpm -q opensm
> opensm-3.0.3-6.el5_1.1
> ---
> Apr  9 12:49:53 HOST OpenSM[3295]: /var/log/osm.log log file opened
> Apr  9 12:49:53 HOST OpenSM[3295]: OpenSM Rev:openib-3.0.13
> Apr  9 12:49:53 HOST kernel: user_mad: process opensm did not enable P_Key 
> index support.
> Apr  9 12:49:53 HOST kernel: user_mad:   
> Documentation/infiniband/user_mad.txt has info on the new ABI.
> Apr  9 12:49:59 HOST OpenSM[3295]: Entering MASTER state
> Apr  9 12:50:02 HOST OpenSM[3295]: Errors during initialization

Your subnet has errors :-(

> Apr  9 12:50:16 HOST OpenSM[3295]: SUBNET UP
> Apr  9 12:50:22 HOST kernel: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
> Apr  9 12:50:30 HOST OpenSM[3295]: Errors during initialization
> Apr  9 12:51:05 HOST last message repeated 2 times
> Apr  9 12:52:17 HOST last message repeated 3 times
> Apr  9 12:53:27 HOST last message repeated 3 times
> ...
> 
> >  and seem to be having difficulties in keeping the opensm around.
> > When I attach to the process w/ strace it does:
> > ---
> > # strace -p 5921
> > Process 5921 attached - interrupt to quit restart_syscall(<... resuming 
> > interrupted call ...>) = 0
> > nanosleep({10, 0}, NULL)                = 0
> > nanosleep({10, 0}, NULL)                = 0
> > nanosleep({10, 0}, NULL)                = 0
> > nanosleep({10, 0}, NULL)                = 0
> > ...
> > nanosleep({10, 0}, NULL)                = 0
> > nanosleep({10, 0}, NULL)                = 0
> > nanosleep({10, 0}, NULL)                = 0
> > nanosleep({10, 0}, NULL)                = 0
> > nanosleep({10, 0}, NULL)                = 0
> > nanosleep({10, 0}, NULL)                = 0
> > nanosleep({10, 0}, NULL)                = 0
> > nanosleep({10, 0}, NULL)                = 0
> > nanosleep({10, 0}, NULL)                = 0
> > nanosleep({10, 0}, NULL)                = 0
> > nanosleep({10, 0}, NULL)                = 0
> > nanosleep({10, 0}, NULL)                = 0
> > nanosleep({10, 0}, NULL)                = 0
> > nanosleep({10, 0}, NULL)                = 0
> > nanosleep({10, 0}, NULL)                = 0
> > nanosleep({10, 0},  <unfinished ...>
> > +++ killed by SIGSEGV +++
> > ---
> >
> > I have ofed 1.1 and 1.2 drivers loaded on the system.  I've done this in 
> > the past using opensm 3.0.0 svn tag 10188 from ofed 1.0 clients and had no 
> > issues before.  Here's how opensm is running:
> > ---
> >  6079 pts/0    Sl     0:08 /usr/sbin/opensm -d 3 -maxsmps 0 -s 300 -t 1000 
> > -f /var/log/osm.log -V -g 0
> > ---
> >
> > I have lots of data in the osm.log as you can imagine ... I don't know 
> > offhand what I should be looking at/for.
> 
> What's towards the end of the log ?
> 
> cdm>
> I rebooted the node ... then brought ib0, then restarted opensmd ... It died 
> when file got this big:
> # ls -l osm.log -h
> -rw-r--r-- 1 root root 3.2G Apr  9 13:12 osm.log
> # tail osm.log
> Apr 09 13:12:31 439877 [43204940] -> __osm_trap_rcv_process_request: Received 
> Generic Notice type:0x01 num:131 Producer:2 from LID:0x0089 Port 12 
> TID:0x00000000000032d3
> Apr 09 13:12:31 440370 [41E02940] -> __osm_trap_rcv_process_request: Received 
> Generic Notice type:0x01 num:131 Producer:2 from LID:0x00D0 Port 3 
> TID:0x0000000000007480
> Apr 09 13:12:31 440669 [43204940] -> __osm_trap_rcv_process_request: Received 
> Generic Notice type:0x01 num:131 Producer:2 from LID:0x00B3 Port 7 
> TID:0x00000000000058dd
> Apr 09 13:12:31 440987 [41E02940] -> __osm_trap_rcv_process_request: Received 
> Generic Notice type:0x01 num:131 Producer:2 from LID:0x0082 Port 21 
> TID:0x000000000000285a
> Apr 09 13:12:31 441228 [43204940] -> __osm_trap_rcv_process_request: Received 
> Generic Notice type:0x01 num:131 Producer:2 from LID:0x00E8 Port 10 
> TID:0x00000000000095a2
> Apr 09 13:12:31 441579 [41E02940] -> __osm_trap_rcv_process_request: Received 
> Generic Notice type:0x01 num:131 Producer:2 from LID:0x004A Port 1 
> TID:0x0000000000010d29
> Apr 09 13:12:31 441847 [43204940] -> __osm_trap_rcv_process_request: Received 
> Generic Notice type:0x01 num:131 Producer:2 from LID:0x0063 Port 24 
> TID:0x000000000000e40c
> Apr 09 13:12:31 442130 [41E02940] -> __osm_trap_rcv_process_request: Received 
> Generic Notice type:0x01 num:131 Producer:2 from LID:0x000A Port 23 
> TID:0x000000000006fca2
> Apr 09 13:12:31 442469 [43204940] -> __osm_trap_rcv_process_request: Received 
> Generic Notice type:0x01 num:131 Producer:2 from LID:0x0009 Port 18 
> TID:0x0000000000059fc4
> Apr 09 13:12:31 442710 [41E02940] -> __osm_trap_rcv_process_request: Received 
> Generic Notice type:0x01 num:131 Producer:2 from LID:0x0009 Port 17 
> TID:0x0000000000059fc5

Those are flow control watchdog errors. Any special opensm options set
in the option file or are you running with the defaults ?

-- Hal

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to