Hi Christopher,
On Wed, 2008-04-09 at 13:14 -0600, Maestas, Christopher Daniel wrote:
> Hello Hal,
>
> -----Original Message-----
> From: Hal Rosenstock [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, April 09, 2008 12:38 PM
> To: Maestas, Christopher Daniel
> Cc: [email protected]
> Subject: Re: running opensm 3.0.3 on 4000+ node system
>
> On Wed, 2008-04-09 at 12:26 -0600, Maestas, Christopher Daniel wrote:
> > I'm trying to run opensm on a 4000+ node system,
>
> Which version ? Do you mean 3.0.3 (or 3.0.13) ?
>
> cdm> Version 3.0.13 ... you're right on that
> # rpm -q opensm
> opensm-3.0.3-6.el5_1.1
> ---
> Apr 9 12:49:53 HOST OpenSM[3295]: /var/log/osm.log log file opened
> Apr 9 12:49:53 HOST OpenSM[3295]: OpenSM Rev:openib-3.0.13
> Apr 9 12:49:53 HOST kernel: user_mad: process opensm did not enable P_Key
> index support.
> Apr 9 12:49:53 HOST kernel: user_mad:
> Documentation/infiniband/user_mad.txt has info on the new ABI.
> Apr 9 12:49:59 HOST OpenSM[3295]: Entering MASTER state
> Apr 9 12:50:02 HOST OpenSM[3295]: Errors during initialization
Your subnet has errors :-(
> Apr 9 12:50:16 HOST OpenSM[3295]: SUBNET UP
> Apr 9 12:50:22 HOST kernel: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
> Apr 9 12:50:30 HOST OpenSM[3295]: Errors during initialization
> Apr 9 12:51:05 HOST last message repeated 2 times
> Apr 9 12:52:17 HOST last message repeated 3 times
> Apr 9 12:53:27 HOST last message repeated 3 times
> ...
>
> > and seem to be having difficulties in keeping the opensm around.
> > When I attach to the process w/ strace it does:
> > ---
> > # strace -p 5921
> > Process 5921 attached - interrupt to quit restart_syscall(<... resuming
> > interrupted call ...>) = 0
> > nanosleep({10, 0}, NULL) = 0
> > nanosleep({10, 0}, NULL) = 0
> > nanosleep({10, 0}, NULL) = 0
> > nanosleep({10, 0}, NULL) = 0
> > ...
> > nanosleep({10, 0}, NULL) = 0
> > nanosleep({10, 0}, NULL) = 0
> > nanosleep({10, 0}, NULL) = 0
> > nanosleep({10, 0}, NULL) = 0
> > nanosleep({10, 0}, NULL) = 0
> > nanosleep({10, 0}, NULL) = 0
> > nanosleep({10, 0}, NULL) = 0
> > nanosleep({10, 0}, NULL) = 0
> > nanosleep({10, 0}, NULL) = 0
> > nanosleep({10, 0}, NULL) = 0
> > nanosleep({10, 0}, NULL) = 0
> > nanosleep({10, 0}, NULL) = 0
> > nanosleep({10, 0}, NULL) = 0
> > nanosleep({10, 0}, NULL) = 0
> > nanosleep({10, 0}, NULL) = 0
> > nanosleep({10, 0}, <unfinished ...>
> > +++ killed by SIGSEGV +++
> > ---
> >
> > I have ofed 1.1 and 1.2 drivers loaded on the system. I've done this in
> > the past using opensm 3.0.0 svn tag 10188 from ofed 1.0 clients and had no
> > issues before. Here's how opensm is running:
> > ---
> > 6079 pts/0 Sl 0:08 /usr/sbin/opensm -d 3 -maxsmps 0 -s 300 -t 1000
> > -f /var/log/osm.log -V -g 0
> > ---
> >
> > I have lots of data in the osm.log as you can imagine ... I don't know
> > offhand what I should be looking at/for.
>
> What's towards the end of the log ?
>
> cdm>
> I rebooted the node ... then brought ib0, then restarted opensmd ... It died
> when file got this big:
> # ls -l osm.log -h
> -rw-r--r-- 1 root root 3.2G Apr 9 13:12 osm.log
> # tail osm.log
> Apr 09 13:12:31 439877 [43204940] -> __osm_trap_rcv_process_request: Received
> Generic Notice type:0x01 num:131 Producer:2 from LID:0x0089 Port 12
> TID:0x00000000000032d3
> Apr 09 13:12:31 440370 [41E02940] -> __osm_trap_rcv_process_request: Received
> Generic Notice type:0x01 num:131 Producer:2 from LID:0x00D0 Port 3
> TID:0x0000000000007480
> Apr 09 13:12:31 440669 [43204940] -> __osm_trap_rcv_process_request: Received
> Generic Notice type:0x01 num:131 Producer:2 from LID:0x00B3 Port 7
> TID:0x00000000000058dd
> Apr 09 13:12:31 440987 [41E02940] -> __osm_trap_rcv_process_request: Received
> Generic Notice type:0x01 num:131 Producer:2 from LID:0x0082 Port 21
> TID:0x000000000000285a
> Apr 09 13:12:31 441228 [43204940] -> __osm_trap_rcv_process_request: Received
> Generic Notice type:0x01 num:131 Producer:2 from LID:0x00E8 Port 10
> TID:0x00000000000095a2
> Apr 09 13:12:31 441579 [41E02940] -> __osm_trap_rcv_process_request: Received
> Generic Notice type:0x01 num:131 Producer:2 from LID:0x004A Port 1
> TID:0x0000000000010d29
> Apr 09 13:12:31 441847 [43204940] -> __osm_trap_rcv_process_request: Received
> Generic Notice type:0x01 num:131 Producer:2 from LID:0x0063 Port 24
> TID:0x000000000000e40c
> Apr 09 13:12:31 442130 [41E02940] -> __osm_trap_rcv_process_request: Received
> Generic Notice type:0x01 num:131 Producer:2 from LID:0x000A Port 23
> TID:0x000000000006fca2
> Apr 09 13:12:31 442469 [43204940] -> __osm_trap_rcv_process_request: Received
> Generic Notice type:0x01 num:131 Producer:2 from LID:0x0009 Port 18
> TID:0x0000000000059fc4
> Apr 09 13:12:31 442710 [41E02940] -> __osm_trap_rcv_process_request: Received
> Generic Notice type:0x01 num:131 Producer:2 from LID:0x0009 Port 17
> TID:0x0000000000059fc5
Those are flow control watchdog errors. Any special opensm options set
in the option file or are you running with the defaults ?
-- Hal
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general