Hal Rosenstock wrote:

Or perhaps something crashed and didn't clean up properly. Does this occur 
immediately after a boot ?


After a fresh reboot of the machines on the switch, I get the log at
http://www.cs.rutgers.edu/~bohra/osm-v2.log

The opensm process does not crash but hangs. The state of the port never changes.

Now there is an OOPS in the dmesg :

ct 28 13:52:13 hora-3 OpenSM[5168]: OpenSM Rev:openib-1.1.0
Oct 28 13:52:14 hora-3 kernel: Unable to handle kernel paging request at virtual address 09000010
Oct 28 13:52:14 hora-3 kernel:  printing eip:
Oct 28 13:52:14 hora-3 kernel: f883f12d
Oct 28 13:52:14 hora-3 kernel: *pde = 00000000
Oct 28 13:52:14 hora-3 kernel: Oops: 0000 [#1]
Oct 28 13:52:14 hora-3 kernel: SMP
Oct 28 13:52:14 hora-3 kernel: Modules linked in: ib_uverbs ib_umad ipv6 i2c_dev i2c_core sunrpc dm_mod video button battery ac uhci_hcd hw_random ib_mthca ib_mad ib_core e1000 floppy
Oct 28 13:52:14 hora-3 kernel: CPU:    1
Oct 28 13:52:14 hora-3 kernel: EIP:    0060:[<f883f12d>]    Not tainted VLI
Oct 28 13:52:14 hora-3 kernel: EFLAGS: 00010286   (2.6.13bohra)
Oct 28 13:52:14 hora-3 kernel: EIP is at ib_post_send_mad+0x1c/0x1b1 [ib_mad] Oct 28 13:52:14 hora-3 kernel: eax: 09000000 ebx: c1a7d900 ecx: c1a7d918 edx: 00000000 Oct 28 13:52:14 hora-3 kernel: esi: c1a7d918 edi: f6571f68 ebp: f6571efc esp: f6571ed8
Oct 28 13:52:14 hora-3 kernel: ds: 007b   es: 007b   ss: 0068
Oct 28 13:52:14 hora-3 kernel: Process opensm (pid: 5224, threadinfo=f6570000 task=f7dfb020) Oct 28 13:52:14 hora-3 kernel: Stack: f883ef5a 00000000 c1a7d800 080bd018 f6571efc 00000000 f6a42900 a0f684f6 Oct 28 13:52:14 hora-3 kernel: f6571f68 f6571f74 f88f1728 00000000 00000018 000000e8 000000d0 f6a42948 Oct 28 13:52:14 hora-3 kernel: f68bda24 00000000 00000009 a0f684f6 00000009 c1a7d918 00000000 00000100
Oct 28 13:52:14 hora-3 kernel: Call Trace:
Oct 28 13:52:14 hora-3 kernel:  [<c0104848>] show_stack+0x7c/0x92
Oct 28 13:52:14 hora-3 kernel:  [<c01049c9>] show_registers+0x152/0x1ca
Oct 28 13:52:14 hora-3 kernel:  [<c0104bcd>] die+0xf4/0x16f
Oct 28 13:52:14 hora-3 kernel:  [<c011885c>] do_page_fault+0x463/0x649
Oct 28 13:52:14 hora-3 kernel:  [<c01044bb>] error_code+0x4f/0x54
Oct 28 13:52:14 hora-3 kernel: [<f88f1728>] ib_umad_write+0x2d0/0x30e [ib_umad]
Oct 28 13:52:14 hora-3 kernel:  [<c015d69b>] vfs_write+0x155/0x15a
Oct 28 13:52:14 hora-3 kernel:  [<c015d741>] sys_write+0x3d/0x64
Oct 28 13:52:14 hora-3 kernel:  [<c01038d3>] sysenter_past_esp+0x54/0x75
Oct 28 13:52:14 hora-3 kernel: Code: e8 d8 63 af c7 89 d8 83 c4 0c 5b 5e 5f 5d c3 55 89 e5 57 56 89 c6 53 83 ec 18 85 f6 89 55 f0 0f 84 ff 00 00 00 8b 46 08 8d 5e e8 <8b> 50 10 8b 7b 14 85 d2 0f 84 7c 01 00 00 8b 4e 18 85 c9 74 0b


Thanks
Aniruddha

________________________________

From: [EMAIL PROTECTED] on behalf of Sean Hefty
Sent: Fri 10/28/2005 12:01 PM
To: Aniruddha Bohra
Cc: openib-general@openib.org
Subject: Re: [openib-general] OpenSM crash with today's trunk



Aniruddha Bohra wrote:
Oh well, I guess this is a different bug.  Is there an oops or
anything in your kernel log, or is this just a userspace crash?

This is what I see :
Oct 27 22:03:34 hora-3 OpenSM[7995]: OpenSM Rev:openib-1.1.0
Oct 27 22:03:34 hora-3 kernel: ib_mad: Method 1 already in use
Oct 27 22:03:34 hora-3 OpenSM[7995]: Exiting SM

Is this useful?

Is there any chance opensm is already running on the system?  It sounds like
something has already registered to receive the same MADs that opensm wants to
receive.

- Sean
_______________________________________________
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



_______________________________________________
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to